To capture or not to capture

As is quite common for this blog, I’m gonna write this little rant post as a result of a twitter argument (“twargument”?). The topic is “capturing” in regular expressions, and how that behavior affects string.split(RE), specifically in JavaScript. As usual, move on if you already disagree with me and don’t care what I have to say. Blah blah.

What follows is an approximation of how I’ve learned about regular expressions and capturing across my varied programming career of 10+ years. If you want to skip over the boring “history” lesson and jump to my conclusions, take a look.

^

I would venture to guess that almost all programmers first learn a programming language (or several), and THEN as that knowledge matures, eventually they’re exposed to and learn regular expressions. In fact, I bet most programmers never formally learn regular expressions — they just kinda pick up on the syntax and fumble their way through it. Wash. rinse. repeat.

For my first 5 years of development, this was true for me. I read some bits and pieces of documentation on regular expressions, but for the most part, I was just trying to figure out what worked through trial and error. Gradually, over time, I learned various different things that helped make regular expressions seem clearer. Even this far into my career, I’d say I probably am only a 6 out of 10 in terms of understanding them — but that’s enough to make good use of them.

Before I ever wrote my first regular expression, I’d programmed in several languages . I wrote code (at one time in my life) in C, Basic/QBasic, Pascal, PHP, and JavaScript, before I ever had to write my first regular expression.

And you know what one of the first syntactic things I learned in those languages was? In writing arithmetic and boolean logic expressions, you have to use ( ) to group operand expressions. In fact, I learned that in math class long before I ever wrote my first line of code. If I want to express 3 * 2 + 5 and I expect 21 instead of 11, I have to write 3 * (2 + 5).

The use of ( ) to group operand expressions is fundamental to mathematical syntax and to almost every major programming language I’ve been exposed to.

( ) overloaded

Then I started learning regular expressions, and I wanted to do something like find in a longer string occurrences of “ab” or “abab” or “abababababab”, etc. In more formal terms, I wanted to “match against a string and see if there’s one or more occurrences of “ab” in a sequence anywhere in the string”.

And so how did I write my regular expression? First I tried /ab+/, but that didn’t work, because it gave me things like “abbbbbb” which is not what I wanted. Then I remembered my mathematical and programming roots, and realized the problem was operator binding (precedence), and that I needed to group “a” and “b” together and apply the + operator to the group. So, my regular expression became /(ab)+/, and voila — it worked!

It never occurred to me that the return value from .match() has other elements in it. All I cared about is the “ababababab” part at result[0]. And my regular expression faithfully does that.

Repeat that process hundreds of times over the years, in gradually more and more complex scenarios, and I’ve learned, informally through trial-and-error, how to write regular expressions that do what I want them to do (well, sorta) — when I need to group things, I use ( ).

Then, somewhere along the way, I read about this interesting notion of “capturing”. The first exposure I have to capturing groups is related to getting at the captured group data in the result array from a .match() call. For instance, result[1] has the first captured group match, result[2] has the second matched group, etc.

It’s a very powerful concept in regular expressions, and with it I realize I can do some really cool stuff. For instance, I write this regular expression: /foo(bar|(\d+))/. What I’m looking for is if “foo” happens, and is either followed by “bar”, or if not, if it’s followed by some digits, and maybe I specifically want to know what those matched digits are.

I read documentation on accessing capture groups from a call to .match(), and it says to count (1-based) the left parentheses to find out the ordinal index of the group, and then I can find my capture match in the result array at that location.

Page 1 of 5 | Next page