To capture or not to capture

As is quite common for this blog, I’m gonna write this little rant post as a result of a twitter argument (“twargument”?). The topic is “capturing” in regular expressions, and how that behavior affects string.split(RE), specifically in JavaScript. As usual, move on if you already disagree with me and don’t care what I have to say. Blah blah.

What follows is an approximation of how I’ve learned about regular expressions and capturing across my varied programming career of 10+ years. If you want to skip over the boring “history” lesson and jump to my conclusions, take a look.

^

I would venture to guess that almost all programmers first learn a programming language (or several), and THEN as that knowledge matures, eventually they’re exposed to and learn regular expressions. In fact, I bet most programmers never formally learn regular expressions — they just kinda pick up on the syntax and fumble their way through it. Wash. rinse. repeat.

For my first 5 years of development, this was true for me. I read some bits and pieces of documentation on regular expressions, but for the most part, I was just trying to figure out what worked through trial and error. Gradually, over time, I learned various different things that helped make regular expressions seem clearer. Even this far into my career, I’d say I probably am only a 6 out of 10 in terms of understanding them — but that’s enough to make good use of them.

Before I ever wrote my first regular expression, I’d programmed in several languages . I wrote code (at one time in my life) in C, Basic/QBasic, Pascal, PHP, and JavaScript, before I ever had to write my first regular expression.

And you know what one of the first syntactic things I learned in those languages was? In writing arithmetic and boolean logic expressions, you have to use ( ) to group operand expressions. In fact, I learned that in math class long before I ever wrote my first line of code. If I want to express 3 * 2 + 5 and I expect 21 instead of 11, I have to write 3 * (2 + 5).

The use of ( ) to group operand expressions is fundamental to mathematical syntax and to almost every major programming language I’ve been exposed to.

( ) overloaded

Then I started learning regular expressions, and I wanted to do something like find in a longer string occurrences of “ab” or “abab” or “abababababab”, etc. In more formal terms, I wanted to “match against a string and see if there’s one or more occurrences of “ab” in a sequence anywhere in the string”.

And so how did I write my regular expression? First I tried /ab+/, but that didn’t work, because it gave me things like “abbbbbb” which is not what I wanted. Then I remembered my mathematical and programming roots, and realized the problem was operator binding (precedence), and that I needed to group “a” and “b” together and apply the + operator to the group. So, my regular expression became /(ab)+/, and voila — it worked!

It never occurred to me that the return value from .match() has other elements in it. All I cared about is the “ababababab” part at result[0]. And my regular expression faithfully does that.

Repeat that process hundreds of times over the years, in gradually more and more complex scenarios, and I’ve learned, informally through trial-and-error, how to write regular expressions that do what I want them to do (well, sorta) — when I need to group things, I use ( ).

Then, somewhere along the way, I read about this interesting notion of “capturing”. The first exposure I have to capturing groups is related to getting at the captured group data in the result array from a .match() call. For instance, result[1] has the first captured group match, result[2] has the second matched group, etc.

It’s a very powerful concept in regular expressions, and with it I realize I can do some really cool stuff. For instance, I write this regular expression: /foo(bar|(\d+))/. What I’m looking for is if “foo” happens, and is either followed by “bar”, or if not, if it’s followed by some digits, and maybe I specifically want to know what those matched digits are.

I read documentation on accessing capture groups from a call to .match(), and it says to count (1-based) the left parentheses to find out the ordinal index of the group, and then I can find my capture match in the result array at that location.

So, if I want to know the numbers found after foo, I look for results[2] (2 being the ordinal index of the second capture group, meaning the second left parenthesis in my regular expression). If results[2] is populated, then I know it must be the numbers that were found immediately after foo. And the overall match is at results[0], faithfully, just like it always was. And I’m happy. And I’ve solved my task (or so I think). I pay no attention to what may or may not be at results[1], because for my task I frankly don’t care.

\2

Of course, this documentation and testing has now misled me a little bit to think that the intention of “capture” groups is to be able to extract the matched group values. In fact (and I didn’t learn this until a lot later, as it’s often not expressed well if at all), the main purpose of capture groups is for “back-references”. Whaaa?

Back-references are extremely powerful: they allow you to reference the value of an earlier match later in the same regular expression. For instance, you might have a group like (["']) which can either match a ” or a ‘, and then you can later reference the value from that match to make sure that you are checking for the same value on the other end of the match (eg: either “…” or ‘…’). BTW, you accomplish the back-reference by saying \x where `x` is the ordinal index of the capture group you want to grab the value from.

The moral here is that capture groups are primarily about serving back-references. The fact that .match() and other API methods return the capture groups in the results array is not directly related to the processing behavior of the regular expression — it’s an intentional choice of the designer of that API.

They chose to assume that if the regular expression matches some capture groups, then that means the author must want those capture groups returned. Sometimes that’s true, but it’s certainly not universally true. Unfortunately, there’s no way to use capture groups in the regular expression without having them returned in the results.

(?: )

Several years down the road, still never having taken any formal class or read (and understood) any formal documentation on regular expressions, I feel decently confident that I know regular expressions enough to get what I need from them. When I need to group, I use ( ). And when I need to get capture groups, I also use ( ). The overloading of those two behaviors into the same operator hasn’t really bothered me too much, or even really directly occurred to me.

Then I ran across someone else’s complex regular expression, and I saw something like: /foo(?:bar|(\d+))/. That regular expression looks familiar, except for the “?:” in there. So I go searching through online documentation to try and find out what the heck that means. And I find some resource somewhere that says that (?: ) is officially a “non-capturing group”. Hmmm, I think. “Non-capturing”. Why do I care to make something “non-capturing”. I brush this off as not really making much sense.

6 months later, I saw such a non-capturing group again, this time in a regular expression one of my coworkers wrote. I asked them about it casually, and they said, “well, if I don’t need something to capture, it’s faster in the processing to tell the regex parser to not worry about capturing it.” Hmm… makes sense, I guess. Tell the regex engine if I don’t need it to capture. But I’ll be darned if I am gonna be very likely to remember that extra (ugly and non-semantic) “?:” in all my regular expression groupings. Besides, I’ve never been bitten by capturing hurting anything, including any performance hits that mattered to me.

Experiments

But I probably started to play around with “?:” a little bit anyway, out of curiosity. I went back to my /foo(bar|(\d+))/ regular expression, and I added in the ?:, and I tried to run my code and see if it went faster. I was dismayed to see that not only does it not go faster, but it breaks completely! What!?

So I go searching again, not really knowing what google keywords to use, and eventually I stumble back over that same document I read a long time ago about capture groups, and I re-read it, and I see in there that they say to count left parentheses of captured groups to find my matched group in the results array. It clicks! When I made one of my groups non-capturing, now I have to go change all my code that looks for the match of the digits at results[2], and change it to look at results[1].

Ugh. All I wanted to do was experiment with non-capturing groups for improving performance. After all, that’s why we do non-capturing right — to improve performance?

I don’t want to have to refactor a bunch of code and then re-validate it all. This “non-capturing” thing is starting to seem a little bogus to me. It seems more work than it’s worth. And it is definitely uglier. And so, not seeing any clear benefit, I doubt I’m gonna remember in the future to use non-capturing on most new regular expressions I write.

.*?

Some more years go by, and now I’m enrolled in computer science in college. I go through and take several computer science courses, and in a few of them, we get introduced briefly to the topic of regular expressions. They never go much deeper than the basics though. They perhaps mention in passing the idea of capturing (and maybe even non-capturing), but they never give us any practical reasons why that concept matters much.

As I advance through more and more computer science classes, I have the occasion from time to time to write regular expressions for various tasks. I do so based on my somewhat grass-roots previous learning (my classes certainly never rigorously go through much of it beyond surface intro stuff). And I get by just fine.

Not once is my code (or tests) ever graded down because I improperly “captured” a group I didn’t need to. I suppose the professor overlooked such details (maybe he doesn’t get why it matters, either), or perhaps he noticed it but it wasn’t what was being tested so it didn’t make sense to confuse me or grade me down. In any case, I sail on through these classes, only barely aware of the concept of “non-capturing” as it applies to regular expressions and groups, and blissfully unaware of any “bad practice” that’s now taken deep foothold on my programming style.

Then I take a class called “Theory of Abstract Languages” — really deep theoretical stuff. I’m humored to see that soon into the curriculum, we start talking about “regular expressions” — and I think to myself, “oh, I’ve written them for years, I’m good on this.” We actually go really deep into the theory behind regular expressions, and transformations, and all kinds of other academic stuff.

But again, we sort of gloss over the concept of “capturing” or “non-capturing”, as that kind of thing isn’t really germane to a theoretical discussion on regular expressions, it’s more the stuff of practical programming. I’m completely unaware of the fact that my “Theory” professor assumed that my “Programming 101″ professor would cover regular expressions in detail, and that vice versa my “Programming 101″ professor skipped over the nitty gritty of regular expressions, assuming my “Theory” professor would deal with it in detail.

And so, I graduate from college, with the entire curriculum of an engineering computer science degree in my head, and I’ve still never really had anyone explain to me why this notion of auto-capturing of groups is important, and moreover why it’s important to explicitly “non-capture” when I don’t need capturing. I’ve been exposed to this topic a dozen times now, but no one nor teaching has ever treated the topic with the rigor it deserves and requires, and I’m none-the-wiser in my ignorance.

{3,}

Fast forward another several years beyond getting my degree, and I still am writing code, and I’m still using regular expressions from time to time. And I’m pretty comfortable with them now. I’m rarely in a position where they don’t do what I want them to do, sometimes even the first time!

And I’ve seen the peculiar “?:” enough times that it doesn’t really catch my eye too often when I see others’ code using it. I still don’t use it myself, because it’s never occurred to me why it’s really all that important to do so. It still seems like more trouble than it’s worth.

To this point, I’ve had several occasions to write fairly complex algorithms using regular expressions. I’ve written code parsers/compilers, where I had to tokenize a string, construct an AST from the tokens, etc. In even those advanced algorithms, I got by just fine with my incomplete knowledge of regular expressions, because I knew enough to get done what I needed, and the big picture or truly indepth stuff is something I get away with not needing.

To tokenize a string, I could use str.split(…), but the hundred or so times I’ve used it in the past, across various languages, I’ve always known that .split() returns an array of the chunks of the string split up, but with the delimiter(s) removed.

This is fine, because there’s actually lots of times when I need that behaior. For instance, if I want an array of words, I get it by doing “one,two,three,four,five”.split(“,”). The .split(STRING) form works great for that task. Occasionally, I need something a little more sophisticated, and so I do something like “one,two|three;four,five”.split(/[,|;]/), and I still get the result I want.

In hindsight, almost never in those cases do I need to use .split(RE) with a regular expression that needs grouping or capturing/back-references. So, everything seems to work fine to me. I’m happily unaware of there being a silent gotcha waiting to trap me.

Since .split() doesn’t preserve my delimiters (or so I think), to tokenize a string, I devise a manual split process, doing something like this:

var op_token_regex = /\+\-\*\//g,
	lc = "", rc = str, tmp, tmp2, captured = 0,
	tokens = []
;
op_token_regex.lastIndex = 0;
while (tmp = op_token_regex.exec(str)) {
	lc = RegExp.leftContext;
	rc = RegExp.rightContext;
	tmp2 = str.substring(captured,op_token_regex.lastIndex-1);
	captured = op_token_regex.lastIndex;

	// something found before operator?
	if (tmp2 && tmp2.length > 0) {
		tokens.push(tmp2);
	}
	tokens.push(tmp[0]);
}
if (rc) {
	tokens.push(rc);
}

I use RE.exec(str) with a global regular expression, then using a while loop, I step through each match one at a time, and “capture” what I need from the original string input.

Admittedly, this code may not be super pretty, but it certainly gets the job done. The better part is that it gives me the code structure control to do more sophisticated things, like discarding certain tokens (like comments) during processing, based on certain conditions.

Stumble

Up to this point, I’ve been able to do some pretty complicated things with regular expressions, and I’ve felt like I’ve been pretty competent at them.

So it came as a big surprise to me yesterday on Twitter when I was asked about this regular expression: /(\s+|\-+)/ and why when used with .split() on a string like “a b-c”, it didn’t produce just ["a", "b","c"] like it seems it should. And in the ensuing discussions and testing, all that I’ve talked about so far in this blog post came crashing down together.

Instead, at least in most modern browsers, it produces ["a", " ", "b", "-", "c"]. Well, it’s obvious from looking at that what is happening. The delimiters I’m splitting on are now suddenly showing up in the results array along with the split chunks. Previously in all of my usage of .split() it’s never retained the delimiters in the results array, but now it seems to be. What’s the difference?

My initial reaction was that the ( ) is unnecessary in that regular expression (you don’t need the grouping), and indeed, if you remove them, the results array is as expected. It’s then that I realized what was happening: it’s not the grouping that’s changing the results array, it’s the capturing. Just like with .match(), .split() responds to capture group data by returning that data interleaved into the results array.

And there it is: 10+ years programming in half a dozen different languages, and I’ve never once seen anyone talk about or explain that the regular expression group capturing I knew about from .match() actually had some behavioral side effect for .split() — using the ( ) “capture” operator in the regular expression in this case is taken to mean “include captured data in the final result set”. This we knew was true of .match(), but it was far less obvious that the same was true of .split(), until now.

Except, the fly in the ointment is that .split() being used in this way is not reliable in JavaScript cross-browser. Though it may be specified in the “standard”, the behavior was not properly implemented in previous versions of IE.

So, we can’t exactly rely on this behavior. And more importantly, if we’re not diligent to avoid capture groups with .split(RE), we’ll introduce cross-browser bugs.

Look closely

Let’s examine what’s going on here.

First, the original designers of regular expressions decided that ( ) groups should, by default, capture. They decided, for some reason, that even though ( ) are almost universally used primarily for grouping in mathematics and all modern computer languages, for regular expressions they would break with that precedent and instead make ( ) primarily be about capturing, with grouping just being an offshoot behavior.

Why? Perhaps in their usage, back-references were much more common than just using ( ) for expression operand grouping. Perhaps they didn’t often use regular expression patterns that needed grouping. Or perhaps there’s some other motivation for this decision that is escaping me.

I’ve definitely used back-references, but I’d have to say that by far this is the less common usage — overwhelmingly, I use ( ) to group elements more for operator binding than for capturing/back-references.

Even if we accept that ( ) should have capturing behavior (I’m not convinced of that), since capturing/back-references seem like they are by nature less performant, it seems quite strange to me that this is the default behavior of the regular expression engine.

Wouldn’t it have made more sense to have capturing be an opt-in (rather than opt-out) behavior? For instance, the “?:” could be used on a group to capture it. Or, there could have been a modifier flag for the whole regular expression which turns on capturing, like “c”.

Moreover, even if you disagree with me about which should be the default behavior, I would still see a lot of benefit if there was a modifier that could turn off capturing across the whole regular expression, instead of forcing the author to turn it off for each group with “?:”. I for one would probably just use that modifier all the time, and only not use it in those rare cases where capturing is important to me.

Why isn’t there an easy way to turn off capturing for the whole regular expression?

APIs

Now, let’s examine the API’s for .match(), .exec(), and .split(). The designers of these functions decided that they would piggyback on the capturing behavior from the regular expression to trigger whether the capture data should be returned in the results array. This was a separate and intentional decision on top of the decision made by the regular expression engine to default to capturing with ( ).

The assumption is that if I use a ( ) capture group in my regular expression (perhaps for its most direct purpose: back-references), that I also want that captured group data returned in my results array. In other words, the “capture” directive in a regular expression is interpreted by the API to mean “include captured data into results”. This is not an entirely self-obvious or semantic leap to have been made, and it’s certainly not directly explicit, but rather just an implicit side-effect behavior.


What if I want to use capture groups for back-references, but I don’t want that captured data in my results array?

Imagine this:

var str = "some 'g' good \"s\" stuff going on 'h' here";
var results = str.split(/(["']).\1/); 
       // want: ["some", "good", "stuff going on", "here"]
       // get: ["some", "'", "good", """, "stuff going on", "'", "here"]

You see, in this case, I only have a ( ) capture group for the back-reference (\1) sake, and I don’t need or want that captured group to be in my returned results (in fact it makes my job harder to filter out that noise).

Because of the faulty assumptions made by the API functions, I can’t use back-reference capture groups in my regexes without automatically getting those groups in my result set. This is a failure of design on the API’s part. And a bunch of languages, not just JavaScript, made the same mistake (and/or copied each other’s mistakes).

There’s some inherent inconsistency, too. What happens to the “capture data” in a regular expression used by .test()? That data apparently gets silently discarded, because .test() only returns a Boolean. If I use capture groups inside a regular expression executed for .test(), those capture groups have no effect on how .test() returns its value. This is strange and inconsistent.

Wouldn’t it have made more sense for there to be an explicit instruction to the API that you want any capture data collected to be returned in the results? For instance, why couldn’t we have had str.split(RE, count, includeInResults{=false})? That way we could explicitly declare whether we want the captured data in our results or not.

Lesson Learned…

…the hard way, I guess. Now I have to try and re-train myself to think of ( ) as primarily a capturing mechanism and not a grouping mechanism. And I’m betting I’m not the only developer that has been astray on this topic for a long time.

$