As is quite common for this blog, I’m gonna write this little rant post as a result of a twitter argument (“twargument”?). The topic is “capturing” in regular expressions, and how that behavior affects string.split(RE), specifically in JavaScript. As usual, move on if you already disagree with me and don’t care what I have to say. Blah blah.

What follows is an approximation of how I’ve learned about regular expressions and capturing across my varied programming career of 10+ years. If you want to skip over the boring “history” lesson and jump to my conclusions, take a look.

^

I would venture to guess that almost all programmers first learn a programming language (or several), and THEN as that knowledge matures, eventually they’re exposed to and learn regular expressions. In fact, I bet most programmers never formally learn regular expressions — they just kinda pick up on the syntax and fumble their way through it. Wash. rinse. repeat.

For my first 5 years of development, this was true for me. I read some bits and pieces of documentation on regular expressions, but for the most part, I was just trying to figure out what worked through trial and error. Gradually, over time, I learned various different things that helped make regular expressions seem clearer. Even this far into my career, I’d say I probably am only a 6 out of 10 in terms of understanding them — but that’s enough to make good use of them.

Before I ever wrote my first regular expression, I’d programmed in several languages . I wrote code (at one time in my life) in C, Basic/QBasic, Pascal, PHP, and JavaScript, before I ever had to write my first regular expression.

And you know what one of the first syntactic things I learned in those languages was? In writing arithmetic and boolean logic expressions, you have to use ( ) to group operand expressions. In fact, I learned that in math class long before I ever wrote my first line of code. If I want to express 3 * 2 + 5 and I expect 21 instead of 11, I have to write 3 * (2 + 5).

The use of ( ) to group operand expressions is fundamental to mathematical syntax and to almost every major programming language I’ve been exposed to.

( ) overloaded

Then I started learning regular expressions, and I wanted to do something like find in a longer string occurrences of “ab” or “abab” or “abababababab”, etc. In more formal terms, I wanted to “match against a string and see if there’s one or more occurrences of “ab” in a sequence anywhere in the string”.

And so how did I write my regular expression? First I tried /ab+/, but that didn’t work, because it gave me things like “abbbbbb” which is not what I wanted. Then I remembered my mathematical and programming roots, and realized the problem was operator binding (precedence), and that I needed to group “a” and “b” together and apply the + operator to the group. So, my regular expression became /(ab)+/, and voila — it worked!

It never occurred to me that the return value from .match() has other elements in it. All I cared about is the “ababababab” part at result[0]. And my regular expression faithfully does that.

Repeat that process hundreds of times over the years, in gradually more and more complex scenarios, and I’ve learned, informally through trial-and-error, how to write regular expressions that do what I want them to do (well, sorta) — when I need to group things, I use ( ).

Then, somewhere along the way, I read about this interesting notion of “capturing”. The first exposure I have to capturing groups is related to getting at the captured group data in the result array from a .match() call. For instance, result[1] has the first captured group match, result[2] has the second matched group, etc.

It’s a very powerful concept in regular expressions, and with it I realize I can do some really cool stuff. For instance, I write this regular expression: /foo(bar|(\d+))/. What I’m looking for is if “foo” happens, and is either followed by “bar”, or if not, if it’s followed by some digits, and maybe I specifically want to know what those matched digits are.

I read documentation on accessing capture groups from a call to .match(), and it says to count (1-based) the left parentheses to find out the ordinal index of the group, and then I can find my capture match in the result array at that location.

So, if I want to know the numbers found after foo, I look for results[2] (2 being the ordinal index of the second capture group, meaning the second left parenthesis in my regular expression). If results[2] is populated, then I know it must be the numbers that were found immediately after foo. And the overall match is at results[0], faithfully, just like it always was. And I’m happy. And I’ve solved my task (or so I think). I pay no attention to what may or may not be at results[1], because for my task I frankly don’t care.

\2

Of course, this documentation and testing has now misled me a little bit to think that the intention of “capture” groups is to be able to extract the matched group values. In fact (and I didn’t learn this until a lot later, as it’s often not expressed well if at all), the main purpose of capture groups is for “back-references”. Whaaa?

Back-references are extremely powerful: they allow you to reference the value of an earlier match later in the same regular expression. For instance, you might have a group like (["']) which can either match a ” or a ‘, and then you can later reference the value from that match to make sure that you are checking for the same value on the other end of the match (eg: either “…” or ‘…’). BTW, you accomplish the back-reference by saying \x where `x` is the ordinal index of the capture group you want to grab the value from.

The moral here is that capture groups are primarily about serving back-references. The fact that .match() and other API methods return the capture groups in the results array is not directly related to the processing behavior of the regular expression — it’s an intentional choice of the designer of that API.

They chose to assume that if the regular expression matches some capture groups, then that means the author must want those capture groups returned. Sometimes that’s true, but it’s certainly not universally true. Unfortunately, there’s no way to use capture groups in the regular expression without having them returned in the results.

(?: )

Several years down the road, still never having taken any formal class or read (and understood) any formal documentation on regular expressions, I feel decently confident that I know regular expressions enough to get what I need from them. When I need to group, I use ( ). And when I need to get capture groups, I also use ( ). The overloading of those two behaviors into the same operator hasn’t really bothered me too much, or even really directly occurred to me.

Then I ran across someone else’s complex regular expression, and I saw something like: /foo(?:bar|(\d+))/. That regular expression looks familiar, except for the “?:” in there. So I go searching through online documentation to try and find out what the heck that means. And I find some resource somewhere that says that (?: ) is officially a “non-capturing group”. Hmmm, I think. “Non-capturing”. Why do I care to make something “non-capturing”. I brush this off as not really making much sense.

6 months later, I saw such a non-capturing group again, this time in a regular expression one of my coworkers wrote. I asked them about it casually, and they said, “well, if I don’t need something to capture, it’s faster in the processing to tell the regex parser to not worry about capturing it.” Hmm… makes sense, I guess. Tell the regex engine if I don’t need it to capture. But I’ll be darned if I am gonna be very likely to remember that extra (ugly and non-semantic) “?:” in all my regular expression groupings. Besides, I’ve never been bitten by capturing hurting anything, including any performance hits that mattered to me.

Experiments

But I probably started to play around with “?:” a little bit anyway, out of curiosity. I went back to my /foo(bar|(\d+))/ regular expression, and I added in the ?:, and I tried to run my code and see if it went faster. I was dismayed to see that not only does it not go faster, but it breaks completely! What!?

So I go searching again, not really knowing what google keywords to use, and eventually I stumble back over that same document I read a long time ago about capture groups, and I re-read it, and I see in there that they say to count left parentheses of captured groups to find my matched group in the results array. It clicks! When I made one of my groups non-capturing, now I have to go change all my code that looks for the match of the digits at results[2], and change it to look at results[1].

Ugh. All I wanted to do was experiment with non-capturing groups for improving performance. After all, that’s why we do non-capturing right — to improve performance?

I don’t want to have to refactor a bunch of code and then re-validate it all. This “non-capturing” thing is starting to seem a little bogus to me. It seems more work than it’s worth. And it is definitely uglier. And so, not seeing any clear benefit, I doubt I’m gonna remember in the future to use non-capturing on most new regular expressions I write.

.*?

Some more years go by, and now I’m enrolled in computer science in college. I go through and take several computer science courses, and in a few of them, we get introduced briefly to the topic of regular expressions. They never go much deeper than the basics though. They perhaps mention in passing the idea of capturing (and maybe even non-capturing), but they never give us any practical reasons why that concept matters much.

As I advance through more and more computer science classes, I have the occasion from time to time to write regular expressions for various tasks. I do so based on my somewhat grass-roots previous learning (my classes certainly never rigorously go through much of it beyond surface intro stuff). And I get by just fine.

Not once is my code (or tests) ever graded down because I improperly “captured” a group I didn’t need to. I suppose the professor overlooked such details (maybe he doesn’t get why it matters, either), or perhaps he noticed it but it wasn’t what was being tested so it didn’t make sense to confuse me or grade me down. In any case, I sail on through these classes, only barely aware of the concept of “non-capturing” as it applies to regular expressions and groups, and blissfully unaware of any “bad practice” that’s now taken deep foothold on my programming style.

Then I take a class called “Theory of Abstract Languages” — really deep theoretical stuff. I’m humored to see that soon into the curriculum, we start talking about “regular expressions” — and I think to myself, “oh, I’ve written them for years, I’m good on this.” We actually go really deep into the theory behind regular expressions, and transformations, and all kinds of other academic stuff.

But again, we sort of gloss over the concept of “capturing” or “non-capturing”, as that kind of thing isn’t really germane to a theoretical discussion on regular expressions, it’s more the stuff of practical programming. I’m completely unaware of the fact that my “Theory” professor assumed that my “Programming 101″ professor would cover regular expressions in detail, and that vice versa my “Programming 101″ professor skipped over the nitty gritty of regular expressions, assuming my “Theory” professor would deal with it in detail.

And so, I graduate from college, with the entire curriculum of an engineering computer science degree in my head, and I’ve still never really had anyone explain to me why this notion of auto-capturing of groups is important, and moreover why it’s important to explicitly “non-capture” when I don’t need capturing. I’ve been exposed to this topic a dozen times now, but no one nor teaching has ever treated the topic with the rigor it deserves and requires, and I’m none-the-wiser in my ignorance.

{3,}

Fast forward another several years beyond getting my degree, and I still am writing code, and I’m still using regular expressions from time to time. And I’m pretty comfortable with them now. I’m rarely in a position where they don’t do what I want them to do, sometimes even the first time!

And I’ve seen the peculiar “?:” enough times that it doesn’t really catch my eye too often when I see others’ code using it. I still don’t use it myself, because it’s never occurred to me why it’s really all that important to do so. It still seems like more trouble than it’s worth.

To this point, I’ve had several occasions to write fairly complex algorithms using regular expressions. I’ve written code parsers/compilers, where I had to tokenize a string, construct an AST from the tokens, etc. In even those advanced algorithms, I got by just fine with my incomplete knowledge of regular expressions, because I knew enough to get done what I needed, and the big picture or truly indepth stuff is something I get away with not needing.

To tokenize a string, I could use str.split(…), but the hundred or so times I’ve used it in the past, across various languages, I’ve always known that .split() returns an array of the chunks of the string split up, but with the delimiter(s) removed.

This is fine, because there’s actually lots of times when I need that behaior. For instance, if I want an array of words, I get it by doing “one,two,three,four,five”.split(“,”). The .split(STRING) form works great for that task. Occasionally, I need something a little more sophisticated, and so I do something like “one,two|three;four,five”.split(/[,|;]/), and I still get the result I want.

In hindsight, almost never in those cases do I need to use .split(RE) with a regular expression that needs grouping or capturing/back-references. So, everything seems to work fine to me. I’m happily unaware of there being a silent gotcha waiting to trap me.

Since .split() doesn’t preserve my delimiters (or so I think), to tokenize a string, I devise a manual split process, doing something like this:

var op_token_regex = /\+\-\*\//g,
	lc = "", rc = str, tmp, tmp2, captured = 0,
	tokens = []
;
op_token_regex.lastIndex = 0;
while (tmp = op_token_regex.exec(str)) {
	lc = RegExp.leftContext;
	rc = RegExp.rightContext;
	tmp2 = str.substring(captured,op_token_regex.lastIndex-1);
	captured = op_token_regex.lastIndex;

	// something found before operator?
	if (tmp2 && tmp2.length > 0) {
		tokens.push(tmp2);
	}
	tokens.push(tmp[0]);
}
if (rc) {
	tokens.push(rc);
}

I use RE.exec(str) with a global regular expression, then using a while loop, I step through each match one at a time, and “capture” what I need from the original string input.

Admittedly, this code may not be super pretty, but it certainly gets the job done. The better part is that it gives me the code structure control to do more sophisticated things, like discarding certain tokens (like comments) during processing, based on certain conditions.

Stumble

Up to this point, I’ve been able to do some pretty complicated things with regular expressions, and I’ve felt like I’ve been pretty competent at them.

So it came as a big surprise to me yesterday on Twitter when I was asked about this regular expression: /(\s+|\-+)/ and why when used with .split() on a string like “a b-c”, it didn’t produce just ["a", "b","c"] like it seems it should. And in the ensuing discussions and testing, all that I’ve talked about so far in this blog post came crashing down together.

Instead, at least in most modern browsers, it produces ["a", " ", "b", "-", "c"]. Well, it’s obvious from looking at that what is happening. The delimiters I’m splitting on are now suddenly showing up in the results array along with the split chunks. Previously in all of my usage of .split() it’s never retained the delimiters in the results array, but now it seems to be. What’s the difference?

My initial reaction was that the ( ) is unnecessary in that regular expression (you don’t need the grouping), and indeed, if you remove them, the results array is as expected. It’s then that I realized what was happening: it’s not the grouping that’s changing the results array, it’s the capturing. Just like with .match(), .split() responds to capture group data by returning that data interleaved into the results array.

And there it is: 10+ years programming in half a dozen different languages, and I’ve never once seen anyone talk about or explain that the regular expression group capturing I knew about from .match() actually had some behavioral side effect for .split() — using the ( ) “capture” operator in the regular expression in this case is taken to mean “include captured data in the final result set”. This we knew was true of .match(), but it was far less obvious that the same was true of .split(), until now.

Except, the fly in the ointment is that .split() being used in this way is not reliable in JavaScript cross-browser. Though it may be specified in the “standard”, the behavior was not properly implemented in previous versions of IE.

So, we can’t exactly rely on this behavior. And more importantly, if we’re not diligent to avoid capture groups with .split(RE), we’ll introduce cross-browser bugs.

Look closely

Let’s examine what’s going on here.

First, the original designers of regular expressions decided that ( ) groups should, by default, capture. They decided, for some reason, that even though ( ) are almost universally used primarily for grouping in mathematics and all modern computer languages, for regular expressions they would break with that precedent and instead make ( ) primarily be about capturing, with grouping just being an offshoot behavior.

Why? Perhaps in their usage, back-references were much more common than just using ( ) for expression operand grouping. Perhaps they didn’t often use regular expression patterns that needed grouping. Or perhaps there’s some other motivation for this decision that is escaping me.

I’ve definitely used back-references, but I’d have to say that by far this is the less common usage — overwhelmingly, I use ( ) to group elements more for operator binding than for capturing/back-references.

Even if we accept that ( ) should have capturing behavior (I’m not convinced of that), since capturing/back-references seem like they are by nature less performant, it seems quite strange to me that this is the default behavior of the regular expression engine.

Wouldn’t it have made more sense to have capturing be an opt-in (rather than opt-out) behavior? For instance, the “?:” could be used on a group to capture it. Or, there could have been a modifier flag for the whole regular expression which turns on capturing, like “c”.

Moreover, even if you disagree with me about which should be the default behavior, I would still see a lot of benefit if there was a modifier that could turn off capturing across the whole regular expression, instead of forcing the author to turn it off for each group with “?:”. I for one would probably just use that modifier all the time, and only not use it in those rare cases where capturing is important to me.

Why isn’t there an easy way to turn off capturing for the whole regular expression?

APIs

Now, let’s examine the API’s for .match(), .exec(), and .split(). The designers of these functions decided that they would piggyback on the capturing behavior from the regular expression to trigger whether the capture data should be returned in the results array. This was a separate and intentional decision on top of the decision made by the regular expression engine to default to capturing with ( ).

The assumption is that if I use a ( ) capture group in my regular expression (perhaps for its most direct purpose: back-references), that I also want that captured group data returned in my results array. In other words, the “capture” directive in a regular expression is interpreted by the API to mean “include captured data into results”. This is not an entirely self-obvious or semantic leap to have been made, and it’s certainly not directly explicit, but rather just an implicit side-effect behavior.


What if I want to use capture groups for back-references, but I don’t want that captured data in my results array?

Imagine this:

var str = "some 'g' good \"s\" stuff going on 'h' here";
var results = str.split(/(["']).\1/); 
       // want: ["some", "good", "stuff going on", "here"]
       // get: ["some", "'", "good", """, "stuff going on", "'", "here"]

You see, in this case, I only have a ( ) capture group for the back-reference (\1) sake, and I don’t need or want that captured group to be in my returned results (in fact it makes my job harder to filter out that noise).

Because of the faulty assumptions made by the API functions, I can’t use back-reference capture groups in my regexes without automatically getting those groups in my result set. This is a failure of design on the API’s part. And a bunch of languages, not just JavaScript, made the same mistake (and/or copied each other’s mistakes).

There’s some inherent inconsistency, too. What happens to the “capture data” in a regular expression used by .test()? That data apparently gets silently discarded, because .test() only returns a Boolean. If I use capture groups inside a regular expression executed for .test(), those capture groups have no effect on how .test() returns its value. This is strange and inconsistent.

Wouldn’t it have made more sense for there to be an explicit instruction to the API that you want any capture data collected to be returned in the results? For instance, why couldn’t we have had str.split(RE, count, includeInResults{=false})? That way we could explicitly declare whether we want the captured data in our results or not.

Lesson Learned…

…the hard way, I guess. Now I have to try and re-train myself to think of ( ) as primarily a capturing mechanism and not a grouping mechanism. And I’m betting I’m not the only developer that has been astray on this topic for a long time.

$

This entry was written by getify , posted on Thursday November 04 2010at 01:11 am , filed under JavaScript and tagged , , , , . Bookmark the permalink . Post a comment below or leave a trackback: Trackback URL.

12 Responses to “To capture or not to capture”

  • @hymanroth says:

    Kyle,

    My introduction to RE matches yours, like descending into the various levels of Dante’s Inferno!

    Agree the () should primarily be used for grouping, and that capturing should be opt-in.

    All in all, a great introduction followed by a well-argued point.

    Kudos.

  • Maybe you are beyond this reading now, but the Pattern Matching chapter (5) in the hearty “Programming Perl” (Larry Wall) book is one of the most succinct and well written explanantions of all things reg-ex.

    This might have to do with the fact that Larry Wall, inventor of PERL, is a lingusit, not a computer programmer. So he tends to approach things from a grammatical viewpoint as opposed to an engineering technical one.

    Despite that, if any one is struggling over capturing/non-capturing/ back references/look-aheads, look-behinds, etc. This is a great place to start.

    And, it doesn’t matter that its PERL as opposed to the language of your choice – because the concepts are identical and the syntax only slightly different.

  • fearphage says:

    I for one would probably just use that modifier all the time, and only not use it in those rare cases where capturing is important to me.

    That’s precisely how it is intended to be used.

  • getify says:

    @fearphage –

    That’s precisely how it is intended to be used.

    I presume by “it” you mean ( ) vs. (?: ). If so, I think you miss the point of this post.

    I know that ( ) captures and (?: ) doesn’t, and that I’m supposed to use ( ) when I want to capture and (?: ) when I don’t want to capture. This point has never been in question. I’m not sure why you (and others) seem to feel like you need to keep repeating that point, as if I disagree.

    Of course, such “facts” as knowledge now is/was not particularly common, self-obvious, or well spelled out in most documentation and discussions of regex, and so I’m sure lots of devs like me have gone for a long time with some incorrect habits (which are hard now to break) in regular expression writing before understanding a more full picture. Even now, I bet most devs don’t know that regex captures were primarily designed to service back-references, not for affecting the return result of certain API calls like .match().

    /(abc)/ matches exactly the same thing as /abc/, but the former will have some non-obvious affects on the results of certain API calls. Such nuances in side-effects on API call results are indirect at best, and entirely non-obvious — the source of a lot of confusion by devs like me.

    All of this conspires to make regular expressions MUCH harder to learn and get right, especially by having an irrational default for behavior that counters established precedent for ( ) operators. Even now that I know the importance of ( ) vs. (?: ), I still think it’s confusing and frustrating how it works. How much more so was that confusion/frustration before I had the understanding I now have.

    By contrast, in math (or in regular programming), if I take `3 * (2 + 5)` and wrap another ( ) around it to be `(3 * (2 + 5))` or even `((((3 * ((2) + (5))))))`, however unnecessary/overkill that may be, it doesn’t change the fact that I’ll get `21` back.

    Basically, what I’m saying is that it’s silly, non-obvious, and confusing for ( ) to do more work (or rather, to have a greater impact on behavior and outcome) than (?: )… put another way, why should I have to go to more trouble (by typing ?: over and over again in a dozen different groups) to opt-out out of capturing behavior, rather than having a modifier flag that can toggle capturing behavior on or off for the whole regex?

  • fearphage says:

    Your analogy is flawed. Parentheses have only one meaning and purpose in math. This is not the case with regular expressions as you have cited. I think it would be really easy if everyone was taught from day one that (?: ) is grouping and ( ) is grouping + capture. (?: ) should be the default in most cases. I find grouping comes up a lot more often than I want to capture the match. Regular Expressions aren’t magical or arbitrary. They do precisely what you tell them to all the time. I don’t see ( ) as confusing in a split because when I don’t include `?:`, I expect something to be captured. If it didn’t return the captured bits, I’d file a bug because it would no longer be adhering to regex specs. So if you always did exactly as you stated you would in the future:

    I for one would probably just use that modifier all the time, and only not use it in those rare cases where capturing is important to me.

    Then you would have never run into this issue and wouldn’t have talked about it on twitter and I wouldn’t be leaving this comment. You are using ( ) in an unintended way which rarely has unintended side effects. The problem is that you are capturing when you have no plans to use the captured values. You really want to group but you capture in addition to grouping. If you had been explicit with your syntax all along, this could have been avoid.

  • D. Hayes says:

    The main reason I think your position is odd is that every programming language I’ve ever seen has strange, un-precedented structures and you’re not complaining about those or saying they should be removed. I don’t think capturing groups having side effects for string.split is any weirder than any of the other unique features in programming land:

    JS: lexically scoped instead of block scoped, “this” redefinitions
    Python: whitespace matters, __methods__ are magic
    C: pre-processor macros
    C#: protected internal accessors
    Java: reliance on file-system hierarchy as program structure
    Objective C: sending messages to null is fine (quite awesome, actually)

    Yes, the API is nuanced… but no more nuanced than falling through case statements (not allowed in all major languages), OS-level threads vs. green threads, namspaces, or any other differentiator.

    Ultimately, to be this upset about regular expressions used within this one context makes me think that you’re more upset that it doesn’t work the way you think it should. Which, heck, I don’t think we should have semicolons… but there you go. ( =

    So maybe the solution is to write up a regular expression tutorial that highlights the differences between regex implementations (JS, Perl, Python, C#, etc…)? If you call out that difference, point to your tutorial from Twitter and Stack Overflow (or something), then the next grassroots regexer will find your post and not have this problem?

  • getify says:

    @David-
    You have some valid points about plenty of silliness in various languages. I only currently write code in JavaScript (and regular expressions), so that’s all I really care to complain about at the moment.

    However, regular expressions are considerably more obtuse and opaque to learn and master than almost any other modern high-level language, so the design decisions (and defaults) made, which make it even harder, are in my opinion candidates for criticism. Regex could be simpler (and prettier) to understand and learn and teach and master if some of this silliness wasn’t baked in.

    Moreover:
    1) I am upset that regular expressions default to capturing behavior, which is less performant, and they make it ugly and clunky (no global regex modifier) to opt-out of such a performance hit. Even if I conceeded that the current default behavior of ( ) capturing is sensible, it’s completely inane that I have to opt-out individually for every single group. How on earth does that make any sense?

    2) I am also upset that the regex APIs in most modern languages made what I consider to be a severely faulty assumption that if I use capturing groups in my regex, that I always (and irrevocably) want that intermediate captured data returned in my result set.

    As shown immediately above, in some cases, like using back-references with .split(), I clearly only want capturing data remembered for the life of the regex, and NOT returned in my results.

    The bad assumptions made by the API designers (not a regular expression thing, a language API thing) of .split() (and .match/.exec) make it impossible to use back-references (very helpful and powerful) without getting the undesirable and difficult-to-filter-out behavior of interleaving those capture groups into my .split() result set.

    Again, it’s merely an annoyance/nitpick about the default behaviors, but a genuine complaint about faulty language API design assumptions in that there’s NO way to access the alternate (and in my case, needed) behavior from the API, which is to contain intermediate capture groups behavior only to the regex itself and not have it affect the overall results.

  • D. Hayes says:


    …if I use capturing groups in my regex, that I… want that intermediate captured data returned in my result set.

    This part tripped me up, I guess. My brain says, “That’s why they’re called capture groups — because you want to capture them.”

    Otherwise, I totally agree about regex being kinda impenetrable.

    \b(([\w-]+://?|www[.])[^\s()]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

    Very, very close to brainfuck for me.

  • First of all, regular expressions can be tricky to understand and learn, but that’s especially true when you stumble in the dark through naive trial and error, read poor documentation, or learn from people who don’t know what they’re talking about. The terseness, flexibility, expressiveness, and power of regular expressions starts out as a curse for many beginners, but these same aspects of the language can become incredible assets to people who take the time to master them. I would argue that it’s these very aspects, even if some details were gotten wrong along the way, that have made regular expressions as popular as they are today.

    Another key point is that modern regular expressions are based on decades of precedent and prior art, with layers upon layers of possibly less than ideal syntax designed to be backward compatible with earlier regex flavors. Language designers can certainly make arbitrary breaking changes if they want, but I would argue that we need less fracturing of the already broad regex flavor landscape–not more.

    To respond to some of your more specific points:

    “Unfortunately, there’s no way to use capture groups in the regular expression without having them returned in the results.”

    Well, you could use string.match with a global regex (and I favor *only* using string.match with global regexes, and always use regexp.exec instead when working with nonglobal regexes), in which case backreferences wouldn’t be included in the result array. That aside, decoupling capturing from including backreferences in e.g. the result array returned by regex.exec would be silly, IMO. The performance hit related to capturing mostly stems from the cost of storing and maintaining backreference values during the matching process (the backreference values may be updated thousands of times during a single match due to backtracking) rather than making the captured values available at the end of the match. Although the use of backreferences within a regex is indeed common, it is even more common (at least in JS) to want to use the captured values outside of the regex. We don’t need to further complicate regexes with four grouping semantics (grouping only, capture for backreference only, capture for match result only, capture for backreference and match result) where the current two (capturing vs. noncapturing) will do.

    If you wanted to argue, e.g., that the simpler and prettier (…) should be noncapturing, and (?:…) should be used for capturing (since capturing is needed far less frequently), I don’t think anyone would disagree in principle, but it would be a pointless argument since it is never going to happen. There are decades of precedent for (…) being used to capture, you’d break backward compatibility and portability, thousands of developers would have to relearn this, etc.

    There are some things you *can* do, though. E.g., .NET offers an explicit capture flag (/n) that turns off the capturing functionality of (…) per regex, and since named capturing is still possible (via e.g. `(?<name>…)`) you don’t lose any functionality through this. Oniguruma automatically disables the capturing functionality of (…) in regexes that include named capture. Those are just a couple examples of ways that other regex flavors have sought to improve the related issues, and which can be explored in future ECMAScript specs.

    “I don’t want to have to refactor a bunch of code and then re-validate it all. This “non-capturing” thing is starting to seem a little bogus to me. It seems more work than it’s worth.”

    I think you’re getting this backwards. If, to start with, you had used noncapturing groups for any groups that you did not need to be able to reference in code outside the regex, then refactoring the regex down the road would have been less likely to require you to refactor other code. Adding support for named capture/backreferences (currently possible via my XRegExp library at http://xregexp.com ) would mean you’d never have to worry about backreference indexes changing in the first place.

    “I still don’t use it myself, because it’s never occurred to me why it’s really all that important to do so. It still seems like more trouble than it’s worth.”

    For the most part, noncapturing groups save you from the kinds of confusion and hassle you describe in this post, but I guess you’ve figured that out by now. :) This old blog post of mine may shed a tiny bit more light on the subject: http://blog.stevenlevithan.com/archives/capturing-vs-non-capturing-groups

    “In even those advanced algorithms, I got by just fine with my incomplete knowledge of regular expressions”

    Imagine what you could’ve done, then, with more complete knowledge of regular expressions. :-D

    “RegExp.leftContext”
    “RegExp.rightContext”

    FYI, these properties have long been deprecated.

    “Wouldn’t it have made more sense to have capturing be an opt-in (rather than opt-out) behavior?”

    Yes. But you can’t just change it since it would break backward compatibility, portability, and developer expectations. Properly discussing why it was done this way originally (whether or not it was the right decision at the time) would take a fair amount of time and a lot of historical context.

    Note that, technically speaking, there’s neither an “opt-in” nor “opt-out,” at least in JS. There are two distinct types of groups, which you choose between. But I know what you mean–you want (…) to be the noncapturing group syntax. In a perfect world, I would agree with you. But in this world, I don’t. :P

    “Moreover, even if you disagree with me about which should be the default behavior, I would still see a lot of benefit if there was a modifier that could turn off capturing across the whole regular expression, instead of forcing the author to turn it off for each group with “?:”.”

    Several regex flavors do offer related features that help deal with this. I’ve already pointed to .NET and Oniguruma as examples.

    “Why isn’t there an easy way to turn off capturing for the whole regular expression?”

    Apart from the obvious that ECMAScript has evolved relatively slowly and different people have different priorities, I’m not sure that allowing you to simply turn off capturing for entire regexes would be the best way to improve the situation.

    “There’s some inherent inconsistency, too. What happens to the “capture data” in a regular expression used by .test()?”

    I don’t understand why there’s anything inconsistent about this. I think @fearphage might have muddied the waters a little bit on Twitter when suggesting that string.split needs to splice backreferences into its result in order to be “consistent” with string.match. I disagree with him on that–it’s simply an API decision that JS inherited from Perl. The fact that it works that way is occasionally very useful, to be sure, and when combined with judicious use of noncapturing groups it covers 99.5% of use cases. But there are those (in practice very rare) cases like you mentioned where you want to use backrefences within a regex delimiter passed to string.split without splicing backreferences into the result, in which case you need to replace the use of split with a couple extra lines of code.

    Meh. Not a big deal, IMHO.

    “Of course, such “facts” as knowledge now is/was not particularly common, self-obvious, or well spelled out in most documentation and discussions of regex, and so I’m sure lots of devs like me have gone for a long time with some incorrect habits (which are hard now to break) in regular expression writing before understanding a more full picture.”

    So join the light side of the force and write and/or disseminate better documentation and discussion about regexes. :)

    From @fearphage:

    “You are using ( ) in an unintended way which rarely has unintended side effects. The problem is that you are capturing when you have no plans to use the captured values. You really want to group but you capture in addition to grouping. If you had been explicit with your syntax all along, this could have been avoid[ed].”

    I’m inclined to agree. It doesn’t do much good to e.g. argue over whether the meaning of (…) and (?:…) should be switched. It’s never going to happen, at least without more fundamental changes to regex syntax and behavior a la Perl 6 (a fundamental shift that e.g. JS inventor Brendan Eich has already ruled out). Improvements are possible, new flags that change the meaning of regex syntax are possible, but breaking change is not unless you want to create a new regex implementation that no one will use since a good-enough implementation is already built into JS. For better or worse, we’re stuck with certain aspects of RegExp in JS, so devs would do well to dedicate the necessary time to fill in their regex knowledge holes.

  • getify says:

    @Steven-
    Thank you for your excellent and thoughtful comment. I’m quite honored that you’d take the time to respond to my little rant blog post. :)

    If you wanted to argue, e.g., that the simpler and prettier (…) should be noncapturing, and (?:…) should be used for capturing (since capturing is needed far less frequently), I don’t think anyone would disagree in principle, but it would be a pointless argument since it is never going to happen.

    It may be futile to argue for it in terms of it ever being changed, but I think it’s important that we be clear and blunt about the past mistakes we’ve made (rather than just glossing over them as water under the bridge) — doing so helps understand the issue better AND it hopefully helps us avoid such mistakes in future behavior/API design.

    .NET offers an explicit capture flag (/n) that turns off the capturing functionality of (…) per regex

    I’m thrilled to know there’s precedent for this. That’s really the point of that part of my post — that there should be such a flag. Of course, I wish the default behavior could change, but since it can’t (because of backwards-compatibility), the next best thing is extending behavior so that alternate behavior is selectable if desired.

    So, I’ll turn my efforts then to petitioning that the next version of ECMAScript spec will include this modifier flag and behavioral selection for regular expressions.

    “RegExp.leftContext”
    “RegExp.rightContext”

    FYI, these properties have long been deprecated.

    These properties may be deprecated, but I think it’s a mistake to do so, and a mistake to remove them. Why? Because JavaScript doesn’t support look-behind assertions. Yes, I’ve read your post on mimicking look-behind assertions using various tricks. I prefer to use RegExp.leftContext to have something to make my look-behind assertion against. For instance:

    var str = "hello bar2 world foobar somebar other"; 
    
    function my_split(str) {
       //var regex = /(?<!foo)bar/g; // doesn't work in JS
       var regex = /bar/g;
       var lc, rc = str, tmp, chunks = [], captured_idx = 0; 
    
       while (tmp = regex.exec(str)) {
          lc = RegExp.leftContext;
          rc = RegExp.rightContext;
          if (!lc.match(/foo$/)) {
             chunks.push(str.substring(captured_idx,regex.lastIndex-tmp[0].length));
             captured_idx = regex.lastIndex;
          }
       }
       if (rc) {
          chunks.push(rc);
       }
       return chunks;
    }
    
    my_split(str); // ["hello ", "2 world foobar some", " other"]
    

    Not terribly graceful or efficient, but constructing the code in that simple way lets me do lots more complicated stuff that would frankly be either impossible or impractical/unmaintainable in pure regex.

    Anyway, point is, deprecated or not, I use them regularly for various tasks, and so I hope they never get removed. Doing so would definitely break a lot of code backwards-compatibility wise.

    Note that, technically speaking, there’s neither an “opt-in” nor “opt-out,” at least in JS.

    This may or may not be true under the covers of the JS engine, but for a simple dev mind like mine, there’s (…) for grouping, and you add ?: into it to make (?:…) if you want it to not capture. That’s why it seems like you have to “opt-in” to non-capturing, or rather, opt-out of capturing, by adding ?: to the beginning of the base group operator. More plainly, (?: looks like ( + ?:, not its own separate 3 character operator `(?:`. Po-tay-toe, Po-tah-toe.

    cases like you mentioned where you want to use backrefences within a regex delimiter passed to string.split without splicing backreferences into the result, in which case you need to replace the use of split with a couple extra lines of code.

    What “couple of extra lines of code” would that be, exactly? Are you suggesting something like what I do above, where I do the split myself? Or are you suggesting that I use the native split() but then remove the unwanted values from the array by checking each value?

    I maintain that it’s not quite as easy as one would think to just easily (and generally) implement this non-capturing behavior for .split(). Maybe I’m missing an easy general solution, but it seems like every place I do the split, I’m going to have to do some more manual specific code to emulate each instance.

    Moreover, this is why I’m asserting the better thing would be to extend the native .split() (and match/exec for that matter, though less useful) to have a third “includeResults” parameter. Of course, for backwards-compat, that parameter should default to `true`, but at least I could set it to `false` for the cases where I need it.

  • Tom Shinnick says:

    First, the original designers of regular expressions decided that ( ) groups should, by default, capture.

    Please pardon the request to reflect on the cuneiform and clay tablet era of computing, but the original designers of regular expressions were concerned with editors and command line tools. The reason capturing was quite reasonable is that matched text was most likely to be immediately modified/used in the same command.

    s/(["'])foo\1/\1oof\1/

    Usages like this (and its fewer characters) drove the ‘incorrect’ definition. Think editors and keystrokes and that’ll give some insight.

  • Chris says:

    I maintain that it’s not quite as easy as one would think to just easily (and generally) implement this non-capturing behavior for .split(). Maybe I’m missing an easy general solution, but it seems like every place I do the split, I’m going to have to do some more manual specific code to emulate each instance.

    I had to deal with the split() including capturing groups today, so I implemented a very quick fix:

    Utils.SplitWithoutCapture = function (str, split) {
    return str.split(
    new RegExp(split.source.replace(/\(([^?].*?)\)/g, “(?:$1)”))
    );
    };

    I don’t know if this will work universally, but it will normally replace capturing groups with non-capturing groups. It takes a regex to fix one…

    (https://github.com/christopherliu/standard-js/blob/master/core/utils.js)

Leave a Reply

Consider Registering or Logging in before commenting.

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Notify me of followup comments via e-mail. You can also subscribe without commenting.