How I Would “Improve” Regular Expressions

Regular expressions are an indispensable tool for any developer. They ease the pain of parsing strings and virtually every programming language and IDE supports them. But like anything else, there’s always room for improvements. In this post, I’ll share some of my ideas for how I would change regular expressions for my own benefit. I’ll refrain from calling them improvements as I’m sure plenty of people will disagree with my ideas.

To keep things simple in the examples, I will not mix my ideas. I will demonstrate each idea independently of my other ideas and only within the context of everyday regular expressions. Furthermore, I will pretend whitespace is ignored in regular expressions.

Easing Parsing of Multi-Line Text

One of the first changes I would make to regular expressions is to change the way we deal with multiple lines in strings. The tokens ^ and $ seem simple enough in concept, but they never seem to work as you expect. It depends on the mode you’re in and varies between implementations. For example, if you want a regex which spans three lines of text, something simple like this most likely won’t work:

(^exp$){3}

Rather than having two tokens which represent the beginning and end of a line, respectively, I would devise a single token with a new meaning. I’m not sure how to phrase the definition exactly, but it’s a single token which can match the beginning of a line or the end of a line or both. It’s non-greedy so will exclude line breaks by default. Let $ be the token used for this definition. Then the previous regex could be written like this (the second line shows the expanded form):

($exp){3}$
$exp$exp$exp$

If you want a regex which spans a single line, that would be:

$exp$

A single, empty line would be denoted as such:

$$

But what about matching the beginning and end of a string in multi-line mode? We already have the escape sequences \A and \Z for that purpose.

Cropping

There have been cases where I needed to find a match that appeared within some context, but I didn’t need the context, just the subtext. This has been especially difficult at times or just cumbersome. My idea, denoted by the token &, would allow you to write an initial expression to match the context, then add a second expression to match some subtext of the first expression. This is a crude example, but suppose you wanted to match a line of C code which had a variable declaration (with an optional initializer), but you only wanted to match the variable name (notice the & token):

^ (short|int|long) \s+ [a-z_] [a-z0-9_]* \s+ (=.*)? \; $ & [a-z_] [a-z0-9_]*

This feature could be made more convenient to support a common find and replace feature, being \1 \2 :

^ (short|int|long) \s+ ([a-z_] [a-z0-9_]*) \s+ (=.*)? \; $ & \2

If that confused you, perhaps this abstract example will make things clearer. Suppose you wanted to match only exp2 but within the context of exp1 and exp3:

(exp1)(exp2)(exp3) & \2

Multiple Conditions

This feature is similar to the previous feature of cropping, expect that it doesn’t crop the original match. There is a program that I used to use frequently called PRGrep which had one particular feature that I found convenient: Naive search. I don’t know this by any other name, but a naive search allows you to match two or more expressions in any order. If you wanted to match a single line of text which contained three different words but in any order, you would need to write a separate expression for each unique order:

^.* (one.*two.*three)|(one.*three.*two)|(two.*one.*three)|(two.*three.*one)|(three.*one.*two)|(three.*two.*one) .*$

Not very convenient. Assigning the above feature to the token &&, the last expression could be reduced to this:

^.+$ && one && two && three

One shortcoming of this feature is that the sub-conditions could overlap, such that “twone” would match both one and two. It’s not a feature that I would use in production code, but it’s a convenient way to quickly search code or logs.

Macros

A macro would be a regular expression which optionally takes one or more expressions as arguments.

  • An elegant syntax that shares its syntax with named captures (though I would call them “variables”)
  • Recursive macros which use OR logic or conditionals for stop conditions.
  • Nested variables in recursive macros which would allow us to match nested HTML / XML tags.
  • Make it standard for developers to include custom macros in their apps. Furthermore, I would allow the possibility for regex libraries to define macros in code rather as regular expressions.

As an example, I will use angle brackets < > for defining / delimiting macros (and variables). Line one defines a macro for matching a number. The second line initiates the macro. The third line defines a variableĀ  which stores a number.

This is not meant to be an official syntax. It is merely meant for demonstration purposes.

<number: [0-9]+>
<number>
<variable = <number>>

How to define macros with one or more arguments (\a matches a bell character):

<numbers, num1, num2: (\a[0-9]+\a){num1, num2}>
<numbers, 3, 6>

I’ll do my best to provide an example that demonstrates matching tags using recursion and nested variables. I cheated a bit and used one of my previous ideas in <closetag>. šŸ™‚ Alas, this example is not perfect because, if some nested pair of tags does not match, it will still match the parent tags and thus the whole thing.

<word: [A-Za-z][A-Za-z0-9]*>
<opentag: \<<word>\>>
<closetag, t: \<\/(t&<word>)\>>
<xml: <t=<opentag>> (<xml> | .+ <closetag, t>) >

I realize that my syntax for macros isn’t perfect and has some ambiguities with regular expressions. Again, it is only for demonstration purposes. I quickly devised a syntax that is easy to read so not to confuse my readers.

Standardization!!!

This is not so much of an idea as something I feel just needs to be done. As useful as regular expressions are, sometimes they’re difficult to write yourself and you just want to find an expression online that does what you need. However, in my experience, regular expression engines vary in their implementation so the same expression is unlikely to work across ALL engines. Sometimes, examples I find online work just fine, but often I’m forced to write the expression myself.

While standardization isn’t perfect, I feel that it would improve the situation so we can share and reuse regular expressions with ease.