How I Would “Improve” Regular Expressions

Regular expressions are an indispensable tool for any developer. They ease the pain of parsing strings and virtually every programming language and IDE supports them. But like anything else, there’s always room for improvements. In this post, I’ll share some of my ideas for how I would change regular expressions for my own benefit. I’ll refrain from calling them improvements as I’m sure plenty of people will disagree with my ideas.

To keep things simple in the examples, I will not mix my ideas. I will demonstrate each idea independently of my other ideas and only within the context of everyday regular expressions. Furthermore, I will pretend whitespace is ignored in regular expressions.

Easing Parsing of Multi-Line Text

One of the first changes I would make to regular expressions is to change the way we deal with multiple lines in strings. The tokens ^ and $ seem simple enough in concept, but they never seem to work as you expect. It depends on the mode you’re in and varies between implementations. For example, if you want a regex which spans three lines of text, something simple like this most likely won’t work:

(^exp$){3}

Rather than having two tokens which represent the beginning and end of a line, respectively, I would devise a single token with a new meaning. I’m not sure how to phrase the definition exactly, but it’s a single token which can match the beginning of a line or the end of a line or both. It’s non-greedy so will exclude line breaks by default. Let $ be the token used for this definition. Then the previous regex could be written like this (the second line shows the expanded form):

($exp){3}$
$exp$exp$exp$

If you want a regex which spans a single line, that would be:

$exp$

A single, empty line would be denoted as such:

$$

But what about matching the beginning and end of a string in multi-line mode? We already have the escape sequences \A and \Z for that purpose.

Cropping

There have been cases where I needed to find a match that appeared within some context, but I didn’t need the context, just the subtext. This has been especially difficult at times or just cumbersome. My idea, denoted by the token &, would allow you to write an initial expression to match the context, then add a second expression to match some subtext of the first expression. This is a crude example, but suppose you wanted to match a line of C code which had a variable declaration (with an optional initializer), but you only wanted to match the variable name (notice the & token):

^ (short|int|long) \s+ [a-z_] [a-z0-9_]* \s+ (=.*)? \; $ & [a-z_] [a-z0-9_]*

This feature could be made more convenient to support a common find and replace feature, being \1 \2 :

^ (short|int|long) \s+ ([a-z_] [a-z0-9_]*) \s+ (=.*)? \; $ & \2

If that confused you, perhaps this abstract example will make things clearer. Suppose you wanted to match only exp2 but within the context of exp1 and exp3:

(exp1)(exp2)(exp3) & \2

Multiple Conditions

This feature is similar to the previous feature of cropping, expect that it doesn’t crop the original match. There is a program that I used to use frequently called PRGrep which had one particular feature that I found convenient: Naive search. I don’t know this by any other name, but a naive search allows you to match two or more expressions in any order. If you wanted to match a single line of text which contained three different words but in any order, you would need to write a separate expression for each unique order:

^.* (one.*two.*three)|(one.*three.*two)|(two.*one.*three)|(two.*three.*one)|(three.*one.*two)|(three.*two.*one) .*$

Not very convenient. Assigning the above feature to the token &&, the last expression could be reduced to this:

^.+$ && one && two && three

One shortcoming of this feature is that the sub-conditions could overlap, such that “twone” would match both one and two. It’s not a feature that I would use in production code, but it’s a convenient way to quickly search code or logs.

Macros

A macro would be a regular expression which optionally takes one or more expressions as arguments.

  • An elegant syntax that shares its syntax with named captures (though I would call them “variables”)
  • Recursive macros which use OR logic or conditionals for stop conditions.
  • Nested variables in recursive macros which would allow us to match nested HTML / XML tags.
  • Make it standard for developers to include custom macros in their apps. Furthermore, I would allow the possibility for regex libraries to define macros in code rather as regular expressions.

As an example, I will use angle brackets < > for defining / delimiting macros (and variables). Line one defines a macro for matching a number. The second line initiates the macro. The third line defines a variable  which stores a number.

This is not meant to be an official syntax. It is merely meant for demonstration purposes.

<number: [0-9]+>
<number>
<variable = <number>>

How to define macros with one or more arguments (\a matches a bell character):

<numbers, num1, num2: (\a[0-9]+\a){num1, num2}>
<numbers, 3, 6>

I’ll do my best to provide an example that demonstrates matching tags using recursion and nested variables. I cheated a bit and used one of my previous ideas in <closetag>. 🙂 Alas, this example is not perfect because, if some nested pair of tags does not match, it will still match the parent tags and thus the whole thing.

<word: [A-Za-z][A-Za-z0-9]*>
<opentag: \<<word>\>>
<closetag, t: \<\/(t&<word>)\>>
<xml: <t=<opentag>> (<xml> | .+ <closetag, t>) >

I realize that my syntax for macros isn’t perfect and has some ambiguities with regular expressions. Again, it is only for demonstration purposes. I quickly devised a syntax that is easy to read so not to confuse my readers.

Standardization!!!

This is not so much of an idea as something I feel just needs to be done. As useful as regular expressions are, sometimes they’re difficult to write yourself and you just want to find an expression online that does what you need. However, in my experience, regular expression engines vary in their implementation so the same expression is unlikely to work across ALL engines. Sometimes, examples I find online work just fine, but often I’m forced to write the expression myself.

While standardization isn’t perfect, I feel that it would improve the situation so we can share and reuse regular expressions with ease.

Advertisements

In-Place Merge Sort Demystified

There are various implementations of in-place merge sort around the web, including my own, and various papers describing procedures for in-place merging. Then a few days ago, while researching Smoothsort, I stumbled upon a page titled Smoothsort Demystified which gives a nicely detailed explanation of the sorting algorithm. I haven’t seen such a tutorial for in-place merge sort. At best, I’ve only seen vague explanations or descriptions which use terminology that many people are unlikely to understand. Even Wikipedia only gives a couple details of in-place merging without any adequate explanation or demonstration. I figured, I know a technique for in-place merge sort and how to implement it, so I’ll take a stab at writing my own tutorial.

Update 8/31/14: It seems that my initial understanding of “rotations” was naive at best. I plan to revise this post to describe three in-place merging techniques: insertion -> range swaps -> rotations. Also, thanks to BonzaiThePenguin, I’m now aware of an in-place stable sorting algorithm called Block sort (and another link) which runs in O(n log n) time. The algorithm seems highly complex and I haven’t had time to study it yet.

Before reading any further, it is highly recommended that you already understand merge sort and preferably have even implemented it yourself. If not, there are more than enough resources online that will adequately teach you the merge sort algorithm, including several YouTube videos. It would be best to familiarize yourself with the algorithm before attempting to follow my tutorial.

What do you mean by “in-place”? A standard merge sort requires you to allocate a buffer equal in length to the input being sorted. For example, if you wish to sort an array with 1000 elements, it is necessary to allocate an additional scratch array with 1000 elements as a space for merging elements before copying them back to the input array. This is why merge sort has O(n) space complexity. An in-place sorting algorithm doesn’t require allocating any additional memory for sorting. Several algorithms meet this requirement, including insertion sort and heap sort which have O(1) space complexity. However, in-place merge sort has O(log n) space complexity. This is because, like quick sort, it is a recursive function which requires pushing elements onto the stack.

How does in-place merging affect the performance and complexity of merge sort? A naive implementation may be 2-3x slower than a standard merge sort. In-place merging only requires a minor increase in comparisons. The worst effect on performance occurs due to the significant increase in reads and writes (there is lots of swapping going on). I will explain a couple of optimizations towards the end which can significantly increase the performance of the algorithm.

Is in-place merge sort stable? Yes! As long as it is implemented correctly, in-place merge sort will preserve the stability of merge sort.

Why would I want to use an in-place merge sort? The truth is, you probably wouldn’t. The only reason I can fathom is if you’re working with embedded systems with very limited memory. In the vast majority of cases, merge sort or Timsort would be a better choice. Otherwise, this is merely for those who are interested or graduate students who need a topic for their thesis.

Is this applicable to Timsort? Only partially. It is perfectly possible to implement an in-place natural merge sort. However, Timsort is an adaptive sort and has a galloping mode which can significantly reduce the number of comparisons on partially sorted data. If you were to implement an in-place Timsort, it would be impossible to implement galloping mode as well… at least using the technique I describe in this post.


Rotations


Before I move forward, I should define what is meant by rotations as they are used heavily in the merging process. Actually, rotations are used in a limited form here in a way that is functionally equivalent to std::swap_ranges in C++ (aka block swap). First, I’ll explain how rotations are used in the algorithm, then I’ll explain what rotations really are and why it’s still a proper usage of the term.

For in-place merging, rotations are used to swap the elements in two adjacent sublists which are equal in length. As you will soon see, this technique is used to move a large number of elements at once.

SwapRanges

That is the basic operation of in-place merging. So what are rotations anyways? To rotate elements in a sublist means to shift the elements some number of places to right and elements at the end wrap around to the beginning (and vice versa). So if you have the list [1,2,3,4,5], shifting it one place to the right will give you [5,1,2,3,4].

The fact that the two ranges being swapped are adjacent is important in calling this operation a rotation. If the length of the sublist being rotated is an even number, then shifting the elements half that distance results in the same operation as swap_ranges above. In the following graphic, I shift the elements one place to the right three times, achieving the same result as before.

Rotate

That’s one way to look at rotations. The way that std::rotate in C++ is portrayed is different. Rather than shifting elements N places to the left or right, std::rotate rotates the elements on a pivot indicated by some middle element, such that the middle element becomes the new first element. Following that definition, if the middle element is M, then shifting the elements M places to the left achieves the same result.


In-Place Merging


Now that I’ve explained rotations in too much detail, I can explain how to use them for in-place merging. Let us define two sorted sublists, A and B, that we want to merge in-place.

Ranges

The trick is to swap the largest elements in A for the smallest elements in B. A linear search can find the range of elements to rotate. Start from the middle and extend outwards, until you find an element in A which is less than an element in B. Rotate the elements between these bounds.

LinearSearch

That’s the basic operation. However, something special happened. Of the two ranges, the smallest elements are in A and the largest elements are in B, albeit not in order. This means you can continue to merge A and B independently, so all you’re left with is two smaller merges. This also means that in-place merging is a recursive function in itself.

By recursively applying rotations in this manner, eventually you’ll merge the two ranges. Let’s apply this process once more and see what happens:

Rotate2

Just like that, the right-hand side is already sorted and the left-hand side is almost sorted. A couple more rotations on the left-hand side and the merge procedure would be complete.

In essence, each rotation moves the elements that much closer to their final location. When the sublists are large, rotations can move thousands of elements at a time. But as the sublists become smaller and smaller, rotations become less efficient. When the sublists become small enough, there are a couple of techniques you could apply that would be more efficient than applying several more rotations.


Binary Search


In the previous section, I stated that a linear search can be used to find the range of elements to rotate. While that is fine for small sublists, it would be very impractical for large sublists with thousands or millions of elements. Using a linear search, it may be necessary to compare thousands of elements to find the bounds of the range to rotate.

In general, a better approach is to use a binary search which is guaranteed to find the bounds within O(log n) comparisons. A binary search can be tricky to implement as it’s easy to get stuck in an infinite loop. However, the end result is worth it as it will find you the bounds much quicker and with far fewer comparisons.

Start half-way from the center and move the search inwards or outwards, depending on the result of the comparison. It’s important to only compare elements which are equidistant from the center!

BinarySearch


Optimizing Using Insertions


When the sublists become small enough, using insertions to move the last few elements into place would be more efficient than using rotations. However, how do you decide whether to use insertions or rotations? What I do is define a threshold for when to switch to insertion.

A well optimized merge sort typically uses insertion sort to sort small sublists before merging anything. Similarly, a well optimized quick sort uses insertion sort once a partition is small enough. By referring to these as examples, I typically use insertion sort on sublists up to 32 elements in length. In a worst-case, this would imply (32+31+30+29+28+…) insertions. This can be calculated directly using the formula (n*(n+1)/2). For 32 elements, this gives us a worst-case of 528 insertions.

Now we need a formula to determine whether the number of insertions required to merge two lists is below this threshold. This is actually simple: just multiply the length of both lists. If one sublist has 10 elements and the other has 15 elements, this gives us a worst-case of (10*15=150) insertions.


Using a scratch array for merging


By applying rotations, you effectively reduce two large sublists into several smaller sublists. When the sublists become small enough, a very effective optimization is to simply allocate a small buffer to merge the sublists in. This will allow you to greatly reduce the number of rotations and is far more efficient than using insertions.

This can be further optimized as well. While most implementations of merge sort have O(n) space complexity, it is trivial to implement it such that it has O(n/2) space complexity and is generally a bit quicker as well. This is the same technique that Timsort uses to achieve O(n/2) worst-case space complexity.

Begin by copying one of the sublists into the buffer. Then simply merge the elements in-place. Depending on which sublist is smaller, it may be necessary to merge the elements in reverse order. In the following graphic, the green cells represent a small buffer which is large enough to store a copy of the right sublist.

halfmerge

This is a great technique to greatly reduce the space complexity while only being marginally slower, albeit technically it wouldn’t be an in-place merge sort anymore. This technique is also applicable to Timsort and allows you to retain some of the benefit of galloping mode. I have implemented my own variant of Timsort with O(n/1024) space complexity.


Summary


In this post, I have demonstrated a practical technique to merge two sublists in place. It has limited applicability in the real world, but is interesting none the less and is a topic of research that has been worthy of publication. When writing this tutorial, I did my best to explain things in simpler terms and to provide some intuition for why this technique works. I simply hope that it was easy to understand and that you learned something new from reading it.

I’m looking for feedback on this article. Please feel free to post any questions or comments here or on Reddit.

Random Word Generator

I was looking through some old code of mine and found a little project I had worked on a number of years ago. I was designing an engine which would output random English-sounding words. I started with a simple idea and tweaked it through a few revisions to produce some good results. As an example, here are some of the words it generates:

nnehoffi
paxiebuyase
uadscutll
bizoladaull
qeqmehel
hoskyedroek
ibhnartuois
vdigro
skleftsom
wextbong
wemmempti
cnamlitl
jiinau
brirewsmew

The words are produced using a simple database that looks like this:

a abcdefghijklmnopqrstuvwxyz
aa bcdhklmnrstuvw
aaa ls
aab
aab adgimnopstuz
aac
aac achks
aad
aad aeghituy
aae
aae dlw

Generating Words Begin by choosing a random letter a-z. Look up that first letter in the database to find another character which you may append to it. Now you have a sequence of two letters. As before, look up that pair of letters in the database to find a third character to append. Now you have a sequence of three letters. Again, look up that sequence in the database to find a fourth character to append….. However, the maximum length of any sequence of letters is three. So if you have a sequence of four or more letters, you only look up the last three letters to find the next character to append. By limiting the max sequence to three letters, this allows you to diverge from normal sequences and generate random words which don’t exist in the English language. Of course, this engine may also generate real English words by chance. The last step is to finalize your randomly generated word. If you look back at the sample of the database, you’ll notice some sequences of three letters are repeated but followed by nothing. What this indicates is that this sequence is a valid word ending. For example, it’s okay to have a word which ends with the sequence “aad”. Dealing with “Dead-Ends” The database isn’t perfect and it does have some dead-ends. Ultimately, generating words is a matter of trial and error. When I refer to dead-ends, I’m referring to two possibilities:

  • You have a sequence of characters which you can append nothing to
  • The last three characters in your word is not a valid ending

There are three basic techniques which I could employ:

  • I could unwind and try appending a different character
  • I could throw away the word and restart from scratch
  • If there is an invalid ending and the word doesn’t have to be a specific length, I could try appending more characters until I have a valid ending.

My preference is the second option, to restart from scratch. While there are some dead-ends in the database, they’re infrequent enough that simply starting over is a practical option. There may also be more sophisticated algorithms which are guaranteed to produce a valid word on every try, but I feel that’s unnecessary. Building the Database Of course, the engine would be worthless without the database. While generating any database is simple enough, creating a quality database is another matter. But first, how do you generate a database? You simply need to download some word lists, then parse the words in those lists, break them down into parts, build the patterns, and write them to a database. My database was generated by sampling tens of thousands of real English words. It’s 145KB and contains 14288 lines. Building a quality database is a matter of applying some filters to eliminate bad patterns. You could simply sift through the database and remove the bad patterns by hand, but with 14288 lines in my own database, this would be a tedious task. However, I did do some manual filtering of my database by eliminating some bad word beginnings. I eliminated 197 bad word beginnings in total. Although real English words have these beginnings, it produced some odd results so I filtered them from my database. Here are a few that I chose to eliminate from my database:

dm dn dp dq drj drs drw dv dz fc fg fj fls

Besides for doing manual filtering. I defined one rule to eliminate some patterns from the database. For every three characters, there is a fourth character you may append to it. Of these four characters, I required there be at least one consonant and one vowel. For example, the sequence “aae” cannot be followed by another vowel. As for the letter Y, which could be either a consonant or vowel depending on the context, I chose to treat it as neither. While I could have analyzed the context to determine if the Y was a consonant or a vowel, I chose to take the easy route. At worst, I may have lost a few patterns which would have been included in the database otherwise. But Why??? The reason I chose to create this database is because I was writing a password generator at the time and I wanted to add an option to generate “pronounceable” passwords. The one thing I failed to do is determine how secure this option would actually be. However, I did have a simple idea to increase the strength of the passwords by using mixed case “URugUaY” or substituting some letters with similar looking numbers and symbols “3 for E” or “$ for S”. The code I wrote for this is a mess and not worth publishing. But one day, I may write a small library which implements this engine and publish it to GitHub. For now, you can implement the engine yourself and download my database here: http://1drv.ms/1jF9L3y

Cogito Ergo Sum

The title, Cogito ergo sum, is a famous Classical Latin phrase by René Descartes which translates into English as, “I think, therefore I am.” It’s one of my favorite quotes of all time, largely because it relates to my fascination with consciousness (more specifically, sentience, the ability to experience sensations). So it annoys me when people misinterpret this quote and apply it to improper contexts. In fact, of all the times I’ve heard this phrase, I’ve only ever heard it used correctly once. A most recent example, I’ve been watching a series of Crash Course videos on Psychology. In the 15th episode at 1m19s, he applies this quote incorrectly in the context of Psychology.

To put it simply, the phrase, “I think, therefore I am,” came about as Descartes pondered the nature of consciousness, eventually pondering and doubting his own existence. But then he realized that the ability to doubt this own existence meant that there was a doubter, himself, doubting his own existence. In another words, the ability to contemplate one’s own existence, or to think in general, is proof of one’s own existence. If you change a single word in the English translation, it might make more sense: “I think, therefore I exist.”

I can’t fully explain it, but consciousness fascinates me, and it’s something that I’ve spent a good amount of time contemplating myself. In relation to this quote, one aspect of conscious experience that I came to realize is that we are aware of our own conscious experience and that we’re able to communicate and talk about our conscious experiences. For me, that is strong evidence that consciousness plays a role in brain function and controlling our behavior. As a consequence, this implies that consciousness has an effect on the physical world, however major or minor that effect may me. I consider this one of the fundamental properties of consciousness and one that must be accounted for (explained) if we are to ever hope to develop a so-called Theory of Everything.

p.s. I recommend the Crash Course collection for those who are curious-minded, but not to those who are hoping to study and obtain a deep understanding of a subject.

“The sum of all natural numbers is -1/12…”

Recently, a video posted on Numberphile titled, Astounding Result, has been making rounds on the web. Of course, this video has resulted in a plethora of criticisms, and I would argue some of it is justified. I wasn’t convinced either and had my skepticisms as well. However, I continued to research the problem and tried to gain a greater understanding. What I concluded is that the result isn’t necessarily wrong, but they did it improperly. The way they present the problem and the terminology they use are both improper.

EDIT: Numberphile posted an excellent follow-up video on March 18 titled, Why -1/12 is a gold nugget. For those interested, I highly recommend watching it.

Early on in the video, they display this result in a “well-known string theory book,” presumably to give it some justification. If you look closely, this is the way in which the result is presented in the book:

Image 006

Notice that this is not an equation as there’s no equality = sign. Rather, what you have is an arrow, which means something quite different. That’s the first mistake they made in the video, using the equality sign everywhere as to signify that this summation “equals” -1/12, which is not how it’s understood in mathematics or physics.

Secondly, if you pause the video on a stable frame (where the camera isn’t jittering), you can clearly make out the preceding text in the book. To save you the trouble, this is what is written:

It can be evaluated by regulating the theory and then being careful to preserve Lorentz invariance in the renormalization.

There are a few special terms here which are, unfortunately, never used in the video. This is the second mistake they made in the video, the lack of proper terminology. For those of us who studied Calculus in college, our natural instinct is to test if the sequence converges on some finite value L. If not, then we simply conclude the series diverges and move on with our day. However, they only taught us one method of summation, otherwise known as Cesàro summation. When using this summation method, the series does indeed diverge. However, there are other techniques for “assigning a meaningful value to a divergent series,” collectively known as regularization (as well as renormalization, used in quantum mechanics).

There was further criticism in how they proved these results, by taking these infinite sequences and adding them together, multiplying, shifting to the right, etc. I’d rather not get into this much. Instead, I’ll just point you to the Riemann series theorem

In the video and other places on the web, I’ve seen the claim that these results are used in “many areas of physics” and “have been observed in nature.” The most common example I’ve encountered relates to string theory, which I think is moot because, as I understand it, string theory hasn’t been tested experimentally (i.e. it’s untestable using existing equipment) and is therefore little more than just theory. However, in an alternate video proof [1], a reference is made to the Casimir force (i.e. the Casimir effect), which has been measured experimentally.

I’m not claiming to have any deep understanding of the mathematics or physics presented here. I simply felt that clarification is needed in order to settle the uproarious masses. I may investigate this topic further in the future, but for now, I’ve satisfied my skepticism. In conclusion, the result isn’t necessarily incorrect (I’ll leave it to the masses to disprove), but greater care could have been taken in the presentation, the proofs, and the terminology used.

There’s a second, much longer and more complex proof presented in this video, [1] Sum of Natural NumbersThe Wikipedia article covering this sequence can be found here (1+2+3+4+5+6…). Another interesting sequence covered in much more detail can be found here (1-2+3-4+5-6….). The Wikipedia article for divergent series contains some useful information as well, including some of the methods used for summation. A couple of the highly critical responses prompted by this video: One and Two.