How I Would “Improve” Regular Expressions

Regular expressions are an indispensable tool for any developer. They ease the pain of parsing strings and virtually every programming language and IDE supports them. But like anything else, there’s always room for improvements. In this post, I’ll share some of my ideas for how I would change regular expressions for my own benefit. I’ll refrain from calling them improvements as I’m sure plenty of people will disagree with my ideas.

To keep things simple in the examples, I will not mix my ideas. I will demonstrate each idea independently of my other ideas and only within the context of everyday regular expressions. Furthermore, I will pretend whitespace is ignored in regular expressions.

Easing Parsing of Multi-Line Text

One of the first changes I would make to regular expressions is to change the way we deal with multiple lines in strings. The tokens ^ and $ seem simple enough in concept, but they never seem to work as you expect. It depends on the mode you’re in and varies between implementations. For example, if you want a regex which spans three lines of text, something simple like this most likely won’t work:

(^exp$){3}

Rather than having two tokens which represent the beginning and end of a line, respectively, I would devise a single token with a new meaning. I’m not sure how to phrase the definition exactly, but it’s a single token which can match the beginning of a line or the end of a line or both. It’s non-greedy so will exclude line breaks by default. Let $ be the token used for this definition. Then the previous regex could be written like this (the second line shows the expanded form):

($exp){3}$
$exp$exp$exp$

If you want a regex which spans a single line, that would be:

$exp$

A single, empty line would be denoted as such:

$$

But what about matching the beginning and end of a string in multi-line mode? We already have the escape sequences \A and \Z for that purpose.

Cropping

There have been cases where I needed to find a match that appeared within some context, but I didn’t need the context, just the subtext. This has been especially difficult at times or just cumbersome. My idea, denoted by the token &, would allow you to write an initial expression to match the context, then add a second expression to match some subtext of the first expression. This is a crude example, but suppose you wanted to match a line of C code which had a variable declaration (with an optional initializer), but you only wanted to match the variable name (notice the & token):

^ (short|int|long) \s+ [a-z_] [a-z0-9_]* \s+ (=.*)? \; $ & [a-z_] [a-z0-9_]*

This feature could be made more convenient to support a common find and replace feature, being \1 \2 :

^ (short|int|long) \s+ ([a-z_] [a-z0-9_]*) \s+ (=.*)? \; $ & \2

If that confused you, perhaps this abstract example will make things clearer. Suppose you wanted to match only exp2 but within the context of exp1 and exp3:

(exp1)(exp2)(exp3) & \2

Multiple Conditions

This feature is similar to the previous feature of cropping, expect that it doesn’t crop the original match. There is a program that I used to use frequently called PRGrep which had one particular feature that I found convenient: Naive search. I don’t know this by any other name, but a naive search allows you to match two or more expressions in any order. If you wanted to match a single line of text which contained three different words but in any order, you would need to write a separate expression for each unique order:

^.* (one.*two.*three)|(one.*three.*two)|(two.*one.*three)|(two.*three.*one)|(three.*one.*two)|(three.*two.*one) .*$

Not very convenient. Assigning the above feature to the token &&, the last expression could be reduced to this:

^.+$ && one && two && three

One shortcoming of this feature is that the sub-conditions could overlap, such that “twone” would match both one and two. It’s not a feature that I would use in production code, but it’s a convenient way to quickly search code or logs.

Macros

A macro would be a regular expression which optionally takes one or more expressions as arguments.

  • An elegant syntax that shares its syntax with named captures (though I would call them “variables”)
  • Recursive macros which use OR logic or conditionals for stop conditions.
  • Nested variables in recursive macros which would allow us to match nested HTML / XML tags.
  • Make it standard for developers to include custom macros in their apps. Furthermore, I would allow the possibility for regex libraries to define macros in code rather as regular expressions.

As an example, I will use angle brackets < > for defining / delimiting macros (and variables). Line one defines a macro for matching a number. The second line initiates the macro. The third line defines a variable  which stores a number.

This is not meant to be an official syntax. It is merely meant for demonstration purposes.

<number: [0-9]+>
<number>
<variable = <number>>

How to define macros with one or more arguments (\a matches a bell character):

<numbers, num1, num2: (\a[0-9]+\a){num1, num2}>
<numbers, 3, 6>

I’ll do my best to provide an example that demonstrates matching tags using recursion and nested variables. I cheated a bit and used one of my previous ideas in <closetag>. 🙂 Alas, this example is not perfect because, if some nested pair of tags does not match, it will still match the parent tags and thus the whole thing.

<word: [A-Za-z][A-Za-z0-9]*>
<opentag: \<<word>\>>
<closetag, t: \<\/(t&<word>)\>>
<xml: <t=<opentag>> (<xml> | .+ <closetag, t>) >

I realize that my syntax for macros isn’t perfect and has some ambiguities with regular expressions. Again, it is only for demonstration purposes. I quickly devised a syntax that is easy to read so not to confuse my readers.

Standardization!!!

This is not so much of an idea as something I feel just needs to be done. As useful as regular expressions are, sometimes they’re difficult to write yourself and you just want to find an expression online that does what you need. However, in my experience, regular expression engines vary in their implementation so the same expression is unlikely to work across ALL engines. Sometimes, examples I find online work just fine, but often I’m forced to write the expression myself.

While standardization isn’t perfect, I feel that it would improve the situation so we can share and reuse regular expressions with ease.

Advertisements

In-Place Merge Sort Demystified

There are various implementations of in-place merge sort around the web, including my own, and various papers describing procedures for in-place merging. Then a few days ago, while researching Smoothsort, I stumbled upon a page titled Smoothsort Demystified which gives a nicely detailed explanation of the sorting algorithm. I haven’t seen such a tutorial for in-place merge sort. At best, I’ve only seen vague explanations or descriptions which use terminology that many people are unlikely to understand. Even Wikipedia only gives a couple details of in-place merging without any adequate explanation or demonstration. I figured, I know a technique for in-place merge sort and how to implement it, so I’ll take a stab at writing my own tutorial.

Update 8/31/14: It seems that my initial understanding of “rotations” was naive at best. I plan to revise this post to describe three in-place merging techniques: insertion -> range swaps -> rotations. Also, thanks to BonzaiThePenguin, I’m now aware of an in-place stable sorting algorithm called Block sort (and another link) which runs in O(n log n) time. The algorithm seems highly complex and I haven’t had time to study it yet.

Before reading any further, it is highly recommended that you already understand merge sort and preferably have even implemented it yourself. If not, there are more than enough resources online that will adequately teach you the merge sort algorithm, including several YouTube videos. It would be best to familiarize yourself with the algorithm before attempting to follow my tutorial.

What do you mean by “in-place”? A standard merge sort requires you to allocate a buffer equal in length to the input being sorted. For example, if you wish to sort an array with 1000 elements, it is necessary to allocate an additional scratch array with 1000 elements as a space for merging elements before copying them back to the input array. This is why merge sort has O(n) space complexity. An in-place sorting algorithm doesn’t require allocating any additional memory for sorting. Several algorithms meet this requirement, including insertion sort and heap sort which have O(1) space complexity. However, in-place merge sort has O(log n) space complexity. This is because, like quick sort, it is a recursive function which requires pushing elements onto the stack.

How does in-place merging affect the performance and complexity of merge sort? A naive implementation may be 2-3x slower than a standard merge sort. In-place merging only requires a minor increase in comparisons. The worst effect on performance occurs due to the significant increase in reads and writes (there is lots of swapping going on). I will explain a couple of optimizations towards the end which can significantly increase the performance of the algorithm.

Is in-place merge sort stable? Yes! As long as it is implemented correctly, in-place merge sort will preserve the stability of merge sort.

Why would I want to use an in-place merge sort? The truth is, you probably wouldn’t. The only reason I can fathom is if you’re working with embedded systems with very limited memory. In the vast majority of cases, merge sort or Timsort would be a better choice. Otherwise, this is merely for those who are interested or graduate students who need a topic for their thesis.

Is this applicable to Timsort? Only partially. It is perfectly possible to implement an in-place natural merge sort. However, Timsort is an adaptive sort and has a galloping mode which can significantly reduce the number of comparisons on partially sorted data. If you were to implement an in-place Timsort, it would be impossible to implement galloping mode as well… at least using the technique I describe in this post.


Rotations


Before I move forward, I should define what is meant by rotations as they are used heavily in the merging process. Actually, rotations are used in a limited form here in a way that is functionally equivalent to std::swap_ranges in C++ (aka block swap). First, I’ll explain how rotations are used in the algorithm, then I’ll explain what rotations really are and why it’s still a proper usage of the term.

For in-place merging, rotations are used to swap the elements in two adjacent sublists which are equal in length. As you will soon see, this technique is used to move a large number of elements at once.

SwapRanges

That is the basic operation of in-place merging. So what are rotations anyways? To rotate elements in a sublist means to shift the elements some number of places to right and elements at the end wrap around to the beginning (and vice versa). So if you have the list [1,2,3,4,5], shifting it one place to the right will give you [5,1,2,3,4].

The fact that the two ranges being swapped are adjacent is important in calling this operation a rotation. If the length of the sublist being rotated is an even number, then shifting the elements half that distance results in the same operation as swap_ranges above. In the following graphic, I shift the elements one place to the right three times, achieving the same result as before.

Rotate

That’s one way to look at rotations. The way that std::rotate in C++ is portrayed is different. Rather than shifting elements N places to the left or right, std::rotate rotates the elements on a pivot indicated by some middle element, such that the middle element becomes the new first element. Following that definition, if the middle element is M, then shifting the elements M places to the left achieves the same result.


In-Place Merging


Now that I’ve explained rotations in too much detail, I can explain how to use them for in-place merging. Let us define two sorted sublists, A and B, that we want to merge in-place.

Ranges

The trick is to swap the largest elements in A for the smallest elements in B. A linear search can find the range of elements to rotate. Start from the middle and extend outwards, until you find an element in A which is less than an element in B. Rotate the elements between these bounds.

LinearSearch

That’s the basic operation. However, something special happened. Of the two ranges, the smallest elements are in A and the largest elements are in B, albeit not in order. This means you can continue to merge A and B independently, so all you’re left with is two smaller merges. This also means that in-place merging is a recursive function in itself.

By recursively applying rotations in this manner, eventually you’ll merge the two ranges. Let’s apply this process once more and see what happens:

Rotate2

Just like that, the right-hand side is already sorted and the left-hand side is almost sorted. A couple more rotations on the left-hand side and the merge procedure would be complete.

In essence, each rotation moves the elements that much closer to their final location. When the sublists are large, rotations can move thousands of elements at a time. But as the sublists become smaller and smaller, rotations become less efficient. When the sublists become small enough, there are a couple of techniques you could apply that would be more efficient than applying several more rotations.


Binary Search


In the previous section, I stated that a linear search can be used to find the range of elements to rotate. While that is fine for small sublists, it would be very impractical for large sublists with thousands or millions of elements. Using a linear search, it may be necessary to compare thousands of elements to find the bounds of the range to rotate.

In general, a better approach is to use a binary search which is guaranteed to find the bounds within O(log n) comparisons. A binary search can be tricky to implement as it’s easy to get stuck in an infinite loop. However, the end result is worth it as it will find you the bounds much quicker and with far fewer comparisons.

Start half-way from the center and move the search inwards or outwards, depending on the result of the comparison. It’s important to only compare elements which are equidistant from the center!

BinarySearch


Optimizing Using Insertions


When the sublists become small enough, using insertions to move the last few elements into place would be more efficient than using rotations. However, how do you decide whether to use insertions or rotations? What I do is define a threshold for when to switch to insertion.

A well optimized merge sort typically uses insertion sort to sort small sublists before merging anything. Similarly, a well optimized quick sort uses insertion sort once a partition is small enough. By referring to these as examples, I typically use insertion sort on sublists up to 32 elements in length. In a worst-case, this would imply (32+31+30+29+28+…) insertions. This can be calculated directly using the formula (n*(n+1)/2). For 32 elements, this gives us a worst-case of 528 insertions.

Now we need a formula to determine whether the number of insertions required to merge two lists is below this threshold. This is actually simple: just multiply the length of both lists. If one sublist has 10 elements and the other has 15 elements, this gives us a worst-case of (10*15=150) insertions.


Using a scratch array for merging


By applying rotations, you effectively reduce two large sublists into several smaller sublists. When the sublists become small enough, a very effective optimization is to simply allocate a small buffer to merge the sublists in. This will allow you to greatly reduce the number of rotations and is far more efficient than using insertions.

This can be further optimized as well. While most implementations of merge sort have O(n) space complexity, it is trivial to implement it such that it has O(n/2) space complexity and is generally a bit quicker as well. This is the same technique that Timsort uses to achieve O(n/2) worst-case space complexity.

Begin by copying one of the sublists into the buffer. Then simply merge the elements in-place. Depending on which sublist is smaller, it may be necessary to merge the elements in reverse order. In the following graphic, the green cells represent a small buffer which is large enough to store a copy of the right sublist.

halfmerge

This is a great technique to greatly reduce the space complexity while only being marginally slower, albeit technically it wouldn’t be an in-place merge sort anymore. This technique is also applicable to Timsort and allows you to retain some of the benefit of galloping mode. I have implemented my own variant of Timsort with O(n/1024) space complexity.


Summary


In this post, I have demonstrated a practical technique to merge two sublists in place. It has limited applicability in the real world, but is interesting none the less and is a topic of research that has been worthy of publication. When writing this tutorial, I did my best to explain things in simpler terms and to provide some intuition for why this technique works. I simply hope that it was easy to understand and that you learned something new from reading it.

I’m looking for feedback on this article. Please feel free to post any questions or comments here or on Reddit.

Sanha pt. 2 – Containers and Composition

I’ve decided I wanted to share a little more of the ideas of my imaginary programming language. I say imaginary because it is merely a design; there is no compiler or interpreter for this language. In my previous post, I gave an introduction to the syntax and type system.

Disclaimer: The examples I provide are merely meant to demonstrate the features, not to show practical usage. As such, some of the examples may appear crude and poorly written.


Review of Types

In my former post, I covered a bit about the type system in Sanha. In essence, a hybrid of dynamic typing and static typing is employed. When a variable is declared, it may be initialized with a value of any type. But once a variable is initialized, it’s type cannot change for the remainder of it’s lifetime.

def foo(#cond bool)
{
    if(cond) return "Ten"
    else return 10
}
#value = foo(true)
value = 30 // Error - Cannot assign int32 to string

It’s possible to explicitly define types for variables or function arguments, though there is no explicit type syntax. Rather, objects of various forms may be interpreted as types in certain contexts. For example, [string] is an array containing a typeid, but it may be interpreted as an array of strings.

#arr [string] = ["Ten", "Twenty", "Thirty"]

I don’t know if there is a term for this paradigm; I simply refer to it as types as expressions. Any expression that results in an object of a special form may be interpreted as a type.

def foo(){ return [int32] }
#arr foo() = [10, 20, 30]

Containers

For this post, I will introduce the built-in container types and the capabilities they provide. I define a container as a collection of elements. There are three basic container types: Arrays, Tuples, and Objects.

Arrays in this language are much like other languages: They contain an ordered sequence of elements of a single type of varying length. Square brackets [] are used to write array literals.

#arr [int32] = [10, 20, 30, 40, 50]
arr.length == 5

Tuples contain an ordered list of multiple types of a fixed length. Each index has a fixed type which cannot change once the tuple is initialized. They support random-access just like arrays. Since the elements can differ in type, tuples are one of the ways that you can employ dynamic typing in Sanha.

#tuple (string, int32) = ("Thirty Five", 35)
tuple.length == 2
tuple[0] == "Thirty Five"
tuple[1] == 35
tuple[0] = "Forty" // OK
tuple[0] = 40 // Error - Cannot assign int32 to string
tuple[1] = 40 // OK

Unlike tuples in D, which automatically expands the elements anywhere it’s referred, Sanha does not do this. This means it’s possible to have nested tuples.

#tuple = ( (10, 20, 30) , (40, 50, 60) )

Since tuples are of a fixed length, it’s also possible to specify names for each element and access them as members of the tuple. In this way, tuples are resemblant of C-style structs.

#tuple (#s string, #i int32) = (#s = "Forty", #i = 40)
tuple[0] is tuple.s
tuple[1] is tuple.i

Objects are an unordered set of members of varying types. As the name suggests, they are the basis of OOP, though there will be explicitly typed classes as well. Curly braces {} are used to write object literals. Objects may also contain methods (i.e. sub-functions).

#obj = {
    #s = "Fifty"
    #i = 50
    def timesTen()
    {
        return i * 10
    }
}
obj.i == 50
obj.timesTen() == 500

In my previous post, there was a blurb in one of the examples of sets being part of the language. I decided against that to enable certain semantics in the language which I’ll demonstrate later in this post. However, I figured this, and other types of containers like hash tables, could be implemented as part of the standard library. I expect that I may include literals for a few types of containers which are simply implemented in the standard library.


Member Grouping

So far, everything that you have seen so far is pretty conventional. However, if I ended it there, that would be boring. Before I get into composition, I wanted to introduce a smaller feature which I call member grouping.

Member grouping allows you to create a container using one or more members of an object. The following example demonstrates how to use the members of an object to construct an array, tuple, or another object:

#obj = {
    #a = 10
    #b = 20
    #c = 30
}
#array = obj.[a, b, c]
#tuple = obj.(a, b, c)
#obj2  = obj.{ #one = a, #two = b, #three = c }

Composition

This is my favorite feature of my own language. I refer to it as composition because it’s an extension of list comprehension but more powerful and flexible. In essence, composition encompasses the notion that you may write any sequence of statements and expressions which are evaluated in order and concatenated into a single container.

It’s important to know that many features which are ordinarily statements in other languages are expressions in Sanha. This includes control statements such as if/else or loops. The second important fact is that elements in a container may be separated by line-breaks as well as commas. So the following code is valid:

#arr = [
    150
    300
    450
    if(cond) 600
]
arr.length // may be 3 or 4, depending on result of (cond)

Now this is where things get fun. Loops are iterated multiple times, so the result is that loops insert multiple elements into a container. This is the functional equivalent of list comprehension from other languages.

// arr contains the values 0 to 99
#arr = [loop(#i in 0..100) i]
arr.length == 100

If you put the two prior examples together, you can construct a container from any sequence of expressions and statements.

#arr = [
    loop(#i in 0..10) i
    20
    loop(#i in 30..40) i
    if(cond) rand(0..100) // Made-up random generator function
    loop(#x in 0..10) loop(#y in 0..10) x*y
]

Arrays have the limitation that all of the elements must be of the same type. However, composition works in tuples as well. Just change the square brackets [] to parenthesis () and you have a tuple!

// Not everything needs to be on a separate line
#tuple = ("The powers of two", loop(#i in 0..8) i**2)

Objects are a bit different. Objects require you to specify a name for every element. However, you may still include non-declarative statements in objects and they’ll be executed in order.

#obj = {
    #i = 100, #j = 200
    if(cond) printLine(i) // Result of expression is ignored
    #k = 300
}

There are two keywords which assist in composition: unpack and mixin.

Unpack is a statement which will expand the elements of any iteratable object into an array or tuple. It’s no different than using a loop to perform the same operation. The keyword simply provides a convenient shortcut and it proves to be useful to overcome some inconveniences in the syntax.

#arr = [unpack 0..100]
#tuple = (unpack arr)

Mixin is a statement which will copy of the members of an object into another object (applicable to tuples with named elements as well, though the elements will be in no particular order). So all of the members of that object will be redeclared in the current scope.

#obj = { #a = 10, #b = 20, #c = 30 }
#obj2 = {
    mixin obj
    #d = 40, #e = 50
}
obj2.a == 10
obj2.e == 5

 Function Arguments

A little secret about function arguments, they’re nothing more than tuples. So all of the features I’ve demonstrated on tuples may be applied to function calls as well. For example, function arguments may be separated by line breaks:

printLine(
    "Hello, World!"
    newLine
    "The current date is: "
    getDateAsString()
    newLine
    "The current time is: "
    getTimeAsString()
}

Suppose one function returns a tuple and you want to pass those elements as arguments to another function. All you need to do is use the unpack keyword to expand the elements into the function call.

def one(){ return (10, 20, 30) }
def sum3(#a #b #c int32){ return a + b + c }
#value = sum3(unpack one())
value == 60

Hidden Semantics

Now that I have covered containers and composition, I can demonstrate how they affect the semantics of the syntax.

Many languages use curly braces for structured programming. Usually, there’s not much in terms of semantics here. However, considering that control statements can be used in expressions, how does it affect the result when curly braces are introduced? A few languages, including Rust, choose to make the last statement in a block to be the result of the expression. That is not what happens in Sanha though. Sanha always defines curly braces to mean objects. So if your control statement is followed by curly braces, the result is going to be an object.

// A crude example...
#obj = if(cond)
{
    #i = 30
    #j = 40
}
else
{
    #i = 50
    #j = 60
}
obj.(i, j) // may equal (30, 40) or (50, 60)

There is no rule forcing you to use curly braces either. You can just as well use parenthesis to nest your code, though this goes against convention and is highly discouraged.

if(cond)
(
    printLine("Hello, World!")
)

The semantics state that control statements are always followed by a single statement or expression. While containers may have multiple statements and expressions, the entirety of the container is a single expression.

Let’s make things a bit more interesting. You know these semantics apply to control statements. Now consider, I define functions to be control statements as well because, in a sense, they also control the flow of execution. So this means that the same semantics of control statements apply to functions as well.

Nothing is ever implicitly returned from functions. A function returns nothing if there is not any type of return statement (of which there are a few).

// Single statement function
def foo() return 35
// Nesting using arrays? Why not!
def bar()
[
    printLine("sup?")
    return "nutin"
]

Program Design

The semantics I’ve covered in this post may raise some questions. I’ll try to clarify a few things here.

There are a few “gotchas” to look out for, such as:

// Expression continues after closing curly brace
if(true) { printLine() } printLine() // Error
// Correct:
if(true) { printLine() }
printLine()
// Alternatively, you may use a semi-colon or comma:
if(true) { printLine() } ; printLine()

Suppose you have an if/else statement nested using curly braces and you want the result of that statement to be a single value and not an object. Simply store the desired value in a variable and extract it afterwards:

#value = if(cond) {
    #value1 = rand(), #value2 = rand()
    #x = if(value1 <= value2) value1 else value2
}.x // Extract value of x here

What if you want to extract two or more values into an array? Use member grouping and the unpack keyword.

// A crude example...
#arr = [
    unpack if(cond) {
        #a = 10, #b = 20, #c = 30
    }.(a, b, c)
    else {
        #a = 40, #b = 50, #c = 60
    }.(a, b, c)
]

The semantics allow you to open up your functions in various ways:

def foo(#arg)
if(typeof(arg) == string)
{
}
else
{
}

Closing Note

Everything I’ve covered so far does beg the question: Wouldn’t this be a horribly inefficient language? I’ve put a lot of thought into the design and features of this language. There are certain aspects of the language that may be inefficient, especially those which utilize dynamic typing or JIT’ing. The most important thing to know is that this is meant to be a compiled language and there are several opportunities for optimization and elimination of dead code.

Semantically, using curly braces for structured programming requires creating objects everywhere. However, a compiler would be smart enough to know when an object isn’t captured or stored in a variable, so it can eliminate the object altogether and allocate variables on the stack rather than the heap.

While a variable can be initialized with any type, it’s type cannot change after it is initialized. In the majority of cases, a type for that variable can be deduced at compile-time so there is zero overhead at runtime.

Ultimately, this language is nothing more than a design and I have no plans to actually implement a compiler or interpreter for this language. This is simply something I enjoy pondering about and have put a great amount of thought into. It gives me something to write about and perhaps somebody out there will find my ideas interesting enough to do something with them.

Random Word Generator

I was looking through some old code of mine and found a little project I had worked on a number of years ago. I was designing an engine which would output random English-sounding words. I started with a simple idea and tweaked it through a few revisions to produce some good results. As an example, here are some of the words it generates:

nnehoffi
paxiebuyase
uadscutll
bizoladaull
qeqmehel
hoskyedroek
ibhnartuois
vdigro
skleftsom
wextbong
wemmempti
cnamlitl
jiinau
brirewsmew

The words are produced using a simple database that looks like this:

a abcdefghijklmnopqrstuvwxyz
aa bcdhklmnrstuvw
aaa ls
aab
aab adgimnopstuz
aac
aac achks
aad
aad aeghituy
aae
aae dlw

Generating Words Begin by choosing a random letter a-z. Look up that first letter in the database to find another character which you may append to it. Now you have a sequence of two letters. As before, look up that pair of letters in the database to find a third character to append. Now you have a sequence of three letters. Again, look up that sequence in the database to find a fourth character to append….. However, the maximum length of any sequence of letters is three. So if you have a sequence of four or more letters, you only look up the last three letters to find the next character to append. By limiting the max sequence to three letters, this allows you to diverge from normal sequences and generate random words which don’t exist in the English language. Of course, this engine may also generate real English words by chance. The last step is to finalize your randomly generated word. If you look back at the sample of the database, you’ll notice some sequences of three letters are repeated but followed by nothing. What this indicates is that this sequence is a valid word ending. For example, it’s okay to have a word which ends with the sequence “aad”. Dealing with “Dead-Ends” The database isn’t perfect and it does have some dead-ends. Ultimately, generating words is a matter of trial and error. When I refer to dead-ends, I’m referring to two possibilities:

  • You have a sequence of characters which you can append nothing to
  • The last three characters in your word is not a valid ending

There are three basic techniques which I could employ:

  • I could unwind and try appending a different character
  • I could throw away the word and restart from scratch
  • If there is an invalid ending and the word doesn’t have to be a specific length, I could try appending more characters until I have a valid ending.

My preference is the second option, to restart from scratch. While there are some dead-ends in the database, they’re infrequent enough that simply starting over is a practical option. There may also be more sophisticated algorithms which are guaranteed to produce a valid word on every try, but I feel that’s unnecessary. Building the Database Of course, the engine would be worthless without the database. While generating any database is simple enough, creating a quality database is another matter. But first, how do you generate a database? You simply need to download some word lists, then parse the words in those lists, break them down into parts, build the patterns, and write them to a database. My database was generated by sampling tens of thousands of real English words. It’s 145KB and contains 14288 lines. Building a quality database is a matter of applying some filters to eliminate bad patterns. You could simply sift through the database and remove the bad patterns by hand, but with 14288 lines in my own database, this would be a tedious task. However, I did do some manual filtering of my database by eliminating some bad word beginnings. I eliminated 197 bad word beginnings in total. Although real English words have these beginnings, it produced some odd results so I filtered them from my database. Here are a few that I chose to eliminate from my database:

dm dn dp dq drj drs drw dv dz fc fg fj fls

Besides for doing manual filtering. I defined one rule to eliminate some patterns from the database. For every three characters, there is a fourth character you may append to it. Of these four characters, I required there be at least one consonant and one vowel. For example, the sequence “aae” cannot be followed by another vowel. As for the letter Y, which could be either a consonant or vowel depending on the context, I chose to treat it as neither. While I could have analyzed the context to determine if the Y was a consonant or a vowel, I chose to take the easy route. At worst, I may have lost a few patterns which would have been included in the database otherwise. But Why??? The reason I chose to create this database is because I was writing a password generator at the time and I wanted to add an option to generate “pronounceable” passwords. The one thing I failed to do is determine how secure this option would actually be. However, I did have a simple idea to increase the strength of the passwords by using mixed case “URugUaY” or substituting some letters with similar looking numbers and symbols “3 for E” or “$ for S”. The code I wrote for this is a mess and not worth publishing. But one day, I may write a small library which implements this engine and publish it to GitHub. For now, you can implement the engine yourself and download my database here: http://1drv.ms/1jF9L3y

Cogito Ergo Sum

The title, Cogito ergo sum, is a famous Classical Latin phrase by René Descartes which translates into English as, “I think, therefore I am.” It’s one of my favorite quotes of all time, largely because it relates to my fascination with consciousness (more specifically, sentience, the ability to experience sensations). So it annoys me when people misinterpret this quote and apply it to improper contexts. In fact, of all the times I’ve heard this phrase, I’ve only ever heard it used correctly once. A most recent example, I’ve been watching a series of Crash Course videos on Psychology. In the 15th episode at 1m19s, he applies this quote incorrectly in the context of Psychology.

To put it simply, the phrase, “I think, therefore I am,” came about as Descartes pondered the nature of consciousness, eventually pondering and doubting his own existence. But then he realized that the ability to doubt this own existence meant that there was a doubter, himself, doubting his own existence. In another words, the ability to contemplate one’s own existence, or to think in general, is proof of one’s own existence. If you change a single word in the English translation, it might make more sense: “I think, therefore I exist.”

I can’t fully explain it, but consciousness fascinates me, and it’s something that I’ve spent a good amount of time contemplating myself. In relation to this quote, one aspect of conscious experience that I came to realize is that we are aware of our own conscious experience and that we’re able to communicate and talk about our conscious experiences. For me, that is strong evidence that consciousness plays a role in brain function and controlling our behavior. As a consequence, this implies that consciousness has an effect on the physical world, however major or minor that effect may me. I consider this one of the fundamental properties of consciousness and one that must be accounted for (explained) if we are to ever hope to develop a so-called Theory of Everything.

p.s. I recommend the Crash Course collection for those who are curious-minded, but not to those who are hoping to study and obtain a deep understanding of a subject.

An Introduction to Sanha

What is Sanha? It’s the name for a design of a programming language which I’ve been working on for a number of years. I call it a design because there’s no compiler, interpreter, or IDE. Quite literally, I haven’t written a single line of code for this language. It is nothing more than a blueprint. With that said, it began nearly a decade ago after I started writing code in C++. Eventually, I grew such a distaste for C++ that I set off to design a “better” language. It was about a year later that I discovered the D programming language which made my little “project” unimportant (as I delicately put it, “I love D because I hate C++”). However, I had obtained an interest in programming languages, so I continued to write down my thoughts and ideas. I didn’t originally call it Sanha; it’s a name that came later. After a while, I settled on the name Piranha, but I didn’t like the name much. Eventually, I discovered the word sanha from the etymology of piranha; the word, sanha, literally means “tooth”. So what is Sanha like? Initially, it began as a re-engineering of C++, copying it’s syntax and semantics, as that was the only language I really knew at the time. The language has undergone a drastic evolution since then, being highly influenced by D and Python, with C++ being a distant echo. The language still has a C-style feel but there is a lot of hidden semantics in the syntax. Get to it already! The conventional “Hello, World!” program:

module main
import std.console

def main(#args [string])
{
    printLine("Hello, World!")
}

Unfortunately, I couldn’t get any syntax highlighting to work here. Everything should seem fairly conventional here. The language uses modules, ‘def’ is used to define functions, braces are used for nesting, and statements are terminated by a line-break. However, this may seem a bit unusual: #args [string] You may have deduced that this defines a parameter, args, which is an array of strings. There are a lot of hidden semantics in this single bit of code. But let’s start with the basics:

  • This defines an explicit type, which means Sanha is statically typed … kind of …
  • The type may be omitted so that the type is inferred from the initializer.
  • The design is such as not to give preference to either type inference or explicit typing; both are encouraged and appropriate in different contexts.

The syntax is actually more clever than that. The full truth is, Sanha employs a hybrid of static and dynamic typing. Variables are statically typed, but their type is not set in stone; given you don’t explicitly define a type, variables may be initialized with an object of any type, which is unknown at compile-time and deduced at runtime. The type of a variable cannot change once it is declared and/or initialized. Another way to think of it, dynamic typing happens at initialization and static typing takes over thereafter. Then there’s the matter of the type syntax. First, it’s important to understand that types are first-class citizens, so you may pass them by reference, store them in containers, et cetera. With that understanding, the term, [string], is literally an array with a type in it. What this means is, types are expressions which may be evaluated at runtime. As long as the expressions results in a object of a special form, it will be translated into a type. The following code is equally valid:

def foo()
{
    return [string]
}

def main(#args foo()){ }

The form is only translated into a type in certain contexts. Otherwise, WYSIWYG, [string] is an array with a type in it. There are a variety of forms that can be translated into types. Many types are defined using their own literals. These are some examples:

#array [int32] = [10, 20, 30]
#tuple (string, int32) = ("One Hundred", 100)
#set {int32} = {10, 20, 30}
#range int32..int32 = 10..20

So far, I’ve covered a bit of the concrete aspects of the language. There are many other aspects which are currently poorly defined. For example, I’d like to implement some sort of pattern matching in the language and foresee a variety of applications for it, but I haven’t the slightest clue as for how to design it. I’m also working on a paradigm of “unique mutable / shared immutable” data, which technically works but is poorly executed. That’s it for now. Should I write another post about Sanha, I’ll likely cover more of the syntax and semantics as well as give an introduction to composition.

“The sum of all natural numbers is -1/12…”

Recently, a video posted on Numberphile titled, Astounding Result, has been making rounds on the web. Of course, this video has resulted in a plethora of criticisms, and I would argue some of it is justified. I wasn’t convinced either and had my skepticisms as well. However, I continued to research the problem and tried to gain a greater understanding. What I concluded is that the result isn’t necessarily wrong, but they did it improperly. The way they present the problem and the terminology they use are both improper.

EDIT: Numberphile posted an excellent follow-up video on March 18 titled, Why -1/12 is a gold nugget. For those interested, I highly recommend watching it.

Early on in the video, they display this result in a “well-known string theory book,” presumably to give it some justification. If you look closely, this is the way in which the result is presented in the book:

Image 006

Notice that this is not an equation as there’s no equality = sign. Rather, what you have is an arrow, which means something quite different. That’s the first mistake they made in the video, using the equality sign everywhere as to signify that this summation “equals” -1/12, which is not how it’s understood in mathematics or physics.

Secondly, if you pause the video on a stable frame (where the camera isn’t jittering), you can clearly make out the preceding text in the book. To save you the trouble, this is what is written:

It can be evaluated by regulating the theory and then being careful to preserve Lorentz invariance in the renormalization.

There are a few special terms here which are, unfortunately, never used in the video. This is the second mistake they made in the video, the lack of proper terminology. For those of us who studied Calculus in college, our natural instinct is to test if the sequence converges on some finite value L. If not, then we simply conclude the series diverges and move on with our day. However, they only taught us one method of summation, otherwise known as Cesàro summation. When using this summation method, the series does indeed diverge. However, there are other techniques for “assigning a meaningful value to a divergent series,” collectively known as regularization (as well as renormalization, used in quantum mechanics).

There was further criticism in how they proved these results, by taking these infinite sequences and adding them together, multiplying, shifting to the right, etc. I’d rather not get into this much. Instead, I’ll just point you to the Riemann series theorem

In the video and other places on the web, I’ve seen the claim that these results are used in “many areas of physics” and “have been observed in nature.” The most common example I’ve encountered relates to string theory, which I think is moot because, as I understand it, string theory hasn’t been tested experimentally (i.e. it’s untestable using existing equipment) and is therefore little more than just theory. However, in an alternate video proof [1], a reference is made to the Casimir force (i.e. the Casimir effect), which has been measured experimentally.

I’m not claiming to have any deep understanding of the mathematics or physics presented here. I simply felt that clarification is needed in order to settle the uproarious masses. I may investigate this topic further in the future, but for now, I’ve satisfied my skepticism. In conclusion, the result isn’t necessarily incorrect (I’ll leave it to the masses to disprove), but greater care could have been taken in the presentation, the proofs, and the terminology used.

There’s a second, much longer and more complex proof presented in this video, [1] Sum of Natural NumbersThe Wikipedia article covering this sequence can be found here (1+2+3+4+5+6…). Another interesting sequence covered in much more detail can be found here (1-2+3-4+5-6….). The Wikipedia article for divergent series contains some useful information as well, including some of the methods used for summation. A couple of the highly critical responses prompted by this video: One and Two.