Random Word Generator

I was looking through some old code of mine and found a little project I had worked on a number of years ago. I was designing an engine which would output random English-sounding words. I started with a simple idea and tweaked it through a few revisions to produce some good results. As an example, here are some of the words it generates:

nnehoffi
paxiebuyase
uadscutll
bizoladaull
qeqmehel
hoskyedroek
ibhnartuois
vdigro
skleftsom
wextbong
wemmempti
cnamlitl
jiinau
brirewsmew

The words are produced using a simple database that looks like this:

a abcdefghijklmnopqrstuvwxyz
aa bcdhklmnrstuvw
aaa ls
aab
aab adgimnopstuz
aac
aac achks
aad
aad aeghituy
aae
aae dlw

Generating Words Begin by choosing a random letter a-z. Look up that first letter in the database to find another character which you may append to it. Now you have a sequence of two letters. As before, look up that pair of letters in the database to find a third character to append. Now you have a sequence of three letters. Again, look up that sequence in the database to find a fourth character to append….. However, the maximum length of any sequence of letters is three. So if you have a sequence of four or more letters, you only look up the last three letters to find the next character to append. By limiting the max sequence to three letters, this allows you to diverge from normal sequences and generate random words which don’t exist in the English language. Of course, this engine may also generate real English words by chance. The last step is to finalize your randomly generated word. If you look back at the sample of the database, you’ll notice some sequences of three letters are repeated but followed by nothing. What this indicates is that this sequence is a valid word ending. For example, it’s okay to have a word which ends with the sequence “aad”. Dealing with “Dead-Ends” The database isn’t perfect and it does have some dead-ends. Ultimately, generating words is a matter of trial and error. When I refer to dead-ends, I’m referring to two possibilities:

  • You have a sequence of characters which you can append nothing to
  • The last three characters in your word is not a valid ending

There are three basic techniques which I could employ:

  • I could unwind and try appending a different character
  • I could throw away the word and restart from scratch
  • If there is an invalid ending and the word doesn’t have to be a specific length, I could try appending more characters until I have a valid ending.

My preference is the second option, to restart from scratch. While there are some dead-ends in the database, they’re infrequent enough that simply starting over is a practical option. There may also be more sophisticated algorithms which are guaranteed to produce a valid word on every try, but I feel that’s unnecessary. Building the Database Of course, the engine would be worthless without the database. While generating any database is simple enough, creating a quality database is another matter. But first, how do you generate a database? You simply need to download some word lists, then parse the words in those lists, break them down into parts, build the patterns, and write them to a database. My database was generated by sampling tens of thousands of real English words. It’s 145KB and contains 14288 lines. Building a quality database is a matter of applying some filters to eliminate bad patterns. You could simply sift through the database and remove the bad patterns by hand, but with 14288 lines in my own database, this would be a tedious task. However, I did do some manual filtering of my database by eliminating some bad word beginnings. I eliminated 197 bad word beginnings in total. Although real English words have these beginnings, it produced some odd results so I filtered them from my database. Here are a few that I chose to eliminate from my database:

dm dn dp dq drj drs drw dv dz fc fg fj fls

Besides for doing manual filtering. I defined one rule to eliminate some patterns from the database. For every three characters, there is a fourth character you may append to it. Of these four characters, I required there be at least one consonant and one vowel. For example, the sequence “aae” cannot be followed by another vowel. As for the letter Y, which could be either a consonant or vowel depending on the context, I chose to treat it as neither. While I could have analyzed the context to determine if the Y was a consonant or a vowel, I chose to take the easy route. At worst, I may have lost a few patterns which would have been included in the database otherwise. But Why??? The reason I chose to create this database is because I was writing a password generator at the time and I wanted to add an option to generate “pronounceable” passwords. The one thing I failed to do is determine how secure this option would actually be. However, I did have a simple idea to increase the strength of the passwords by using mixed case “URugUaY” or substituting some letters with similar looking numbers and symbols “3 for E” or “$ for S”. The code I wrote for this is a mess and not worth publishing. But one day, I may write a small library which implements this engine and publish it to GitHub. For now, you can implement the engine yourself and download my database here: http://1drv.ms/1jF9L3y