wug


The real words are from the SOWPODS dictionary. The pseudowords are automatically generated using statistical information gathered from the real words - so that they often look convincingly real.

In a nutshell, the pseudowords are generated using a 5th-order Markov chain that is modeled upon the dictionary of real words. But what does THAT mean?

Basically, the Markov chain tells us the probability that a given letter will appear after the letters that came before it in a word. So, imagine that we've started generating a pseudoword, and the first letter is a "q". The Markov chain tells us this:

> q
a:    12 (1.41%)
i:     9 (1.06%)
o:     2 (0.24%)
u:   825 (97.06%)
w:     2 (0.24%)

So, 97% of the time, a "q" is followed by a "u". But it can also be followed by an a, i, o or w. (Did you know qwerty is a word?) When generating a pseudoword, we're free to choose any of these follow-up letters. But we adhere to the same probabilities. So, 97% of the pseudowords we generate will also use a "u" after a "q". We simply weigh our random decision accordingly.

Once we choose the next letter, we simply repeat the process:

> qu
a:   371 (44.97%)
b:     4 (0.48%)
e:   117 (14.18%)
i:   286 (34.67%)
o:    43 (5.21%)
r:     4 (0.48%)

Note that we use the same general process when choosing the first letter of the word. (There's nothing special about the first letter.) Here are the probabilities for letters following an empty string:

> 
a: 10771 (6.03%)
b:  9973 (5.58%)
c: 16588 (9.28%)
d: 10402 (5.82%)
e:  7184 (4.02%)
f:  7138 (3.99%)
g:  5874 (3.29%)
h:  6498 (3.64%)
i:  6628 (3.71%)
j:  1480 (0.83%)
k:  1857 (1.04%)
l:  5317 (2.98%)
m:  9949 (5.57%)
n:  4454 (2.49%)
o:  6057 (3.39%)
p: 15066 (8.43%)
q:   850 (0.48%)
r: 10518 (5.89%)
s: 19732 (11.04%)
t:  9000 (5.04%)
u:  5231 (2.93%)
v:  2859 (1.60%)
w:  3922 (2.19%)
x:   152 (0.09%)
y:   590 (0.33%)
z:   601 (0.34%)

11% of the real words start with an "s". 11% of the pseudowords will, as well.

How do we know when to stop repeating this process? The Markov chain uses a special, non-alphabetical character to indicate "end of word". So the pseudowords won't run on forever, they won't seem to end mid-word and they'll terminate in a natural way. Here, we use a "$" character for this purpose.

> quart
$:     1 (1.64%)
a:     2 (3.28%)
e:    42 (68.85%)
i:     6 (9.84%)
o:     2 (3.28%)
s:     1 (1.64%)
z:     7 (11.48%)

This means "quart" is a valid end of a word.

A traditional Markov chain would generate letters based only on the previous letter. So if the word so far was "qua", we'd only care about letters that follow an "a" and not necessarily "qua". This isn't a large enough window to produce realistic looking words because there isn't enough context.

Instead, we're using a "5th-order" Markov chain. This means we use a sliding window of the past 5 letters to compute the frequency table for the next letter.

Here's an example showing the complete process in generating a pseudoword.

 -> r
r -> i
ri -> m
rim -> o
rimo -> s
rimos -> i
 imosi -> s
  mosis -> $

Note that this process can yield real words as well as fake words. Also note that if we didn't slide the window, we'd ONLY generate real words. We want a window size large enough that the words look realistic and small enough that we don't generate mostly or only real words.

So, a final step of the process is to check if the word actually exists in the dictionary. When generating a pseudoword, we keep trying until we generate a word that isn't in the dictionary.

We actually use the same process to select the real words. This way, we don't just pick random words from the dictionary with equal probability but instead follow the same weighted decisions so the real words don't stand out.