Finding COCOA in CHOCOLATE without a dictionary?

13 August 2009, 10:05

Last night I went to MPUG, the Melbourne Python users group. I have been on that mailing list for seemingly years, and it looks like now there will be an attempt to have regular meetings. Woot.

There was a very interesting talk by Martin Schweitzer called “Primetime Wordfinding” or “Elegant String Searches”. (The slides were posted to the mailing list.)

The basic problem is thus:

Given a set of letters and a dictionary, find all words that can be made from those letters.

The method that he outlines is wonderfully elegant, and will be especially appreciated by maths geeks. However seeing dictionary = ’/usr/share/dict/words’ make me think, “Typical IR approach! Where’s the linguistics?”

It also made me wonder how many languages Linux ships word lists for. Apparently Ubuntu ships many varieties of English, Portugese, Bulgarian, Catalan, Danish, Dutch, Finnish, Faroese, French, Galician, Italian, Norwegian, German, Polish, Spanish, Swedish and Ukrainian. So Europe has decent coverage, but the rest of the world, hmm…

So, how about this revised problem:

Given a set of letters and a language, find all words that can be made from those letters.

We don’t have a dictionary but we have a language, which means we have (whether we consciously realise or not) the rules for

  1. how alphabetic letters map to phonemes (sound units)
  2. how phonemes can be combined to form syllables (the main concern)
  3. how syllables can be combined to form words.

I did a bit of looking to see if I could try and find a ready-made solution, and while it seems that syllable ‘parsing’ is a well-studied problem, syllable ‘generation’ is another matter.

Now this is going to be relatively tricky, because English doesn’t have good one-to-one correspondences between letters and phonemes.

So let’s hack some stuff together… as a first approximation, I’ll grab all the written examples from the Wikipedia articles on English phonology, English orthography and the IPA chart for English dialects.

>>> onsets = list("pbtdckgjfvszwmlnryh") + ["ch","th","sh"]
>>> onsets += ["pl","bl","cl","gl","pr","br","tr","dr","cr","gr","tw","dw","gu","qu"]
>>> onsets += ["fl","sl","fr","thr","shr","sw","thw","wh"]
>>> onsets += ["sp","st","sk"]
>>> onsets += ["sm","sn"]
>>> onsets += ["sph"]
>>> onsets += ["spl","spr","str","scl","scr","squ", "sc"]
>>> nuclei = ["a","e","i","o","u","ow","ou","ou","ie","igh","oi","eer","air","ee","ai"]
>>> nuclei += ["au","ea","ou","ai","ey","ei","er","ear","ir","oo","ou","igh","ough",
>>> codas = ["lp","lb","lt","ld","lk","rp","rb","rt","rd","rk","rgue","lf","lve","lth","lse",
>>> final = ["s","ed"]

That’s pretty yuck. And I’m not too sure at all about where some of those “r“s should go. A bit of a brute-force solution for problem #1 above. I would like to clean this up and somehow make sure it is complete.

Also, that “final” bit is not a linguistic thing, but it seems to me my codas are not accounting for plural words too well.

>>> # thankyou, martin!
>>> primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 
53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103]
>>> def prime_val(ch):
...     return primes[ord(ch.lower()) - ord('a')]
>>> def get_val(word):
...     total =1
...     for ch in word:
...             total *= prime_val(ch)
...     return total
>>> magic = get_val("chocolate")
>>> nuclei_ok = [n for n in nuclei if magic % get_val(n)  0]
>>> onsets_ok = [o for o in onsets if magic % get_val(o)  0] + [""]
>>> codas_ok = [c for c in codas if magic % get_val(c)  0] + [""]
>>> syllables = []
>>> for o in onsets_ok:
...     for n in nuclei_ok:
...             for c in codas_ok:
...                     syllable = o + n + c
...                     if magic % get_val(syllable)  0:
...                             syllables.append(syllable)
>>> len(syllables)
>>> syllables
['talch', 'ta', 'telch', 'te', 'tolch', 'to', 'tealch', 'tea', 'toolch', 'too', 'toalch', 'toa', 'tol', 
'calt', 'calth', 'calch', 'cact', 'calct', 'ca', 'celt', 'celth', 'celch', 'cect', 'celct', 'ce', 'colt', 
'colth', 'colch', 'coct', 'colct', 'co', 'cealt', 'cealth', 'cealch', 'ceact', 'cealct', 'cea', 
'coolt', 'coolth', 'coolch', 'cooct', 'coolct', 'coo', 'coalt', 'coalth', 'coalch', 'coact', 
'coalct', 'coa', 'colct', 'col', 'lact', 'la', 'lect', 'le', 'loct', 'lo', 'leact', 'lea', 'looct', 'loo', 
'loact', 'loa', 'halt', 'hact', 'halct', 'ha', 'helt', 'hect', 'helct', 'he', 'holt', 'hoct', 'holct', 
'ho', 'healt', 'heact', 'healct', 'hea', 'hoolt', 'hooct', 'hoolct', 'hoo', 'hoalt', 'hoact', 
'hoalct', 'hoa', 'holct', 'hol', 'chalt', 'chact', 'chalct', 'cha', 'chelt', 'chect', 'chelct', 
'che', 'cholt', 'choct', 'cholct', 'cho', 'chealt', 'cheact', 'chealct', 'chea', 'choolt', 
'chooct', 'choolct', 'choo', 'choalt', 'choact', 'choalct', 'choa', 'cholct', 'chol', 'tha', 
'the', 'tho', 'thea', 'thoo', 'thoa', 'thol', 'clact', 'cla', 'clect', 'cle', 'cloct', 'clo', 'cleact', 
'clea', 'clooct', 'cloo', 'cloact', 'cloa', 'alt', 'alth', 'alch', 'act', 'alct', 'a', 'elt', 'elth', 
'elch', 'ect', 'elct', 'e', 'olt', 'olth', 'olch', 'oct', 'olct', 'o', 'ealt', 'ealth', 'ealch', 'eact', 
'ealct', 'ea', 'oolt', 'oolth', 'oolch', 'ooct', 'oolct', 'oo', 'oalt', 'oalth', 'oalch', 'oact', 
'oalct', 'oa', 'olct', 'ol']

Note that onsets and codas are optional, hence I add the empty string to those lists. (I forgot to factor in the “final” bit, although it doesn’t make any difference for the word “chocolate”.)

OK so now I have my syllables. You should find that these are basically all pronouncable in English, although they may not be the standard way of being written (for example, if “choct” was a valid word, I think it would be written as “chocked”. “ct” only seems to get to be a coda for a small number of words, like “tact”). And of course many of them are not valid as mono-syllabic words.

Now, how can we combine them into multi-syllabic words? Well, there are some word-level rules, but mostly they seem more relevant to pronunciation. So we should be reasonably safe with just concatenating syllables.

>>> syll2 = []
>>> for syllable in syllables:
...     remaining = syllables[:]
...     remaining.remove(syllable)
...     for r in remaining:
...             combined = syllable + r
...             if magic % get_val(combined) == 0:
...                     syll2.append(combined)
>>> len(syll2)

And now…. wait for it…. the big moment has arrived!

>>> 'cocoa' in syll2

At the moment this is perhaps not markedly better than just generating every permutation of every length string of the letters in “chocolate”. But there you go… I call it “Dictionary-Free, Linguistically Motivated String Searches”. :)

I am going to ponder if there is a better way to implement this in Prolog. But for Python, is there any way you can use a regular expression for generation rather than parsing? A kind of “regular expression for production”?

tags: , ,



Subscribe to comments on this post: rss / atom

Commenting is closed for this article.