Articles tagged: linguistics

Scrabble quibblers

277 days ago

“Want to play Scrabble?” my housemate asked yesterday, out of the blue. “Sure,” I replied, although I remembered with some reservations a previous game we had played, where she had declared some of my two letter words unacceptable. House rules die hard, and hers included something vague about only including “real words”, or “words you know the meaning of”. “And none of those spelled-out letters, like EN and EM.”

Two things you need for a satisfying Scrabble game, in my opinion, are 1) a Scrabble dictionary, and 2) an agreement to accept any word in the Scrabble dictionary, and only words in the Scrabble dictionary. A regular dictionary will not do. It contains too many abbreviations, proper nouns and borrowings. My house rules amount to no house rules. But then I did often play with my grandfather who is a crossword master, and you certainly need to be able to appreciate obscure words “for their own sake” to get across crosswords.

While I was sitting on my AD, knowing it to be a perfectly valid Scrabble word, I tried to figure out why this house rule annoyed me so. I mean sure, the first time you get an AE played against you for some phenomenal score, it is an outrageous affront to ordinary words. But the answer is to learn them yourself, not ban them, in my view. And the thing about the Scrabble dictionaries is, at least they’re well-defined. At least you are always playing to a common, fixed set of words. Once I’ve learned that these little words exist, I can’t unlearn them, even if I’ll likely never use them in speech. What annoyed me was the suggestion that “ordinary words” might easily identifiable or obvious. What even constitutes a word is not a well-defined thing, especially when you get down to these little scraps of letters.

And that’s my vaguely linguistic rant for the day :)

tags: ,

Comment [3]

---

Conversation traffic

362 days ago

A thought I had, careening in a taxi towards the Buenos Aires airport on my way home from Wikimania…

Driving into Buenos Aires on a Wednesday morning
By Alex-s, CC-BY license.

Driving a vehicle is rather like understanding conversation. Knowing the (explicit) road rules is not enough. You need to understand the implicit rules of driving in practice — what you might call the traffic rules. How much space do you leave? Where are lanes and how do they work, especially when changing? What does tooting the horn mean? Is the speed limit strictly adhered-to or just a suggestion? How do you expect larger or smaller vehicles to behave?

You might think you just learn this once, and you’re done… until you experience driving in another country. Then you realise how much implicit knowledge you had about traffic norms in your first country.

Now the only place my analogy seems to fall down is when it all goes horribly wrong. If you get the road or traffic rules wrong, you will notice. :) But conversations can go horribly wrong without similar warning signs. For example, you could walk away from a conversation thinking, “That person is unbelievably RUDE! How could they think it was acceptable to behave that way!” when in fact they were unaware that the two of you had divergent conversation rules. If either of you had figured this out earlier and self-corrected or corrected the other, the flow could have been righted and continued without dragging a burning car-crash of a conversation behind it.

tags:

Comment [1]

---

Finding COCOA in CHOCOLATE without a dictionary?

393 days ago

Last night I went to MPUG, the Melbourne Python users group. I have been on that mailing list for seemingly years, and it looks like now there will be an attempt to have regular meetings. Woot.

There was a very interesting talk by Martin Schweitzer called “Primetime Wordfinding” or “Elegant String Searches”. (The slides were posted to the mailing list.)

The basic problem is thus:

Given a set of letters and a dictionary, find all words that can be made from those letters.

The method that he outlines is wonderfully elegant, and will be especially appreciated by maths geeks. However seeing dictionary = ’/usr/share/dict/words’ make me think, “Typical IR approach! Where’s the linguistics?”

It also made me wonder how many languages Linux ships word lists for. Apparently Ubuntu ships many varieties of English, Portugese, Bulgarian, Catalan, Danish, Dutch, Finnish, Faroese, French, Galician, Italian, Norwegian, German, Polish, Spanish, Swedish and Ukrainian. So Europe has decent coverage, but the rest of the world, hmm…

So, how about this revised problem:

Given a set of letters and a language, find all words that can be made from those letters.

We don’t have a dictionary but we have a language, which means we have (whether we consciously realise or not) the rules for

  1. how alphabetic letters map to phonemes (sound units)
  2. how phonemes can be combined to form syllables (the main concern)
  3. how syllables can be combined to form words.

I did a bit of looking to see if I could try and find a ready-made solution, and while it seems that syllable ‘parsing’ is a well-studied problem, syllable ‘generation’ is another matter.

Now this is going to be relatively tricky, because English doesn’t have good one-to-one correspondences between letters and phonemes.

So let’s hack some stuff together… as a first approximation, I’ll grab all the written examples from the Wikipedia articles on English phonology, English orthography and the IPA chart for English dialects.

>>> onsets = list("pbtdckgjfvszwmlnryh") + ["ch","th","sh"]
>>> onsets += ["pl","bl","cl","gl","pr","br","tr","dr","cr","gr","tw","dw","gu","qu"]
>>> onsets += ["fl","sl","fr","thr","shr","sw","thw","wh"]
>>> onsets += ["sp","st","sk"]
>>> onsets += ["sm","sn"]
>>> onsets += ["sph"]
>>> onsets += ["spl","spr","str","scl","scr","squ", "sc"]
>>> nuclei = ["a","e","i","o","u","ow","ou","ou","ie","igh","oi","eer","air","ee","ai"]
>>> nuclei += ["au","ea","ou","ai","ey","ei","er","ear","ir","oo","ou","igh","ough",
"y","oy","oa","ou","ow","ol","ar","ere","are","ear","or","ar","ore","oar","our",
"oor","ure","uer"]
>>> codas = ["lp","lb","lt","ld","lk","rp","rb","rt","rd","rk","rgue","lf","lve","lth","lse",
"lsh","lch","lge","rf","rve","rth","rce","rsh","rch","rge","lm","ln","rm","rn",
"rl","mp","nt","nd","nk","mph","mth","nth","nce","nze","nch","nge","ngth",
"ft","sp","st","sk","fth","pt","ct","pth","pse","ghth","tz","dth","dze","x","lpt",
"lfth","ltz","lst","lct","lx","rmth","mth","rpt","rpse","rtz","rst","rct","mpt",
"mpse","ndth","nct","nx","ngth","xth","xt"]
>>> final = ["s","ed"]

That’s pretty yuck. And I’m not too sure at all about where some of those “r“s should go. A bit of a brute-force solution for problem #1 above. I would like to clean this up and somehow make sure it is complete.

Also, that “final” bit is not a linguistic thing, but it seems to me my codas are not accounting for plural words too well.

>>> # thankyou, martin!
>>> primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 
53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103]
>>> def prime_val(ch):
...     return primes[ord(ch.lower()) - ord('a')]
...
>>> def get_val(word):
...     total =1
...     for ch in word:
...             total *= prime_val(ch)
...     return total
...
>>> magic = get_val("chocolate")
>>> nuclei_ok = [n for n in nuclei if magic % get_val(n)  0]
>>> onsets_ok = [o for o in onsets if magic % get_val(o)  0] + [""]
>>> codas_ok = [c for c in codas if magic % get_val(c)  0] + [""]
>>> syllables = []
>>> for o in onsets_ok:
...     for n in nuclei_ok:
...             for c in codas_ok:
...                     syllable = o + n + c
...                     if magic % get_val(syllable)  0:
...                             syllables.append(syllable)
... 
>>> len(syllables)
172
>>> syllables
['talch', 'ta', 'telch', 'te', 'tolch', 'to', 'tealch', 'tea', 'toolch', 'too', 'toalch', 'toa', 'tol', 
'calt', 'calth', 'calch', 'cact', 'calct', 'ca', 'celt', 'celth', 'celch', 'cect', 'celct', 'ce', 'colt', 
'colth', 'colch', 'coct', 'colct', 'co', 'cealt', 'cealth', 'cealch', 'ceact', 'cealct', 'cea', 
'coolt', 'coolth', 'coolch', 'cooct', 'coolct', 'coo', 'coalt', 'coalth', 'coalch', 'coact', 
'coalct', 'coa', 'colct', 'col', 'lact', 'la', 'lect', 'le', 'loct', 'lo', 'leact', 'lea', 'looct', 'loo', 
'loact', 'loa', 'halt', 'hact', 'halct', 'ha', 'helt', 'hect', 'helct', 'he', 'holt', 'hoct', 'holct', 
'ho', 'healt', 'heact', 'healct', 'hea', 'hoolt', 'hooct', 'hoolct', 'hoo', 'hoalt', 'hoact', 
'hoalct', 'hoa', 'holct', 'hol', 'chalt', 'chact', 'chalct', 'cha', 'chelt', 'chect', 'chelct', 
'che', 'cholt', 'choct', 'cholct', 'cho', 'chealt', 'cheact', 'chealct', 'chea', 'choolt', 
'chooct', 'choolct', 'choo', 'choalt', 'choact', 'choalct', 'choa', 'cholct', 'chol', 'tha', 
'the', 'tho', 'thea', 'thoo', 'thoa', 'thol', 'clact', 'cla', 'clect', 'cle', 'cloct', 'clo', 'cleact', 
'clea', 'clooct', 'cloo', 'cloact', 'cloa', 'alt', 'alth', 'alch', 'act', 'alct', 'a', 'elt', 'elth', 
'elch', 'ect', 'elct', 'e', 'olt', 'olth', 'olch', 'oct', 'olct', 'o', 'ealt', 'ealth', 'ealch', 'eact', 
'ealct', 'ea', 'oolt', 'oolth', 'oolch', 'ooct', 'oolct', 'oo', 'oalt', 'oalth', 'oalch', 'oact', 
'oalct', 'oa', 'olct', 'ol']

Note that onsets and codas are optional, hence I add the empty string to those lists. (I forgot to factor in the “final” bit, although it doesn’t make any difference for the word “chocolate”.)

OK so now I have my syllables. You should find that these are basically all pronouncable in English, although they may not be the standard way of being written (for example, if “choct” was a valid word, I think it would be written as “chocked”. “ct” only seems to get to be a coda for a small number of words, like “tact”). And of course many of them are not valid as mono-syllabic words.

Now, how can we combine them into multi-syllabic words? Well, there are some word-level rules, but mostly they seem more relevant to pronunciation. So we should be reasonably safe with just concatenating syllables.

>>> syll2 = []
>>> for syllable in syllables:
...     remaining = syllables[:]
...     remaining.remove(syllable)
...     for r in remaining:
...             combined = syllable + r
...             if magic % get_val(combined) == 0:
...                     syll2.append(combined)
...
>>> len(syll2)
3382

And now…. wait for it…. the big moment has arrived!

>>> 'cocoa' in syll2
True

At the moment this is perhaps not markedly better than just generating every permutation of every length string of the letters in “chocolate”. But there you go… I call it “Dictionary-Free, Linguistically Motivated String Searches”. :)

I am going to ponder if there is a better way to implement this in Prolog. But for Python, is there any way you can use a regular expression for generation rather than parsing? A kind of “regular expression for production”?

tags: , ,

Comment

---