Sunday, April 26, 2009

Falling for the magic formula

Conditional entropy and the Indus Script: A recent publication (Rajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia, Hrishikesh Joglekar, R. Adhikari, and Iravatham Mahadevan, "Entropic Evidence for Linguistic Structure in the Indus Script", Science, Published online 23 April 2009; and the also Supporting Online Material) claims a breakthrough in understanding the nature of the symbols found in inscriptions from the Indus Valley Civilization. (Via Language Log.)

Once again, Science falls for a magic formula that purports to answer a contentious question about language: is a certain ancient symbolic system a writing system. They would not, I hope, fall for a similar hypothesis in biology. They would, I hope, be skeptical that a single formula could account for the complex interactions within an evolutionarily built system. But somehow it escapes them that language is such a system, as are other culturally constructed symbol systems, carrying lots of specific information that cannot be captured by a single statistic.

The main failing of the paper under discussion seems to be choosing an artificially unlikely null hypothesis. This is a common problem in statistical modeling of sequences, for example genomic sequences. It is hard to construct realistic null hypotheses that capture as much as possible of the statistics of the underlying process except for what the hypothesis under test is supposedly responsible for. In the case at hand: what statistics could distinguish a writing system from other symbol systems? As other authors convincingly argued, first-order sequence statistics miss critical statistical properties of writing systems with respect to symbol repetition within a text, namely that certain symbols co-occur often (but mostly not adjacently) within a text because they represent common phonological or grammatical elements. Conversely, given first-order statistics are compatible with a very large class of first-order Markov processes, almost all of which could not be claimed to be anywhere close to writing of human languages. In other words, most of the “languages” that the paper's test separate from their artificial straw man are nowhere close to any possible human language.

To paraphrase Brad De Long's long-running lament: why oh why can't we have better scientific journal editors?

Update: Mark Liberman, Cosma Shalizi, and Richard Sproat actually ran the numbers. It turns out that they were able to generate curves similar to the paper's “linguistic” curves with memoryless sources. Ouch. As a bonus, we get R, Matlab, and Python scripts to play with, in an ecumenical demonstration of fast statistical modeling for those following at home.

No comments: