Sunday, April 26, 2009

Falling for the magic formula

Conditional entropy and the Indus Script: A recent publication (Rajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia, Hrishikesh Joglekar, R. Adhikari, and Iravatham Mahadevan, "Entropic Evidence for Linguistic Structure in the Indus Script", Science, Published online 23 April 2009; and the also Supporting Online Material) claims a breakthrough in understanding the nature of the symbols found in inscriptions from the Indus Valley Civilization. (Via Language Log.)

Once again, Science falls for a magic formula that purports to answer a contentious question about language: is a certain ancient symbolic system a writing system. They would not, I hope, fall for a similar hypothesis in biology. They would, I hope, be skeptical that a single formula could account for the complex interactions within an evolutionarily built system. But somehow it escapes them that language is such a system, as are other culturally constructed symbol systems, carrying lots of specific information that cannot be captured by a single statistic.

The main failing of the paper under discussion seems to be choosing an artificially unlikely null hypothesis. This is a common problem in statistical modeling of sequences, for example genomic sequences. It is hard to construct realistic null hypotheses that capture as much as possible of the statistics of the underlying process except for what the hypothesis under test is supposedly responsible for. In the case at hand: what statistics could distinguish a writing system from other symbol systems? As other authors convincingly argued, first-order sequence statistics miss critical statistical properties of writing systems with respect to symbol repetition within a text, namely that certain symbols co-occur often (but mostly not adjacently) within a text because they represent common phonological or grammatical elements. Conversely, given first-order statistics are compatible with a very large class of first-order Markov processes, almost all of which could not be claimed to be anywhere close to writing of human languages. In other words, most of the “languages” that the paper's test separate from their artificial straw man are nowhere close to any possible human language.

To paraphrase Brad De Long's long-running lament: why oh why can't we have better scientific journal editors?

Update: Mark Liberman, Cosma Shalizi, and Richard Sproat actually ran the numbers. It turns out that they were able to generate curves similar to the paper's “linguistic” curves with memoryless sources. Ouch. As a bonus, we get R, Matlab, and Python scripts to play with, in an ecumenical demonstration of fast statistical modeling for those following at home.

Tuesday, April 14, 2009

Strings are not Meanings Part 2.1

Strings are not Meanings Part 2.1: Fernando is right – these observations are powerful traces of how writers and readers organize and relate pieces of information. Just as a film of Kasparov is a trace of his playing chess.

I think I didn't make my point as strongly or precisely as I should have. The bubble chamber analogy is neat, but limited. In contrast to the traces in the chamber, the stuff out there on the internet, stored or in transit, is not just a record but also a huge external memory that is as causally central to our behavior as anything in our neural matter. The question then is, what's the actual division of labor between external and mental representation. I tend to believe that material and communicative culture carry a lot more of the burden than individual minds, similarly to how much more of the informational burden of current computing is carried by programs stored out there than by CPUs.

I think that Fernando approaches this space from a more behaviourist mindset – accepting the input, output and context but with no requirements for stuff happening ‘inside’.

No, my stance is definitely not behavioristic. There's lots of complexity ‘inside.’ But the patterns of representation and inference favored by symbolic AI have little to do with ‘inside’ as far as I can see. Instead, they are formalizations of language — as formal logic is —which explain little and oversimplify a lot. Given that, we might as well go right to the language stuff out there and drop the crippled intermediaries.

In addition to their taxonomic meaning, ontologies have come to refer to a requirement for communication – that the stuff I refer to maps to the same stuff for you.

But that's where it all falls apart. No formal system can ensure that kind of agreement. There is no rigid designation in nature. Our agreements about terms are practical, contextual, contingent. Language structure relates to common patterns of inference (for instance, monotonicity) that seem cognitively “easy” (whether they have innate “hardware” support I don't know). But asserting that is much less than postulating a whole fine-grained cognitive architecture of representations and inference algorithms out of thin air, when the alternative of computing directly with the external representations and usage events of language is available to us and so much richer than even the fanciest semantic net system.

(Via Data Mining.)

Sunday, April 12, 2009

Strings and Meanings

I'm reading Alva Nöe's so far (page 90) delightful Out of Our Heads. He makes much more concisely a point I tried to make earlier, which goes back to Hilary Putnam:

I am not myself, individually, responsible for making my words meaningful. They have their meanings thanks to the existence of a social practice in which I am allowed to participate.

This is the same whether the words are in my speech, in my writing, or strings in some data structure, maybe an ontology. Ontologies do not have magical powers. Their value is in their practical communicative success, as is the value of any other means of communication.

Saturday, April 11, 2009

Strings are not Meanings Part 2

Strings are not Meanings Part 2: Matt refines his earlier points:

Data may be unreasonably effective, but effective at what?

In asking this, I was really drawing attention to firstly the ability for large volumes of data (and not much else) to deliver interesting and useful results, but its inability to tell us how humans produce and interpret this data. One of the original motivations for AI was not simply to create machines that play chess better than people, but to actually understand how people’s minds work.

The data we were discussing in the original paper tells us a lot about how people “produce and interpret” it. Links, clicks, and markup, together with proximity, syntactic relations, and textual similarities and parallelisms, are powerful traces of how writers and readers organize and relate pieces of information to each other and to the environments in which they live. As David Lewis once said, just as the Web was emerging, they form a bubble chamber that records the traces of much of human activity and cognition. Like with a bubble chamber, it is noisy, it requires serious computation to interpret, and most important of all, it requires prior hypotheses about what we are looking at to organize those computations. How much those hypotheses depend on fine-grained models of “how people's minds work,” we really have no idea. If we were to measure the success of AI for its progress on creating such models, we'd have to see AI as a dismal, misguided failure. AI's successes, such as they are, are not about human minds, but about computational systems that behave in a more adaptive way in complex environments, including those involving human communication. Indeed, neither AI researchers nor psychologists, nor linguists, nor neuroscientists have made much progress (not since I came into AI 30 years ago, anyway) in figuring out the division of labor between task-specific cognitive mechanisms and representations and more shallow, statistical, neural and social learning systems in enabling human cognition and communication. If anything, we have increasing reason to be humble about the alleged uniquely fine-grained nature of human cognition, as opposed to the broader, shallower power of a few inference-from-experience hacks, social interaction, and external situated memory (territory marking, as it were), not just in humans, to construct complex symbolic processing systems: Language as Shaped by the Brain, Out of our Heads, The Origins of Meaning, Cultural Transmission and the Evolution of Human Behavior, ....

Despite all the ontology nay-sayers, a big chunk of our world is structured due to the well organized, systematic and predictable ways in which industry, society and even biology creates stuff.

Here, I want to draw attention to the skepticism around ontologies. Yes, they come at a cost, but it is also the case that they offer true grounding of interpretations of textual data. Let me give an example. The Lord of the Rings is a string used to refer to a book (in three parts) a sequence of films, various video games, board games, and so on. The ambiguity of the phrase requires a plurality of interpretations available to it. This is a 1-many mapping. The 1 is a string, but what is the type of the many? I actually see the type of work described in the paper as being wholly complimentary with categorical knowledge structures.

Hey, you gave your own answer! The many are "a book", "a sequence of films", "[a] video game", ... Sure, the effect of the (re)presentation of those strings in certain media (including our neural media) in certain circumstances has causal connections to action involving various physical and social situations, such as that of buying an actual, physical book from a book seller. But that causal aspect of meaning — which I contend is primary — is totally ignored by ontologies. Ontologies may pretend to be somehow more fundamental than mere text, but they are just yet another form of external memory, like all the others we already use, whose value will be determined by practical, socially-based activity, and not by somehow being magically imbued with “true grounding.” Grounding is not true or false, it's the side effect of causally-driven mental and social learning and construction. What a symbol means is what it does to us and what we do with it, not some essence of the symbol somehow acquired by having it sit in a particular formal representation. No one has provided any evidence that by having "Harry Potter" sit somewhere in WordNet, the string becomes more meaningful than what we can glean from its manifold occurrences out there. It may be more useful to summarize the symbol's associations in a data structure for further processing, I'm all for useful data structures, but they don't add anything informationally (it may add something in computational efficiency, of course), and it often loses a lot, because context of use gets washed out or lost. Let's be serious — and a bit humbler — about what we are really doing with these symbolic representations: engineering — which is cool, don't worry — not philosophy or cognitive science. Much of this was already said or implied in McDermott's classic (unfortunately I can find it only behind the ACM paywall, so much for “Advancing Computing as a Science and a Profession,” but I digress...), which we'd do well to (re)read annually on the occasion of every AAAI conference, and whenever semantic delusions strike us. (Via Data Mining.)

Update: Partha (thanks!) found this freely accessible copy of Artificial Intelligence Meets Natural Stupidity.