Monday, August 13, 2007

The Surface/Symbol Divide

The Surface/Symbol Divide: This approach to knowledge discovery is fixed at the surface level of text (and the surface level of the representation language of documents, to be complete). Consequently, the performance of the system highlights both what is good about statistical surface techniques (little training required - which is often the case for systems that work with both document structure, textual data and high precision seed input; works in (m)any language(s); fast) and what is bad (has no real knowledge of language). (Via Data Mining.)

What is "real knowledge of language"? Where does it come from? Why is it unobtainable with statistical techniques? For all we know, a somewhat more sophisticated statistical inference procedure might get rid of some of the errors that Matt highlights (I have some ideas that are too tentative to discuss). More generally, given how quickly our understanding of language acquisition is changing, how can anyone say surely what "real knowledge of language" entails? It's time to retire the essentialism of "colorless green ideas".

4 comments:

Matthew said...

Fernando - I think you have mis-parsed 'no real knowledge of language' in this case. It is not ((real knowledge) of language) it is (real (knowledge of language)). It is a subtle difference, but it is a difference. You should, then, be asking 'what is "knowledge of language"?' In this case, it would be the ability to distinguish parts of speech - clearly, this system is not capable of doing this.

Fernando Pereira said...

I too thought "(real (knowledge of language))" is what you meant. My general point is that "knowledge of language" has the same lack of explanatory power as other essences like "vital force". More specifically, you must know that parts-of-speech can be distinguished fairly well by unsupervised "surface" statistical methods. This system does not do this because there are only so many hours in the day of even the brightest graduate student.

William Cohen said...

Fernando says: "...parts-of-speech can be distinguished fairly well by unsupervised "surface" statistical methods. This system does not do this because there are only so many hours in the day of even the brightest graduate student."

Also because a) it's mostly finding semi-structured pages (lists and tables and such) where POS tagging would be less reliable, and b) because it's language-independent.

I generally agree with both of you. I think the specific problem Matt points out can be (and probably will be) fixed with some additional analysis.

But I think the overall question remains - there are clear limitations to this sort of technique. It's very much oriented toward exploiting redundancy across sites, and it doesn't usually work for things that aren't relatively popular and well-known named entities. This is a sort of wisdom-of-crowds result, and finesses the problem of doing any real understanding of any particular page. In fact, it will get poor results on many of the hundreds of pages it processes - it works because its aggregating information across many poorly-understood information sources.

Fernando Pereira said...

Regarding William's point b: for this task, you do not need POS tags, but a way of classifying tokens based on use and context that carries the relevant information, which is the distinction between proper names and other lexical categories. Of course, that assumes some morphological preprocessing, which might be challenging for highly inflected languages.