Thursday, June 26, 2008

"These can be aptly compared with the challenges, problems, and insights of particle physics"

"These can be aptly compared with the challenges, problems, and insights of particle physics": For this survey, I decided to start with some historical background, and so I went back to the famous ALPAC report. [...] This report is mainly remembered as a rather negative assessment of the quality, rate of progress, and economic value of research on Machine Translation. It's widely credited with eliminating, for more than two decades, nearly all U.S. Government support for MT research. [...] However, I had forgotten — if I ever knew — that as the title "Computers in Translation and Linguistics" suggests, the ALPAC report had another and more positive message. Here's a sample:

Today there are linguistic theoreticians who take no interest in empirical studies or in computation. There are also empirical linguists who are not excited by the theoretical advances of the decade–or by computers. But more linguists than ever before are attempting to bring subtler theories into confrontation with richer bodies of data, and virtually all of them, in every country, are eager for computational support.
(Via Language Log.)

Read Mark's whole post. But I can't resist the feeling that Mark is being too charitable towards ALPAC, who seem to have predicted the future of natural language processing exactly wrong. Use of computational methods and large corpora in linguistics is still the exception, while (statistical) machine translation is arguably the area of language processing that is making the most remarkable progress. At this year's ACL meeting, the pervasive impact of the core idea of statistical MT — mining co-occurrences in text to discover language units and unit correspondences - not only in advancing MT but also in many other areas of natural language processing. And statistical MT has generated new demand for advances in robust language analysis methods, as the recent progress of MT methods based on statistical grammars shows.

It's interesting how ALPAC could have gotten it so wrong. The main methods and metaphors of statistical MT are based on ideas from communication theory and (implicitly) from cryptology that should have been very natural to Pierce and his colleagues. Instead, they seemed to be captured by the formal syntax revolution in linguistics, and they expected that the main line of progress would be developing detailed linguistic theories and analyses. Twenty years later, the contrast between steady measurable advances in statistical speech recognition — which drew heavily from those communication theory/crypto ideas — and the relative floundering of symbolic methods in AI and language processing finally undid ALPAC and sent renewed DARPA funding into corpus collection and annotation, statistical NLP, and crucially statistical MT. Today, there's still very limited use of computational methods in core linguistic research, but you can get free machine translation services online that are getting steadily better by mining ever growing data with statistical learning methods.

1 comment:

Mark Johnson said...

I'm sometimes asked by for specific examples of linguistic insights that statistical methods provide.

The story I currently tell is that statistics is about inference, and statistical methods have given us a deeper understanding of inferential processes involving language, such as parsing and language acquisition.

Maybe this is enough. But I'd like to offer something more immediate and compelling.

In the early days of generative grammar Chomsky presented a number of now-famous examples that followed from the theory he was developing. You might not agree with or even understand the theory, but it was hard not to accept that e.g. in "John promised/persuaded Bill to shave himself" the referent of "himself" depends on the choice of verb.

Is there anything similarly compelling that statistical research on language can offer? (Saying that speech recognizers only work because they use statistical methods really doesn't cut it).