Sunday, June 29, 2008

Exponential Stupidity

An Undertaking of Great Advantage, But Nobody to Know What It Is: Cosma Shalizi summarizes and links to a Wired report about an FBI data mining project. Leaving aside constitutional and political issues, what stuck me most in the report is was the following quote:

[...] the Justice Department claims that with this new data mining center’s access to billions of personnel records the “universe of subjects will expand exponentially.”
(Via Three-Toed Sloth.)

If the “universe of subjects” will expand exponentially, so will the universe of false positives for a fixed acceptance rate. Since the FBI's investigative capabilities cannot grow exponentially for demographic reasons, their only solution would be to decrease the acceptance rate of their data mining procedure. Given their past problems with false negatives, I'm not sure that would be tenable. Unless their data mining methods are getting exponentially better, which in my experience never happens.

Lest I sound too negative about data mining, here's a high-accuracy data mining rule: When high-ranking bureaucrats use the adverb “exponentially,” they are deluded or being economical with the truth.

Thursday, June 26, 2008

"These can be aptly compared with the challenges, problems, and insights of particle physics"

"These can be aptly compared with the challenges, problems, and insights of particle physics": For this survey, I decided to start with some historical background, and so I went back to the famous ALPAC report. [...] This report is mainly remembered as a rather negative assessment of the quality, rate of progress, and economic value of research on Machine Translation. It's widely credited with eliminating, for more than two decades, nearly all U.S. Government support for MT research. [...] However, I had forgotten — if I ever knew — that as the title "Computers in Translation and Linguistics" suggests, the ALPAC report had another and more positive message. Here's a sample:

Today there are linguistic theoreticians who take no interest in empirical studies or in computation. There are also empirical linguists who are not excited by the theoretical advances of the decade–or by computers. But more linguists than ever before are attempting to bring subtler theories into confrontation with richer bodies of data, and virtually all of them, in every country, are eager for computational support.
(Via Language Log.)

Read Mark's whole post. But I can't resist the feeling that Mark is being too charitable towards ALPAC, who seem to have predicted the future of natural language processing exactly wrong. Use of computational methods and large corpora in linguistics is still the exception, while (statistical) machine translation is arguably the area of language processing that is making the most remarkable progress. At this year's ACL meeting, the pervasive impact of the core idea of statistical MT — mining co-occurrences in text to discover language units and unit correspondences - not only in advancing MT but also in many other areas of natural language processing. And statistical MT has generated new demand for advances in robust language analysis methods, as the recent progress of MT methods based on statistical grammars shows.

It's interesting how ALPAC could have gotten it so wrong. The main methods and metaphors of statistical MT are based on ideas from communication theory and (implicitly) from cryptology that should have been very natural to Pierce and his colleagues. Instead, they seemed to be captured by the formal syntax revolution in linguistics, and they expected that the main line of progress would be developing detailed linguistic theories and analyses. Twenty years later, the contrast between steady measurable advances in statistical speech recognition — which drew heavily from those communication theory/crypto ideas — and the relative floundering of symbolic methods in AI and language processing finally undid ALPAC and sent renewed DARPA funding into corpus collection and annotation, statistical NLP, and crucially statistical MT. Today, there's still very limited use of computational methods in core linguistic research, but you can get free machine translation services online that are getting steadily better by mining ever growing data with statistical learning methods.

Monday, June 23, 2008

The End of Theory: The Data Deluge Makes the Scientific Method Obsolete

The End of Theory: The Data Deluge Makes the Scientific Method Obsolete: But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. [...] There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. (Via Wired News.)

I like big data as much as the next guy, but this is deeply confused. Where does Anderson think those statistical algorithms come from? Without constraints in the underlying statistical models, those "patterns" would be mere coincidences. Those computational biology methods Anderson gushes over all depend on statistical models of the genome and of evolutionary relationships.

Those large-scale statistical models are different from more familiar deterministic causal models (or from parametric statistical models) because they do not specify the exact form of observable relationships as functions of a small number of parameters, but instead they set constraints on the set of hypotheses that might account for the observed data. But without well-chosen constraints — from scientific theories — all that number crunching will just memorize the experimental data.

Saturday, June 21, 2008

Books

After Dark

I hadn't read novels for a while. Maybe I was a bit disappointed by my fiction choices, maybe there were just too many interesting non-fiction books, like Microcosm, drawing my attention. But recently I had several trips where I needed something to read on the plane or at the hotel. I enjoyed Chabon's The Yiddish Policemen's Union (although, like Kavalier&Klay, it could have benefitted from a tougher editor), but After Dark is something else. Spare, subtle, deceptively plain. If quantum entanglement can have a fictional embodiment, this may be it. The action seems random, but correlated in unexpected ways. Even in my own memory. When I started reading it on the flight from MSP to SFO (returning from ACL in Columbus), it evoked the long-ago reading of Report on Probability A, by Brian Aldiss. Aldiss did not achieve Murakami's grit and emotional density, but he got the entanglement.

I had never read any novel by Murakami. Then I read a self-portrait in The New Yorker that used his taking up of long-distance running as the driver for the story of becoming a full-time fiction writer. I had to read something by him after that.

Chile-Argentina

In other book news, the beautiful Chile-Argentina: Handbook of Ski Mountaineering in the Andes by Frédéric Lena arrived in a well-wrapped package from Grenoble. Beautiful photos, detailed maps and routes of many Andean ski tours, a few of which I've done. Now I just need to sort out my trip down there this summer...

Friday, June 6, 2008

Dreaming of Andean snow

View N from Llaima.jpgIt's fall in the Southern hemisphere, and those of us who have been smitten by the terrain, landscape, and people down there start dreaming of climbing another volcano to ski it. Frédéric Lena has climbed and skied many of them, and has now created a great site and book for all of us Andes skiing fanatics.

Sunday, June 1, 2008

Sierra Spring

I forgot my camera in the rush to pack and drive up to Strawberry on Friday evening, but even camera-less, I had a wonderful two days of backcountry spring skiing around Tahoe with friends. On Saturday, we hiked and skinned up from Carson Pass to Round Top and skied N-facing corn snow from the shoulder of Round Top down to the bowl between Round Top and The Sisters. Then we just had to slog out over melting snow and open ground bursting with wild flowers. On Sunday, we skinned up to ski three excellent runs off Tamarack Peak above the Mount Rose Highway, and we just had to remove our skis once on the way back to the road. Both days were warm, but some clouds and a cool breeze kept the snow from turning into mush and threatening wet slides on the aspects we traveled. Good skiing on June 1st is yet another of those Sierra gifts.