Earning My Turns: January 2007

Tuesday, January 30, 2007

Bosworth On Why AJAX Failed, Then Succeeded

Bosworth On Why AJAX Failed, Then Succeeded:An anonymous reader writes "eWeek has a story describing a talk by former Microsoft developer Adam Bosworth, now a VP at Google, entitled 'Physics, Speed and Psychology: What Works and What Doesn't in Software, and Why.' Bosworth depicts issues with processing, broadband, natural language, and human behavior; and he dishes on Microsoft."

Given my interests, the most interesting part of the eWeek article is:

Natural language was billed as a replacement for the GUI, but it failed to achieve that. It also failed as a query language for databases, as a calculation language for spreadsheets and as a document creation language, Bosworth said. "Humans expect a human level of comprehension," he said, noting that database queries and spreadsheet formulas have to be exact.
But natural language got a second life, too, triggered in part by Microsoft Help, and the next step turned out to be Google, Bosworth said. The trick to being successful with natural language is to "start with a fuzzy problem, one no human can resolve anyway…orient it around search, and the magic is just in the ranking," he said.

Exactly. Another example that I'm familiar with is bioinformatics. All the methods we use for sequence comparison, gene prediction, regulatory network reconstruction, protein structure prediction, and so on, are fallible. If exact answers are required, nothing could be done. But inexact answers are still useful, and in any case there is no alternative. Natural language may be useless for database queries where exact answers are expected, like the standard database examples (employee, department, ...), but in bioinformatics databases, some of the core relations are uncertain anyway.

Monday, January 29, 2007

write a song for the dead

write a song for the dead: real innovation from a century long ago a lucent pr piece but it was dead before it was Lucent.

As the last minute or so was rolling by, I was placing faces into where they are now: Harvard, NYU, Columbia, Penn, DARPA, ... I'm sure I could place many more if I slowed it down, but it's really not needed.

Saturday, January 27, 2007

mobile internet non-use

mobile internet non-use:Reported in Europe - and some commentary with some possible reasons. Ultimately mobile Internet has to be open (no walled gardens) and inexpensive. (Via tingilinde).

In the US, I use a fixed monthly fee EDGE/GPRS T-Mobile service to connect my laptop to the net over my Bluetooth phone. Quality varies from very usable to terrible depending on location and wireless congestion, but it is much more convenient than for-pay WiFi. Just over the last two weeks, I had two incidents (one with Sprint at SLC, the other with T-Mobile HotSpot at SFO) in which I tried to pay for airport WiFi service and the registration process failed somehow. In fact, I used to have a HotSpot account, but it stopped working for reasons that I still don't understand and I'm not willing to waste the time tracking down. The Balkanized state of WiFi provision is a joke.

Friday, January 26, 2007

The cost of search computations

I just picked Why Choose This Book? by Read Montague for reading material on my flight back from a meeting. Coincidentally, I had been talking with several friends about the costs and benefits of the computations needed to add natural-language processing to search. Here's Montague on the cost of computation and the peculiar development of modern computers:

It is widely thought that the effort of the group at Bletchley Park saved many lives during the war. But I think it also attached to modern computing an odd legacy. The model for the code-breaker was speed and accuracy at all costs, which means loads of wasted heat. The machines did not have to decide how much energy each computation should get; instead, they simply oversupplied energy to all the computations and wasted most of it. And although we understand the urgent needs of the time, this style allowed them all to overlook a critical fact --- the amount of energy a computation “should” get is a measure of its value to the overall goal. Goals and energy allocation under those goals; these two features are generally missing from the modern model of computing today. Just as for the code-breakers, speed and accuracy are the primary constraints on modern computers.

This picture is changing radically for the search engine. The computation for each query has quantifiable costs and benefits. The costs include R&D, computing infrastructure amortization, energy, rent, maintenance. The direct benefit is advertising revenue from the query response; indirect benefits such as market share gained from greater search quality are harder to measure, but can still be estimated. Search engine success is ultimately given by the efficiency with which it delivers advertising revenue given these factors.

Researchers often complain that search engines are not using (their) latest and greatest ideas in machine learning and natural-language processing. How could they be so resistant to ideas that must improve search quality?

Before I support their complaints, I'd want to see a good cost/benefit analysis. More elaborate algorithms cost more to develop and run. New page analysis, indexing, or retrieval methods will spread their costs all over the building and operation of search engine facilities. It is not unconceivable that one of the fancier query processing schemes being bandied around will double storage and computation requirements. How much more advertising revenue would it have to generate to just break even with the current efficiency? How likely is it that it will?

As opposed to the Bletchley Park model of computation discussed by Montague, the search engine does not have absolute accuracy or timeliness requirements. Higher accuracy or index timeliness are good only insofar as they lead to better ad click-through. In addition, user response to search results depends on other factors beyond accuracy, such as speed and clarity. These tradeoffs can be quantified and tested experimentally.

Search engines are very interesting from this perspective because they are persistent computations whose survival depends on their ability to both pay their way and promise future profits so that their high stock price reduces their cost of capital for growth. They are truly embodied computations whose meaning ultimately assigned by their survival value.

Thursday, January 25, 2007

The end of end-to-end?

Vint Cerf: one quarter of all computers part of a botnet: A new estimate on the prevalence of botnets from one of the Internet's leading authorities paints a disturbing picture of a world in which 25 percent of all Internet-connected machines have bot software installed on them. (Via ArsTechnica).

If all the intelligence is at the edge of the network, so is all the malice. End-to-end advocates need to take this very seriously, or self-serving walled-garden advocates will have a very strong argument for centralized control.

Wednesday, January 24, 2007

Norway: Apple's FairPlay DRM is illegal

Norway: Apple's FairPlay DRM is illegal: Norway today ruled that Apple's digital rights management technology on its iPod and iTunes store is illegal, following a report earlier this week that both France and Germany have also decided to go after Apple's closed iPod/iTunes ecosystem. According to Out-Law.com, the Consumer Ombudsman in Norway has ruled that the closed system is illegal because the songs, encoded with Apple's FairPlay DRM cannot be played on any music device other than an iPod, breaking Norway's laws. (Via MacNN).

The argument(s) for the iPod/iTMS tie-in have been user experience quality and intellectual property protection. These arguments would be more believable if Apple had demonstrated a good-faith effort with competitors and standards bodies to try to create an open or freely licensable decoupling between devices and media delivery and protection protocols. It might be that such decoupling is not feasible given the conflicting requirements of parties. But we won't know without a honest design and experimentation effort. Apple had the resources and public good will to lead this effort when the iPod took off in a positive-som direction. They didn't, and now they'll be have to be fighting defensively rather than driving the agenda.

Sunday, January 21, 2007

Doing Meta: from meta-language to meta-clippy

Doing Meta: from meta-language to meta-clippy: The theme of the January/February recent issue of Technology Review is "software", and the cover story is "Anything You Can Do, I Can Do Meta", by Scott Rosenberg. [...] I guess that the source must have been the language/metalanguage distinction in logic, though exactly how this usage came into proto-computer science in the 1950s and 1960s is not clear to me.

The earliest use of meta-language I can remember in computer science is in John McCarthy's 1962 LISP 1.5 Programmer's Manual:

The second important part of the LISP language is the source language itself which specifies in what way the S-expressions are to be processed. This consists of recursive functions of S-expressions. Since the notation for the writing of recursive functions of S-expressions is itself outside the S-expression notation, it will be called the meta language. These expressions will therefore be called M-expressions.

This approach to programming language specification should be contrasted with that of the equally famous Revised Report on the Algorithmic Language Algol 60 by Backus et al., 1962, who introduce the well-known BNF syntactic meta language (although they don't call it a meta language) but use informal natural language to describe the semantics of Algol 60. The notions of abstract machine and formal operational semantics needed still a few years to develop beyond the relatively simple recursive definitions sufficient for LISP 1.5.

Tuesday, January 9, 2007

Sunday, January 7, 2007

Germany Quits EU-Based Search Engine Project

Germany Quits EU-Based Search Engine Project: The Quaero project, a French initiative to build a European rival to Google, has lost the backing of the German government. The search engine was announced in 2005 by Jacques Chirac and Gerhard Schroeder, but the German government under Merkel has decided that Quaero isn't worth the $1.3-2.6 billion commitment that development would require. Germany will instead focus on a smaller search engine project called Theseus. From the article: 'According to one French participant, organizers disagreed over the fundamental design of Quaero, with French participants favoring a sophisticated search engine that could sift audio, video and other multimedia data, while German participants favored a next- generation text-based search engine.' (Via Slashdot).

This quote from the original article summarizes well I was skeptical about this project from the beginning:

"In Germany I think there was also resistance to the idea of a top-down project driven by governments," said Andreas Zeller, chairman of software engineering at the University of Saarland in Saarbrücken, Germany, which supplied advisors to Quaero. "Success in the end is something that can't be planned but is something that begets itself."

The top-down project model does not work for building widely used goods and services because it is not responsive enough to early user feedback. The effectiveness of a search engine cannot be predicted, but it can be measured in the field. Bureaucratic top-down projects do not seem to be able to build something simple early, measure its effectiveness, and use the metrics to quickly evaluate proposed improvements. In other words, the design and development processes are not adaptive enough.

I like the analogy of learning how to ski late. One of the hardest things is to learn to trust fast low-level feedbacks and small adaptations, and push conscious control out of the way, because it is way too slow to do the right thing on time. Low-level adaptation comes from trying small adjustments and getting immediate feedback (oops, I'm out of balance!). Good teachers use exercises to decompose motions so that the student becomes aware of small perceptual and motor effects and can put into place robust adaptation processes.

In other words: effective complex artifacts or processes are unlikely to be designed as whole, they are much more likely to evolve through a process that quickly evaluates combinations of robust, field-tested pieces. Like biological evolution.

Saturday, January 6, 2007

Avalanches and public safety

Now that the avalanche in Colorado that hit U.S. 40 is in all the news, it's as good a time as any to complain about how the federal and state governments have been starving avalanche forecasting and control, even when the number of winter users of public lands increases, and more people move into beautiful Western areas with significant avalanche hazards. For example, the outstanding Utah Avalanche Center has a budget shortfall of $30K in a total budget of $256K:

Bruce Tremper, Director, Utah Avalanche Center: "There's more and more accidents and there is more need for avalanche education and avalanche forecasting. We are just barely keeping up."

The Director of the Center says its been many years since the center has had a funding increase yet the need for more services increases as more and more people head to the backcountry for recreation.

Bruce Tremper, Director, Utah Avalanche Center: "What's frustrating for us is we just keep seeing the same accidents over and over in the same places. Just the names change is all…there's just a lot of ignorance about avalanches and we would love to do more out reach programs to get more education."

The Friends of the UAC have been forced to start an emergency fund drive. If you ever skied in Utah, consider contributing. I have.

To put this budget in perspective, Utah ski areas sell over 4M tickets each year. With current ticket prices ranging to over $50, a ticket surcharge of 10 cents, that is, 2/1000, would pay for almost double the current Center budget.

Thursday, January 4, 2007

Bear meat

The current New Yorker has a wonderful mountaineering short story by Primo Levi, Bear Meat. I missed so much what I felt he left unwritten when he died at 68, that this new (for me) story is a magical gift, on a subject that I love.

Wednesday, January 3, 2007

5 things

Steve tagged me a while ago, but I somehow missed that post. Seeing Hanna's response made me read Steve's post and so, here are five things you might not know about me:

If I did not do what I do, I would have liked to work as a geologist (volcanology, glaciology).
I first skied at Alpine Meadows, California, in the winter of 1983-84. I haven't yet recovered.
My favorite novel is Musil's "The Man Without Qualities."
My favorite composer is Bartók.
My first foreign language was French, but it rusted away a bit in the constant presence of English.

I'll tag Lawrence only; since I'm a latecomer to this tag propagation, other targets have been tagged already.

The Two Faces of Natural Language Search

The Two Faces of Natural Language Search. Matt brings out a widely held misconception about natural language in search:

Read/Write Web writes about NLS and the NYT article:
Based on what we have seen so far, it is difficult to see how these companies can beatGoogle. Firstly, being able to enter the query using natural language is already allowedby Google, so this is not a competitive difference. It must then be the actual resultsthat are vastly better. Now that is really difficult to imagine. Somewhat better maybe,but vastly different? Unlikely.
It seems there is a common misconception about NLS which limits the application of NLP to the search query. One has also to account for the fact that NLP can and will be applied to interpreting the data in the content store - for example, parsing the sentences in the text into some logical form that can then be indexed. I'm not sure how this misconception got started, but it renders Alex Iskold and Robert MacManus' statement that natural language is 'not a competitive difference' moot.

The operative words in the quote are "based on what we have seen so far." The discussions of NLP in corporate press releases and in the technology press are so trite that it is very likely they have never seen a convincing argument for NL document analysis as a foundation for search. Indeed, what would be a convincing argument? It is not as if even the most advanced NLP research has demonstrated reliable broad-coverage language analysis that could serve as the basis for significantly improved search. Some NLP researchers may believe that such a capability is just around the corner, but certainly none has been demonstrated yet as far as I know. What we have are hunches and promissory notes. Shallow NLP plays already a useful role in certain vertical search applications, but as for "parsing the sentences in the text into some logical form that can then be indexed," sure, we can run parsers on any text, and index the resulting data structures, but it is a big leap from that being able to answer queries in a way that will seem an improvement for a wide range of queries and users.

I am all for putting a lot of research effort into this area, but we have to be humble about the difficulties in our way. It is good to remember that the "bag-of-words" model for information retrieval (IR) goes back around 40 years (Salton's Automatic Information Organization and Retrieval was published in 1968) and it is only in the last ten years that good bag-of-words retrieval made it into general use. Sure, there were a lot of incremental improvements, and important new ideas like PageRank, but it might not be too pessimistic to argue that the current state of NLP is comparable to the state of IR circa 1980.

I continue to believe that NLP will infiltrate search rather than taking over search. As we develop more robust ways to recognize classes of concepts and relationships expressed by text, we will be able to improve indexing and query matching. We already see very simple concept recognition methods such as that of dates and addresses in GMail that are useful even if they seem trivial compared with the grand ambitions of NLP. In work that I am involved in, recognizing accurately which genes are discussed in biomedical abstracts already helps scientists do more specific and complete literature searches, which is valuable to them given the accelerating growth of biomedical research. In this respect, I agree with the cited article that "vertical" search methods are best seen as helpful modules in a general search engine than as separate services.

I have turned on comments for this post, to see what happens.

Tuesday, January 2, 2007

Natural Language Search

Natural Language Search:
Matt Hurst:

Fernando writes clearly about what the article could have been about, though I fear that his expectations for the intersection between journalists, their audience and this particular subject may be to high.

I disagree. If Times science writers like Natalie Angier or Nicholas Wade can write deeply and clearly about the most difficult scientific questions for the Times readership, I see no reason why the same expectations of quality should not apply to technology writing.

Then Matt gets into the substance:

Fernando writes:
The fundamental question about NLS is whether the potential gains from
a deeper analysis of queries and indexed documents are greater than the
losses from getting the analysis wrong.
I actually disagree with this as being the right question. I see NLP as being a strong contender for changing the utility of the web, and our interfaces into it (a.k.a. search engines) from the discover of documents to the discovery of knowledge and information. Yes, that will be backed by documents, but they won't be the primary 'result'. For example, when I ask 'who invented the elevator?' I don't mean 'find me documents that, with a high probability, contain text that will answer the question: who invented the elevator?'. I really mean who invented the elevator?
NLS has the potential to come back with the result: Elisha Graves Otis.

I think that this distinction between "knowledge and information" and "documents" is a red herring, for several reasons:

Useful information is information in context. I don't just want to know a so-called fact, I want to know who stated it, where, and how. Your example is exactly the kind of example that people working in NLQA use all the time, and also the kind of example that is pretty useless except for trivia games and bad middle-school essays.

Even when we want to aggregate information across documents, we want the documents directly accessible to assess provenance and thus quality of the extracted information. For an example, check out the gene lister on Fable.

When the information is not explicitly stated in document, I am very skeptical of current methods for drawing inferences involving information scattered among documents and other sources. There's good research going on on this, such as in the textual entailment challenges, but it is very far from what we would want to reply on for a practical search engine.

Finally, Matt complains rightly about this medium:

Ok - now allow me to complain about something else. I posted about the NYT article and Fernando, I assume, read my post and wrote his. I have now written a follow up and we have all linked to each other nicely. However, consider how annoying it is to follow this 'conversation.' Fernando could have left  a comment on my post. I could have left a comment on his (though he actually has them turned off). The fact that there are multiple ways for this discussion to flow and there are no integrated mechanisms for readers (or writers) to tune in to the discussion makes a lie of the whole 'conversations in the blogosphere' propostion. It's been a problem for a long time and is an element of a theme which I think will be important next year - the efficiency of social media.

I didn't leave a comment in Matt's blog for two reasons:

I want my web writing on one place, so that whoever gets my feed can see it.

I dislike the editing environment of blog comment boxes.

I don't allow comments on my new Blogger blog because I have a low opinion of the S/N ratio in blog comments. I totally agree with Matt that this is not good. What I would like is a means for creating unified discussion threads in a distributed fashion. That is, Matt writes, I comment, he comments back, someone else chimes in, all in our own blogs, but a virtual thread is established that can be easily read by anyone.

Monday, January 1, 2007

Powerset In The New York Times

Powerset In The New York Times: A nice little article summarizing the playing field for novel search going in to 2007. (Via DataMining).

It's good to see Barney and his colleagues in the Times. However, I didn't think much of the article. As is unfortunately common in the MSM, there is no substance in the story, except for who invested and how much. What is "natural language search," (NLS) in terms that would make sense to the average reader of the business section of the Times? If current search engines do not use NLS, it it just because they are too fat and distracted? Or are there technical, let alone scientific reasons for the lack of NLS? The writer missed the opportunity to illustrate the issues and challenges with some concrete examples, for instance some of those that Barney discussed in his blog a while ago.

The fundamental question about NLS is whether the potential gains from a deeper analysis of queries and indexed documents are greater than the losses from getting the analysis wrong. The history of using deeper analysis to improve speech recognition accuracy does not give much cause for optimism. Even after many years of effort, the improvements are modest at best. And language modeling for speech recognition is a pretty simple task compared with answering natural language queries, which may require deeper inference involving a wide range of background knowledge.

A second question is whether users would be happy with NLS. Natural language is what we use to communicate with each other. Our use of natural language involves subtle expectations about our interlocutors, including their ability to talk back intelligently. If the interlocutor doesn't seem to keep up with expectations, we may prefer a simpler mode, maybe more predictable mode of communication.

I believe that natural-language processing (NLP) can help improve search even in the short-to-medium term. But those improvements are more likely to be incremental, as for instance when search engines become better able to recognize a wider range of entities and relationships in indexed pages that can be used to answer queries more precisely. As search engines start moving in this direction, users may gain confidence in the effectiveness of richer queries. NLS will be the end result of a long evolution, not something completely designed from the beginning.

Yes, I've been reading yet another book on evolution, Sean B. Carroll's The Making of the Fittest. Computer scientists and software engineers can benefit a lot from reflecting on the evolutionary processes that led to the most complex and adaptable information processing systems known. We still believe too much in upfront design, and not enough in quick search, testing, and selection.

Earning My Turns