Wednesday, January 3, 2007

The Two Faces of Natural Language Search

The Two Faces of Natural Language Search. Matt brings out a widely held misconception about natural language in search:
Read/Write Web writes about NLS and the NYT article:
Based on what we have seen so far, it is difficult to see how these companies can beatGoogle. Firstly, being able to enter the query using natural language is already allowedby Google, so this is not a competitive difference. It must then be the actual resultsthat are vastly better. Now that is really difficult to imagine. Somewhat better maybe,but vastly different? Unlikely.
It seems there is a common misconception about NLS which limits the application of NLP to the search query. One has also to account for the fact that NLP can and will be applied to interpreting the data in the content store - for example, parsing the sentences in the text into some logical form that can then be indexed. I'm not sure how this misconception got started, but it renders Alex Iskold and Robert MacManus' statement that natural language is 'not a competitive difference' moot.

The operative words in the quote are "based on what we have seen so far." The discussions of NLP in corporate press releases and in the technology press are so trite that it is very likely they have never seen a convincing argument for NL document analysis as a foundation for search. Indeed, what would be a convincing argument? It is not as if even the most advanced NLP research has demonstrated reliable broad-coverage language analysis that could serve as the basis for significantly improved search. Some NLP researchers may believe that such a capability is just around the corner, but certainly none has been demonstrated yet as far as I know. What we have are hunches and promissory notes. Shallow NLP plays already a useful role in certain vertical search applications, but as for "parsing the sentences in the text into some logical form that can then be indexed," sure, we can run parsers on any text, and index the resulting data structures, but it is a big leap from that being able to answer queries in a way that will seem an improvement for a wide range of queries and users.

I am all for putting a lot of research effort into this area, but we have to be humble about the difficulties in our way. It is good to remember that the "bag-of-words" model for information retrieval (IR) goes back around 40 years (Salton's Automatic Information Organization and Retrieval was published in 1968) and it is only in the last ten years that good bag-of-words retrieval made it into general use. Sure, there were a lot of incremental improvements, and important new ideas like PageRank, but it might not be too pessimistic to argue that the current state of NLP is comparable to the state of IR circa 1980.

I continue to believe that NLP will infiltrate search rather than taking over search. As we develop more robust ways to recognize classes of concepts and relationships expressed by text, we will be able to improve indexing and query matching. We already see very simple concept recognition methods such as that of dates and addresses in GMail that are useful even if they seem trivial compared with the grand ambitions of NLP. In work that I am involved in, recognizing accurately which genes are discussed in biomedical abstracts already helps scientists do more specific and complete literature searches, which is valuable to them given the accelerating growth of biomedical research. In this respect, I agree with the cited article that "vertical" search methods are best seen as helpful modules in a general search engine than as separate services.

I have turned on comments for this post, to see what happens.

No comments: