Saturday, February 17, 2007

Reason, emotion, and the limits of computation

I listened today to a good Open Source podcast on Spinoza with António Damásio (Looking for Spinoza) and Rebecca Goldstein (Betraying Spinoza). I'm not qualified to comment on their readings of Spinoza, but I confirmed an impression I had from reading their books: they do not appear to recognize the role of the limits of computation on the evolutionary necessity of the reason-emotion linkage. To the extent that we equate equate reasoning with logical (or probabilistic) inference, computational complexity tells us that a reasoner cannot explore most of the consequences of its beliefs. The reasoner needs some means of directing limited computational resources to promising directions, otherwise it will not reach any conclusions useful to its well-being. By coloring beliefs with associations to bodily states, emotions provide those directions. Damásio explains the reason-emotion connection beautifully in his books, but he does not stress how it is a necessity driven by computational limits, not just an intriguing and useful accident of evolutionary history.

The computational limits of reason are a truly modern discovery that did not seem to be foreshadowed in the thought of the great 17th-century rationalists, and that even today is not taken seriously enough.

Gear lust

My current alpine touring setup consists of Garmont Adrenalin boots, 04/05 Rossignol B3 skis, and Fritschi Freeride bindings. The boots are the most comfortable and responsive I've ever had (including alpine boots), the skis are excellent on powder and crud. But it is a heavy setup for long tours. To lighten up radically would require switching to Dynafit bindings and Dynafit-compatible boots. I actually have an old pair of Dynafit-compatible Scarpa Laser boots, but they are too soft for me on the descent and quite uncomfortable (big-blister uncomfortable) on long tours. (I don't have the time, equipment, or skill to mod them into effectiveness like Lou Dawson suggests). The heavier Adrenalins are so much better that I'm willing to pay the weight premium for them and compatible bindings. The only place left to cut weight are the skis. The Rossignol B3s are heavy fat alpine skis. So, I just put money down on a pair of Ski Trab Stelvio Freerides, which promise to be almost 1kg lighter (pair) than the B3s. I'll be picking them up, mounted with Fritschi Explores, at Bent Gate after my visit to UC Boulder this week.

Translation From PR-Speak to English of Selected Portions ofMacrovision CEO Fred Amoroso’s Response to Steve Jobs’s ‘Thoughts onMusic’

Translation From PR-Speak to English of Selected Portions of Macrovision CEO Fred Amoroso’s Response to Steve Jobs’s ‘Thoughts on Music’:Remember those squiggly lines when you tried copying a commercial VHS tape? You can thank us for that.

John Gruber really nails DRM-speak. DRM proponents will say anything to cover up the fundamental incompatibility between user control of their devices and licensor control of media. Anything that does the licensor's bidding is guaranteed to interfere with the general-purpose computing capabilities of the device. Trying to protect crypto secrets in insecure devices means opaque functionality, lack of interoperability, and restrictions on the device's generality that would undermine innovation and freedom of speech.

Thursday, February 15, 2007

That annoying natural world

Drivers stuck for full day on Pa. road: Eugene Coleman, who is hyperglycemic, was trapped for 20 hours while on his way home to Hartford, Conn. from visiting his terminally ill mother in Georgia, along with his girlfriend and pregnant daughter. "How could you operate a state like this? It's totally disgusting," Coleman said. "God forbid somebody gets really stuck on the highway and has a life-threatening emergency. That person would have died."

He set out from Georgia while all the weather forecasts on the Eastern seaboard warned of a very large storm. He chose to take an inland route. It was obvious on Wednesday that traveling on the highway around here would be a really bad idea for at least 36 hours. Schools were closed. Motorists were warned to avoid traveling. Coleman did, and he and many others like him cost Pennsylvania a significant expense in emergency services because they couldn't be bothered to change their travel plans. They needed a reminder that Nature still bites.

Back To The Future: NLP, Search, Google and Powerset

Back To The Future: NLP, Search, Google and PowersetBattelle then goes on to speculate about how these capabilities might surface in the Google UI. The last sentence in the above quote seems so close - at least in terms of vision - with some of the current wave of NLP search debate that is provokes the question: what happened to this project? Did Google try and fail? If you read it closely, you'll see that Norvig is talking about some key NLP concepts:
  • Entities (typed concepts expressed in short spans of text, generaly noun phrases)
  • Ontologies (Java IS_A programming language)
  • Relationships (between entities)
I mean - couldn't you build a next gen search engine on such wonderful ideas?

I have no privileged information on what any search engine is doing in this area. But I've been doing research on entity extraction, parsing, and some relation extraction for the last eight years. It is very hard to create general, robust entity extractors. The most accurate current methods depend on labeled training data, and do not transfer well to domains different from those in which they were trained. For instance, a gene/protein extractor that performs at a state-of-the-art level on biomedical abstract does terribly on biomedical patents (we did the experiment). Methods that do not depend on labeled data are less domain dependent, but do not do very well on anything.

Matt has proposed before that redundancy — the same information presented many times — can be exploited to correct the mistakes individual components. I noted that this argument ignores the long tail. We cannot recover from mistakes on entities that occur just a few times, but unfortunately those are the entities that are often most important, in particular in technical domains. The common entities and facts are known already, it's the long tail of rarer, less-known ones ones that could lead to a new hypothesis.

I don't believe that there is a secret helicopter to lift us all over the long uphill slog of experimenting with incrementally better extraction methods, occasionally discovering a surprising nugget, such as a method for adapting extractors between domains more effectively. This is how speech recognition grew from a lab curiosity to a practically useful if still limited technology, and I see no reason why Matt's three bullets above should be any different.

I am enthusiastic about research in this area, because I've seen significant progress in the last ten years. But I'm not convinced that the current methods are nearly general enough for a "next gen search engine." Matt's three bullets are not yet "wonderful ideas" ready for deployment, just "promising research areas." We should not forget the "AI winter" of 20 years ago that followed much over-promising and under-delivering, and quite a bit of investor disappointment.

Wednesday, February 14, 2007

Music execs criticise DRM systems

Music execs criticise DRM systems: More people would buy digital music if hi-tech locks were removed, say music executives.

But they don't believe that DRM can be removed because that would go against current major label strategies. Time to roll out that apt definition attributed to Einstein (although it could easily have been in the The Devil's Dictionary):

Insanity: doing the same thing over and over again and expecting different results.

Tuesday, February 13, 2007

The Time To Build NLP Applications

The Time To Build NLP Applications:[...]I am proposing that a more sophisticated search engine would be explicit about ambiguity (rather than let the user and documents figure this out for themselves) and would take information from many sources to resolve ambiguity, recognize ambiguity and synthesize results.

Sure, those are interesting research goals. But the paths to those goals are not even close to being mapped out. Recognizing ambiguity is an especially difficult problem, because we do not have robust methods for recognizing and quantifying what we don't know. Stupid mistakes, even bugs, get blended with subtle distinctions in a hodgepodge of candidate interpretations, most of which will appear baffling to users.

This is why I believe that tested NLP techniques will creep into search from the inside, building out from islands of confidence. I like a evolutionary developmental biology analogy: complex, more versatile species did not arise by wholesale redesign of old species, but by co-opting robust, highly conserved modules into new roles. Reading recommendation: The Plausibility of Life by Marc W. Kirschner and John C. Gerhart.

Monday, February 12, 2007

Music Sales Would Explode Without DRM

Music Sales Would Explode Without DRM:Without restrictions on entire catalogs, "Sales would explode," says David Pakman, CEO of eMusic, the No. 2 online music retailer behind Apple's iTunes. "DRM has been holding the market back." His company is the only legitimate digital music service selling unrestricted songs, in the MP3 format.

Although eMusic's subscription model does not fit well my rather bursty music-buying habits, I have great respect for their strong stance against DRM. Without DRM, there would be a lot more competition between online music stores, and between device makers. Stores and players could more easily compete on format quality once compression formats are decoupled from DRM. Of course, competition may be exactly what some of the players here do not want...

Command, Option, Control

Command, Option, Control:The race is on to see who can say the most jackassed thing regarding Steve Jobs's "Thoughts on Music".

It's both entertaining and irritating, but not surprising, how so many commentators let their dislike of Steve Jobs get in the way of their ability to recognize facts and rational arguments. No one has yet falsified Jobs's main point: music DRM is useless because we all can get music without DRM — and with better quality — from CDs.

Sunday, February 11, 2007

24

24:To commemorate the Twenty Fourth Annual International Conference on Machine Learning (ICML-07), the FOX Network has decided to launch a new spin-off series in prime time. Through unofficial sources, I have obtained the story arc for the first season, which appears frighteningly realistic.

I prefer the art-house version in which the hero is convinced by older, slower colleagues of the futility of conference deadlines and the subtle beauty of carefully baked theorems, and agrees to take the time to write a journal submission and enjoy the warmth of family life.

NLP and Search: Free Your Mind

NLP and Search: Free Your Mind:
One of the basic paradigms of text mining, and a simple though constraining architectural paradigm, is the one document at a time pipeline. A document comes in, the machinery turns, and results pop out. However, this is limiting. It fails to leverage redundancy - the great antidote to the illusion that perfection is required at every step.

This is a puzzling assertion. Search ranking techniques like TFIDF and PageRank work on the collection as a whole, and exploit redundancy by aggregating term occurrences and links. Current text-mining pipelines look at extracted elements as a whole for reference resolution, for instance. Everyone I know working on these questions is working hard to exploit redundancy as much as possible. However, I still believe what I wrote in a review paper seven years ago:

While language may be redundant with respect to any particular question, and a task-oriented learner may benefit greatly from that redundancy [...], it does not follow that language is redundant with respect to the set of all questions that a language user may need to decide.

Matt then raises the critical issue of a system's confidence in its answers:

The key to cracking the problem open is the ability to measure, or estimate, the confidence in the results. With this in mind, given 10 different ways in which the same information is presented, one should simply pick the results which are associated with the most confident outcome - and possibly fix the other results in that light.

The fundamental question then is whether confidence can be reliably estimated when we are dealing with heavy-tailed distributions. However large the corpus, most questions have very few responsive answers, and estimating confidence from very few instances is problematic. In other words, redundancy is much less effective when you are dealing with very specific queries, which are an important fraction of all queries, and those for which NLP would be most useful if it could be used reliably. This is also one of the reasons why speaker-independent large-vocabulary speech recognition with current methods is so hard: however big a corpus you have for language modeling, many of the events you care about do not occur often enough to yield reliable probabilities. Dictation systems work because they can adapt to the speech and content of interest or a single user. But search engines have to respond to all comers.

And as for the issue of 'understanding every query' this is where the issue of what Barney calls a grunting pidgin language comes in. For example, I saw recently someone landing on this blog via the query - to Google - 'youtube data mining'. As the Google results page suggested, this cannot be 'understood' in the same way that a query like 'data mining over youtube data' can. Does the user want to find out about data mining over YouTube data, or a video on YouTube about data mining?

That's a cute example, but Matt forgets the more likely natural-language query, 'data mining in youtube', which is a nice grammatical noun phrase ambiguous in exactly the way he describes. Language users exploit much shared knowledge and context to understand such ambiguous queries. Even the most optimistic proponents of current NLP methods would be hard-pressed to argue that the query I suggested and its multitude of relatives can be disambiguated reliably by their methods. Sure, you could argue that users will learn to be more careful with their language as Matt suggests, but all the evidence from the long line of work on natural language interfaces to databases from the early 70s to the early 90s suggests that is not the case. Our knowledge of language is mostly implicit, and it is difficult even for professional linguists to identify all the possible analyses of a complex phrase, let alone all of its possible interpretations in context. That makes difficult for a user of a language-interpretation system to figure out how to reformulate a query to coax the system toward the corner of semantic space they have in mind — if they can even articulate what they have in mind.

So what's to be done?

  • Shallow NLP methods can be effective in recognizing specific types of entities and relationships that can improve search. I mentioned an example from my work before, but a lot more is possible and will be exploited over the next few years. Global inference methods for disambiguation and reference resolution are starting to be quite promising.
  • In the medium term, there might be reliable ways to move from 'bags of words' to 'bags of features' that include contextual, syntactic and term distribution evidence. The rough analog here is BLAST, which allows you to search efficiently for biological sequences that approximately match a pattern of interest, except that the a pattern would be a multidimensional representation of a query.
  • There are many difficult longer-term research questions in this area, but underlying many of them is the single question of how to do reliable inference with heavy-tailed data. Somehow, we need to be able to look at the data at different scales so that rare events are aggregated into more frequent clusters as needed for particular questions; a single clustering granularity is not enough.

Saturday, February 10, 2007

As a Mac user, I wish Microsoft...

As a Mac user, I wish Microsoft...:As a Mac user, I wish Microsoft would run an Apple-like ad about the process by which Mac users get service for broken hardware. It would be really hard for Apple to respond, because their system for dealing with broken hardware is itself horribly broken. They need some serious incentives to fix this.

I'm lucky that a local independent store is pretty good at quick turnaround on repairs. But that's still second-bestcompared with the on-site services that you get with Lenovo or Dell extended warranties.

Father of MPEG Replies To Jobs On DRM

Father of MPEG Replies To Jobs On DRM:marco_marcelli writes with a link to the founder and chairman of MPEG, Leonardo Chiariglione, replying to Steve Jobs on DRM and TPM. After laying the groundwork by distinguishing DRM from digital rights protection, Chiariglione suggests we look to GSM as a model of how a fully open and standardized DRM stack enabled rapid worldwide adoption.

I cannot imagine that Leonardo Chiariglione is clueless, so I must conclude that he is being disingenuous. There's a huge difference between mobile phones and computers. GSM mobile phones are specialized devices with secrets stored in a (somewhat) tamper resistant hardware module, the SIM. The networks that connect them are centralized and closed. GSM is open only in the sense that it is available to all phone manufacturers and wireless providers. It is not open to me or you, because the connection between the phone and the network is tightly controlled by the provider.

We know that certain self-interested and powerful parties are trying to move computers and data networks in that direction: trusted cores in computers, with DRM software layers (as in Vista), and "walled garden" networks instead of the open internet. The only problem is that those directions are incompatible with the general-purpose nature of computers, and the end-to-end principle in networking. General purpose and end-to-end have been the keys for the explosive growth of computing and networking, which as businesses now dwarf the media business. Not to mention all the other businesses that depend on the ability of end users to determine how they store, process, and transmit information without requiring anyone's permission. Tim Berners-Lee argument for net neutrality can be paralleled by an argument for computing neutrality. Attempts to control the use of bits in everyone's computer necessarily get in the way of the freedom to write and run code and make our computers do our bidding rather than sneakily obey secret orders from remote parties. If the freedom to run the code of our choice and communicate is taken from us, we can expect significant erosion of the economic and social benefits of the information economy. I remember well the disastrous efforts of telecom companies to control their customers' access to information and services in the name of "security" and "predictability." While they wasted uncounted millions in building impoverished walled gardens, search and commerce took off on the open Internet. Openness allowed varied investment and experimentation. Many efforts failed, but those that succeeded could continue to build and innovate without asking anyone for permission. As we know from biology, evolution requires diversity.

A subtly related story is going around: wringing of hands about the lower number of students interested and computer science, and the supposed need of academic computer science to refocus on applications and the management of technology, away from the design and analysis of programs that is supposedly no longer much needed. If computers and networks are closed, most will be reduced to managing and using services controlled by others; only a small priesthood will be allowed inside the DRM sanctum, maybe wearing white coats like the mainframe operators of yore.

Friday, February 9, 2007

Powerset In PARC Deal

Powerset In PARC Deal: VentureBeat (once a critic of Powerset, now more of a believer) covers the story. In summary, Powerset's technology is not some rushed together start-up demo, but the result of many man years of research and development at PARC. VentureBeat's post contains an interview around the topic of NLP with Google's Peter Norvig. While Norvig gives some insight into the work on NLP at Google, he doesn't mention one of their main areas of focus: machine translation (MT). It is interesting to learn of his caution in the area of NLP for search while they are tasking a number of scientist at MT, possibly the hardest AI problem known to man.

Some five years ago, I was talking with George Doddington at a Human Language Technology (HLT) meeting in Arden House about the difficulty of building accurate, robust natural-language relation extraction systems, even for limited sets of relations. I commented that the problem is that the input-output function implemented by such a system is not "naturally" observed. I meant that (almost) no one does relation extraction for a living and writes down the result. We all read text and take notes, and some of those notes are about relations expressed in the text. But there are no large collections of text and all the relations expressed in the text, or even all the relations from a prescribed set. Indeed, one might wonder if the function from texts to sets of expressed relations ``exists'' at all, in the sense that it is not obvious that people ever perform such a function in their heads, except as an evanescent step in their interpretation of and response to language.

In contrast, parallel corpora consisting of a text and its translation are widely available, simply because translation is something that is done anyway for a practical purpose, independently of any machine translation effort. Similarly, parallel corpora of spoken language and its transcription are created for a variety of practical purposes, from helping the deaf to court records. That is, the input-output function implemented by a machine translation or speech recognition system is explicitly available to us by example, and those examples can be used to validate proposed implementations or to train implementations with machine learning methods.

Anybody who has ever been involved in efforts to annotate text with syntactic or semantic information for use as ``ground truth'' in NLP work knows how difficult it is to get consistent, accurate results even from skilled human annotators. The problem is that those annotations are not ``natural.'' They are theoretical constructs. Annotators need to be instructed in annotation with instruction manuals many pages long. Even then, many unforeseen situations arise in which reasonable annotators can differ greatly. That's one reason why it has been so difficult to develop usable relation extraction systems. If people can only agree on extracted relations 70% of the time, how can we expect people and programs to agree more often? A short paragraph may specify some 20 relations. .7^20 ~ 0.0008, that is, a vanishingly small chance of getting it all correct. That's not because people cannot agree on the meaning of the paragraph left to their own devices, but because the annotation procedure cannot be specified precisely so that it can be reproduced faithfully by multiple annotators. If it could, then we could as well use it as the specification for a program. The same problem arises in natural language search. There are no extensive records of the intended input-output function.

PARC's methods of language analysis and interpretation are the best I know of in their class. But they suffer from the critical limitation that their output is a theoretical construct and not something observable. Without a plentiful supply of input-output examples, it is extremely difficult to judge the accuracy of the method across a broad range of inputs, let alone use machine learning methods to learn how to choose among possible outputs.

The situation is different for keyword-based search, because the input-output function is in a sense trivial: return all the documents containing query terms. The only, very important, subtlety is the order in which to return documents, so that those most relevant are returned first. Relevance judgments are relatively easy to elicit in bulk, compared with trying to figure out whether an entity or relation is an appropriate answer to a natural language question across a wide range of domains.

The availability of large ``natural'' parallel corpora has enabled the relatively fast progress of statistical speech recognition and statistical machine translation over the last 20 years. Those systems are still pretty bad by human standards, but they are often usable. Machine translation is ``possibly the hardest AI problem known to man'' only if you expect human performance.

So, is natural language search impossible? No, I think there are ways to proceed. The critical question is whether we can find plentiful ``natural'' correspondences between questions and answers that we can use to bootstrap systems. There are some interesting ideas around on possible sources, and I expect more to emerge over the next few years.

Spring Break


Here's where I'm booked to go for Spring Break. Getting to fresh turns under our own power, March 4-11. Great conditions at the moment. When I booked last weekend, they still had places (hint, hint). I need to pack very light (25 pounds + skis) for the heli rides in and out. I'd like to take a single-malt fireside sip, but the airlines don't like liquids. Oh well, water will have to do.

searching

searching: A NY Times piece on Powerset with quotes from Fernando... This seems like a very difficult nut to crack.

Miguel Helft wrote a clear and balanced piece, which was not easy given the complexity of the issues. I talked with him for over half an hour. He chose representative quotes from what I said, but it's of course impossible to go into details within the length limits of the daily press. I wrote about the issues in previous postings, and I'm writing a new posting that works out some of the arguments more fully.

Thursday, February 8, 2007

RIAA suggests Apple open up FairPlay

RIAA suggests Apple open up FairPlay:
Recording Industry Association of America (RIAA) chairman and chief executive Mitch Bainwol has rejected Apple CEO Steve Jobs' plea to remove the requirement of digital rights management (DRM) from digital music, suggesting instead that the Cupertino-based company instead open up its FairPlay DRM to competitors.
[...]ainwol argues that the move would allow more consumers to make use of the iTunes Store to play tracks on portable players other than the iPod. "We have no doubt that a technology company as sophisticated and smart as Apple could work with the music community to make that happen," the chairman said.

And while Apple is working on it, why don't they also invent faster-than-light flight and perpetual motion? Did the RIAA forget SDMI, or are they conveniently omitting that "open" DRM debacle?

Saturday, February 3, 2007

Why NLP Is A Disruptive Force

Why NLP Is A Disruptive Force:
Fernando is still skeptical about the potential of NLP to play a major role in search.
I'm not skeptical about the potential of NLP. I'm skeptical about the approaches I'm reading and hearing about.
I may be putting words in Fernando's mouth, but I believe the reason he states this is because he is assessing its impact against the standard search interaction (type words in a box, get a list of URLs back). This is missing the point.
I not making that assumption. My group at AT&T Labs built one of the earliest Web question-answering systems, back in 1999, which identified interesting entities that might be answers to typed queries, as well as URLs for the pages that contained mentions of those entities. I understand the potential of answering queries with entities and relationships derived not only from text but also structured data. That's why, for example, I have contributed to the Fable system, which sorts through the genes and proteins mentioned in MEDLINE abstracts responsive to a search query, linking them to standard gene and protein databases.
When one is dealing with language, one is dealing at a higher level of abstraction. Rather than sequences of characters (or tokens - what we might rudely refer to as words) we are dealing with logical symbols. Rather than the primary relationships being before and after (as in this word is before that word) we can capture relationships that are either grammatical (pretty interesting) or semantic (extremely interesting). With this ability to transform simple text into logical representations one has (had to) resolve a lot of ambiguity.
That's exactly where we differ. We do not have yet “this ability to transform simple text into logical representations.” To the extent that our methods reach for that ability, they are very fragile and narrow. Current methods that are robust and general can only rely on shallow distributional properties of words and texts barely beyond bag-of-words search methods. That's why I believe that NLP successes will arise bottom-up, by improving current search where it can be improved (word-sense disambiguation, recognition of multiword terms, term normalization, recognition of simple entity attributes and relations), and not top-down with a total replacement of current search methods.
I'm claiming that changes to the back end will enable fundamental changes to how 'results' are served.
The Fable system I mentioned is a simple example of exactly that. A back-end entity tagger and normalizer recognizes mentions of genes and proteins. This is nontrivial because the same gene or protein may be mentioned using a variety of terms and abbreviations. We can aggregate the normalized mentions to rank genes/proteins according to their co-occurrence with search terms, and we can link the mentions to gene and protein databases. We are working to improve the system to support richer queries that find the genes and proteins involved in important biological relationships and processes. However, I do not believe that these particular methods generalize robustly and cost-effectively to more general search. In particular, some of the methods we use rely on supervised machine learning, which requires costly manual annotation of training corpora. That will not scale to the Web in general.

The situation gets much more complicated with “relationships that are either grammatical (pretty interesting) or semantic (extremely interesting).” Current large-coverage parsers make many mistakes (typically one per 10 words in newswire, one per 7 words in biomedical text). It gets much worse for semantic relationships. The best co-reference accuracy I know about is around 80%, that is, one in five references are made to the wrong antecedent. These mistakes compound. That is, the proposed back-end would crate a lot of trash, which would be difficult to separate from the good stuff. We can do interesting research on these issues, but we are very far from generic, robust, scalable solutions.
If you change the game (e.g. by changing the way results are provided) then the notion of quality has been disrupted. I'm not sure what the costs are that Fernando is referring to. CPU (e.g. time to process all content)? Response time?
Users have expectations of quality for current search. A “disruptive” search engine that does not meet those expectations while offering something more will be in trouble. As for costs, putting NLP algorithms in the back-end requires a lot more time to process the content and a lot more storage than current indexing methods. In addition, mistakes in back-end processing will be costly: if your parser or semantic interpreter messes up — and they all do, often — you have to reprocess a huge amount of material, which is not easy to do while trying to keep up with the growth of the Web. All of these costs have to be paid for with improved advertising revenue, or some new revenue source. I have not detailed data to come up with actual estimates, but I would be very surprised if the costs of a search engine with a NLP back-end were less than twice those of a state-of-the-art non-NLP search engine. And that's probably lowballing it a lot.

Thursday, February 1, 2007

Ask Innovates Search UI

Ask Innovates Search UI:...There are two fundamental areas of innovation in search - the back end, which Powerset is disrupting with its NLP technology, and the UI which companies like Ask are disrupting.

I wish my friends at Powerset the best, but I fear that this claim of disruption is premature, unless you measure disruption by press attention. I believe that NLP is more likely to improve search incrementally, by stealth, than through one fell disruptive swoop. As I noted before, new search technology has to be cost-effective to succeed, and I know of no evidence of that for broad-coverage NLP methods. Unlike, say, financial calculations or personnel database searches, where accuracy is not an option, search quality can be and has to be traded off with search cost.