2005 : WHAT DO YOU BELIEVE IS TRUE EVEN THOUGH YOU CANNOT PROVE IT?

marti_hearst's picture
Computer Scientist, UC Berkeley, School of Information; Author, Search User Interfaces
Computer Scientist, UC Berkeley, School of Information Management & Systems

The Search Problem is solvable.

Advances in computational linguistics and user interface design will eventually enable people to find answers to any question they have, so long as the answer is encoded in textual form and stored in a publicly accessible location. Advances in reasoning systems will to a limited degree be able to draw inferences in order to find answers that are not explicitly present in the existing documents.

There have been several recent developments that prompt me to make this claim. First, computational linguistics (also known as natural language processing or language engineering) has made great leaps forward in the last decade, due primarily to advances stemming from the availability of huge text collections, from which statistics can be derived. Today's automatic language translation systems, for example, are now derived almost entirely from statistical patterns extracted from text collections. They now work as well as hand-engineered systems, and promise to continue to improve. As another example, recent government-sponsored research in the area of (simple) question answering has produced a radical leap forward in the quality of results in this arena.

Of course, another important development is the rise of the Web and its most voracious consumer, the internet search engine. It is common knowledge that search engines make use of information associated with link structure to improve results rankings. But search engine companies also have enormous, albeit somewhat impoverished, repositories of information about how people ask for information. This behavioral information can be used to build better search tools. For example, some spelling correction algorithms make use of how people have corrected erroneous spellings in the past, by observing pairs of queries that occur one after the next. The second query is assumed to be the correction, if it is sufficiently similar to the first. Patterns are then derived that convert from different types of misspellings to their corrections.

Another development in the field of computational linguistics is the manual creation of enormous lexical ontologies, which are then used to build axioms and rules about language use. These modern ontologies, unlike their predecessors, are of a large enough scale and simple enough design to be useful, although this work is in the early stages. There are also many attempts to build such ontologies automatically from large text collections; the most promising approach seems to be to combine the automated and the manual approaches.

As a side note, I am skeptical about the hype surrounding the Semantic Web—it is very difficult to characterize concepts in a systematic way, and even more so to force all the world's creators of information to conform to one schema. Automated analysis tools adapt to what people really do, rather than try to force people's expressions of information to conform to a standard.

Finally, advances in user interface design are key to producing better search results. The search field has learned an enormous amount in the ten years since the Web became a major presence in society, but as is often noted in the field, the interface itself hasn't changed much: after all this time, we still type words into a blank box and then select from a list of results. Experience shows that a search interface has to be a qualitative leap better than the standard in order to entice people to switch. I believe headway will be made in this area, most likely occurring in tandem with advances in natural language analysis.

It may well be the case that advances in audio, image, and video processing will keep pace with those of language analysis, thus making possible the answering of questions that can be answered by information stored in graphical and audio form. However, my expertise does not extend to these fields, so I will not make a claim about this.