marti_hearst's picture
Computer Scientist, UC Berkeley, School of Information; Author, Search User Interfaces
Computational Analysis of Language Requires Understanding Language

To me, having my worldview entirely altered is among the most fun parts of science. One mind-altering event occurred during graduate school. I was studying the field of Artificial Intelligence with a focus on Natural Language Processing. At that time there were intense arguments amongst computer scientists, psychologists, and philosophers about how to represent concepts and knowledge in computers, and if those representations reflected in any realistic way how people represented knowledge. Most researchers thought that language and concepts should be represented in a diffuse manner, distributed across myriad brain cells in a complex network. But some researchers talked about the existence of a "grandmother cell," meaning that one neuron in the brain (or perhaps a concentrated group of neurons) was entirely responsible for representing the concept of, say, your grandmother. I thought this latter view was hogwash.

But one day in the early 90's I heard a story on National Public Radio about children who had Wernicke's aphasia, meaning that a particular region in their brains were damaged. This damage left the children with the ability to form complicated sentences with correct grammatical structure and natural sounding rhythms, but with content that was entirely meaningless. This story was a revelation to me -- it seemed like irrefutable proof that different aspects of language were located in distinct regions of the brain, and that therefore perhaps the grandmother cell could exist. (Steven Pinker subsequently wrote his masterpiece, "The Language Instinct," on this topic.)

Shortly after this, the field of Natural Language Processing became radically changed by an entirely new approach. As I mentioned above, in the early 90's most researchers were introspecting about language use and were trying to hand-code knowledge into computers. So people would enter in data like "when you go to a restaurant, someone shows you to a table. You and your dining partners sit on chairs at your selected table. A waiter or waitress walks up to you and hands you a menu. You read the menu and eventually the waiter comes back and asks for your order. The waiter takes this information back to the kitchen." And so on, in painstaking detail.

But as large volumes of text started to become available online, people started developing algorithms to solve seemingly difficult natural language processing problems using very simple techniques. For example, how hard is it to write a program that can tell which language a stretch of text is written in? Sibun and Reynar found that all you need to do is record how often pairs of characters tend to co-occur in each language, and you only need to extract about a sentence from a piece of text to classify it with 99% accuracy into one of 18 languages! Another wild example is that of author identification. Back in the early 60's, Mosteller and Wallace showed that they could identify which of the disputed Federalist Papers were written by Hamilton vs. those written by Madison, simply by looking at counts of the function words (small structural words like "by", "from", and "to") that each author used.

The field as a whole is chipping away at the hard problems of natural language processing by using statistics derived from that mother-of-all-text-corpora, the Web. For example, how do you write a program to figure out the difference between a "student protest" and a "war protest"? The former is a demonstration against something, done by students, but the latter is not a demonstration done by a war.

In the old days, we would try to code all the information we could about the words in the noun compounds and try to anticipate how they interact. But today we used statistics drawn from counts of simple patterns on the web. Recently my PhD student Preslav Nakov has shown that we can often determine what the intended relationship between two nouns is by simply counting the verbs that fall between the two nouns, if we first reverse their order. So if we search the web for patterns like:

"protests that are * by students"

we find out the important verbs are "draw, involve, galvanize, affect, carried out by" and so on, whereas for "war protests" we find verbs such as "spread by, catalyzed by, precede", and so on.

The lesson we see over and over again is that simple statistics computed over very large text collections can do better at difficult language processing tasks than more complex, elaborate algorithms.