Natural Language Processing (NLP)

intelligently processing natural languages, like English or Chinese or German, is one of the biggest areas of AI research

  • NLP is studied by many fields, including linguistics, mathematics, computer science, psychology, cognitive science

a language is essentially a set of strings, plus rules for what strings are legal called a grammar

languages also have semantics, i.e. meaning

for example, the English sentence “Mario is hungry.” is grammatically well-formed, but “Mario hungry is.” is ungrammatical

the semantics of “Mario is hungry” is that it’s an assertion that some particular person named Mario is hungry

  • the assertion may be true or false, of course

note that the ungrammatical sentence “Mario hungry is.” seems to have the same meaning

  • it is possible to have grammatical sentences without meaning, a famous example being “Colorless green ideas sleep furiously.”

nowadays, a lot of natural language processing is statistical in nature, i.e. it uses statistical and probabilistic algorithms to learn rules for solving language problem

  • such techniques don’t have a deep understanding of language in the way that humans understand language, yet nonetheless these techniques have proven extremely successful in solving many language problems

Language Models: N-grams

a useful way to model a language for some applications is as a sequence of words (or characters, or syllables, or sentences — different applications might best be handled with different tokens)

for example, suppose you have a sentence consisting of 5 words: <w1, w2, w3, w4, w5>

  • e.g. “Fonzi ate ten red apples” is a five-word sentence, where w1=Fonzi, w2=ate, w3=ten, etc.

the unigrams of the sentence are just the individual words: w1, w2, w3, w4, w5

the bigrams of the sentence are the pairs of adjacent words: (w1,w2), (w2,w3), (w3,w4), (w4,w5)

the trigrams of the sentences are triple of 3 adjacent words: (w1,w2,w3), (w2,w3,w4), (w3,w4,w5)

in general, an n-gram is a sequence of n consecutive words from a text

given a large corpus of text, i.e. a large collection of documents, we can count n-grams and get useful information about the probability of particular n-grams

for example, suppose we have counted all the trigrams for a large collection of news articles from the web

given two words followed by a blank, such as “Simon Fraser ___”, we can make a good guess about the most likely way to fill in the blank by looking at all the trigrams that start with “Simon Fraser” and choosing the most frequently occurring word after that

n-grams can also be done at the level of characters, which is useful for applications such as:

  • language identification, i.e. determining what language a text is written in
  • spelling correction
  • genre identification, i.e. determining if a text is a news story, a scientific publication, a novel, etc.
  • named entity recognition, i.e. determining the names of things (like people, countries, businesses) in a text

Text Classification

this is the problem of labeling a text as being in one of a given set of pre-defined categories

for example, spam detection labels email as either “ham” or “spam”

for example, a marker gives an English essay a grade of “A+”, “A”, “B”, “C+”, etc.

n-grams can be useful in spam detection

  • typically, spam detector is trained on both ham and spam messages, so that it can create a model for Prob(Message|Spam) and another language model for Prob(Message|Ham)

  • then given a new message, Bayes rule is used to estimate the probability of spam/ham:

            argmax P(c|Message) = argmax P(Message|c)P(c)
    
    where c in {ham, spam}
    

in addition to word-based n-grams, spam filters may also used character-based n-grams so they can catch intentional misspellings like “f,r,e,e!”

another interesting idea for text classification mentioned in the textbook is to use compression

  • text compression algorithms, like zip, essentially find patterns in text and replace those patterns with shorter descriptions
  • you can detect spam using compression as follows:
    • append all your training spam messages into a text file and compress it
    • append all your training ham messages into a text file and compress it
    • when you get a new message M, append it to both the spam and ham messages, and compress the result — whichever compresses to a smaller file is then used to classify M, e.g. if the spam messages compress smaller, then M is classified as spam
    • the intuition is that if M will compress well with a set of other messages if M shares similar patterns
    • experiments with off-the-shelf compression tools show that they can do about as well as n-gram classification, but they’re not as fast

Information Retrieval

information retrieval (IR) is the task of finding documents relevant to a users needs

these days, IR is dominated by the web, but it turns out here is a long history of IR research before the web

many early IR system used boolean keyword searching, i.e. you could write logical queries like (“farm” and (“BC” or “Alberta”))

  • boolean keyword searches can be powerful, but many users found it tricky to write the correct queries with them

most IR systems do queries on words using word count statistics

the idea is that the query is a set of words like “farming in Kansas”, and for every document in the database, a score is generated for how good a match that document is for the query

  • many IR systems do some pre-processing on queries, e.g. they might fix spelling errors, or change words like “farming” instead a standard stem form such as “farm”
    • “farms”, “farming”, “farmed” might all be changed into “farm”
  • the highest scoring documents are the ones that are returned

for example, the BM25 scoring function scores a query q against a document d by taking into account three things:

  • the frequency with which query terms appear in the document; this is called TF, for term frequency; e.g. if the query is “farming in Kansas”, then a documenting where “farming” appears many times is considered to score more highly

  • the inverse document frequency, denoted IDF is a measure of how frequently a term appears in all documents, with a lower IDF meaning it occurs very frequently and so is likely not to be as informative as a term that occurs in fewer documents; e.g. the term “in” is a common word that occurs in many documents, while “farming” is less common; in BM25, IDF is defined as follows:

    \[\textrm{IDF}(q_i) = \log \frac{N - \textrm{DF}(q_i) + 0.5}{\textrm{DF}(q_i) + 0.5}\]

    here, \(N\) is the total number of documents in the database, and \(\textrm{DF}(q_i)\) is the number of documents that contain \(q_i\)

  • shorter documents that match the query are preferred over longer documents; the intuition is that, say, a million word document is likely to match lots of queries and not be terribly relevant, while a short document that matches the query is more likely to be relevant

here is the final formula:

\[\textrm{BM25}(d_j,q_{1:N})=\sum_{i=1}^{N}\textrm{IDF}(q_i)\cdot \frac{\textrm{TF}(q_i,d_j)\cdot (k+1)}{\textrm{TF}(q_i,d_j) + k \cdot (1-b+b\cdot \frac{|d_j|}{L})}\]

in this formula, \(q_i\) is word \(i\) of the query \(q\), \(\textrm{TF}(q_i,d_j)\) is the number of times word \(q_i\) occurs in document \(d_j\), \(L=\sum_{i}|d_i|/N\) is the average document length of documents in the database, and \(k\) and \(b\) are tuning parameters that are typically set to values like \(k=2.0\) and \(b=0.75\)

in practice, IR systems are carefully evaluated and tuned on many different sets of documents, and getting a good scoring function can be surprisingly tricky

  • plus, it requires some non-trivial software engineering to make the system fast and memory-efficient — anything less than instantaneous results will probably not be tolerated by most users these days!

web search engines use many of the same ideas as traditional IR systems, but hyperlinks are a major difference between web pages and regular documents

  • algorithms like Google’s PageRank take into account the quality of the web pages “point” to a page, so that it cannot be fooled by spammers who add lots of occurrences of a word to their page
  • before Google, most web search engines used traditional IR techniques, and the results were typically terrible
    • you might have to click through 2 or 3 pages of results to get what you wanted
  • the arrival of Google was a major breakthrough: it ranked web search results much better than any other major search engine, and soon became the dominant searching tool it is today

Information Extraction

information extraction is the process of skimming a text to look for information of interest

typically, an information extractor has a relatively simple and clear goal about what sort of information it is extracting

for example, an information extraction system might search a text to final all email addresses, or all company names, or all products and their prices, etc.

for relatively simple information extraction problems, such as finding all postal codes in a text, it might be enough to use regular expressions

but for more complex information extraction, more sophisticated linguistic and learning techniques may be needed

  • e.g. suppose you want to get movie recommendations from newsgroup discussion
  • how would you determine what movies are being discussed, and what people think of the movie?
    • both bad/good movies might have a lot of discussion
  • perhaps you could somehow count how many people like a movie, and how many don’t
    • your program doesn’t need to understand why they like or dis-like the movie, it just has to determine what the “sentiment” is, i.e. favorable/unfavorable