Pitacily's lovely desk

(not) just another Binusian blog site

Natural Language Processing

What is Text Classification?

Text classification, also known as categorization: given a text of some kind, decide which of a predefined set of classes it belongs to. Language identification and genre classification are examples of text classification, as is sentiment analysis (classifying a movie or product review as positive or negative) and spam detection (classifying an email message as spam or not-spam).

 

Another way to think about classification is as a problem in data compression. A lossless compression algorithm takes a sequence of symbols, detects repeated patterns in it, and writes a description of the sequence that is more compact than the original.

For example, the text “0.142857142857142857” might be compressed to Compression algorithms work by building dictionaries of subsequences of the text, and then referring to entries in the dictionary. The example here had only one dictionary entry, “142857.“

In effect, compression algorithms are creating a language model. The LZW algorithm in particular directly models a maximum-entropy probability distribution. To do classification by compression, we first lump together all the spam training messages and compress them as

 


 

 

What is Information Retrieval?

Information retrieval is the task of finding documents that are relevant to a user’s need for information. The best-known examples of information retrieval systems are search engines on the World Wide Web. A Web user can type a query such as [AI into a search engine and see a list of relevant pages. In this section, we will see how such systems are built. An information retrieval (henceforth IR) system can be characterized by :

  • A corpus of documents. Each system must decide what it wants to treat as a document: a paragraph, a page, or a multipage text.
  • Queries posed in a query language. A query specifies what the user wants to know. The query language can be just a list of words, such as [AI book]; or it can specify a phrase of words that must be adjacent.
  • A result set. This the subset of documents that the IR system judges to be relevant to the query. By relevant, we mean likely to be of use to the person who posed the query, for the particular information need expressed in the query.
  • A presentation of the result set. This can be as simple as a ranked list of document titles or as complex as a rotating color map of the result set projected onto a three-dimensional space, rendered as a two-dimensional display.

 

Characteristic of IR:

  • A collection of writings (document). The system must determine which one want to be considered as a document (paper). Example: a paragraph, a page, etc.
  • User Query. The query is a formula used to find the information needed by the user. In its simplest form, a query is a keyword and documents that contain the keywords are the searched documents. Example: [AI book]; [“Al book”]; [AI AND book]; [AI NEAR book] [AI book site:www.aaai.org].
  • Set of Results. The results from the queries. A part of the documents in which is relevant to the query.
  • Display of result sets. Can be a list of results in a ranking of the title documents

 


 

 

HITS (Hyperlink-Induced Topic Search)

It is almost the same as PageRank algorithm, but HITS does not count the number of links in a page, but look around the found links, if it is in accordance with the destination of a link, the more appropriate words between the origin link to the destination link, the higher the authority value of the page.

 


 

 

Prolog is a language based on first order predicate logic. (Will revise/introduce this later). We can assert some facts and some rules, then ask questions to find out what is true.

  • Facts:

Untitled1

Note: lower case letters, full stop at end.

 

  • Rules:

Untitled2

  1. John likes someone if that someone is tall.
  2. A person examines a course if they teach that course.
  3. NOTE: “:-” used to mean IF. Meant to look a bit like a backwards arrow
  4. NOTE: Use of capitals (or words starting with capitals) for variables.

This entry was posted on Wednesday, June 4th, 2014 at 4:33 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply





XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>