Natural Language Processing: Difference between revisions

Latest revision as of 21:17, 7 September 2017

Artificial Intelligence^[1]

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. ^[2]

Process of parsing natural language[edit]

Text Normalization
1. Segmenting/tokenizing words from running text
2. Normalizing word formats
3. Segmenting sentences in running text.

Terms related to text normalization[edit]

Term	Definition
Text Normalization	Normalizing text means converting it to a more convenient, standard form ^[3]
Tokenization	A part of text normalization. Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation^[4]
Token	A token is Linguistically significant and Methodologically useful^[5]. A token is the used unit in NLP task.^[6]
Lemmatization	A part of text normalization. The task of determining that two words have the same root, despite their surface differences.^[7]
Stemming	Stemming refers to a simpler version of lemmatization in which we mainly just strip suffixes from the end of the word.^[8]
Sentence segmentation	breaking up a text into individual sentences, using cues like periods or exclamation points. ^[9]
Corpus	In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed).^[10]
Utterance	In spoken language analysis, an utterance is the smallest unit of speech.^[11]
Disfluency	A speech disfluency, also spelled speech dysfluency, is any of various breaks, irregularities (within the English language, similar speech dysfluency occurs in different forms in other languages), or non-lexical vocables that occurs within the flow of otherwise fluent speech. These include false starts, i.e. words and sentences that are cut off mid-utterance, phrases that are restarted or repeated and repeated syllables, fillers i.e. grunts or non-lexical utterances such as "huh", "uh", "erm", "um", "well", "so", and "like", and repaired utterances, i.e. instances of speakers correcting their own slips of the tongue or mispronunciations (before anyone else gets a chance to).^[12]
Fillers or filled pauses	A part of disfluency. Words like uh and um are examples of fillers.^[13]
Lemma	A lemma is a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense. ^[14]
Word Form	The word form is the full inflected or derived form of the word.^[15]
Word Types	Related to understanding how many words are there in a corpus. Types are the number of distinct words in a corpus^[16]
Word Token	Related to understanding how many words are there in a corpus. Tokens are the total number N of running words^[17]
Clitic	In English morphology and phonology, a clitic is a word or part of a word that is structurally dependent on a neighboring word (its host) and cannot stand on its own. A clitic is said to be "phonologically bound," which means that it's pronounced, with very little emphasis, as if it were affixed to an adjacent word.^[18]
Case folding	Case folding is another kind of normalization. For tasks like speech recognition and information retrieval, everything is mapped to lower case.^[19]
Morphemes	A morpheme is the smallest meaningful unit in the grammar of a language that can't be divided into smaller meaningful parts.^[20]
Running Text	The body of text, as distinct from headings, footnotes, diagrams, data tables, logos, images, and other added content.^[21]

Language models[edit]

A language model (LM) assigns probabilities to sentences and sequences of words, the N-gram. N-gram words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”^[22]

The goal of probabilistic language modeling is to compute the probability of a sentence or sequence of words.

References[edit]

[1] ttp://www.flaticon.com/

[2] ttps://en.wikipedia.org/wiki/Natural_language_processing

[3] ttps://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

[4] ttps://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

[5] ttps://www.ibm.com/developerworks/community/blogs/nlp/entry/tokenization?lang=en

[6] ttps://www.researchgate.net/post/What_is_differance_between_String_and_token_in_Natural_Language_Processing_techniques

[7] ttps://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

[8] ttps://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

[9] ttps://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

[10] ttps://en.wikipedia.org/wiki/Text_corpus

[11] ttps://en.wikipedia.org/wiki/Utterance

[12] ttps://en.wikipedia.org/wiki/Speech_disfluency

[13] ttps://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

[14] ttps://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

[15] ttps://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

[16] ttps://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

[17] ttps://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

[18] ttps://www.thoughtco.com/what-is-clitic-words-1689757

[19] ttps://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

[20] ttp://www.glossary.sil.org/term/morpheme

[21] ttps://en.wiktionary.org/wiki/running_text

[22] ttps://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

@@ Line 56: / Line 56: @@
 ==  Language models ==
-A language model (LM) assigns probabilities to sentences and sequences of words, the N-gram. An N-gram is a sequence of N. N-gram words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”<ref>https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf</ref>
+A language model (LM) assigns probabilities to sentences and sequences of words, the N-gram.  N-gram words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”<ref>https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf</ref>
-The goal of probabilistic language modeling is to compute the the probability of a sentence or sequence of words.
+The goal of probabilistic language modeling is to compute the probability of a sentence or sequence of words.
 == References ==