Professional Documents
Culture Documents
Sentence Segmentation
Recommended Reading:
• Japanese, Chinese, Arabic Word Segmenter
Stanford-nlp: https://nlp.stanford.edu/software/segmenter.shtml
Language Identification -1
• Key Step:
• Identifying the language of the document
• Documents could be multilingual at the sentence level or
paragraph level too
Language Identification -2
• Writing System – Document triage
• Orthographic Features - Conventions of writing language (Spelling,
hyphenation, word break emphasis and punctuations.
• Unique Character Set – Greek or Hebrew – helps in identifying the
language
• Shared Character Sets – helps in narrowing down to a small set of
languages
• Arabic & Persian
• Russian & Ukrainian
• Norwegian & Swedish
Language Identification -3
• Identifying Character Set
• Key Strategy
• sort the bytes in a file by frequency count and use the sorted list as a
signature vector for comparison via an n-gram model
Corpus Dependence
• Early NLP systems - could process only well-formed input conforming to
their hand-built grammar.
• Current Scenario – increasing availability of large corpora; multilingual;
wide range of data types (newswires, emails, web blogs)
• Misspellings
• Erroneous Punctuations and Spacing
• Difficult to write rules to govern the corpora from the Internet
• Punctuations mean – Suprasegmentals in Spoken Language but might not be the
same for corpora
• Origin and Purpose of text rules the scenario
• Web Pages – headers, images, navigation links, browser scripts – not actual content
Application Dependence
• Word and Sentence Segmentation – arbitrary distinctions
• I’m
• Contraction
• Tokeniser’s role
• Grammatical Structure
• Possessive Forms – word governor’s
• Brown Corpus - Possessive noun
• Susanne’s Corpus – Singular Noun and Possessive
These are Application Dependent
Possessive pronoun – my, mine, our, ours, its, his, her, their, theirs, your, yours
Tokenization
• Space-delimited languages
• Unsegmented languages
Tokenization - challenges
• Space Delimiters – inadequate
• Ambiguous nature of writing systems
• Typographical structure of words
• Latin
• Alphabetic
• Space Delimited Text
Observation - 1
• Uses period in three different ways
• Within numbers as decimal points ($3.9)
• To mark abbreviations (Corp. and Sept.)
• To mark the end of the sentence (24.) – Here it is not a decimal point though
it occurs after a number
Observation -2
• Uses apostrophes in two ways
• To mark the possessive case (analyst’s)
• To show contractions (doesn’t)
(1) The contemporary viewer may simply wonder at the vast wooded
vistas rising up from the Saguenay River and Lac St. Jean, standing in for
the St. Lawrence River.
(2)The firm said it plans to sublease its current headquarters at 55 Water
St. A spokesman declined to elaborate.
1.1 Tokenizing Punctuation in English
QUOTATION MARKS & APOSTROPHES
• A major source of tokenization ambiguity
• Indicate quoted passage (Ambiguity: Open or Close)
• Sometimes, Single Quote and Apostrophe are same!!
• Apostrophe
• Genetive case of noun
• Contractions (Soln: Expand the word – but, language knowledge is reqd.)
• Some plurals
• Ambiguities:
• Peter’s bag; Peter’s home; Popular during the 80’s
Common Approaches:
• Extensive Word List
• But no widely accepted guideline to DEFINE A WORD!!
2 Common Word Segmentation Algorithm
• Each character is a distinct word
• But does not help in parsing; POS tagging; text-to-speech system
• Thus hurts performance!!
• Forward-Backward Matching
• Results are compared
• Optimised Segmentation occurs
• Language-specific heuristics are used later
2.1 Chinese Segmentation
• Hanzi – Several Thousands of Characters (words consists of one or
more character)
• Approaches:
• Statistical – use mutual information between characters from a corpus
• Lexical rule-based – syntax; semantics; morphological rules
• Hybrid - combination; weighted Finite State Transducer to identify dictionary
entries
2.2 Japanese Segmentation
• Alphabetic – Syllabic – Logographic
• Kanji
• Hiragana
• Katakana
• Romanji
• Challenges:
• Identifying the base
• Mix of writing systems may be used
• Approaches:
• Statistical techniques of Chinese apply
2.3 Unsegmented Alphabetic - Syllabic
Languages
• Thai
• Balinese
• Javanese
• Khmer
• Approaches:
• Maximum matching
• Broken to syllables – trained syllable collocation is used
• Features:
• Fewer characters
• Longer words
3 Sentence Segmentation
• Different Punctuation Marks
• Disambiguating all instances of punctuations may help
• Thai – no use of period; but a space ; which can be mixed with carriage
return
• Spaces are used in place of commas to mark clauses
•
.!? Are used to mark sentence boundaries, but they can
27
Sentence Segmentation
• Sentence delimiters are (!, ?) are relatively unambiguous
• Period “.” is quite ambiguous
• Sentence boundary
• In most NLP applications, the only sentence boundary punctuation marks considered are the
period, question mark, and exclamation point, and the definition of sentence is limited to the
text sentence which begins with a capital letter and ends in a full stop.
• Abbreviations like Inc. or Dr.
• Numbers like .02% or 4.3
• Consider Examples, two English sentences that convey exactly the
same meaning; yet, by the traditional definitions,
• the first would be classified as two sentences, the second as just one.
The semicolon in Example could likewise be replaced by a comma or a
dash, retain the same meaning, but still be considered a single
sentence.
• Here is a sentence. Here is another.
• Here is a sentence; here is another.
29
Sentence Segmentation
• Build a binary classifier
• Looks at a “.”
• Decides EndOfSentence/NotEndOfSentence
• Classifiers: hand-written rules, regular expressions, or machine-learning
30
Determining if a word is end-of-sentence:
a Decision Tree