08 Text Data Processing

DATA2001 – Data Science,
Big Data and Data Diversity

Week 8: Text Data Processing:
Feature Extraction & Analysis
Presented by
A/Prof Uwe Roehm
School of Computer Science
The University of Sydney Page 1

Today: Text Data Processing
Objective Readings
Understanding how to process – Data Science from Scratch, Ch. 13
unstructured text data for categorisation – Doing Data Science, Ch. 4
and forecasting. Role of tokenization and
feature vectors. Use Cases
Lecture – Spam detection
– Text Data, Unstructured Data – Predicting box office returns
– Text Classification – Wikipedia
– Information Extraction

Introduction:
Unstructured Data

Text data usually does not have a pre-defined
data model, is unstructured and is typically
text-heavy, but may contain dates, numbers
and facts as well.
This results in ambiguities that make it more
difficult to understand than data in structured
databases.

Structured data
– Data in fields
– Easily stored in databases
– E.g.:
– Sensor data
– Financial data
– Click streams
– Measurements

Text is Unstructured Data
“In the history of cinematic – 80-90% of all potentially

mustaches, few have been as usable business information
disgusting as that of Rye is unstructured
Gerhardt (Kieran Culkin), the – E.g.:
youngest scion of North – Images
Dakota’s reigning crime
family and the stray spark – Video
that sets off the powder-keg – Email
second season of Fargo.” – Social media
(From a Slate review of Fargo Season 2)
Information Overload
http://bhavnaober.blogspot.com.au/2015_02_01_archive.html
Social Media Data
The University of Sydney http://hubpages.com/technology/Big-Data-Understanding-New-Insights# Page 8

Review of Machine Learning tasks
– Supervised learning – predict a value where truth is
available in the training data
– Prediction
– Classification (categorical - discrete labels), Regression (quantitative -
numeric values)
– Unsupervised learning – find patterns without ground truth
in training data
– Clustering
– Probability distribution estimation
– Finding association (in features)
– Dimension reduction
Other tasks: Semi-supervised learning, Reinforcement learning
Supervised vs Unsupervised Learning
Supervised learning Unsupervised learning

Text Categorisation

Motivation: Legitimate eMail – or SPAM?

Motivation Task: Spam/Not-spam Detection
http://blog.dato.com/how-to-evaluate-machine-learning-models-part-2a-classification-metrics

Modelling Spam Detection as Classification
– Input:
– Emails
– SMS messages
– Facebook pages
…
– Predict:
– 1 (spam)
– 0 (not-spam)

Core Idea: Text to Feature Vectors
– Represent document as a
multiset of words
– Keep frequency information
– Disregard grammar and
word order
– Feature Vector:
http://www.python-course.eu/text_classification_python.php
Which words occur how
often in a given text?
Tokenisation
– Split a string (document) into

“Friends, Romans, Romans, countrymen”
pieces called tokens
⬇ – Possibly remove some
characters, e.g., punctuation
– Remove “stop words” such as
[“Friends”,
“Romans”,
“Romans”, “a”, “the”, “and” which are
“countrymen”] considered irrelevant
– What about “O’Neill”? “Aren’t”?

Normalisation
[“Friends”, – Map similar words to the
“Romans”,
same token
“Romans”,
“countrymen”] – Stemming/lemmatisation
– Avoid grammatical and
⬇ derivational sparseness
– E.g., “was” => “be”
[“friend”,
“roman”, – Lower casing, encoding
“roman”, – E.g., “Naïve” => “naive”
“countrymen”]

Indicator Features
[“friend”, – Binary indicator feature for each
“roman”, word in a document
“roman”, – Ignore frequencies
“countrymen”]
{“friend”: 1,
“roman”: 1,
“countrymen”: 1}

Term Frequency Weighting
[“friend”, – Term frequency
“roman”,
– Give more weight to terms
“roman”,
that are common in document
“countrymen”]
– TF = |occurrences of term in doc|
⬇ – Damping
– Sometimes want to reduce
{“friend”: 1, impact of high counts
“roman”: 2, – TF = log(|occurrences of term in doc|)
“countrymen”: 1}

TF-IDF Weighting
[“friend”,
– Inverse document frequency (IDF)
“roman”, – Give less weight to terms that are
“countrymen”] common across documents
• deals with the problems of the Zipf
⬇ distribution
– IDF = log(|total docs|/|docs containing term|)
{“friend”: 0.1,
“roman”: 0.8, – TFIDF
– TFIDF = TF * IDF
“countrymen”: 0.2}

Vector Space Model
– Documents are represented as vectors in term space
– Terms are usually stems
– Document vector values can be weighted by, e.g., frequency
– Queries represented the same as documents
nova galaxy heat h’wood film role diet fur

A 10 5 3
B 5 10
C 10 8 7
D
“Nova” occurs 10 times
9 in10text
5 A
E
“Galaxy” occurs 5 times in text A 10 10
F
“Heat” occurs 3 times in text A 9 10
(Blank
G 5 means 0 7occurrences.) 9
H 6 10 2 8
I 7 5 1 3
Document Vectors
All document vectors together: Document-Term-Matrix (Feature-Matrix)
Document ids
nova galaxy heat h’wood film role diet fur

A 10 5 3
B 5 10
C 10 8 7
D 9 10 5
E 10 10
F 9 10
G 5 7 9
H 6 10 2 8
I 7 5 1 3

We Can Plot the Vectors
“Star”
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
“Diet”
Assumption: Documents that are close in direction
and length are similar to one another.
Feature Extraction in Python
– Scikit-learn library provides corresponding functionality via its CountVectorizer
– Example:
from sklearn.feature_extraction.text import CountVectorizer
from pprint import pprint
corpus = [ 'This is the first document.’,
'This is the second second document.’,
'And the third one.’,
'Is this the first document?', ... ]
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(corpus)
pprint(matrix)
– https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
– See also: https://adataanalyst.com/scikit-learn/countvectorizer-sklearn-example/
Feature Extraction in Python (cont’d)
– CountVectorizer can be configured in quite some detail
– By default, CounterVectorizer does tokenization for single words of minimum length 2
• Change to also consider bigrams (terms consisting of 2 words):
vectorizer = CountVectorizer(ngram_range=(1,2))
– Convert input text to lower case; also ignore certain accents in text:
vectorizer = CountVectorizer(lowercase=True, strip_accents=‘ascii’)
– Use indicator features (0 or 1) rather than term frequencies

vectorizer = CountVectorizer(binary=True)
– Specify a list of stop words that get ignored
vectorizer = CountVectorizer(stop_words=[‘the’,’a’])
– Only keep features within a certain document frequency range
vectorizer = CountVectorizer(min_df=0.1, max_df=0.5)
– Example:
CountVectorizer(lowercase=True,strip_accents=‘ascii’,binary=True)
Feature Extraction with word2vec
– Open Source Tool by Google
– https://code.google.com/archive/p/word2vec/

Unstructured data in Supervised Regression
– Unstructured data can also be input for regression (predict a
numeric value, not a categorical label)
– Example: Predict box-office return from the movie reviews
– Example: Predict share price changes from stock-market
announcements

The Curse of Dimensionality
– From a theoretical point of view, increasing the number of features
should lead to better performance.
– In practice, the inclusion of more features leads to worse performance

(i.e., curse of dimensionality).
– The number of training examples required increases exponentially with

dimensionality.
– Dimensionality reduction can improve the prediction accuracy.

Information Extraction

Many language tasks require structured prediction
Word POS tag – Structured prediction:
This DT (determiner) problems where output is a
is VBZ (verb) structured object, rather than
a DT (determiner) discrete or real values
tagged JJ (adjective)
– E.g., sequence tagging for
sentence NN (noun)
part-of-speech (POS)
. . tagging or named entity
recognition
https://en.wikipedia.org/wiki/Structured_prediction

Parsing Natural Language
https://spacy.io/demos/displacy?share=2473569563126265042

Natural Language Processing (NLP)
“Interdisciplinary field – Understanding

concerned with modelling – Tokenisation
natural language from a – POS tagging
computational perspective.”
https://en.wikipedia.org/wiki/Computational_l – Parsing
inguistics – Generation
– Summarisation

Information Extraction
“Task of automatically – Named entity recognition

extracting structured – Entity disambiguation
information from unstructured – Relation extraction
and/or semi-structured
documents.”
https://en.wikipedia.org/wiki/Information_extr
action

Knowledge Base Population (KBP)
– Aim is to build structured knowledge bases from massive
unstructured text corpora
– Two subtasks:
– Entity linking: identify mentions of entities, link to KB
– Slot filling: extract and populate facts for given entity

Entity Linking

Slot Filling

Parsing etc in Python
– spaCy is an open-source software
library for advanced Natural Language
Processing, written in the programming
languages Python and Cython.
– Visualisation of parses possible via

https://spacy.io/demos/displacy

Review

Review: Unstructured Data
Objective Readings
Understanding how to process – Data Science from Scratch, Ch. 13
unstructured text data for categorisation – Doing Data Science, Ch. 4
and forecasting. Role of tokenization and
feature vectors.
Use Cases
Lecture – Spam detection
– Text as unstructured data – Predicting box office returns
– Text classification – Wikipedia
– Information Extraction from Text

Next Week
– No tutorials this week (Good Friday = public holiday)

– Next week is then easter break
– Lectures in semester Week 8 start again 26 April

– Week 8 topic: Geo-Spatial Data
– If you haven’t done so, please join a group in your lab class for
the upcoming assignment, which we will publish soon

Learn much more in COMP5046: Natural Language
Processing
– “This unit introduces computational linguistics and the statistical techniques and
algorithms used to automatically process natural languages (such as English or Chinese).
It will review the core statistics and information theory, and the basic linguistics,
required to understand statistical natural language processing (NLP). Statistical NLP is
used in a wide range of applications, including information retrieval and extraction;
question answering; machine translation; and classifying and clustering of documents.
This unit will explore the key challenges of natural language to computational
modelling, and the state of the art approaches to the key NLP sub-tasks, including
tokenisation, morphological analysis, word sense representation, part-of-speech
tagging, named entity recognition and other information extraction, text categorisation,
phrase structure parsing and dependency parsing. You will implement many of these
sub-tasks in labs and assignments. The unit will also investigate the annotation process
that is central to creating training data for statistical NLP systems. You will annotate
data as part of completing a real-world NLP task.”
– https://www.sydney.edu.au/units/COMP5046
Some References
Text Classification
– http://www.python-course.eu/text_classification_python.php
– https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
– https://adataanalyst.com/scikit-learn/countvectorizer-sklearn-example/
Information retrieval and analytics
– Manning et al. Introduction to information retrieval
http://nlp.stanford.edu/IR-book/
– Kibana user guide
https://www.youtube.com/watch?v=Kqs7UcCJquM
– Setting up Elasticsearch and Kibana for analytics
https://www.youtube.com/watch?v=wHWb1d_VGp8
NLP Parsing (not examinable)
– https://spacy.io/demos/displacy
Information extraction (not examinable)
– Ji and Grisham. KBP: successful approaches and challenges.
http://www.aclweb.org/anthology/P11-1115.pdf
– Roth and Ji. Wikipedia and beyond (tutorial).
http://nlp.cs.rpi.edu/paper/wikificationtutorial.pdf
– Bordes and Gabrilovich. Web-scale knowledge graphs.
http://www.cs.technion.ac.il/~gabr/publications/papers/KDD14-T2-Bordes-Gabrilovich.pdf

08 Text Data Processing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

08 Text Data Processing

Uploaded by

Copyright:

Available Formats

DATA2001 – Data Science,

Big Data and Data Diversity

The University of Sydney Page 1

The University of Sydney Page 2

The University of Sydney Page 3

The University of Sydney Page 4

The University of Sydney Page 5

“In the history of cinematic – 80-90% of all potentially

The University of Sydney http://hubpages.com/technology/Big-Data-Understanding-New-Insights# Page 8

Supervised learning Unsupervised learning

The University of Sydney Page 10

The University of Sydney Page 11

The University of Sydney Page 12

The University of Sydney Page 13

The University of Sydney Page 14

– Split a string (document) into

The University of Sydney Page 16

The University of Sydney Page 17

The University of Sydney Page 18

The University of Sydney Page 19

The University of Sydney Page 22

nova galaxy heat h’wood film role diet fur

nova galaxy heat h’wood film role diet fur

The University of Sydney Page 24

Doc about mammal behavior

– Use indicator features (0 or 1) rather than term frequencies

The University of Sydney Page 28

The University of Sydney Page 29

– In practice, the inclusion of more features leads to worse performance

– The number of training examples required increases exponentially with

– Dimensionality reduction can improve the prediction accuracy.

The University of Sydney Page 30

The University of Sydney Page 31

The University of Sydney Page 32

The University of Sydney Page 33

“Interdisciplinary field – Understanding

The University of Sydney Page 34

“Task of automatically – Named entity recognition

The University of Sydney Page 35

The University of Sydney Page 36

The University of Sydney Page 37

The University of Sydney Page 38

– Visualisation of parses possible via

The University of Sydney Page 40

The University of Sydney Page 41

The University of Sydney Page 42

– No tutorials this week (Good Friday = public holiday)

– Lectures in semester Week 8 start again 26 April

The University of Sydney Page 43

You might also like