You are on page 1of 42

DATA2001 – Data Science,

Big Data and Data Diversity


Week 8: Text Data Processing:
Feature Extraction & Analysis
Presented by
A/Prof Uwe Roehm
School of Computer Science

The University of Sydney Page 1


Today: Text Data Processing
Objective Readings
Understanding how to process – Data Science from Scratch, Ch. 13
unstructured text data for categorisation – Doing Data Science, Ch. 4
and forecasting. Role of tokenization and
feature vectors. Use Cases
Lecture – Spam detection
– Text Data, Unstructured Data – Predicting box office returns
– Text Classification – Wikipedia
– Information Extraction

The University of Sydney Page 2


Introduction:
Unstructured Data

The University of Sydney Page 3


Text data usually does not have a pre-defined
data model, is unstructured and is typically
text-heavy, but may contain dates, numbers
and facts as well.
This results in ambiguities that make it more
difficult to understand than data in structured
databases.

The University of Sydney Page 4


Structured data

– Data in fields
– Easily stored in databases
– E.g.:
– Sensor data
– Financial data
– Click streams
– Measurements

The University of Sydney Page 5


Text is Unstructured Data

“In the history of cinematic – 80-90% of all potentially


mustaches, few have been as usable business information
disgusting as that of Rye is unstructured
Gerhardt (Kieran Culkin), the – E.g.:
youngest scion of North – Images
Dakota’s reigning crime
family and the stray spark – Video
that sets off the powder-keg – Email
second season of Fargo.” – Social media
(From a Slate review of Fargo Season 2)
The University of Sydney Page 6
Information Overload

http://bhavnaober.blogspot.com.au/2015_02_01_archive.html
The University of Sydney Page 7
Social Media Data

The University of Sydney http://hubpages.com/technology/Big-Data-Understanding-New-Insights# Page 8


Review of Machine Learning tasks
– Supervised learning – predict a value where truth is
available in the training data
– Prediction
– Classification (categorical - discrete labels), Regression (quantitative -
numeric values)
– Unsupervised learning – find patterns without ground truth
in training data
– Clustering
– Probability distribution estimation
– Finding association (in features)
– Dimension reduction
Other tasks: Semi-supervised learning, Reinforcement learning
The University of Sydney Page 9
Supervised vs Unsupervised Learning

Supervised learning Unsupervised learning

The University of Sydney Page 10


Text Categorisation

The University of Sydney Page 11


Motivation: Legitimate eMail – or SPAM?

The University of Sydney Page 12


Motivation Task: Spam/Not-spam Detection

http://blog.dato.com/how-to-evaluate-machine-learning-models-part-2a-classification-metrics

The University of Sydney Page 13


Modelling Spam Detection as Classification
– Input:
– Emails
– SMS messages
– Facebook pages

– Predict:
– 1 (spam)
– 0 (not-spam)

The University of Sydney Page 14


Core Idea: Text to Feature Vectors

– Represent document as a
multiset of words
– Keep frequency information
– Disregard grammar and
word order

– Feature Vector:
http://www.python-course.eu/text_classification_python.php
Which words occur how
often in a given text?
The University of Sydney Page 15
Tokenisation

– Split a string (document) into


“Friends, Romans, Romans, countrymen”
pieces called tokens
⬇ – Possibly remove some
characters, e.g., punctuation
– Remove “stop words” such as
[“Friends”,
“Romans”,
“Romans”, “a”, “the”, “and” which are
“countrymen”] considered irrelevant
– What about “O’Neill”? “Aren’t”?

The University of Sydney Page 16


Normalisation
[“Friends”, – Map similar words to the
“Romans”,
same token
“Romans”,
“countrymen”] – Stemming/lemmatisation
– Avoid grammatical and
⬇ derivational sparseness
– E.g., “was” => “be”
[“friend”,
“roman”, – Lower casing, encoding
“roman”, – E.g., “Naïve” => “naive”
“countrymen”]

The University of Sydney Page 17


Indicator Features
[“friend”, – Binary indicator feature for each
“roman”, word in a document
“roman”, – Ignore frequencies
“countrymen”]

{“friend”: 1,
“roman”: 1,
“countrymen”: 1}

The University of Sydney Page 18


Term Frequency Weighting
[“friend”, – Term frequency
“roman”,
– Give more weight to terms
“roman”,
that are common in document
“countrymen”]
– TF = |occurrences of term in doc|

⬇ – Damping
– Sometimes want to reduce
{“friend”: 1, impact of high counts
“roman”: 2, – TF = log(|occurrences of term in doc|)
“countrymen”: 1}

The University of Sydney Page 19


TF-IDF Weighting

[“friend”,
– Inverse document frequency (IDF)
“roman”, – Give less weight to terms that are
“countrymen”] common across documents
• deals with the problems of the Zipf
⬇ distribution
– IDF = log(|total docs|/|docs containing term|)
{“friend”: 0.1,
“roman”: 0.8, – TFIDF
– TFIDF = TF * IDF
“countrymen”: 0.2}

The University of Sydney Page 22


Vector Space Model
– Documents are represented as vectors in term space
– Terms are usually stems
– Document vector values can be weighted by, e.g., frequency
– Queries represented the same as documents

nova galaxy heat h’wood film role diet fur


A 10 5 3
B 5 10
C 10 8 7
D
“Nova” occurs 10 times
9 in10text
5 A
E
“Galaxy” occurs 5 times in text A 10 10
F
“Heat” occurs 3 times in text A 9 10
(Blank
G 5 means 0 7occurrences.) 9
H 6 10 2 8
I 7 5 1 3
The University of Sydney Page 23
Document Vectors
All document vectors together: Document-Term-Matrix (Feature-Matrix)
Document ids

nova galaxy heat h’wood film role diet fur


A 10 5 3
B 5 10
C 10 8 7
D 9 10 5
E 10 10
F 9 10
G 5 7 9
H 6 10 2 8
I 7 5 1 3

The University of Sydney Page 24


We Can Plot the Vectors

“Star”
Doc about movie stars
Doc about astronomy

Doc about mammal behavior

“Diet”
Assumption: Documents that are close in direction
and length are similar to one another.
The University of Sydney Page 25
Feature Extraction in Python
– Scikit-learn library provides corresponding functionality via its CountVectorizer
– Example:
from sklearn.feature_extraction.text import CountVectorizer
from pprint import pprint
corpus = [ 'This is the first document.’,
'This is the second second document.’,
'And the third one.’,
'Is this the first document?', ... ]
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(corpus)
pprint(matrix)

– https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
– See also: https://adataanalyst.com/scikit-learn/countvectorizer-sklearn-example/
The University of Sydney Page 26
Feature Extraction in Python (cont’d)
– CountVectorizer can be configured in quite some detail
– By default, CounterVectorizer does tokenization for single words of minimum length 2
• Change to also consider bigrams (terms consisting of 2 words):
vectorizer = CountVectorizer(ngram_range=(1,2))
– Convert input text to lower case; also ignore certain accents in text:
vectorizer = CountVectorizer(lowercase=True, strip_accents=‘ascii’)

– Use indicator features (0 or 1) rather than term frequencies


vectorizer = CountVectorizer(binary=True)
– Specify a list of stop words that get ignored
vectorizer = CountVectorizer(stop_words=[‘the’,’a’])
– Only keep features within a certain document frequency range
vectorizer = CountVectorizer(min_df=0.1, max_df=0.5)
– Example:
CountVectorizer(lowercase=True,strip_accents=‘ascii’,binary=True)
The University of Sydney Page 27
Feature Extraction with word2vec
– Open Source Tool by Google
– https://code.google.com/archive/p/word2vec/

The University of Sydney Page 28


Unstructured data in Supervised Regression
– Unstructured data can also be input for regression (predict a
numeric value, not a categorical label)
– Example: Predict box-office return from the movie reviews
– Example: Predict share price changes from stock-market
announcements

The University of Sydney Page 29


The Curse of Dimensionality
– From a theoretical point of view, increasing the number of features
should lead to better performance.

– In practice, the inclusion of more features leads to worse performance


(i.e., curse of dimensionality).

– The number of training examples required increases exponentially with


dimensionality.

– Dimensionality reduction can improve the prediction accuracy.

The University of Sydney Page 30


Information Extraction

The University of Sydney Page 31


Many language tasks require structured prediction
Word POS tag – Structured prediction:
This DT (determiner) problems where output is a
is VBZ (verb) structured object, rather than
a DT (determiner) discrete or real values
tagged JJ (adjective)
– E.g., sequence tagging for
sentence NN (noun)
part-of-speech (POS)
. . tagging or named entity
recognition
https://en.wikipedia.org/wiki/Structured_prediction

The University of Sydney Page 32


Parsing Natural Language

https://spacy.io/demos/displacy?share=2473569563126265042

The University of Sydney Page 33


Natural Language Processing (NLP)

“Interdisciplinary field – Understanding


concerned with modelling – Tokenisation
natural language from a – POS tagging
computational perspective.”
https://en.wikipedia.org/wiki/Computational_l – Parsing
inguistics – Generation
– Summarisation

The University of Sydney Page 34


Information Extraction

“Task of automatically – Named entity recognition


extracting structured – Entity disambiguation
information from unstructured – Relation extraction
and/or semi-structured
documents.”
https://en.wikipedia.org/wiki/Information_extr
action

The University of Sydney Page 35


Knowledge Base Population (KBP)
– Aim is to build structured knowledge bases from massive
unstructured text corpora
– Two subtasks:
– Entity linking: identify mentions of entities, link to KB
– Slot filling: extract and populate facts for given entity

The University of Sydney Page 36


Entity Linking

The University of Sydney Page 37


Slot Filling

The University of Sydney Page 38


Parsing etc in Python
– spaCy is an open-source software
library for advanced Natural Language
Processing, written in the programming
languages Python and Cython.

– Visualisation of parses possible via


https://spacy.io/demos/displacy

The University of Sydney Page 40


Review

The University of Sydney Page 41


Review: Unstructured Data
Objective Readings
Understanding how to process – Data Science from Scratch, Ch. 13
unstructured text data for categorisation – Doing Data Science, Ch. 4
and forecasting. Role of tokenization and
feature vectors.
Use Cases
Lecture – Spam detection
– Text as unstructured data – Predicting box office returns
– Text classification – Wikipedia
– Information Extraction from Text

The University of Sydney Page 42


Next Week

– No tutorials this week (Good Friday = public holiday)


– Next week is then easter break

– Lectures in semester Week 8 start again 26 April


– Week 8 topic: Geo-Spatial Data

– If you haven’t done so, please join a group in your lab class for
the upcoming assignment, which we will publish soon

The University of Sydney Page 43


Learn much more in COMP5046: Natural Language
Processing
– “This unit introduces computational linguistics and the statistical techniques and
algorithms used to automatically process natural languages (such as English or Chinese).
It will review the core statistics and information theory, and the basic linguistics,
required to understand statistical natural language processing (NLP). Statistical NLP is
used in a wide range of applications, including information retrieval and extraction;
question answering; machine translation; and classifying and clustering of documents.
This unit will explore the key challenges of natural language to computational
modelling, and the state of the art approaches to the key NLP sub-tasks, including
tokenisation, morphological analysis, word sense representation, part-of-speech
tagging, named entity recognition and other information extraction, text categorisation,
phrase structure parsing and dependency parsing. You will implement many of these
sub-tasks in labs and assignments. The unit will also investigate the annotation process
that is central to creating training data for statistical NLP systems. You will annotate
data as part of completing a real-world NLP task.”
– https://www.sydney.edu.au/units/COMP5046
The University of Sydney Page 44
Some References
Text Classification
– http://www.python-course.eu/text_classification_python.php
– https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
– https://adataanalyst.com/scikit-learn/countvectorizer-sklearn-example/
Information retrieval and analytics
– Manning et al. Introduction to information retrieval
http://nlp.stanford.edu/IR-book/
– Kibana user guide
https://www.youtube.com/watch?v=Kqs7UcCJquM
– Setting up Elasticsearch and Kibana for analytics
https://www.youtube.com/watch?v=wHWb1d_VGp8
NLP Parsing (not examinable)
– https://spacy.io/demos/displacy
Information extraction (not examinable)
– Ji and Grisham. KBP: successful approaches and challenges.
http://www.aclweb.org/anthology/P11-1115.pdf
– Roth and Ji. Wikipedia and beyond (tutorial).
http://nlp.cs.rpi.edu/paper/wikificationtutorial.pdf
– Bordes and Gabrilovich. Web-scale knowledge graphs.
The University of Sydney Page 45
http://www.cs.technion.ac.il/~gabr/publications/papers/KDD14-T2-Bordes-Gabrilovich.pdf

You might also like