You are on page 1of 5

About the Reviewers

Afroz Hussain is a data scientist by profession and is currently associated with a


US-based data science and ML start-up, PredictifyMe. He has experience of working
on many data science projects and has extensive experience of Python, scikit-learn,
and text mining with NLTK. He has more than 10 years of programming and
software development experience along with the experience of working on data
analysis and business intelligence projects. He has acquired new skills in data
science by taking online courses and taking part in Kaggle competitions.

Sujit Pal works at Elsevier Labs, which is a research and development group within
the Reed-Elsevier PLC group. His interests are in the fields of information retrieval,
distributed processing, ontology development, natural language processing, and
machine learning. He is also interested in and writes code in Python, Scala, and Java.
He combines his skills in these areas in order to help build new features or feature
improvements for different products across the company. He believes in lifelong
learning and blogs about his experiences at sujitpal.blogspot.com.

Kumar Raj serves as a data scientist II at Hewlett-Packard Software solutions in


the research and development department, where he is responsible for developing
the analytics layer for core HP software products. He is a graduate from Indian
Institute of Technology, Kharagpur, and has more than 2 years of experience
in various big data analytics domains, namely text analytics, web crawling and
scraping, HR analytics, virtualization system performance optimization, and
climate change forecasting.

www.it-ebooks.info
www.PacktPub.com

Support files, eBooks, discount offers,


and more
For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.com and as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
service@packtpub.com for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up
for a range of free newsletters and receive exclusive discounts and offers on Packt books
and eBooks.
TM

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book
library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?
• Fully searchable across every book published by Packt
• Copy and paste, print, and bookmark content
• On demand and accessible via a web browser

Free access for Packt account holders


If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib
today and view 9 entirely free books. Simply use your login credentials for immediate access.

www.it-ebooks.info
Table of Contents
Preface v
Chapter 1: Introduction to Natural Language Processing 1
Why learn NLP? 2
Let's start playing with Python! 5
Lists 5
Helping yourself 6
Regular expression 8
Dictionaries 9
Writing functions 10
Diving into NLTK 11
Your turn 17
Summary 17
Chapter 2: Text Wrangling and Cleansing 19
What is text wrangling? 19
Text cleansing 22
Sentence splitter 22
Tokenization 23
Stemming 24
Lemmatization 26
Stop word removal 26
Rare word removal 27
Spell correction 28
Your turn 28
Summary 29
Chapter 3: Part of Speech Tagging 31
What is Part of speech tagging 31
Stanford tagger 34
Diving deep into a tagger 35
[i]

www.it-ebooks.info
Table of Contents

Sequential tagger 36
N-gram tagger 37
Regex tagger 38
Brill tagger 39
Machine learning based tagger 39
Named Entity Recognition (NER) 40
NER tagger 40
Your Turn 42
Summary 43
Chapter 4: Parsing Structure in Text 45
Shallow versus deep parsing 46
The two approaches in parsing 46
Why we need parsing 46
Different types of parsers 48
A recursive descent parser 48
A shift-reduce parser 48
A chart parser 49
A regex parser 49
Dependency parsing 50
Chunking 52
Information extraction 55
Named-entity recognition (NER) 56
Relation extraction 57
Summary 58
Chapter 5: NLP Applications 59
Building your first NLP application 60
Other NLP applications 63
Machine translation 63
Statistical machine translation 65
Information retrieval 65
Boolean retrieval 66
Vector space model 66
The probabilistic model 67
Speech recognition 68
Text classification 68
Information extraction 70
Question answering systems 70
Dialog systems 71
Word sense disambiguation 71
Topic modeling 71

[ ii ]

www.it-ebooks.info
Table of Contents

Language detection 72
Optical character recognition 72
Summary 72
Chapter 6: Text Classification 73
Machine learning 74
Text classification 75
Sampling 77
Naive Bayes 80
Decision trees 83
Stochastic gradient descent 84
Logistic regression 85
Support vector machines 85
The Random forest algorithm 87
Text clustering 87
K-means 88
Topic modeling in text 89
Installing gensim 89
References 91
Summary 92
Chapter 7: Web Crawling 93
Web crawlers 93
Writing your first crawler 94
Data flow in Scrapy 97
The Scrapy shell 98
Items 103
The Sitemap spider 105
The item pipeline 106
External references 108
Summary 108
Chapter 8: Using NLTK with Other Python Libraries 109
NumPy 110
ndarray 110
Indexing 111
Basic operations 111
Extracting data from an array 113
Complex matrix operations 114
Reshaping and stacking 116
Random numbers 118

[ iii ]

www.it-ebooks.info

You might also like