You are on page 1of 16

11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets

News
Programming
JOIN NEWSLETTER
Python
SQL

Datasets

Getting Started with spaCy for NLP Education


Certificates
Search KDnuggets…

Courses
Online
In this blog, we will explore how to get Masters
started with spaCy right from the installation to explore the
various functionalities it provides.
Resources
By Yesha Shastri, AI Developer andCheatsheets
Writer on November 1, 2022 in Natural Language Processing
Events
Share
Share Share ?Jobs
Publications
Webinars

Blog Latest News


Top Posts
Submissions
About

Topics
Artificial Intelligence
Career Advice
Computer Vision
Data Engineering
Data Science
Subscribe To Our Newsletter
(Get
Machine Learning The Great Big NLP
MLOps Your email address SUBSCRIBE
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 1/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
KDnuggets News, November 9: 7 Tips
Programming
To Produce Readable
J O I N...N E W S L E T T E R
Python
SQL
Watch all of IMPACT’s breakout sessions
Datasets ON-DEMAND
Education
Certificates Analyzing Diversity & Inclusion with SQL
Courses
Online Masters
Fake It Till You Make It: Generating
Realistic Syntheti...
Resources
Cheatsheets
Events Finally a Book on Attention!
Jobs
Publications Confusion Matrix, Precision, and Recall
Webinars Explained

Top Posts Last Week

Image by Editor

Blog
Nowadays, NLP is one of the most emerging trends of AI as its applications are widespread
Top Posts
across several industries such Submissions
as Healthcare, Retail, and Banking to name a few. As there is
About
an increasing need to develop fast and scalable solutions, spaCy is one of the go-to NLP
Topics
libraries for developers. NLP products are developed to make sense of the existing text
Artificial Intelligence
1 How to Select Rows and
data. It mainly revolves aroundCareer Advice
solving questions such as ‘What is the context of data?’, Columns in Pandas Using [ ],
Computer Vision
‘Does it represent any bias?’, ‘IsData
there some similarity among words?’ etc. to build valuable
Engineering .loc, iloc, .at and .iat
Data Science
solutions?  2 15 Free Machine Learning and
Subscribe To Our Newsletter
(Get The Great Big NLP
Machine Learning
MLOps Your email address SUBSCRIBE
Deep Learning Books
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 2/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
Therefore, spaCy is a library that helps to deal with such questions and it provides a bunch
Programming 3 Decision Tree Algorithm,
JOIN NEWSLETTER
of modules that are easy to plug and play. It is an open-source and production-friendly
Python Explained
SQL
library that makes development and deployment smooth and efficient. Moreover, spaCy 4 Should I Learn Julia?
Datasets approach hence it provides a limited set of
was not built with a research-oriented
Education 5 7 Techniques to Handle
functionalities for the users to Certificates
choose from instead of multiple options to develop quickly.
Imbalanced Data
Courses
In this blog, we will explore how to get
Online started with spaCy right from the installation to
Masters

explore the various functionalities it provides. 


Resources More Recent Posts
Cheatsheets
Events
Jobs

Installation
Publications
Webinars

To install spaCy enter the following command:

pip install spacy

Blog
Top Posts
spaCy generally requires trained pipelines to be loaded in order to access most of its
Submissions
About
functionalities. These pipelines contained pretrained models which perform prediction for
some of the commonly usedTopics
tasks. The pipelines are available in multiple languages and in
Artificial Intelligence
multiple sizes. Here, we will install
Careerthe small and medium size pipelines for English.
Advice
Computer Vision
Data Engineering
python -m spacy download en_core_web_sm

Data Science
python -m spacy download en_core_web_md
Subscribe To Our Newsletter
(Get The Great Big NLP
Machine Learning
MLOps Your email address SUBSCRIBE
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 3/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
Voila! You are now all set to start using spaCy. Confusion Matrix, Precision, and Recall
Programming
Explained JOIN NEWSLETTER
Python
SQL
Map out your journey towards SAS
Loading
Datasets
Education
the Pipeline Certification

Certificates Python Control Flow Cheatsheet


Courses
Here, we will load the smaller pipeline version of English. 
Online Masters
The Most Comprehensive List of Kaggle
Solutions and Ideas
Resources
import spacy

Cheatsheets
nlp = spacy.load("en_core_web_sm")
Events 3 Useful Python Automation Scripts
Jobs
Publications Approaches to Text Summarization: An
Webinars Overview
The pipeline is now loaded into the nlp object. 
15 More Free Machine Learning and
Next, we will be exploring the various functionalities of spaCy using an example. 
Deep Learning Books

Top Posts October 31 – November 6:

Blog
Tokenization How to Select Rows an...

Top Posts 4 Ways to Rename Pandas Columns


Submissions
Tokenization is a process of splitting
About the text into smaller units called tokens. For example, How to Create a Sampling Plan for Your
in a sentence tokens would be words whereas in a paragraph tokens could be sentences. Data Project
Topics
This step helps to understand Artificial
the contentIntelligence
by making it easy to read and process.  Related Posts
Career Advice
Computer Vision
We first define a string. 
Data Engineering Getting Started in AI Research
Data Science
text =Subscribe
"KDNuggetsToisOur Newsletter
(Get
Machine
a wonderful The
Learning
website to Great
learn Big NLP learning with python"
machine Getting Started with
Your email address S UPyTorch
BSCRIBE
MLOps
Primer ebook)
NLP
Getting Started with SQL Cheatsheet
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 4/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
Getting Started with SQL Cheatsheet
News
Programming
J O Spectral
Getting Started with I N N E W SClustering
LETTER
Python
Now we call the ‘nlp’ object on SQL
‘text’ and store it in a ‘doc’ object. The object ‘doc’ would be
Getting Started with Feature Selection
containing all the information about the text - the words, the whitespaces etc. 
Datasets
Education Getting Started with Reinforcement
doc = nlp(text) Certificates Learning
Courses
Online Masters

Resources
‘doc’ can be used as an iterator to parse through the text. It contains a ‘.text’ method which
Cheatsheets
can give the text of every tokenEvents
like:
Jobs
Publications
for token in doc:

Webinars
print(token.text)

output:

KDNuggets

is
Blog
a
Top Posts
wonderful

Submissions
website

to
About
learn

machine
Topics
learning
Artificial Intelligence
with
Career Advice
python
Computer Vision
Data Engineering
Data Science
Subscribe To Our Newsletter
(Get
Machine Learning The Great Big NLP
MLOps Your email address SUBSCRIBE
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 5/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
In addition to splitting the words by white spaces, the tokenization algorithm also performs
Programming
JOIN NEWSLETTER
double-checks on the split text. 
Python
SQL

Datasets
Education
Certificates
Courses
Online Masters

Resources
Cheatsheets
Events
Jobs
Publications
Webinars

Blog Source: spaCy documentation


Top Posts
Submissions
About
As shown in the above image, after splitting the words by white spaces, the algorithm
Topics
checks for exceptions. The word ‘Let’s’Intelligence
Artificial is not in its root form hence it is again split into ‘Let’
Career Advice
and ‘’s’. The punctuation marksComputer
are alsoVision
split. Moreover, the rule makes sure not to split
words like ‘N.Y.’ and considers Data
them like a single token. 
Engineering
Data Science
Subscribe To Our Newsletter
(Get The Great Big NLP
Machine Learning
MLOps Your email address SUBSCRIBE
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 6/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
Stop Words
Programming
JOIN NEWSLETTER
Python
SQL

One of the important preprocessing


Datasets steps in NLP is to remove stop words from text. Stop
Education
words are basically connector words such as ‘to’, ‘with’, ‘is’, etc. which provide minimal
Certificates
context. spaCy allows easy identification
Courses of stop words with an attribute of the ‘doc’ object
Online Masters
called ‘is_stop’.
Resources
We iterate over all the tokens and apply the ‘is_stop’ method. 
Cheatsheets
Events
for token in doc:
Jobs
if token.is_stop == Publications
True:

print(token)Webinars

output:

is

to
Blog
with Top Posts
Submissions
About

Topics
Lemmatization
Artificial Intelligence
Career Advice
Computer Vision
Data Engineering
Lemmatization is another important preprocessing step for NLP pipelines. It helps to
Data Science
removeSubscribe To Our Newsletter
(Get
different versions of a Machine
single word LearningThe Great redundancy
to reduce Big NLP of Your
same-meaning
email addresswords SUBSCRIBE
MLOps
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 7/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
as it converts the words to their root lemmas. For example, it will convert ‘is’ -> ‘be’, ‘eating’ -
Programming
JOIN NEWSLETTER
> ‘eat’, and ‘N.Y.’ -> ‘n.y.’. With spaCy,
Python the words can be easily converted to their lemmas
SQL
using a ‘.lemma_’ attribute of the ‘doc’ object. 
Datasets
We iterate over all the tokensEducation
and apply the ‘.lemma_’ method. 
Certificates
Courses
for token in doc:

Online Masters
print(token.lemma_)

Resources
Cheatsheets
Events
output:  Jobs
Publications
Webinars
kdnugget

be

wonderful

website

to

learn

machine

learning

with
Blog
python
Top Posts
Submissions
About

Part-of-Speech
Topics
Artificial Intelligence
(POS) Tagging
Career Advice
Computer Vision
Automated POS tagging enables usEngineering
Data to get an idea of the sentence structure by knowing
Data Science
what category
Subscribe of words
To Ourdominate theLearning
content
Newsletter
(Get
Machine Theand vice
Great Bigversa.
NLP This information forms an
MLOps Your email address SUBSCRIBE
Primer
essential part in understanding theebook)
context. spaCy allows parsing the content and tagging
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 8/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
the individual tokens with theirProgramming
respective parts of speech through the ‘.pos_’ attribute of
JOIN NEWSLETTER
the ‘doc’ object.  Python
SQL
We iterate over all the tokens and apply the ‘.pos_’ method. 
Datasets
Education
for token in doc:
Certificates
print(token.text,':',token.pos_)
Courses
Online Masters

Resources
output:  Cheatsheets
Events
Jobs
KDNuggets : NOUN
Publications
is : AUX

Webinars
a : DET

wonderful : ADJ

website : NOUN

to : PART

learn : VERB

machine : NOUN

learning : NOUN

with : ADP

python : NOUN
Blog
Top Posts
Submissions

Dependency Parsing
About

Topics
Artificial Intelligence
Career Advice
Every sentence has an inherent structure in which the words have an interdependent
Computer Vision
relationship with each other. Dependency
Data Engineering parsing can be thought of as a directed graph
Data Science
wherein the nodesToare
Subscribe Ourwords and theLearning
Newsletter
(Get
Machine edgesTheareGreat
relationships
Big NLP between the words. It
MLOps Your email address SUBSCRIBE
extracts the information onPrimer
what ebook)
one word means to another grammatically; whether it is a
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 9/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
subject, an auxiliary verb, or a Programming
root, and so on. spaCy has a method ‘.dep_’ of the ‘doc’ object
JOIN NEWSLETTER
which describes the syntactic dependencies
Python of the tokens. 
SQL
We iterate over all the tokens and apply the ‘.dep_’ method. 
Datasets
Education
for token in doc:
Certificates
print(token.text, '-->',
Coursestoken.dep_)
Online Masters

Resources
output:  Cheatsheets
Events
Jobs
KDNuggets --> nsubj
Publications
is --> ROOT
Webinars
a --> det

wonderful --> amod

website --> attr

to --> aux

learn --> relcl

machine --> compound

learning --> dobj

with --> prep

python --> pobj


Blog
Top Posts
Submissions

Named Entity Recognition


About

Topics
Artificial Intelligence
Career Advice
All the real-world objects have a name assigned to them for recognition and likewise, they
Computer Vision
are grouped into a category. For instance,
Data Engineering the terms ‘India’, ‘U.K.’, and ‘U.S.’ fall under the
Data Science
category of countries
Subscribe whereas
To Our ‘Microsoft’,
Newsletter
(Get
Machine ‘Google’,
LearningThe Greatand
Big‘Facebook’
NLP belong to the category of
MLOps Your email address SUBSCRIBE
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 10/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
organizations. spaCy already has trained models in the pipeline that can determine and
Programming
JOIN NEWSLETTER
predict the categories of such named
Python entities. 
SQL
We will access the named entities by using the ‘.ents’ method over the ‘doc’ object. We will
Datasets
display the text, start character, end character, and label of the entity. 
Education
Certificates
Courses
for ent in doc.ents:

Online Masters
print(ent.text, ent.start_char, ent.end_char, ent.label_)

Resources
Cheatsheets
Events
output:  Jobs
Publications
Webinars
KDNuggets 0 9 ORG

Word Vectors and Similarity


Blog
Often in NLP, we wish to analyze the similarity of words, sentences, or documents which
Top Posts
can be used for applications such as recommender systems or plagiarism detection tools to
Submissions
About
name a few. The similarity score is calculated by finding the distance between the word
embeddings, i.e., the vector Topics
representation of words. spaCy provides this functionality with
Artificial Intelligence
medium and large pipelines. The larger
Career Advicepipeline is more accurate as it contains models
Computer Vision
trained on more and diverse data. However, we will use the medium pipeline here just for
Data Engineering
the sake of understanding.  Data Science
Subscribe To Our Newsletter
(Get
Machine Learning The Great Big NLP
MLOps Your email address SUBSCRIBE
Primer
We first define the sentences to beebook)
compared for similarity. 
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 11/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
Programming
nlp = spacy.load("en_core_web_md")

JOIN NEWSLETTER

Python
doc1 = nlp("Summers in IndiaSQL
are extremely hot.")

doc2 = nlp("During summers a lot of regions in India experience severe temperatures.")

doc3 = nlp("People drink lemon juice and wear shorts during summers.")


Datasets
print("Similarity score ofEducation
doc1 and doc2:", doc1.similarity(doc2))

Certificates
print("Similarity score of doc1 and doc3:", doc1.similarity(doc3))
Courses
Online Masters

output:  Resources
Cheatsheets
Events
Similarity score of doc1 andJobs
doc2: 0.7808246189842116

Similarity score of doc1 andPublications


doc3: 0.6487306770376172
Webinars

Rule-based Matching 

Rule-based matching can be considered similar to regex wherein we can mention the
Blog
specific pattern to be found in Top
thePosts
text. spaCy’s matcher module not only does the
Submissions
mentioned task but also provides access to the document information such as tokens, POS
About
tags, lemmas, dependency structures, etc. which makes extraction of words possible on
Topics
multiple additional conditions. 
Artificial Intelligence
Career Advice
Here, we will first create a matcher
Computerobject
Visionto contain all the vocabulary. Next, we will define
the pattern of text to be lookedData
forEngineering
and add that as a rule to the matcher module. Finally,
Data Science
we willSubscribe
call the matcher
To Ourmodule over the
Newsletter
(Get
Machine input
LearningThe sentence. 
Great Big NLP
MLOps Your email address SUBSCRIBE
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 12/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
Programming
from spacy.matcher import Matcher

JOIN NEWSLETTER
matcher = Matcher(nlp.vocab)

Python

SQL
doc = nlp("cold drinks help to deal with heat in summers")

pattern = [{'TEXT': 'cold'}, {'TEXT': 'drinks'}]


Datasets
Education
matcher.add('rule_1', [pattern], on_match=None)

matches = matcher(doc)
Certificates

Courses
for _, start, end in matches:

Online Masters
matched_segment = doc[start:end]

print(matched_segment.text)
Resources
Cheatsheets
Events
Jobs
output:  Publications
Webinars
cold drinks

Let's also look at another example wherein we attempt to find the word 'book' but only

from spacy.matcher import Matcher

Blog
matcher = Matcher(nlp.vocab)


Top Posts
doc1 = nlp("I am reading theSubmissions
book called Huntington.")

doc2 = nlp("I wish to book aAbout


flight ticket to Italy.")

pattern2 = [{'TEXT': 'book', 'POS': 'NOUN'}]

Topics

Artificial on_match=None)

matcher.add('rule_2', [pattern2], Intelligence



Career Advice
matches = matcher(doc1)
Computer Vision
Data Engineering
print(doc1[matches[0][1]:matches[0][2]])


Data Science
matches = matcher(doc2)

Subscribe To Our Newsletter


(Get
Machine Learning The Great Big NLP
print(matches)
MLOps Your email address SUBSCRIBE
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 13/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
Programming
JOIN NEWSLETTER
Python
output:  SQL

book
Datasets
[] Education
Certificates
Courses
Online Masters

In this blog, we looked at how to install and get started with spaCy. We also explored the
Resources
various basic functionalities it provides
Cheatsheetssuch as tokenization, lemmatization, dependency
Events
parsing, parts-of-speech tagging,
Jobsnamed entity recognition and so on. spaCy is a really
Publications
convenient library when it comes to developing NLP pipelines for production purposes. Its
Webinars
detailed documentation, simplicity of use, and variety of functions make it one of the widely
used libraries for NLP. 

Yesha Shastri is a passionate AI developer and writer pursuing Master’s in Machine


Blog
Top Posts Yesha is intrigued to explore responsible AI
Learning from Université de Montréal.
Submissions
techniques to solve challengesAbout
that benefit society and share her learnings with the
community. Topics
Artificial Intelligence
Career Advice
Computer Vision
More On This Topic Data Engineering
Data Science
HowSubscribe
to Fine-Tune BERT Transformer
To Our Newsletter
(Get
Machine LearningwithThespaCy
Great3 Big NLP
MLOps Your email address SUBSCRIBE
Primer
Production-Ready MachineNLP ebook)
Learning NLP API with FastAPI and spaCy
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 14/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
Deploying Serverless spaCyProgramming
Transformer Model with AWS Lambda
JOIN NEWSLETTER
Python
Building a Structured Financial
SQL Newsfeed Using Python, SpaCy and Streamlit
How to Train a Joint Entities and Relation Extraction Classifier using BERT…
Datasets
Education
Getting Started with R Programming
Certificates
Courses
Online Masters

Resources
Cheatsheets
Events
Jobs
Get the Publications
FREE ebook 'The Great Big Natural Language
Webinars
Processing Primer' and the leading newsletter on AI,
Data Science, and Machine Learning, straight to your
inbox.

Your Email

SIGN UP

Blogyou accept KDnuggets Privacy Policy


By subscribing
Top Posts
Submissions
About

Topics
<= Previous post Next post =>
Artificial Intelligence

Top Posts Past 30 Days Career Advice


Computer Vision
Data Engineering
1 Data Science
How to Select Rows and Columns in Pandas Using [ ], .loc, iloc, .at and .iat
Subscribe To Our Newsletter
(Get
Machine Learning The Great Big NLP
MLOps Your email address SUBSCRIBE
2 Primer
10 Cheat Sheets You Need To Ace ebook)
Data Science Interview
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 15/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
3 7 Free Platforms for BuildingProgramming
a Strong Data Science Portfolio
JOIN NEWSLETTER
Python
4 Decision Tree Algorithm, Explained
SQL

5 Datasets
The Complete Free PyTorch Course for Deep Learning
Education
6 CertificatesMy Income as a Data Scientist
3 Valuable Skills That Have Doubled
Courses
Online Masters
7 25 Advanced SQL Interview Questions for Data Scientists
Resources
8 7 Techniques to Handle Imbalanced Data
Cheatsheets
Events
9 A Data Science Portfolio That Will Land You The Job in 2022
Jobs
Publications
10 5 Tricky SQL Queries SolvedWebinars

© 2022 KDnuggets   |   About   |   Contact   |   Privacy Policy   |   Terms of Service

Subscribe To Our Newsletter


(Get The Great Big NLP
Your email address SUBSCRIBE
Primer ebook)

https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 16/16

You might also like