Professional Documents
Culture Documents
News
Programming
JOIN NEWSLETTER
Python
SQL
Datasets
Courses
Online
In this blog, we will explore how to get Masters
started with spaCy right from the installation to explore the
various functionalities it provides.
Resources
By Yesha Shastri, AI Developer andCheatsheets
Writer on November 1, 2022 in Natural Language Processing
Events
Share
Share Share ?Jobs
Publications
Webinars
Topics
Artificial Intelligence
Career Advice
Computer Vision
Data Engineering
Data Science
Subscribe To Our Newsletter
(Get
Machine Learning The Great Big NLP
MLOps Your email address SUBSCRIBE
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 1/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
KDnuggets News, November 9: 7 Tips
Programming
To Produce Readable
J O I N...N E W S L E T T E R
Python
SQL
Watch all of IMPACT’s breakout sessions
Datasets ON-DEMAND
Education
Certificates Analyzing Diversity & Inclusion with SQL
Courses
Online Masters
Fake It Till You Make It: Generating
Realistic Syntheti...
Resources
Cheatsheets
Events Finally a Book on Attention!
Jobs
Publications Confusion Matrix, Precision, and Recall
Webinars Explained
Image by Editor
Blog
Nowadays, NLP is one of the most emerging trends of AI as its applications are widespread
Top Posts
across several industries such Submissions
as Healthcare, Retail, and Banking to name a few. As there is
About
an increasing need to develop fast and scalable solutions, spaCy is one of the go-to NLP
Topics
libraries for developers. NLP products are developed to make sense of the existing text
Artificial Intelligence
1 How to Select Rows and
data. It mainly revolves aroundCareer Advice
solving questions such as ‘What is the context of data?’, Columns in Pandas Using [ ],
Computer Vision
‘Does it represent any bias?’, ‘IsData
there some similarity among words?’ etc. to build valuable
Engineering .loc, iloc, .at and .iat
Data Science
solutions? 2 15 Free Machine Learning and
Subscribe To Our Newsletter
(Get The Great Big NLP
Machine Learning
MLOps Your email address SUBSCRIBE
Deep Learning Books
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 2/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
Therefore, spaCy is a library that helps to deal with such questions and it provides a bunch
Programming 3 Decision Tree Algorithm,
JOIN NEWSLETTER
of modules that are easy to plug and play. It is an open-source and production-friendly
Python Explained
SQL
library that makes development and deployment smooth and efficient. Moreover, spaCy 4 Should I Learn Julia?
Datasets approach hence it provides a limited set of
was not built with a research-oriented
Education 5 7 Techniques to Handle
functionalities for the users to Certificates
choose from instead of multiple options to develop quickly.
Imbalanced Data
Courses
In this blog, we will explore how to get
Online started with spaCy right from the installation to
Masters
Installation
Publications
Webinars
Blog
Top Posts
spaCy generally requires trained pipelines to be loaded in order to access most of its
Submissions
About
functionalities. These pipelines contained pretrained models which perform prediction for
some of the commonly usedTopics
tasks. The pipelines are available in multiple languages and in
Artificial Intelligence
multiple sizes. Here, we will install
Careerthe small and medium size pipelines for English.
Advice
Computer Vision
Data Engineering
python -m spacy download en_core_web_sm
Data Science
python -m spacy download en_core_web_md
Subscribe To Our Newsletter
(Get The Great Big NLP
Machine Learning
MLOps Your email address SUBSCRIBE
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 3/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
Voila! You are now all set to start using spaCy. Confusion Matrix, Precision, and Recall
Programming
Explained JOIN NEWSLETTER
Python
SQL
Map out your journey towards SAS
Loading
Datasets
Education
the Pipeline Certification
Cheatsheets
nlp = spacy.load("en_core_web_sm")
Events 3 Useful Python Automation Scripts
Jobs
Publications Approaches to Text Summarization: An
Webinars Overview
The pipeline is now loaded into the nlp object.
15 More Free Machine Learning and
Next, we will be exploring the various functionalities of spaCy using an example.
Deep Learning Books
Blog
Tokenization How to Select Rows an...
Resources
‘doc’ can be used as an iterator to parse through the text. It contains a ‘.text’ method which
Cheatsheets
can give the text of every tokenEvents
like:
Jobs
Publications
for token in doc:
Webinars
print(token.text)
output:
KDNuggets
is
Blog
a
Top Posts
wonderful
Submissions
website
to
About
learn
machine
Topics
learning
Artificial Intelligence
with
Career Advice
python
Computer Vision
Data Engineering
Data Science
Subscribe To Our Newsletter
(Get
Machine Learning The Great Big NLP
MLOps Your email address SUBSCRIBE
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 5/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
In addition to splitting the words by white spaces, the tokenization algorithm also performs
Programming
JOIN NEWSLETTER
double-checks on the split text.
Python
SQL
Datasets
Education
Certificates
Courses
Online Masters
Resources
Cheatsheets
Events
Jobs
Publications
Webinars
print(token)Webinars
output:
is
to
Blog
with Top Posts
Submissions
About
Topics
Lemmatization
Artificial Intelligence
Career Advice
Computer Vision
Data Engineering
Lemmatization is another important preprocessing step for NLP pipelines. It helps to
Data Science
removeSubscribe To Our Newsletter
(Get
different versions of a Machine
single word LearningThe Great redundancy
to reduce Big NLP of Your
same-meaning
email addresswords SUBSCRIBE
MLOps
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 7/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
as it converts the words to their root lemmas. For example, it will convert ‘is’ -> ‘be’, ‘eating’ -
Programming
JOIN NEWSLETTER
> ‘eat’, and ‘N.Y.’ -> ‘n.y.’. With spaCy,
Python the words can be easily converted to their lemmas
SQL
using a ‘.lemma_’ attribute of the ‘doc’ object.
Datasets
We iterate over all the tokensEducation
and apply the ‘.lemma_’ method.
Certificates
Courses
for token in doc:
Online Masters
print(token.lemma_)
Resources
Cheatsheets
Events
output: Jobs
Publications
Webinars
kdnugget
be
wonderful
website
to
learn
machine
learning
with
Blog
python
Top Posts
Submissions
About
Part-of-Speech
Topics
Artificial Intelligence
(POS) Tagging
Career Advice
Computer Vision
Automated POS tagging enables usEngineering
Data to get an idea of the sentence structure by knowing
Data Science
what category
Subscribe of words
To Ourdominate theLearning
content
Newsletter
(Get
Machine Theand vice
Great Bigversa.
NLP This information forms an
MLOps Your email address SUBSCRIBE
Primer
essential part in understanding theebook)
context. spaCy allows parsing the content and tagging
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 8/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
the individual tokens with theirProgramming
respective parts of speech through the ‘.pos_’ attribute of
JOIN NEWSLETTER
the ‘doc’ object. Python
SQL
We iterate over all the tokens and apply the ‘.pos_’ method.
Datasets
Education
for token in doc:
Certificates
print(token.text,':',token.pos_)
Courses
Online Masters
Resources
output: Cheatsheets
Events
Jobs
KDNuggets : NOUN
Publications
is : AUX
Webinars
a : DET
wonderful : ADJ
website : NOUN
to : PART
learn : VERB
machine : NOUN
learning : NOUN
with : ADP
python : NOUN
Blog
Top Posts
Submissions
Dependency Parsing
About
Topics
Artificial Intelligence
Career Advice
Every sentence has an inherent structure in which the words have an interdependent
Computer Vision
relationship with each other. Dependency
Data Engineering parsing can be thought of as a directed graph
Data Science
wherein the nodesToare
Subscribe Ourwords and theLearning
Newsletter
(Get
Machine edgesTheareGreat
relationships
Big NLP between the words. It
MLOps Your email address SUBSCRIBE
extracts the information onPrimer
what ebook)
one word means to another grammatically; whether it is a
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 9/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
subject, an auxiliary verb, or a Programming
root, and so on. spaCy has a method ‘.dep_’ of the ‘doc’ object
JOIN NEWSLETTER
which describes the syntactic dependencies
Python of the tokens.
SQL
We iterate over all the tokens and apply the ‘.dep_’ method.
Datasets
Education
for token in doc:
Certificates
print(token.text, '-->',
Coursestoken.dep_)
Online Masters
Resources
output: Cheatsheets
Events
Jobs
KDNuggets --> nsubj
Publications
is --> ROOT
Webinars
a --> det
to --> aux
Topics
Artificial Intelligence
Career Advice
All the real-world objects have a name assigned to them for recognition and likewise, they
Computer Vision
are grouped into a category. For instance,
Data Engineering the terms ‘India’, ‘U.K.’, and ‘U.S.’ fall under the
Data Science
category of countries
Subscribe whereas
To Our ‘Microsoft’,
Newsletter
(Get
Machine ‘Google’,
LearningThe Greatand
Big‘Facebook’
NLP belong to the category of
MLOps Your email address SUBSCRIBE
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 10/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
organizations. spaCy already has trained models in the pipeline that can determine and
Programming
JOIN NEWSLETTER
predict the categories of such named
Python entities.
SQL
We will access the named entities by using the ‘.ents’ method over the ‘doc’ object. We will
Datasets
display the text, start character, end character, and label of the entity.
Education
Certificates
Courses
for ent in doc.ents:
Online Masters
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Resources
Cheatsheets
Events
output: Jobs
Publications
Webinars
KDNuggets 0 9 ORG
JOIN NEWSLETTER
Python
doc1 = nlp("Summers in IndiaSQL
are extremely hot.")
doc3 = nlp("People drink lemon juice and wear shorts during summers.")
Datasets
print("Similarity score ofEducation
doc1 and doc2:", doc1.similarity(doc2))
Certificates
print("Similarity score of doc1 and doc3:", doc1.similarity(doc3))
Courses
Online Masters
output: Resources
Cheatsheets
Events
Similarity score of doc1 andJobs
doc2: 0.7808246189842116
Rule-based Matching
Rule-based matching can be considered similar to regex wherein we can mention the
Blog
specific pattern to be found in Top
thePosts
text. spaCy’s matcher module not only does the
Submissions
mentioned task but also provides access to the document information such as tokens, POS
About
tags, lemmas, dependency structures, etc. which makes extraction of words possible on
Topics
multiple additional conditions.
Artificial Intelligence
Career Advice
Here, we will first create a matcher
Computerobject
Visionto contain all the vocabulary. Next, we will define
the pattern of text to be lookedData
forEngineering
and add that as a rule to the matcher module. Finally,
Data Science
we willSubscribe
call the matcher
To Ourmodule over the
Newsletter
(Get
Machine input
LearningThe sentence.
Great Big NLP
MLOps Your email address SUBSCRIBE
Primer ebook)
NLP
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 12/16
11/9/22, 11:10 PM Getting Started with spaCy for NLP - KDnuggets
News
Programming
from spacy.matcher import Matcher
JOIN NEWSLETTER
matcher = Matcher(nlp.vocab)
Python
SQL
doc = nlp("cold drinks help to deal with heat in summers")
Datasets
Education
matcher.add('rule_1', [pattern], on_match=None)
matches = matcher(doc)
Certificates
Courses
for _, start, end in matches:
Online Masters
matched_segment = doc[start:end]
print(matched_segment.text)
Resources
Cheatsheets
Events
Jobs
output: Publications
Webinars
cold drinks
Let's also look at another example wherein we attempt to find the word 'book' but only
Blog
matcher = Matcher(nlp.vocab)
Top Posts
doc1 = nlp("I am reading theSubmissions
book called Huntington.")
Topics
Artificial on_match=None)
Data Science
matches = matcher(doc2)
book
Datasets
[] Education
Certificates
Courses
Online Masters
In this blog, we looked at how to install and get started with spaCy. We also explored the
Resources
various basic functionalities it provides
Cheatsheetssuch as tokenization, lemmatization, dependency
Events
parsing, parts-of-speech tagging,
Jobsnamed entity recognition and so on. spaCy is a really
Publications
convenient library when it comes to developing NLP pipelines for production purposes. Its
Webinars
detailed documentation, simplicity of use, and variety of functions make it one of the widely
used libraries for NLP.
Resources
Cheatsheets
Events
Jobs
Get the Publications
FREE ebook 'The Great Big Natural Language
Webinars
Processing Primer' and the leading newsletter on AI,
Data Science, and Machine Learning, straight to your
inbox.
Your Email
SIGN UP
Topics
<= Previous post Next post =>
Artificial Intelligence
5 Datasets
The Complete Free PyTorch Course for Deep Learning
Education
6 CertificatesMy Income as a Data Scientist
3 Valuable Skills That Have Doubled
Courses
Online Masters
7 25 Advanced SQL Interview Questions for Data Scientists
Resources
8 7 Techniques to Handle Imbalanced Data
Cheatsheets
Events
9 A Data Science Portfolio That Will Land You The Job in 2022
Jobs
Publications
10 5 Tricky SQL Queries SolvedWebinars
https://www.kdnuggets.com/2022/11/getting-started-spacy-nlp.html 16/16