You are on page 1of 3

spaCy Library

spaCy is a free, open-source library for advanced Natural Language Processing


(NLP) in Python. Unlike NLTK, which is widely used for teaching and research,
spaCy focuses on providing software for production usage. spaCy also supports
deep learning workflows that allow connecting statistical models trained by
popular machine learning libraries like TensorFlow, Keras, Scikit-learn or
PyTorch

spaCy relies on models that are language-specific and come in different sizes.
You can load a spaCy model with spacy.load.

For example, here's how you would load the English language model.
import spacy
nlp = spacy.load('en')

With the model loaded, you can process text like this:
doc = nlp("Today is a nice and sunny day, don't you think so?")

There's a lot you can do with the doc object you just created.

Tokenizing

This returns a document object that contains tokens. A token is a unit of text in
the document, such as individual words and punctuation. SpaCy splits
contractions like "don't" into two tokens, "do" and "n't". You can see the
tokens by iterating through the document.
for token in doc:
print(token)
Iterating through a document gives you token objects. Each of these tokens
comes with additional information. In most cases, the important ones
are token.lemma_ and token.is_stop.

Text preprocessing

There are a few types of preprocessing to improve how we model with words.
The first is "lemmatizing." The "lemma" of a word is its base form. For example,
"walk" is the lemma of the word "walking". So, when you lemmatize the word
walking, you would convert it to walk.

It's also common to remove stopwords. Stopwords are words that occur
frequently in the language and don't contain much information. English
stopwords include "the", "is", "and", "but", "not".

With a spaCy token, token.lemma_ returns the lemma,


while token.is_stop returns a boolean True if the token is a stopword
(and False otherwise).
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

You might also like