You are on page 1of 18

12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

Do you want unlimited access to all our Finance


and Investing Courses? Learn More >>
(https://www.ferventlearning.com/risk-tolerance-quiz/)

FERVENT | FINANCE
COURSES, INVESTING
COURSES

How to Clean Text Data (Full


Practical Walkthrough)
 Share  Tweet  Share

In this article, we’re going to learn how to clean text data.

So let’s get into it.


We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you

What is text cleaning?


consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.

Cookie Settings Accept All

Firstly, what exactly is text cleaning?

https://www.ferventlearning.com/how-to-clean-text-data/ 1/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

Ultimately, it’s just a process of transforming raw text into a format that’s
suitable for textual analysis.

Cleaning text data is imperative for any sort of textual analysis; and
naturally, the same applies for sentiment analysis
(https://www.ferventlearning.com/sentiment-analysis-an-overview/) or
more broadly, text mining as well.

And this holds regardless of whether you’re conducting sentiment analysis


(or other textual analysis) “manually”, or whether you’re using some sort of
machine learning algorithms.

Data cleansing is imperative for any sort of analysis. And textual analysis is
no exception.

Formally, text cleaning essentially involves vectorising text data.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Put
consent to thedifferently, we’re
use of ALL the cookies.going from,
However, say,
you may a "Cookie
visit text file or .txt
Settings" file toasome
to provide sort
controlled of a
consent.

vector.
Cookie Settings Accept All

Think of a column or a row in an Excel spreadsheet or in a pandas

https://www.ferventlearning.com/how-to-clean-text-data/ 2/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

Think of a column or a row in an Excel spreadsheet, or in a pandas


dataframe.

Text data by definition, and by construction is unstructured. But the idea is


to move away from a blob of text to a format that’s a little more
structured.

What does text cleaning look like?

To help you understand what this looks like, let’s actually take a look at
what a blob of text looks like. And then see what the cleaned text looks
like.

Pre-Cleaning / processing

So right here, you’ve got just a bunch of text…

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.

Cookie Settings Accept All

https://www.ferventlearning.com/how-to-clean-text-data/ 3/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

This is actually an excerpt from a management discussion and analysis or


MD&A filing.

And you can see that this looks like any ordinary blob of financial text.

So you’ve got some words. And then you’ve got your punctuation marks.
You’ve got some numbers, dates, parentheses, percentage signs, etc.

There’s a variety of different special characters in here. There’s a variety of


different words.

And a lot of it is not something we can actually use.

Related Course: Investment Analysis with Natural


Language Processing (NLP)

This Article features concepts that are covered extensively in our course
on Investment Analysis with Natural Language Processing (NLP)
(https://www.ferventlearning.com/courses/investment-analysis-with-
natural-language-processing-nlp/).
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.

Cookie Settings Accept All

https://www.ferventlearning.com/how-to-clean-text-data/ 4/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

If you’re interested in learning how to leverage the power of text data for
investment analysis while working with real world data, you should
definitely check out the course
(https://www.ferventlearning.com/courses/investment-analysis-with-
natural-language-processing-nlp/).

Post cleaning / processing

Ultimately, the only thing we’re really interested in is the actual words.

We don’t particularly care about the numbers. Or the symbols.

And nor do we care about punctuation marks.

We’re literally only really interested in the words.

And once you go through the cleaning process, here’s what the cleaned
text would look like…

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.

Cookie Settings Accept All

https://www.ferventlearning.com/how-to-clean-text-data/ 5/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

We’ve now moved away from a blob of text to something that’s relatively
more structured in that it is a list of words.

This state – of transforming text into a list of words or a “bag of words” – is


also called “tokenization”. And each element inside the list is a “token”
(essentially a word).

You can see that all of the symbols have gone. There are no parentheses
and hyphens. All the punctuation marks have gone.

And all the numbers have gone, too.

Now we literally only have the words, without any unwanted characters.

How do we achieve this?

Well, it’s actually a three-step process. And we do actually have a video


that explains the full process, viewable here:

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.

Cookie Settings Accept All

https://www.ferventlearning.com/how-to-clean-text-data/ 6/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

Of course, you can also continue to read about the whole process further
below.

How to clean text data using the 3


Step Process

Step 1: Remove numbers, symbols, and other


unwanted characters

The 3 step process on how to clean text data starts with removing all the
numbers, symbols, and anything that’s not an alphabetic character from
the text.

So we remove literally anything that is not a word.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.

Cookie Settings Accept All

Want to go beyond cleaning text


https://www.ferventlearning.com/how-to-clean-text-data/ 7/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

Want to go beyond cleaning text


data?

Get the Investment Analysis with NLP Study


Pack (for FREE!).

Fields marked with an


* are required
Your Name

Your Email

I agree to receive emails


from Fervent and accept
the Privacy Policy. *

Why is removing non-alphabetic characters important?

Why is this necessary?

Because if we don’t do this, then we can essentially end up


underestimating sentiment, for example.

For instance, one of the ways to estimate sentiment is to use a


“proportional counts approach”.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Sentiment
consent to using
the use of ALL a proportional
the cookies. counts
However, you may approach
visit "Cookie can
Settings" be estimated
to provide as…
a controlled consent.

Cookie Settings Accept All

   

https://www.ferventlearning.com/how-to-clean-text-data/ 8/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

Where you take the frequency counts of the (cleaned) words that belong to
a sentiment language (the numerator).

And you divide that by the total number of (cleaned) words in that
document (the denominator).

Now, the total number of words to you and I intuitively would just include
words, right?

But if you don’t eliminate things like the numbers, symbols, and other non-
alphabetic characters…

Then it’s possible for the program to essentially include those symbols and
numbers as words. So essentially, the number of words would be
significantly higher than it is actually is.

Because it’s counting the symbols as individual words. You and I, as


humans, know they’re not words.

But unless we explicitly code in the requirement to ignore numbers,


symbols and punctuation marks, and the like…

Unless we do that, the program is likely going to end up counting those


symbols and numbers as words.

So that’s the technical reason as to why we need to remove non-alphabetic


characters.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
The
consent to fundamental
the use of ALL the cookies. rationale forvisit
However, you may removing non-alphabetic
"Cookie Settings" to provide a controlled consent.
characters
Cookie Settings Accept All

But there’s also an underlying reasoning or rationale.

https://www.ferventlearning.com/how-to-clean-text-data/ 9/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

y g g

And that’s simply because numbers aren’t words. Symbols aren’t words.
And punctuation marks aren’t words.

If we’re trying to establish the words which relate to a specific sentiment


language, then numbers, symbols and punctuation marks are completely
irrelevant.

For us to achieve that objective, we only need the words.

We don’t need all of the other stuff that just happens to be inside a blob of
text.

And this is why it’s important for us to remove all of the non-alphabetic
characters as the first step in our text cleaning process.

Note that removing unwanted characters is fairly straightforward, using


either regular expression or other built-in text cleaning functions, for
example, Python’s .isalpha() method.

Step 2: Harmonise letter case

The next thing we do as part of how to clean text data using the 3 step
process, is to harmonise the letter case.

In an ordinary blob of text, we tend to have a mix of upper case, lower


case, and title case text.

And working with text that’s in different cases can be a little bit
problematic.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Why
consent to isofharmonising
the use letter
ALL the cookies. However, youcase important?
may visit "Cookie Settings" to provide a controlled consent.

Cookie Settings Accept All


Harmonising letter case helps us ensure that the words inside a document
that belong to a sentiment lexicon or sentiment language are actually

https://www.ferventlearning.com/how-to-clean-text-data/ 10/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

that belong to a sentiment lexicon or sentiment language are actually


picked up.

To give you an example, consider the word “growth” with a capital G – so


“Growth”.

To you and I, as humans, that’s the same as growth with a little G


(“growth”).

The word is the same.

They both mean the same thing.

Just because we write it with a capital G or we write the whole thing with
an upper case as “GROWTH”, or indeed the whole thing with lower case as
“growth”…

That doesn’t change the meaning of the word.

It still means growth.

You and I know that as humans.

But computers aren’t as clever as we’d like them to be.

And so for instance, if you were to ask Python whether the text string
“growth” is the same as “Growth”, Python will return “false”.

Because as far as Python is concerned, these two words are not the same
because they are spelled slightly differently.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Python
consent to doesn’t
the use of care that,
ALL the cookies. as far
However, you as
maythe
visit English language
"Cookie Settings" goes,
to provide they are
a controlled in
consent.

fact the same


Cookie Settings
thing.
Accept All

And so if you now imagine the word/text string “growth” being in our

https://www.ferventlearning.com/how-to-clean-text-data/ 11/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

And so if you now imagine the word/text string  growth” being in our
positive dictionary or positive sentiment language.

Say we’ve listed that word as lowercase…

Then if there’s a title, case word of growth (“Growth”), it’s not going to pick
it up. It’s not going to include that word as one that belongs to the
sentiment language.

And the same goes if there was an upper case “GROWTH” – it won’t pick it
up.

Because our dictionary includes the word in a lower case as “growth”.

This is why it’s really important for us to make sure that all of the cases in
the text that we’re working with is consistent and identical.

Choice of letter case

You can choose to work with upper case, or title case, or lower case.

It doesn’t matter which specific case you end up working with. As long as
you’re consistent with that case throughout the Corpus.

Generally speaking, most people who work with text data tend to
harmonise the text data to lower case.

It just happens to be a typical convention or the general consensus /


general practice.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Okay,
consent to after
the use harmonising
of ALL the letter
the cookies. However, you maycase acrossSettings"
visit "Cookie all words, thealast
to provide thingconsent.
controlled we
need to do
Cookie Settings
is remove all stopwords.
Accept All

https://www.ferventlearning.com/how-to-clean-text-data/ 12/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

Step 3: Remove the most common words


(stopwords)

The final step of the tax cleaning process involves removing the most
common words, aka “stopwords”.

Stopwords are the most common words in a given language. And this
language can be a general language (e.g., English), or it could be a subject-
specific language; for instance, Finance.

The idea is to remove the words that are most commonly used in that
language. “a” is a stopword. As is “the”. And “an”, for example.

Why is removing stopwords important?

And that’s ultimately because the most common words are so common,
that they actually add little to no value to any analysis.

It’s also because if we don’t remove stopwords, then we can end up


underestimating sentiment.

To see why we might end up underestimating sentiment, think about the


proportional counts’ estimate of sentiment that we talked about earlier in
this article.

You’ll recall that the numerator of that estimate is the frequency count of
all of the words which belong to a given sentiment language.

And the denominator of the proportional counts’ estimate is the total


number of words in that document.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Here’s
consent to the usethe equation
of ALL again,
the cookies. just
However, youin case
may visit you missed
"Cookie Settings"it:to provide a controlled consent.

Cookie Settings Accept All

   

https://www.ferventlearning.com/how-to-clean-text-data/ 13/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

And so if our document has words like “a”, “the”, “and”, etc, then, of course,
that’s going to increase the total number of words in the document.

That will increase the denominator of the proportional counts estimate.


Which will then naturally decrease the value of the proportional counts
based estimate of sentiment.

Language specific & subject specific stopwords

Now, importantly, stopwords aren’t necessarily limited to just the most


common words in the English language.

Broadly speaking, stopwords can comprise of general language specific


words. But they can also comprise of subject specific words.

So in finance and accounting, for instance, you might think of words like
“company”, “firm”, “management”, or “business” as examples of stopwords.
Because these are likely words that are extremely common across all
documents.

And so you can actually think of these common words as stopwords that
are specific to finance and accounting.

Now, while the general language specific stopwords lists are available for
free (http://www.nltk.org/nltk_data/), subject specific stock words – at least
at the time of writing – tend to proprietary.

Some people have created stopword lists that are specific to certain
subjects, but they do not allow people to use those stopword lists for free.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Few
consent to of them
the use allow
of ALL the free
cookies. use for
However, youacademic purposes,
may visit "Cookie Settings" but not for
to provide commercial
a controlled consent.

purposes.Accept
Cookie Settings
AndAllothers don’t allow people to use those lists for anything for
free.

https://www.ferventlearning.com/how-to-clean-text-data/ 14/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

Subject specific stop words can be very important.

But they’re not imperative for to you use.

So it’s not like your sentiment analysis will completely break down if you
don’t use the subject specific stopwords.

They certainly can be very useful and important. But they’re not by any
means the be-all and end-all of sentiment analysis
(https://www.ferventlearning.com/sentiment-analysis-an-overview/).

So hopefully you now understand the process of cleaning text data, and
perhaps more importantly, you understand why the individual steps are
necessary.

Step 3.5 (Bonus / optional): Stem and Lemmatize

When exploring how to clean text data, the preceding 3 steps are
imperative.

In addition though, depending on the hypotheses that you’re working with,


text cleaning can also include “stemming” or lemmatizing”. And it can also
include the removal of the most common words within the Corpus.

Stemming and lemmatizing essentially reduce all of the words down to


their core root word.

So for example, the word “managers” would be reduced down to “manager”.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Inthe
consent to terms
use ofof
ALLremoving
the cookies. the most
However, youcommon words
may visit "Cookie within
Settings" the Corpus,
to provide it’s consent.
a controlled a
simple case
Cookie Settings
of remove the words that are used most commonly across all
Accept All
documents inside the Corpus.

https://www.ferventlearning.com/how-to-clean-text-data/ 15/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

Wrapping Up How to Clean Text


Data
In this article, you’ve learnt the core fundamentals of how to clean text
data.

Specifically, we learned that cleaning text data involves transforming raw


text into a format that’s suitable for textual analysis.

This itself is a 3 step process, including:

removing numbers, symbols, and everything that’s not an alphabetic


character,
harmonising letter case so they all have the same case, be that upper
case, title case, or lower case
removing stopwords

Hopefully all of this makes sense.

If any part of this article is not quite clear, please read it again before
moving on any further.

Next steps? Discover how all this hard work can be used to create
profitable sentiment investing
(https://www.ferventlearning.com/sentiment-investing-guide/) strategies.

Or build your own sentiment investing system by enrolling on and taking


the course below.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Related Course: Investment Analysis with Natural
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.

Language
Cookie Settings Accept All Processing (NLP)

https://www.ferventlearning.com/how-to-clean-text-data/ 16/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

Do you want to build a rigorous investment analysis system that leverages


the power of text data with Python?

EXPLORE THE COURSE


(HT TPS://WWW.FERVENTLEARN I N G.COM/COURSE S/I N VE STM E N T
-ANALYSIS-WITH-NATURAL-L AN GUAGE - P ROCESSI N G- N L P /)

 Share  Tweet  Share

You must be logged in (https://www.ferventlearning.com/wp-login.php?


redirect_to=https%3A%2F%2Fwww.ferventlearning.com%2Fhow-to-clean-text-data%2F) to
post a comment.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.

Cookie Settings Accept All

https://www.ferventlearning.com/how-to-clean-text-data/ 17/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses

Do You Want Unlimited Access To All


Our Courses?

YES! TELL ME MORE


(HT TPS://WWW.FERVENTLEARN I N G.COM/ALL-ACCESS- PASS/)

Copyright © 2023, Fervent · Privacy Policy (https://ferventlearning.com/privacy-policy/) · Terms and Conditions


(https://ferventlearning.com/terms-conditions/)

Logos of institutions used are owned by those respective institutions. Neither Fervent nor the institutions endorse
each other's products / services.

We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.

Cookie Settings Accept All

https://www.ferventlearning.com/how-to-clean-text-data/ 18/18

You might also like