Professional Documents
Culture Documents
How To Clean Text Data (Full Practical Walkthrough) - Fervent - Finance Courses, Investing Courses
How To Clean Text Data (Full Practical Walkthrough) - Fervent - Finance Courses, Investing Courses
FERVENT | FINANCE
COURSES, INVESTING
COURSES
https://www.ferventlearning.com/how-to-clean-text-data/ 1/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
Ultimately, it’s just a process of transforming raw text into a format that’s
suitable for textual analysis.
Cleaning text data is imperative for any sort of textual analysis; and
naturally, the same applies for sentiment analysis
(https://www.ferventlearning.com/sentiment-analysis-an-overview/) or
more broadly, text mining as well.
Data cleansing is imperative for any sort of analysis. And textual analysis is
no exception.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Put
consent to thedifferently, we’re
use of ALL the cookies.going from,
However, say,
you may a "Cookie
visit text file or .txt
Settings" file toasome
to provide sort
controlled of a
consent.
vector.
Cookie Settings Accept All
https://www.ferventlearning.com/how-to-clean-text-data/ 2/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
To help you understand what this looks like, let’s actually take a look at
what a blob of text looks like. And then see what the cleaned text looks
like.
Pre-Cleaning / processing
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
https://www.ferventlearning.com/how-to-clean-text-data/ 3/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
And you can see that this looks like any ordinary blob of financial text.
So you’ve got some words. And then you’ve got your punctuation marks.
You’ve got some numbers, dates, parentheses, percentage signs, etc.
This Article features concepts that are covered extensively in our course
on Investment Analysis with Natural Language Processing (NLP)
(https://www.ferventlearning.com/courses/investment-analysis-with-
natural-language-processing-nlp/).
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
https://www.ferventlearning.com/how-to-clean-text-data/ 4/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
If you’re interested in learning how to leverage the power of text data for
investment analysis while working with real world data, you should
definitely check out the course
(https://www.ferventlearning.com/courses/investment-analysis-with-
natural-language-processing-nlp/).
Ultimately, the only thing we’re really interested in is the actual words.
And once you go through the cleaning process, here’s what the cleaned
text would look like…
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
https://www.ferventlearning.com/how-to-clean-text-data/ 5/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
We’ve now moved away from a blob of text to something that’s relatively
more structured in that it is a list of words.
You can see that all of the symbols have gone. There are no parentheses
and hyphens. All the punctuation marks have gone.
Now we literally only have the words, without any unwanted characters.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
https://www.ferventlearning.com/how-to-clean-text-data/ 6/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
Of course, you can also continue to read about the whole process further
below.
The 3 step process on how to clean text data starts with removing all the
numbers, symbols, and anything that’s not an alphabetic character from
the text.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
Your Email
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Sentiment
consent to using
the use of ALL a proportional
the cookies. counts
However, you may approach
visit "Cookie can
Settings" be estimated
to provide as…
a controlled consent.
https://www.ferventlearning.com/how-to-clean-text-data/ 8/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
Where you take the frequency counts of the (cleaned) words that belong to
a sentiment language (the numerator).
And you divide that by the total number of (cleaned) words in that
document (the denominator).
Now, the total number of words to you and I intuitively would just include
words, right?
But if you don’t eliminate things like the numbers, symbols, and other non-
alphabetic characters…
Then it’s possible for the program to essentially include those symbols and
numbers as words. So essentially, the number of words would be
significantly higher than it is actually is.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
The
consent to fundamental
the use of ALL the cookies. rationale forvisit
However, you may removing non-alphabetic
"Cookie Settings" to provide a controlled consent.
characters
Cookie Settings Accept All
https://www.ferventlearning.com/how-to-clean-text-data/ 9/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
y g g
And that’s simply because numbers aren’t words. Symbols aren’t words.
And punctuation marks aren’t words.
We don’t need all of the other stuff that just happens to be inside a blob of
text.
And this is why it’s important for us to remove all of the non-alphabetic
characters as the first step in our text cleaning process.
The next thing we do as part of how to clean text data using the 3 step
process, is to harmonise the letter case.
And working with text that’s in different cases can be a little bit
problematic.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Why
consent to isofharmonising
the use letter
ALL the cookies. However, youcase important?
may visit "Cookie Settings" to provide a controlled consent.
https://www.ferventlearning.com/how-to-clean-text-data/ 10/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
Just because we write it with a capital G or we write the whole thing with
an upper case as “GROWTH”, or indeed the whole thing with lower case as
“growth”…
And so for instance, if you were to ask Python whether the text string
“growth” is the same as “Growth”, Python will return “false”.
Because as far as Python is concerned, these two words are not the same
because they are spelled slightly differently.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Python
consent to doesn’t
the use of care that,
ALL the cookies. as far
However, you as
maythe
visit English language
"Cookie Settings" goes,
to provide they are
a controlled in
consent.
And so if you now imagine the word/text string “growth” being in our
https://www.ferventlearning.com/how-to-clean-text-data/ 11/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
And so if you now imagine the word/text string growth” being in our
positive dictionary or positive sentiment language.
Then if there’s a title, case word of growth (“Growth”), it’s not going to pick
it up. It’s not going to include that word as one that belongs to the
sentiment language.
And the same goes if there was an upper case “GROWTH” – it won’t pick it
up.
This is why it’s really important for us to make sure that all of the cases in
the text that we’re working with is consistent and identical.
You can choose to work with upper case, or title case, or lower case.
It doesn’t matter which specific case you end up working with. As long as
you’re consistent with that case throughout the Corpus.
Generally speaking, most people who work with text data tend to
harmonise the text data to lower case.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Okay,
consent to after
the use harmonising
of ALL the letter
the cookies. However, you maycase acrossSettings"
visit "Cookie all words, thealast
to provide thingconsent.
controlled we
need to do
Cookie Settings
is remove all stopwords.
Accept All
https://www.ferventlearning.com/how-to-clean-text-data/ 12/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
The final step of the tax cleaning process involves removing the most
common words, aka “stopwords”.
Stopwords are the most common words in a given language. And this
language can be a general language (e.g., English), or it could be a subject-
specific language; for instance, Finance.
The idea is to remove the words that are most commonly used in that
language. “a” is a stopword. As is “the”. And “an”, for example.
And that’s ultimately because the most common words are so common,
that they actually add little to no value to any analysis.
You’ll recall that the numerator of that estimate is the frequency count of
all of the words which belong to a given sentiment language.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Here’s
consent to the usethe equation
of ALL again,
the cookies. just
However, youin case
may visit you missed
"Cookie Settings"it:to provide a controlled consent.
https://www.ferventlearning.com/how-to-clean-text-data/ 13/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
And so if our document has words like “a”, “the”, “and”, etc, then, of course,
that’s going to increase the total number of words in the document.
So in finance and accounting, for instance, you might think of words like
“company”, “firm”, “management”, or “business” as examples of stopwords.
Because these are likely words that are extremely common across all
documents.
And so you can actually think of these common words as stopwords that
are specific to finance and accounting.
Now, while the general language specific stopwords lists are available for
free (http://www.nltk.org/nltk_data/), subject specific stock words – at least
at the time of writing – tend to proprietary.
Some people have created stopword lists that are specific to certain
subjects, but they do not allow people to use those stopword lists for free.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Few
consent to of them
the use allow
of ALL the free
cookies. use for
However, youacademic purposes,
may visit "Cookie Settings" but not for
to provide commercial
a controlled consent.
purposes.Accept
Cookie Settings
AndAllothers don’t allow people to use those lists for anything for
free.
https://www.ferventlearning.com/how-to-clean-text-data/ 14/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
So it’s not like your sentiment analysis will completely break down if you
don’t use the subject specific stopwords.
They certainly can be very useful and important. But they’re not by any
means the be-all and end-all of sentiment analysis
(https://www.ferventlearning.com/sentiment-analysis-an-overview/).
So hopefully you now understand the process of cleaning text data, and
perhaps more importantly, you understand why the individual steps are
necessary.
When exploring how to clean text data, the preceding 3 steps are
imperative.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Inthe
consent to terms
use ofof
ALLremoving
the cookies. the most
However, youcommon words
may visit "Cookie within
Settings" the Corpus,
to provide it’s consent.
a controlled a
simple case
Cookie Settings
of remove the words that are used most commonly across all
Accept All
documents inside the Corpus.
https://www.ferventlearning.com/how-to-clean-text-data/ 15/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
If any part of this article is not quite clear, please read it again before
moving on any further.
Next steps? Discover how all this hard work can be used to create
profitable sentiment investing
(https://www.ferventlearning.com/sentiment-investing-guide/) strategies.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
Related Course: Investment Analysis with Natural
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
Language
Cookie Settings Accept All Processing (NLP)
https://www.ferventlearning.com/how-to-clean-text-data/ 16/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
https://www.ferventlearning.com/how-to-clean-text-data/ 17/18
12/04/2023, 19:35 How to Clean Text Data (Full Practical Walkthrough) - Fervent | Finance Courses, Investing Courses
Logos of institutions used are owned by those respective institutions. Neither Fervent nor the institutions endorse
each other's products / services.
We ethically use cookies on our website to give you the best possible user experience. By clicking “Accept All”, you
consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
https://www.ferventlearning.com/how-to-clean-text-data/ 18/18