You are on page 1of 8

Text mining on the command line. For the last couple of day... https://towardsdatascience.com/text-mining-on-the-comma...

Get started Open in app

Follow 596K Followers

Text mining on the command line


Sabber Ahamed Jun 25, 2018 · 7 min read

For the last couple of days, I have been thinking to write something about my recent
experience on the usages of raw bash command and regex to mine text. Of course,
there are more sophisticated tools and libraries online to process text without
writing so many lines of codes. For example, Python has built-in regex module “re”
that has many rich features to process text. ‘BeautifulSoup’ on the other hand has

1 of 8 10/11/2021 09:54
Text mining on the command line. For the last couple of day... https://towardsdatascience.com/text-mining-on-the-comma...

nice built-in features to clean raw web pages. I use these tool for faster processing of
Get started Open in app
large text corpus and when I feel lazy to write codes.

Most of the times, I prefer to use the command line. I feel at home on the command
line especially when I work with text data. In this tutorial, I use bash commands and
regex to process raw and messy text data. I assume readers have the basic familiarity
of regex and bash commands.

I show how bash commands like ‘grep,’ ‘sed,’ ‘tr,’ ‘column,’ ‘sort,’ ‘uniq,’ ‘awk’ can be
used with regex to process raw and messy texts and then extract information. As an
example, I use the complete works of Shakespeare provided by Project Gutenberg,
which is in cooperation with World Library, Inc.

Look at the file first


The whole work of Shakespeare work can be downloaded from this link. I
downloaded the entire work of Shakespeare and put it into a text file:
“Shakespeare.txt.” All right, now let’s get started looking at the file size:

ls -lah shakes.txt

### Display:
-rw-r--r--@ 1 sabber staff 5.6M Jun 15 09:35 shakes.txt

‘ls’ is the bash command that lists all the files and folder in a certain directory. ‘-l’
flag displays file types, owner, group, size, date, and filename. ‘-a’ flag is used to
display all the files including hidden ones. The flag ‘h’ -one of my favorite flags as it
displays file size which is the human-readable format. Size of the shakes.txt is 5.6
megabyte.

Explorer the text


Okay, now lets read the file to see what's in it. I use ‘less,’ and ‘tail’ commands to
explorer the parts of the file. Name of the commands tells about their functionalities.
‘less’ is used to view the contents of a text file one screen at a time. It is similar to
‘more’ but has the extended capability of allowing both forward and backward
navigation through the file. ‘-N’ flag can be used to define line numbers. Similarly
‘tail’ shows the last couple of lines of the file.

2 of 8 10/11/2021 09:54
Text mining on the command line. For the last couple of day... https://towardsdatascience.com/text-mining-on-the-comma...

Get started Open in app

less -N shakes.txt

### Display:
1 <U+FEFF>
2 Project Gutenberg’s The Complete Works of William
Shakespeare, by William
3 Shakespeare
4
5 This eBook is for the use of anyone anywhere in the United
States and
6 most other parts of the world at no cost and with almost
no restrictions
7 whatsoever. You may copy it, give it away or re-use it
under the terms

It looks like the first couple of lines are not Shakespeare work but some information
about the Gutenberg’s project. Similarly, there are some lines at the end of the file
unrelated to Shakespeare’s work. So I would delete all the unnecessary lines from
the file using ‘sed’ as below:

cat shakes.txt | sed -e '149260,149689d' | sed -e '1,141d' >


shakes_new.txt

The above code snippets delete lines from 14926 to 149689 at the tail then delete
the first 141 lines. The unwanted lines include some information about legal rights,
Gutenberg’s project and contents of the work.

Basic Analysis
Now let's do some statistics of the file using ‘pipe |’ and ‘awk’.

cat shakes_new.txt | wc | awk '{print "Lines: " $1 "\tWords: " $2


"\tCharacter: " $3 }'

### Display
Lines: 149118 Words: 956209 Character: 5827807

In the above code, I first extract the entire text of the file using ‘cat’ and then pipe

3 of 8 10/11/2021 09:54
Text mining on the command line. For the last couple of day... https://towardsdatascience.com/text-mining-on-the-comma...

into ‘wc’ to count the number of lines, words, and characters. Finally, I used ‘awk’ to
Get started Open in app
display information. The way of counting and displaying can be done in tons of other
ways. Feel free to explore other possible options.

Text processing
Now its time to clean the text for further analysis. Cleaning includes, convert the text
to lower case, remove all digits, remove all punctuations, and remove high-frequency
words (stop words). Processings are not limited to these steps, and it depends on the
purpose. Since I intend to show some basic text processing, I only focus on the above
steps.

First, I convert all the uppercase characters/words to lowercase followed by


removing all the digits and punctuations. To perform the processing, I use bash
command ‘tr’ which translate or delete characters from a text document.

cat shakes_new.txt | tr 'A-Z' 'a-z' | tr -d [:punct:] | tr -d


[:digit:] > shakes_new_cleaned.txt

The code snippet above first converts the entire text to lower case and then remove
all the punctuations and digits. The results of the above codes:

### Display before:


1 From fairest creatures we desire increase,
2 That thereby beauty’s rose might never die,
3 But as the riper should by time decease,
4 His tender heir might bear his memory:
5 But thou contracted to thine own bright eyes,
6 Feed’st thy light’s flame with self-substantial fuel,
7 Making a famine where abundance lies,
8 Thy self thy foe, to thy sweet self too cruel:
9 Thou that art now the world’s fresh ornament,
10 And only herald to the gaudy spring,
11 Within thine own bud buriest thy content,
12 And, tender churl, mak’st waste in niggarding:
13 Pity the world, or else this glutton be,
14 To eat the world’s due, by the grave and thee.

### Display after:

1 from fairest creatures we desire increase


2 that thereby beautys rose might never die

4 of 8 10/11/2021 09:54
Text mining on the command line. For the last couple of day... https://towardsdatascience.com/text-mining-on-the-comma...

3 but as the riper should by time decease


4
Get started hisOpen
tender
in app heir might bear his memory
5 but thou contracted to thine own bright eyes
6 feedst thy lights flame with selfsubstantial fuel
7 making a famine where abundance lies
8 thy self thy foe to thy sweet self too cruel
9 thou that art now the worlds fresh ornament
10 and only herald to the gaudy spring
11 within thine own bud buriest thy content
12 and tender churl makst waste in niggarding
13 pity the world or else this glutton be
14 to eat the worlds due by the grave and thee

Tokenization is one of the basic preprocessing in natural language processing.


Tokenization can be performed both on a word or sentence level. In this tutorial, I
show how to tokenize the file. In the code below, I first extract the cleaned text using
‘cat’ then I use ‘tr’ and its two flags: ‘s’ and ‘c’ to convert every word into lines.

cat shakes_new_cleaned.txt | tr -sc ‘a-z’ ‘\12’ >


shakes_tokenized.txt

### Display (First 10 words)

1 from
2 fairest
3 creatures
4 we
5 desire
6 increase
7 that
8 thereby
9 beautys
10 rose

Now that we have all the words tokenized, we can answer a question like, what is the
most/least frequent word in the entire Shakespeare work? To do this, I first use the
‘sort’ command to sort all the words first, then I use ‘uniq’ command with ‘-c’ flag to
find out the frequency of each word. ‘uniq -c’ is same as ‘groupby’ in Pandas or SQL.
Finally, sort the words with their frequency in either ascending (least frequent) or
descending (most frequent) order.

cat shakes_tokenized.txt | sort | uniq -c | sort -nr >


shakes_sorted_desc.txt

5 of 8 10/11/2021 09:54
Text mining on the command line. For the last couple of day... https://towardsdatascience.com/text-mining-on-the-comma...

Get
###started
DisplayOpen in app

29768 the 28276 and 21868 i 20805 to 18650 of 15933 a


14363 you 13191 my 11966 in 11760 that

cat shakes_tokenized.txt | sort | uniq -c | sort -n >


shakes_sorted_asc.txt

### Display

1 aarons 1 abandoner 1 abatements 1 abatfowling


1 abbominable 1 abaissiez 1 abashd 1 abates
1 abbeys 1 abbots

The above results reveal some interesting observations. For example, the ten most
frequent words are either pronouns or prepositions or conjunctions. If we want to
find out more abstract information about the work, we have to remove all the stop
word (prepositions, pronouns, conjunctions, modal verbs, etc.). It also depends on
the purpose of the object. One might be interested only in prepositions. In this case,
it’s okay to keep all the prepositions. On the other hand, the least frequent words are
abandoned, abatements, abashed.

Removing stop words


In the next step, I show the usages of ‘awk’ to remove all the stop words on the
command line. In this tutorial, I used NLTK’s list of English stopwords. I also have
added a couple more words to the list. Details of the following codes can be found in
this StackOverflow answers. Details of the different options of awk can be also found
from the manual of awk (‘man awk’ on the command line)

awk ‘FNR==NR{for(i=1;i<=NF;i++)w[$i];next}(!($1 in w))’


stop_words.txt shakes_tokenized.txt > shakes_stopwords_removed.txt

Alright, after removing the stop words lets sort the words in ascending and
descending order like as above.

cat shakes_stopwords_removed.txt | sort | uniq -c | sort -nr >


shakes_sorted_desc.txt

### Display most frequent

6 of 8 10/11/2021 09:54
Text mining on the command line. For the last couple of day... https://towardsdatascience.com/text-mining-on-the-comma...

Get
3159started
lord Open
2959in app
good 2924 king 2900 sir
2634 come 2612 well 2479 would 2266 love
2231 let 2188 enter

cat shakes_stopwords_removed.txt | sort | uniq -c | sort -n >


shakes_sorted_asc.txt

### Display least frquent

1 aarons 1 abandoner 1 abatements 1 abatfowling


1 abbominable 1 abaissiez 1 abashd 1 abates
1 abbeys 1 abbots

We see the most frequent word used by Shakespeare is the word ‘Lord’ followed by
‘good’. The word ‘Love’ is also included in the top most frequent words. The least
frequent words remain the same. A linguistic or literature student may interpret the
information or gain better insight from these simple analytics.

Let‘s discuss
As we are done with some necessary processing and cleaning, in the next tutorial I
will discuss how to perform some advanced analytics. Until then if you have any
questions feel free to ask. Please make comments if you see any typos, mistakes or
you have better suggestions. You can reach out to me:

Email: sabbers@gmail.com
LinkedIn: https://www.linkedin.com/in/sabber-ahamed/
Github: https://github.com/msahamed
Medium: https://medium.com/@sabber/

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on
tutorials and cutting-edge research to original features you don't want to miss. Take a look.

Get this newsletter

Text Mining Bash Script Command Line Naturallanguageprocessing Machine Learning

7 of 8 10/11/2021 09:54
Text mining on the command line. For the last couple of day... https://towardsdatascience.com/text-mining-on-the-comma...

Get started Open in app

About Write Help Legal

Get the Medium app

8 of 8 10/11/2021 09:54

You might also like