You are on page 1of 4

JON CLINDANIEL: A common use case for regular expressions

is in processing text from unstructured data you have just


scraped from the web.
We got a small hint of this in the last unit
with the commands grep, which stands for global regular expression
print, and sed, where we found a particular word or set of characters
and replaced them with another set of characters.
In this lesson, we'll be looking at four of George Washington's
presidential speeches from 1790 through 1792.
They're scrape directly from Wikipedia, so they
have a lot of HTML characters embedded within them.
So let's take a look at how regular expressions as well
as the find/replace routines we touched on in the last unit
can allow us to clean and structure data for further analysis.
So let's open up one of these text files just to see what we have here.
And you can see it's very messy.
So let's say one of our research questions
deals with the length of sentences in presidential speeches
over the course of a presidency.
Are they terse and to the point at the beginning and long winded at the end?
Do they stay the same?
To answer these research questions, we need
to take word counts of each sentence and calculate an average word
count per sentence for each speech.
We should structure our output in a CSV file
for easy analysis with year, month, and sentence length
recorded for each presidential speech.
Now we could just manually count the number of words in each sentence,
but this would quickly become cumbersome if we
were looking at all the presidential speeches in US history.
We need to be able to automate this task.
Thankfully, though, this sort of automation
is easy with regular expressions and command line functions.
So let's open up the command line again, and I recommend for this lesson
that you write down the relevant code somewhere handy so you can easily
repeat the analysis on your own.
The code is a bit long to efficiently repeat by memory alone.
So first of all, let's load up our command line,
and I'm going to go to my desktop where these files are sitting.
And I'm going to first of all save the name of the text
file we're working with as a variable.
Variables are just a programmer's tool for storing data
within a command line session.
They're especially useful in this context
because we don't have to type the name of the file again
and again every time we use it.
We create a variable like so where we set the variable file name equal
to the value 1790, 01, 08 Washington 1 dot text the name of our file.
I call this variable file name, but you could theoretically
call it whatever you want.
Now when I want to call the text file from the command line,
I just need to type the variable file name appended by a dollar sign,
and you'll see that the text file is now loaded in memory under the name file
name.
Assigning file names to variables like this
also makes it easy to use the same command line code for different files
simply by changing the name of the text file associated with the variable file
name.
If I use the cat command with the file name variable,
you can see that the text file is loaded correctly.
If you want to check the value contained in your variable-- in this case,
the name of the text file--
you can print out the value using the echo command.
Let's try echoing the value in our variable file name.
Note that the file name includes the date of the speech.
We will need this information for our CSV output file containing
the year, month, and sentence length.
We can grab it directly from the variable file name like so.
For year, we want the first four characters in file name, 1790.
Note that in computer programming numbering starts at 0 and not 1.
To access the first four characters of dollar sign file name then,
we type the file name like so surrounded by brackets
and then indicate that we're starting at character 0
and including only 4 characters.
Note that we have the numbers 1790 now.
Let's save that value to a variable as well so
that we can access that later when we create our CSV.
If we do the same thing for month, we're now starting on character 5,
and we want to include two characters.
You can see this command correctly produces 01 for the month of January.
We'll save this as a variable, too, so we can add it to our output CSV
later on.
As we take a look at the speech, you can see
that there's still a lot of HTML tags and miscellaneous characters in there.
If we leave these tags and different characters in there,
we'll receive misleading word count numbers
since the word counts will include these tags and characters as words.
Before we do any analysis then, let's first write a new text file
with all of these tags cleaned out.
To do so I'm going to use the sed command we covered in section 4.4.
Instead of filtering out a single character or word,
though, I'm going to use a more complicated regular expression
to filter out any characters that occur between the less than and greater
than characters, in other words, all of the HTML span tags.
Regular expressions can get a bit complicated, but stick with me
and we'll walk through the logic of this example together.
Make sure that you're very careful with spaces and punctuation.
Every detail matters when you're dealing with regular expressions.
So first I'm going to type the same sed command with the edit flag
that I typed in the last unit.
Remember as well that we start up the phrase in quotes with an S like so.
Now we want to remove everything that starts with a less
than an ends with a greater than sign.
So we start our regular expression with a less than,
and we end up with a greater than.
The HTML tags can contain any number of word configurations
within the less than and greater than signs, though.
We need to specify that any character following the less than sign
that is not the greater than sign should also be removed.
To do this, we first introduce brackets, a carrot character,
and a greater than character after the less than character.
The brackets are used to signify any characters that
are between the less than and greater than sign characters such as the word
span.
The carrot is used to signify the logical operation not,
thus our regular expression currently matches any text pattern
that matches the less than character followed by any character that is not
the greater than character and then finally followed
by the greater than character.
For the sake of completeness, we should also
include cases where there are empty pairs of less than and greater
than characters, for instance, if there were no characters between the two.
To match zero or more of the characters identified by the bracket and carrot
pattern, we must add an asterisk after the second bracket like so,
finishing out our regular expression.
Then we finish out the sed command like in the last lesson,
finding anything that matches our regular expression pattern
and replacing it with nothing, i.e. deleting it.
We use a G at the end of the sed command to tell it
that we want to perform this operation globally for as many times
as characters matching the expression appear on a line in the speech.
I then write the result out to a text file
so that we can use our results for further steps in our analysis.
So note that there are no HTML tags in the text anymore, exactly as we wanted.
Conveniently this is only a single line of code,
and we can reuse it whenever we want to do the same cleaning
operation for a text file, much faster than doing this cleaning manually.
You'll note, though, that at the bottom of the text
and at the bottom of all the scraped speeches
I have is a statement on the speech being in the public domain.
For our analysis, let's also remove the statement and write the remaining text
to file for inspection.
Here I write another regular expression utilizing the backslash, less than,
and backslash greater than characters to bracket the phrase this work is in the
and match anything that starts with that full phrase.
I then end the regular expression with a period, which means any character,
and the asterisk, which again, matches zero or more
characters identified by the period.
So we have a regular expression matching text
that starts with the phrase this work is in the and finishing with zero or more
of any other character.
And sure enough, we can see that the public domain
statement is no longer in the text.
Now we need to separate each sentence so we can measure their length.
To do this, we can again use the sed command
to replace all end of sentence periods with a new line character as follows.
So this new line character separates each sentence onto a different line.
Remember that periods have the meaning of any character
in regular expressions.
To actually match the character period and not
use its expanded definition of any character,
we need to insert a backslash in front of the period
to let the computer know that we're specifically
talking about periods here.
This is the same technique that can be used
if you're looking to identify any of the special characters used
in regular expressions.
So we finally need to remove punctuation and then any empty lines.
To remove punctuation, we can use the shorthand and bracket
punct instead of typing in all of the possible punctuation values.
This said statement matches all punctuation
and replaces it with nothing, i.e. it removes punctuation.
Finally we can write line word counts to file using the powerful text processing
utility auk.
The intricacies of auk are beyond the scope of this course,
but this particular command removes blank lines from consideration
and then prints out NF, the number of fields or word count for each line.
Thus if we want to look at the resulting text file,
we can see that we have word counts for every sentence in the speech.
So now we just need to average the word counts for each speech
and add the average to CSV with each speech's year and month.
Auk reduces this operation to a single line of code as follows.
Again, don't worry too much about the specifics of auk for this course,
but I will walk through the logic of how this particular command is working.
The first phrase in brackets indicates that you
are taking the sum overall lines in the first and only column in the word count
dot text file, the wc dot TXT.
The end indicates that at the end of the file, you stop the summation process
and print out the quotient of the sum and the NR, or number of records.
So this gives us our average.
If we save the average sentence length and a variable,
this will make it easier to write it to CSV with its year and month data.
To write to CSV, we can simply echo our variable values out to the CSV
like we did in the redirection unit.
Note that we can use two greater than characters in this context.
This command means that we will append the data to the end of the listed CSV
rather than overwrite it.
That way when we do the same analysis for Washington's other speeches,
we can append their data and directly add it
to the same CSV using the same code.
Now we just need to change the file listed for the variable file name,
and we can automate and repeat the same process for the remaining Washington
speech files.
Go ahead and pause your video and rerun the code
for the remainder of Washington speeches.
After you're done, we can compare our results.
Now that you're done running your code again,
if you pull up your resulting CSV in Excel,
you can see that we have the year, month,
and sentence length for each one of the George Washington speeches
provided just as we wanted.
Congratulations.
Now we have an idea of how his sentence length changed
over the course of his presidency.
And we have the code to repeat this analysis for any number
of presidential speeches through time.

You might also like