You are on page 1of 59

Do it Yourself Data Mining

A Three Part White Paper


By

Sherwin A. Steffin
May 15, 2009
Introduction
Back in the days when I was founder and CEO of a Macintosh software
publishing company (BrainPower, Inc.), I designed a powerful text analysis
product, named ArchiText, for the Macintosh platform. An independent
contractor was employed for program coding. When the company closed in
1990, the program disappeared from public view, but to this day, contains
features that I have found in no other free or low-priced program.

Unfortunately, the program operates only on obsolete Mac computers running 0S


6.1-9.x, a rapidly declining number of computers. Additionally, it was developed
long before the Internet, so cannot directly import web pages, but rather must rely
on text, or MS Word 4.0 or earlier.

Given those limitations, I am looking for a programmer, interested in updating the


software design elements into a new program, usable on all platforms, and
incorporating features , not available at that time, or for which development costs
prevented inclusion.

I see this as a project which would be undertaken without any financial


transactions between any parties involved in development. The program could
be sold as low-cost (Under $20.00) shareware, providing revenue to all involved
in the program development. Any interested in reviewing the program having
access to a Macintosh with the requisite OS can receive copies by contacting the
author at the email address provided in my profile. Please put “ARCHITEXT” in
the subject line.

A Rationale for Reconstruction of a Twenty-Year Old Program


The first thing anyone considering becoming involved in this project will ask is,
“Why should any effort go in to rebuilding a twenty year old program?” As you
read through the features and capabilities of this program, you will find several
components missing from every modern search engine, as elegant and
innovative as is each:

1. A search operator that will facilitate proximity searches.

Wikipedia provides a useful definition of the proximity search. As you read this
page you will note that while Google provides some facility for this feature, it
is more than a little clumsy to implement. While there have been some
attempts to implement this feature, they are either clumsy or very limited in
their implementation. One example of this is an API designed to do two-word
proximity searches.

2. A SEARCH WITHIN SENTENCE or SEARCH WITHIN PARAGRAPH


Operator

Page 2 of 59
Perhaps there is no greater frustration than searching multiple terms using the
AND operator, and finding thousands of pages containing both terms, yet
totally unrelated to each other. Depending on the search terms employed, one
can get either some very targeted results or be greatly frustrated by the
presence of all terms, unrelated to each other.

You are searching for instances where George H.W. Bush AND CIA drug
running pilot, Barry Seal appear and identified as having a relationship
to each other. You do a quick search, Bush AND Seal. Google reveals a
page count of 1,520,000 Pages! Treating both terms as quote-limited
phrases, “George Bush” AND “Barry Seal,” still yields a count of 967
pages. Even after scanning link titles the opening and scanning of a
substantial number of pages is required to determine whether there was
a specific commonality between these individuals.

If, however, the two terms can be linked as occurring within the same
sentence or paragraph, commonalities can be rapidly identified.

3. Extraction and combination of all sentences or paragraphs having search


terms in common

In your investigation of the JFK assassination, you have assembled


many books and articles which you have converted to text. You want to
see how much agreement exists among authors regarding the frame by
frame analysis of the Zapruder film. You search for “Frame nnn.”
Instead of copying and pasting each paragraph in which the frame is
mentioned, you create one document, containing all paragraphs in
which this frame is mentioned, along with the source from which they
were extracted.```

The following material is a three part white paper, introducing the reader to three
approaches to Data Mining, requiring little or no previous training and using low
cost or free computer applications.

In Part I, the reader is introduced to some principles of text analysis, and then
walked through an example of how these principle can be applied using
ArchiText, the program described above.

Part II introduces some quantitative methods which yield readily interpretable


graphic displays. The differences between analytic and Exploratory Data Analysis
are presented. The use of Dot Plots, Box Plots, Frequency Analysis, Scatter
Plots, and Contingency Tables is introduced, with interpretive examples
presented.

Part III is an in-depth discussion of the use and interpretation of Contingency


Tables, and introduces the concept of Block Tagging, and how it is applied to
analytic activity.

Page 3 of 59
Do-It-Yourself Data Mining – Part I
Text Analysis Using ArchiText

Principles of Text Analysis

Before considering the existing program, and the mechanics necessary for
updating it, it is important that we share a common understanding of the
principles of Text analysis that this author considered in developing the original
design of the program.

To that end it will be useful to review some of the slides in a PowerPoint


presentation designed to clarify those principles.

Concordancing

The earliest analysis of text began with a process going back to the development
of printing. Called Concordancing, it consists of locating every word within a
text, and counting the frequency its appearance within the text. This process was

Page 4 of 59
first applied to scholarly analyses of the Bible: The slide below illustrates a
screen shot of a free Concordancing program, TextStat, available on the Net.

Getting the frequency of occurrence of a specific words, by itself, has


considerable utility. Here is some of the information which can be derived just
from this data:

• Subject Information: Inspection of the proper and common nouns,


sorted by frequency quickly provides an overview of the subject of this
text collection.

• Structural Analysis: Teachers, linguists, and others having interest in


the structure of language use can apply various ratios and percentages
to list contents. Examples include the percentage of prepositions to the
total word count, active to passive voice verbs, etc.

Still, the mere presence of a word, or even it’s frequency of occurrence provides
relatively limited information regarding it’s usage within the corpus (the totality of
all documents under analysis). The next step, therefore, is to view a word or
words of interest within a context. You will often find the acronym, KWIC (Key
Words In Context), referring to this process.

Page 5 of 59
While not shown here, proximity searching is made available in the Query Editor,
initiated with the button shown.

Expanding to a full citation provides a full view of just how and were each word
appears in the context of the total document. This is quickly accomplished by
clicking the Citation button.

Page 6 of 59
The Complexity of Word Tags

Using a number of methods, individual words can be linked to each other as they
occur throughout the corpus of material under study. This is especially essential
when the document contains a large number of names of people, locations, or
events which can easily become confusing. Which person is linked to which
employer, location, or event? What about individuals sharing the same surname?
Separating into categories and clear identities brings clarity to confusion.

The next five slides, with text extracted from the 9/11 Commission Report,
illustrate this clarification process.

We initially begin with a paragraph of unedited text:

Page 7 of 59
The first step in increasing the information value of individual words is to select a
category, in this case surnames of individuals. Using the Replace dialog in Word,
change all instances of a surname to ALL CAPS.

Page 8 of 59
Any compound word which you wish to have listed completely must have the
space character replaced with a hyphen.

Page 9 of 59
Prefixes serve to group words which are members of the same category together
so that they will appear in a group within a word listing.

Page 10 of 59
Suffixing of the root word provides differentiation between same names. Thus in
the example below, the two “AL-SHEHRIs” are identified as being siblings, but
can be separated with respect to individual activity.

Page 11 of 59
Here are some rules for compound words, other than the name of people:

Page 12 of 59
All of the techniques shown above enhance your capabilities for analysis of a
document or collection of documents, but none are essential to the full
employment of the program.

ArchiText, a Text Analysis (Text Mining) Program


Import and Split – Creating Nodes

The first step in using ArchiText is to import the corpus of one or more
documents into the program. In the example shown below the text of the 9/11
Commission Report is going to be subjected to analysis.

In general, if the document is of any significant length you'll want to split it into
categories or sections which represent some logical division of the whole. After
creating a new ArchiText file, select the file you're going to use.

Page 13 of 59
You can elect to import the entire file, or to split the file into elements, referred to
as “Nodes.” If you choose the former, you will have just one node, having the
document name.

If you decide to split into sections, you can use any symbolic character (or
combination of symbolic characters) as the target string by which sections are
split, as shown above.

In this example, we have split the entire document into chapters, as illustrated in
the Node Directory shown below. Double clicking on any of the Nodes will open a
window in which the text of the node appears.

Page 14 of 59
Node Selection

Regardless of the analysis you are going to perform, you can select all nodes,
an ordered set of nodes, or a discontinuous combination of nodes. Preferences
allow you to order nodes alphabetically or by time modified.

Page 15 of 59
Keyword Lists

Typically this is the first analysis you are going to do. Typically, you will have
done a lot of tag preparation in the original file, as described above. In the
selection above, our interest is in identifying the key terrorist players, so the
keyword search was restricted to those nodes where they were discussed.

After selecting the nodes whose words are to be listed, the keyword dialog sets
up the parameters for the listing.

Most of these choices are self-explanatory, but the “stop word list” requires some
discussion. ArchiText comes pre-loaded with a modifiable list of words – articles,
prepositions and auxiliary verbs, which ordinarily are irrelevant to content. Thus,
when this item is checked, these words are eliminated from the frequency

Page 16 of 59
listings. However, there are times when these words have usefulness for a given
analysis, and they can then be included by un-checking this box.

In this partial view of the resulting frequency list, each person has been prefixed
by “p-“ which facilitates grouping all of the those named fitting into the category
“Person.” For those occurring with high frequency, we will proceed to extract all
information regarding them, and combine that information into a single new node
only focused on each of them.

Page 17 of 59
Extract and Combine to make new nodes

In our first search, two of the terrorists have been selected. Selecting the “S” tab
will automatically initiate the search dialog. Remember that the nodes have been
preselected when the keyword list was constructed.

After pressing “Start Search” button, you will see the following results in the
Directory.

Page 18 of 59
Notice that the number of occurrences of the names of the two terrorists, within
each node, are highlighted. The next step is to extract just this information, and
combine it into a new node. To do this, select “Combine Nodes” from the
“Analysis” menu.

In the example shown below, we have searched for George Bush, and are
extracting all occurrences of his name throughout the nodes.

Select “Embed Node Name” if you want the source nodes named in the new
node. After completion, a new node containing only those instances in which
Bush is named in a paragraph. The result of this combination looks like this:

Page 19 of 59
This illustration is, of course, only a small potion of the nodes in which the Bush
Name occurs. If you wish, you can “drill down” further, building a keyword list for
this node alone, and searching for other combinations related to Bush, as they
occur within Paragraphs or sentences in which his name occurs. If desired, you
could build extracted nodes for any combinations of Bush and other words
included in your search.

Identify Relationships – Node Maps

In a 500 page document there are obviously a huge number of relationships


between people, events, locations, and other categories. Node Maps facilitate
your finding and manipulating these relationships in an infinite number of ways.

Page 20 of 59
New maps are built in the same way as are new nodes -- by using the create
button for maps in the directory dialog. You will note that there are number of
nodes which are not on the map, but which are available through selection and
pressing the "Add to Map" button. When nodes are deleted from the map, they
appear in the left column which is the "On-Call" list. Another way that nodes can
be added from the On-Call list is through a search which selects some of the
nodes in this list.

Page 21 of 59
One way of visualizing the nodes found in a search is to change the size of the
nodes selected by that search. This option is available by selecting, "Change
Node Size," in the Map menu found on the main toolbar.

Page 22 of 59
A far more powerful option is available. Using one all of the eight linking tools
which are available a "Parent Node" can be connected or linked to each of the
nodes to which a relationship exists. One example of this linking is shown in the
map below. In this case, Terrorist 001 (Osama bin Laden) is linked to each of the
chapters in which his name appears.

Page 23 of 59
As you see below every node which is linked to another is illustrated in the nodes
window. Double clicking on any node name opens the note window, and
depending on the preference settings, will either open the source node and
destination node, or simultaneously open the destination node while closing the
source node.

Page 24 of 59
Implications for data mining

The methodology employed here facilitates the discovery of all kinds of


relationships between people, events, locations, and in fact any word or phrase
to any other. Typically as relationships are discovered new sub nodes will be
created so that those relationships can be examined and further linked to other
relationships.

It is not necessary to do this specialized tagging which will be explained in the


following tutorial directed at methods of text analysis. This simply makes it easier
to define categories of items making their location and identification easier within
ArchiText and providing a basis for quantitative analyses which can flow from
these categorical classifications.

Page 25 of 59
Some limitations

While the design of this program offers features which this author has found in no
other program, because it was designed in 1988, there are some limitations and
deficiencies which demand starting from the beginning and rebuilding the
program shell. Listed below are some of the current problems which must be
resolved for the program to reach its potential power for its users:

• By far the most serious deficiency in this program is the fact that it will
only operate on older Macintosh computers still installed with OS 9x or
earlier. The search and linking functions are available on no other
program, except those enterprise-level highly expensive data mining
systems. Thus the program needs to be updated such that is usable
on any platform.

• As currently constituted the program only can import text in ASCII


format, and lacks the capability to open Internet files or read from
them.

• There are number of deficiencies in the search algorithm, most


particularly in the program's inability to process numerical searches.
Thus, a search for a number greater than, equal to, or less than
another quantity can not currently be accomplished.

• As displayed above, they the mapping capabilities of the program are


very limited and a number of modifications should be made so that
more effective pictographic displays are readily available. An example
of one such possibility is shown below.

Page 26 of 59
What’s Next?

Part II of this series discusses application of some of the more traditional


methods of Data Mining, describing the ease with which standard statistical
methods may be used to determine and present complex relationships existing
between words and numbers, without the necessity for advanced expertise, or
expensive and complex professional analytic software.

Page 27 of 59
Do-It-Yourself Data Mining – Part II
Concepts and Display

Introduction
In beginning our consideration of Data Mining, readers will find many, if not all, of
the concepts involved to be to be foreign to their past experience. “Data mining
(DM),” also called Knowledge-Discovery in Databases (KDD) or Knowledge-
Discovery is the process of automatically searching large volumes of data for
patterns using tools such as classification, association rule mining, clustering,
etc.. Data mining is a complex topic and has links with multiple core fields such
as computer science and adds value to rich seminal computational techniques
from statistics, information retrieval, machine learning and pattern recognition.”

For most of us our major approach to processing information ultimately depends


on the viewpoint of others. This is because, our entire education has largely
consisted of memorizing facts, and solving problems based on the rules that
others have given to us. We typically find that we rely on "facts," and
"conclusions" which come from those whose viewpoint is most similar to that
which we have developed over a long period of time. Here is a graphic example
which illustrates this process.

One reason that many avoid engaging in DM is the perception that it requires not
only training in statistics, but in database usage as well. Typically, those
employed as data analysts will have formal training and experience in database
programming languages, statistical programming, as well as research design.
This paper seeks to provide those who have competence in general computer
applications with the intellectual tools necessary to shortcut the heavy duty
software and training used by the professional.

It all depends on Point of View

Assume that you hold a “Liberal” viewpoint with respect to the war. You will tend
to reject the statements of conservatives – in essence “screening out” the Red
view of any dispute. You will tend to accept, in fact, even receive, only that
information which is in agreement with these long held views.

Page 28 of 59
Conversely, if the view you hold is consistent with those held by conservatives,
you will tend to see and incorporate into your thinking the views held by members
of that group.

None of us is immune to the tendency to accept or reject new information to the


degree that it is consonant with our previously constructed world views. The
analysis tools discussed here can serve to free you from these frameworks and
provide new ways of looking at the information contained within the text.

What you will be learning

Looking at the names given to the disciplines and knowledges used by those who
are engaged in Data Mining, you are likely to be thinking that such training is far
beyond your own education. This article is designed to show you that, while
some of what those who do DM have advanced academic experience, that the
core principles can be learned and employed by everyone – and in fact, can be
used by students in their early teens.

For those lacking a strong background in statistical analysis and the tools of
quantitative analysis such as Regression, Correlation and Cross Tabulation
(Contingency Tables), this material will initially appear be very new. Most who
have little or no experience using and calculating statistics tend to think of
statistics as a discipline which uses numeric values to do reach conclusions or
results. While certainly this is very much the case, there are also a number of
statistical methods which use text exclusively, or as a part of the calculations.

As you proceed through this tutorial, you will find that the concepts introduced
are much easier to understand than you had previously believed to be the case.

Page 29 of 59
From Numbers and Words to Conclusions – An
Example
Getting acquainted with statistical ideas
Before starting we need to define some terms used by statisticians when they
carry out research.

Populations and Samples

Two terms you will run across throughout this document are Sample and
Population. The 100 8th graders, whose responses we are going to obtain, are
a Sample of all boys and girls in the 8th grade, going to school in the United
States. This total is referred to as the 8th Grade Population. Whatever we do
with the sample, the greatest concern is that the results we find for the sample,
are very similar to those which would be found if the entire Population could be
measured.

Variables

Information regarding either Populations or Samples is broadly divided into two


kinds of Variables – Discrete (Categorical) and Numeric.

The height of each member (case) in the sample as well as the grade point
average are the numeric variables which will be used in this example.

Categorical (Text) variables or labels which are assigned to place each case in
one of two or more different Categories. “A,” “B,” “C,” “D,” and “F,” are all
elements of the Categorical Variable named “Grade.” These categories are
either located and extracted in the text of documents being analyzed or derived
from equations, such as shown below. In this instance a formula was used to
derive a letter grade from the grade point average, for each case.
Figure 1 Deriving Letter Grades from Performance Average

All of the categorical variables in this example come from mathematical


derivations, since there is no real textual data from which they can be extracted.
Nonetheless, you will find yourself dealing with this same construction of
categories, especially when the text source is a series of tables.

Page 30 of 59
First Steps – Defining the Analysis

More than anything you do, this is the most important element of conducting your
analysis. Your efforts are done for some purpose. This first step is where you
define the questions for which you seek answers. Here is an example of the kind
of question that you may seek to answer through your analysis.

There is a huge amount of material about whether global


warming is taking place, and if it is, what people can or
should do about it. There are other scientists who say that
the world’s temperature has nothing to do with what man
does or does not do. How can I determine who to believe?

Or, here is another:

The President tells us that the way we need to go in Iraq is to


add troops to the war in Iraq, but the majority of the
American public says we should get out. So too, do many of
our generals, the majority of Congress, and many experts.
Who is right?

In order to illustrate the use of simple categories, an example has been


developed from a survey, sampling answers from students attending a middle
school located in a small Texas town. This hypothetical analysis generated a
number of questions which illustrate how one might use labels to evaluate what
would otherwise be necessarily considered to require numeric data. Here are
some of the questions this analysis will answer.
1. Given the choice, during leisure time, of “Playing Sports,” or “Doing
something else,” which choice will 8th graders make?
2. Is there a difference between genders with respect to which is more
likely to make either choice
3. Is there a difference in academic performance between boys and girls?
4. Is there a difference in average height between those who select Playing
Sports,” against those who select “Doing something else?”
5. Is there a relationship between height and the choice of use of leisure
time?
6. Is there a relationship in grades between those who select Playing
Sports,” against those who select “Doing something else?”

If this sample had been actually collected, it would have been done by asking
each sampled student the following questions:
1. Are You a boy or a girl? ________
2. What is the most recent report card grade you received in this class? ____

Page 31 of 59
3. How tall are you (in inches) ____
4. If you have a choice of activity on the weekend which would you prefer to
do? ___ Play Sports___ Do something else? [Put an “X” in the line next to
your choice.]

From a combined word count of a little over 500 words, you will find that (100
sets of answers to the 4 questions) a wealth of information can be obtained. You
will begin to reach some powerful answers to your questions and know the
likelihood that your answers are correct.

Composition of the Sample Boys vs. Girls

The first thing we want to know is whether there are an equal number of boys
and girls in our sample. As you can see both are equally represented in the
sample.
Figure 2 – Gender composition of sample

Overall Academic Performance

What about academic performance of the all in the sample? Since the letter
grades are derived from numeric averages of performance throughout the school
year, we use a Bar Chart to inspect the number of students receiving each of the
five grades.
Figure 3 – Grade Frequencies

Page 32 of 59
The horizontal axis contains the ranges of grades, in this case with the lowest
being between 51 to 60, proceeding in increments of 10 points, ending in a
perfect score of 100. The vertical axis gives you the number of cases within the
selected ranges.

We can see that performance is pretty much as we might expect – A few in the
failing and top ranges, more in the high and low ranges, and most right in the
middle…where a “C” is the grade awarded. But that is far from the whole story.

Looking at this graph, you will certainly want to know whether there is any
difference between the academic performance of the boys vs. the girls.

The first step is one of inspecting the frequency of each category Boys and Girls,
with respect to the entire sample. To do this, we use a plot called a “Dot Plot,”
and with the boxes added, the plot is referred to as a “Box Plot.”
Figure 4 – Dot and Box Plots of Performance Averages

As you can see, the girls did better than the boys, both in the median of their
scores, (white line in the center of the shaded areas) and in the top grade

Page 33 of 59
received by a girl vs. that of a boy. While none of the girls failed (Score below 61)
two of the boys did. Removing the boxes, we see the three boys who are
“outliers,” those who represent extreme low values disconnected from the rest of
their group, as well as the three high-scoring girls who are also disconnected
from the rest of their group by their high performance.

Yet, while we have a good look at where the average scores fall, and the
distribution of scores in each groups, we really don’t have any accurate count of
how many in each group received which letter grade. To get this precision, we
instead turn to a statistical tool referred to as a “Contingency Table.”

As you look at the table, what becomes evident that, at every grade level, the
girls did better than the boys. There is something else that is evident. If you look
at the p-value shown at the bottom of the table, you will note that it is shown as p
= .0478. This tells you that there is slightly less than 5 chances in 100 that you
would find the boys equaling or surpassing the performance of the girls, if you
repeated this survey with other boys and girls in the 8th grade.
Figure 5 – Table Boy vs. Girl Grades Received

Finding relationships

To this point, we have been working largely with using word frequencies to
interpret the information we have displayed. Now we are going to some
numerical methods for determining relationships between the underlying
variables which have led to classifying some of the variables into words.

Height and Academic Performance

The 8th grade is a time of great change in physical growth for adolescents.
Thirteen year old girls tend to be well into puberty, while many boys lag in
development, causing a far greater variation in boy’s height than is found with the
girls.

Page 34 of 59
Figure 6- Comparison of Heights for Boys and Girls

Nonetheless, boys at this age show an average difference, being approximately


two inches taller than are the girls.

Figure 8 - Height vs. Grades – Full Sample

This Scatter Plot of Height vs. Grade averages for the entire sample is very
interesting. There is a modest trend toward those receiving higher grades being
shorter than those receiving lower grades Boys are shown as Xs, and Girls
shown as small open circles. The red line is the cutoff separating low and high
grades. Is this trend the same or different for boys and girls?

To answer this question we split the total by gender as shown in the figure below.
Figure 9 – Height vs. Grades by Gender

Page 35 of 59
While we see this trend for boys persisting in the left plot, there appears to be
almost no relationship for the girls, as shown by the nearly level line in the girl’s
plot at right. One of the nice things about this kind of display is that it does not
require the viewer to try and interpret complicated numeric calculations – instead,
simply looking at the plot makes a number of things evident:
• There are only 7 girls as opposed to 18 boys who received grades
below “C” (< 71) in this sample.
• Looking at the boy’s heights, the shortest boys tended to get the best
grades, while the tallest boys, were more evenly distributed in the grades they
received. Thus, one might posit that short boys may be more motivated
toward academic work then tall, since they have fewer distractions in their
attraction to the girls, and less likelihood of being engaged in time consuming
sports activity.
• Conversely, there is almost no relationship between the heights of girls
and the grades they receive. While they may be involved in extra-curricular
activities, most parents will severely limit any dating activity by this age group.

Choice of Leisure Activity and Grades

Recall that there another two choice question in our example. It asked, “If you
have a choice of activity on the weekend, which would you prefer to do? ___ Play
Sports___ Do something else.”

Recall in our example, these students are living in a small town in Texas. In such
towns, high school football assumes a high degree of importance. This cultural
bias toward sports participation leads us to the following hypotheses:
• While the students in our sample are two young to participate in a high
school program, boys will have strong aspirations and interest in future
participation, leading to leisure time sports activity.

Page 36 of 59
• Since high school football is a male-only sport, girls will show less
interest in sports participation, although some, of course, will participate in
programs that are equally open to boys and girls.
• Regardless of gender, students with larger body mass will have a
greater inclination towards sports participation then their smaller counterparts.
• For a variety of reasons, students who either participate, or are
emotionally invested in organized athletic activity will tend towards lower
grade achievement than those not involved.

We begin this analysis with a contingency table showing the relationship between
academic performance and sports participation:
Figure 10 – Table Activity Choice vs. Grade Performance

When the choice of “Play Sports” vs. “Something Else,” is overlaid over grades
and height, we see a clear relationship between those receiving poor grades and
those choosing to spend their leisure time playing sports, rather than doing
“Something Else.” You will also note that this behavior is much more pronounced
among the boys (11) as compared with the same choices made by girls (3).

Page 37 of 59
Figure 11 – Overlay of Activity Choice vs. Grades and Height

A Note about the “Findings”

In reviewing all of the above, it is important to note that all of these findings are
based upon a set of results, constructed by the author. The survey questions
described were never actually given to any group of students, and the results are
completely factitious. There are, in fact, a number of studies which take an
opposing view, finding that students athletes tend to be among the high
performers within their educational settings. One, peer reviewed study is
available here, involving many more variables than the sample provided here.

Statistical Programs

For anyone wanting to do serious analysis using the methods described above, a
statistical analysis program is required. Many readers will be reluctant to either
spend the money, or engage in the steep learning curve required to master many
professional level programs. The author uses a professional version of
DataDesk, a uniquely powerful, yet, easy-to- use program. A relatively low cost
( $75.00) Excel add-in, for the same program, is offered by the publisher, making
available all of the analyses described above.

Page 38 of 59
Do-It-Yourself Data Mining – Part III
Using Block Tags to Analyze Text

Introduction
In Part II of this series of articles, you learned how text and numeric data can be
used to extract meaningful and useful information from large collections of textual
material. In this section, we look at how the textual content can be reorganized to
be extracted, as demonstrated in the previous article.

As before, your most important task is going to be to define in advance, the


information you are expecting to derive from your efforts. As previously, this will
be stated in the form of questions to be answered or hypotheses to be tested.

There are two kinds of tagging which will be considered, each form serving to
answer different purposes.

Word Tags

Word Tags are words contained within the document which have particular
importance either because of the frequency with which they occur or their
association with other words within a sentence or paragraph. By capitalizing,
(George BUSH), compounding (New-York), and adding prefixes(DOD-
RUMSFELD) or suffixes (BUSH-POTUS43) to words identified as important,
word elements can be combined and linked to others having some element in
common. A full discussion of Word Tag was provided in Part I of this series of
articles.

Block Tags

Block Tags are words added by the user to categorize sections of text (by
sentence or paragraph) such that Contingency Tables can be constructed
showing the dependence or independence on one category against another.

Before you scratch your head trying to figure out what I am saying, here is an
illustration of the process I am describing:

Each year the President presents his State of the Union


Speech to Congress and the watching nation. In 2005, a
popular President Bush, newly reelected, came to Congress
with a sweeping series of domestic agenda which he
anticipated bringing into law during his second term in office.

Page 39 of 59
By the same time in 2006, with the war in Iraq going badly,
much of the nation’s attention was directed at resolving the
war, and away from the issues brought in 2005.

Here are some of the questions which flow from this brief description:

Was there a significant difference in the emphasis that the President gave to
various subjects which appeared in both speeches when compared against the
two years in question?

Since the speeches are used as a means of presenting agenda, and convincing
the audience of the value of the President’s views, what elements comprised the
form in which his views were presented?

We will closely inspect these and a number of other questions which Block Tags
answer.

Interpreting Contingency Tables


Before going on to the construction of Block Tags, we need to spend some time
looking at the power of contingency tables to provide you with precise knowledge
regarding the information you are seeking. There is a great deal available from
these seemingly simple numeric tables:

In preparing this example, the text of both State of the Union speeches was
divided into sentences, and then, several categories were assigned to each
sentence:
• YEAR
• DOMESTIC/FOREIGN AFFAIRS
• RATIONAL/EMOTIONAL (A “Selector” Variable)

Here is what they look like in generating the sample contingency tables:

Page 40 of 59
In the tables you have viewed previously all were simple 2 x 2 tables. This simply
refers to the number of columns and rows within each table. In the example
below, the two speeches are in columns, with each sentence classified as
referring to either FOREIGN AFFAIRS or DOMESTIC matters.

The question answered by this table is: Was there a difference between the two
speeches with respect to a change in emphasis regarding FOREIGN AFFAIRS
or DOMESTIC matters?

Since the number of sentences in the ’06 speech increases by 17 from that of
’05, (287-260), and the number of FOREIGN AFFAIRS sentences increases only

Page 41 of 59
by 27, (116-89) it may first appear that this does not really represent a huge
change.

To get a preliminary idea of the magnitude of the change, let us first substitute
the percentage of column totals to see whether this difference looks important.

We see a small increase (approximately 6.2%) in the number of sentences


devoted to FOREIGN AFFAIRS, but we still don’t know whether this increase is
important enough to say there has been a real change in emphasis, or this was
just a random occurrence… perhaps there were different speech writers, or some
other factors which entered in the process of preparing the speech.

To make a further determination of whether this apparently small difference is


really important, we need to look at another set of calculations in the table – the
Expected Values. Without going into how these values are calculated, they
should be interpreted in the following way: The greater the difference between
actual (counted) cell values and Expected Values, the greater the chance that
the row values are affected by the Column values.

Page 42 of 59
All of the expected values are above or below their respective counts by a little
over 8 points. Thus we turn to the next measure of differences – the
Standardized Residuals. This number (after complex numeric processing) gives
you information about the magnitude of the variation of the expected values
from the actual counted values.

In looking at the highlighted number, we see that during the ’06 speech, that
FOREIGN AFFAIRS showed the greatest increase in emphasis. The question
that remains to be answered is: How likely is it that the increase in FOREIGN
AFFAIRS statements was related to the time that it occurred (2006) or was it
merely due to chance?

Page 43 of 59
To answer this question, we turn to two other statistics: The Chi-Square value,
and the Probability (p = ) of occurrence:

The Null Hypothesis


You would expect that the theory or hypothesis that you are stating, related to
the data above would read something like this:

The increase in FOREIGN AFFAIRS statements is related to the time that it


occurred (2006).

Because of the methods by which statisticians calculate the likelihood or


probability of an hypothesis being true, hypotheses are stated in just the reverse
form from which one would ordinarily expect to see. Thus:

There is no statistically significant difference between FOREIGN AFFAIRS


statements made in 2005 and 2006.

This form of stating a hypothesis is referred to as the Null Hypothesis. When


you see a “p-value” for such a statement, it is giving you the probability that this
negative form of your hypothesis is true, or correct. Researchers will Reject the
Null Hypothesis, when the value of “p” is less than or equal to some value
previously determined by them. Thus, if we reject the Null Hypothesis, this
means we accept as likely the original form of the hypothesis.

Let’s see how this operates when using the contingency table with which we
have been working:

Page 44 of 59
The number “p = 0.1355,” is translated as: The probability that there is no
statistically significant difference between FOREIGN AFFAIRS statements
made in 2005 and 2006 is .1355 (This means that there is a 13.5% chance that
there is no difference between the two variables)

For most researchers this number is far too high. Most researchers require a p-
value less than .05 (5%) or .01 (1%) and so we would accept the hypothesis,
since the .1355 number is too large to reject the statement.

Without going into a technical explanation of the Chi-Square value, the general
rule is that the higher this number the lower will be the p-value. The “df” refers to
“Degrees of Freedom.” It is derived from the number of cells in the table. As df
increases, a larger Chi-Square value is required to obtain the same p-value.
Neither of these statistics are necessary for your interpretation of these tables.

Admittedly, this “double-negative” approach is counterintuitive to almost everyone


who has not been involved with statistical research. However any reading of
professional journals addressing such disciplines as medicine, psychology,
political science and many others, will show hypotheses presented as described
above.

Selector Variables

A selector variable is used to filter all of the counts within a contingency table,
reducing the totals each cell to only those cases in which meet the criteria for the

Page 45 of 59
selector. In the example below, we add the EMOTIONAL Selector to the
contingency table we have been using.

Inspecting the two tables reveal some important differences between them.
• Comparing the Total Counts between the two tables, EMOTIONAL
statements account for 27.2% of all statements made by the President in both
years EMOTIONAL sentences occurred 3.5 time more frequently in ’06, when
compared to ’05.
• DOMESTIC sentences decreased from 342 to 97, with the ratio
between the two years of exactly 1:1 to EMOTIONAL statements dominating
’06 by a ratio of 3.6:1.
• While those total statements regarding FOREIGN AFFAIRS increased
by 30% between the two years, those having an EMOTIONAL rating
increased by 81%.
• Most importantly, the probability for the Null Hypothesis being TRUE
has dropped from .135 to .065.
• Taking all of the data together the interpretation which follows
concludes that the President significantly increased the use of emotional
appeals during between the ’05 and ’06 speeches.

Contingency Table Summary


While much of the above may have seemed tedious and difficult to retain, the
more derived from contingency table analysis of categorical variables, the more
powerful and useful will be the conclusions reached from using this tag
assignment system,.

Page 46 of 59
Preparing Block Tags
Selecting Variables
As previously mentioned the number of variables, and the variable headings
should be determined in advance. In the example which follows, there are three
variables with headings: YEAR, TYPE, AND SUBJECT.

Category Selection
The general rule in determining the number of categories you are going to use is
to make each variable have the smallest number of categories allowing for the
information you seek to discover. Ideally, you will have a 2 x 2 table to work with.
Differences between expected and actual values will be at their greatest, giving
the opportunity for higher Chi-Square values with the lowest p values,
obtainable. Of course, in real situations this is often not possible. In fact, until
you are actually in the process of coding variables, you will not know the names
of all the categories.

Text Categories Used in the Example


YEAR
• 05 Speech
• 06 Speech

TYPE
• EMOTIONAL = A conclusion based upon emotion, not facts.
(“Americans are a compassionate people,” or the “Axis of Evil.”)
• FACTUAL = A verifiable statement (“There were 12 marines killed in
the helicopter crash.”)
• F-CONCLUS = An inference or conclusion, based upon factual data
(“Another 20,000 troops are required to end the violence in Baghdad.”)
• PROMISE = A commitment made by the speaker. (“I will sign a bill
which…”)
• REQUEST = A Request made by the speaker (I ask that Congress
continue the Tax Reduction…”)
• REQUIRE = A demand made (Congress must pass legislation to…”)
• RHETORICAL = A statement which communicates no information

SUBJECT
• CONGRESS = Statement regarding Congress

Page 47 of 59
• ENERGY = Oil, nuclear, alternative energy sources
• INTERNATIONAL = All foreign affairs matters, except war
• LAW = Law enforcement and the courts
• MONEY = All things related to the economy, employment, Social
Security, Taxes, Budget, health insurance.
• SECURITY = Homeland Security issues
• WAR
• YOUTH

Coding the Document


Preparing the Source Text

Much of the text that you will be organizing and manipulating will come from web
pages, pdf files, or numerical tables. All raw data should be converted to Word
documents, so that the tools contained within can be utilized.

In our example we will be working with the State of the Union Addresses
delivered in 2005, and 2006. To assemble the two completed documents, you will
need to copy and paste from the text, changing pages, and eliminating
extraneous material. Since we are going to be working with tow different
speeches, you should code one year at a time, and then combine the two
finished documents into one composite complete data set. In a moment, you will
see why.

Shown below is a fragment of the 2005 Address, showing the first four
paragraphs.

Page 48 of 59
1. Use Global Replace to parse all paragraphs into their component sentences.

The result appears as shown below.

Page 49 of 59
2. The following Global Replace dialog readies the test formatting of the two
speeches for Block Tag entry.

3. This results in the following text, with “2005” later replaced with “SOU-05.” A
separate Replace dialog is used for the 2006 speech.

Page 50 of 59
4. Using the Table Menu, the Text is converted to the following Table:

5. Empty cells can be rapidly categorized using the appropriate tag as each
sentence is evaluated. Remember that there are two speeches to be coded

Page 51 of 59
While the above appears to involve a substantial amount of work in preparing the
tag tables, Steps 1 through 4 take only a few minutes. After completion of Step 5,
simply copy the categorical columns to an Excel sheet, and then import the
variables to the statistics application you have chosen. You are now ready to
assess the results of the relationships existing among the variables.

Analysis from Two Perspectives


Analytic Statistics
Dating from the 19th Century, classical analytic statistics are familiar to any who
either conducts or evaluates scientific research. This method provides a means
for testing the reliability and validity of predefined hypotheses. Thus, a
pharmaceutical manufacturer believes that Drug “A” will be effective (or more
effective, safer, etc.) then other drugs in treating disease “X.” The investigators
apply one or more “research designs,” to conduct experiments designed to prove
or test their hypotheses. Important to note here is that the investigators have
already determined the expected results, and now are simply seeking to
determine whether their expectations are correct or not.

Exploratory Data Analysis


There are many phenomena that occur where we really don’t have any clearly
defined explanation for the what we observe. This is particularly true in the social
and behavioral sciences. What factors, for example, had the most impact on the
2004 presidential election? What changes in American society can be expected if
current immigration legislation becomes law? Typically many variables will be
involved in the explanation, some of which will have unknown effects, or even not
be suspected of having any relationship to the Dependent (y or “predicted”
variable”).

Comparing the Two State of the Union Speeches


Fresh from his reelection the State of the Union Speech delivered in January
2005 reflected the President’s plans for using his political capital to advance his
legislative priorities in the year to follow. By the time the 2006 was delivered,
there had been a marked change in the challenges faced by the President. No
longer was support for the Iraq war by any means as great as it had been in the
previous year. Protest were well established, as was rising criticism of Bush
policy both foreign and domestic.

Thus, we would expect to find significant differences in content emphasis, and in


tone. Moreover, since 2006 was to be a mid-tem election year, the congressional,
journalistic, and public audiences were all mindful of the implications of this
speech related to the effect it might have on this election.

Page 52 of 59
Keep these considerations in mind as we explore the differences and
commonalities in the two speeches.

Emotionality – Rationality

An initial area of interest is the degree to which the style of the two speeches
stays consistent or changes. The ’05 speech reflected the President’s confidence
in his ability to bring the Iraq war to, if not to an end, movement toward his
promised direction. He advanced a number of favorite domestic priorities which
he was confident would be accepted and achieved. By 2006 much had changed.
Domestically, his plan for Social Security privatization which he had so glowingly
announced was dead in the water. While an elected government was developing
to schedule in Iraq, little progress and in fact regression,was evident with respect
to sectarian and insurgent violence. American casualties continued to spike, and
the public was showing serious resistance to the confident assertions of the
Administration.

Given these circumstances a reasonable inference is that the EMOTIONAL


TONE (defined here as the Ratio of Emotional to Rational Statements) will be
significantly greater for the 2006 speech then was the case in 2005.

Using all of the Type classifications, a comparison between the frequencies of


EMO-CONCLUS (Emotional Conclusions) and all other categories makes it very
evident that the increase in this category was the largest and most significant of
all changes between the two speeches. This is further reinforced by the decline in
FACTUAL statements occurring in the two speeches.

Page 53 of 59
This table displays all categories of statement types. If we restrict the categories
to “EMOTIONAL,” RATIONAL,” (F-CONCLUS + FACTUAL) and “OTHER” the
major shift to emotionalism in the 2006 speech becomes evident.

Page 54 of 59
The following point to the major shift from RATIONAL to EMOTIONAL
statements:
• EMOTIONAL statements changed from 15.8% of all of the ’05
statements to 37.6% of all ’06 Statements.
• Of all EMOTIONAL statements made in the combined speeches,
72.5% of them were made in the ’06 Speech.

Drilling Down for Deeper Insight

When we inspect Year, Subject, and Type Categories, there is a great deal we
can discover just by trying various combinations in contingency tables. The
combination of two table below, comparing the proportions of only EMOTIONAL
statements compared between those statements having a subject of DOMESTIC
matters vs. FOREIGN AFFAIRS reveals what may have been an unexpected
trend.

Page 55 of 59
Here are some of the inferences supported by a side by side inspection of the
two tables.
• There was a significant shift between 2005 and 2006, with the
emotional component being substantially higher in 2006 than was the case in
the earlier year.
• Than there was an unexpected trend, is demonstrated by the fact that
DOMESTIC statements having an emotional component were greater than
was the case for FOREIGN AFFAIRS.

Sliding Granularity

Inspecting the above we see a shift in the toward emotionality in the tone of the
’06 speech. While the p value of .066 is approaching an acceptable level of
significance (.05), we don’t yet know which of the subject categories accounts for
this change. Thus, we want to get to sufficient detail to tell us which category(s)
account for this shift to emotionality.

The Category, “FOREIGN AFFAIRS” is quite easy to split into its respective
elements, since there it is composed of only two – “WAR” and
“INTERNATIONAL.” “WAR” refers to military actions being taken in either Iraq or
Afghanistan. “INTERNATIONAL” refers to all other references to relations with
foreign countries.

Selecting the appropriate categories composing the ‘DOMESTIC” category is


more complex, since it is composed of seven (MONEY, VALUES, SECURITY,
LAW, YOUTH, ENERGY, and CONGRESS) elements. To use all of them, would

Page 56 of 59
defeat the purpose of determining which among them was most affected by this
shift. As shown in the table below, only two (MONEY and VALUES) exceed 10%
of the combined statements for both years.

Thus we assemble contingency tables for each of the 4 variables, looking for the
p values of each, First, the Main Table, showing the shift toward emotional
statements. Note the p value which clearly establishes the shift toward
EMOTIONAL in the ’06 speech.

Next we compare the change for ’06 between DOMESTIC and FOREIGN
AFFAIRS.

Page 57 of 59
Interpreting the above becomes a bit complicated, but if you follow the process, it
should become clear how to correctly read the results:
• A shift in tone toward the EMOTIONAL was determined in the
comparison of ‘05/’06
• Therefore, in the table above, we search for the EMOTIONAL cell in
which is contained the largest positive residual. This cell, as highlighted, is
the EMOTIONAL/DOMESTIC cell.
• Since the p value of .0245 < .05 (the minimum acceptable value) we
are assured that it meets our criterion for this change being statistically
significant.

Having determined that the DOMESTIC category cells to which we need to


direct are attention to the two components of which it consists, MONEY and
VALUES. We want to limit our analysis to determine which of these two have the
greater significance. This will be determined by their respective p values.

Page 58 of 59
Comparing the two tables, it is immediately evident that MONEY is the element of
DOMESTIC policy which saw the greatest increase in EMOTIONAL statements.
We have three ways of confirming the relative weight of the two elements.
• The total number of MONEY statements (148) exceeded the VALUES
statements (68) by a full 70 sentences. Clearly , MONEY statements received
more attention than did VALUES.
• The MONEY/EMOTIONAL cell has a positive residual 2.3 times
greater than that of Money, suggesting that the power of EMOTIONAL
statements was far greater for MONEY, than for VALUES.
• Finally, the Chi-Square value for MONEY was far greater (21.54) than
was that of VALUES (11.54) with the subsequent p value for MONEY<
VALUES (p ≤ .0001 is less than p = .0031).

Summarizing the MONEY – VALUES Findings

o The first question of interest was whether there was a difference


between the ’05 and ’06 speeches. A trend in the direction of more
emotional statements being made in the ’06 speech was identified.
o Two general categories of statement had been previously
identified – DOMESTIC and FOREIGN AFFAIRS. The largest positive
shift in emotional statements was found to be in the category,
DOMESTIC.
o Finally, the DOMESTIC category held two significant elements,
VALUES and MONEY

Page 59 of 59