You are on page 1of 9

Do-It-Yourself Data Mining Part I Text Analysis Using ArchiText Principles of Text Analysis Before considering the existing

g program, and the mechanics necessary for updating it, it is important that we share a common understanding of the principles of Text analysis that this author considered in developing the original design of the program. To that end it will be useful to review some of the slides in a PowerPoint presentation designed to clarify those principles. Concordancing The earliest analysis of text began with a process going back to the development of printing. Called Concordancing, it consists of locating every word within a text, and counting the frequency its appearance within the text. This process was first applied to scholarly analyses of the Bible: The slide below illustrates a screen shot of a free Concordancing program, TextStat, available on the Net. Getting the frequency of occurrence of a specific words, by itself, has considerable utility. Here is some of the information which can be derived just from this data: Subject Information: Inspection of the proper and common nouns, sorted by frequency quickly provides an overview of the subject of this text collection. Structural Analysis: Teachers, linguists, and others having interest in the structure of language use can apply various ratios and percentages to list contents. Examples include the percentage of prepositions to the total word count, active to passive voice verbs, etc. Still, the mere presence of a word, or even its frequency of occurrence provides relatively limited information regarding its usage within the corpus (the totality of all documents under analysis). The next step, therefore, is to view a word or words of interest within a context. You will often find the acronym, KWIC (Key Words In Context), referring to this process. While not shown here, proximity searching is made available in the Query Editor, initiated with the button shown. Expanding to a full citation provides a full view of just how and were each word appears in the context of the total document. This is quickly accomplished by clicking the Citation button. The Complexity of Word Tags Using a number of methods, individual words can be linked to each other as they occur throughout the corpus of material under study. This is especially essential when the document contains a large number of names of people, locations, or events which can easily become confusing. Which person is linked to which employer, location, or event? What about individuals sharing the same surname? Separating into categories and clear identities brings clarity to confusion. The next five slides, with text extracted from the 9/11 Commission Report, illustrate this clarification process. We initially begin with a paragraph of unedited text: The first step in increasing the information value of individual words is to select a category, in this case surnames of individuals. Using the Replace dialog in Word, change all instances of a surname to ALL CAPS. Any compound word which you wish to have listed completely must have the space character replaced with a hyphen. Prefixes serve to group words which are members of the same category together so that they will appear in a group within a word listing.

Suffixing of the root word provides differentiation between same names. Thus in the example below, the two AL-SHEHRIs are identified as being siblings, but can be separated with respect to individual activity. Here are some rules for compound words, other than the name of people: All of the techniques shown above enhance your capabilities for analysis of a document or collection of documents, but none are essential to the full employment of the program. ArchiText, a Text Analysis (Text Mining) Program Import and Split Creating Nodes The first step in using ArchiText is to import the corpus of one or more documents into the program. In the example shown below the text of the 9/11 Commission Report is going to be subjected to analysis. In general, if the document is of any significant length you'll want to split it into categories or sections which represent some logical division of the whole. After creating a new ArchiText file, select the file you're going to use. You can elect to import the entire file, or to split the file into elements, referred to as Nodes. If you choose the former, you will have just one node, having the document name. If you decide to split into sections, you can use any symbolic character (or combination of symbolic characters) as the target string by which sections are split, as shown above. In this example, we have split the entire document into chapters, as illustrated in the Node Directory shown below. Double clicking on any of the Nodes will open a window in which the text of the node appears. Node Selection Regardless of the analysis you are going to perform, you can select all nodes, an ordered set of nodes, or a discontinuous combination of nodes. Preferences allow you to order nodes alphabetically or by time modified. Keyword Lists Typically this is the first analysis you are going to do. Typically, you will have done a lot of tag preparation in the original file, as described above. In the selection above, our interest is in identifying the key terrorist players, so the keyword search was restricted to those nodes where they were discussed. After selecting the nodes whose words are to be listed, the keyword dialog sets up the parameters for the listing. Most of these choices are self-explanatory, but the stop word list requires some discussion. ArchiText comes pre-loaded with a modifiable list of words articles, prepositions and auxiliary verbs, which ordinarily are irrelevant to content. Thus, when this item is checked, these words are eliminated from the frequency listings. However, there are times when these words have usefulness for a given analysis, and they can then be included by un-checking this box. In this partial view of the resulting frequency list, each person has been prefixed by p- which facilitates grouping all of the those named fitting into the category Person. For those occurring with high frequency, we will proceed to extract all information regarding them, and combine that information into a single new node only focused on each of them. Extract and Combine to make new nodes In our first search, two of the terrorists have been selected. Selecting the S tab will automatically initiate the search dialog. Remember that the nodes have been preselected when the keyword list was constructed. After pressing Start Search button, you will see the following results in the Directory. Notice that the number of occurrences of the names of the two terrorists, within each node, are highlighted. The next step is to extract just this information, and combine it into a new node. To do this, select Combine Nodes from the Analysis menu. In the example shown below, we have searched for George Bush, and are

extracting all occurrences of his name throughout the nodes. Select Embed Node Name if you want the source nodes named in the new node. After completion, a new node containing only those instances in which Bush is named in a paragraph. The result of this combination looks like this: This illustration is, of course, only a small potion of the nodes in which the Bush Name occurs. If you wish, you can drill down further, building a keyword list for this node alone, and searching for other combinations related to Bush, as they occur within Paragraphs or sentences in which his name occurs. If desired, you could build extracted nodes for any combinations of Bush and other words included in your search. Identify Relationships Node Maps In a 500 page document there are obviously a huge number of relationships between people, events, locations, and other categories. Node Maps facilitate your finding and manipulating these relationships in an infinite number of ways. New maps are built in the same way as are new nodes -- by using the create button for maps in the directory dialog. You will note that there are number of nodes which are not on the map, but which are available through selection and pressing the "Add to Map" button. When nodes are deleted from the map, they appear in the left column which is the "On-Call" list. Another way that nodes can be added from the On-Call list is through a search which selects some of the nodes in this list. One way of visualizing the nodes found in a search is to change the size of the nodes selected by that search. This option is available by selecting, "Change Node Size," in the Map menu found on the main toolbar. A far more powerful option is available. Using one all of the eight linking tools which are available a "Parent Node" can be connected or linked to each of the nodes to which a relationship exists. One example of this linking is shown in the map below. In this case, Terrorist 001 (Osama bin Laden) is linked to each of the chapters in which his name appears. As you see below every node which is linked to another is illustrated in the nodes window. Double clicking on any node name opens the note window, and depending on the preference settings, will either open the source node and destination node, or simultaneously open the destination node while closing the source node. Implications for data mining The methodology employed here facilitates the discovery of all kinds of relationships between people, events, locations, and in fact any word or phrase to any other. Typically as relationships are discovered new sub nodes will be created so that those relationships can be examined and further linked to other relationships. It is not necessary to do this specialized tagging which will be explained in the following tutorial directed at methods of text analysis. This simply makes it easier to define categories of items making their location and identification easier within ArchiText and providing a basis for quantitative analyses which can flow from these categorical classifications. Some limitations While the design of this program offers features which this author has found in no other program, because it was designed in 1988, there are some limitations and deficiencies which demand starting from the beginning and rebuilding the program shell. Listed below are some of the current problems which must be resolved for the program to reach its potential power for its users: By far the most serious deficiency in this program is the fact that it will only operate on older Macintosh computers still installed with OS 9x or earlier. The search and linking functions are available on no other program, except those enterprise-level highly expensive data mining systems. Thus the program needs to be updated such that is usable on any platform.

As currently constituted the program only can import text in ASCII format, and lacks the capability to open Internet files or read from them. There are number of deficiencies in the search algorithm, most particularly in the program's inability to process numerical searches. Thus, a search for a number greater than, equal to, or less than another quantity can not currently be accomplished. As displayed above, they the mapping capabilities of the program are very limited and a number of modifications should be made so that more effective pictographic displays are readily available. An example of one such possibility is shown below. Whats Next? Part II of this series discusses application of some of the more traditional methods of Data Mining, describing the ease with which standard statistical methods may be used to determine and present complex relationships existing between words and numbers, without the necessity for advanced expertise, or expensive and complex professional analytic software. Do-It-Yourself Data Mining Part II Concepts and Display Introduction In beginning our consideration of Data Mining, readers will find many, if not all, of the concepts involved to be to be foreign to their past experience. Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering, etc.. Data mining is a complex topic and has links with multiple core fields such as computer science and adds value to rich seminal computational techniques from statistics, information retrieval, machine learning and pattern recognition. For most of us our major approach to processing information ultimately depends on the viewpoint of others. This is because, our entire education has largely consisted of memorizing facts, and solving problems based on the rules that others have given to us. We typically find that we rely on "facts," and "conclusions" which come from those whose viewpoint is most similar to that which we have developed over a long period of time. Here is a graphic example which illustrates this process. One reason that many avoid engaging in DM is the perception that it requires not only training in statistics, but in database usage as well. Typically, those employed as data analysts will have formal training and experience in database programming languages, statistical programming, as well as research design. This paper seeks to provide those who have competence in general computer applications with the intellectual tools necessary to shortcut the heavy duty software and training used by the professional. It all depends on Point of View Assume that you hold a Liberal viewpoint with respect to the war. You will tend to reject the statements of conservatives in essence screening out the Red view of any dispute. You will tend to accept, in fact, even receive, only that information which is in agreement with these long held views. Conversely, if the view you hold is consistent with those held by conservatives, you will tend to see and incorporate into your thinking the views held by members of that group. None of us is immune to the tendency to accept or reject new information to the degree that it is consonant with our previously constructed world views. The analysis tools discussed here can serve to free you from these frameworks and provide new ways of looking at the information contained within the text.

What you will be learning Looking at the names given to the disciplines and knowledges used by those who are engaged in Data Mining, you are likely to be thinking that such training is far beyond your own education. This article is designed to show you that, while some of what those who do DM have advanced academic experience, that the core principles can be learned and employed by everyone and in fact, can be used by students in their early teens. For those lacking a strong background in statistical analysis and the tools of quantitative analysis such as Regression, Correlation and Cross Tabulation (Contingency Tables), this material will initially appear be very new. Most who have little or no experience using and calculating statistics tend to think of statistics as a discipline which uses numeric values to do reach conclusions or results. While certainly this is very much the case, there are also a number of statistical methods which use text exclusively, or as a part of the calculations. As you proceed through this tutorial, you will find that the concepts introduced are much easier to understand than you had previously believed to be the case. From Numbers and Words to Conclusions An Example Getting acquainted with statistical ideas Before starting we need to define some termsused by statisticians when they carry out research. Populations and Samples Two terms you will run across throughout this document are Sample and Population. The 100 8th graders, whose responses we are going to obtain, are a Sample of all boys and girls in the 8th grade, going to school in the United States. This total is referred to as the 8th Grade Population. Whatever we do with the sample, the greatest concern is that the results we find for the sample, are very similar to those which would be found if the entire Population could be measured. Variables Information regarding either Populations or Samples is broadly divided into two kinds of Variables Discrete (Categorical) and Numeric. The height of each member (case) in the sample as well as the grade point average are the numeric variables which will be used in this example. Categorical (Text) variables or labels which are assigned to place each case in one of two or more different Categories. A, B, C, D, and F, are all elements of the Categorical Variable named Grade. These categories are either located and extracted in the text of documents being analyzed or derived from equations, such as shown below. In this instance a formula was used to derive a letter grade from the grade point average, for each case. Figure 1 Deriving Letter Grades from Performance Average All of the categorical variables in this example come from mathematical derivations, since there is no real textual data from which they can be extracted. Nonetheless, you will find yourself dealing with this same construction of categories, especially when the text source is a series of tables. . Composition of the Sample Boys vs. Girls The first thing we want to know is whether there are an equal number of boys and girls in our sample. As you can see both are equally represented in the sample. Figure 2 Gender composition of sample Overall Academic Performance What about academic performance of the all in the sample? Since the letter grades are derived from numeric averages of performance throughout the school year, we use a Bar Chart to inspect the number of students receiving each of the five grades.

Figure 3 Grade Frequencies The horizontal axis contains the ranges of grades, in this case with the lowest being between 51 to 60, proceeding in increments of 10 points, ending in a perfect score of 100. The vertical axis gives you the number of cases within the selected ranges. We can see that performance is pretty much as we might expect A few in the failing and top ranges, more in the high and low ranges, and most right in the middlewhere a C is the grade awarded. But that is far from the whole story. Looking at this graph, you will certainly want to know whether there is any difference between the academic performance of the boys vs. the girls. The first step is one of inspecting the frequency of each category Boys and Girls, with respect to the entire sample. To do this, we use a plot called a Dot Plot, and with the boxes added, the plot is referred to as a Box Plot. Figure 4 Dot and Box Plots of Performance Averages As you can see, the girls did better than the boys, both in the median of their scores, (white line in the center of the shaded areas) and in the top grade received by a girl vs. that of a boy. While none of the girls failed (Score below 61) two of the boys did. Removing the boxes, we see the three boys who are outliers, those who represent extreme low values disconnected from the rest of their group, as well as the three high-scoring girls who are also disconnected from the rest of their group by their high performance. Yet, while we have a good look at where the average scores fall, and the distribution of scores in each groups, we really dont have any accurate count of how many in each group received which letter grade. To get this precision, we instead turn to a statistical tool referred to as a Contingency Table. As you look at the table, what becomes evident that, at every grade level, the girls did better than the boys. There is something else that is evident. If you look at the p-value shown at the bottom of the table, you will note that it is shown as p= .0478. This tells you that there is slightly less than 5 chances in 100 that you would find the boys equaling or surpassing the performance of the girls, if you repeated this survey with other boys and girls in the 8th grade. Figure 5 Table Boy vs. Girl Grades Received Finding relationships To this point, we have been working largely with using word frequencies to interpret the information we have displayed. Now we are going to some numerical methods for determining relationships between the underlying variables which have led to classifying some of the variables into words. Height and Academic Performance The 8th grade is a time of great change in physical growth for adolescents. Thirteen year old girls tend to be well into puberty, while many boys lag in development, causing a far greater variation in boys height than is found with the girls. Figure 6-Comparison of Heights for Boys and Girls Nonetheless, boys at this age show an average difference, being approximately two inches taller than are the girls. Figure 8 -Height vs. Grades Full Sample This Scatter Plot of Height vs. Grade averages for the entire sample is very interesting. There is a modest trend toward those receiving higher grades being shorter than those receiving lower grades Boys are shown as Xs, and Girls shown as small open circles. The red line is the cutoff separating low and high grades. Is this trend the same or different for boys and girls? To answer this question we split the total by gender as shown in the figure below. Figure 9 Height vs. Grades by Gender While we see this trend for boys persisting in the left plot, there appears to be almost no relationship for the girls, as shown by the nearly level line in the girls plot at right. One of the nice things about this kind of display is that it does not

require the viewer to try and interpret complicated numeric calculations instead, simply looking at the plot makes a number of things evident: There are only 7 girls as opposed to 18 boys who received grades below C (< 71) in this sample. Looking at the boys heights, the shortest boys tended to get the best grades, while the tallest boys, were more evenly distributed in the grades they received. Thus, one might posit that short boys may be more motivated toward academic work then tall, since they have fewer distractions in their attraction to the girls, and less likelihood of being engaged in time consuming sports activity. Conversely, there is almost no relationship between the heights of girls and the grades they receive. While they may be involved in extra-curricular activities, most parents will severely limit any dating activity by this age group. Choice of Leisure Activity and Grades Recall that there another two choice question in our example. It asked, If you have a choice of activity on the weekend, which would you prefer to do? ___ Play Sports___ Do something else. Recall in our example, these students are living in a small town in Texas. In such towns, high school football assumes a high degree of importance. This cultural bias toward sports participation leads us to the following hypotheses: While the students in our sample are two young to participate in a high school program, boys will have strong aspirations and interest in future participation, leading to leisure time sports activity. Since high school football is a male-only sport, girls will show less interest in sports participation, although some, of course, will participate in programs that are equally open to boys and girls. Regardless of gender, students with larger body mass will have a greater inclination towards sports participation then their smaller counterparts. For a variety of reasons, students who either participate, or are emotionally invested in organized athletic activity will tend towards lower grade achievement than those not involved. We begin this analysis with a contingency table showing the relationship between academic performance and sports participation: When the choice of Play Sports vs. Something Else, is overlaid over grades and height, we see a clear relationship between those receiving poor grades and those choosing to spend their leisure time playing sports, rather than doing Something Else. You will also note that this behavior is much more pronounced among the boys (11) as compared with the same choices made by girls (3). A Note about the Findings In reviewing all of the above, it is important to note that all of these findings are based upon a set of results, constructed by the author. The survey questions described were never actually given to any group of students, and the results are completely factitious. There are, in fact, a number of studies which take an opposing view, finding that students athletes tend to be among the high performers within their educational settings. One, peer reviewed study is available here, involving many more variables than the sample provided here. Statistical Programs For anyone wanting to do serious analysis using the methods described above, a statistical analysis program is required. Many readers will be reluctant to either spend the money, or engage in the steep learning curve required to master many professional level programs. The author uses a professional version of DataDesk, a uniquely powerful, yet, easy-to- use program. A relatively low cost ( $75.00) Excel add-in, for the same program, is offered by the publisher, making available all of the analyses described above.

Do-It-Yourself Data Mining Part III Using Block Tags to Analyze Text Introduction In Part II of this series of articles, you learned how text and numeric data can be used to extract meaningful and useful information from large collections of textual material. In this section, we look at how the textual content can be reorganized to be extracted, as demonstrated in the previous article. As before, your most important task is going to be to define in advance, the information you are expecting to derive from your efforts. As previously, this will be stated in the form of questions to be answered or hypotheses to be tested. There are two kinds of tagging which will be considered, each form serving to answer different purposes. Word Tags Word Tags are words contained within the document which have particular importance either because of the frequency with which they occur or their association with other words within a sentence or paragraph. By capitalizing, (George BUSH), compounding (New-York), and adding prefixes(DOD.RUMSFELD) or suffixes (BUSH-POTUS43) to words identified as important, word elements can be combined and linked to others having some element in common. A full discussion of Word Tag was provided in Part I of this series of articles. Block Tags Block Tags are words added by the user to categorize sections of text (by sentence or paragraph) such that Contingency Tables can be constructed showing the dependence or independence on one category against another. Before you scratch your head trying to figure out what I am saying, here is an illustration of the process I am describing: Each year the President presents his State of the Union Speech to Congress and the watching nation. In 2005, a popular President Bush, newly reelected, came to Congress with a sweeping series of domestic agenda which he anticipated bringing into law during his second term in office. By the same time in 2006, with the war in Iraq going badly, much of the nations attention was directed at resolving the war, and away from the issues brought in 2005. Here are some of the questions which flow from this brief description: Was there a significant difference in the emphasis that the President gave to various subjects which appeared in both speeches when compared against the two years in question? Since the speeches are used as a means of presenting agenda, and convincing the audience of the value of the Presidents views, what elements comprised the form in which his views were presented? We will closely inspect these and a number of other questions which Block Tags answer. Interpreting Contingency Tables Before going on to the construction of Block Tags, we need to spend some time looking at the power of contingency tables to provide you with precise knowledge regarding the information you are seeking. There is a great deal available from these seemingly simple numeric tables: In preparing this example, the text of both State of the Union speeches was divided into sentences, and then, several categories were assigned to each sentence: The Null Hypothesis You would expect that the theory or hypothesis that you are stating, related to the data above would read something like this: The increase in FOREIGN AFFAIRS statements is related to the time that it occurred (2006). Because of the methods by which statisticians calculate the likelihood or probability of an hypothesis being true, hypotheses are stated in just the reverse

form from which one would ordinarily expect to see. Thus: There is no statistically significant difference between FOREIGN AFFAIRS statements made in 2005 and 2006. This form of stating a hypothesis is referred to as the Null Hypothesis. When you see a p-value for such a statement, it is giving you the probability that this negative form of your hypothesis is true, or correct. Researchers will Reject the Null Hypothesis, when the value of p is less than or equal to some value previously determined by them. Thus, if we reject the Null Hypothesis, this means we accept as likely the original form of the hypothesis. Lets see how this operates when using the contingency table with which we have been working: The It is derived from the number of cells in the table. As df increases, a larger ChiSquare value is required to obtain the same p-value. Neither of these statistics are necessary for your interpretation of these tables. Admittedly, this doublenegative approach is counterintuitive to almost everyone who has not been involved with statistical research. However any reading of professional journals addressing such disciplines as medicine, psychology, political science and many others, will show hypotheses presented as described above. Selector Variables A selector variable is used to filter all of the counts within a contingency table, reducing the totals each cell to only those cases in which meet the criteria for the selector. In the example below, we add the EMOTIONAL Selector to the contingency table we have been using. Inspecting the two tables reveal some important differences between them. Exploratory Data Analysis There are many phenomena that occur where we really dont have any clearly defined explanation for the what we observe. This is particularly true in the social and behavioral sciences. What factors, for example, had the most impact on the 2004 presidential election? What changes in American society can be expected if current immigration legislation becomes law? Typically many variables will be involved in the explanation, some of which will have unknown effects, or even not be suspected of having any relationship to the Dependent (y or predicted variable).

You might also like