You are on page 1of 165

Document Explorer User Guide __________________

Methods for Contextual Discovery and Analysis

Document Explorer User Guide __________________ Methods for Contextual Discovery and Analysis Hamilton-Locke, Inc.

Hamilton-Locke, Inc.

Methods for Contextual Discovery and Analysis

QUALITATIVE AND QUANTITATIVE METHODS OF TEXT ANALYSIS FOR DOCUMENT EXPLORER/WORDCRUNCHER USERS

Methods for Contextual Discovery and Analysis Q UALITATIVE AND Q UANTITATIVE M ETHODS OF T EXT

© 2002 Hamilton-Locke, Inc.

Table of Contents

I. DISCOVERY METHODS

............................................................................................

1

INTRODUCTION .................................................................................................................

1

SECTION A – USING WORD COUNTS

2

Counts of Total Words

3

Total Word Counts

..................................................................................................

3

Total Unique Words

................................................................................................

7

Counts by Groups of Words

......................................................................................

12

Counts of Word Parts

................................................................................................ Counts of Related Words

........................................................................................... Counts by Collocated Terms

.....................................................................................

15

17

19

Viewable Search Information

................................................................................

20

Neighborhood Width

.........................................................................................

20

Rating

................................................................................................................

20

Z-score

...............................................................................................................

21

Sample Frq

........................................................................................................

21

Total Frq

............................................................................................................

21

Percent

...............................................................................................................

21

Expected Frq

.....................................................................................................

21

Std Dev

.............................................................................................................. Sort Sequence

....................................................................................................

21

21

SECTION B – IN-CONTEXT SEARCHING

..........................................................................

22

Meaning and Word Use

............................................................................................. Consistency in Word Use

..........................................................................................

22

23

Options and Elements of the Reference List Window

..........................................

25

Reference Window Citation Line ...................................................................................................... View Option

............................................................................................

...................................................................................................... Occurrence Number Select Option ..................................................................................................... Delete Option

..........................................................................................

....................................................................................................

25

25

26

26

26

26

SECTION C – DISCOVERING WITH RUNTIME CONCORDANCES SECTION D – COMPARING READABILITY LEVELS SECTION E – EXAMINING COLLOCATED WORDS SECTION F – S TYLE ANALYSIS

.......................................

..........................................................

...........................................................

.......................................................................................

27

30

33

35

What is Style?

............................................................................................................

35

The Significance of Style Analysis

.......................................................................

36

Discovering Style

.......................................................................................................

37

1) Patterns in the content: What is in the text?

...................................................... 2) Patterns in discretionary use: How is the content being used?

.........................

37

38

3) Patterns in associations: How do things fit together?

.......................................

38

SECTION G – TRANSLATION CONSISTENCY: WORD USE, THEMES AND IMAGERY

.........

40

Translation

................................................................................................................

40

Discovery Before Translation Evaluating Prior Translations

...............................................................................

41

................................................................................

41

SECTION H – TEXTUAL ANALYSIS METHODS FOR WRITERS

..........................................

44

Objective Evaluation

................................................................................................. Additional Writing Aids

.............................................................................................

44

45

PART II. ANALYTICAL METHODS INTRODUCTION ...............................................................................................................

.........................................................................

46

46

SECTION A. METHODS OF SUMMARIZING OBSERVATIONS

.............................................

48

Levels of Measurement

..............................................................................................

48

Types of Data

............................................................................................................

48

Creating a Frequency Distribution Describing a Frequency Distribution

...........................................................................

.......................................................................

48

49

Measures of Central Tendency

..................................................................................

50

Measures of Variability Various Vocabulary

.............................................................................................

................................................................................................... Parametric vs. Nonparametric Assumptions and Tests

............................................

50

51

51

Advantages of Nonparametric Statistics

............................................................... Disadvantages of Nonparametric Procedures When to use Nonparametric Procedures

.......................................................

...............................................................

52

52

52

SECTION B. METHODS OF HYPOTHESIS TESTING

...........................................................

53

Steps in Hypothesis Testing

.......................................................................................

53

Test of Significance

...................................................................................................

53

SECTION C. VARIABLE SELECTION

................................................................................

54

Measures of Total Words and Total Unique Words

..................................................

54

Readability or Grade Level

....................................................................................... Word Groupings (Comparing Sets of Key Words)

....................................................

55

56

Grammatical Discriminators Comparison Against a Pool

....................................................................................

...................................................................................... Strings and Collocated Word Variables

....................................................................

58

58

58

SECTION D. S TATISTICAL ANALYSIS USING MICROSOFT SECTION E. METHODS OF COMPARING OBSERVATIONS

EXCEL....................................

.................................................

59

61

Two Independent Observations

.................................................................................

62

T-test for Two Independent Samples

....................................................................

63

Sign Test

................................................................................................................

Signed-Ranked Two-Sample Test (Mann-Whitney)

.............................................

64

65

Two Related Observations

........................................................................................

66

Paired T-Test

......................................................................................................... Sign Test for Pairs

................................................................................................. Matched Pairs Signed-Ranked (Wilcoxon)

...........................................................

67

68

69

Three or More Independent Observations

................................................................

71

One-Way Analysis of Variance

.............................................................................

72

Extension of Sign Test

..........................................................................................

74

Kruskal-Wallis Test Multiple Comparisons

...............................................................................................

...........................................................................................

75

76

Three or More Related Samples

................................................................................

77

Two-Way ANOVA

............................................................................................... Friedman Two-Way ANOVA by Ranks

............................................................... Multiple Comparison Procedure for use with Friedman Test

...............................

78

80

82

Use of Aligned Ranks (Hodges-Lehmann) Page’s Test for Ordered Alternatives

...........................................................

....................................................................

84

85

SECTION F. ASSOCIATION, TREND AND S LOPE COMPARISONS AND TIME SERIES

..........

86

Scattergram (Scatterplot)

.......................................................................................... Determining Association (Correlation)

....................................................................

86

87

Pearson Product-Moment Correlation Coefficient

................................................

88

Spearman Rank Correlation Kendall’s Tau

..................................................................................

........................................................................................................ Olmstead-Tukey Corner Test of Association

........................................................

89

90

92

Phi Coefficient

....................................................................................................... Yule’s Q Coefficient

............................................................................................. Goodman-Kruskal Coefficient Cramer’s Statistic

..............................................................................

.................................................................................................. Point Biserial Coefficient of Correlation

..............................................................

92

94

95

97

98

Chi-Square Test of Independence

99

100

......................................................................... Kendall’s Coefficient of Concordance W

...........................................................

Partial Correlation Coefficient

............................................................................

101

Trend and Slope Comparison (Regression)

............................................................

102

Theil Test

............................................................................................................. Sign-Test for Trend

.............................................................................................

Sen, Adichie Test Jaeckel, Hettmansperger-McKean

................................................................................................

......................................................................

103

105

106

107

Time Series

..............................................................................................................

108

Basic Concepts of Time Series

............................................................................ Some Classes of Univariate Time-Series Models

...............................................

108

110

Autoregressive (AR) Process

..............................................................................

Moving Average (MA) Process

..........................................................................

ARMA

.................................................................................................................

111

112

113

SARIMA Periodic AR Models

.............................................................................................................

............................................................................................ Fractional Integrated ARMA (abbreviated ARFIMA)

........................................

State Space Models Growth Curve Models

.............................................................................................

.........................................................................................

114

114

114

114

115

Non-linear Models

............................................................................................... Time-series Model Building

................................................................................

Forecasting

..........................................................................................................

SECTION G. GOODNESS OF FIT

.....................................................................................

115

116

117

118

Introduction

.............................................................................................................

118

Chi-Square Goodness of Fit Test Kolmogorov-Smirnov One-Sample Test

........................................................................

............................................................

Kolmogorov-Smirnov Two-Sample Test Lillefors

............................................................

...............................................................................................................

119

120

122

123

SECTION H. MULTIVARIATE METHODS

........................................................................

124

Factor and Principal Component Analysis

.............................................................

124

Cluster Analysis

.......................................................................................................

125

Discriminant or Classification Analysis

126

 

.......................................................

127

PART III. TABLES

129

  • 1. NORMAL DISTRIBUTION – AREAS UNDER THE NORMAL CURVE

130

  • 2. BINOMIAL DISTRIBUTION – CRITICAL VALUES OF THE BINOMIAL TEST

131

  • 2. BINOMIAL DISTRIBUTION – CRITICAL VALUES OF THE BINOMIAL TEST

132

  • 3. F DISTRIBUTION – CRITICAL VALUES

132

  • 3. F DISTRIBUTION – CRITICAL VALUES

133

  • 4. T DISTRIBUTION – CRITICAL VALUES

135

  • 6. CONVERTING R TO Z

.............................................................................................

139

  • 7. CHI-SQUARE DISTRIBUTION – CRITICAL VALUES

141

  • 8. STUDENTIZED RANGE S TATISTIC – CRITICAL

142

  • 9. DUNNETTS TEST

142

  • 9. DUNNETTS TEST

143

  • 10. MANN-WHITNEY

U

TEST

143

  • 10. MANN-WHITNEY

U

TEST

144

  • 11. WILCOX RANKED SUMS TEST

145

  • 11. WILCOX RANKED SUMS TEST

146

  • 12. WILCOXON SIGNED RANKS TEST

147

  • 12. WILCOXON SIGNED RANKS TEST

148

  • 13. SAMPLE S IZE REQUIREMENTS

149

PART IV. BIBLIOGRAPHY AND APPENDIX

........................................................

151

CITATIONS

FOR

PART

I. – DISCOVERY METHODS .........................................................

151

CITATIONS FOR PART II. – ANALYTICAL METHODS

152

APPENDIX

154

iv

I. DISCOVERY METHODS

Introduction

The study of the elements of language has long been relegated to the field of linguistics. Letters combine to form sounds, which are combined to form words, which are combined to form phrases and sentences. Words, phrases and sentences are the building blocks for conveying thoughts, concepts, theme and imagery. They are used to convey, convince, prove and provoke. They are the foundation of literature, poetry, history, law, government and business.

Language is part of every aspect of human existence. Interpersonal communication depends on some form of symbolic language whether oral, written or signed. Religions, laws and governments find their foundation in words. Entertainment, whether by reading, television, movies, or radio, is based on language. Businesses are founded on branding, name recognition, marketing campaigns and persuasive sales.

It is also the foundation of academia. Literature, Composition, Foreign Language, Linguistics, Public Relations, Business Management, History and all like fields are all directly dependant on language. Physics, Mathematics, Biology, Chemistry, Medicine, Engineering and other scientific fields are equally dependant on language in that they comprehend the nature, motives and ends of their work via language.

Language and humanity are interdependent. Where one is the other will be found and conversely, where one is not the other will not be found. In light of this point it is easier to see the benefits of language study in all fields – as a rule, the process of language discovery and analysis is accompanied by a better understanding of humanity.

Language study always reveals something to us about ourselves, our individual and collective perceptions of the universe, the relationships between individuals and groups within a society and about our culture, customs, artistic achievements and social and political movements of a given era or across a time period. When we view language as dynamic rather than static we reveal its changes and its progressions.

The purpose of this manual is to discuss methods for text and discourse analysis so as to make apparent that the methods and tools of Document Explorer are applicable to all academic fields. It is expected that the processes of discovery and analysis will be revelatory to research in all fields.

To study language we examine the building blocks of language. These rudiments include words, phrases, sentences, themes and images. Admittedly, this is a simplistic way of defining language and only one of many, yet it suits the needs of this manual well.

This manual introduces and explains tools designed for electronic text analysis and to give examples of actual and possible research to show the versatility and practicality of these tools in all fields of academic research. Examples, though specific and categorized, should be seen as iconic templates illustrative of practical application of the Document Explorer tools to an array of academic fields.

Section A – Using Word Counts

Word counts are counts of:

  • 1. The total number of words in a document.

  • 2. The total number of unique words in a document.

  • 3. The total number of occurrences of a type or part of a word.

  • 4. The total number of groups of words.

  • 5. Related words.

  • 6. Collocated words.

Each of these counts has potential to reveal useful information about definitions, themes, concepts and images.

Document Explorer incorporates search tools to perform counts of punctuation, individual letters, words, sub-words and word parts, or counts by phrases. In addition, these tools facilitate specialized searches by related words or collocated terms. The following sections discuss each of these several types of counts, explains what they are, gives applications across several fields of study and outlines procedures for Document Explorer users.

Counts of Total Words

Total word count has a long-standing tradition in the classroom. From grade school to the university, writing instructors pose a minimum word requirement for compositions. Though by far not an unfailing indicator, there is often a parallel between the holistic quality and the length of an essay. When it comes to text analysis, total word counts have the potential to reveal more than a vague association with quality.

Total word counts can be divided into two categories – (1) the total number of words and (2) the number of unique words in a text. A count of the total number of words assesses the size of the text. The count of the number of unique words assesses the size of the vocabulary used. These two measurements can also be calculated for sections or subsections or for different features of a text such as, author, theme, time period, source, genre, etc.

Total Word Counts

Total word count is a function of word economy. It can be assumed that the number of words or verbiage that an author uses in a text shows the importance of the theme addressed by that text. When the text is made up of several sections, each devoted to a different theme, a count of the words dedicated to each theme can be compared to the overall word economy. Data gathered in this manner can also infer prejudice and partiality that the author may hold – the true value the author places on the subject treated or audience addressed may be discerned.

Linguistic Applications:

How does the verbiage of various languages compare for describing a single event – are some languages inherently more concise than others?

How does word economy correlate with age and development?

How many words does one student use to describe an event compared to another student? Analyze by age group, e.g., compare kinder garden students to 3 rd grade students.

Social Science Applications:

What proportion of words in a State of the Union Address deal with a particular

issue compared to the other issues in the speech or to the opposing party rebuttal? What proportion of the words in a newspaper, news transcript, or magazine deal

with a particular issue? Do the proportions imply prejudice or bias? What proportion of the words in textbooks deal with issues of a particular

minority group and what can be inferred from such data? i How many words compose the tax code for the United States? How many for the tax code of the Philippines? -- What does this imply about taxation of the two

countries, about culture and about types of taxation (word counts in income tax vs. sales tax vs. property tax vs. business tax sections). How many words does politician A use in answering debate questions compared to politician B?

What is the word count for one inaugural address vs. another? Before TV and after TV. At war or at peace?

What is the ratio of the number of words in the Constitution of the United States regarding a principle per the number of words that the U.S. Congress or Supreme Court uses to articulate and interpret the related law?

Humanities Applications:

How does the propensity to use more verbiage vary from author to author or from

one culture to another? Why? How does one genre compare to another genre?

Business and Market Research Applications:

How many words are in the market survey? (How long is the survey and how

much time will it take to read.) In a verbatim response to a survey question what category of respondents have the

longest response? (Possibly demonstrates interest level by respondent.) How many total words are in the advertisement, web page, or opinion editorial?

How does the count in the Annual Report compare to similar works?

Procedures: The Search Category Report

The Search Category Report contains general information about the document. Select Search on the Menu and select the Search Category Report. The following describes the contents of this report.

Procedures: The Search Category Report The Search Category Report contains general information about the document. Select

All Words:

Total Words: This is the total number of words in this search category. Depending upon the way the publisher set up the data for this book or group of books, some of these words may be numbers or punctuation marks.

Average Length: This is the average length (mean) of all the words in the search category. The reason it usually smaller than the average length of the unique words is many small words are high frequency words. This skews the data toward smaller numbers.

Length Std. Dev.: This statistic gives an insight into the clustering of the length data. In this case it means that approximately 84% of the words will have a length below 6.1 characters in length (4.0 + 2.1). The tool tip prompt for words 8 characters in length indicates these words are in the 82nd percentile.

Average Frequency: The average frequency (count) of all the words in the search category. This is indicated by the red mark on the Low Frq. histogram. Notice that a lot of words occur only a few times and a few words occur a lot of times. This is common.

Frequency Std. Dev.: This statistic gives an insight into the clustering of the data. From this histogram, it is obvious that this is a different distribution of data than seen with the

word lengths. For this reason, the only thing such a large standard deviation (relative to the mean) can tell us is that the data is not clustered.

Please note that to view the Search Category report on a subsection, first cut and paste a specific subsection of the document into MS Word ™, convert that document into by using the conversion icon on the Document Explorer toolbar, then open the book in Document Explorer and view the Search Category Report on the selected subsection.

Total Unique Words

A count of the unique words in a text is a count of the size of the vocabulary of an author. (Note that by this definition vocabulary is used in a broad sense and includes the different inflections of a word and that counting the unique words of a text will not provide a lexicon for the text. Assessing lexicons is addressed in the section titled Counting by Related Words.) Counts of unique words, when compared to counts of total words in a document show the richness of the vocabulary of the text.

Linguistic Applications:

How does the size of the vocabulary change through the different stages of childhood?

How do language development and word repetition relate?

What is the size of the vocabulary of a particular language – how many unique words compose the standard newspaper?

Compare domestic or foreign language authors (or cultures) for vocabulary richness.

Compare vocabulary growth in bilingual and monolingual speakers.

Social Science Applications:

What vocabulary distinguishes a particular culture or sub-culture, e.g., gang terminology, Christian terminology, etc.

Which newspaper or magazine has the biggest vocabulary or is the most expressive?

What is the vocabulary that distinguishes one newspaper from another? What is the difference between different newspapers when covering a particular story or theme?

Which political speakers have the richest vocabulary?

Humanities Applications:

When two authors each produce a work of 5,000 total words, one with 2,500 unique words, the other with 4,000 unique words – what does that say about the authors?

How does an author’s vocabulary richness differ between genres?

How do plays with a similar theme compare in vocabulary use? ii

Business and Market Research Applications:

In a verbatim response – what category of respondent had the largest vocabulary?

In an advertisement or web page – how many unique words are used?

PROCEDURES: TOTAL UNIQUE WORDS

The Search Category Report also contains information on the unique words in a document. The following outlines the information given in the report.

P ROCEDURES : T OTAL U NIQUE W ORDS The Search Category Report also contains information

Unique Words

Total Words: The sum of all the unique words in this search category. This will match the number of words shown below the WordWheel. Depending upon the way the publisher set up the data for this book, some of these words may be numbers or punctuation marks. These words are shown on the WordWheel.

Average Length: The average number (called the mean) of characters in each unique word in the search category. This is also shown as a mark on bottom axis of the Unique Words - Lengths histogram on the bottom right side of the dialog box above.

Length Std. Dev.: This statistic gives an insight into the clustering of the data. In this case it means that approximately 84% of the words will have a length below 10.7 (mean + standard deviation) characters in length. The tool tip prompt for unique words 11 characters in length indicates that these words are in the 83rd percentile.

Counts of Individual Words

As a discovery tool, counting individual words can have several applications, primary ones being identifying themes and topic importance. Once themes are identified they can then be examined more carefully by looking at the words in context and the images they build. Discovering topic importance in a text may also provide direction for further analyses and perhaps a better understanding of an author’s approach to and opinion of the topic of the text and the audience to which it is addressed

The basic idea with individual word counts is that you count how many times each word occurs in a text. This type of counting can be done not only for the text as a whole, but also for a particular part or voice in the text such as a character in a novel or a speaker in a political debate. Once the words in the text are listed in order of the frequency of occurrence, almost at a glance one can get an idea of the content of the text.

The vocabulary arrayed by counting will help the researcher discover something about the text, the author, or original situation in which the text was authored. This list of unique words can also reveal word groupings. These word groupings can be used in different ways and will be covered in-depth in the section called Counts by Groups of Words.

With word counts of individual words a researcher can also assess expectation for particular word use. Predictive models can be established and authorship examined within the framework of predictive models. The following examples are insightful of applications in the various fields.

Linguistics Applications:

What are the most common words or types of words used at different stages of language development?

What are the most common words used in several languages as measured in newspapers – this would essentially be a comparison of the semantic content, e.g., man, hombre, ish, etc?

What words are overused by children of different ages?

Social Science Applications:

Which topics, (foreign policy, welfare reform, domestic economy, etc.) does an

author treat most extensively? What is the importance of a specific topic in a political platform?

How many times does a given president reference deity in his inaugural address?

By performing word counts on political propaganda, what themes and imagery can be identified – what can be inferred about the authors? iii

How can counts of individual words help the researcher to better understand a political speaker – are there contrasts between the speaker’s choice of words and his self-characterization? iv

Humanities Applications:

What are the 100 most commonly used contextual words in Shakespeare’s

sonnets, tragedies and comedies? Which words distinguish one author from another? For example, an author may

repeatedly use a preposition such as about when around, on, encircling, upon, up to, concerning, regarding, in relation, etc. may be equally applicable? What words or what kind of words are most common to a particular genre?

Business and Market Research Applications:

In designing a market survey – what words are overused?

In examining a verbatim response – what words were used the most when answering a question? This is important for building classification codes for verbatim responses.

PROCEDURES: THE WORDWHEEL

The WordWheel lists all the words in the text along with individual word counts (frequency), word length and Z-score.

P ROCEDURES : T HE W ORD W HEEL The WordWheel lists all the words in

The WordWheel in WordCruncher is a Windows® list control. The width of the columns can be changed by dragging on the column separators (vertical bars). The scroll bar is used to reposition the word list. In order to type in the WordWheel, the keyboard focus must be activated (highlight a line by clicking on it) and typing must commence without hesitation. Please note that the Z-score gives users a feel for the difference between the sample frequency of a word and the expected frequency.

Counts by Groups of Words

Once individual word counts are examined the researcher can begin to tailor counts by targeting specific types or groups of words.

A researcher can discover a great deal about the author, the time period and the culture associated with the language by performing counts of word groups. The procedure for doing these counts is that the researcher first categorizes words into baskets of similar terms. Grouped terms carry similar connotations such as optimism, pessimism, activity, delay, caution, recklessness, division or faction, union, rebellion, submission, fear, etc. Word groupings can be very extensive and are usually grouped around a hypothesized theme. Once the hypothesized categories are set and words that pertain to the categories are identified, the researcher begins to search for those words within a text. The occurrence or omission of words in the text is what becomes revealing.

It would be well to note that for the purposes of word counts, an idiom (or any phrase that is repeated) can often be considered as a word. Document Explorer can search out a single word, groups of words, a phrase, or any combination of words and phrases.

The applications of this type of counting are myriad as are the potential implications for the results. Some examples are given below to give a feel for the range of applications for this type of word count.

Linguistic Applications:

How can word-group analyses be used to evaluate communication? v

How do words of a particular connotation (positive/negative, friendly, angry, etc.) appear in different media or in a single media over a specific time period?

How has popular music changed with reference to a specific basket of terms?

Social Science Applications:

How can word-group analyses be used in qualitative evaluation of campaign speeches? vi

How have word-group analyses been used to evaluate WWII Nazi propaganda in film? vii

What do word-group counts on presidential speeches reveal about the current status of the United States or of another country?

What word groups are most common in long-term marriages vs. short-term marriages (those that end in divorce)? What are ratios of positive words to negative words in successful long-term marriages?

What is the ratio of positive to negative words in different media over time? How does this change compare to events of national or international significance?

How do word-group counts in student essays correlate with a propensity toward physical violence – is there a connection between the use of violent words and physical aggression?

Humanities Applications:

Which author most uses the words of a positive connotation? Negative?

Business and Market Research Applications:

When considering the themes of different works, which author uses the fewest

words to build a theme and which uses the most – what can be implied from such data? What are the most common word groups in Shakespeare’s tragedies – in his

comedies?

In evaluating verbatim responses to a market research survey question, this area

facilitates the coding of responses. (Please see tutorial on Coding, Classifying and Ranking Contextual Data.) In evaluating editorials – can the context be classified as positive, negative, ambiguous, ambivalent, decisive, etc.?

PROCEDURES: GROUPS OF WORDS

By searching with a + sign between the words or phrases words can be grouped together. The use of wild cards (*) can list all words of a particular type and by marking the box for “Use all word forms” the search will include all related forms of the words.

P ROCEDURES : G ROUPS OF W ORDS By searching with a + sign between the

Counts of Word Parts

Examining the parts that make up a word is a specialty field of linguistics called morphology. Morphological analysis is based on examining how the parts of a word (including roots, prefixes, infixes and suffixes) are put together.

Morphology has stronger implications in some languages than others. For example, English relies on morphology much less than German. In German it is common to create a single word from other words or parts of words. “Kindergarten” is a German word used in English that demonstrates how German morphs words to create a single word. What morphology reveals depends a lot on the way the language uses morphology.

Applications of morphology in the field of linguistics are readily visible, though for the Humanities and the Political Science such applications may not be immediately apparent. What are the uses of morphological analysis? What can be learned by counting morphological occurrences? What can we say of the writer/speaker or the audience?

Verbs: We can look at verb conjugation and other inflections placed on words. These can vary from person to person or from speech community to speech community within the same language.

Pronunciations: Orthography: British, Irish, Indian and American English all have different pronunciations. Where these pronunciations are indicated in the orthography they can be counted by Document Explorer.

Linguistics Applications:

When do children begin to use particular morphemic constructions such as past tense verb conjugation and the possessive suffix?

How does child speech differ from adult speech in the creative use of prefixes and suffixes?

How do foreign languages differ in the manner in which they use prefixes, suffixes and infixes?

Social Science Applications:

How do different social classes use prefixes and suffixes?

Does geographical location correlate with non-standard use of prefixes and suffixes?

How are prefixes and suffixes used in music lyrics? How does this vary among the different types of music, e.g., rap, country, pop, opera?

Humanities Examples:

In transcripts of theatrical works and in novels – what is the correlation between

dialectical variations as indicated by non-standard prefix and suffix use and character development? What might the correlation reveal about the author’s views of the speech community that typically employs such variations? How does Shakespeare use prefixes and suffixes or, how do prefix and suffix use differ from Middle English to Modern English?

PROCEDURES: PARTS OF WORDS

The use of word parts can be discovered by first examining the WordWheel to see which prefixes, infixes, or suffixes are attested. Researchers can then use a wild card marker (*) to search specific substrings (word parts.)

Humanities Examples: • In transcripts of theatrical works and in novels – what is the correlation

Note that by using Boolean strings, a combination of word parts and word groups can be searched.

Counts of Related Words

A count of the unique words numbers each of the forms of a word separately. (e.g., hate, hateful, hatred). A lexicon will count all of the various forms of a word as the same word. For example, the words eat, eats, eaten and ate are different yet are based on the same lexical item and represent one entry in a lexicon. In this manner of counting a text might have 8,000 total words and 4,500 unique words, but based on a lexicon will have only 3,000 words.

Discovering and analyzing the lexicon of an author can reveal a great deal. A lexicon is developed as a result of encounters with the world. Hence, by examining the quantitative and qualitative properties of the lexicon, it is possible to discover something of the level and type of education or the diversity and nature of life experiences that an author has had.

An individual’s lexicon changes as time passes. This may be intentional or subconscious. Authors who are very aware of their word use, probably most common in politics, adapt their words to fit the audience or topic addressed. Other authors’ lexicons change as they traverse life phases.

Researchers can examine the lexicon of an author as a whole, or for as many of their works as can by found. By examining the lexicon of an author at particular points in the author’s career and comparing data gathered from those analyses to life events, the researcher often discovers correlations that contribute to a deeper understanding of the author.

Linguistic Applications:

When do children begin to learn to expand their vocabulary with related words –

when do they begin to use all verb inflections vs. just one or two? How rich is a particular language in its availability of words for expression?

Social Science Applications:

Are certain forms of words more prevalent in a particular culture, in a particular newspaper, in speeches by a particular political party or in works by a particular author?

Humanities Applications:

How many different related words does Shakespeare use to convey a particular thought or image?

How does an author use related words to stress a particular message.

Business and Market Research Applications:

Use of the various lexicons in Document Explorer/WordCruncher facilitates “Word Groupings” and helps build classification codes for verbatim questions.

PROCEDURES: ALL WORD FORMS

In the Search window, mark Use all word forms to search for all related word forms. This function automatically includes all related word forms for individual words and groups of words in a single lexical search.

Business and Market Research Applications: • Use of the various lexicons in Document Explorer/WordCruncher facilitates “Word

Counts by Collocated Terms

Collocated terms are words that are co-located or located near each other. Sometimes referred to as correlated or neighborhood terms, collocation is a great instrument for analyzing the content of a text. For example, in the King James Bible, by performing a collocation for the word love the researcher finds that it is most commonly associated with the words hate, neighbor and husband. These three words are called collocates of the word love in the King James Bible.

Collocation words are listed in an array with the statistical properties of counts, frequency and expected values. The list represents words that are located within a specified proximity to the search word. The array of words listed by frequency is a valuable list allowing the user to peruse the words most closely linked to the searched term. Simply viewing collocated terms from two different authors, newspapers, or political platforms can be very revealing. Collocations can be a tremendous tool for examining content, authorship, style, theme and image analysis.

The statistical properties of collocated terms will be addressed in a later section, but it should be pointed out that by sorting the various statistical parameters each can reveal a different ranking of the collocated terms.

Applications:

What terms are generally collocated in a particular text?

Why does the author collocate those particular terms?

What are the statistical values associated with the Document Explorer collocation report?

Are the occurrences of the collocations statistically significant?

How can collocated terms assist in building classification coding for verbatim responses or media editorial evaluation?

PROCEDURES: SORT BY NEIGHBORS REPORT

The Sort by Neighbors report shows the occurrence-related data of words found adjacent to the search anchor word (first word in search argument). This report is always visible on the Sort by Neighbors tab. By using the Sort by Neighbors tab of User Preferences all options that affect the appearance of this report can be set. Below is a list of the data that is viewable with relation to any search.

P ROCEDURES : S ORT BY N EIGHBORS R EPORT The Sort by Neighbors report shows

Viewable Search Information

Neighborhood Width

This option determines the size of the report. The maximum neighborhood size is 25 words before and after the anchor word. In the example used from Constitution Papers, with the search results from the word "freedom" (67 hits) the following table shows the number of unique words in the neighborhood based upon neighborhood width.

These numbers are representative; however, the numbers will be different for every search. Notice the range from the maximum is 25,25 to the two minimums (0,1 or 1,0). The usual neighborhood size is 10 or less. Words removed much further than that most frequently have fewer associations with the search anchor word. The cases where the minimums are used are also very unusual; however, they are very valid and with some studies very useful.

Rating

This is a custom Document Explorer/WordCruncher statistic: it varies between 10 and -10. A rating greater than zero is shown in black and means you might wish to pay attention to these words. The explanation that follows is technical and requires exposure

to statistical concepts. This rating is defined as: ( ( word Z-score - Average Z-score) / Standard Deviation of the Z-scores) X 2. A rating greater than 10.0 is assigned to 10. A rating less than -10.0 is assigned to -10. This statistic has been normalized to allow comparisons between reports.

Z-score

Again, the explanation that follows is technical and requires exposure to statistical concepts. This statistic is designed to give you a feel for the difference between the sample frequency of a word in the neighborhood and the expected frequency (based on the ratio of the total neighborhood size to the total size of the document). This difference is divided by the standard deviation of the sample size to normalize this difference for comparison purposes (comparing data between two reports). This data is used in computing the rating.

Sample Frq

This is the frequency count of the number of times this word occurs in the neighborhoods.

Total Frq

This is the frequency count of the number of times this word occurs in the total book.

Percent

This is the ratio of sample frequency to total frequency (e.g. if the sample frequency is10 and the total frequency is 67, the percent is 14.9%). This may be more useful than reporting the sample frequency because it incorporates the total frequency as well.

Expected Frq

The ratio of the neighborhood size to the book size times the total frequency (e.g. if the neighborhood is 1/20th the size of the book, then we would expect to find 1/20th of the words in the neighborhood)

Std Dev

The standard deviation of the expected frequency (gives a feel for the clustering of these frequencies).

Sort Sequence

The default sort sequence for this report is by rating (descending). You can change this by clicking on the column header you wish to have as the primary sort key. Clicking on the column header again, changes the sort direction from descending to ascending and ascending to descending.

Section B – In-Context Searching

The preceding section of the manual focused on how various types of counts can be used and their possible applications across different fields. This section of the manual will discuss discovery techniques. Many of the methods discussed in Section A provide a foundation for Section B – for example, counting individual words and word groups provides an overall view of what themes are found in a text. These themes can then be scrutinized by using techniques explained in this section of the manual.

In-context searching facilitates an examination of word use, meaning and consistency of use and meaning throughout a text or across a group of texts.

Document Explorer incorporates in-context search tools that allow the researcher to view how a particular word is used. One objective of searching in context is based on the fact that words may vary in meaning according to authorial discretion. The principle that words can vary in meaning is one aspect of language that makes it so versatile and creative.

Meaning and Word Use

By examining word use in-context the researcher is able see the meaning that the author gives to particular words or phrases. Any word or phrase in a text has the potential to have a spectrum of meanings. A good example of this is the word love in English. The left column shows some possible uses of the word love with corresponding semantic content in the right column:

John loves God. Mary loves her husband. Sue loves Bob. He loves Pizza Hut. He loves to golf. Joe loves women. Jim loves money.

Devotion or adoration. Affection. Romance. Preference. Derive pleasure/enjoyment. Lust sensuously. Lust greedily.

The technique of examining how an author uses individual words is a common method of literary and political analysis. This tool can reveal much about the tone (for example, positive or negative) of the text. Political spin is based on contextual word and phrase use. A fascinating exercise is to compare newspapers, magazines and news programs for contextual definitions of words. The contextual views can reveal slants toward liberal or conservative prejudice. Characteristics of contextual definitions are often a reflection of the author and his appraisal of the content, audience, time period, genre or medium.

An extension of searching for contextual definition is searching for definition consistency across a text or group of texts.

Consistency in Word Use

When researchers study the many uses of a word, they often do so across a series of texts rather than with an isolated text. A good example is in the field of law where legal precedence often hangs on the consistent in-context-use of a word or phrase. Lawyers and their researchers have to look through the documentation of many related cases and identify where and how a term is used to find a contextual definition that suits their needs.

Other fields, such as the humanities, examine consistency of term usage within a document or across a series of texts. Searching across authors, genres, or across a given time period allows researchers to examine word use and consistency. Political science researchers can scan for term usage across a library of newspaper articles or campaign speech transcripts and compare term use.

In the past, much of the research by this technique has been done manually. With Document Explorer, the computer shuffles through the paperwork, leaving valuable time for the researcher to observe the data and expand the search to related words or collocated terms.

Throughout their career, authors may use a word or phrase consistently or with variation. Variation in word use may be a result of an author’s personal views, experience, writing style, or something else entirely. The following examples provide additional applications of contextual searching.

Linguistic Applications:

How does an increase of variations in word use correlate with age and development?

How has the use of euphemisms developed in American newspapers?

How is a particular term used differently across fields – with bond, in law bail bond, in business stocks and bonds, in science chemical bond, etc.

Which English words have the broadest spectrum of meanings?

Which language averages the most number of meanings per word – what can be implied from this data?

Social Science Applications:

How are the terms love, hate, commitment, etc. used in successful vs. non-

Do Jay, Hamilton and Madison use tyranny similarly in the Federalist Papers?

successful marriages? How are terms such as morality, virtue and values used in the news media?

How are politically correct terms used in today’s newspapers compared to those from 1980, 1960, 1940, etc?

Are the terms in a proposed bill used the same as they were used in preexisting laws?

Humanities Applications:

What particular manner of using given terms or phrases distinguishes Shakespeare

from another author? viii How broadly does an author define a term throughout a single work?

Do a set of authors use a term consistently or differently – if so, how?

How do the theme and imagery within a text change as the contextual definitions of terms change?

How does an author use words to build up imagery – are there patterns in the number of words or kinds of words used?

Business and Market Research Applications:

Is the market survey designed using a uniform vocabulary across the entire

survey? Is the advertisement designed with a uniform vocabulary? In a verbatim response to a market survey – what terms are the same yet have different contextual definitions? How does this change the coding and classification of the responses?

PROCEDURES: IN-CONTEXT SEARCHING

After searching a word or phrase, the Reference List window shows the “hits” in-context and gives the location reference for each “hit.”

P ROCEDURES : I N -C ONTEXT S EARCHING After searching a word or phrase, the

The Reference List window is a container that holds smaller reference windows. The following options apply to all three tabs

  • 1. Reference windows can be deleted until only one remains.

  • 2. As many reference windows can be added/opened as the user desires.

  • 3. The scroll bar or mouse wheel changes which reference window is highlighted.

Options and Elements of the Reference List Window

Reference Window

Each reference window has a citation line and a small text window. Each reference window contains a citation line, a view option, an occurrence number, a select option, a delete option, each explained below.

Citation Line

The citation line shows the location of the reference within the work being searched.

View Option

Clicking on the View option allows you to view searched words and phrases in a broader context in a text window.

Occurrence Number A serial number assigned to each hit (search result.)

Select Option

Clicking on the Select option allows you to transfer a reference found in the Reference- List window to the Selected References window.

Delete Option

Selecting the Delete option allows you to delete references found in the Reference-List window.

Section C – Discovering with Runtime Concordances

In its simplest form a concordance is a listing of all the words in a text, given within their respective contexts. For example, a concordance for a literary work such as Mark Twain’s Tom Sawyer would list every word in that book in alphabetical order, each word being accompanied by a pre-specified amount of text as it occurs before and after the word in the actual document.

On the other hand, a runtime concordance is when you can search any word and see the contextual uses of that word. The search program in Document Explorer is a runtime concordance. Advantages of this concordance builder as opposed to a textual concordance include (1) speed -- users can build a concordance faster than turning pages to find the cited word, (2) flexibility -- users can build a concordance for groups of words or phrases, (3) specificity -- users can build a concordance based on collocated terms and (4) users can build a concordance using wild cards for prefixes, suffixes, or word roots.

The idea of examining word use with the search window and building runtime concordances are based on the principle of viewing results in context. Document Explorer can build an enormous reference list of hits that can be exported for review and examination. With the use of wild cards (*) the entire book or library of books can be exported to a physical concordance or viewed in runtime.

A physical concordance displays all of the words at once (which can be a huge amount of data) and leaves the researcher to view the text within the confines of the parameters set for the number of words before and after the text – limiting the purpose of the concordance’s contextual view. The runtime concordance allows you to quickly broaden the contextual framework allowing the user to expand their contextual view of a particular reference.

Concurrent examination of many words and phrases and the ability to control contextual view parameters are powerful tools for research. Although these tools are not limited to Linguistics, Humanities, Political Science and Foreign language, examples of research that has been done in these fields with the use of a concordance are given below.

Linguistic Applications:

Does authorial gender correlate with the way that particular words are used? ix

Social Science Applications:

Use the concordance program to compose a dictionary to teach foreign language

vocabulary. x

When two political platforms are compared, is term use consistent between them?

When two legal documents are compared, is a specific term used with consistent or variant contextual meaning?

When comparing use of “politically correct” terms in newspapers it will be seen that these terms generally change over time. What can be implied from these changes?

For terms related to Civil Rights, freedom, prejudice, private, etc., how are these terms contextually used by several different forms of governments, e.g., Communist, Capitalist, Socialist, etc?

Humanities Applications:

How can concordances be used to detect plagiarism? xi

What are the contextual definitions for the terms justice, peace and freedom as used by one author compared to another?

Business and Market Research Applications:

How can the contextual searching and coding applications be expanded to include

multiple classification categories? How can run-time concordances be useful in adjusting classifications on the run?

PROCEDURES: SINGLE AND MULTIPLE-WORD CONCORDANCES

The run time concordance is viewed from the Search Reference-List window. Note how users can build multiple concordances very quickly. A run-time concordance can be built for a combination of words and for all word forms.

P ROCEDURES : S INGLE AND M ULTIPLE -W ORD C ONCORDANCES The run time concordance
Single Word Concordance
Single Word Concordance

Combination words and all word forms

The Reference List can be exported to a file to be edited or for use in print material by selecting File, Save Reference List.

Section D – Comparing Readability Levels

Readability levels are a measure of the ease or difficulty with which a text can be understood. A common formula for calculating this is the Flesch-Kincaid Readability Formula:

Grade Level = 0.39 (aver. # words/sentence) + (aver. # vowels/word) – 15.59.

This formula is based on the average number of words per sentence plus the average number of vowels per word and adjusted by several factors. There are many existing formulas used for calculating readability, all of which have their strengths and weaknesses. For our purposes, we are not looking at specific readability, but at comparing readability of different texts. The Flesch-Kincaid has proven accurate enough that it has been built-into Microsoft’s word processor, Word. We will use this method in comparing the readability between texts.

In private, public and commercial sectors readability levels have been used in various ways. Some use it to promote reading by guiding readers to publications with a readability level that corresponds to the reader’s level – this is typified where libraries assign a readability level to books. Others suggest that any document conveying critical information, from legal forms to websites, should be tested for readability in order to increase the probability of successfully conveying the information. The readability level of the Miranda Rights has been examined in order to determine the chances of that information being unsuccessfully conveyed.

Readability levels can reflect authorship, the degree of education or intelligence of the author or the audience as the author may choose to write or speak with more simplicity or complexity, depending on the audience. The numerous causes for discrepancies among readability levels are not always immediately perceptible. The process for discovering these causes is to calculate and compare readability levels.

The following are general examples for using comparative readability levels across:

two or more different authors

several different speeches from the same author

different genre, e.g., fiction vs. prose

one time period to another

one geographical location to another

These examples of readability comparisons have been used in actual research. Many other types of comparisons can be found according to the needs and creativity of the researcher. This research can reveal differences, but in order to say that there is a significant difference the researcher must use statistical methods outlined in the analytical procedures section.

Linguistic Applications:

How do the readability levels of spoken vs. written texts compare to each other at different ages/developmental stages?

Social Science Applications:

What are the readability or grade levels of speeches given by several presidents of the United States? xii

What is the difference in readability levels of the responses to questions in a live debate compared to scripted speeches?

How do Congressional bills or documentation from federal agencies (IRS, EPA, etc.) compare to the readability levels of media that are commonly and generally accessed by the American people such as newsprint, news broadcasts, pulp fiction, etc?

Humanities Applications:

What does readability level say about an author, his audience, or his topics?

Business and Market Research Applications:

How does one author’s readability level differ from another? (e.g. How does

Shakespeare’s readability level compare to another author of that era such as Hobbes or Bacon? Why?)

What grade level is the advertisement or the opinion editorial?

How do you classify verbatim responses to a market survey by education level?

If you have education level, the readability level can either confirm the education level or the interest level of the respondent.

PROCEDURES: CALCULATING FLESCH-KINCAID

This is the procedure for calculating readability levels with Microsoft’s Word ©. Select the Tools menu then the Spelling and Grammar. When the Spelling and Grammar checker opens, click on Options. Finally, check Show Readability Statistics under the Spelling & Grammar tab (the lowest box.) When Microsoft Word finishes checking spelling and grammar, a dialogue box will display information about the reading level of the document, including a readability score calculated by the Flesch-Kincaid formula.

P ROCEDURES : C ALCULATING F LESCH -K INCAID This is the procedure for calculating readability

Section E – Examining Collocated Words

Collocated terms are terms that are co-located or located near each other – performing a collocation on a word in a text shows a word and lists the words that most commonly occur near that word in the text. In the King James Bible the researcher finds that the word love is most commonly used along with the words hate, neighbor and husband. These three words are called collocates of the word love.

Collocation counts can help users discover themes and recognize imagery as well as analyze aspects of an author’s style. The associations an author gives to words, both of contrast and similarity, are much more easily identified in the results of a collocation count. The applications for using collocations are truly limited only by the ingenuity of the researcher. Below are some examples of some ways in which collocation has been used in research.

Linguistic Applications:

What patterns of collocated words characterize a particular author’s works?

What words are most commonly collocated for ESL students?

What terms are commonly collocated by native English speaking children at different ages, e.g., at what age do freedom and speech begin to be collocated?

Which languages consistently rely on collocated terms (including reduplication) for expression? For example, in Mandarin Chinese chr fan is a common collocation which literally means eat rice but has come to be used as eat any food at all.

Social Science Applications:

What subcultures collocate words such as faith and god or big and government most frequently?

What are common collocations found in rap, country, classic rock, alternative, classical symphonic, elevator music, etc. – what do these collocations reveal about the cultures that produce and consume such music?

How do collocations vary from text to text as the nature of the content of those texts varies, e.g., what are collocations common to texts dealing with U.S. military histories vs. collocations found in texts dealing with the history of U.S. civil rights?

Humanities Applications:

When collocations are performed on an author’s works, what aspects of the author’s life become salient? xiii

How can imagery in a text be discovered with collocated terms? xiv

Business and Market Research Applications:

What words are consistently collocated with the classification grouping words? How does this change the classification word groupings? (bad, not bad)

What words are commonly collocated with the advertisement “hooks?”

PROCEDURES: THE COLLOCATION WINDOW

Collocations are shown in the “Sort by Neighbors” window. Double clicking on a specific word will show the collocation hit in context as shown below.

Business and Market Research Applications: • What words are consistently collocated with the classification grouping words?

The above example, shows some of the results for a collocation search on love in Hamlet. Highlighting a word (liar) will change the reference list to show the collocated term in- context in the upper window. Select Help from the Menu for explanations of the statistics of the collocated terms.

The parameters for collocation statistics can be set/altered by clicking the Report Preferences icon (middle icon) on the Search Results window.

Section F – Style Analysis

To begin it would be well for the reader to note that there has been a great deal of research in the area of style analysis – authorship studies taking the forefront of the field. Such research projects are usually ongoing and quite lengthy. The aim of this section is not to provide an exhaustive history of style analysis but to reduce the process to its essentials and explain how to apply those essentials to research.

Style analysis fundamentally seeks to distinguish one text from another and to establish or discover something of the origins of the text analyzed. This section defines what style is, explains how style analysis is significant and discusses how to formulate research in order to identify styles.

What is Style?

Style is a flexible term. Most students have at least a feeling for what the word style can mean. In literature, style is related to the form of expression that an author uses. In order to offer a methodology for style analysis a clear definition of style must first be provided.

For the purposes of textual analysis style is a conglomerate effect of any number of distinct features, which in their sum provide an overall distinctness for the text. By that definition, style is a sufficient number of the features of a text, which taken as a whole, set the text apart from other texts. When people speak of Shakespeare’s style, they refer to the features of Shakespearean works that together make his works distinctly Shakespearean.

It would be well to define what is meant by features since features are what make up style. Features are components or facets of the text that can be identified and examined independent of other components or facets of the text. Textual features relating to the form of composition include such things as punctuation, orthography, sentence structure and word choice. Features relating to content include theme and poetic devices like metaphor.

In a text these kinds of features are everywhere – one could argue that the text is entirely composed of such features. To build a style the features must occur in some sort of pattern. To illustrate how feature patterns occur let us examine the feature of sentence complexity. The feature of sentence complexity is part of almost every text but in order for this feature to contribute to a style the researcher must identify a pattern of occurrence. A pattern of occurrence in sentence complexity could be something like compound sentences occurring only at the end of paragraphs or perhaps a complete absence of simple sentences throughout a given text. Thus, style consists of not merely random features clustered together to form a text but of identifiable patterns in the features of a text. Style analysis and feature pattern analysis are used interchangeably in this section.

The Significance of Style Analysis

Discovering the style of a text could be quite bland if the process never got outside of the text itself. If all that is known about the text is how its features are organized, not much is known. As it turns out, style will always point back to some aspect of the origin of the text. The correlation between the style and the origin of the text pushes the study of style beyond the text itself and grounds style analysis in the real world.

The aspects of textual origin that correlate with style are: (1) the genre of the text such as poetry, expository, narrative, etc. (2) the medium by which the text is presented, for example, book form, newsprint, conversation, etc. (3) the era in which the text originated, e.g., during WWII or early in the 19 th century (4) the geographical area from which the text originated, Australia, New England, Rome, etc. (5) the author of the text, Shakespeare, Joseph Conrad, a college student, the United States Supreme Court, etc. (6) the nature of the content of the text, for example medical, military, academic, personal, etc. and (7) the audience addressed by the text such as citizens of a nation, a particular ethnic or age group, etc.

The point is that whenever patterns in features are discovered there will always be a correlation between the patterns and one or more of the aspects of textual origin. The example questions below illustrate some ways in which feature patterns and aspects of textual origin could correlate. Each bulleted item is followed by a parenthetical label for the aspects of textual origin that the question correlates to style.

Linguistic Applications:

How can style be analyzed to classify texts by genre? xv (genre)

Social Science Applications:

What is the difference in speech styles between adolescents and adults? (author)

What aspects of syntax vary from discourse to written texts? (medium)

How do features of style vary between texts explaining medical procedures vs.

texts explaining military procedures? (nature of content)

How do feature patterns in conversational speech differ from those of public discourse? (medium, audience)

How do the feature patterns of speeches by a given political candidate differ from one audience to another – when the audience is composed mostly of African Americans vs. Caucasians? (audience)

How do styles of congressional bills vary over time? (era)

What features of the Federalist Papers of undetermined authorship have been analyzed to evince authorship by one party or another? (author)

Humanities Applications:

How are patterns in features of expository texts from early 19 th century distinct from patterns in the features of persuasive texts from the same era? (era, geographical area, genre)

When authorship is unknown, how can style analysis be used to establish a probable author? xvi (author)

What feature patterns distinguish personal communications by a group of famous authors from their professional writings? (nature of content)

What contrast does style analysis reveal between music lyrics and written poetry? (medium)

Business and Market Research Applications:

Is there a particular style to the advertisement?

Is there a pattern in the survey design?

Is there bias in the question outline?

Discovering Style

There is no single formula for discovering style – remember that a style consists of enough features occurring in identifiable patterns to make the text distinct from other texts. Every researcher has different motives for performing style analysis – some may want to examine authorship, others seek to discover the distinctness of one medium versus another and other researchers might want to know how style has changed over time for a given location or genre.

The researcher will begin stylistic analysis with a general research question relative to one or more of the seven aspects of origin, perhaps, “What are the characteristics of Emily Dickinson’s style?” or “How do the speaking styles of W. J. Clinton and Ronald Regan differ?” After collecting the appropriate texts the researcher will begin a search for patterns in textual features.

These are the three categories that most features of a text fall into. Below each category are examples that show the researcher how to frame questions in order to discover feature patterns. The bolded words indicate textual features.

1) Patterns in the content: What is in the text?

What kinds of inflections occur in the texts – which verb conjugations are omitted that are generally common to texts?

What grammatical devices are used – are sentences simple, complex, compound?

What kinds of words are used– what is the ratio of function words to content words?

What orthographic variations are present – analyze vs. analyse?

What themes are present – good vs. evil, redemption, childhood innocence, etc?

What poetic devices are employed – is metaphor, rhyme or meter present?

What word groups are present – considering the 100 most common words, are they optimistic/pessimistic, legal, dynamic/sedentary, foreign, etc?

What collocations are present – freedom, speech and press, right and bear arms?

2) Patterns in discretionary use: How is the content being used?

Where in the text do inflections occur?

How are words selected – as a function of character development, poetic device or authorial dialect?

What terms are used to build up imagery – are themes formed by patterns of rhyming words, by use of idioms or by a set of terms being collocated repeatedly?

Is a word used denotatively or metaphorically – is head used to mean the body part or the uppermost portion of another entity such as a line of people or a river?

Is a word used denotatively or connotatively – is home used strictly to indicate a dwelling place or to bring to mind impressions of family, security and inclusion?

What types of novel or non-standard word use are evident – are there instances of backformation, e.g., orientate for orient, borrowing, e.g., avante guarde, carpe diem, amigo, vulgarity, blending, e.g., smog for smoke + fog, clipping, e.g., meg and net for megabyte and internet, coinage or word manufacture, e.g., musquirt for the clear runny juice that always comes out before the mustard?

What are the dynamics of the semantics – are contextual definitions consistent or do they vary?

What are the dynamics of the pragmatic aspects of the text – how are aspects of truth, quantity and relevance of information, etc. manipulated, e.g., to a humorous end or otherwise?

What is the nature of the collocations in the text – are the collocations repeatedly composed of only two or three words or are they composed of several words, do the collocations play any role in organizing imagery?

3) Patterns in associations: How do things fit together?

Which words are regularly collocated – dog and cat, man and woman, friendly and fire, etc?

What patterns in function word ratios and word groups are evident - what is the ratio of a collocation such as to x to (where x is any word) over all the instances of to?

What kinds of punctuation is used and where?

What words start and end sentences – what is the percentage of sentences that start with a, an, and, in, it, that end with it, that have the or with as the second-to- last word?

PROCEDURES: STYLE ANALYSIS

Style analysis is basically pattern analysis and Document Explorer is very useful for pattern discovery. Document Explorer’s search program allows the user not only to view all the hits in-context, but to view the entire document with the hits highlighted.

P ROCEDURES : S TYLE ANALYSIS Style analysis is basically pattern analysis and Document Explorer is

As Document Explorer users use thematic words, search the WordWheel for terms or find collocated terms patterns in lexical choice are immediately evident.

Searching all related words also reveals word patterns across a lexicon of related terms.

Section G – Translation Consistency: Word Use, Themes and Imagery

This part of the manual is divided into two segments. The first is a short summary of translation itself. It includes an overview of the essential nature of translation, common problems in translation and an explanation on how to use Document Explorer to avert some of those problems. The second segment contains a discussion of how to use Document Explorer tools to evaluate prior translations.

Translation

Before explaining how to use Document Explorer for translation, it will be helpful to identify the assumption made by the explanation. This assumption is that the essence of translation is to have the meaning that the author intended to convey through the text of the original language to be conveyed through the text in the second language. This is the definition that the manual uses for translation.

Translation is not merely a function of changing all of the words from the original language to corresponding words in the second language. Even when the translator is familiar with both languages there are some factors that make it a difficult task. Here are some examples of the complex nature of translation:

When the second language does not have the cultural aspects to support symbolisms of the original. In modern translations of the Hebrew Old Testament. Ancient writers such as Isaiah relied heavily on metaphor and symbolic speech. Phrases such as “the ships of the Tarshish” and “cedars of Lebanon” held significance for the Hebrew culture in Isaiah’s day but Modern English lacks the cultural aspects for those symbols to hold the same significance.

When technical terms of the original language do not exist in the second language. Imagine translating a tractor repair manual from English to Arabic. How would the term “nine toothed dog” be rendered in Arabic if it indicates a gear in the transmission of the tractor? To translate the words exactly as they stand in English could result in unwanted confusion.

When a word has various meanings in the original language the translator must decide how the word is being used in order to make an accurate translation. Take

run for example – my nose is running, they ran three miles, the river ran dry, his blood ran hot, she ran to the store, they ran the machine all day, Joe ran for a public office, run up the flag, etc.

As these examples illustrate, a word can have many potential meanings. The translator must take into consideration not just the isolated words but their individual contexts as well. Additionally, meaning exists on more levels than just that of the word – idioms and metaphors are composed of several words that convey a holistic meaning which exists above or beyond the meaning of the words themselves. This holistic meaning is what must be translated.

Section H – Textual Analysis Methods for Writers

40

Discovery Before Translation

Document Explorer tools help to understand the relationship between the language of the original text and the meaning that it conveys. Discovering the source document before translation is one way to ensure a more accurate translation.

The greatest authors speak through imagery and not via mere words. When analyzing a document prior to translation if it is difficult to search out all of the occurrences of a particular word, then it is so much more complicated to discover themes, find imagery and analyze style. Hence, the importance of a tool that allows the discovery and comparison of images in a text.

These are some ways in which Document Explorer tools can be used to discover a text prior to translation:

in-context searching – to understand word use, identify themes and imagery

collocations – identify themes and imagery

perform style analysis – to note how textual features are organized

run-time concordance – to examine how much variation there is in word selection

Evaluating Prior Translations

Post translation analyses are done by comparing the translated document to the source document, by comparing several independent translations of the same document to each other or by a combination of both.

Document Explorer enables the researcher to evaluate several translations of a single document by comparing them to each other, even when the researcher is not familiar with the source language. Alternately, when fluent in the source language, the researcher can use synchronized windows to compare the source document and the translated documents line by line. Procedures for these comparisons are discussed below.

These are some ways in which Document Explorer tools can be used to evaluate a text after it has been translated:

Perform word counts - by counting how many total words are used, how many unique words and how many times each individual word is used.

Use in-context searching – by checking the consistency in word use, the theme treatment and the imagery.

Use synchronized screens – to view both the original document and translated documents or two-plus independent translations of the original as they are connected/synchronized paragraph by paragraph, chapter by chapter, etc.

Section H – Textual Analysis Methods for Writers

41

Applications:

Use Document Explorer to discover a poetic text and write an exegesis that serves as a template for translation of the source into several different languages. xvii

Evaluate a time series of translations of the Iliad, the Bible or any other work that has been translated repeatedly over time. How are the terms and ideas translated differently? Compare the Middle English of the King James Bible to the Modern English of the New International Version of the Bible?

Evaluate several independent translations of the same document. Examine the variance in the terms used to represent the same ideas in the source text.

Compare the several translations of terms used in presidential news briefings as they occur in foreign media. xviii

Considering the U. S. Constitution, how has the terminology been translated? How is arms in the third amendment rendered in other languages?

How is Tolstoy translated into one culture as compared to another– American vs. French vs. German?

How do translations of UN Resolutions or WTO Charters into several languages differ?

PROCEDURES: SYNCHRONIZED WINDOWS

Because the Document Explorer tools for much of translation analysis has been covered in previous sections, we will focus on synchronization tools and the use of word counts, in-context searching and run-time concordance use, collocation and pattern analysis.

Applications: • Use Document Explorer to discover a poetic text and write an exegesis that serves

The Synchronous Layout allows you to select the default layout for the books in a book set. This data is used every time a book set is opened. WordCruncher creates separate

Section H – Textual Analysis Methods for Writers

42

panes for each book in the book set in accordance with the pane layout determined here. The Help files will walk the user through the synchronization process.

Synchronized layouts allow users to not only read files when they are connected synchronically, but to search the synchronized files.

Example 1. Reading Synchronized Files

panes for each book in the book set in accordance with the pane layout determined here.

Example 2. Searching Synchronized Files

panes for each book in the book set in accordance with the pane layout determined here.

Section H – Textual Analysis Methods for Writers

43

Section H – Textual Analysis Methods for Writers

This section discusses two ways in which the tools explained in previous sections can be used: 1) to analyze one’s own writing in order to attain a greater degree of objectivity in evaluating the quality and impact of one’s own texts. 2) as additional writing tools for authors.

Objective Evaluation

Evaluating ones own writing can be very difficult. Aside from aversion to conscious self- criticism, the author often has a difficult time attaining the ear of the audience or the same level of objectivity that a dispassionate reader naturally has. Objectivity not only reveals deficiencies in the content of a text, but also provides an understanding for the general effect that the text might have on a reader. Since it is difficult for an author to perceive the text as one who is not involved in the process of constantly reading and reviewing, Document Explorer tools can assist in finding a more objective position.

Below are some ways that Document Explorer tools can be used to gain a more objective standpoint when reviewing self-compositions.

WordWheel and Word Counts: Examination of the WordWheel and their counts – is the mixture of words what the author desires it to be, e.g., varied and dynamic or uniform and stale? Are thematic words overused or are they underused to the extent that there is not a sufficient reinforcement of the theme? Is there a necessity or opportunity for variation in word choice such as use of negated antonyms for understatement – not bad for good, or not good for bad, etc?

In-context searching: View term locations in the search window – are the terms used correctly in-context? Search on a list of cliché phrases – is it possible to substitute a more novel, appropriate or convincing phrase? Check the contextual use of specific terms to ensure appropriateness for the intended audience. Search the focus words of a speech for impact within the context.

Grade level: Perform a readability test on the document. Check for number and context of connectors, conjunctions and subordinators, e.g., and, but, or, however, hence, accordingly, etc. as the use of these words is a factor in complexity level.

Collocated terms: Check to see if collocations form themes and ideas that are fitting to the audience addressed or the content treated – are jargon phrases compatible with the audience?

Themes and devotion to themes: How often are thematic words used? Do they build toward a climax or do they reinforce a direction for the theme? Are themes overt as in Bush’s Thousand Points of Light speech or are they underlying/symbolic?

Imagery and Symbolism: What is the parallel between symbols from the text and the themes discussed? How are images constructed or what are their fundamental parts?

Section H – Textual Analysis Methods for Writers

44

Style Analysis: What is the style? What do the patterns of textual features say about the author, the expected audience, the time period, etc.

Additional Writing Aids

Word count analysis may provide the writer with ideas for an appropriate title.

In-context search tools of Document Explorer can be used to organize a thematic index for a manuscript.

Section H – Textual Analysis Methods for Writers

45

PART II. ANALYTICAL METHODS

Introduction

This part of the manual describes the statistical procedures that will enable Document Explorer users expand document analytical capabilities. By understanding various procedures for text analysis (some simple and others complex), document exploration may be expanded to include statistically sound comparison and predictive modeling. Hence, a fundamental objective of this section is to explain how Document Explorer users can summarize observations, compare sets of observations and make projections using both deductive and inductive methods.

It is important to note that these statistics can be used to model relationships and estimate predictive models but that they are not meant to establish cause-and-effect relationships. Most cases of cause and effect must be left up to the theorists and qualitative reasoning. Quantitative modeling methods are simply that, “models” of correlation or relationship.

Although the Discovery Methods are simple enough for a broad audience, the Analytical Methods outlined here are complex and may be better suited for students and faculty with at least a rudimentary knowledge of Statistics, though the introduction to each section may be of interest to a broad range of researchers in that it explains the procedures in general terms and gives practical applications for the procedures.

The analysis procedures, in most cases, can be performed by computer software with little difficulty. Still, it is important for any researcher to understand the input data and the nature of the procedure in order to interpret the statistical data output. Matriculation in a course on general statistics, including non-parametric statistical procedures, is recommended.

Words, strings, collocation-counts, frequencies and distributions are the fundamentals for summarizing observations and making statistical comparisons and projections. Document Explorer readily supplies this output; additionally, data from Document Explorer are easily copied and pasted into spreadsheets and statistical programs (see Section D.)

The main sections of this part of the manual are summarized below.

Methods of Summarizing Observations:

Statistics is a way of summarizing observations through graphics and quantitative methods. This section explains much of the terminology covered in general statistics, covering types of data, methods of summarizing location and dispersion parameters, creating and describing distribution frequencies, measures of central tendency, measures of variability and other important.

Hypothesis Testing:

This section explains the formation of hypotheses. The importance of this section cannot be underestimated. The formation of a correct hypothesis is key to variable selection and

selecting the correct test procedure.

Variable Selection:

Text analysis has special types of variables that are outlined in this section. These

variables can be combined to create combinations of variables and ratios.

Comparing Observations:

Statistical tests may be used to investigate the differences between means and differences between medians of two sets of observations. These observations may be “independent” or “pared.” This section will explain tests comparing two observations and three or more observations.

Association and Trend Analysis: (Correlation and Regression Analysis)

This section examines methods of studying relationships between two different measures. Association (in terms of correlation) and trend analysis (in terms of regression) are

methods for determining the relationships between measures.

Time Series

An extension of Association and Trend is when we examine an observation across time.

This section examines the special nature of time related variables and various methods for modeling and forecasting.

Comparing Dispersion: (Goodness of Fit)

Comparing dispersion is a comparison of frequency distributions rather than location

parameters. This comparison considers the word frequencies or the distribution of variables across a series of works.

Multivariate Analysis of Variance: (MANOVA)

This section examines methods of comparing multiple variables through classification,

discrimination and clustering procedures.

Section A. Methods of Summarizing Observations

“Location” and “dispersion” are terms we use in answering most statistical questions. Measures of location include the mean, the median and the mode. These are methods of computing the central tendency of a distribution. Measures of dispersion describe how much the observations differ. This is called variability. Common measures of variability are the range, standard deviation and variance.

Descriptive statistics: are used to describe the data and are not employed to draw predictions. Basically, they are methods and procedures used for presenting and summarizing data. Examples: tables, graphs.

Inferential Statistics: are used to make inferences or to make predictions; used to make conclusions about the population (all the objects that have something in common with one another) from the sample (set of objects derived from the population).

Levels of Measurement

Nominal: categorical, identify mutually exclusive categories; cannot be mathematically manipulated. Examples: hair color, SSN

Ordinal: rank-order, represent rank-orders but do not give any information about the differences between adjacent ranks. Example: Order of finish in a horse race

Interval: considers the relative order of the measures involved and also has equal differences between measurements corresponding to equal differences in the amount of the attribute being measured; does not have a true zero. Example: IQ

Ratio: has a true zero point, equal differences between measurements correspond to equal differences in the amount of the attribute being measured. Examples: weight, height, blood pressure level

Types of Data

Continuous: can assume any value within the range of values that defines the limits of that variable. Example: temperature

Discrete: can only assume a limited number of values. Example: values on the face of a die

Creating a Frequency Distribution

What do the observations look like graphically? This is a graphical look at the data and is what allows you to see the data in a physical representation. This means graphing counts and looking for general shapes in the data.

Frequency Histogram: histogram showing the frequency of individual data values on the vertical axis and the data value along the horizontal axis

Frequency Histogram : histogram showing the frequency of individual data values on the vertical axis and
Frequency Histogram : histogram showing the frequency of individual data values on the vertical axis and

Describing a Frequency Distribution Normal: will look like a classic bell-shaped curve

Frequency Histogram : histogram showing the frequency of individual data values on the vertical axis and
Frequency Histogram : histogram showing the frequency of individual data values on the vertical axis and
Frequency Histogram : histogram showing the frequency of individual data values on the vertical axis and
Frequency Histogram : histogram showing the frequency of individual data values on the vertical axis and

Bimodal: will have two modes, so it looks like two humps or two bells next to each other

Skewed: a measure of the relative symmetry of the distribution; zero indicates symmetry. Positive values show a long right tail; negative values show a long left tail.

Kurtosis: a measure of relative peakedness, based on the size of the tail of a distribution. If a distribution is unimodal and symmetric, then K=3 indicates a normal, bell-shaped distribution (mesokurtic); K < 3 indicates a platykurtic distribution (flatter than normal, with shorter tails); and K > 3 indicates a leptokurtic distribution (more peaked than

4 ∑ ( X − µ ) K = − 3 , 4 N σ
4
(
X −
µ
)
K
=
− 3
,
4
N
σ
Kurtosis : a measure of relative peakedness, based on the size of the tail of a

normal, with longer tails). Kurtosis is calculated using this formula:

where σ is the standard deviation.

Kurtosis : a measure of relative peakedness, based on the size of the tail of a
Kurtosis : a measure of relative peakedness, based on the size of the tail of a

Mesokurtic

Platykurtic

Leptokurtic

Measures of Central Tendency

mode: data value that occurs most frequently in a sample; not necessarily unique; if there are two modes, the data are called bimodal; the mode is most useful for discrete data with a small range

median: middle score in a distribution; the point that defines the upper and lower 50 percent of the sample; the exact middle of the data set; “if n is odd the median is a member of the data set, while if n is even the median is the average of two adjacent values”; robust measure of central tendency because it is insensitive to outliers and extreme values; most commonly used in nonparametric tests

mean: average score of the distribution; “average of the sample data”; most common measure of central tendency; not robust to outliers and extreme values

Example:

in the distribution of the following seven scores:

5,6,8,9,13,13,16

the mode is 13, the median is 9 and the mean is 10

Measures of Variability

How is the vocabulary dispersed? How is a basket of key words dispersed?

Range: the difference between the maximum and the minimum value in a sample, a measure of dispersion; not robust to outliers and extreme values

Variance: the mean of the squared deviation scores; the sum of the squared deviations about the mean divided by the sample size minus one. The larger the variance the greater the dispersion or spread around the mean; not robust to outliers and extreme values

Standard Deviation: square root of the variance; measure of dispersion about the mean; measured in the same units as the mean

Central Tendency: description of the location of the middle of characteristic values in a distribution (mean, median, midrange, trimmed mean, modal class); position where the data tend to center

Dispersion: general reference to the “spread” of data values around the center of a distribution, including variance, standard deviation and range

Various Vocabulary Parametric tests: require assumptions about the shape of the populations involved

Nonparametric tests: does not require assumptions about the shape of the populations involved (distribution-free tests)

Outlier: any sample observation that is more than 3 standard deviations from the mean. In general, it is an observation that may be from a different population because it differs markedly from the others in the sample.

Robust: the quality of being unaffected by a particular factor; example: the median is robust to outliers

Quartiles: the first quartile (q1) is the point along the x-axis which defines the lower 25 percent of the sample. The second quartile is the median. The third quartile is the point along the x-axis that defines the upper 25 percent of the sample.

Statistic: characteristic of a sample

Parameter: characteristic of a population

Parametric vs. Nonparametric Assumptions and Tests

Nonparametric statistical tests are especially appropriate when the sample size is small, the data is not continuous, or you don’t think your data come from a normal distribution. Parametric tests make specific assumptions about one or more of the population parameters that characterize the underlying distribution for which the test is employed. Nonparametric tests make no such assumptions. Typically, parametric tests use interval or ratio data, while nonparametric tests use categorical/nominal and ordinal/rank-order data.

Advantages of Nonparametric Statistics

  • 1. Computations are quick and easy.

  • 2. May be applied when the data are measured on a weak measurement scale (such as nominal or ordinal).

  • 3. Depend only on a minimum of assumptions, which makes them quite general.

  • 4. Outliers have limited influence since the observations are usually replaced by signs or ranks.

  • 5. Can be used with data measured on a qualitative rather than quantitative scale.

  • 6. Valid for small sample sizes (less than 25). There is no minimum sample size required for most methods to be valid and reliable.

  • 7. Easy to use and understand.

  • 8. More widely applicable than parametric methods, since the techniques may be applied to phenomena for which it is impractical or impossible to obtain precise measurements on (at least) an interval scale. Disadvantages of Nonparametric Procedures

  • 1. The arithmetic in many instances is tedious and laborious.

  • 2. May lose efficiency when converting data to simple signs or ranks.

  • 3. Since the calculations for most nonparametric methods are simple and rapid, these procedures are sometimes used when parametric procedures are more appropriate.

  • 4. Less flexible than linear models and ANOVA.

When to use Nonparametric Procedures

  • 1. The assumptions necessary for the valid use of a parametric procedure are not met.

  • 2. The data have been measured on a scale weaker than that required for the parametric procedure that would otherwise be employed.

  • 3. The hypothesis to be tested does not involve a population parameter.

  • 4. Data with notable outliers (which cannot be eliminated with transformations).

  • 5. Non-normal distribution of the dependent variable.

  • 6. Unequal variances across groups.

Section B. Methods of Hypothesis Testing

Steps in Hypothesis Testing

  • 1. Choose the null and alternative hypotheses.

  • 2. Set alpha level (usually α=0.01 or α=0.05).

  • 3. Choose the appropriate statistical test.

  • 4. Calculate the test statistic.

  • 5. Decide whether or not to reject the null hypothesis.

  • 6. Make a summary statement about the conclusion, from the statistical analysis.

Test of Significance

  • 1. Establish basic assumptions of the experiment or survey.

  • 2. Predict what outcome is expected under the assumptions using the sampling distribution.

  • 3. Observe the outcome.

  • 4. Calculate the probability, under assumptions of outcomes as extreme as our observation, using the sampling distribution.

  • 5. If this probability is large, then the outcome is consistent with assumptions. If the probability is small, then the outcome is inconsistent with assumptions, or there is statistically significant evidence against the assumptions.

null hypothesis: a statement of no change in status; remains with the status quo; the hypothesis you generally want to disprove; denoted by H 0

alternative hypothesis: statement of change in the status; predicted change from normal; describing what you want to prove (this helps you decide to do a one- or two-sided test); denoted by H 1

alpha: boundary value for credibility of H 0 vs. H 1 ; typically a small value, usually 0.05 or smaller; denoted by α

significance level: predetermined boundary value level for determining statistical significance; your accepted risk of an improbable event

test statistic: number computed from data we use to test H 0 , assuming it is true

p-value: probability of getting the test statistic value or a more extreme value, assuming H 0 is true; If p-value ≤ α then this is statistically significant and you reject H 0 , otherwise there is no statistical significance and you do not reject H 0 .

REMEMBER: if any of the assumptions of the test is seriously violated, the reliability of the computed test statistic may be compromised.

Section C. Variable Selection

Textual analysis has a wide variety of individual variables and combinations of variables. The following is a selection of variables to consider in text analysis.

Measures of Total Words and Total Unique Words

Total Number of Words (N)

Total Unique Vocabulary (V)

Unique Word Ratio

Type-token Ratio

UWR = V/N

TTR = N i /V i

This statistic represents the rate at which new words are generated by an author.

Pace

Pace = V i /N i This statistic represents the rate at which new words are generated by an author.

Entropy

H = p i log p i

i

p i = probability of appearance of the ith word type

(number of occurrences) = ------------------------------------ (total # of words in the text)

Increasing the internal sturucture yields decreasing entropy. Increasing disorder (randomness) yields increasing entropy.

Adjusting for length of sample text

H = -100 p i log p i /logN

i

(100 is a measure of diversity)

Test for Once-used Words (Hapax Legomena)

R = (100 log N)/(1-V 1 /V)

Tests the propensity of an author to choose between the alternatives of

employing a word used previously or employing a new word.

This test may

also measure change over time in vocabulary richness and be helpful when

problems of dating are an issue.

Yule’s Characteristic

A measure of vocabulary richness based on the assumption that the occurrence of a given word is based on chance and can be regarded as a Poisson distribution

K = 10 4 (r 2 V t -N)/N 2

r

Simpson’s Index

Chance that two members of an arbitrarily chosen pair of tokens will belong to the same type.

D = 10 4 (r(r-1)V t /[N(N-1)]

(r = 1,2,3,

. .

.)

i

(V r = number of types which occur just r times in a sample of text)

Readability or Grade Level

Word Length

Sentence Length

Number of Nouns

Number of Punctuation Marks

Grade Level (Readability)

Word length, sentence length, number of nouns, and number of punctuation marks are variables that combine to give an estimate of the grade level of the text. Grade levels can then be compared.

GL* = 0.39 x (average no. of words/sentence) + (average no. of vowels/word) -15.59

* Flesch Kincaid Readability Formula adjusted for foreign language use.

Word Groupings (Comparing Sets of Key Words)

Frequent Non-Contextual Words (i.e. the, and, of, that, to, in, a) Infrequent Non-Contextual Words (i.e. again, after, among, according, wherefore)

Non-contextual Word Ratio (Top occurring non-contextual words) (Least occurring non-contextual words)

Rank words by occurrence. Select the top ranking non-contextual words and the bottom ranking non-contextual words.

Preferred Words

Non-Preferred Words

Preferred Words Ratio (Preferred Words – Non-preferred Words) (Preferred Words + Non-preferred Words)

Rank words by occurrence. Select author preferred words and author non-preferred words.

Rare Words Most Common Words

Rare/Common Words Ratio

R

Rare Words Slope 1 = ----------------------------- Most Common Words

Rare/Common Words Ratio R Rare Words Slope = ----------------------------- Most Common Words C By plotting the

C

By plotting the results on a graph, a measure of slope (-/+) for the entire text is given. Slopes can then be compared.

New Words

Rare/New Words Ratio

R

Rare Words Slope 2 = ----------------------------- New Words

Rare/New Words Ratio R Rare Words Slope = ----------------------------- New Words N

N

Function Words Verb Words (i.e. run, walk

...

)

Concept Words (i.e. faith, freedom, abuse) Feminine Endings Open Lines Contractions (i.e. I’m, you’re, we’ve, I’ve, you’ve) I do variants (i.e. I do not, I do, I do + verb) Metric Fillers (i.e. if that, the which, when that, since that)

Adversions (i.e. look, look you, you see, do you see, mark my words, hear me, listen)

-th (i.e. fifth, heareth, sayeth, thinketh)

Prefixes (i.e. where-, there-, un-, fore-, dis-)

Suffixes (i.e. -less, -able, -ful, -ish, -ible, -ment, -like)

Positive Intensifiers (i.e. most, many, very, much, more)

Negative Intensifiers (i.e. none, no one, nothing, few)

Frequency of i-syllable words

Large words

Grammatical Discriminators

Parts of speech (i.e. nouns, verbs, adjectives, adverbs, pronouns, noun/verb ratio, verb/adjective ratio, prepositions, conjunctions, articles)

Sentence Constructs (, ; : .)

Verb Plots -- Measures the word count between verbs. (i.e. sat

. go

. .

(5)

. .

be)

. .

(3) . run

. .

(4) .

Comparison Against a Pool

Distinctiveness Ratio (freq of word from author1) D = ------------------------------------------------- (freq of word from all other authors)

Strings and Collocated Word Variables

Word Patterns

Collocated Words

Word Parts and Rhyming Words

Word Parts (Suffixes, Infixes, Prefixes, Word Roots)

Rhyming Words (*ing, *ly, *tion)

Section D. Statistical Analysis Using Microsoft Excel

Document Explorer/WordCruncher exports data into formats that can be copied and pasted into most statistical analysis programs including SAS, SPSS and Minitab. The statistical functions provided in Microsoft Excel are also available, although there has been a great deal of criticism about the reliability of several Excel procedures (especially its Random Number Generator); still, as an educational tool Microsoft Excel has merit and is widely available.

The following is a list of Excel’s statistical functions.

Statistical worksheet functions perform statistical analysis on ranges of data. For example, a statistical worksheet function can provide statistical information about a straight line plotted through a group of values, such as the slope of the line and the y- intercept, or about the actual points that make up the straight line.

Microsoft Excel’s Statistical Functions

AVEDEV

GAMMALN

PERMUT

AVERAGE

GEOMEAN

POISSON

AVERAGEA

GROWTH

PROB

BETADIST

HARMEAN

QUARTILE

BETAINV

HYPGEOMDIST

RANK

BINOMDIST

INTERCEPT

RSQ

CHIDIST

KURT

SKEW

CHIINV

LARGE

SLOPE

CHITEST

LINEST

SMALL

CONFIDENCE

LOGEST

STANDARDIZE

CORREL

LOGINV

STDEV

COUNT

LOGNORMDIST

STDEVA

COUNTA

MAX

STDEVP

COVAR

MAXA

STDEVPA

CRITBINOM

MEDIAN

STEYX

DEVSQ

MIN

TDIST

EXPONDIST

MINA

TINV

FDIST

MODE

TREND

FINV

NEGBINOMDIST

TRIMMEAN

FISHER

NORMDIST

TTEST

FISHERINV

NORMINV

VAR

FORECAST

NORMSDIST

VARA

FREQUENCY

NORMSINV

VARP

FTEST

PEARSON

VARPA

GAMMADIST

PERCENTILE

WEIBULL

GAMMAINV

PERCENTRANK

ZTEST

In addition to these statistical functions, there are the following graphical representations for time series analysis:

LINEAR TRENDLINE LOGARITHMIC TRENDLINE POLYNOMIAL TRENDLINE

POWER TRENDLINE EXPONENTIAL TRENDLINE MOVING AVERAGE TRENDLINE

Microsoft Excel provides a set of data analysis tools — called the Analysis ToolPak — a step-saver when developing complex statistical or engineering analyses. You provide the data and parameters for each analysis; the tool uses the appropriate statistical or