You are on page 1of 12

Using Language Examples in an Introductory SAS Programming Class USCOTS Ohio State University Saturday, June 27th, 2009

Roger Bilisoly, PhD Department of Mathematical Sciences Central Connecticut State University

Why analyze language in a SAS class?


There are several excellent sources of free texts on the Web. For example, Project Gutenberg at http://www.gutenberg.org/wiki/Main_Page Google books at http://books.google.com/ VIRGObeta at http://virgobeta.lib.virginia.edu/ There are several sources of free word lists on the Web. For example, Moby word lists for English, German, Spanish, French, Italian, and Japanese at Gutenberg.org. The American Cryptogram Association has lists for many additional languages. See http://cryptogram.org/cdb/words/words.html. The National Puzzlers League has many types of wordlists for English. See http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:start. Adding variety to the types of data used could broaden the appeal of a statistics class. Many examples of statistical analyses of text have already been developed by linguists and computer scientists. Corpus linguists use computers to analyze text samples designed to be representative of a certain aspect of a language. For example, the million-word Brown corpus was created to be representative of American English in 1961.

Homework Problem: Find the Proportion of Each Letter of the Alphabet in Dickens A Christmas Carol.
Are there any letter frequency anomalies? For example, does the letter J appear more often than average due to the name Jacob Marley? This novel was originally published in 1843. How do its letter frequencies compare to American English in 1961, i.e., to the Brown Corpus? How do its letter frequencies compare to German frequencies: e.g., to Goethes Die Leiden des jungen Werther? Complications: Other languages using the Latin alphabet often employ diacritical marks (e.g., German has umlauts) and sometimes add new letters (e.g., German has , the Eszett, which stands for a double s). Hence alphabets are more complex than one might first suppose.

This SAS Code Introduces both Character Data and Frequency Tables.
data carol; infile C:\A_Christmas_Carol.txt"; input char $1. @@; lowchar = lowcase(char); run;
data letters_carol; set carol; if anyalpha(lowchar) > 0; run; proc freq data=letters_carol order=freq; tables lowchar / out=carolfreq; run; The above code can be introduced early in a programming class, and the ability to read in external files is important for applications. Read characters one at a time.

SAS v9 has many character functions.

Letter Frequencies for A Christmas Carol with some comparisons.


The FREQ Procedure Cumulative Frequency 14869 25759 35455 44770 53148 61457 69419 77335 84373 90049 94604 97939 101035 104071 107051 109892 112330 114629 116751 118694 119725 120754 120885 120998 121095 121179 121180 Cumulative Percent 12.27 21.26 29.26 36.95 43.86 50.72 57.29 63.82 69.63 74.31 78.07 80.82 83.38 85.88 88.34 90.68 92.70 94.59 96.35 97.95 98.80 99.65 99.76 99.85 99.93 100.00 100.00 lowchar e t o a h i n s r d l u w c g m f y p b k v x j q z Frequency 14869 10890 9696 9315 8378 8309 7962 7916 7038 5676 4555 3335 3096 3036 2980 2841 2438 2299 2122 1943 1031 1029 131 113 97 84 1 Percent 12.27 8.99 8.00 7.69 6.91 6.86 6.57 6.53 5.81 4.68 3.76 2.75 2.55 2.51 2.46 2.34 2.01 1.90 1.75 1.60 0.85 0.85 0.11 0.09 0.08 0.07 0.00

Top 12 letters in frequency order for several sources: Christmas Carol ETOAHI NSRDLU Brown Corpus ETAOIN SRHLDU

junges Werthers ENIRSH TADULC


Rule of Thumb ETAOIN SHRDLU The letter j Dickens: 0.0009 Brown: 0.0020

From the word Laocon, a figure from Greek mythology

Homework Problem: Find Initial Consonant Clusters.


How do languages differ in their use of consonants? As noted earlier, diacritical marks and additional letters makes this complicated. In addition, the same sound can be represented in quite different ways in different languages. Sounds in a language are restricted in practice: these are called phonotactic constraints in linguistics.
For example, English has a ts sound (as in cats), but it doesnt appear at the beginning of words, except for loanwords like tsar (from the Russian , where ts = ). German does have an initial ts sound, but its represented with the letter z (as in Zimmer.) However, ts can also appear where t ends a syllable and s starts the next syllable as in pantsuit. In this case the sound is not the ts appearing in cats or tsar.

Studying initial consonant clusters restricts attention to one syllable, so boundaries are not a problem. Lets compare English and German.

Initial Consonant Clusters: English vs. German


Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 start c r m d s p b h t l f g w n pr v st ch tr j k sh COUNT 7333 7012 6261 5831 5540 5398 5265 4079 3713 3706 3426 2473 2284 2206 2150 1924 1364 1330 1311 1104 1017 987 PERCENT 8.01202 7.66129 6.84075 6.37094 6.05299 5.89784 5.75253 4.45671 4.05682 4.04917 3.74324 2.70199 2.49549 2.41027 2.34908 2.10216 1.49030 1.45315 1.43240 1.20623 1.11117 1.07839 Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 start v g b w h z k s m f r d l t st n sch p tr pr sp fr gr COUNT 11356 9310 9282 8208 6851 6444 6214 4849 4847 4836 4035 3978 3501 3456 3144 2977 2669 2521 1724 1681 1348 1296 1258 PERCENT 9.43581 7.73577 7.71251 6.82011 5.69256 5.35438 5.16327 4.02908 4.02742 4.01828 3.35272 3.30536 2.90902 2.87162 2.61238 2.47362 2.21770 2.09472 1.43249 1.39676 1.12007 1.07686 1.04528

First, note that English and German phonology (sounds the letters make) differ. For example, a German v is pronounced like the English f. Second, these two languages have different constraints on initial letters. For example, almost no words in German start with c, but z is pronounced like ts, which is a common starting letter (ranks 6th above) in German. Third, the frequencies of initial letters does not match the overall letter frequencies found earlier.

Analyzing Word Games and Language


Many language games require finding words given specific letter constraints. Crossword puzzles and hangman are two examples. In linguistics, morphology, the study of the structure of words, can be analyzed in similar ways. Words are broken into morphemes, which are the smallest units of a word that have meaning.
For example, in English, many adverbs are formed by adding the morpheme ly to an existing word. Compare: Scoot is quick, and Scoot runs quickly. Quick is an adjective in the former sentence, and quickly is an adverb in the latter. Run here, Scoot, and be quick about it. Here quick is used as an adverb. However, a rule with exceptions can still be useful.
This adverb example is from Section 6.4.3 from Practical Text Mining with Perl (Bilisoly, 2008).

Can you solve the following word puzzles?


1. Find all the words that fit the following crossword puzzle pattern: ___b__u 2. Find all the words that fit the following hangman pattern: _e____s, where t, a, o, i, n dont appear. 3. How useful is the idea that most adverbs in English can be formed by adding ly to an existing word? Unfortunately, there are many complications:
Happy becomes happily (y changes to i.) Seasonable becomes seasonably (e is dropped.) Automatic becomes automatically (-al- is added.) Hill becomes hilly (only y is added.) And there are words ending in ly that are not adverbs: anomaly, apply, fly, etc.

Here are the SAS solutions to the crossword and hangman problems.
data one; length word $30; infile "C:\crosswd.txt"; input word; len = length(word); run; data two; set one; if len = 7; if substr(word,4,1) = 'b'; if substr(word,7,1) = 'u'; run; proc print data=two; run; data three; set one; if len = 7; if findc(word,'taoin') = 0; if findc(word,'e') = 2 and findc(word,'e',-30) = 2; if findc(word,'s') = 7 and findc(word,'s',-30) = 7; proc print data=three; run;

SAS output: Obs word

len

SAS output: Obs word 1 jambeau

len 7

1 2 3 4 5 6 7 8 9 10 11 12

bedbugs bedrugs bedumbs begulfs ferrums peplums rebuffs redbuds redbugs regulus vellums zephyrs

7 7 7 7 7 7 7 7 7 7 7 7

Word Inflections
A complete analysis of adverbs would be quite complicated. However, the exceptions noted earlier (happily, etc.) were easy to find by reading in a wordlist and then checking each word that ends in ly to see if it is still a word after removing ly. There is a methodology called regular expressions that finds general text patterns. This is implemented in version 9 of SAS using functions such as PRXPARSE and PRXMATCH. English is not very inflected, but this varies from language to language. For example, English is less inflected than German, and Finnish is heavily inflected. Moreover, there are many other word structures (morphemes) to analyze: plurals, verb conjugations, compound nouns, etc.

Current Status
I used language examples in CCSUs STAT 456 (Fundamentals of SAS), Spring, 2009, for the first time. Initial feedback is mixed. The language examples were difficult for non-native speakers of English. Would this be helpful in an introductory class? I plan to ask my future classes in their interest in word games to judge whether this is worth pursuing at the introductory level.