You are on page 1of 7

APLNG 578

Shizheng Zhang

Assignment 3

Brief description of each corpus:


Corpus Genre

Source of texts

DALN: http://daln.osu.edu/

Literary
text
News
article

Number of
texts
5

http://www.chinadailyasia.com/ 5

Number of
words
5401
2895

I put the two corpus (LiteraryText.txt and NewsArticle.txt) into a new folder called hw3 under
the folder programs)
(1) Lexical Analysis: Command
1. Command to POS tag the corpora:
cd ~corpus/programs
cd stanford-postagger-full-2014-01-04
sh stanford-postagger.sh
models/english-bidirectional-distsim.tagger
~/corpus/programs/hw3/LiteraryText.txt > ~/corpus/programs/hw3/LiteraryText.tag

sh stanford-postagger.sh
models/english-bidirectional-distsim.tagger
~/corpus/programs/hw3/NewsArticle.txt > ~/corpus/programs/hw3/ NewsArticle.tag

2. Command to lemmatize the corpora:


./morpha t < ~/corpus/programs/hw3/LiteraryText.tag >
~/corpus/programs/hw3/LiteraryText.lem

./morpha t < ~/corpus/programs/hw3/ NewsArticle.tag > ~/corpus/programs/hw3/


NewsArticle.lem

3. Command to analyze the lexical complexity (analyze multiple files in a directory):


python lc-anc.py ~/corpus/programs/hw3/LiteraryText.lem > ~/corpus/programs/hw3/
LiteraryText.lex
python lc-anc.py ~/corpus/programs/hw3/NewsArticle.lem > ~/corpus/programs/hw3/
NewsArticle.lex

Heres the output for the result:


filename, sentences, wordtypes, swordtypes, lextypes, slextypes, wordtokens, swordtokens,
lextokens, slextokens, ld, ls1, ls2, vs1, vs2, cvs1, ndw, ndwz, ndwerz, ndwesz, ttr, msttr, cttr,
rttr, logttr, uber, lv, vv1, svv1, cvv1, vv2, nv, adjv, advv, modv

LiteraryText.lem, 279, 1139, 465, 990, 434, 4861, 647, 2345, 594, 0.48, 0.25, 0.41, 0.13,
11.16, 2.36, 1139, 36, 41.10, 39.10, 0.23, 0.79, 11.55, 16.34, 0.83, 21.57, 0.42, 0.38, 100.43,
7.09, 0.11, 0.48, 0.08, 0.04, 0.12

filename, sentences, wordtypes, swordtypes, lextypes, slextypes, wordtokens, swordtokens,


lextokens, slextokens, ld, ls1, ls2, vs1, vs2, cvs1, ndw, ndwz, ndwerz, ndwesz, ttr, msttr, cttr,
rttr, logttr, uber, lv, vv1, svv1, cvv1, vv2, nv, adjv, advv, modv
NewsArticle.lem, 102, 893, 347, 770, 335, 2645, 605, 1526, 575, 0.58, 0.38, 0.39, 0.19,
10.07, 2.24, 893, 40, 42.40, 40.40, 0.34, 0.82, 12.28, 17.36, 0.86, 24.84, 0.50, 0.62, 106.04,
7.28, 0.11, 0.46, 0.12, 0.04, 0.15

4.Discussion: According to the data in the Excel, in terms of sentences, wordtokens and
lextokens, the number of Literary Text is almost twice of that of News Articles. It is largely
because the number of the words of the former is almost twice than that of the latter, I guess.
Interestingly, the data of swordtokens and slextokens are almost the same, considering News
article has smaller words than Literay Text but almost share similar number of these three
items, thus I suppose News Article has more sophisticated words that Literary Text.

(2)Syntax Analysis: Command:


1. Command to parse the corpora:
sh stanford-parser-directory.sh ~/corpus/programs/hw3/

2. The syntax structure I choose to define is imperative sentences, and its Tregex pattern is
S > ROOT <<# VB.

Heres the command to retrieve the results:


./tregex.sh S > ROOT <<# VB ~/corpus/programs/hw3/LiteraryText.parsed C o
9
./tregex.sh S > ROOT <<# VB ~/corpus/programs/hw3/NewsArticle.parsed C o
0

3.Discussion: The results show that 9 for the Literary Text and 0 for the News Article in terms of
imperative sentences. Theres none imperative sentences in the news article kind of theres
something wrong with my code, but also make sense because usually there are not a lot
imperative sentences in the news article according to my own experience. However, the literary
texts are about peoples experiences in learning a language (their written narratives), thus it
involves a lot of personal ideas, not very academic or serious, in this case, it has 9 results would
make sense to me. Another important reason why these two results are so small is that the
number of the texts is very small, only 5 texts for each corpora.
5

4. Command to use L2SCA to analyze the syntactic complexity:


python analyzeText.py ~/corpus/programs/hw3/LiteraryText.txt
~corpus/programs/hw3/LiteraryText.sc
python analyzeText.py ~/corpus/programs/hw3/NewsArticle.txt ~corpus/programs/hw3/
NewsArticle.sc

The output for the Literary Text:


Filename,W,S,VP,C,T,DC,CT,CP,CN,MLS,MLT,MLC,C/S,VP/T,C/T,DC/C,DC/T,T/S,CT/T,C
P/T,CP/C,CN/T,CN/C
LiteraryText.txt,4839,279,760,622,319,252,167,94,456,17.3441,15.1693,7.7797,2.2294,2.3824,1
.9498,0.4051,0.7900,1.1434,0.5235,0.2947,0.1511,1.4295,0.7331

The output for the News Article:


Filename,W,S,VP,C,T,DC,CT,CP,CN,MLS,MLT,MLC,C/S,VP/T,C/T,DC/C,DC/T,T/S,CT/T,C
P/T,CP/C,CN/T,CN/C
NewsArticle.txt,2623,102,289,222,111,88,58,79,362,25.7157,23.6306,11.8153,2.1765,2.6036,2.
0000,0.3964,0.7928,1.0882,0.5225,0.7117,0.3559,3.2613,1.6306

5.Discussion: According to the two outputs, the number of words, sentences, verb phrases,
clauses, T-units, dependent clauses, complex T-units, coordinate phrases and complex nominal
for Literary Text is almost over twice than that of the News Articles. It reflects that the words
and expressions used in the Literary Text might be more sophisticated than those in News
Articles. I think it is caused by the large difference of the number of words or token in these two
corpora. Literary Text has 5401 words while News Article only has 2895 words, almost half less
than the former. So I think the result is not scientific and lack more data. Also, personally I felt
weird that News Article would have less sophisticated words than Literary Text. I think the two
corpora I choose may not be very comparable, thus leading to this result.

You might also like