Professional Documents
Culture Documents
Shizheng Zhang
Assignment 3
Source of texts
DALN: http://daln.osu.edu/
Literary
text
News
article
Number of
texts
5
http://www.chinadailyasia.com/ 5
Number of
words
5401
2895
I put the two corpus (LiteraryText.txt and NewsArticle.txt) into a new folder called hw3 under
the folder programs)
(1) Lexical Analysis: Command
1. Command to POS tag the corpora:
cd ~corpus/programs
cd stanford-postagger-full-2014-01-04
sh stanford-postagger.sh
models/english-bidirectional-distsim.tagger
~/corpus/programs/hw3/LiteraryText.txt > ~/corpus/programs/hw3/LiteraryText.tag
sh stanford-postagger.sh
models/english-bidirectional-distsim.tagger
~/corpus/programs/hw3/NewsArticle.txt > ~/corpus/programs/hw3/ NewsArticle.tag
LiteraryText.lem, 279, 1139, 465, 990, 434, 4861, 647, 2345, 594, 0.48, 0.25, 0.41, 0.13,
11.16, 2.36, 1139, 36, 41.10, 39.10, 0.23, 0.79, 11.55, 16.34, 0.83, 21.57, 0.42, 0.38, 100.43,
7.09, 0.11, 0.48, 0.08, 0.04, 0.12
4.Discussion: According to the data in the Excel, in terms of sentences, wordtokens and
lextokens, the number of Literary Text is almost twice of that of News Articles. It is largely
because the number of the words of the former is almost twice than that of the latter, I guess.
Interestingly, the data of swordtokens and slextokens are almost the same, considering News
article has smaller words than Literay Text but almost share similar number of these three
items, thus I suppose News Article has more sophisticated words that Literary Text.
2. The syntax structure I choose to define is imperative sentences, and its Tregex pattern is
S > ROOT <<# VB.
3.Discussion: The results show that 9 for the Literary Text and 0 for the News Article in terms of
imperative sentences. Theres none imperative sentences in the news article kind of theres
something wrong with my code, but also make sense because usually there are not a lot
imperative sentences in the news article according to my own experience. However, the literary
texts are about peoples experiences in learning a language (their written narratives), thus it
involves a lot of personal ideas, not very academic or serious, in this case, it has 9 results would
make sense to me. Another important reason why these two results are so small is that the
number of the texts is very small, only 5 texts for each corpora.
5
5.Discussion: According to the two outputs, the number of words, sentences, verb phrases,
clauses, T-units, dependent clauses, complex T-units, coordinate phrases and complex nominal
for Literary Text is almost over twice than that of the News Articles. It reflects that the words
and expressions used in the Literary Text might be more sophisticated than those in News
Articles. I think it is caused by the large difference of the number of words or token in these two
corpora. Literary Text has 5401 words while News Article only has 2895 words, almost half less
than the former. So I think the result is not scientific and lack more data. Also, personally I felt
weird that News Article would have less sophisticated words than Literary Text. I think the two
corpora I choose may not be very comparable, thus leading to this result.