Cst402 Final Project

Ruonan Wen

Project Report
1. Introduction
Hadoop MapReduce is a programming model and software framework for writing applications that rapidly
process vast amounts of data in parallel. On the other hand, “The Dream of the Red Chamber” is one of
China's Four Great Classical Novels. In this project, the author applied Hadoop MapReduce on the novel
“The Dream of the Red Chamber” to analyze its character occurrence and compare it with frequent Chinese
characters in Modern imaginative texts to see the differences and development of Chinese language.

2. Initial Motivation
Similar to English language, Chinese language experienced a long period of changes and development.
From ancient style to vernacular, to nowadays modern Chinese, some characters were abandoned, some
were introduced, and some were changed to another way to be expressed. It is particularly interested to go
into the development of Chinese language.
On the other hand, as a new framework, Hadoop MapReduce has not been widely used, especially in China.
Applying Hadoop MapReduce on some Chinese data seems particular fascinating.

3. Input File
“The Dream of the Red Chamber” is one of China's Four Great Classical Novels. It is generally
acknowledged to be the pinnacle of classical Chinese novels. It was composed in 1784 during the Qing
Dynasty when Chinese vernacular literature started to grow. Literature works accomplished during late Qing
Dynasty usually combined ancient and vernacular Chinese; thus, were also considered as the start of modern
Chinese Language.
In this project, the author downloaded electronic version (.txt format) of “The Dream of the Red Chamber”,
converted the file to Unicode, and applied Hadoop MapReduce to count the character frequency.
Here is a sample of “The Dream of the Red Chamber” in Chinese:

1

Output each character with its frequency in each line.Cst402 Final Project Ruonan Wen 4. Here is the main part of mapper algorithm: 5. count all the character frequency. Here is the main part of reducer algorithm: 2 . For each line. Reducer Add all the frequency together of same character. Mapper Read the file line by line.

in this novel. including punctuations and unreadable characters.Cst402 Final Project Ruonan Wen 6. but abandoned when analyzing) in this novel with frequency. Output File The output file gives all the characters (including punctuations. Here is a sample: Noticeably. There are 4531 characters. some ancient Chinese characters cannot be displayed correctly after Unicode conversion: 3 .

xlsx gives the result of my project.mtsu.Cst402 Final Project Ruonan Wen These are not Chinese characters after conversion. modern imaginative character character frequency in modern imaginative Chinese character in novel “The Dream of the Red Chamber” character frequency in “The Dream of the Red Chamber” The No. However. Results The file result.” The data of modern imaginative texts was obtained from online source: http://lingua. from the results. The whole output file can be found in file ouput. 3. 2. Here is a sample of the result file: There are four columns: 1. 4.rtf. we ignored them in the experiment. 7.edu/chinese-computing/statistics/ 4 . we found that they were not frequent characters. on the side indicates the frequency ranking. The file presents 3999 most frequent character sorted by frequency in both modern imaginative texts and “The Dream of the Red Chamber.

my experiment could prove “The Dream of the Red Chamber” indeed is one of the works indicated the start of the Chinese vernacular literature. on one hand. For example: 罢 便 frequency ranking in “The Dream of the Red Chamber” 93 32 5 Frequency ranking in modern imaginative Chinese 990 165 . “袭”. some characters usually used in classic works are in the very different position in modern imaginative Chinese list. 贾 袭 frequency ranking in “The Dream of the Red Chamber” 23 114 Frequency ranking in modern imaginative Chinese 1853 1347 8. 8Analyses and Comparison 8. it is classic work. but in the very different position in the 4000 most frequent modern imaginative Chinese character list. are commonly used in vernacular Chinese and even modern Chinese literature works are also the most frequent characters in “The Dream of the Red Chamber.3. we could find some characters such as “贾”. and etc. on the other hand. when apply useful data in the novel to contribute to the character counting for the whole works in particular period. there are a lot names in this novel. Thus.” Since this novel was composed in late Qing Dynasty.1. Thus. From the result.Development of modern Chinese Even though “The Dream of the Red Chamber” indicated the start of Chinese vernacular literature. and etc. Many characters in this novel could not be found in modern imaginative Chinese list. are very common in the novel. In addition. “的”. we could use this result to investigate the main characters (people) in the novel. it is significant to discard this kind of characters.Cst402 Final Project Ruonan Wen . 了 的 frequency ranking in “The Dream of the Red Chamber” 1 2 Frequency ranking in modern imaginative Chinese 4 1 8. It is because “贾” and “袭” are characters in the names of main characters in this novel.Characters in names of novel “The Dream of the Red Chamber” is a novel about four big families involving many people.2.Chinese vernacular literature Some characters such as “了”.

We do find some properties of this novel and some differences between this and modern Chinese language. Thanks Melissa Etling for helping me with Hadoop MapReduce. However. I could combine the frequent Chinese character in different periods in the same file and use Hadoop MapReduce to analyze the changes and differences. when applied Hadoop MapReduce Chinese data. we only take “The Dream of the Red Chamber” to analyze. especially for classic literature works. in this project. 10.Thanks Thanks for the help of Dr. I applied Hadoop MapReduce on the novel to accomplishment the character counting. 6 . some problems happened in data conversion. Thanks Kyle Schmitt for helping me converting the text file to Unicode. In addition. Chinese language has a very long history and it changed dynasty by dynasty. Eric G Berkowitz through the whole semester. we should apply Hadoop MapReduce on more works of different time.Cst402 Final Project Ruonan Wen 9. since some ancient characters had not been used. Further Works In this project. In the further work. To see the development of Chinese as a language. At last.