You are on page 1of 5

Computer Standards & Interfaces 20 Ž1998.

25–29

Recombinant Chinese Pinyin system for efficient processing of


information in Chinese
Harry Tong ) , Linda Jin
Argonne National Laboratory BioCARS, Building 434-B Argonne, IL 60439 USA
Accepted 1 August 1998

Abstract

An enhanced Pinyin ŽChinese phonetic symbol. coding system has been developed for creating a novel Romanized
version of the Chinese language. The homophonic problem associated with inputting Chinese characters to computers has
been solved by the addition of a suffix to the conventional Pinyin of individual Chinese characters as well as multisyllable
wordsrphrases. The suffix for the Pinyin of a Chinese character comprises a combination of a tone symbol and a single
letter index that is mostly designated as the first letter of an English word giving an appropriate translation. For multisyllable
Chinese words, the suffix consists only of a single letter that indicates either the tone of the last Chinese character or the first
letter of an English translation. The resulting set of Romanized Chinese notations, christened ‘Chinish’, make it possible to
input Chinese characters as efficiently as English words. It is proposed that Chinish be used as a replacement for the current
Pinyin system as a standard interface between Chinese speaking users and computers. q 1998 Elsevier Science B.V. All
rights reserved.

Keywords: Chinese Pinyin system; Chinish; Coding system; Information processing

1. Introduction the overseas Chinese community. Since Chinese


characters are highly complex, the various sets of
It has long been recognized that the input of
rules for decomposing them into components suit-
Chinese characters to computers represents a bottle-
able for computer processing are necessarily arbi-
neck for Chinese information processing. Numerous
trary to some degree. Although coding systems based
coding methods for Chinese have been developed to
on such sets of rules can achieve an acceptable level
tackle this problem, most of which are based on the
of efficiency in inputting Chinese for trained profes-
coding of individual components Žcalled ‘radicals’.
sional typists, they are difficult to learn and easy to
of the ideographic Chinese characters. There are
forget once not in use. The alternative methods for
about 6000 simplified Chinese characters in common
coding Chinese characters are based on one of the
use in the People’s Republic of China and in Singa-
Chinese phonetic systems called Hanyu Pinyin
pore, and about 7000 traditional Chinese characters
Ž . that was originally designed by the gov-
in Taiwan, in Hong Kong, and in a greater part of
ernment of the People’s Republic of China for the
most commonly spoken Chinese dialect known as
)
Corresponding author. Tel.: q1 630 2520441; fax: q1 630 Mandarin. Upon its publication, it was made clear
2520443; e-mail: tong@cars1.uchicago.edu that such a phonetic system was not intended to be

0920-5489r98r$ - see front matter q 1998 Elsevier Science B.V. All rights reserved.
PII: S 0 9 2 0 - 5 4 8 9 Ž 9 8 . 0 0 0 3 1 - 2
26 H. Tong, L. Jin r Computer Standards & Interfaces 20 (1998) 25–29

Fig. 1. Comparison of the styles of the Chinese character prompt box between ŽA. conventional input methods, as implemented in the
ChinaPro system ŽqDORAK International, 1997–1998.; and ŽB. a Chinish based input system, as implemented in Chinish Write ŽqYijun
Ding and Harry Tong, 1998.. The list of multiple choices of Chinese characters in A are displayed after ‘shi’ is entered in ChinaPro, which
does not allow for tones, and those in B are displayed after ‘shih’ Žwith ‘h’ indicating the first tone. is typed in Chinish Write. The ranking
indices for the Chinish notations in B are derived as follows: a as in ‘arthropod’ Ž .; b in ‘body’ Ž .; c in ‘carry out’ Ž .; g in ‘grass’
Ž .; k in ‘kou’ Ž ., a Chinese radical name; l in ‘lion’ Ž .; p in ‘poem’ Ž .; t in ‘teacher’ Ž .. The Chinese character ‘ ’ Žlose. has
only the space key as the ranking index, and is simply represented by the Pinyin ‘shi’ plus the first tone symbol ‘h’. Additionally, the Pinyin
‘shi’ itself codes for the Chinese number ‘ ’ Žten..

used as an alphabetic version of the written lan- 1. The possibility to fully leverage the basic
guage. With one Pinyin notation corresponding to knowledge of conventional Pinyin and English for
13–17 Chinese characters on average, the standard most Chinese speakers with a need to communicate
Chinese Pinyin system is extremely inefficient as a with a computer.
coding method because most of the time the user has 2. The usage frequency of Chinese charactersr
to select a Chinese character from a list of multiple words as a guide in deciding which charactersrwords
choices in a prompt box located away from the main are to be given priority for the assignment of the
body of text under work ŽFig. 1A.. However, despite suffix in case of conflicts.
the relative inefficiency caused by the homophonic 3. Ergonomic consideration in the choice of let-
problem, the Pinyin system has remained by far the ters for the tone symbols, given the conventional
most popular method for inputting Chinese charac- QWERTY design of the English keyboard.
ters because for the vast majority of Chinese speak- 4. Making a distinction between lowercase letters
ers it requires no special training. Besides serving as and uppercase letters to increase the coding capacity,
a natural coding system for inputting Chinese charac- with words starting with uppercase letters reserved
ters to computers, the Pinyin system is the key to for person or place names.
learning correct Chinese pronunciations and to per- 5. The need for novice users Žwith a command of
forming general Chinese information processing tasks conventional Pinyin. to start using the new language
such as using a dictionary and preparing a biblio- effectively immediately, as well as the need for
graphic list. Learning the Pinyin system forms part veteran users to increase their efficiency over time,
of the primary school education in most countries with only their physical typing speed as the ultimate
and regions where Chinese is spoken. constraint on how fast they can input Chinese char-
In order to address the homophonic problem asso- acters.
ciated with the conventional Pinyin system, we have In accordance with the design considerations out-
developed an enhanced version of Pinyin ŽChinish. lined above, we have formally represented the Chin-
that can be used as a precise alphabetic language ish notation of a Chinese character by two parts.
with each word capable of being translated into 1. The standard toneless Pinyin for the pronuncia-
Chinese characters with 100% accuracy. tion of the Chinese character.
2. A suffix that is unique with respect to all the
homophones of the Chinese character, comprising a
2. Design principles of Chinish combination of a tone symbol and a ranking index
that is a single letter from the English alphabet. The
In the design of Chinish we have taken into five tones of Mandarin pronunciations are formally
consideration of the following factors. represented by the Arabic numbers ‘1’ to ‘5’, but for
H. Tong, L. Jin r Computer Standards & Interfaces 20 (1998) 25–29 27

ergonomic reasons they are replaced by a set of dependence on either of them has been minimized.
letters better suited for touch typing. For example, the Chinish notation for the Chinese
In the few cases where a Pinyin corresponds to character ‘ ’ is ‘shihl’, in which ‘shi’ is the conven-
more than twenty-six Chinese characters, all the tional Pinyin, ‘h’ represents the tone symbol ‘l’, and
characters are first sorted into sub-groups of twenty- ‘l’ is the ranking index taken from the English
six or less according to their usage frequencies, i.e., translation ‘lion’. For Chinese characters with rela-
the most frequently used 26 characters belong to the tively low usage frequency, or those with no appro-
first sub-group, which has a suffix containing a priate English translations, the ranking index used is
single tone symbol, and the next most frequently the first letter of the Chinese words depicting the
used 26 characters have duplicated tone symbols in pronunciation of either the first or the last radical
their suffix, and the least frequently used 26 charac- forming the Chinese character. Special allowances
ters have triplicated tone symbols in their suffix. The are made for about 1500 most frequently used Chi-
alphabetic replacements for the tone symbols are nese characters, which are simply coded by their
selected as follows: conventional Pinyin, or by the conventional Pinyin
1—h; 2—j; 3—k; 4—l; 5—f; plus the tone symbol. Applying this system, we have
11—z; 22—x; 33—c; 44—v; created unambiguous Chinish notations for the ap-
111—zz; 222—xx; 333—cc; 444—vv; proximately 6500 Chinese characters that can be
These substitute letters for the tone symbols were found in a typical modern Chinese dictionary ŽZidian,
chosen because they do not cause any conflicts with 1992..
the standard Pinyin representations. The four tone The use of the combination of the Pinyin’s
symbols ‘1’, ‘2’, ‘3’, ‘4’ for the most frequently of individual characters in coding multisyllable Chi-
used first sub-group of Chinese characters are re- nese wordsrphrases, in particular for disyllables
spectively represented by the letters ‘h’, ‘j’, ‘k’, and ŽShuangpin—‘ ’., has been found to be remark-
‘l’, that are arranged in tandem in the middle row of ably successful in minimizing the homophonic prob-
a conventional QWERTY type English keyboard. lem, and is implemented in some widely used Chi-
Accordingly, the QWERTY keyboard is ideally nese processing systems such as NJSTAR ŽqHong-
suited for inputting Chinese characters using the bo Data Systems, 1995–1998. and CHINAPRO
Chinish notations. ŽqDORAK International, 1997–1998.. In case where
We have found that a most appropriate choice for such a phonetic combination corresponds to two or
the single letter index is the first letter of an English more Chinese phrases, we have applied the princi-
word giving a translation for the Chinese character ples used for coding single Chinese characters to
itself. The English letter chosen in such a way can resolve the ambiguities by adding a suffix to the
not only specify the ranking order of a Chinese phrases with lower usage frequency. In contrast with
character but can also serve as a ‘root’ for its single Chinese characters, disambiguation for multi-
meaning. Thus a Chinish notation so created has syllable Chinese words is achieved with the use of a
both a phonetic and a semantic component. It was single letter as the suffix comprising either the tone
estimated that one sixth of Chinese take English symbol for the last Chinese character or the first
courses some time in their lives, corresponding to a letter of the English translation for the Chinese word.
total of 200 million. To these people a properly For example, the Chinish notation for a frequently
chosen English translation can serve as an excellent used dissyllable word ‘ ’ Žmeaning. is ‘yiyi’,
mnemonic device for remembering the ranking in- whereas for a less frequently used word ‘ ’ Žob-
dex. Since a Chinese character usually corresponds jection. is ‘yiyil’. To avoid coding and reading
to several English translations, the conflicts caused confusions with single Chinese characters, for two-
by several Chinese characters sharing the same pro- character phrases with the Pinyin for the last charac-
nunciations as well as English translations can be ter consisting of only two letters starting with one of
minimized. Such a system can also circumvent the the four most commonly used tone symbols, an
differences between the traditional Chinese charac- one-letter suffix is added irrespective of whether
ters and their simplified counterparts because its there are duplicated codings. For example, the Chin-
28 H. Tong, L. Jin r Computer Standards & Interfaces 20 (1998) 25–29

ish notation for ‘ ’ Žcompanion. is ‘banlvk 1 ’. For keeps using our system there will be less and less
commonly used Chinese phrases consisting of two or need to refer to the prompt system. Once most of the
more characters, we have also provided shorthand Chinish notations are learned by heart naturally this
notations consisting of only the first letters of the way, the user can input a Chinese character as easily
Pinyin’s of individual Chinese characters. With such as an English speaker can type a English word, with
a system we have already coded 20,000 two-char- focus only on the content of what is being inputted.
acter, 5400 three-character, 5200 four-character, and Since mastering Chinish practically eliminates the
50 five-character Chinese phrases. Thus our system need for a prompt system for advanced users, it
is not only the easiest to learn for novice users, but follows naturally that the input string before transla-
also potentially the most efficient to use in the hands tion into Chinese characters should appear in the
of experts. main body of text, in the same manner as the pro-
cessing of any other alphabetic languages such as
English and Dutch, rather than in the prompt box
3. Prospect of Chinish as the new standard Pinyin only as is usually the case with most conventional
Chinese input systems.
We propose that Chinish be considered as a re- We have also noted that the Chinish coding sys-
placement for the current standard Chinese Pinyin tem may serve as a first-approximation automatic
system in Romanizing Chinese characters and translation mechanism between Chinese and English.
phrases, because its adoption as the new standard It may make Chinese speakers interested in learning
will completely eliminate the homophonic problem English for more efficient usage of Chinese. It may
associated with processing Chinese information us- also encourage English speakers to learn Chinese
ing computers. because they have a unique advantage in using the
One striking feature of our system is the mini- written Chinese language via Chinish. Thus the cre-
mum amount of adjustment that is required of users ation of Chinish not only solves the bottleneck prob-
who are already familiar with the conventional Pinyin lem of processing Chinese information if it is adopted
input system. Since Chinish is strictly a superset of as the next standard Chinese Pinyin, but also gives
Pinyin, it is therefore the easiest to learn for in- rise to a new written language for facilitating cultural
putting Chinese characters compared with any other exchanges between speakers of the two major world
coding methods that also aim to tackle the homo- languages. Our method may also be adapted for
phonic problem Že.g., Dianpin from Gowell, 1997.. Romanizing other languages that incorporate Chi-
Because our coding system requires only that the nese characters such as Japanese and Korean.
users make a special effort to remember the last The impact of the creation of Chinish on the
letter in a Chinish notation for the less frequently general movement to Romanize Chinese cannot yet
used Chinese characters or phrases, they can start to be easily assessed. On the one hand Chinish is a
use our system effectively right away with a prompt version of Romanized Chinese that is as precise as
system that differs slightly from the conventional the currently used ideographic versions of the lan-
one in that the Chinese characters are arranged al- guage, besides being more amenable to automatic
phabetically rather than numerically ŽFig. 1B.. In the processing by computers. The adoption of Chinish
conventional prompt system a numeric index serves notations as the basis for indexing bibliographic data
no more purpose than a temporary device for retriev- bases can eliminate most of the confusion and incon-
ing a Chinese character. By contrast, in our system sistency associated with using Pinyin or Chinese
an alphabetic index for accessing a Chinese character characters themselves ŽCai, 1991.. On the other hand,
is also part of its Chinish notation. The difference is the use of Chinish can help preserve the traditional
superficially small to a novice user, but as the user Chinese writing systems by making Chinese charac-
ters much easier to input to a computer, the medium
of choice in the information age.
1
¨ used in the conventional Pinyin is
The non-English letter "u" Considering all the important advantages of the
¨ and "nu".
replaced by "v" for the two pronunciations "lu" ¨ Chinish system over conventional Pinyin, we antici-
H. Tong, L. Jin r Computer Standards & Interfaces 20 (1998) 25–29 29

pate that Chinish will be a strong candidate for a frequency. When given multiple translations ŽWu et
new standard for Romanizing the Chinese language, al., 1992., the English word with the highest usage
and as such will be used as the standard interface frequency was chosen if it is not in conflict with the
language between Chinese speakers and computers. choices already made for the other Chinese charac-
ters in the same sub-group. For the assignment of
6500 Chinese characters, 2500 English words were
4. Technical notes used. For Chinese characters with low usage fre-
quency, as well as those with no appropriate English
4.1. Source materials translations, the first letter in the name of the first or
the last radical was used for the ranking index. The
The Chinese character set sorted by conventional definition of the radical names follows the conven-
Pinyin that was used for creating the character dic- tions published in Chinese Characters Ž1985..
tionary of Chinish was obtained from an on-line
archive maintained by the Chinese Community In-
formation Center ŽCCIC, 1992., as were several
Chinese phrase lists that form the basis of the phrase Acknowledgements
dictionary of Chinish. The list of single Chinese
characters sorted by usage frequency was calculated
by Tsai Ž1997.. A list of 3000 commonly used We thank Titia Sixma for critical comments on
English words sorted by usage frequency was ob- the manuscript. We also thank Xin Tong and Ewalt
tained from Collins Cobuild Ž1997.. Scherer for helpful suggestions, and Yijun Ding for
collaboration in writing the computer program CHIN-
4.2. Calculation of phrase usage frequency ISH WRITE.

The usage frequency of Chinese phrases consist-


ing of two to five Chinese characters were calculated
by the computer program PHRFREQ qHarry Tong References
1997–1998.. Searches were performed on unpro-
cessed Chinese phrases from the CCIC data archive Chinese Characters, 1985 edn. Foreign Languages Press, Beijing,
against Chinese reading materials selected from on- China.
line magazines and journals such as Language and Cai, S.E., 1991. A critique and assessment of the automation of
Information Ž‘ ’, 1995–1998., Huaxia Wen- East-Asian libraries in the United States. Coll. Res. Libr. 52
Ž6., 559–575.
zhai Ž‘ ’, 1991–1998. and Tulip Ž‘ ’,
Cobuild, C., 1997. HarperCollins Publishers, London, UK.
1994–1998., totalling five million Chinese charac- Data Archive of the Chinese Community Information Center
ters. The threshold for high usage frequency was set Ž1992–1998. wOn linex. Available http:rrwww.ifcss.orgrftp-
at 3 times per million characters. pubrsoftwarerdata.
Tsai, C.H., 1997–1998. Frequency of usage and number of strokes
4.3. Selection of English words and radical names of Chinese characters wOn linex. Available http:rrcasper.beck-
man.uiuc.edur ;-c-tsai4rchinesercharfreq.html.
Wu, J.R. et al., 1992. A Chinese-English Dictionary. Commercial
Assignment of ranking indices are first made for Press, Beijing, China
the Chinese characters with relatively high usage Zidian, X., 1992 edn. Commercial Press, Beijing, China.

You might also like