Professional Documents
Culture Documents
To cite this article: Gong Yitai (1999) Standardization in Chinese character processing and Chinese MARC records, Library
Collections, Acquisitions, and Technical Services, 23:3, 279-286
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the
Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and
are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and
should be independently verified with primary sources of information. Taylor and Francis shall not be liable for
any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever
or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of
the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Library Collections, Acquisitions, & Technical Services, Vol. 23, No. 3, pp. 279–286, 1999
Copyright © 1999 Elsevier Science Ltd
Pergamon Printed in the USA. All rights reserved
1464-9055/99 $–see front matter
PII: S1464-9055(99)00037-8
GONG YITAI
Shanghai, China
E-mail: gonggec@online.sh.cn
INTRODUCTION
Automation and the application of information technology (IT) in libraries are key topics in
modern Chinese libraries. The zeal for IT is perhaps stronger here than in many other countries, due
in part to the speed of economic development and recognition of the essential role of IT in this
process; but this is tempered by a cautious approach on the part of government to the introduction
of IT that is accessible to the general population (including libraries). Also, success in dealing with
two pressing issues is essential to the success of any moves to full-scale automation in China’s
libraries: Chinese character processing and Chinese MARC records. Without standardization in
these two areas, effective IT applications in library and information services will be compromised
to the extent of becoming unworkable. This paper addresses each issue in turn, showing how they
have been dealt with and the likely extent of success in the attempted solutions.
While the Chinese have great respect for their written language and its ideographic structure,
there is increasing recognition that this is perhaps the single greatest barrier to automation in
Chinese libraries. The number of Chinese characters is around 50,000 (in contrast to the 26
characters of the Western alphabet), creating an almost insurmountable barrier to the assumed
simplicity of key strokes around which IT applications are constructed. Therefore, a study of
279
280 G. YITAI
Chinese character processing and ways in which characters can be coded is a prerequisite for
effective automation in libraries and other types of organizations. Chinese character processing
began in the early 1970s, with initial research emphasizing the coding of Chinese characters. This
led to the development of two basic codes—the internal or machine code and the external or input
code.
Internal Code
The internal code represents characters by binary numbers and is the equivalent of the ASCII
or EBCDIC codes for the Western alphabet. In April 1981 the Technical Committee on National
Standardization of Documentation presented the Code for Chinese Character Sets for Information
Interchange (CCCSII) to ISO/TC 46 Subcommittee 4; this was enacted as a national standard (GB
2312-80) in May 1981 [1]. The primary set consists of 6,763 Chinese characters divided into two
groups, of which 3,755 most frequently used characters constitute the first group, and the remaining
Downloaded by [Stony Brook University] at 06:49 14 October 2014
3,008 less frequently used characters constitute the second group. The most significant feature of
the system is that characters are coded with two bytes instead of one. This is because one byte
(eight bits) only allows coding of 256 characters, whereas a two byte code can represent 65,536
characters. Thus the tens of thousands of Chinese characters require at least a two byte coding
system. If we regard ASCII or EBCDIC as unidimensional because a character corresponds to a one
byte number, then the Chinese system is two-dimensional because the code for each character
consists of two one-byte numbers.
The creation of this standard is based on a statistical analysis of Chinese characters. According
to a survey in 1985, 2,851 characters can cover 99% of occurrences, and 5,018 characters cover
99.99% [1]. Thus GB2312-80, with 6,763 characters, is considered sufficient for ordinary use.
However, librarians are suffering from the shortage of many seldom used characters, for libraries
are involved in recording and processing information which is often complex. To help overcome
this difficulty, a project to develop an expanded character set has been under way for several years
at the National Library of China, and the final result will be a new protocol for information
interchange.
Recently, in collaboration with the Chinese Standardization and Information Code and Classi-
fication Institute, the Chinese Character Coding Subcommittee of the Chinese Society for Infor-
mation Science has developed the General Sets for Chinese Character Keyboarding and applied to
the State Technology Supervision Administration for a national standard. The Sets have collected
44,792 characters or words, which will contribute considerably to the standardization of Chinese
characters [2].
At present there are about 10 standard internal codes for Chinese characters used around the
world. In addition to GB 2312-80, other representative codes include the Code of Japanese Graphic
Character Sets for Information (JIS C6226-1978), Taiwan’s Chinese Industrial Standard Code for
Information Interchange (CISCII), the Code of Korean Graphic Character Sets for Information
(KIS 5619), and the US Chinese Character Code for Information Interchange (CCCII) [3].
Although all these standards are intended for information interchange, exchange among the systems
has itself become a problem because the same character is assigned different codes by the different
standards. The differences are of two types. First, the number of characters differs in the standards.
Second, characters are assigned in different sequences. In CCCSII and CISCII, for example, the
former arranges characters according to their phonetic order (Pinyin system), while the latter sorts
characters according to their radicals and strokes. A fundamental problem concerning the coding
of Chinese characters is, therefore, the lack of an international standard, and this must be a priority
if information interchange and exchange is to become a practicality.
Chinese Character Processing and MARC Records 281
Input Code
The input of Chinese characters is one of two major obstacles to computer processing of the
Chinese language. (The other is the lack of word “boundaries” to facilitate the segmentation of
Chinese words to form meaningful data processing units.) According to a survey conducted in
1991, more than 700 input methods have been devised, of which 40 to 50 methods have been
employed in the design of Chinese computers, but only 20 of these have found a reasonable level
of acceptance among users [4]. Generally speaking, the means of inputting Chinese characters
include keyboarding, ideographic recognition, and phonetic recognition. Although considerable
effort has gone into research on ideographic and phonetic recognition, most of this is either at the
experimental stage or has been applied in very limited areas. For example, a research team at the
Data Institute of the Ministry of Post and Telecommunications recently completed a project on
Chinese character OCR. The resulting product, Zhi Xing (Wisdom Star) OCR, is able to recognize
printed Chinese characters at the speed of 83 characters per second, with a success rate of 98% [5].
Qinghua University, Fujian Institute of Computer Science, and Harbin Industrial and Engineering
Downloaded by [Stony Brook University] at 06:49 14 October 2014
University, among others, have experimented with a photosensitive digitized plate—a device to
send a pen’s moving pattern to the computer when it writes. Qinghua University’s system
recognizes 3,755 characters, and its successful recognition rate has reached 80 –95%, with a speed
basically matching human writing [6]. For voice input, all efforts in China are still in early
experimental stages.
Keyboarding is the most fundamental and, for the present, most effective means of inputting
Chinese characters. It requires special Chinese keyboards, either large keyboards with each key a
Chinese character (similar to the Chinese typewriter) or smaller keyboards with each key an
orthographic or phonetic component. Because of the awkward size of large keyboards, the
“qwerty” keyboard and small Chinese keyboards have gained popularity in the design of Chinese
computers.
Keyboarding methods can be classified into four categories: phonetic-based codes,
orthographic-based codes, combined phonetic and orthographic-based codes, and position-section-
based systems. At present the most popular input code is the Five Stroke Coding System devised
by Wang Yongmin. This method analyzes characters into 130 roots, which in turn can be produced
by 26 keys based on a Chinese character root cycle table [7]. The Pinyin system is still in wide use
because of its convenience, but it is widely accepted that the Five Stroke Coding System is more
adequate.
Data transfer has been a key issue in the design of library systems particularly because of the
cost implications in resource sharing and especially the sharing of cataloguing data. In China, as
elsewhere, this means that the data must be available in a machine-readable format based on some
common standard. It is for this reason that the CN MARC Project (CN MARC) was launched in
1987 by the National Library of China. Based on UNIMARC developed in 1977 by IFLA, CN
MARC is now the national standard for the machine-readable bibliographic description of Chinese
documentation.
Structure of CN MARC
The general structure of CN MARC is as follows: Record Label, Directory, Data Fields, Record
Separator. The Record Label is a 24-character, fixed-length field that contains general information
282 G. YITAI
about record length, record status, type of record, etc. The Directory consists of a series of
fixed-length, 12-character directory fields, each of which is subdivided. The Data Fields consist of
10 blocks: identification, descriptive data, notes, series entry, related title, subject, author, inter-
national use, national use. Altogether there are 125 fields in a CN MARC record. The Appendix,
which lists the fields in a full CN MARC record, shows that CN MARC has adopted many
international standards in structuring its format (e.g., AACR2 and ISBD), thus paving the way for
international and national resource sharing.
Sound though the structure of CN MARC is, the contents of fields and subfields are lengthy and
tedious, sometimes even duplicated. As there are 125 fields and 394 subfields in the CN MARC
format, it is likely that in most libraries some of this information is unnecessary and wastes
computer space. Furthermore, the present format structure of CN MARC is unable to accommodate
information of different types of documents, such as non-print materials, maps, rare books.
Therefore, many experts in the use of MARC suggest that a radical revision of CN MARC is long
overdue.
Downloaded by [Stony Brook University] at 06:49 14 October 2014
● The bibliographic record should be in conformity with ISBD and the National Standards for
Bibliographic Descriptions both in terms of forms and contents.
● Libraries at any level can make direct use of bibliographic records to form their own files
without any change or conversion. In other words the bibliographic records can be directly
used by any library across the country so that the goal of resource sharing can be realized.
● The unique format structure can accommodate information on different types of documents,
such as monographs, serials, non-print materials, maps, rare books, archive materials, etc., in
order to facilitate the development and maintenance of software packages.
● The structure can be used either as an external communication format or the internal format
of a system. Without any conversion it is possible to display or print bibliographic products.
The general structure of the new generation CN MARC is as follows: Label and Code Area,
Descriptive Bibliographic Area, Control and Retrieval Area, Record Separator, Code Area. The
Label and Code Area contains information on control numbers (SBN, ISSN, etc.), type of
publication, language, country of origin, etc. The Descriptive Bibliographic Area consists of
bibliographic data based on the Bibliographical Description for Monographs (for materials in
Chinese) and the Descriptive Cataloguing Rules for Western Language Materials. All data are
arranged in text format so that they can be easily displayed on screen or printed. Table 1 shows the
general structure of this new format.
The Control and Retrieval Area contains information on access points. All the fields in the Area
are variant length fields which can be repeated, but have no subfields. Information in this Area is
used to build indexes for quick search. Table 2 shows the general structure of this area—note the
fields for the two most commonly used Chinese classification schemes: Classification of Chinese
Literature (CCL) and Classification of the Library of the Chinese Academy of Sciences (CLCAS).
This new generation CN MARC is used by some libraries in Guangdong Province. Although it
Chinese Character Processing and MARC Records 283
TABLE 1
Structure of Descriptive Bibliographic Area in New CN MARC
has a number of potentially useful features, this format must be tested extensively in the field before
being widely adopted as a national MARC standard in China.
CONCLUSION
The information profession in China has made substantial progress in areas unique to China’s
information environment (Chinese character processing and Chinese MARC records) as a prereq-
uisite to development of appropriate IT applications in libraries. But there are still numerous
hurdles to be overcome. In particular, as the developed countries make greater use of the Internet
and World Wide Web, information professionals in China are increasingly frustrated by the slow
pace of introducing this latest information-rich medium to the library sector. While it is true that
many libraries now have Web pages and some limited access to the Internet, it is not widely
available as an information retrieval tool in libraries. It is in the interests of the information sector
for this problem to be dealt with rapidly so that China can keep up with other countries and not
become information poor for want of access to a most important resource.
TABLE 2
Structure of Control and Retrieval Area in New CN MARC
REFERENCES
1. Yongquan, Liu. “New Developments in Computer and National Language Processing in China.” Information Science,
8 (1987), 64 –70.
2. Ping, Li, & Yifan, Chen. “On the Generality, Standardization and Intelligence of Chinese Character Keyboarding.”
China Computerworld, 9 (November 1994).
3. Maruyama, S. “Bibliographic Control and Library Automation in Japan.” In C. Bobmeyer and S. Massil. (Eds.),
Automated Systems for Access to Multilingual and Multiscript Library Materials: Problems and Solutions, (Munich:
K. G. Saur, 1987); “Hundred Schools of Thoughts Contend on Chinese Interchange Code.” Central Daily News (overseas
ed.), 8 July 1989; Hyeon, K. S. “On Establishing a Standard Character Code System in Korea and Its Application to the
Korean MARC Database.” In Automated Systems for Access to Multilingual and Multiscript Library Materials:
Problems and Solutions, edited by C. Bobmeyer and S. Massil. (Munich: K. G. Saur, 1987); Hsieh, C., et al. “The Design
and Application of the Chinese Character Code for Information Interchange (CCCII).” Journal of Library and
Information Science USA/Taiwan), 7 (1981), 129 –143.
4. “Great Achievements in Chinese Information Processing.” Computer World, 8 (1991), 2–3.
5. Shuangguam, Zhang, & Shuanlong, Zhou. “A Review of Chinese Information Processing in the Past 30 Years.” New
Downloaded by [Stony Brook University] at 06:49 14 October 2014
APPENDIX:
Fields in a Full CN MARC Record