You are on page 1of 9

This article was downloaded by: [Stony Brook University]

On: 14 October 2014, At: 06:49


Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK

Library Collections, Acquisitions, and Technical


Services
Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/ulca20

Standardization in Chinese character processing and


Chinese MARC records
a
Gong Yitai
a
Shanghai Documentation and Information Centre, Chinese Academy of Sciences, Shanghai,
China
Published online: 03 Dec 2013.

To cite this article: Gong Yitai (1999) Standardization in Chinese character processing and Chinese MARC records, Library
Collections, Acquisitions, and Technical Services, 23:3, 279-286

To link to this article: http://dx.doi.org/10.1080/14649055.1999.10765580

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the
Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and
are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and
should be independently verified with primary sources of information. Taylor and Francis shall not be liable for
any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever
or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of
the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Library Collections, Acquisitions, & Technical Services, Vol. 23, No. 3, pp. 279–286, 1999
Copyright © 1999 Elsevier Science Ltd
Pergamon Printed in the USA. All rights reserved
1464-9055/99 $–see front matter

PII: S1464-9055(99)00037-8

STANDARDIZATION IN CHINESE CHARACTER


PROCESSING AND CHINESE MARC RECORDS

GONG YITAI

Shanghai Documentation and Information Centre


Chinese Academy of Sciences
Downloaded by [Stony Brook University] at 06:49 14 October 2014

Shanghai, China
E-mail: gonggec@online.sh.cn

Abstract—Chinese characters are recognized as a significant hindrance to the rapid


introduction of automated systems in Chinese libraries. Therefore, the coding of these
characters is a prerequisite for effective library automation. Two basic codes, internal
or machine code and external or input code, have been developed. These are described
and analyzed in this paper. © 1999 Elsevier Science Ltd

INTRODUCTION

Automation and the application of information technology (IT) in libraries are key topics in
modern Chinese libraries. The zeal for IT is perhaps stronger here than in many other countries, due
in part to the speed of economic development and recognition of the essential role of IT in this
process; but this is tempered by a cautious approach on the part of government to the introduction
of IT that is accessible to the general population (including libraries). Also, success in dealing with
two pressing issues is essential to the success of any moves to full-scale automation in China’s
libraries: Chinese character processing and Chinese MARC records. Without standardization in
these two areas, effective IT applications in library and information services will be compromised
to the extent of becoming unworkable. This paper addresses each issue in turn, showing how they
have been dealt with and the likely extent of success in the attempted solutions.

CHINESE CHARACTER PROCESSING

While the Chinese have great respect for their written language and its ideographic structure,
there is increasing recognition that this is perhaps the single greatest barrier to automation in
Chinese libraries. The number of Chinese characters is around 50,000 (in contrast to the 26
characters of the Western alphabet), creating an almost insurmountable barrier to the assumed
simplicity of key strokes around which IT applications are constructed. Therefore, a study of

279
280 G. YITAI

Chinese character processing and ways in which characters can be coded is a prerequisite for
effective automation in libraries and other types of organizations. Chinese character processing
began in the early 1970s, with initial research emphasizing the coding of Chinese characters. This
led to the development of two basic codes—the internal or machine code and the external or input
code.

Internal Code
The internal code represents characters by binary numbers and is the equivalent of the ASCII
or EBCDIC codes for the Western alphabet. In April 1981 the Technical Committee on National
Standardization of Documentation presented the Code for Chinese Character Sets for Information
Interchange (CCCSII) to ISO/TC 46 Subcommittee 4; this was enacted as a national standard (GB
2312-80) in May 1981 [1]. The primary set consists of 6,763 Chinese characters divided into two
groups, of which 3,755 most frequently used characters constitute the first group, and the remaining
Downloaded by [Stony Brook University] at 06:49 14 October 2014

3,008 less frequently used characters constitute the second group. The most significant feature of
the system is that characters are coded with two bytes instead of one. This is because one byte
(eight bits) only allows coding of 256 characters, whereas a two byte code can represent 65,536
characters. Thus the tens of thousands of Chinese characters require at least a two byte coding
system. If we regard ASCII or EBCDIC as unidimensional because a character corresponds to a one
byte number, then the Chinese system is two-dimensional because the code for each character
consists of two one-byte numbers.
The creation of this standard is based on a statistical analysis of Chinese characters. According
to a survey in 1985, 2,851 characters can cover 99% of occurrences, and 5,018 characters cover
99.99% [1]. Thus GB2312-80, with 6,763 characters, is considered sufficient for ordinary use.
However, librarians are suffering from the shortage of many seldom used characters, for libraries
are involved in recording and processing information which is often complex. To help overcome
this difficulty, a project to develop an expanded character set has been under way for several years
at the National Library of China, and the final result will be a new protocol for information
interchange.
Recently, in collaboration with the Chinese Standardization and Information Code and Classi-
fication Institute, the Chinese Character Coding Subcommittee of the Chinese Society for Infor-
mation Science has developed the General Sets for Chinese Character Keyboarding and applied to
the State Technology Supervision Administration for a national standard. The Sets have collected
44,792 characters or words, which will contribute considerably to the standardization of Chinese
characters [2].
At present there are about 10 standard internal codes for Chinese characters used around the
world. In addition to GB 2312-80, other representative codes include the Code of Japanese Graphic
Character Sets for Information (JIS C6226-1978), Taiwan’s Chinese Industrial Standard Code for
Information Interchange (CISCII), the Code of Korean Graphic Character Sets for Information
(KIS 5619), and the US Chinese Character Code for Information Interchange (CCCII) [3].
Although all these standards are intended for information interchange, exchange among the systems
has itself become a problem because the same character is assigned different codes by the different
standards. The differences are of two types. First, the number of characters differs in the standards.
Second, characters are assigned in different sequences. In CCCSII and CISCII, for example, the
former arranges characters according to their phonetic order (Pinyin system), while the latter sorts
characters according to their radicals and strokes. A fundamental problem concerning the coding
of Chinese characters is, therefore, the lack of an international standard, and this must be a priority
if information interchange and exchange is to become a practicality.
Chinese Character Processing and MARC Records 281

Input Code
The input of Chinese characters is one of two major obstacles to computer processing of the
Chinese language. (The other is the lack of word “boundaries” to facilitate the segmentation of
Chinese words to form meaningful data processing units.) According to a survey conducted in
1991, more than 700 input methods have been devised, of which 40 to 50 methods have been
employed in the design of Chinese computers, but only 20 of these have found a reasonable level
of acceptance among users [4]. Generally speaking, the means of inputting Chinese characters
include keyboarding, ideographic recognition, and phonetic recognition. Although considerable
effort has gone into research on ideographic and phonetic recognition, most of this is either at the
experimental stage or has been applied in very limited areas. For example, a research team at the
Data Institute of the Ministry of Post and Telecommunications recently completed a project on
Chinese character OCR. The resulting product, Zhi Xing (Wisdom Star) OCR, is able to recognize
printed Chinese characters at the speed of 83 characters per second, with a success rate of 98% [5].
Qinghua University, Fujian Institute of Computer Science, and Harbin Industrial and Engineering
Downloaded by [Stony Brook University] at 06:49 14 October 2014

University, among others, have experimented with a photosensitive digitized plate—a device to
send a pen’s moving pattern to the computer when it writes. Qinghua University’s system
recognizes 3,755 characters, and its successful recognition rate has reached 80 –95%, with a speed
basically matching human writing [6]. For voice input, all efforts in China are still in early
experimental stages.
Keyboarding is the most fundamental and, for the present, most effective means of inputting
Chinese characters. It requires special Chinese keyboards, either large keyboards with each key a
Chinese character (similar to the Chinese typewriter) or smaller keyboards with each key an
orthographic or phonetic component. Because of the awkward size of large keyboards, the
“qwerty” keyboard and small Chinese keyboards have gained popularity in the design of Chinese
computers.
Keyboarding methods can be classified into four categories: phonetic-based codes,
orthographic-based codes, combined phonetic and orthographic-based codes, and position-section-
based systems. At present the most popular input code is the Five Stroke Coding System devised
by Wang Yongmin. This method analyzes characters into 130 roots, which in turn can be produced
by 26 keys based on a Chinese character root cycle table [7]. The Pinyin system is still in wide use
because of its convenience, but it is widely accepted that the Five Stroke Coding System is more
adequate.

CHINESE MARC PROJECT

Data transfer has been a key issue in the design of library systems particularly because of the
cost implications in resource sharing and especially the sharing of cataloguing data. In China, as
elsewhere, this means that the data must be available in a machine-readable format based on some
common standard. It is for this reason that the CN MARC Project (CN MARC) was launched in
1987 by the National Library of China. Based on UNIMARC developed in 1977 by IFLA, CN
MARC is now the national standard for the machine-readable bibliographic description of Chinese
documentation.

Structure of CN MARC
The general structure of CN MARC is as follows: Record Label, Directory, Data Fields, Record
Separator. The Record Label is a 24-character, fixed-length field that contains general information
282 G. YITAI

about record length, record status, type of record, etc. The Directory consists of a series of
fixed-length, 12-character directory fields, each of which is subdivided. The Data Fields consist of
10 blocks: identification, descriptive data, notes, series entry, related title, subject, author, inter-
national use, national use. Altogether there are 125 fields in a CN MARC record. The Appendix,
which lists the fields in a full CN MARC record, shows that CN MARC has adopted many
international standards in structuring its format (e.g., AACR2 and ISBD), thus paving the way for
international and national resource sharing.
Sound though the structure of CN MARC is, the contents of fields and subfields are lengthy and
tedious, sometimes even duplicated. As there are 125 fields and 394 subfields in the CN MARC
format, it is likely that in most libraries some of this information is unnecessary and wastes
computer space. Furthermore, the present format structure of CN MARC is unable to accommodate
information of different types of documents, such as non-print materials, maps, rare books.
Therefore, many experts in the use of MARC suggest that a radical revision of CN MARC is long
overdue.
Downloaded by [Stony Brook University] at 06:49 14 October 2014

New Generation CN MARC


Thanks to the progress in full-text retrieval of Chinese materials, especially the advent of
technologies of single Chinese character retrieval, automatic segmentation of key words and
automatic indexing, a new generation of CN MARC has been developed by librarians at the
Zhongshan Library, Guangdong Province [8]. The new generation is based on four principles.

● The bibliographic record should be in conformity with ISBD and the National Standards for
Bibliographic Descriptions both in terms of forms and contents.
● Libraries at any level can make direct use of bibliographic records to form their own files
without any change or conversion. In other words the bibliographic records can be directly
used by any library across the country so that the goal of resource sharing can be realized.
● The unique format structure can accommodate information on different types of documents,
such as monographs, serials, non-print materials, maps, rare books, archive materials, etc., in
order to facilitate the development and maintenance of software packages.
● The structure can be used either as an external communication format or the internal format
of a system. Without any conversion it is possible to display or print bibliographic products.

The general structure of the new generation CN MARC is as follows: Label and Code Area,
Descriptive Bibliographic Area, Control and Retrieval Area, Record Separator, Code Area. The
Label and Code Area contains information on control numbers (SBN, ISSN, etc.), type of
publication, language, country of origin, etc. The Descriptive Bibliographic Area consists of
bibliographic data based on the Bibliographical Description for Monographs (for materials in
Chinese) and the Descriptive Cataloguing Rules for Western Language Materials. All data are
arranged in text format so that they can be easily displayed on screen or printed. Table 1 shows the
general structure of this new format.
The Control and Retrieval Area contains information on access points. All the fields in the Area
are variant length fields which can be repeated, but have no subfields. Information in this Area is
used to build indexes for quick search. Table 2 shows the general structure of this area—note the
fields for the two most commonly used Chinese classification schemes: Classification of Chinese
Literature (CCL) and Classification of the Library of the Chinese Academy of Sciences (CLCAS).
This new generation CN MARC is used by some libraries in Guangdong Province. Although it
Chinese Character Processing and MARC Records 283

TABLE 1
Structure of Descriptive Bibliographic Area in New CN MARC

Tag Content of Field

100 Title proper (parallel title, subtitle)


101 Edition number and edition authors
102 Place and date of publication
103 Number of pages, illustrations, etc.
105 Series title, volume number, editor, ISSN
106 Notes
107 ISBN, terms of availability
108 Summary
Downloaded by [Stony Brook University] at 06:49 14 October 2014

has a number of potentially useful features, this format must be tested extensively in the field before
being widely adopted as a national MARC standard in China.

CONCLUSION

The information profession in China has made substantial progress in areas unique to China’s
information environment (Chinese character processing and Chinese MARC records) as a prereq-
uisite to development of appropriate IT applications in libraries. But there are still numerous
hurdles to be overcome. In particular, as the developed countries make greater use of the Internet
and World Wide Web, information professionals in China are increasingly frustrated by the slow
pace of introducing this latest information-rich medium to the library sector. While it is true that
many libraries now have Web pages and some limited access to the Internet, it is not widely
available as an information retrieval tool in libraries. It is in the interests of the information sector
for this problem to be dealt with rapidly so that China can keep up with other countries and not
become information poor for want of access to a most important resource.

TABLE 2
Structure of Control and Retrieval Area in New CN MARC

Tag Content of Field

200 Title proper, subtitle, series title


210 intellectual responsibility (author, editor)
250 Subject headings
260 CCL classification number
261 CLCAS classification number
270 Patent number
271 Standard number
280 Other access points
284 G. YITAI

REFERENCES

1. Yongquan, Liu. “New Developments in Computer and National Language Processing in China.” Information Science,
8 (1987), 64 –70.
2. Ping, Li, & Yifan, Chen. “On the Generality, Standardization and Intelligence of Chinese Character Keyboarding.”
China Computerworld, 9 (November 1994).
3. Maruyama, S. “Bibliographic Control and Library Automation in Japan.” In C. Bobmeyer and S. Massil. (Eds.),
Automated Systems for Access to Multilingual and Multiscript Library Materials: Problems and Solutions, (Munich:
K. G. Saur, 1987); “Hundred Schools of Thoughts Contend on Chinese Interchange Code.” Central Daily News (overseas
ed.), 8 July 1989; Hyeon, K. S. “On Establishing a Standard Character Code System in Korea and Its Application to the
Korean MARC Database.” In Automated Systems for Access to Multilingual and Multiscript Library Materials:
Problems and Solutions, edited by C. Bobmeyer and S. Massil. (Munich: K. G. Saur, 1987); Hsieh, C., et al. “The Design
and Application of the Chinese Character Code for Information Interchange (CCCII).” Journal of Library and
Information Science USA/Taiwan), 7 (1981), 129 –143.
4. “Great Achievements in Chinese Information Processing.” Computer World, 8 (1991), 2–3.
5. Shuangguam, Zhang, & Shuanlong, Zhou. “A Review of Chinese Information Processing in the Past 30 Years.” New
Downloaded by [Stony Brook University] at 06:49 14 October 2014

Technology in Library and Information Science 3 (1994): 49 –54.


6. Chengjian, Sun, & Dahua, Zhang. “Recognition Techniques of Written Characters and Their Application to Library
Services.” New Technology in Library and Information Science, 1 (1993), 8 –11.
7. Neghua, Chen. “Comparative Studies on Input Codes for Chinese Characters.” Library Journal, 2 (1994), 25–27.
8. Guangzuo, Chen. “On Single Chinese Character Retrieval System.” Information Science, 11 (1992), 11–18; Minzu,
Zeng. “The Prospect and a Decade’s Achievements in Computerized Information Retrieval in China.” Information
Science, 9 (1990), 57– 66; Shaoqiang, Mo. “The Design and Research for a New Format of Machine Readable
Catalogue.” New Technology in Library and Information Science, 1 (1994), 2– 8.

APPENDIX:
Fields in a Full CN MARC Record

Tag Contents of Field Repetition Notes

001 Record control number N mandatory


005 Record edition label N
010 ISBN Y
011 ISSN Y
020 National bibliography number Y
021 Copyright number Y
022 Government publication number Y
040 CODEN N
091 Uniform book number Y
092 Acquisition number Y
093 Patent number Y
094 Standard number N
100 General processing data N mandatory
101 Languages of materials N mandatory
102 Country of publication N
105 Encoded data field: monograph N
106 Encoded data field: physical description N
110 Encoded data field: series N
122 Encoded data field: time range of materials Y
200 Title and statement of responsibility N mandatory
205 Edition Y
207 Material particulars: series N
210 Publication and distribution N
215 Collation Y
Chinese Character Processing and MARC Records 285

Tag Contents of Field Repetition Notes

225 Series data Y


300 General note Y
301 Identification number note Y
302 Code information note Y
303 General note of description data Y
304 Title and statement of responsibility note Y
305 Edition and bibliographic history note Y
306 Publication and distribution note Y
307 Media note Y
308 Series note Y
310 Binding and terms of availability note Y
311 Linking field note Y
312 Related title note Y
313 Subject note Y
Downloaded by [Stony Brook University] at 06:49 14 October 2014

314 Author note Y


315 Material particulars note Y
320 Bibliographic note Y
321 Index, abstract and reference note Y
324 Facsimile copy note Y
326 Publication frequency note Y
327 Contents note Y
328 Dissertation note Y
330 Summary or abstracts Y
332 Citation Y
345 Acquisition information note Y
410 Series Y
411 Subordinate of series Y
421 Supplement Y
422 Main periodical Y
423 Bound Y
430 Continuation Y
431 Continuation in part Y
432 Supercession Y
433 Supercession in part Y
434 Absorption Y
435 Absorption in part Y
436 Merger Y
437 Separated from Y
440 Continued by Y
441 Continued in part by Y
442 Superseded by Y
443 Superseded in part by Y
444 Absorbed by Y
445 Absorbed in part by Y
446 Split into Y
447 Mergered with Y
448 Changed back Y
451 Other editions of a medium Y
452 Other editions of different medium Y
453 Translated as Y
454 Translated from Y
461 Collective work Y
462 Section Y
286 G. YITAI

Tag Contents of Field Repetition Notes

463 Single issue Y


464 Analysis of single issue Y
488 Other related works Y
500 Title proper Y
501 Collective uniform title Y
503 Unified conventional title entry Y
510 Parallel title Y
512 Cover title Y
513 Title as it appears on the added title page Y
514 Caption title Y
515 Running title Y
516 Spine title Y
517 Other title Y
520 Preceding title (series) Y
Downloaded by [Stony Brook University] at 06:49 14 October 2014

530 Key title (series) Y


531 Abbreviated title (series) Y
532 Expanded full title Y
540 Supplied title Y
541 Translated title supplied by cataloguers Y
600 Personal name subject heading Y
601 Corporate name subject heading Y
602 Family name subject heading Y
604 Author/title subject heading Y
605 Title subject heading Y
606 General subject heading Y
607 Geographical name subject heading Y
610 Key words not used in thesaurus Y
620 Publication place/production place/accessing point Y
660 Geographic name code Y
661 Time duration code Y
675 UDC Y
676 DDC Y
680 LC Y
686 Other classification numbers Y
690 Classification for Chinese Libraries Y
692 Classification for the Library of the Chinese Academy of Sciences Y
700 Personal name—primary responsibility N
701 Personal name—alternative responsibility Y
702 Personal name—secondary responsibility Y
710 Corporate name—primary responsibility N
711 Corporate name—alternative responsibility Y
712 Corporate name—secondary responsibility Y
720 Family name—primary responsibility Y
721 Family name—alternative responsibility Y
722 Family name—secondary responsibility Y
801 Source information of record Y
802 ISDS centre Y
905 Holdings Y

You might also like