You are on page 1of 70

Michael S.

Blekhman

MACHINE TRANSLATION: PROFESSIONAL EXPERIENCE


An introduction to an introduction
1998

Part 1. Philosophy

1.1. What is this book about?

This brochure is a kind of «introduction to an introduction to computer translation». It


doesn’t even have a list of references. It can hardly be looked upon as a thorough
compendium like the well-known brilliant book by John Hutchins and Harold Somers.
My task is much more modest. The fact is, I remember but well the way I felt when my
teachers tried to introduce me to automatic text processing when I studied at Kharkov
State University. They were very serious and professional, which scared me terribly! It
would have been easier for them to address people of their level of knowledge, but a boy
or a girl of eighteen or twenty needs something to begin with before mastering deep ideas
and complicated terminology. I asked myself many times when listening to complicated
«introductory» courses: «If this is a mere introduction, will I ever be able to make head or
tale of the post-introductory course?»

As far as I understand, young people first want to know what it’s all about. They need
an introduction...to the forthcoming introduction. Lecturing to students of the Department
of Intelligent Information Systems at Kharkov Polytechnical University, I have seen this
dozens of times. I made up my mind to introduce the young ones into the magnificent
world of translation. I decided to make them my friends, allies, and only then -
colleagues. I try to make them interested, just the way one of my idols (a teacher, in fact)
Lewis Carroll did when asked to tell three little girls a story. My ‘story’ will be far from
that brilliant, but I do hope the readers will first like it and then find it informative. So, I
address this book to University students in the first place, but I’ll certainly be happy if
others also read it.

This book would have never appeared if not for my colleagues, who, at the same time,
are my friends, and two of them, Olga Bezhanova and Marina Bezhanova, are my
daughters.

I would like to say a few words about these people.

Mrs. Alla Rakova and Mr. Andrei Kursin are programmers and members of my company.
Both of them live in Kharkov, and we have been together for several years.
Mr. Jevgeni Pinchasik, an Amsterdam based programmer and linguist, joined us in 1994,
the reason being The European Conference on Artificial Intelligence (ECAI) held in
Amsterdam. He took the trouble of presenting the PARS 3.0 system at the conference,
which required understanding and analyzing numerous subtleties.

As to Olga, she is a serious and promising young linguist and translator. Being fluent in
Russian, Ukrainian (her native tongues), and English, she is also doing French, German,
and Spanish very actively and reads classics in the original. By the way, my younger
daughter, Marina, though only 15, is also an active member of my company, her English
and German being not bad!

The book calls a spade a spade: I am far from wishing to persuade you that our MT
systems are perfect, or we have at least solved the principal language engineering
problems. Nothing of the kind. They are further away from the ideal than from the zero
point. I will try to give the readers objective information - and let them be free in their
conclusions.

This book is a kind of summing up. It shows what has been done, and outlines some
prospects. At the same time, it will never be «finished» because, as the reader will
understand, summing up does not mean giving up. The work is going on, and it’s only
natural that I am planning to issue more up-to-date versions from time to time. And we
are going on to improve our products in order to furnish the customers with the best
commercial MT systems of all they can purchase on the market. So if you are interested,
don't hesitate to contact the author, that is me! The third version of the book is going to
be still more interesting, to a large extent due to your comments.

1.2. Preliminary notes

More than twenty years have passed since my first scientific conference on automatic text
processing, held in 1977 in Kishinev, Moldavia, in ex-Union. Between 1976 and 1989, I
worked at the Vniitelektromash research institute, in Kharkov, Ukraine. My specialization
was automatic abstracting and indexing, as well as discourse analysis. My dissertation,
which I defended in 1986 in Leningrad (St. Petersbourg), analyzed the category of
definiteness in English texts.

During 20 years of my professional activity, teams I headed have developed computer


systems for information retrieval, abstracting, indexing, and, certainly, machine
translation.

I became a machine translator in 1986. My first prototype MT system was developed in


1987. I called it PARS, which is the abbreviation of the Russian name that means
«English-Russian translation of abstracts and papers».
In 1989, thanks to Mikhail Gorbachev's reforms, I moved to the private sector of the
Soviet economy. The first private company I worked at was the Escort innovation center.
We developed machine-aided translation systems.

In 1990, I joined Medicom Ltd, where PARS-2 and PARS/RU, the first Russian-
Ukrainian-Russian MT system, were developed. We became well-known. The Supreme
Soviet and Cabinet of Ministers of Ukraine became our customers among hundreds of
other organizations, enterprises, and institutions.

In 1993, I set up my own private company, which I called Lingvistica ‘93. It resided in
Kharkov, the second largest city in Ukraine, and develops machine translation and
machine-aided translation systems as well as computer-based dictionaries. At present, it
is named Lingvistica ‘98, and it is in Montreal, Canada, where my family and myself
came in 1998.

I have the honor of being one of the numerous pupils of an outstanding Russian linguist,
Prof. Raimund Piotrowski, who has brought up dozens of specialists, among whom are
the authors of the well-known Stylus translation system.

I am also happy to name Prof. Victor Berzon and Dr. Boris Pevzner as my teachers: the
former (he passed away several years ago) was one of the most authoritative specialists in
discourse analysis in ex-Union, while the latter (residing in Israel now) was the first one
who formulated the idea of example-based machine translation in the Soviet Union (in
the early 1970s!).

In 1994, I wrote the first version of this book, a pamphlet which I entitled Description of
Russian and Ukrainian Morphologies in Commercial Machine Translation
Systems.People who read it said the game was worth the candles. They, especially Zoe
Mizuho of Brown University, inspired me greatly assuring me that the book would be of
interest for University students studying a course in machine translation. This is the
greatest compliment for me! Well, if what I write is appreciated by those 25 years
younger, this makes me feel as young myself.

1.3. History and principles

In 1998, I answered several questions offered to me by Andrew Joscelyne of Language


International. Here is this diaolog (though a bit modified), which, I hope, will be of
interest to my readers as it sheds some light on some basic notions and principles of
machine translation as well as on the history of MT.

Question:

As founder of the machine translation company Lingvistica ‘93 and creator of the PARS
family of MT systems, could you briefly resume the Russian / Ukrainian world of MT
development in the 70s and 80s (principal people but also main users and successes of
MT systems)?

Answer:

Machine translation dates back to the 1950s in the ex-Union. I could tell you about the
Soviet MT pioneeres, but it’s much easier for me to concentrate on the 70s and 80s
simply because I graduated from the Kharkov State University in 1974, so I was a
witness of and then a participant in the development.

The first person to mention is, by no means, professor Raimund Piotrowski, a man whose
role in Soviet language engineering has been really great. He is both a brilliant linguist
and a very energetic organizer. In the early 1970s, he founded the All-Union linguistic
group which he called Statistica Rechi (‘Speech Statistics’). It united language engineers
from all over the USSR: Leningrad, Moscow, Ukraine, Kazakhstan, Moldavia,
Uzbekistan, Azerbaidzhan, etc.

The first operational Soviet MT system was developed in 1976 at the Chimkent Teachers
Training College, by the Kazakhstan subgroup headed by Prof. K.Bektayev and Prof.
P.Sadchikova. The system ran on IBM-compatible mainframes and performed word-for-
word and phrase-for-phrase English-Russian translation of patent chemical texts. The
system was used at the Institute of Chemistry, Kazakhstan Academy of Sciences.

Piotrowski’s Moscow colleague, Prof. Yuri Marchuk, Director of the All-Union Center
for Translations, headed an MT project covering 3 language pairs: English-Russian
(AMPAR), German-Russian (NERPA), and French-Russian (FRAP). The AMPAR
system was launched in 1977. It was used for generating raw translations of technical
texts both at the Center and at some depratmental research institutes. Marchuk published
a 2-volume English-Russian contextual dictionary that can be used (and I am planning to
use it in my forthcoming projects!) for disambiguation purposes. Dr. Yevgeni Lovtski
developed a special language for representing linguistic rules in AMPAR.

Doctors Boris Tikhomirov, Zoya Shaliapina, and Nina Leontieva investigated into
various aspects of semantic-based and transfer-based MT. I believe that Zoya was the
best expert in Japanese-based MT in the USSR.

Dr. Boris Pevzner, my older friend and one of my teachers, published in the early 70s a
series of papers on example-based text processing, which I consider revolutionary. The
PARS «distant phrases» have very much to do with Pevzner’s ideas!

In the 80s, the Leningrad subgroup of Speech Statistics headed by Raimund Piotrowski
himself and his best pupil, Prof. Larisa Beliayeva, the most charming of all linguists I
have ever seen, launched an integrated language engineering project which included:
 MULTIS, a multilingual MT system based on what Larisa called MARS - a multi-
aspect Russian automatic dictionary (my PARS systems include a grammatical
Russian dictionary which resembles MARS to some extrent!); the main lnguage pairs
were English-Russian and French-Russian; the latter direction was headed by Dr.
Tatiana Apollonskaya;
 a system for automatic topic recognition preceeding machine translation of
information messages; the system was designed by Dr. Yelena Shingareva;
 automatic abstracting of information messages, the project headed by one of my
teachers, Prof. Victor Berzon, and myself.

The corporate user of the system was a large governmental analytical bureau that
processed hundreds of such messages every day.

MULTIS was the first Soviet MT system running on personal computers. It was made
operational in 1988-1989 by Larisa Beliayeva as the ideologist, and Svetlana Sokolova
and Alexander Serebriakov, the programmers. MULTIS was an implementation of
several basic ideas put forward by Raimund Piotrowski back in 1971 in his epoch-making
paper in Problemy Strukturnoi Lingvistiki (‘Problems of Structural Linguistics’). One of
them consisted in assigning a single generalising translation to each ‘polysemantic’ (that
is having more than one meaning) word instead of several translations (which differs
PARS from MULTIS). Some time later, the Stylus system was developed by Sokolova
and Serebriakov, based on the MULTIS linguistic principles, though much more efficient
from the technological point of view.

Piotrowski’s main idea was what he called the engineering approach to language
modelling. My teacher argued that developing an MT system is a complicated process
consisting of numerous stages. The linguist models the text, implements it in an
operational (not hypothetical!) program, analyzes the results, modifies the model, and so
on, thus «growing» the system up from the «napkin» state. That’s exactly what we have
been doing to the PARS systems for almost 10 years now!

The 70s-80s were a period of scientific confrontation of two conceptions: the practical
(«engineering») approach to machine translation, most vividly expressed by Raimund
Piotrowski, and the theoretical approach, backed by such outstanding linguists as Prof.
Igor Melchuk and Prof. Yuri Apresian. They opposed the idea of automatic translation to
Piotrowski’s machine translation, and argued that the linguist’s task is to offer an in-
depth description of the language as the foundation of an AT algorithm instead of gradual
improving an imperfect MT system. Apresian’s group developed the ETAP family of
pilot MT systems translating from French and English into Russian. It’s interesting that
the word-for-word English-Russian translation module was used for translating patent
titles in the INPADOC information retrieval system.

A Kiev group headed by Oleg Galchenko developed an efficient system for translating
English patent titles into Russian based on a dictionary of 100,000 technical terms!
As to PARS, its 1st version was launched in 1989 and implemented at the Georgian
Medical Information Center for generating raw translations of the MEDLINE database
abstracts.

However, it was in the 1990s, with the advent of personal computers, that machine
translation was made accesible to hundreds of thousands of end users. Would it have
been possible without the first steps made by our teachers?

Question:

You are a practising translator (you have, for example, translated Alice's Adventures in
Wonderland). In which ways has a translator's experience and knowledge influenced the
design of PARS (and its ongoing developments), compared with purely "computer
science" type MT designs? How important is the role of the translator in your opinion?
Does it depend on the usage of the system or its fundamentals?

Answer:

My first impulse was to say that being a translator is by no means an advantage in


developing an MT system because translating is art, and you can’t make anyone,
including a computer, an artist. In other words, MT is not translation. I remember very
well translating Alice’s Adventures. You know what I had to do? Trying to make the
story funny and amusing, I had, more often than not, to invent, not even translate!

Of course, you may say that a technical text is by no means so hard to translate as Alice.
Yes, but there are some problems, too. When I worked as technical translator at the
VNIITelektromash research insitute, another my teacher, Vladimir Terletsky, a brilliant
chemist and metallurgist, showed me the translation of an English paper on powder
metallurgy, one of the paragraphs in which sounded absolutely senseless, though it was
quite smooth syntactically. The translation had been made by a professional translator at
the Chamber of Commerce. Terletsky asked me to translate the same paragraph word-for-
word, as close to the context as possible, without trying to understand it. «It’s up to me
to make head or tail of it», he said. After I did, he exclaimed, «Thanks God! Everything
is clear now!» It was clear to him, not to me.

So, my translation was a success because I acted like a computer program: I simply
substituted Russian words for the English ones and put them in the proper morphological
forms.

At the same time, being a translator is a great thing for a machine translator. I always
understood very well that my colleagues might be disappointed with the numerous
childish mistakes PARS would be making. I even thought that all of them would prefer
extensive electronic dictionaries instead of an MT program. That’s why I am always
pleased and surprised when the system is praised by a translator. And the paradox lies in
the fact that it’s the most skilled and experienced translators, such as my late friend
Vladimir Kolykhmatov who worked for the Dupont company, who find PARS useful in
their everyday work.

One of the brightest examples of a translator’s attitude I have ever experienced was
PARS presentation at the Antonov Aviation Plant in Kiev. I was surrounded by a group
of brilliant professionals who were watching, somewhat skeptically, the computer screen
while PARS was busy translating a technical text from Russian into English. They
analyzed the result attentively, and I asked one of them: «What do you think about it?»
What he said amazed me: «Well, it translates like a student». «A fresher or a senior?» I
asked him. He thought a little and said, smiling: «Like a sophomore». What he meant was
that the translation was quite understandable but rather primitive. «You flatter me!» I
replied. «The student is human, while the computer is not!»

Well, I did use my translator’s experience when I designed PARS. The peculiarities of
our systems consist in the service options rather than in the translation algorithms, and
the former were introduced because I am a translator.

Our systems feature specific target text post-editing facilities. The unique pen editor
being developed by a team headed by my friend, Alexander Kazakov, will provide hot
keys for the most typical editing operations performed by professional translators.

Another specific feature is the dictionary updating subsystem. I am really happy to hear
translators saying that PARS is user friendly from this point of view, and that they create
dictionaries of their own reflecting their professional experience. I think, however, that
the program can and should be further improved!

I myself and my elder daughter, Olga Bezhanova, use our MT systems together with the
Polyglossum scientific and technical dictionaries, developed by ETS Publishers Ltd., for
making professional translations. I always ask Olga to keep records of her work when
using these systems. The main conclusion she came to is this: a professional can’t do
without MT and MAT if he/she wants to be competitive! Olga edits 30 pages of technical
texts a day after Russian to English MT, while her colleagues not equipped with
computer programs do much worse, first of all because they lack serious on-line
professional dictionaries. If the subject area is covered by PARS specialist dictionaries,
editing the raw translation is several times easier than translating the same text manually.

You know, deep in my heart I suspect that machine translation as a scientific task is a
mathematical problem. But my practical experience tells me that an operational MT
system can only be designed by a linguist. Life is really full of paradoxes! I don’t know
how I would be able to head an MT team if I were a mathematician rather than a linguist
and translator.

Question:
Looking back, what do you feel are the most significant dimensions to developing an MT
system and in what order of priority: fitting in with "international standards" ?, having
good dictionary compilers?, developing tools to help in the "industrial" process of
inputting dictionaries?, having a single end-client with a specific translation need, or
alternately total freedom to try out anything at all? Is MT design an exercise in computer
science or the art of communication engineering? Is it setting quality targets or quantity
targets (i.e. numbers of words in dictionary, number of rules)? Does the constant change
in the computing platforms (from DOS to Windows NT, the Mac/Windows battle, the
solutions to character coding etc.) affect your decisions?

Answer:

A very good question, though it requires looking forward rather than back! Let me try to
evaluate each of the above criteria using a 10-point scale.

Speaking of the «international standards», and standards at all, I am sure that the time has
not come yet for setting up standards in the MT area. System developers are still looking
for optimum solutions. We are united by computer hardware and software platforms, and
it’s a conditio sine qua non to make an MT system compatible with at least some of them,
but dozens if not hundreds of linguistic and technological solutions can hardly be
standardized, at least now. 5 points.

The dictionary updating and compiling tool is one of the most important characteristics of
an operational MT system. Boris Pevzner taught me that I should only develop
technologically efficient systems, that is such systems which could be easily used by as
many people as possible. This condition presupposes tuning the system to the user’s
requirements. A flexible dictionary updating program is sometimes even a more
important condition put forward by a professional translator than the translation quality
itself! An MT system is a product made by linguists and programmers for people who
have nothing to do with linguistics and programming (translation and linguistics are two
different things!), and we have to ask ourselves, «Look, how would I feel if I were the
user who wants to enter new words into the dictionary»? The procedure suggested by the
program is to be natural and understandable.

Let me compare this with translating Alice’s Adventures: when I came across a pun or
some specific English expression, I did my best to find a translation, but I never included
it in the final variant until I offered it to Olga, who was 7 then - Alice’s age! I never told
her I was translating something - she didn’t have to know about my technical problems, I
simply offered her the Russian joke invented by me, and the only criterion was whether
she smiled or not. If the user smiles when entering new words into the dictionary, you did
well. If he or she looks serious, maybe your solution was not the best one. 10 points.

As to an industrial process of compiling dictionaries for MT, it’s one of my company’s


«visiting cards». Together with ETS Publishers, we made a «conveyer» technology for
inputting new and new dictionaries into our systems. A colleague and friend of mine,
Igor Fagradiants, attracted Russian most eminent lexicographers to developing world’s
largest specialist English-Russian and German-Russian bidirectional dictionaries, which
are then converted into the PARS format. We are implementing a break-through
technology of interactive example-based converting dictionary text files into PARS. This
will make dictionary compilation an industrial process backed up by first-class
lexicographers! 10 points.

Having a single client is another conditio sine qua non. One of my principles is to have a
definite user in mind when developing an MT system. This principle dates back to 1980,
when I began developing a retrieval system at VNIITelektromash. It was not easy to
convince my bosses, very wise and experienced people, though somewhat conservative,
as a real boss should be, to finance the work. I needed someone who would support me
and use the system in his or her everyday professional work to provide feedback. Dr.
Vladimir Terletsky, head of the powder metallurgy laboratory, was such a person. We
discussed system structure and ways of practical usage so scrupulously and carefully that
I saw the light clearly, and, due to this collaboration, the system was really made
technologically efficient.

Another such person was Dr. Valeri Yepifanov, whose assistance let me understand the
necessity of the topic recognition engine, which made our system an absolutely unique
tool for information retrieval. By the way, it was based on Pevzner’s ideas of example-
based text processing!

Developing an «organism» as complex as an MT system is hardly possible without


communicating regularly with one or several users, preferably amiable, though
reasonably critical. Of course, you should not take all their advices and demands for
granted, but you must take them into account! More than that, an exceptionally important
thing is that, in this case, you see that the system you are developing is needed, a feeling
which can hardly be overestimated!

Thus my principle is: develop your system for someone you know very well, and then
it’ll be accepted by many others! 10 points for having a concrete end-client + 7 more for
the possibility to know the opinions of as many «invisible» end-clients as possible.

What is MT design? Oh, well, it’s everything! An exercise in computer science? Yes.
You can’t design a competitive MT system if the most talented and progressive
programmers are not engaged. The art of communication engineering? Of course,
especially if you mean that the system is supposed to be part and particle of a complex
technological process, not just a stand-alone program: you have to suggest a technology
of using the MT system, which is more than «simply» write a translation program.

And it also means quite a lot of other things, first of all a challenge for a linguist because
teaching a computer is much harder than teaching a child: the child masters the basics
very soon and keeps studying on his/her own, sometimes asking you for assistance, while
the computer program always remains dependent on you, so «a perfect MT system» is
something I can hardly imagine. Thus in total:

 MT as a computer science task - 9 points;


 MT as a communication engineering task - 6 points;
 MT as a linguistic task - 10 points.

Is MT setting quality targets or quantity targets? Both, I believe. These two criteria
should not be separated from each other. Some of my colleagues say that a very large
dictionary is practically everything needed for producing good quality machine
translations since the lion’s share of the infornation is in the words, not in the syntactic
relations. This may result in a try-anything industrial-scale system designed for making
average quality, raw translations. In this case quality is sacrifisied for quantity. Others
argue that quality is more important, and overburden the system with semantic data
making it hard for the end user, being no linguist, to extend the dictionary. This results in
an almost perfect pilot but not industrial-scale system. In this case quantity is sacrifised
for quality.

As to me, I prefer compromises. This is my profession de foi. An MT system should be a


good combination of quantitaive and qualitative factors. However, if you ask me which I
prefer: to make a system simpler for the end user, i.e. potentially larger, or more
complicated (again, from the user’s point of view) and smaller, I will choose the former.
A simple system will be used and extended, that is, to some extent, improved. A
complicated system will hardly be used, that is why it hardly has any future. -
Quantitative characteristics: 10 points. Qualitative characteristics: 8 points.

And, the last but not the least, the computer platforms. Well, this affects our solutions
greatly! Sometimes this almost drives me crazy. As soon as we come up with a new
version of a translation system, «those Americans» invent something else, so we have to
update our products. Since 1994, we have passed a long way from DOS systems to
Windows 95 and Windows NT. All PARSes are linked to MS Word 6.0, MS Word 7.0,
and MS Office 97. And people from all over the world ask me: «Look, can you make
your systems compatible with Unix? Macintosh? Sun?» But I only choose those
platforms that can attract many customers instead of making PARS universal. Maybe
sometime, when I get rich, PARS will run on all kinds of machines in all surroundings,
but not now, unfortunately, although it’s no problem from the technical pint of view.

At present, we prefer the most popular platforms to meet the major market requirements.
Windows 95 and Windows 98, as well as Windows NT, with WinWord 7.0 and MS
Office 97 are very broadly used all over the world, which makes our life easier. There are
no problems with fonts there, while Microsoft Tool Kits simplify compatibility! Number
of platforms: 6 points.

Question:
How do you relate to the ongoing international attempts to develop shared resources in
dictionaries (e.g. standard coding formats for dictionary entries) and corpora (tagging
etc.)? Do you feel it is important to you as a commercial developer to participate or
accept them? Or is it a loss of competitive advantage? Or do you prefer to wait for the
universities, Microsoft and the official organizations to sort it all out on their own first
Do you feel you have a role to play in discussions on such standards?

Answer:

Shared resources in dictionaries is something I like very much! In general, I prefer


cooperation to competition. I would be very glad, even happy, to participate in
developing such resources as well as using them. For instance, one of my dreams is
making a German-English-German system, and ELRA offers extensive METAL
dictionaries. Why not use them? The same relates to large German, Spanish, and French
dictionaries we would like to use for developing MT systems for other language pairs.

As to Microsoft, universities, and official organizations, I believe we can do the job


better. I am sure I have a role to play in discussions on such standards! You see, PARS
dictionaries, to say nothing of Polyglossum, total to more than 1 million terms now, so I
believe that our opinion is worth listening to! I hope to present our MT dictionary
generation technology at forthcoming international forums, and I hope that some of my
colleagues will be interested in collaboration. Although the language pairs we process are
English-Russian, English-Ukrainian, German-Russian, German-Ukrainian, Russian-
Ukrainian (all are bidirectional), the principles can be extended to other languages and
language pairs. Besides, our dictionaries can be converted into a text format to be used
outside PARS.

Question:

You have developed the PARS MT family for handling Russian and Ukrainian and
English and German in various combinations. Can you explain the decisions which led
you to develop these systems: the commercial validity of Russian and English is certain,
but what about the others? Do you develop on the basis of "strategic" language pairs, or
ones you know, or ones with immediate commercial utility or for sentimental reasons?

Answer:

My first PARS project was launched in 1985 and covered English to Russian, which was
only natural for the Soviet Union in those years. It would have been strange to begin
one’s MT career with something else than an English-Russian system!

In 1989-1990, we were financed by Marchuk’s Translation Center to develop a Russian-


English system, and we did it, being the first in Ukraine.
Developing a Russian-Ukrainian bidirectional product was a strategic decision, and again
we were the first. We came up with PARS/RU (the Russian abbreviation is РУМП) in
1990, and it is very popular in Ukraine. More than that: it was our main commercial
product for several years because the importance of Ukrainian is growing every day.
Unfortunately, computer pirates produce thousands of copies on CD-ROM, which maked
our life very difficult.

PARS/Ukrainian was to some extent the result of my reading the paper by Bogdan
Onyshkevich in a book on machine translation edited by Sergei Nirenburg (my
University friend, by the way). That paper described an English-Ukrainian prototype MT
system, and I decided to make a system to translate between Ukrainian and English. In
1992 we were financed by the computer department of the Ukrainian Supreme Rada
(Parliament), and came up with the world’s first Ukrainian-English-Ukrainian operational
MT system. Now its newest version is marketed in North America by the Montreal-based
Yevshan company. It was presented at the AMTA conference held in Montreal in 1996, at
MT Summit VI, in San Diego, in 1997, and at the University of Toronto in 1998. Some
of PARS/U dictionaries are quite unique, such as those on computers and
telecommunications developed by Dr. Orest Kossak and Dr. Roman Kravets.

The DOS version of PARS/D, the German-Russian system, was made on the order of the
Izvestia Concern, Moscow, to translate VWD information messages. Unfortunately, we
failed to complete the project for a number of reasons. In 1997, however, we resumed the
work due to financing from the Hamburg-based Igor Jourist Verlag. It’s a commercial
product translating between Russian and German now, but it needs intensive
improvement since you can’t develop a commercial MT system overnight. By the way,
PARS/D is the Ukrainian first MT system for this language pair.

The PARS/DU Ukrainian-German-Ukrainian system was developed in 1998 in the


framework of the KOPERNIK project launched by the Ukrainian Ministry of Education.
The project includes all our 5 PARSes on CD-ROM, in a nice package, with the user’s
guide in 4 languages. The number of dictionaries is very limited, which makes it possible
for us to sell the disk at a very low price for university, college, and high school students
and have something in store for professionals. You will be surprised to know that a
Ukrainain student who produces his/her student’s card will buy a KOPERNIK for an
equivalent of $13.

Question:

Quote from your book: "the more socially important a language is, the more attention it
requires from computer science". Could you make any predictions about the "emergence"
of socially important languages in the CIS or former Soviet countries in general, and how
you see the strategic development of MT in the region?

Answer:
I think that the role of Russian will be at least as high as it is now, and that of Ukrainian
will be growing rapidly provided that the left-wing forces will not win the future
elections to the Supreme Rada as well as the next Presidential elections. The need for MT
systems translating between Russian and Ukrainian is tremendous now in Ukraine, and it
will be increasing (see item 7).

To my great disappointment, I can see no other language in ex-Union whose importance


would become international in the nearest future. In 1990-1991, we discussed a Russian-
Georgian system with Dr. Roman Serebriany, Director of the Georgian Medical
Information Center, but we lost contacts after the war broke in Georgia. I have written
several letters to Roman, but he didn’t answer; I suspect he didn’t even receive them.

My colleagues tell me that the Governmemt of Tatarstan, one of the Russian autonomous
republics, is investing serious money into developing MT systems for the Tatar language.
Let’s wait and see…

Speaking of strategic development of MT, my opinion is that Russia and Ukraine will
remain absolute leaders in this field in ex-Union. We have very serious linguistic and
programming traditions which our colleagues in other post-Soviet republics can hardly do
without in the nearest future. More than that, I don’t think that someone will be willing to
develop MT systems of their own there for the Russian-English language pair, they
would rather prefer to use Stylus or PARS. However, I may be mistaken, of course.

Question:

Can you tell us anything about the state of and future prospects for the market for MT
systems in Russia, the Ukraine and elsewhere? What languages are involved, what
sectors of business, how MT fits in with the growth of Internet usage (e.g. I see a version
of PARS can be embedded in Netscape Navigator), what type of end users are showing
an interest in it for what types of translation tasks?

Answer:

The prospects in Ukraine and Russia depend, to a very large extent, upon the
Governments: either they launch serious prosecution of computer pirates, or some or
even most of the MT companies will simply disappear. Now, you may see all kinds of
piratic CDs in almost any Ukrainian or Russian computer store or at other places where
CDs are sold, such as the numerous «electronic market places». One of the freshest disks
is named «Flint’s Treasures», and it bears a picture of a hideous pirate taking his (his?)
treasures out of a bag. Hundreds of thousands of Styluses and PARSes can be pulled out
of those bags!

When in Ukraine, people called me on the phone asking me to help them about the piratic
versions of PARS/Russian-Ukrainian and PARS/Russian-English: sometimes they
wouldn’t work at all. Others told me that piratic disks discredited our products, making
people think that the original versions were as bad.

Another item of importance is training qualified language engineers. You may be


astonished to hear that such specialists are not trained at Ukrainian universities. More
than that, language engineering as a scientific discipline is not present in the list of
specialities of the Ukrainian Highest Attestation Commission, so it’s rather difficult to
defend dissertations on language engineering in Ukraine! I hope, however, that at least
my initiative to organize a language engineering department at Kharkov Slavonic
University will be supported. I also have an idea of organizing an international student
language engineering group (see item 8).

As to the users, I can subdivide them into the following groups:

Individual users

A very numerous subgroup is made up by students who need their diplomas and other
kinds of papers to be translated from Russian into Ukrainian. We hope to meet their
requirements with the KOPERNIK CD-ROM and convince at least some of them to
abstain from using piratic disks.

Some people want to communicate with people living abroad. PARS/U is bought, in
particular, by Americans and Canadians wishing to communicate with their friends and
relatives residing in Ukraine. One of them told me: «They speak Ukrainian, and I speak
English. The only way to communicate is to use a computer program». I wonder if one of
the international pen pal organizations might be interested in using PARSes for
communication purposes. It would certainly require serious modifications to the systems
in order to take into account peculiarities of this style, but the idea itself seems rather
promising to me.

Professional free-lance translators in ex-Union make up another subgroup, though less


numerous. Their language pairs are mainly English, German, French, Italian to and from
Russian. Some of them like MT systems, some prefer MAT software (electronic
dictionaries such as Polyglossum), while others buy both. My opinion is, however, that
the majority of this group are still our potential clients. The fact is that the foreign
languages departments of Ukrainian universities train people who are good at languages
but have no idea of the computer as translator’s everyday tool. Introducing elements of
language engineering at such departments would contribute a lot to expanding the circle
of our conscientious customers!

There is a group of individual users who require Russian to English translation of


scientific texts. Here is an example. A scientist asked me to translate his medical paper
for submittance to a serious British journal. When I looked through the text, there was
only one thing which I understood - I could not do without PARS because the paper was
abundant in «awful» medical terms. I faced a dilemma: either to translate the text
manually looking every second or third word up in the Polyglossum Russian-English
medical dictionary, or to let PARS make a draft translation and post-edit it. I chose the
latter variant, and the paper was accepted.

Corporate users

MT and MAT systems seem to be very popular with corporate users.

Generally speaking, all kinds of organizations, both state-owned and private, use
PARS/RU for translating official documentation, including that of financial, scientific,
and technical nature, between Russian and Ukrainian.

Many Ukrainian banks use PARS/RU for translating financial documentation, such as
official instructions, between Russian and Ukrainian. Here is another example. In 1997, I
installed PARS/RU in one of the banks in the town of Saki, the Crimea. They use it to
translate megabytes of instructions they receive electronically from the Ukrainian
National Bank. Those texts are written in Ukrainian, the country’s state language, and the
problem is that many people in the Southern and Eastern parts of Ukraine doesn’t even
understand Ukrainian, to say nothing of speaking it.

A tendency that gains popularity is making MT systems part of integrated products, such
as PRAVO, a system very well-known in Ukraine. It is supplied on CD-ROM and
comprises the full set of Ukrainian laws and decrees, with a retrieval system and our
Ukrainian-Russian translation module. Our Ukrainian to English and German modules
will also be added.

I am especially proud that PARS/ER was used for translating Russian medical abstracts
into English for the Medical Practice journal published in Kharkov. I did it myself, first
running the texts through PARS and then post-editing the raw translations. Using MT
systems for translating abstracts in scientific journals may become a tendency.

Large plants and design bureaus that export their products are among the users of the
PARS/ER system. The Yangel Spacecraft Bureau in Dnepropetrovsk is among them. We
supplied PARS/Avia to them, which includes the core Russian-English-Russian system
and a number of terminological dictionaries on aviation, space, communications, etc.
Their reaction is very important for me: they say that PARS is better for translating
technical documentation, while Stylus is preferable for business correspondence. Well,
we’ll try to be up to the mark in both aspects!

A new tendency is the application of PARS to translate Russian textbooks and courses of
lectures into English for foreign students coming to sudy at our universities. A vivid
example is described by Olga Bezhanova: we translated Russian texts on aviation for
Kharkov State Aviation University at which sudents from Iran were coming to study.
MT can and should also be used for purely academic purposes. An example is using
PARSes at Kharkov State Polytechnical University in the course of machine translation at
the Department of Intelligent Information Systems. Presently, we are going to set up a
department of language engineering at Kharkov Slavonic University. I plan to implement
all our systems there.

Access to Internet and E-mail will contribute to a higher role of MT. However, this will
require not only technical (which is comparatively simple) but also linguistic solutions
because colloqial texts, which are very often to be found on web sites, to say nothing of
E-mail messages, are very hard to translate automatically. I am sure that Internet and MT
will stimulate each other greatly. And this application is very promising. The fact is that
Internet resources are in fact unaccessible to Russian and Ukrainian speaking scientists
because of the language barrier, and so are the Russian and Ukrainian publications for the
English-speaking community. You should take into account that the state system of
scientific information, which was the pride of the former Soviet Union, does not exist in
Ukraine for a number of reasons, so Internet will be a very good, though not the only
source of information if the decision will be taken to build up such a system in this
country.

I might also suggest that you should read my pamphlet on machine translation to find
more examples of using PARS, some of them being quite amusing and even instructive,
as I hope.

Question:

And finally, can you tell us how you see the evolution of your PARS systems: priority
language pairs, partnerships with other system developers or user companies, the kind of
commercial strategy you wish to pursue in this field?

Answer:

In June 1998, my family arrived in Montreal as permanent Canadian residents. A month


later, I was invited to collaborate with LanguageForce, a very rapidly growing California-
based MT company,. To call this collaboration exciting would mean to say the least, and
I am more than grateful to the company’s leaders, Ian Simpson and Yuri Mordovskoi for
theit proposal to combine our efforts. I hope that PARS will find ‘a new life’ in the
LanguageForce’s Universal Translator, and I planning to describe this work in the next
edition of this book.

One of the ideas I am cherishing consists in creating an international group of young


language engineers (linguists and programmers), preferably university students, whose
practical goal would be developing a new generation of PARSes, including new language
pairs, to be used by governmental organizations in Europe and the Americas. I think that
this group could be headed by me and some of my European colleagues, and financed
(may I dream of this?), for example, by some of the European foundations.
As to the academic goal, this work will be extremely useful for the young people who
join the group since they will learn and master much more than any University can offer!
On the other hand, maybe the universities will somewhat amend their curricula according
to the practical necessities.

I hope it would be a kind of honor to be enrolled in this group, and the students would be
willing to qualify! I am absolutely sure this would be a very serious stimulus for
developing language engineering in Ukraine. More than that, it would contribute to
eliminating skepticism in some of Ukrainian boys and girls who are greatly disappointed
with the present moral and economic situation in this country and are of the opinion that
science is no longer needed in our society and it’s much more prestigious to be someone
else rather than a scientist or engineer.

1.4. Background, or is it easy to translate?

Translating texts from one language into another is probably the most difficult task one
may think of. Marina, my younger daughter, with whom I have been communicating in
English since her very birthday, once pleaded,

«Oh, Daddy, I’d better tell you the whole story in the original than translate it into
Russian! You see, I understand everything, but I can't translate!»

When I was younger (but hardly less ambitious) I translated «Alice's Adventures» into
Russian. It was a challenge! The puns and rhymes that make any English kid laugh would
hardly mean anything to a Russian speaking child. So, I decided to render them in such a
way that the impression they made would be equal, though the linguistic meaning might
be quite different. Here is an example. This rhyme, formally, does not sound like the
famous «Father William», though it resembles Vladimir Mayakovsky's «What Is Good
and What Is Bad», which is as moralising as its English «prototype». I hope that this
example will be interesting for those speaking or learning Russian.

Крошка-сын к отцу пришел,


И спросила кроха:
- Что такое «хорошо»,
И что такое «плохо»?

«Åñëè ìàëü÷èê ñòåêëà áüåò


È áàêëóøè òîæå,
Ñëàâà ïðî íåãî èäåò:
Î÷åíü îí õîðîøèé!

Åñëè â ëóæó îí ïîëåç,


Íàìî÷èë òðóñèøêè,
Ãîâîðþ: Ìîëîäåö!
Òàê äåðæàòü, äåòèøêè!

Åñëè ìàìå íàãðóáèë,


Áàáóøêå è äåäó,
Ìíå òàêîé ìàëü÷èøêà ìèë,
Äàì åìó êîíôåòó!

Åñëè æ ó÷èòñÿ íà «ïÿòü»,


Ñëàáîìó ïîìîæåò,
Ïðî òàêîãî ãîâîðÿò:
Î÷åíü íåõîðîøèé!

Åñëè îí öâåòû ïîëèë


È ñâàðèë êàðòîøêó,
ß á òàêîãî îòëóïèë
È ïîäñòàâèë íîæêó!»

Ìóäðûé ïàïà ñïàòü ïîøåë,


È ñêàçàëà êðîõà:
- Ïëîõî äåëàòü õîðîøî!
Ëó÷øå äåëàòü ïëîõî!»

Take also the Cheshire Cat: «Cheshire» means absolutely nothing to Russian and
Ukrainian girls and boys. On the other hand, you can't do without a cat in this story. My
decision was to find out another «cat», which would be familiar to Russian speaking
children: Cat in Boots! And after the cat in question was found, the rest was easy: I
turned him into The Cat Without Boots; having no shoes, he had a lot of humor instead,
that's why he smiled so often. My elder daughter approved of that: being as young in
those times as Alice herself, she liked the new cat, proving her attitude with a real
Cheshire smile!

It turned out, however, that translating was even an easier task than teaching a computer
to translate.

In my first PARS 1.00 MT system, which was marketed in Soviet Union in 1988-1989, I
tried to make the IBM-compatible mainframe translate English scientific and technical
texts in such a way that the users could extract useful information from the machine-
made Russian texts.

Here are sample English-Russian translations of INPADOC data base documents (patent
titles) made by PARS-1:

Source text Target text

Functional human urokinase Функциональная человеческая урокиназа белки.


proteins.

Hybrid proteins. Гибридные белки.

Drainage bag and non-return Дренажный мешок и не-return комплект


клапана.
valve assembly.

6-hydrogen purine derivatives. 6-водород пуриновые дериваты.

Radiolabelled sucralfate Мечен радиоактивными изотопами sucrasalfate


compositions. составы.

Insulin derivatives modified in Дериваты инсулина модифицированные в 30


the B 30 positions. позиций.

Aminopolycarboxilic acids and Aminopolycarboxilic кислоты и дериваты


derivatives thereof. Вышеуказанного (из указанного предмета).

Since then, up to the newest PARS-3 English-Russian bidirectional, I have never been
fully satisfied. More than that, I am sure I’ll never be. Sometimes, the electronic Eliza
Doolittle merely drives me crazy. And the «cats» never stop haunting me. Once,
translating a software manual into Russian, PARS came up with something like
«Защитите кота», which means «Protect the Cat»; I looked into the original and was
terrified to see that the system did not cope with the file name: Protect.cat!

Another example of PARS's brilliance was translating Mr Green as «Господин


зеленый», that is «a green gentleman» or «Mr Green in Color».

I have to teach it so many things that sometimes I expect it to beg, «Oh, Daddy, don't
make me translate!»

Back in 1979, Professor Hays, an authority in machine translation, answered my question


why he considered deep structure analysis necessary in MT. He said that otherwise the
user «might be deceived» by the seeming correctness of the machine output. I was only
(«already», as I thought) 28, and I disagreed. Now that I am 18 years older, I have not
become a champion of deep structures, but I can see that the user is really deceived, from
time to time. Want another example? Here you are.

My elder daughter, Olga, once wrote a letter to her pen friend to USA, and I had it
machine translated into Russian, just to see how clever PARS was at that time. The
translation was not bad, in general, but one sentence, though quite correct syntactically,
had an absolutely different semantic meaning. The source was «I have passed all my
exams», and the output - «Я пропустила все мои экзамены», which means «I have
missed all my exams». The reason was the ambiguity of the English verb «to pass».

PART 2. Technology

In this chapter I am going to introduce you into some basic notions and principles of
machine translation. Please don’t be afraid of potential difficulties, and don’t think that
there will be no more interesting things! You know, when I was a student, I was eager to
have clear understanding of mathematics and programming, and whenever I opened a
thick book on those complicated subjects, it promised that I needed no special knowledge
to read the whole book almost as fiction. Reality, however, was rather gloomy: I failed to
understand a single word. Since then, I hate reading books and even short papers having
formulas and scientific terms. Of course, I don’t deny that such clever books are required,
but the number of people who really read them is negligibly small compared to those who
need…an introduction!

That is why, “technology”, as I understand it (I hope I do understand it, after more than
20 years in language engineering), is how the program does what it is supposed to do.
My idea is “to shoot to hares at a time” as the Russians say: I will show you how real-life
MT systems work, and this will help you grasp some basic MT notions.

When reading this chapter, you should not think, however, that I consider the PARS
family of systems to be the world’s best. I should be clever enough to understand that I
am not the cleverest. At the same time, PARS is in a sense a typical operational MT
system. On the other hand, it pioneers in some aspects, such as in the approach to
dictionary updating. And, for obvious reasons, I know PARS better than any other
translating program. So why not use it as an example?

2.1. Approaching Russian and Ukrainian, or what we wanted to do

2.1.1. General remarks

The reader may know that Russian and English were the two languages that attracted
attention of the MT pioneers headed by Professor Leon Dosstert. It was back in 1954, at
the Georgetown University, that the first machine translation system prototype was
developed. It translated Russian sentences into English. Since then, quite a number of
languages have been tackled by language engineering, among which are French, German,
Spanish, Japanese, to name only a few, but Russian still remains a subject of special
interest.

As to Ukrainian, it has never been so lucky. The reason is clear: Ukraine has always been
considered part and particle of the consolidated Union of «sovereign» republics,
represented by the common Russian language. As a result, it turns out that many people
don't even seem to know that Ukrainian exists as an «independent» language. A British
friend of mine, who is a really charming and educated person, once confessed,

«To tell you the truth, I've only heard of Russian and Georgian as the languages spoken
in the Soviet Union; but what is Ukrainian - a language, or just a jargon of Russian?»

Language situation in Ukraine is rather complicated due to political reasons. The problem
is that most people in that country are native speakers of Russian, though most of them
understand simple Ukrainian. West Ukraine is but the only region where Ukrainian is
spoken «since napkins», though Russian is also understood and spoken there.

The situation with Ukrainian resembles that with Hebrew. The latter was revived as a
state language, which is being done now in Ukraine to Ukrainian. According to the Law
on the Languages, adopted several years ago, Ukrainian is to be the official language in
Ukraine, which demands serious efforts from computational linguistics because it is
becoming common knowledge that the more socially important a language is, the more
attention it requires from computer science.

So, as we see, it was only natural for me to include both Russian and Ukrainian in my
«creative paradigm».

2.1.2. Peculiarities

Russian and Ukrainian feature plenty of morphological ambiguities, which even make it
difficult to develop a high-quality system translating between these colse languages, to
say nothing of such a language pair type as Russian (Ukrainian) on the one side, and a
West European language, such as English, on the other.

The main peculiarity of the East Slavonic languages is their case systems of nouns,
adjectives and participles, as well as verb conjugation.

Below is an example showing declension of the Russian noun «стенд» («a stand») and
two attributes, «наш» («our») and «выставочный»(«exhibition» in the meaning of an
attribute). Russian case endings are separated from the word stems with a vertical line (|).

Case Russian English

Nominative Это наш| выставочн|ый стенд|. This is our exhibition stand.

Genitive Возле наш|его выставочн|ого стенд|а. Near our exhibition stand.

Dative К наш|ему выставочн|ому стенду|. To our exhibition stand.

Accusative Мы видим наш| выставочн|ый стенд|. We see our exhibition stand.


Instrumental Посетители довольны наш|им Visitors are satisfied with
выставочн|ым стенд|ом. our exhibition stand.

Prepositional Мы расскажем вам о наш|ем We’ll tell you about our


выставочн|ом стенд|е. exhibition stand.

Ukrainian morphological structure has many common features with that in Russian, but
there are also numerous peculiarities. There are seven noun cases in Ukrainian instead of
six in Russian, the seventh one being the «calling case» (remember ‘O, Mouse’ in
Alice?).

Besides, Ukrainian, unlike Russian, is rather poor in participles: it’s correct to say
«прыгающая девочка» («a jumping girl») in Russian, where «прыгающая» is an active
participle (Participle I) of the verb «прыгать» («to jump»), which corresponds to the
Ukrainian phrase «дiвчинка, що стрибала» («a girl who was jumping»).

All this makes description of Russian and Ukrainian morphologies a very important
condition for machine translation requirements. Why? Because otherwise the translating
program will not recognize a word form in the source text. Here is an English example,
just to make the problem quite clear for you:

Supposing a computer program is to translate English texts not prepared deliberately for
this purpose. It’s only natural to suppose that the texts will have a lot of words absent in
the dictionary, such as dogs, played, sitting, ran, etc.; the dictionary has dog, paky, sit,
run, doesn’t it? How will the program know that dogs means many dog-s, and played
means play-ed some time ago? The linguist should explain this to the program by means
of providing a list of endings with the meaning of each ending. Besides, the program
should understand that -s can mean either plural or present, depending on the context.

Please mind that Russian and Ukrainian, unlike English, have dozens of endings, most of
them having several meanings!

That’s not all, however. We have to develop what is called a computer grammar for these
Slavonic languages, too. Simply speaking, we are supposed to teach our translating
program to understand the syntactic structure of the sentences. Here is what I mean.

When at school, I was a very good pupil, and I liked literature and languages most of all.
But it wasn’t easy even for me to do those numerous “sentence analysis” exercises. We
had to underline the subject with a single solid line, the predicate was to be underlined
with two such lines, the attribute was underlined with a curvy line, the object - with a
dashed one, etc. Sometimes it was an ordeal, even though we analyzed sentences in our
native tongues, Russian and Ukrainian. And you will not object that no language is native
for the computer, won’t you? You may ask me why the machine should be that clever?
Here is an example, in English again.
Supposing our program comes across the English sentence to be translated into Russian:

A girl was playing with her kitten.

It’s so simple that any English-speaking kid will understand it easily. However, the
computer is not English-speaking, and it has to analyze the sentence to understand that

 a girl is the subject


 was playing is the predicate in Past Continous
 with her kitten is the object, while her is attribute, compare: I met her and I met her
brother

This information is necessary to provide correct translation, because:

 Russian predicate should correlate with the subject in gender and number
 the verb in the Russian sentence should be imperfective
 her should be translated as an attribute

So much information required for a short English sentence!

Well, in order to describe what we did, I will first tell you what we planned to do.

My desire was to develop commercial MT systems dealing with Russian and


Ukrainian, on the one side, and West European languages, first of all English, and
also German, on the other side.

To begin with, let’s clear up the notion «commercial».

A commercial product is something people buy because they need it and can use it. Here,
we must analyze two aspects:

a) what kind of MT system people really want, and


b) what kind of MT system people can really use.

It seems to me that a commercial MT system should only have minimum (or a bit more
than minimum) information about the words in the dictionary, it has to be a compromise
between the system designer's desire to develop a powerful linguistic tool, with
maximum information assigned to the words, which would produce high-quality
translations, and understanding that such a tool will be useless from the customer's point
of view if working with it would require too much effort from the user. In other words,
the system should be powerful and easy-to-operate and customize, this desire bearing a
strong intrinsic contradiction.

The problem is that powerful linguistic apparatus, which is a necessary prerequisite of


obtaining very high-quality output, requires a lot of semantic information in the system
dictionary. That is, the words entered into the dictionary must acquire special semantic
notions, which will make the system really clever. In this case, the words will be
described not only as parts of speech that can have such and such endings, but their
senses will also be represented. Here is an example.

The minimum information for the word dog is that it is a noun, and its plural is dogs.
Besides, it’s animate. However, much more can be said about it, for instance that a dog is
a domestic animal. The latter is very important. Just take a nice sentence provided by one
Japanese linguist: I saw a dog with a telescope. The program will not understand this
sentence if it doesn’t know that a dog is not human; besides, the program requires a
special database, in which it is stated that dogs can’t use telescopes.

It’s not easy to develop such a dictionary and such a database. Please mind that the
system doesn’t simply exist the way it was bought. No, the user will necessarily want to
‘customize’ it entering new words into the existing dictionaries and creating new
dictionaries. In this case, the user will have to know as much about semantics as the
linguist who developed the system, which may make the system practically useless, for
example, for an engineer who needs translations of texts in his/her subject area, but
knows nothing of semantic categories of the words entered into the dictionary in those
cases when the dictionary needs extending. And, as the experience and common sense
show, it’s hardly possible to wirte an algorithm for assigning semantic categories to the
words being entered into system dictionaries, so it would be up to the user!

On the other hand, too little information is as dangerous as too much of it since the
system is to be not only convenient, but also practically useful: bearing on too little
linguistic data, the system will not be able to give comprehensible translations in some
cases, no matter how easy it may be to enter new words into it.

That is why the authors of an MT system have to determine the scope of linguistic
information the system will really not do without, balancing between «too little» and «too
much».

Speaking of any language, especially Slavonic, the basic data is, first of all,
morphological.

Describing morphology, the linguist is to enter as much data on the languages being
described as possible, since otherwise the texts will not be analyzed and generated
properly in the MT process. The rule is: any word should be recognized in any text in any
of its forms. And this description is to be made as clear and convenient for the user as
possible; more than that, the best variant would be to offer an automatic procedure of
assigning morphological information to the words being entered into the dictionary
(‘encoding’, or ‘tagging’ them).

And it would also be desirable to make the word entries look just or almost like those in
the traditional «paper» dictionaries familiar to any literate person.
2.2. The PARS project, or what we did

Speaking of a machine translation system, as I understand the problem, one may describe
it from three angles,:

 how the system looks, that is what you can see when you run the program on your
computer;
 how the system works, that is what kind of translations it comes up with;
 how it does what it does, that is what algorithm the system uses for translating texts.

This book mainly covers the first two of the above three aspects, and only partially the
third one. The reason is that it’s not easy to describe an algorithm, the more so, in «an
introduction to an introduction». And, to tell you the truth, I am somewhat afraid of
numerous syntactic and semantic «trees» and schemes one can often come across in
serious publications on machine translation. If, God forbid, they are the first thing a
young person gets to know of MT, he or she will hardly ever force himself (herself) to
read another paper on a similar subject.

At the same time, this book is written in such a way that getting to know how our MT
systems work, you will certainly have the third of the above questions answered, at least
partially.

My whole career as a machine translator is connected with the PARS project, which I
launched in the mid-eighties. Several systems have been developed within that project,
and this is how the most up-to-date ones look and work.

2.2.1. Five PARSes: general description

Since 1986, we have been developing the English-Russian-English PARS (ПАРС)


system and, since 1990, PARS/RU (ÐÓÌÏ) - «Russian-Ukrainian-Russian machine
translation».

In 1996, yet another system by Lingvistica ’93 appeared on the market, PARS/U, for
translating between English and Ukrainian.

In 1997, the first version of the PARS/D German-Russian-German system was released,
and, in 1998, PARS/DU, the German-Ukrainian-German one.

These five systems are quite similar, so having mastered, for example, PARS, one will
easily master PARS/RU, PARS/U, and PARS/D, as well as any other MT system by
Lingvistica ’93.

Each system runs in 2 variants: the Windows-version and the DOS-version.


2.2.1.1. DOS versions

The MS DOS versions of our systems are used to translate ASCII texts.

One of the main peculiarities of these systems is the user(translator)-oriented built-in


two-window editor. It features some specific functions that correspond to the most
frequent text-editing operations made by professional translators:

 a key-stroke transposition of neighboring words;


 a key-stroke change of register (substitution of capital letters with small ones, and
vice versa);
 marking polysemantic (having more than one meaning) words and phrases in the
target text with asterisks, which ensures one keystroke substitution of a translation
variant;
 search for the next «new» word, that is a word not found in the dictionary;
 the possibility of entering «new» words into the dictionary directly from the text
editor, according to the «dictionary first» principle: the user opens the dictionary
and initiates entering the next «new» word into it, while the word entered is
highlighted in the text so that the user could see the context and give the right
translation(s).

The screen may be split either horizontally or vertically, and the user may scroll either
both windows synchronously, or the active one only. Besides, the target text may be
exported to another text editor that supports ASCII files, such as PenEdit, a pen editor
developed by the Kiev-based team led by my friend, Alexander Kazakov.

2.2.1.2. Windows versions

These systems work under Windows 3.1, Windows 95, Windows NT, and they translate
files in such formats as MS RTF, Windows text, DOS text, HTML, hypertext (CP 1251,
KOI8-Russian).

Each system may be activated directly from MS Word 6.0, MS Word 7.0, and MS Word
97: after the MT systems have been installed, the main menu of MS Word will have the
item Translate, with the option for running the corresponding system. The user opens the
source text in the editor and starts one of the systems, after which the machine translation
appears in the bottom window created by MS Word; in the target text, formatting of the
source one is preserved, such as fonts, styles, and tables (see Fig. 1).
Fig. 1. PARS has preserved the source table format in the target text

The polysemantic words and phrases are marked with «asterisks», just as in the DOS
versions (see Fig. 2).
Fig.2. PARS has translated Declaration on State Sovereignty of Ukraine from Russian
into English: translation variants for «nation» are «people» and «folk».

«New» words and phrases can be entered into the dictionary directly from the screen
according to the «Text first» principle. The difference from the DOS-versions consists in
the fact that the user marks the word/phrase to be entered, clicks the New word button,
and the word/phrase is written to the dictionary, i.e. unlike the DOS-versions, the
principle is «text first». Besides, unlike the DOS-versions, not only separate words, but
also phrases can be entered into the dictionary directly from the text.

Besides, the user may translate screen Helps and texts of Internet WWW-pages in the
HTML format. This is done via Clipboard: the text portion to be translated is copied to
the Clipboard, the MT system is started, and the target text appears in a separate window
under the source text (see Fig. 3 and Fig. 4).
Fig. 3. PARS has translated a Help file from English into Russian
Fig. 4. PARS has translated an HTML file from English into Russian

The machine translation can be saved as a separate file.

Here are more characteristics of the MT systems developed by Lingvistica ’93:

a) the user may choose the dictionaries to be used according to the subject area, as well
as their priorities; up to 4 dictionaries can be used in each translation session.
b) PARS features automatic transliteration of proper names, that is rendering
Cyrillic characters with Latin ones, and vice versa. This turns out to be very useful
for dealing, for instance, with long and unusual foreign names, such as the Georgian
last name Дзодзуашвили, which PARS will transliterate as Dzodzuashvili. The
transliteration is given as a translation variant (see Fig. 5).
Fig. 5. This is how PARS transliterates Russian and English proper names when run
from Word 6.0 and 7.0

To finish the functional description, I would like to add that all the above systems, both
in DOS and Windows versions, run in stand-alone and network modes.

2.2.2. Dictionaries

It seems to me that one of the most important criteria of evaluating a commercial MT


system is its dictionary support subsystem: the easier it is to extend dictionaries supplied
with the system as well as to create user’s dictionaries, the better the system is in general.

2.2.2.1. User options

1) Dictionary entries in MT systems by Lingvistica ‘93 remind those in traditional


dictionaries (see Figs. 6 and 7), the difference being that in «paper» dictionaries it is
the head word which is replaced with a tilde in a phrase, this word bearing the main
sense of the word string, while in PARS, PARS/U, PARS/D, PARS/DU, and
PARS/RU dictionaries the first word is considered the head one.

Fig. 6. A dictionary entry in the PARS/U general dictionary


Fig. 7. A dictionary entry in the PARS dictionary on geology/mining

NB: Since Russian and Ukrainian are inflective languages, word endings are separated
from the stems with vertical lines.

2) Dictionaries in Lingvistica ’93 systems are bidirectional. For example, if the user
enters an English word with its Russian translation into a PARS dictionary, the
system automatically sets the opposite correspondence, Russian-English.
Accordingly, any dictionary can be browsed and edited by any part, for example,
English-Russian or Russian-English (see Fig. 8).
Fig. 8. A Russian-English word-entry in the PARS business dictionary

3) It is very important that a word/phrase can have a practically limitless number of


translations, which permitted us to realise choosing translation variants in the target
text. The customer may use the one-keystroke transposition option in the
dictionary entry assigning a higher priority to the translation which is considered the
most likely one for the subject area. For example, in the PARS general dictionary, the
Russian word «îáùåñòâî» has two English translations - «society» and «company».
For translating socio-political texts, it is advisable to put the translation «society» in
the first place in the dictionary entry, then the word «company» will be placed
«under asterisk» as a translation variant. However, for translating financial-legal
texts, the word order should be opposite.

4) These systems feature automatic indexing (tagging) of Slavonic words being


entered into the dictionary: the system automatically assigns grammatical
characteristics to them, such as part of speech, declension, conjugation, subclass
characteristics (such as gender) (see Fig. 9). If the program doubts how to index a
word, the user can make a choice out of several options. For example, the system will
not sure about the declension of the Russian word «áåíçîçàïðàâùèê» - as «àâòîìàò»
(thing) or as «инженер» (human).
Fig. 9. PARS has assigned grammatical characteristics to the Russian word «записать»

2.2.2.2. How Lingvistica ’93 dictionaries are compiled

The main peculiarity is that dictionaries by professional lexicographers are broadly


used. The lists of dictionaries supplied with the systems comprise not only the
quantitative characteristics, but also the names of the authors.

PARS features a large spectrum of English-Russian-English specialist dictionaries, the


subject areas being technology, business, medicine, space engineering, electronics,
mathematics, chemistry, automobile building, etc. The total number of terms as of April,
1998, was above 900,000 words and phrases in each part - English-Russian and Russian-
English.

Such great volumes could never be compiled without the collaboration of Lingvistica ‘93
and ETS Ltd, the greatest Russian publishers of electronic and paper dictionaries, the
former running under the Polyglossum dictionary support program (see Fig. 10).
Fig. 10. An English-Russian entry in the Polyglossum law dictionary

Under the joint PARS+Polyglossum project, the dictionaries of this world's largest
English-Russian dictionary base are semi-automatically converted into the PARS format.
The procedure consists of three stages.

a) First, the Polyglossum dictionary is imported into PARS.

b) Then, the Russian words of the new dictionary are encoded in a batch mode
according to the coincidence principle: the word acquires the same grammatical
characteristics as in the PARS dictionary that was set as the prototype.

c) At the last stage, a special program encodes the words that were not recognized by
the batch mode program. It uses the analogy principle: the word acquires those
grammatical characteristics as similar words that were entered into the other
dictionaries before. If the program has several variants, the dictionary officer is
supposed to make a choice, after which the program goes on encoding the words.

Dictionaries are tagged very quickly: a dictionary of 50,000 translations can be processed
within 2-3 hours.
If there is no Polyglossum dictionary for a certain subject area, the PARS dictionary is
created be means of running a representative corpus of texts through the translation
system with subsequent input of «new» words and phrases into the dictionary, or by
means of scanning existing «paper» dictionaries provided we have the author’s
permission. Here is the description of this process which I call The Automated Dictionary
Creation Technology. It was implemented in the Machine Translation Laboratory at
Kharkov State Polytechnical University.

According to the technology, the following sources are used to compile the dictionaries:

 existing printed bilingual dictionaries,


 existing electronic bilingual dictionaries,
 real-life texts.

The following procedure is used to enter words into the target dictionaries from the
printed and electronic ones.

1) scanning or manual entering, if the printed copy quality is poor;


2) converting into text format; «extra» fragments are deleted automatically and/or
manually, if necessary, such as comments, transcriptions, etc;
3) converting into the PARS communicative format followed by importing into
PARS;
4) the words in the dictionary obtained are encoded in a batch mode: a special program
is applied to attribute grammatical information to each word, according to the
«equality principle», that is the words are compared with the previously encoded
ones; besides, phrases are analyzed, and a word doesn't acquire any grammatical
tagging if it has been recognized as being an invariable part of the phrase;
5) a system linguist runs the interactive encoding program, which assigns grammatical
data to the words which were not identified at the previous stage; this program
recognises the invariable parts of the phrases and encodes the rest of the words
according to the «similarity principle» using the special grammatical index file. In
cases of alternative decisions, the program displays the list of the alternatives so that
the linguist could make a choice.

I understand but very well that some dictionaries are to be updated much quicker to
include the most up-to-date terminology. This relates, for example, to such a
«terminologically flexible» sphere as telecommunications. The only way to do this is to
cooperate with the companies that generate new terminology.

The next paragraph explains the above procedure and gives you details which you may
certainly omit if you are not interested. I do hope the general idea is already clear. But I
would read it if I were you.

2.2.3. Grammar: some specific information for those interested in details


In order to let our MT systems translate to and from two Slavonic languages, Russian and
Ukrainian, we, first of all, describe the morphologies of each language, i.e. some
morphological characteristics are assigned to each word entered into the dictionary. This
is done manually at the first stage, when the basic general dictionary is compiled.

The following parts of speech are distinguished in PARS:

 noun
 attribute
 verb
 adverb
 preposition
 particle
 short adjective
 article
 conjunctions ‘and’ and ‘or’
 other conjunctions
 some grammatical classes and separate words characteristic of the other (non-
Slavonic) language: particle ‘to’, the word ‘that’ (see also Fig. 11).

Fig. 11. The part-of-speech list in PARS dictionaries

As you can see, some parts of speech, such as ‘noun’, ‘verb’, ‘attribute’, are characteristic
of both Slavonic and Germanic languages; some of them are only existent in Russian or
in English, such as ‘short adjective’ or ‘that’, respectively. The fact is that, in PARS and
PARS/U, all the grammatical characteristics, including those of the English words, are
assigned to the Slavonic words, English being a purely analytic language, with very little
morphological peculiarities. It means that, as a rule, when encoding a Slavonic word in a
word entry, i.e. assigning morphological characteristics to it, the English equivalent is
supposed to have the same ones.

Most of the parts of speech have additional morphological characteristics. Here is an


example.

Slavonic nouns have, among other, the gender/number characteristic: male, plural,
female, neuter, and the singular and plural paradigms. A paradigm is a set of
morphological endings in all the grammatical cases, for example, ‘стол|, стол|а, стол|у,’
etc. (See Figs. 12 and 13).

Fig. 12. Singular paradigm of the Russian word ‘парень’(‘boy’)


Fig. 13. Plural paradigm of the Russian noun ‘парень’, the plural form being ‘парни’
(‘boys’)

A very important peculiarity of MT systems by Lingvistica ‘93 is what I call distant


phrases, an idea I gained from my unforgettable hours-on-end talks with one of my
teachers, Dr. Boris Pevzner. He considered it unrealistic to list all possible phrases in a
dictionary, no matter how large the dictionary may be. It would be more reasonable, he
said, to enter a typical (model) phrase and a rule for making substitutions so that the
system could generate phrases similar to the model. It means that a phrase in this case
will not be something fixed, but rather a flexible unity similar to many other phrases.
Unlike fixed phrases, the elements of which are always adjacent, such as ‘in order to’,
‘door handle’, etc., a distant phrase may have a ‘gap’, for example, ‘pay...attention’: we
can see definite words in real-life texts instead of three dots, such as ‘pay great attention’,
‘pay extraordinarily serious attention’, etc.

So far, all our MT systems only recognize one kind of distant phrases. A distant phrase is
considered to be a 2-word source-language phrase having a 2-word translation. Two
‘gap’ types are distinguished:

 positional: in a text, not more than 5 words may appear between the left and the right
words;
 grammatical: no other part of speech may occur between them but an article, one or
more adjectives and/or adverbs.

I’d like to add that one of the most promising ways of improving translation quality is to
make an MT system ‘cleverer’ by means of recognizing more types of distant phrases,
with maybe even more than one ‘gaps’ in a phrase. I am planning to develop our MT
systems into those doing example-based translation: a dictionary of numerous typical
phrases will be made, which will let the translation program generate much more similar
phrases than are in the dictionary.

So, as we have seen, the words in the dictionary acquire grammatical characteristics.
Some semantic information is also assigned, such as ‘time’, ‘geographic notion’, etc.,
which makes it possible to determine the right meaning of a word in context. For
example, ‘for’ is translated ‘в течение’ (‘during’) if it precedes a word marked as ‘time’
in the dictionary.

However, semantics is something an ordinary end-user can hardly cope with when
extending the dictionary or compiling one of his/her own. That is why our systems will
even translate if no semantic marks are present, though maybe a little worse.

As a result of assigning the corresponding characteristics to the words in the basic general
dictionary, the latter usually consisting of about 20,000 word entries, a grammatical
dictionary of the language in question is compiled automatically, which is then used in
all our MT systems dealing with this language. The Russian grammatical dictionary is
used in PARS, PARS/D, and PARS/RU; Ukrainian - in PARS/RU and PARS/U. More
than that, if another MT system is to be developed, such as Russian-Spanish or
Ukrainian-German, it will use the corresponding grammatical dictionary. The question is
- how? Let’s see.

When a system dictionary is extended, that is new words and/or phrases are added to it,
or when a new dictionary is created, the system looks each new word up in the
grammatical dictionary and, if it is found there, assigns the same grammatical
characteristics to it. The reader may ask me why a word is entered into the system
dictionaries more than once. There may be two reasons:

 it has different meanings in different dictionaries;


 it may be part of different phrases, although its grammatical characteristics are the
same.

If, however, the word to be encoded is not present in the grammatical dictionary, the
system encodes it on analogy, assigning such morphological characteristics to it as the
most similar word has in the grammatical dictionary. Similarity is determined by the last
letters in the word. For example, the Russian word ‘сказать’ looks more like ‘указать’,
than ‘поднимать’. In order to implement the similarity principle, a program was made
which automatically creates a grammatical index corresponding to each grammatical
dictionary.

What seems to me very important and encouraging, the grammatical dictionary/index


approach turned out to be also useful for dealing with the German language! It lets us
encode German words entered into the German part of the German-Russian-German
PARS/D system and in the similar German-Ukrainian-German system developed by
Lingvistica ’93, PARS/DU. Here is a word entry in the general German-Russian
dictionary (Fig. 14):

Fig. 14. A word entry in a PARS/D dictionary

Figure15 illustrates automatic encoding of the Geman verb ‘abbauen’: as you can see, the
program has automatically assigned grammatical characteristics to this word.

Fig. 15. Automatic tagging of a German word


2.3. Translating

2.3.1. General principles: contemplation for those interested in details

Now that I have shown you how the systems look and given you some substantial
grammatical information, it would be interesting to see how they are used to translate
texts. In other words, what translation philosophy is laid in the foundation of those
systems?

First I wanted to make use of the «almost-classic» definition of three translation


approaches: direct, transfer-based, and interlingua-based. These terms look so very
scientific that they have to be explained.

Direct translation has the word-for-word basis. For example, when translating from
Russian into English, the computer program substitutes each Russian word or phrase
found in the dictionary with its English equivalent. This is called direct translation
because the system is based on direct correspondence between 2 languages, such as
Russian-English, German-Spanish, Dutch-French, etc. It can only translate between the
given language pair, and it’s not capable of anallyzing the source language sentence for
subsequent translation into another target language.

On the contrary, the transfer approach presupposes independent analysis of the source
text sentences as well as independent generation of the target text ones. This means that
the system, instead of translating word-for-word, first analyzes the source sentence and
comes up with a special grammatical representation of this sentence, which
(representaton) is then transformed into a sentence in the target language. “Transfer”
means the transition to the target language after the first stage of the translation process,
the analysis.

Generally speaking, the interlingua philosophy resembles the transfer one. You see, an
interlingua is a special artificial language used for making source language sentence
representations. The idea is really great! Just imagine that you have to develop an MT
system to translate between 20 languages, which would make 400 language pairs! Which
would be easier to make: 400 direct translation programs, or 20 programs for translating
from each of the 20 languages into the interlingua plus 20 programs for translating from
the interlingua into each of the 20?

Well, the more I was thinking about all this, the more convinced I was became that it
would hardly be possible to use this definition practically as it is very hard to draw a
demarcation line between the above three approaches. Reason No 1 is that the champions
of this definition consider what they call «direct translation» quite fruitless, while, on the
other side, systems translating «directly» are sold and, what is more important, bought
throughout the world, giving their developers honestly earned profits, the latter being
sometimes rather high. Or maybe we have to admit that no “pure direct” translation
systems really exist, and each system is a combination of philosophies, so a different kind
of terminology should to be suggested.

In each of our PARSes, the translation program first generates a word-for-word


translation, and then brushes it up intensively, making it look as natural as it (the
program) can. That's why I call our approach FTA - «first-translate-then-analyze».

Generally speaking, FTA is usually resorted to if system developers don't want to view
the sentence as a single structural entity, considering it as a linear sequence of lexical
units and regarding syntactic and semantic relations merely for disambiguation purposes.

On the contrary, a system may first analyze the source text, and then translate it, using the
results of this analysis, thus working according to the FAT - «first-analyze-then-
translate» principle. Traditionally, the FAT-type systems consider the whole sentence as
a syntactic (or even semantic-syntactic) unit, the basic idea being that the more
information you use in your analysis, the better results you will obtain.

It should be taken into consideration, however, that mistakes are practically inevitable, or
at least highly probable in each case, that is «when she (the translation algorithm) is
good, she is very, very good, but when she is bad, she is horrid»: mistakes made in
analyzing as complicated entity as a sentence will cause translation mistakes.

This situation is but very well known to practical developers of language engineering
systems, who constantly face the «noise/recall(completeness)» dilemma. From time to
time, we come across the typical situation: too much analysis causes poorer translation
quality than no analysis at all. My opponents may contradict that «too much analysis»
means «too little analysis», but have you ever seen enough analysis in real-life MT
systems?

PARSes bear on hundreds of rules to analyze the source text and synthesize the target
one, some of the rules being rather sophisticated, such as disambiguation of «-ed» forms
for English-Russian or English-Ukrainian translation purposes. However, it doesn't dare
to view the sentence as a structural unit. The program only analyzes a word if it is
grammatically ambiguous. At the same time, the set of rules is constantly extended in the
system «growing» process: we analyze translation results, and if a mistake is typical, i.e.
a certain ambiguity type is come across regularly, we try to develop a rule to eliminate
the ambiguity.

So, here is a general outline of the translation procedure in our systems.

Stage 1. The system makes what is called word-for-word and phrase-for-phrase


translation of the source text, recognizing phrases and single words and extracting the
corresponding grammatical data from the dictionary. This is done using the
morphological analysis rules. For example, when analyzing English texts, a table of
irregular verbs is made use of, as well as a set of rules for recognizing noun plural forms;
the German morphological analysis is based on the rules of linking German separable
prefixes to their corresponding verbs; Slavonic words are recognized in the source text
due to special tables of Russian and Ukrainian paradigms.

Stage 2. The system analyzes the resulting text and makes its best to eliminate as many
ambiguities as it can. Doing so, it makes use of special contextual rules for
grammatical and semantic disambiguation. In this case, contexts of ambiguous words
are analyzed.

Stage 3. The system generates the target text. The task consists in making the target
sentences look as natural as possible. The system tries to insert articles (which is even a
difficult task for some humans, to say nothing of an algorithm), changes word order, etc.

The latter two are very difficult-to-implement stages, and, I as said, ‘too much analysis’
may really turn out to be ‘too little analysis’. For instance, the system sometimes
transposes words in such a way that the resulting sentence seems to make no sense at all.
Boris Pevzner once said a system considered good may give poorer results than one
called primitive!

Let’s call a spade a spade: if the grammatical structures of the source and target
languages are not so much alike as, for example, those of Russian and Ukrainian
(although Russian and Ukrainian grammars have a lot of differences), the output texts are
very far from those made by qualified translators. When I hear or read that an MT system
ensures «80-90-percent accuracy», I am inclined to consider such a statement a mere
advertising trick, especially speaking of such different languages as Germanic and
Slavonic. Yes, machine grammars are being constantly improved, but, being a
professional language engineer, I can hardly imagine that computer programs will ever be
able to compete with qualified humans. Or maybe I am mistaken? People used to think
that a computer would never be ‘cleverer’ than a human chess player, but Deep Blue beat
Kasparov...

2.3.2. So, what can these systems do?

PARS/RU does translate texts in such a way that they are 70-80%, sometimes even 90%
ready for publication, the quality of Russian-Ukrainian translation being somewhat
higher than that of Ukrainian-Russian (I hope to explain this phenomenon sooner or
later). As to PARS, PARS/U, PARS/D, and PARS/DU, they are used to

 let the user have a general idea of the document, for example, when browsing large
databases, that is «scan» the text;
 create a draft for subsequent polishing, i.e. for turning the draft into a real
translation.
The option of selecting translation variants essentially simplifies editing of the machine-
made translation. This option, as we already know, also provides automatic transliteration
of proper names.

At the same time, numerous users, among which there are professional translators, say
that it is very hard to edit machine translations in MS Word. Let me explain this.

The main disadvantage of the FTA-type programs that translate between languages one
of which belongs to the Germanic group and the other to the Slavonic one, is that, more
often than not, the word order is not observed, and the translator has to change it
according to the rules of the target language. The reason is that observing word order
requires very serious transposition (‘transfer’) rules based not only on grammatical, but
also on semantic characteristics of the words, and using semantics in machine translation
is a task for a new generation of commercial MT systems.

Speaking of editing machine translations in MS Word, the only option that can be used
for transposing words is tiresome operation with text blocks, which is something quite
unnatural for a normal professional translator. Being a translator myself, I know very
well how conservative they (we!) are. And one of the most unpleasant things for a
conservative person, that is for a person who is well accustomed to doing something, is
changing his/her habits. If I write a draft on a sheet of paper, I use arrows to transpose
words instead of cutting a text portion out and stickimg it into another place. An MT
system ‘embedded’ in Word suggests that I should act in a strange manner, which is
twice bad:

 first, because it’s stupid to use scissors instead of arrows;


 second, because the program tries to make me change my habits instead of helping
me.

I am sure that one of the most promising directions in MT is developing a new generation
of text editors. A Windows-version of PenEdit has been developed by Alexander
Kazakov and his team. Among other things, it will let the user transpose words very
easily, using an electronic pen or a digitizer, just as if they were working with a sheet of
paper to write a text on. Alexander and I will describe the results in detail elsewhere.

Another characteristic feature of MT systems by Lingvistica ‘93 is that they use up to 4


dictionaries in the translation session, and the user may set their priorities. When
translating, the system looks the word (phrase) up in the dictionary which has the highest
priority, then, if it was not found there, in the following one, etc. As it turns out, this
approach has not only advantages, but also drawbacks. Let’s discuss the latter.

a) To begin with, PARS comprises quite a number of dictionaries, which requires linking
up more than 4 dictionaries in some translation sessions. For example, the following
dictionaries can be used for translating aviation texts:
 general;
 aviation;
 aerospace;
 mathematics (for instance, when translating texts on mathematical modelling in
aircraft building);
 computers;
 aviation medicine;
 radioelectronics;
 ground and space communications;
 polytechnical.

b) Having found a word in one of the dictionaries, the system stops looking it up in the
rest, which may cause incorrect translation simply because one and the same word may
be present in different dictionaries and thus have different meanings.

c) Another large problem consists in the difficulty of correct assigning priorities to the
dictionaries. For example, PARS once translated an English medical text into Russian
using the medical and general dictionaries in the indicated order of priorities, and the
word «flow» was translated as «ìåíñòðóàöèя» («menstruation») instead of «поток»
(«flow»), the latter being suggested as a translation variant; but if the general dictionary
had a higher priority, the translation would be correct, and the wrong translation would
be suggested as a variant.

What do «people from the street» think of PARS? Here are two examples, somewhat
funny, but taken from real life.
One of our clients used PARS for the first time in his life to translate a business letter to
his American partners. He said he had used to write his letters in English manually, and it
took him a lot of time, and caused him terrible headaches. He sent the machine
translation, with no comments, by E-mail. The answer was: «Congratulations, you are
making progress! Your English is much better, though you still have problems with
English grammar».
Here is an example from my own experience.
My Dutch business partners, one of whom is a native speaker of Russian, invited me to
their country, and sent me a letter to the effect that they were ready to provide
accommodation to me, and that my visa would be prolonged if necessary. The letter was
in English. I didn't open it before the plane landed in the airport of Amsterdam. There, I
handed it in to the customs officers, they read it and said everything was all right. When I
took the letter back, I noticed some grammatical mistakes in the text. «You see, my
partners said when we met, PARS did well, but you should go on working to improve the
translation algorithm!»

And now, to complete this paragraph, here is how the three PARSes translated the same
text, the Ukrainian Declaration on the State Sovereignty. In fact, the original text was in
Ukrainian (just all right for PARS/U), and I also had it translated by PARS/RU into
Russian and edited the translation manually, after which the Russian text was translated
into English and German by PARS and PARS/D, respectively. Mind that no translation
variants, although provided by the systems, are displayed in the illustrations.

Russian to German Translation

Die Deklaration über die staatlichen Souverenität der Ukraine.

Der Oberste Rat Ukrainischen SSR, aussprechend den Wille des Volkes der Ukraine,
strebend schaffen die demokratischen Gesellschaft, hervorgehend vom den Bedarfen der
allseitigen Versorgung der Rechte und den Freiheiten dem Mann, verehrend die
nationalen Rechte der allen Völker, sorgend um hochwertigen politischen,
ökonomischen, sozialen und die geistigen Entwicklung des Volkes der Ukraine,
bekennend die Notwendigkeit des Aufbaues des Rechtsstaat, habend Ziel aufnehmen die
Souverenität und die Selbstverwaltung des Volkes der Ukraine, proklamiert die
staatlichen Souverenität der Ukraine als die Vorherrschaft, die Selbständigkeit, die Fülle
und die Unteilbarkeit der Gewalt der Republik innerhalb ihren Gelände und die
Unabhängigkeit und die Gleichberechtigkeit in dem äußerliche Verkehr.

Ukrainian to English Translation

Declaration about the state sovereignty of Ukraine.

Supreme rada Ukrainian SSR expressing freedom the nation of Ukraine seeking to create
democratic society proceeding from from the needs of the all-round provision of rights
and the freedoms of man respecting national right all nations caring about complete
political, economic, social and spiritual development the nation of Ukraine accepting the
necessity of the construction of legal country having aim to affirm sovereignty and self-
governance the nation of Ukraine, declares the state sovereignty of Ukraine as
supremacy, independence, amplitude and the indivisibility of the authority of Republic
within it territory and self-support and equality into foreign communion.

Russian to English Translation

Declaration about the state sovereignty of Ukraine.

Supreme Soviet Ukrainian SSR expressing the will of the nation of Ukraine, aiming to
create democratic society, based on needs of the all-round provisioning of rights and
freedoms man respecting the national rights of all nations attending to full-value political,
economic, social and spiritual development of the nation of Ukraine recognising the
necessity of the building of legal state having purpose to affirm sovereignity and the
autonomy of the nation of Ukraine, proclaims the state sovereignity of Ukraine as
supremacy, independence, completeness and indivisibility the power of Republic within
its territory and independence and equal rights into exterior relations.
2.3.3. Translation technology: the PARS+Polyglossum tandem

Experience shows that the most efficient technology of translating from Russian into
English and from English into Russian is using an automatic translation system (PARS or
another) and a dictionary look-up system with large professional dictionaries (like
Polyglossum) to complement each other, if, and this is important, the MT system
dictionaries are not representative enough, that is

 either they don’t contain some specific terms the Polyglossum-like dictionaries have,
or
 the user needs explanations of some terms to choose the most appropriate variants.

The fact is that the Polyglossum system has a program for dictionary look-up, and the
word entries in its dictionaries contain numerous explanations and commentaries. That is
why, Polyglossum is not only a source of new PARS dictionaries, but it also serves for
translating technical terms which PARS fails to translate, or for choosing the most
appropriate translation variant if the human translator post-editing the raw machine
translation needs an explanation of a term.

2.4. Quality evaluation, or Is MT really useful?

2.4.1. Philosophy

What is a good translation? Or what would you think if you hear that a text has been
translated well, or badly? A definite answer can hardly be found. It's much easier to find
a grammatical or lexical mistake in a sentence than to say whether the translation is good,
though grammatically or even lexically incorrect. If, for instance, you come across the
sentence «I has a interesting book», you will easily understand what it means and correct
both mistakes, so, on the one side, the sentence is not correct, but, on the other side, it is
understandable and, so to say, correctable. On the contrary, there may be such mistakes
that make the contents incomprehensible, or, which is even worse, «lead the user astray».

Being a practical machine translator, I approach the quality problem from the practical
point of view. In other words, evaluating the target texts, I mainly pay attention to the
usefulness of the translation, and, if the machine output is good for nothing, I analyze the
reasons. Of course, there are also such mistakes whose reasons in the algorithm are to be
eliminated, though their influence on the output comprehensibility may be negligible,
such as «I has».

To approach the problem of translation quality, we should first analyze user subgroups
and see who needs what, otherwise we shall hardly be able to distinguish between
«good» and «bad» since «good» implies the existence of a user who is satisfied, that is
answering the question «Good for whom?».
2.4.2. Users

At present, I can see the following subgroups among the end users.

a) People who require what may be called «translation for information», that is, they view
the target text as a mere source of information rather than a finished document. One of
such persons, an authority in humus production and application, told me after having read
a PARS-made very raw translation of a scientific paper on humus that the quality was
quite satisfactory from the point of view of informativeness, and even more than that: it
turned out that, in fact, PARS coped with the job not worse than the girl translator who
was much better at syntax but rendered the contents in the same primitive but
comprehensible way.

Another example of the same kind is the requirement of an expert working with messages
rendered by world information agencies: this user is supposed to write analytical reviews
of such messages, and he/she requires quick translations he/she can read and understand
without looking into the source text.

In such cases, «global» post-editing should not be necessary; what the user really needs is
correction of those fragments that have been rendered so erroneously that no information
can be extracted from the translation.

b) People who need high-quality translation of the publication quality type, which
requires intrusion of a skilful human translator or post-editor.

One of the most frequent cases in modern translation practice is translating from Russian
into English various kinds of scientific and technical documents, which brings Russian-
speaking scholars and engineers closer to their English-speaking colleagues. Another
example is preparing legal documents, such as texts of documents translated by the
PARS/RU system between Ukrainian and Russian at the Ukrainian Parliament. It's clear
that this kind of translation requires the highest quality of post-editing, no matter whether
the source document is translated by a human or machine translator.

Here is a 7-point scale I composed for evaluating the quality of machine translations
produced for information extraction requirements.

Machine translation quality scale

1: Translation quality is unacceptable. This translation is absolutely useless for any


purpose. The only way out is to read the text in the original than such a «translation».

2: The quality is poor, few text fragments are comprehensible. It is hard to grasp what the
text is about.
3: Translation quality is rather low. The general idea is comprehensible, but it is very
hard to read large fragments of the machine product. Such translations can be used for
reference purposes only, for example, in libraries. The translation is useless, however, as
an information or even signal document.

4: Average quality. The document can be used for the first, «draft» reading of short or
average size documents for determining if high quality translation is necessary, or when
reading large documents - for selecting text fragments to be translated professionally.

5: Acceptable quality. The general idea is clear. Useful information can be drawn from
the target text. However, in general, the text is hard to read due to a large number of
various kinds of mistakes.

6: The translation is quite satisfactory. There are some stylistically poor and even obscure
fragments, but, in general, the translation can be used as a source of information.

7: The translation is good, though not stylistically perfect.

So far, the average mark for PARS output is 5, and if 6 as the average mark is ever
attained, it will be our serious victory.

As to evaluating the quality of machine translation made as the basis of attaining a


publication quality text, the only reasonable criterion seems to be whether it is easier to
post-edit the machine output than translate the source text manually, and if it is, how
much easier it is.

Generally speaking, the situation is far from being cloudless, though quite a number of
people thank us for having designed a useful translation instrument. There are people
who have negative attitude to PARS (or maybe to MT in general), saying that it is easier
to translate the whole text manually than to post-edit the machine product. The most
unexpected thing, however, is that the higher professional skills, the better attitude to
MT.

I discussed the problem with a lady translator who worked at the Izvestia Concern,
Moscow, where PARS was implemented. Izvestia is one of the most influential Russian
dailies, and PARS 2.04 had been selected among the MT systems marketed in Russia for
translating the VWD Information Agency data base. I was sitting there for about an hour
and a half, working with PARS, while she was manually translating VWD messages from
English into Russian, using a standard 2-window text editor and a pile of «paper»
dictionaries. I noticed that she only managed to translate 2 documents within that time,
though, certainly, she was supposed to come up with «ideal» translations, which is
unattainable for any MT software. It only took PARS 2.04 a minute to rough translate
one document on an IBM/286 PC (PARS-3 would translate it for a second if run on a
Pentium), and the translation quality seemed not bad to me, the contents being quite
clear, so I put it a «6». However, when asked if PARS could really help her, the girl said,
«You see, it seems useless since the machine product needs too much post-editing, which
is a terribly boring task for me».

On the contrary, one of my clients and friends, the late Mr. Vladimir Kolykhmatov, who
was a very experienced translator working for the Moscow Agency of the Dupont
Company, translating 80-90% from Russian into English and 10-20% from English into
Russian, said he could not do without PARS and added,

«I am too lazy to do the job all on my own; I prefer to post-edit the raw translation
instead of typing everything in with my fingers. And, what is very important, PARS
helps me get started: psychologically, it’s easier for me to join in after PARS did part of
the work».

When shown the translations his lady colleague had rejected, he evaluated them rather
high, saying, «I don't understand what on earth she wants!»

Vladimir said he couldn't even imagine his work without using PARS.

I am very glad PARS has also been appreciated by non-translator end-users. In particular,
Dr. Boris Piskunov of the South Sakhalin Institute of Geology sent me a PARS-made
English translation of his scientific report he was supposed to submit to his South Korean
partners, with the commentary that he was quite satisfied as the volume of post-editing
needed was not large. I also received similar information from Dr. Vladimir Petushkov of
Ukrainian Welding Institute. By the way, both of them had mastered entering new words
into PARS dictionaries.

And, certainly, one of the most reliable experiment for me was my personal experience I
gained back in 1993-1994 using my old PARS-2 system.

One of the jobs was translating an English legal text of more than 40K; the work was to
be completed within 1 day, the source being presented as a hard copy on paper. In order
to do the translation manually, I would have had to have specialist dictionaries on my
working table, which I lacked, and even if I had had them, I would have hardly coped
with the task as I didn't even have enough time for typing the translation, to say nothing
of translating it.

So, I had the source text scanned and machine translated. The subsequent post editing
was rather hard, but I did manage to complete the work in time, and handed the
translation in to the user as a pack of sheets of paper with the Russian text. The
translation was checked up for authenticity by a professional editor, who said the quality
was high.

Another example from my own experience was translating a large technical text from
Russian into English, which I was to translate «as quick as possible», though the user's
requirement was not so strict as in the above situation: he asked me to make a text that
would be clear to his American colleague, though maybe somewhat rough syntactically
and even lexically. The text to translate was abundant in technical terms I had never even
heard of, either in English or in Russian, such as «воздуходувка», which, as it turned
out, means an air-blower. That translation was a success again: I made it in due time, and
the quality was not bad.

This experience i very important for me because I had to play the role of my numerous
clients and face all the advantages and disadvantages of the PARS system. What I see
now is that there are situations when one can't, or can hardly do without machine
translation. More than that, I know for sure that I would have not agreed to translate those
texts if I had had no MT assistant since it is easier for me to post edit machine product
than to translate the original text manually. Those who disagree will have to furnish the
customer in the evening with a manually made translation of a 40K-long legal or
technical text they received in the morning.

1996-1997 have been very fruitful for practical evaluations of translations made by the
new generation of the system, PARS-3.

Olga, my elder daughter, a linguist, translator, and student at Kharkov State University,
used PARS to translate two large and rather complicated Russian scientific and technical
texts into English. I asked her to describe her experience, and she prepared two papers,
one of which was published in MT News International, and the other was accepted as a
contribution to Machine Translation Summit VI. She allowed me to include the papers
into this book, which I did having updated them correspondingly. Please mind that PARS
dictionary characteristics are given as of the date of the experiments, August, 1996, and
February-March, 1997.

2.4.3. Evaluation of Russian-English translations (by Olga Bezhanova)

2.4.3.1. Translating scientific texts

Here I will analyze the work accomplished on the order of The Russian Foundation for
Fundamental Research (RFFR). The work gave valuable material for the analysis of the
modern translation facilities.

It became evident for me long ago that for translating large volumes of texts abundant in
special terminology, the professional translator has to use both traditional «paper»
dictionaries and something less habitual - machine translation systems and electronic
dictionaries. The work that made up the subject of the present investigation can serve an
example of using such facilities for making professional translations, from the point of
view of its volume, the complexity of the task, and abundance of special terminology
that belongs to various subject areas.

2.4.3.1.1. Task description


First, some words about the task I had to solve to meet the requirements of RFFR. As is
well known, enormous amount of research into all areas of science is carried out
annually in Russia. Some of these investigations are conducted under grants from various
Western foundations interested in the development of science in the countries of the
former USSR. Projects that have this kind of financial support are included in the «RFFR
Annual Bulletin».

The 1996 directory comprises about 400 pages. It embraces titles and bibliographic data
of several thousand research projects in the following areas:

 mathematics and information science;


 physics and astronomy;
 chemistry;
 biology and medicine;
 geosciences;
 liberal arts;
 databases and books issued in Russia in 1995.

The structure of the Directory was somewhat unusual for me as translator: it consisted of
a brief introduction followed by approximately 5.500 titles of projects including the
author’s surname, the title of the project, the identification number, the name of the
institution (University, research institute, etc.) where the research was carried out, and
the city/area of residence of this institution; the list of the abbreviations of the titles of
scientific institutions, as well as their addresses, were given in the end of the Directory.

It is evident that, except for the brief introduction, the task generally consisted in
translating not complete texts but the titles of research projects, each title consisting of
two to fourty words.

The initial text was a Microsoft Word file of 1.2 MB. The customers required a similar
English text preserving the source text styles and formatting. The customers also
stipulated that the surnames were to be transliterated according to the rules of the
English language, while the titles of institutions were to be translated.

Translating from Russian into English is considerably more difficult for a Russian-
speaking person than translating from English into Russian because of the absence of
Russian-English terminology dictionaries for quite a number of subject areas. In the ex-
Union, many excellent English-Russian dictionaries of mathematics, astronomy,
chemistry, biology, etc. were published, but it is absolutely impossible to find
corresponding Russian-English dictionaries.
The work was supposed to be done within approximately a month. Taking into account
the days-off, the translation was made for 34 days of intensive work, 5-7 hours a day,
consisting in post-editing the texts translated by PARS.

A great number of scientific terms in the source text relating to numerous subject areas
required using quite a number of various dictionaries. I am sure that translating texts of
such volume by one person for such a short period of time without using machine
translation software is impossible.

It is also necessary to say that, due to numerous misprints in the source text (the better
half of which composed Latin letters instead of Cyrillic ones in Russian words), the
machine translation quality in the first instance turned out to be lower than it could be if
the text had had no misprints. All Russian words having Latin letters were left
untranslated by the system, which considerably complicated post-editing. That is why
machine translation was preceded by context substitution of Russian letters with their
Latin «analogs», and this slowed down the whole process.

Before the translation session, the Polyglossum system of dictionaries by ETS Ltd. was
activated on CD-ROM, which made it possible to access any of the dictionaries by
pressing Alt+Tab without exiting from WinWord.

Here is an example of a translation made by PARS (Fig.16):

Fig. 16. PARS has translated a portion of the source Russian text directly in WinWord
6.0
As a result of machine translation, the screen is split in two WinWord windows: the
source text is in the top window, and the bottom one contains the unedited, draft
translation produced by PARS. This makes it possible to post-edit the text comparing it
with the original. The polysemantic words as well as words beginning with capital letters,
i.e. potential proper names, are marked in the target text with asterisks. The translator can
choose a more suitable translation option of a polysemantic word (phrase), and the
transliteration of the proper name.

Words not found by PARS were looked up in Polyglossum dictionaries.

2.4.3.1.2. Translating subject area oriented chapters

Mathematics and information science

This chapter of the Directory comprised 800 titles of research projects in the field of
mathematics and information science.

In order to translate this chapter, the following dictionaries were set up in PARS (in a
descending order of priorities):

1) computer dictionary (25,000 terms in each part, Russian-English and English-


Russian);
2) technical (76,000 terms);
3) general (35,000 words and phrases).

It is to be noted that these dictionaries turned out to be not enough for the complete
translation of the chapter, so I also used the Polyglossum system of dictionaries when
post-editing this portion, namely its mathematics and polytechnical dictionaries, as well
as the largest paper Russian-English dictionary, by Prof. A.I. Smirnitski (55,000 words
and phrases).

On the whole, the chapter «Mathematics and information science» was translated by
PARS fairly well, especially the titles of research in the field of information science due
to the programming dictionary. On the other hand, some purely mathematical terms (for
example, «tetrahedron») were unfamiliar to PARS, and I had to look them up in the
Polyglossum mathematical dictionary.

The main difficulty when working with this chapter consisted in translating phrases
comprising surnames of «foreign» mathematicians, as, for example, Langevin equation.
Because many similar phrases were absent both in PARS and in Polyglossum, I had to
look them up in The Great Soviet Encyclopaedia, which presents names of well-known
scientists in their native languages.
It is also necessary to note that the first chapter of the Directory, «Mathematics and
Information Science», was automatically translated as a whole, which slowed down post-
editing because WinWord works much slower with long text portions. The rest of the text
was translated by portions comprising 300-400 titles each.

Physics, astronomy

The following dictionaries were set up in PARS for translating this chapter (consisting of
1290 titles):

1) technical;
2) radioelectronics (50,000 terms);
3) microelectronics (20,000);
4) general.

Due to the absence of a special dictionary of physics and astronomy in PARS, post-
editing of this chapter was more difficult than of the previous portion. The Polyglossum
dictionaries, comprising about 1,500,000 terms of various subject areas, were of great
help. When post-editing the chapter «Physics, astronomy», I made use of the
Polyglossum polytechnical dictionary.

The main problems arose in rendering the names of the planets and their satellites, which
I managed to find in the English-Russian astronomy paper dictionary.

Chemistry

For translating of this chapter (659 titles), the technical and general dictionaries were set
up in PARS.

This chapter was the most difficult to translate since it comprised quite a lot of specific
chemical terms, such as фталоцианин (phthalocyanine), редокс (oxidation-reduction),
рацемат (racemoid), аценафтен (acenaphthene), гваяцил (guaiacyl), etc.

I had to look up the words not found either by PARS or by Polyglossum in the Russian-
English Dictionary of Chemical Reactions and in the English-Russian Dictionary of
Petroleum Chemistry and Processing because, generally, the difficulty consisted in the
spelling of the unknown chemical terms.

For example, it was clear that the English translation of the term «стирил» could not
differ seriously from the Russian variant, but I was not sure whether it was «styril» or
«stiril». I found the word «styryl» in one of the paper dictionaries, which put an end to
my doubts.
It was very difficult to translate complex terms consisting of several components, for
example, «винилхалькогенополигалогенбензол». PARS failed to translate such words,
that is why it took me 6 days to post-edit this comparatively short chapter.

Coming across a word consisting of several components, I usually broke it in sense-


bearing parts and translated them in turns. Thus, the term

винилхалькогенополигалогенбензол

was broken into винил, халькоген, полигалоген, and бензол. The resulting «simple»
words were translated and united in one. It’s only natural that such work was very labor-
intensive and occupied much time.

Biology, medicine

This chapter (908 titles) was the second most difficult to post-edit. The following PARS
dictionaries were used for translation:

1) medicine (20,000 terms);


2) aviation medicine (24,000 terms);
3) technical;
4) general.

The main difficulty consisted in translating the names and genders of animals and insects.
I could not do without the Russian-English paper dictionary by A.I. Smirnitski, in which
I found such terms as «иглокожие», «ракообразные», - «echinodermata», «crustacea»,
etc. The dictionary by A.I. Smirnitski comprises quite a number of biological terms, and I
made the greatest use of this dictionary for post-editing the chapter «Biology, medicine».

Also, I used The Great Soviet Encyclopaedia, in which I found such rare terms as
ветвистоусые (Cladocera), булавоусые (Rhopalocera), сивуч (Eumetopias jubatus).

Besides, I used The Random House Unabridged Dictionary intensively, including its
electronic variant. It let me clear up the spelling of such words as Leishmania, Pteropoda,
etc.

Geosciences

This chapter comprised 752 projects in such subject areas as geology, paleontology,
archaeology, ecology, etc. PARS translated this chapter very well, owing to the
presence of geological and ecological dictionaries; this raised the translation quality
several times as compared with translating such chapters of the Directory as «Chemistry»
and «Biology, Medicine».
The chapter was translated in two stages. It was split into two portions of nearly equal
sizes that were translated by PARS with the following dictionaries:

The first portion:

1) geological dictionary (11,000 terms);


2) technical;
3) general.

Embarking on the translation of this chapter, I didn’t yet know that it comprised many
documents on ecology, that is why the ecological dictionary was not chosen for
translating the first portion. When post-editing it, I saw that the better half of the words
not found in the system dictionaries related to ecology, and I also set up the PARS
ecological dictionary (18,000 terms) for translating the second portion.

I also made some conclusions (these are discussed below) as to setting up the priorities of
the dictionaries, which made me indicate the dictionary of technical terms as the
prioritized dictionary for translating the second portion. Thus, the list of dictionaries
looked like this:

The second portion:

1) technical;
2) geological;
3) ecological;
4) general.

When post-editing this chapter, I actively used The Random House Unabridged
Dictionary to clear up the spelling of such words as Тетис (Tethys), пегматит
(pegmatite), and many others. In particular, I made use of the table of geological periods
given in this dictionary. It is to be noted that this was the only source where I managed
to find translations of a large number of geological terms. It only took me 4 days to
translate this chapter, which would have been impossible without using the Random
House dictionary.

Humanities and social sciences

This chapter (214 titles) was the easiest to translate. I set up the following PARS
dictionaries:

1) general;
2) economic dictionary (55,000 terms).
Despite the fact that this chapter occasionally comprised separate terms of biology,
geology and ecology, the number of words not found by PARS was very small.

«Databases of the 95-96ies»

This chapter was translated by PARS very well using the general and computer
dictionaries. Post-editing consisted in making minor corrections to the machine
translation.

2.4.3.1.3. Translating and post-editing

This part discusses the process of translation itself, including the difficulties encountered,
as well as the ways of overcoming them.

The draft translations provided by PARS were of different quality, depending on the
subject area of the source text and, accordingly, on the presence of terminological
dictionaries.

A few translations required no post-editing at all or «cosmetic» post-editing. For


example:

Source text:

133. Озернюк Н.Д. Механизмы метаболической стабильности процессов


развития.

Machine translation:

133. Ozernyuk N.D. Mechanisms of the metabolic stability of the development processes.

Source text:

151. Офицеров В.И. Исследование структурной организации конформационных


антигенных детерминант на примере белков оболочки вируса гепатита А.

Machine translation:

151. Ofitserov V.I. Research of the structural organization of conformational antigenic


determination on the example of the proteins of the capsule of hepatitis of virus A.

Such cases were rare, but they did occur.

On the other hand, in some cases the translation offered by PARS was to be changed
completely to obtain the correct text. The fact is that if the system dictionary doesn’t have
a set expression, PARS translates it word by word, sometimes making it hard to
understand.

For example, the phrase «принимающий решения» was translated «receiving


decisions», on analogy with «receiving letters». The phrase «образ жизни» was rendered
as «image of life» instead of «way of life», etc.

Some translation problems were caused by the absence of some geographic names in the
PARS general dictionary as, for example, Yekaterinburg, Petropavlovsk-Kamchatka,
Kabardino-Balkariya, etc. At the same time, in the very beginning of my work, I entered
these words into the dictionary, which simplified post-editing.

One of the merits of the PARS system is the selecting translation variants option. Here is
an example.

A title that relates to biology or medicine is to be translated:

580. Носиков В.В. Поиск и изучение антигенных детерминант, связанных с


аутоиммунной деструкцией островковых бета-клеток при инсулинозависимом
сахарном диабете, с сипользованием библиотек бактериофагов,
экспрессирующих широкий спектр разнообразных пептидных эпитопов.

Machine translation:

580. Nosikov V.V. Search and the studies* of antigenic determinants bound with the
autoimmunity disruption of islet beta-cages* at инсулинозависимом sugar diabetes,
using the libraries of bacteriophages expressing the broad* spectrum of diversified*
peptide epitopes.

A double click on the asterisk will display the list of translation options for this
word/phrase. Having chosen one of the variants and pressed the button, the post-editor
inserts it into the text instead of the initial one.

In the above text, the following translation variants were offered: studies (research,
analysis), cages (cells), broad (wide, capacious, extensive, large-scale), diversified
(miscellaneous, diverse). The most important substitution is certainly ‘cells’ instead of
‘cages’.

One of the main features of the PARS system is that the translator may set up dictionary
priorities. When choosing the dictionaries to be used in the translation session, it is
recommended to set them up in the optimum order, placing on the top of the list the
dictionary which is going to be most frequently used in this session. It is common
knowledge that a word can have quite a lot of translations beyond the context, and the
right translation depends on the subject area. A good example was given above: the
medical term «бета-клетки» was translated «beta-cages». In this case, the system treated
the polysemantic word «клетка» as a general-usage one, not as a medical term.

Another example shows a different situation: the system understood the Russian word
«опыт» as «experiment», so the machine translation of the sentence «in the context of the
European experience» was «in the context of European test*». The translation option for
the word «опыт» was «experience».

When translating the chapter «Geosciences», I understood very well the importance of
setting up dictionary priorities in PARS. As was mentioned above, the first part of the
chapter was machine translated with the geological dictionary having the highest priority.
The fact is, I beleived that geology would be the main topic in this chapter. However, I
did not take into account that those texts comprised many common words, which would
be translated as their «geologic equivalents».

For example, «разработка нового метода» was translated «the mining of new method»,
while the word «development» was to be used.

Without doubt, when assigning priorities to the dictionaries, one should remember that
it’s impossible to foresee all the cases of the contextual usage of a word. The system does
not «feel» differences between shades of meaning.

For example, when translating the chapter «Physics, astronomy» the phrase
«осажденный магнетик» was translated «besieged magnet». In this case, the system
failed to distinguish between the physical term «осаждение» («reduction») and the
general-usage «осаждение» - (to besiege a city or fortess). The reason is quite simple: in
the dictionary, the word «besiege» is the first translation, and the system gives preference
to it. Here I want to note another interesting feature of the PARS system - the possibility
of transposing the translations in the dictionary.

I’d also like to add that the draft machine translations produced by PARS sometimes had
funny peculiarities. Thus, sometimes Russian surnames were translated into English as
general-usage words: «Бобров - Beavers, Зубов - Teeth», etc. In such cases, I simply
used the transliteration facility.

2.4.3.1.4. Conclusion

The translation of the RFFR Directory was a very interesting and useful work for me. It
helped me make up the technology of working with similar texts, clear up, in what
dictionaries I can find terms relating to definite areas of knowledge. Experiments with the
choice of dictionaries for each of the chapters helped me clear up what dictionaries would
be better to use for translating texts of definite subject areas.

For the PARS system, this work turned out to be even a more useful stage of its
development. In the process of post-editing, new words and phrases were entered into the
dictionaries. New dictionary projects were launched. Presently, PARS comprises
dictionaries of mathematics (85,000 terms), chemistry (50,000 terms), oil and gas (70,000
terms), physics (88,000 terms), a much larger dictionary of geology (27,000 terms), etc.

I believe that permanent contacts of PARS and Polyglossum users with the authors of
these systems, Lingvistica ‘93 Co. and ETS Ltd. respectively, will contribute to further
raising the operational abilities of the systems.

2.4.3.2. Translating technical texts

Now I would like to tell you about my experience in the application of PARS and
Polyglossum to translate a large portion of technical texts in the field of aircraft building.
The texts were presented in the form of lectures for Iranian students who were coming to
sudy at Kharkov State Aviation University.

As it always happens, the work was «awfully urgent»: it was initiated not long before the
beginning of the academic semester, and, according to the contract, I had very little time.

First I tried to attain both high quality and the required speed, but very soon I understood
that a certain degree of perfection of the output texts was to be sacrificed as I needed a lot
of time for finding translations of specific technical terms, and there was very little time
left for polishing the style.

2.4.3.2.1. Manual translation using electronic dictionaries

The first portion of the texts (about 150 pages) came in «traditional» (though traditions
change quickly nowadays!) paper form. Because of the low quality of the printed copies
(they had been typed using an old typing machine), I could not scan the texts for
subsequent running them through PARS. At the same time, I was supposed to present the
translations as electronic text files: no typing machines will be used any longer.

The texts had some words (5-7%) which I couldn't find in any general-usage Russian-
English dictionary. Those wre mainly special technical terms, such as «бобышка», the
English translation for which is «boss», «линия выноски» - «extension line», «визирная
линия» - «hairline». The task was made easier due to the application of the Polyglossum
set of electronic dictionaries, in which all the unknown terms were found. The
mathematical and polytechnical dictionaries were of the biggest help, the former
comprising 85,000 word entries, and the latter 300,000.

It seems necessary to mention that, except the problems with technical terms, there were
certain difficulties with the syntactic structure of some of the source Russian sentences,
especially in the texts of lectures on technical maintenance, safety, and storage
requirements. The texts, written with the observation of rather a peculiar style never used
in every-day life were very hard to translate. Some sentences had to be reread several
times before their meaning was grasped and grammatical structure understood, though
this may sound rather strange as I am a native Russian speaker!

Certainly, this slowed down the translation process, but it is always the case with manual
translation: you try to understand the grammar, and it may require a lot of time and
efforts, especially if the author’s mentality differs drastically from yours, and the style
peculiarities are too exotic!

It should also be noted that one of the requirements to the translations was observation of
the terminological system used at the Aviation University. The task was rather difficult as
there were 25 translators working at the lectures, and we could not cordinate our work
properly for purely organisational reasons.

However, in the long run, my translations were checked up by a terminologist


experienced in translating aviation texts.

All the source texts can be divided into four groups according to the subject areas. The
first group consisted of purely technical texts, such as, for example, «Loft-Template
Method of Manufacturing Machine Parts». The lectures in the second group were
descriptions of special computer systems used in airplane design. Technical guides for
special devices composed the third group. The fourth group, which was the easiest to
translate from the lexical point of view but the most difficult one as to grammar,
comprised lectures on the general requirements to technical documentation, such as «The
Order and Rules of Manufacturing Certification».

It is quite obvious that different dictionaries and dictionary combinations had to be used
for translating these text groups. Thus, the first group required the mathematical and
polytechnical Polyglossum dictionaries. When translating texts of the second group, I
used another pair of Polyglossum dictionaries: software (25,000 word-entries) and
polytechnical. As for the third and fourth groups, their terminology was covered by the
polytechnical dictionary only.

Depending on the degree of difficulty of the source texts, the average time needed for
the manual translation of 15-20 text pages using Polyglossum was 3,5 hours. And I
would have hardly coped with the work without the Polyglossum dictionaries!

2.4.3.2.2. Post-Editing machine translations

The second text portion (about 650 K) was presented as WinWord and DOS files. The
lectures were translated by PARS and then post-edited. I used both PARS for Windows
and PARS for DOS. The following PARS dictionaries were used in various
combinations:

 general (40,000);
 polytechnical (76,000);
 concise aviation dictionary (7,000);
 aerospace (60,000);
 mathematics (85,000);
 computer dictionary (25,000);

The Polyglossum polytechnical dictionary was also used.

At this stage, I had even less time for doing the work. An original Russian text of 20-30
pages used to be given to me every day at noon, the request being to «please translate it
not later than tomorrow morning! The Iranians are coming!». And the whole number of
such texts seemed to be endless.

The situation being so tense, sacrifices as to the stylistic purity of the end-translations had
to be made in order to submit translations as soon as possible. The idea of my post-
editing was to make the texts grammatically correct and understandable, omitting a
number of stylistical details, such as repetition of several «of»-clauses, misuse of articles
in those cases where this did not affect understanding, etc.

This allowed me to come up with translations that were grammatically and lexically
correct though stylistically far from ideal in a number of cases. Here are two examples of
machine translations left as they were, without any post-editing:

«Requirements to the execution of the outlines of modern airplanes and assurance of


inter-changeability of their aggregations».

«This advantage is especially noticeable on the larger level of loading».

Again, technical terms were the main difficulty, but the problem was solved by using
Polyglossum were PARS failed to translate a term: the time needed to find in
Polyglossum a word not translated by PARS and paste it into the text is, as a
mathematician would say, «negligibly little». Besides, all the «new» terms were
immediately entered into the corresponding PARS dictionaries by Lingvistica ‘93
dictionary officers, which made PARS «cleverer» with each text translated.

And, again, the source texts being abundant in specific technical terminology, the final
translations were checked up by an experienced translator of aviation texts.

At the same time, there were also some sentences generated by PARS that had to be
changed completely, as, for example, the following one:

«After switching-on pumping station, if via 5 with pressure is not is heaved above 8 Pa
actuates signaling table ...ABORT».
If all or most of the sentences had been translated so poorly, post-editing would have
been much more difficult, which would have made MT quite or almost useless. However,
such examples were very rare.

The major conclusion was as follows.

It took me about 2-2.5 hours to post-edit 15-20 pages of texts using PARS, and, what
is very important, the work itself was not so boring and tiresome as manual
translation.

In other words, editing machine-made (in my situation, PARS-made) Russian-English


technical translations plus a set of additional electronic dictionaries (in my case,
Polyglossum) is about 3 times easier than translating manually, but using electronic
dictionaries only (again, Polyglossum), although the target texts will be stylistically
poorer.

Using PARS and Polyglossum, the translator can prepare 20-30 pages a day.

As to purely manual translation, as well as using terminologically poor systems, the


result would have been obvious, that’s why such a situation simply was not
analysed.

I hear that professional translators are very severe as to machine translation. They say
that computers cannot compete with them. Well, PARS does not compete with me. It
helps.

2.4.4. Evaluating MT without analyzing the translations

Very often, at numerous demonstrations of MT systems developed under the PARS


project, I have to answer one and the same question:

‘What if a text is first translated and then back translated by the MT system? Will the
source text and the one obtained in such a way be alike?’

My friend and colleague, Jevgeni Pinchasik, called this procedure the iterative reverse
translation approach.

In August, 1994, PARS 3.0 was exhibited at The European Conference on Artificial
Intelligence (ECAI) in Amsterdam, The Netherlands. The numerous visitors did not
speak Russian, so they could hardly evaluate translation quality directly. That is why, a
different method was chosen, the Iterative Reverse Translation technique. It consists in
comparing the source text with the reverse translation of the machine product.
Translation reversibility means the possibility to translate the target text back into the
source language, which any of our translation tools can easily do, being bidirectional MT
systems.
Let's suppose that lingustic capabilities of the PARS system are to be demonstrated to a
person who only has a good command of one of the two languages covered by the
system. In this case, translation direction is determined by purely practical reasons, that is
the user's language proficiency. Thus, if the user speaks English and doesn't speak
Russian, it would be only natural to have an English text translated into Russian (the first
cycle), and then have the Russian output back-translated into English (the 2nd cycle).

We shall call the subsystems making the 1st and the 2nd cycle translations «The first
translation direction subsystem» and «The 2nd translation direction subsystem»,
respectively.

If the comparison shows substantial difference between the original text and the one we
obtain after translating the machine output into the source language, there may be two
reasons:

a) there are some errors in the dictionary and/or linguistic algorithm in the 1st
translation direction subsystem, or
b) it is the 2nd translation direction subsystem that is to be modified.

Practically, the user may correct the dictionary if necessary and repeat translation
procedures (iterations) until the comparison proves satisfactory.

Here are some of the numerous examples, provided by Jevgeni:

Original English Texts Reverse English Translations


Source: Financial Times,
July, 22, 1994

1. IBM earnings rise surprises 1. Increase of earnings astonishes


Wall Street. Wall street.

2. Bundesbank raises hope of fall 2. Bundesbank hoists hope on


in rates. the tumble of rate.

3. Telegraph's chairman in sharp 3. Chairman of Telegraph in sharp


attack on Cazenove. attack on Cazenove.

4. Italy fails to decide economic 4. Italy fails decide economic policy.


policy.

5. Latin Americans rediscover 5. Latin americans discover again


their neighbours' markets. markets of its neighbors.

6. Ecuador Parliament rejects 6. Parliament of Ecuador rejects


Telecom sale. selling Telecom.

7. UN lawyers take steps to set 7. Lawyers UN taking steps in order


up criminal court. to set up the court of criminals.

8. Japan puts conditions on 8. Japan puts conditions on


backing Taiwan's bid to join maintaining the proposal of Taiwan
GATT. to affiliate GATT.

Alcatel in China Alactel in China

Alcatel Australia has signed Alcatel Australia signed contracts


contracts, valued at a US$56 sum to on sum of 56 dollars USA for
provide telecommunication equipment the ensuring of China by the equipment
to China, agency AP-DJ reports from of telecommunication, discloses
Canberra. The company will supply agency AP-DJ from Canberra. Company
and install digital switching will supply and establish the digital
exchanges to the autonomous switching of exchanges for the
regions of Tibet and Ningxia and autonomous districts of Tibet and
the province of Gansu. Ningxia and for province Gansu.

Note that these sample translations only display the first (main) meanings of the
polysemantic words. However, we already know that PARS features a special multi-
variant support facility, that is each word having more than one translation in the
dictionary is marked with an asterisk so that the user could have all the variants displayed
to choose the most appropriate one and have it pasted into the target text instead of the
initial one.

So, one of the most obvious reasons of discrepancies between the source text and its
reverse translation can be formulated as follows.

Let the source language word A have three translations in the dictionary: A1, A2 and A3,
A1 being the first translation in the dictionary. Then PARS will translate A as A1 and
mark the latter with * in the target text. Let A1 have two equivalents in the dictionary: B
and A, the former being the first translation. In this case, the system will translate A1 as
B and mark it as a polysemantic word, displaying A2 and A3 as the variants. Here is an
example: the English word «raise» is translated «поднять» in the PARS general
dictionary, while «поднять», in its turn, has three translations: ‘raise’, ‘lift’, and ‘hoist’.

The user will be able

a) either to substitute A2 or A3 for A1 in the target text, or


b) make A the main translation for A1 by means of transposing B and A in the word-
entry of the word A1.
PART 3. CONCLUSION: Which way to go?

During all these years in language engineering, I have understood but well: machine
translation is a task that can never be solved completely. The paradox is that the more
rules of text analysis and synthesis you introduce into your system, the slower it
progresses. Well, at the beginning of a new project, the progress is quite obvious, every
new version translates much better than the previous one, and the «specific weight» of
each linguistic rule is very high. But the cleverer the system gets, the worse it reacts to
modifications. More than that, sometimes translations become even poorer after the
translation algorithm has been modified. Why? Because the new rules may contradict the
«old» ones. Just like in the Russian joke: when you pull out the nose, the tail gets in, and
vice versa. I think that an MT system can be compared with a human organism: when
young, it grows rapidly, but getting older, the growth slows down.

However, taking into account that no progress will sooner or later result in stagnation and
death, what shall we do to prevent our systems from dying? How can we keep improving
them? Which way shall we go?

The more I think about it, the more convinced I become that we should teach the program
to make use of it own mistakes. I remember Boris Pevzner telling me 10 years ago,

«It’s strange that most of language engineering systems repeat the same mistakes every
time they tackle similar problems».

Isn’t it better to save the right and the wrong translations in two separate files and make
the system compare the text to be translated with the mistakes it used to make as well as
the situations when it avoided mistakes? That’s what is now called example-based
translation, and the files I have mentioned are called translation memory, while in the 80s
a colleague of mine suggested a less metaphorical term, precedential databases.

Boris Pevzner made the first experiments in example-based MT in the early 70s. He said
it was the most reliable and promising method in language engineering in general, and in
machine translation, in particular. His experimental systems translated English scientific
titles as well as German composite words into Russian using examples provided by the
teacher, the linguist in charge of the precedential databases. And he called this
technology «MT systems with a human teacher».

His basic idea is as follows:

 the program looks a text segment (such as a noun group) up in the «memory»; if the
result of the search is positive, the segment acquires the same translation as in the
«memory»;
 if the segment is absent in the memory, the program tries to find a segment looking
similar to the source one; the decision depends on the similarity criterion, for
example, «looks a text segment up» may be considered similar to «looks some
missing names up»;
 if such a similar segment is found, the program makes what is called substitution; in
the above example, it substitutes some missing names for a text segment;
 and, if no similar segment has been found, the program applies the FTA principles.

Another idea is using a special contextual dictionary for semantic disambiguation. On my


shelf, I am keeping, among other valuable books, the 2-volume «Contextological
Dictionary for Automatic Translating English Multi-Meaning Words into Russian» by
Prof. Yuri Marchuk, published in 1976 in Moscow. Marchuk’s idea is easy to understand:
if we compile a dictionary in which translations are accompanied with contexts, the latter
acting as «landmarks» for making a decision, the program, when coming across a word
with several translations, will compare its context in the source text with those provided
in the dictionary for disambiguation and make the right decision. For example, the
English word «red» will be translated into Russian as «рыжая» if the context is «fox»,
and «красная» in any other situation: the fact is that, in Russian, foxes have a different
color than ribbons, flags, balls, poppies, etc. That’s exactly what a contextual dictionary
is for!

So,as you see, this book has no conclusion. Two or three years will pass, and a new
version will be written. I don’t think I’ll ever finish writing it simply because I’ll never
stop developing machine translation systems. And when I do, I hope there’ll be someone
to continue the work. First of all, my daughters.

ACKNOWLEDGEMENTS

There are quite a few people I would like to mention and thank in this summing-up book.
The only problem is that not all of them are alive any more.

I will never forget the influence of Raisa Pogorelova, who was my English teacher at the
University, upon my decision to bind my life with the English language.

Prof. Raimund Piotrowski was the first machine translator I ever met, and he became
one of my teachers back in the 70th. He founded and headed the famous «Statistika
rechi» («Speach Statistics») group based in St Petersbourg. It was there that I had the
pleasure of making acquaintance of the charming machine translators - Prof. Larisa
Beliayeva and Dr. Tatiana Apollonskaya.

The late Prof. Victor Berzon and Dr. Boris Pevzner, who now resides in Israel, were
the two outstanding linguists I collaborated with for many years. Victor introduced me
into modern linguistics. Boris, as well as Dr. Valeri Yepifanov, made me understand
what an opreational, «industrial»-type information processing system is; so did Dr.
Vladimir Terletski, who was my «second father» during the 13 years of working at the
VNIITelektromash research institute.
Besides Andrei Kursin and Alla Rakova, whom I have already mentioned, I have
successfully collaborated with the following persons: Alexander Akselrod, Bella
Valenko, Alexander Zakharov, Vadim Obukhov, programmers; Dr. Oksana
Polonskaya and Yelena Zeitlina, linguists. I am deeply greatful to Mr. Vladimir
Topopolsky and Dr. Alexander Serebriany, who both supported the PARS-1 project
organizationally and financially.

As to the late Vladimir Kolykhmatov, the most brilliant translator I have ever met, and
my intimate friend, he supported the PARS-3 project, both financially and morally.

Andrei Dashko wase my first serious business partner for several years, and it was he
who created conditions for my investigations in 1990-1993.

I also thank Igor Fagradiants, Director of ETS Publishers, the representative of


Lingvistica '93 in Russia. Igor is one of my closest friends and, certainly, the most
reliable partner. This pamphlet would have never been published without his direct
assistance. He, Dr. Leonid Kelner and Vladimir Petrov, my American partners, set up
POLYGLOSSUM, Inc., whose Vice Presedient I have the honor to be.

My special thanks are to Dr. Yevgeni Lovtsky, a brilliant machine translator, who has
been my friend for 20 years now.

Great is my gratitude to and admiration of my daughters, Olga and Marina. I hope they
will do much more than their father has done and is going to do.

Dr. David Wigg, the Chairman of The Natural Language Specialist Group, and Dr. John
Hutchins, President of the International Association for Machine Translation, both
residing in Great Britain, keep me in close contact with the state-of-the-art in machine
translation in the West and publish my papers in their respective Newsletters, which lets
me feel a particle of a large international family.

There are five more persons who made all my work possible: Maria Krupetskaya, my
grandmother; Vladimir Petkevich, my grandfather; Klarissa Blekhman and Samuel
Blekhman, my parents; Nadezhda Bezhanova, my wife. To express my gratitude to
them, I'd have to compose a separate book. The only thing I will say here is that, if it had
not been for them, there would be one machine translator less first in the Soviet Union,
then in Ukraine, and now in Canada.

You might also like