April/May 2010 GETTING STARTED: Guide

Streamlining Internationalization

and Localization

Evaluating Emerging

Language Technologies

Corpus Linguistics and

the Translation Process

Creating Your Own

Multilingual Technology

GETTING STARTED: Guide

Streamlining Internationalization and Localization
Subscriptions Terri Jadick Ian Henderson page 4
Special Projects Bernie Nova Ian Henderson is CEO of Rubric, a provider of localization services
Advertising to the high technology industry for the past 15 years.
Subscriptions, customer service, back issues
Evaluating Emerging Language Technologies Vadim Berman page 8
Submissions Vadim Berman is a cofounder and a CEO of Digital Sonata,
Editorial guidelines are available at a provider of language engineering products and services.
Corpus Linguistics and the Translation Process
This guide is published as a supplement to Thiana Donato page 10
MultiLingual, the magazine about language
technology, localization, web globalization and Thiana Donato is executive director and founder of All Tasks,
international software development. It may be a Brazilian company in the South American multilingual services market.
downloaded at

Creating Your Own Multilingual Technology

Dennis Wakabayashi and Chris Golaszewski page 12
Dennis Wakabayashi, founder and chief online ofcer of Mojoti,
has handled international business operations for a number of years.
Chris Golaszewski is the manager of online development at Mojoti.

April/May 2010

Guide: GETTING STARTED
Streamlining Internationalization
and Localization
Ian Henderson

nternationalization is defined as an 30% 28%
enabling process: Making original con- 25%
tent, such as code, ready for markets
around the world. Localization is then an
adaptation process that prepares content 20%
for a specific target market globally. 15%
Ideally, the two processes work seam- 13%
lessly. Internationalization should ease
the process of localization. Many years ago 10%
perhaps after the fiasco of Y2K when
countless professionals scrambled to undo
two-character year fields IT and product
teams, especially in the software industry,
Process/project management Terminology management
learned that it is better to design code or
content with the intent of presenting it glob- Best in class All others
ally than to have to retrofit it after the fact.
Yet, the gaps between localization and Figure 1: Solutions integrated with translation management.
internationalization often remain wide. In Source: Aberdeen Group, Translating Product Documentation
2007, Lingoport conducted a survey of cus-
tomers and vendors that identified a major around the world is that the internation- as we often undertake these cost- reduction
gap between internationalization and alization effort has frequently been com- exercises without passing the cost on to the
localization teams, which can adversely pleted without involving an experienced client. Once that relationship is established,
impact time-to-market deadlines. internationalization team or any other LSP we work with the client to standardize file
Three years later, the significant crev- for that matter. Take Synaptics, for exam- formats, file names, codes and so on.
ice between internationalization teams ple a leading worldwide developer of One of our first recommendations to
and localization providers persists. Time- human interface solutions for mobile com- clients is to employ an integrated transla-
to-market deadlines have shrunk even puting, communications and entertain- tion management solution that includes
further, and expectations for lower inter- ment devices. Previously, this company standardized terminology in one place,
nationalization and localization costs had updated its multilingual resource code for both the internationalization and
continue to increase. Furthermore, recent (RC) files by adding new English strings at localization teams, which is a big step in
research, such as that conducted by Aber- the end of each language section. This was reducing redundancy and encouraging
deen Group (see sidebar) points to the a manual, error-prone and time-consuming collaboration. It is not surprising that the
value of integrated translation environ- process for the client. An added compli- recent Aberdeen research found that com-
ments. Content providers and code devel- cation was that not all languages were in panies using an integrated solution (see
opers who integrate teams from end to end sync, so the added English strings varied Figure 1) were far more successful with
will reap sizable benefits as they roll their from language to language. At our end, production and version control.
products and services out worldwide. we had to extract the English strings from Within that integrated process, many fac-
each section, translate them and patch tors influence overall success. For example,
The problem them back into the multilingual files. The it is surprising how many clients believe that
Many times when we begin work as engineering process was laborious. if all localizable text is put into Excel or XML
a language service provider (LSP), the Ideally, the internationalization and local- files during the internationalization process,
internationalization process is fully com- ization teams, whether external or inter- then the problem is solved. Unfortunately,
plete. In the clients mind, it is now just a nal, work closely together. We have found this is not the case. All serious localization
question of localizing the files and going that the ability to streamline is directly companies will use translation tools when
to market. If localization were as simple correlated to the volume and frequency of localizing files, but these tools will only sup-
as that, we would be out of work pretty work. In fact, when we work with a client to port standard file formats, such as RC and
quickly. streamline and reduce the effort and cost of XLIFF. If you come up with your own XML
The issue we face as an LSP coming up the localization process, it is imperative that schema or Excel spreadsheet, you can be
to speed to get a product to local markets there is high volume and frequency of work pretty sure some engineering effort will be

GETTING STARTED: Guide

required to separate translatable from non- German (Deutsch) and el for Greek (Ellinika), Within RC files there is usually some lan-
translatable text. Fixing file format struc- alleviates that problem. guage-specific content. For example, the
ture during the internationalization process Another client uses bs-ID for one of its highlighted text below is not usually pre-
speeds localization. languages. This is not Bosnian as spoken in sented to the translator as a translatable
Next, we work with clients to ensure the Indonesia, but, in fact, refers to Bahasa Indo- text because translation tools will make
file naming scheme follows a recognizable nesia (id-ID in ISO terms). Similarly, bs-BS is these changes automatically.
and consistent pattern. Typically transla- neither Bosnian nor Bahasa as spoken in the
tion tools retain the name of the source Bahamas, but Bahasa Melayu (ms-MY in ISO ////////////////////////////////////
file for the target file, but put the target speak). Straightening out these differences //English(U.S.)resources
file in a different folder. For example, a using the ISO codes from the very begin- #if!defined(AFX_RESOURCE_
source file called /en/resources.rc may ning streamlines the entire process. DLL)||defined(AFX_TARG_ENU)
end up as /fr/resources.rc. Alternatively, Multilingual files add to the workload, as #ifdef_WIN32
translation tools may rename the source they need to be split into monolingual files LANGUAGE LANG_ENGLISH,
file by adding or replacing language iden- and reassembled after translation. When SUBLANG_ENGLISH_US
tifiers at the end of the file. /res/props we can work with the internationalization #pragma code_page(1252)
.properties may become res/props_de- teams, we can limit the impact of multilin- #endif//_WIN32 Handling a mishmash of gual source files on the localization process
file naming conventions is not a problem and cost. ////////////////////////////////////
in itself, but it adds time and increases //Japanese resources
cost as somebody needs to make sure Character, content and context # i f!d e f i n e d(A F X _ R E S O U R C E _
the translated file names conform to the When working with software files, we DL L)||define d(A FX _TA RG _ J PN)
required pattern. often encounter the issue of having to #ifdef_WIN32 LANGUAGE
Using standard language and country escape characters. In many cases there LANG_JAPANESE, SUBLANG_DEFAULT
codes reduces the risk of errors. We have will be no escaped character in the source #pragma code_page(932)
one client using gr to denote German, while phrase, so deciding how to escape an apos- #endif//_WIN32
another client uses the same gr for Greek. trophe character () in French, for example,
Because of this, in one rushed instance can be a challenge. Should it be: jai, jai, RC localization tools are aware of this and
we actually delivered the wrong language. j\ai, j\\ai or j\\\ai? We often see multiple will change the content accordingly; how-
Using standard ISO codes, such as de for different examples, even in the same file. ever, we have also seen instances where

April/May 2010 

Guide: G E T T I N G S T A R T E D

content needs to be introduced into the Start\ Receipt=Proc\u00e9der master list of all English strings that had
translated file. This can be addressed, but \u00e0 la r\u00e9ception been translated in one or more languages.
it requires additional effort, time and cost. On\ Case\ Pre\ Receipt=Sur pr\ The cost of translating the complete master
u00e9r\u00e9ception de caisse file for every language, even if the strings
#The file Contains the property Change\ Shipment\ Status= were not required in a particular language,
Values. Please read \ as escape Modification d'\u00e9tat d'exp turned out to be much cheaper than main-
characters on the left hand \u00e9dition taining a separate list of English strings
side. For example Test\ Literal= Blind\ Return\ Receipt=Re\u00e7u for each language. Once the master files
should be read as Test Literal retour sans autorisation have been translated, Synaptics merges
#Fri Aug 28 18:32:04 IST 2009 all the languages into multilingual RC files
Start\ Receipt=Start Receipt Translating out of context invariably by using automated scripts. Synaptics has
On\ Case\ Pre\ Receipt=On Case leads to a lower quality product and is the been open to implementing suggested
Pre Receipt biggest challenge facing linguists as clients changes and eliminating wasted effort in
Change\ Shipment\ Status=Change try to reduce the localization cost. When all order to streamline the process. As a result
Shipment Status contextual information is stripped out and of reducing the overall effort, the transla-
Blind\ Return\ Receipt=Blind translated pieces are reduced to an Excel tions are much quicker than before and
Return Receipt spreadsheet, the challenge for the linguist cost less.
is considerable. What clients do not realize This centralized approach with a reposi-
#The file Contains the property is that stripping out context may actually tory of common source language content
Values. Please read \ as escape be more expensive, as more time has to worked for Synaptics. According to the
characters on the left hand side. be spent testing and fixing the translated recent Aberdeen study, that kind of cen-
For example Test\ Literal= should strings in context after translation. tralization, paired with standardized termi-
be read as Test Literal In closing, I would like to return to Syn- nology and a closed-loop review process,
#Fri Aug 28 18:32:04 IST 2009 aptics, the company I mentioned before. is crucial to achieving higher translation
French=Fran\u00E7ais We decided it 12/7/2009
January Ad_ML_Final2:January Ad_MultiLingual_V2 would be best11:49
to create
AM aPage performance.
1 G

The Guide From MultiLingual

GETTING STARTED: Guide

Aberdeen Research Study Reveals Practices of Top-performing Companies

BetH WalsH
Aberdeen Groups recently released Translating Product Documenta- 80%
tion study identifies how top companies effectively manage translation 70%

and localization efforts while reducing costs and increasing efficiency. 60%
Based on the experiences of nearly 200 companies, the study was done 54%
50% 48%
as a follow-up to Documentation Goes Global, a report completed in 45%
40% 38% 39% 38%
the spring of 2008 that determined most companies were facing trans-
lation cost increases from 18% to 32% due to increased volume and 30%

language requirements. 20% 16%

In the fall of 2009 Aberdeen also conducted research on Technical 10% 6%
Communications as a Profit Center, which determined that technical 0%
Process/project Translation memory Terminology Translation Machine translation
communications departments provide significant customer-facing value management management system management system
by publishing product documentation online. Aberdeen analysts believed Best-in-class Industry average Laggard
it was important, given the two previous studies revelations, that they
look at how top-performing best-in-class companies were managing to
find the right balance between cost and quality in the localization chain. Technology use by best in class.
The results of the newest study reveal that companies that are most
successful in managing their translation and localization efforts maintain
participants, at about 45% of all respondents. However, what stands out
consistently lower costs, are more efficient with personnel resources,
in best-in-class companies is the process of incremental translation or
and produce higher quality work than their competitors. Best-in-class
creating topic-based authoring in source language content modules; this
companies save 240% over their competitors in translation expenses
opens up the opportunity of reuse significantly, potentially leading to
and 630% more in localization costs. They reduce the time required to
tremendous savings. Combined with standardization of terminology avail-
complete translation projects by 30% and translate content into 48% more
able to all translation workers and a formal closed-loop review process,
languages than their competitors. In addition, they complete 88% of their
it ensures consistency in both quality of translation and operational
translation projects by targeted deadlines, and 91% come in under budget.
performance of partners. When the review is done by a native speaker,
Best-in-class companies translate into about 11 languages on average.
in particular, it enables companies to preserve the intended meaning to
Our research clearly demonstrates that top companies, focused on
better serve their customers.
ROI, effectively manage time and expenses involved with translation and
Technology is being used to support internal ownership and account-
localization projects, said David Houlihan, senior research associate
ability as well as to gain cost and time savings, leading to higher effi-
with Aberdeens Product Innovation and Engineering practice. We found
ciencies. Integrated translation management solutions are proving to
that leading companies utilize integrated translation environments and
be an emerging trend among best-in-class companies. The use of these
realize performance improvements more than three times those achieved
solutions gives best-in-class performers a considerable advantage by
by their competitors. This high level of productivity comes with no sacri-
providing them with a centralized repository for translated content and
fice to the quality of work and may, in fact, improve quality.
centralized control as well as easy accessibility of approved terminology
Quality localization can have significant benefits for the enterprise,
by internal and external workers. This single source for all multilingual
as the prior Aberdeen research showed that high-quality documentation
content further enables reuse, maintains version control and eliminates
contributes as much as a 41% increase in customer satisfaction scores
redundant rework across the localization chain. Aberdeen advises all
and a 41% reduction in inbound calls to customer service organizations.
levels to actively assess translation quality through formal ranking and
How do they achieve these great results? The capabilities reviewed in
asserts that centralized processes will continue to improve results.
Aberdeens research are divided into five core areas: process, organiza-
Across Systems, which was a major sponsor of the research, found the
tion, knowledge management, technology and performance management.
results confirmed the approach they advise customers and prospects to
Significant productivity drivers are increased control and transparency
take. Aberdeens research identified that the integration of project and
over the entire process, closed loop processes that promote internal
terminology management into translation management solutions is an
and external accountability, and automated reuse of content.
emerging practice of best-in-class companies, said Daniel Nackovski,
Leading performers are much more likely than their competitors to
president of Across Systems, Inc. We were gratified to find the study
assign a dedicated project manager to manage the total translation
supports our strategy to include project and workflow management,
process, institute a formal review process for translated documents,
a translation memory, a terminology system and more in a unified work
and control content with terminology management and the use of
integrated translation management solutions. The highly specialized
As reported in the Aberdeen study, about 48% of the best-in-class
and irregular nature of translation work prevents many companies
companies use translation management software solutions, and 28%
from maintaining a standing translation staff, and much of the work
have an integration with project management, both of which are emerg-
is outsourced. However, transparency facilitated by comprehensive
ing practices. However, the difference in adoption is high, with these
management solutions is an aid to greater internal ownership over
top companies using it more than two times the norm. This means that
even outsourced translation and localization resources. Without this
even though it is an area where still less than half of companies are
transparency it is difficult for companies to understand how to improve
taking action, the great majority of those that do is reaching the top
their translation processes, either in terms of operational execution or
tier of performance, proving it is a highly useful practice. G
quality of output.
Increased reuse of translated content offers a compelling value proposi-
tion. As such, it is the most popular initiative pursued most often by study Beth Walsh is the vice president of Clearpoint Agency

April/May 2010 

Guide: GETTING STARTED
Evaluating Emerging
Language Technologies
Vadim Berman

ashion is not only for clothing and Usability: capability and desire The same principle applies to a crude
shoes. Trend following also can be ap- Looking at the exhibits in historical approach in building speech-to-speech MT
plied to the world of technology. In the museums, one cannot but admire the systems. Take two reasonably good sys-
silicon gold rush, some technologies are craftsmanship of the old masters. Kitchen tems, text MT and speech recognition. Lets
more favored by wanna-be inventors than utensils, furniture and wheel-lock guns assume both have an accuracy of 0.9. When
others., a barometer of in- are decorated with complex ornaments linked together, the complete solution has
novation, lists 140 startups tagged with the and precious stones. However, more the accuracy of 0.9 times 0.9 = 0.81. If the MT
expression VoIP (voice over internet pro- practical and down-to-earth minds may system is rule-based, it will not take kindly
tocol), 669 startups containing the word say this is a waste. The gargoyles and the the lack of punctuation in the text input, and
communication, 41 startups working on Greek deities dont add one bit of usability the accuracy is likely to degrade further.
surveillance, and a whopping 849 startups to the tool. The best example of an incred- This means that well have a frustrating ten-
tagged with natural language processing ible effort with little practical use is the dency to get every fifth word wrong.
(NLP). The once-obscure field of language wooden pocket watches of the Russian On the other hand, with stronger emphasis
technology seems to be getting hot. Bronnikov brothers. While magnificent on the underlying algorithms some aspects
It makes sense that this is happening and unique in the way they are made, do not have to be scrutinized as much as
now. An individual can travel around the these chronographs did not accomplish they are in other software. User interface is
world within a couple of days. A global much on the practical side of things: a not that difficult to change, so let it be even
communication network has been estab- pocket watch is still a pocket watch, and if you dont like it. Stability is paramount in
lished to capture bits and pieces of real- wood does not last as long as steel. The software, but in the early stages it does not
ity in clear video and audio signals that first and simple test, if you are looking to have to influence your decision too much.
can be stored forever. Business pro- invest in anything, is to ask if it is usable
cesses are mostly digitized. Now users and practical. Scalability: from toy data to real world
want machines to understand human The recently surfaced semantic search But how should the results be checked
language. Buzzword addicts call all this engines seem to be questionable in that when a product is still in development? Lan-
Web 3.0. respect. Many critics point out that in prac- guage technologies have a distinctive trait
But human languages have their own tice it does not yield much improved experi- that makes them so insanely hard. While a
logic, which is nothing like the strict true- ence over the tried-and-true keyword search. normal database application may deal with
or-false machine logic, and machines Try to assess the market realistically, and see a small or moderate amount of data, the
have their own different ways of making if the complexity and the costs are worth the linguistic applications by definition must
sense of human language. While regu- niche they are going to fill. Common sense deal with a potentially infinite set of words
lar business logic can manifest itself via applies, as usual. Avoid wishful thinking. comprising a language and endless combi-
labels, textboxes and the like, linguistic If you are looking for a tool to accomplish nations within this infinite set.
logic is largely invisible. You put text in, a certain task, are you sure that this beauti- A newly born application doesnt know
you get text out. It either matches your ful and intelligent masterpiece can handle much of this infinite set. It starts off with
expectations or it does not. But 99.999% it well? Consider the following example. a small portion of data, and this is usually
of the inner works of this programming You are building a software package to why the examples are limited. They all may
iceberg is under water. Linguistic soft- search content in a foreign language. Some work great, but there are just ten or twenty
ware, while not appearing very high-tech people take a straightforward approach: of them. This is normal, but if a technology
on the surface, is a mind-boggling array of apply machine translation (MT) to the con- is limited to this toy world, it is not of much
wires, cogs, counterweights, pulleys and tent, index it and connect to a plain search use. Most developers understand it. The
buttons, designed to run by itself. Like engine. Can it work? Maybe, but MT is yet question is, however, how they plan to
all complex mechanisms, it is prone to to become accurate enough to be reliable enlarge the scope of the input. Learning
breaking. If you are shopping in this area, for some types of language pairs, such as from corpora? Importing machine-readable
you have to either dive into this insanely Chinese English or Japanese French. dictionaries? User input? Crowdsourcing?
complex world or know the tricks of the With an accuracy of 70% 80%, nearly There are no good and bad methods,
trade. Come to think of it, the tricks of the every third or fifth word is incorrect, which just suitable and unsuitable ones or well-
trade are mandatory in any case; no one may result in arcane, unexplainable search planned and not well-planned strategies.
has the time to check everything. results. Try to check whether this data acquisition

The Guide From MultiLingual

GETTING STARTED: Guide

method has been tried already. Apply com- and the regionalisms are coming from. Fur- solid results. Experience, successful track
mon sense. Does it work? What is required thermore, if a system is good in principle record, social standing, reputation, hard
to make it work on a production level? and offers some customization capabili- work there is no escaping the basics.
Finances, personnel, linguistic resources? ties, the support for regional dialects can It might be more difficult with a startup.
Does it require 50 expensive highly skilled be added externally. Dont bother to check Usually, odds are against the gold dig-
computational linguists to build a dictionary domains you know youll never use or ones gers, so startup people either have a gam-
manually or petabytes of high-quality cor- that the system is not built for. I remember bling trait or are not experienced enough
pora for a rare language? Try to assess the a customer testing a MT system by trying to to understand how long and difficult the
feasibility and suitability of these resources translate a fragment of an Agatha Christie path before them is. Young entrepreneurs
necessary for growth. Even if a spaceship thriller. This is not guaranteed to work well, usually have more drive than their more
can carry you to the stars, it is difficult to use for good reason. Language engineering is experienced counterparts, but they have
its potential if the fuel must be pure gold. meant to handle mundane tasks, not to pro- other traits as well, and only the future
This, however, does not mean that if the duce literary masterpieces. According to a (or maybe also will tell
developer is unable to meet expectations, classically apt comparison, it is similar to whether it is a winning combination.
the requirements are unrealistic. The bud- assessing the performance of an industrial More often than one would have
get might be tiny, as creating new technol- robot by making it dance Swan Lake. thought, new technologies are presented
ogies is not as glamorous and profitable in by people with questionable honesty and
the beginning of a venture. Be understand- Extensibility professionalism. With the abundance of
ing, but analytical. What if the system seems to be a good strange characters and plenty of legiti-
basis for what you are looking for but does mate hard-working garage inventors,
Real-world examples not have the exact functionality that you some shamans are successfully posing
Did you ever wonder why so many natu- are looking for? Due to the complexity of as real doctors. There is no recipe to tell
ral language search engines and speech language engineering, the choice of lin- a scam, and even when the technology
recognition packages demonstrate their guistic tools is small. There is rarely a wide itself is legitimate, peek under the hood.
capabilities by looking either for pizza or array of choices, so an almost-suitable There is no place for impractical vision-
sushi? While these might be just similari- product may be the only option. aries at the steering wheel. They may
ties in the life style and culinary prefer- Then, in addition to other criteria, you contribute to the main idea and maybe
ences of the linguistic crowd, the main need to see how fast the product can be even the initial architecture, but they are
reason may be quite prosaic. These exam- adapted to your needs. The answer is likely to doom the enterprise no matter
ples are perfect to demonstrate systems often in no time. In fact, the extension how good the technology or the prospects
that claim to be production-ready. is almost there, 95% complete. Dont fall are. Driving a car or flying a plane does not
When humans must strain their brain into that one. Even though the choice allow for chasing birds or stars.
to understand or spell long, rare, exotic of products and suppliers is scarce, the Should techies or salespeople run a com-
words, the machines have a different language engineering job market is even pany? Normally, salespeople, but I believe
problem. The main problem is ambigu- scarcer. These guys may not be so sure language engineering is a bit different. It is
ity. Epidermis or uranium do not have too themselves. Feelings of a developer for his a small, tightly-knit community where many
many interpretations, and so they are easy brainchild are similar to those of a mother people arrive from other industries. Its idio-
for machines. On the other hand, words for her child. Neither is usually the best syncrasies are so distinctive that an external
with numerous meanings such as put or address to seek for an objective opinion. observer might doubt these people actually
set are a nightmare for every NLP package. Of course, if the extension is trivial and live on the same planet as the rest of the
Human readability is not the same and is does not touch on the linguistic parts, mankind. Mainstream salespeople might
often the opposite of machine readabil- there are no reasons to worry. However, if not be able to figure out this strange world,
ity. However, epidermis is a specialized it touches the core engine or requires imple- let alone explain the small technicalities to
term and might not be present in a small menting or modifying some linguistic logic, a potential customer. Imagine that you are
dictionary. Pizza and sushi, on the other the feasibility should be carefully analyzed. buying an electric appliance, and the sales-
hand, are quite common, yet still unam- Common sense applies, like everywhere person tells you, Well, the interface is very
biguous. These words are targets that are else. Try asking what the plan is and whether intuitive. I know that there are three green
quite easy to hit. a similar modification has been done before. buttons, one red bulb, and a lever. Im not
Try tougher tasks. Is this about food? Another useful question is What can go sure what they do. I think you need to pull
Try steak for a speech input, or lamb with wrong? Nothing is not a good answer, the lever, but to make sure, Ill just catch our
sage for MT (yes, the latter often yields especially if replied immediately. main techie and hell tell you how to make
lamb with a wise man in statistical MTs). it work. Apparently, with salespeople like
See how well the system makes complex Doctors and shamans these, there are no sales. Ive seen it hap-
decisions. On a more advanced stage, Technology might be the heart of the pening, too.
dont forget to introduce noise, either lit- offering, but this is not all. If a person has Finally, as the saying goes, if something
erally for speech or figuratively for text. a strong, well-functioning heart but severe is too good to be true, it probably is. Dont
Dont overdo your attempts to make the issues with other vital organs, one cant struggle to find overlooked diamonds;
system fail, though. Accents and regional- call it perfect health. Similarly, the man- look for more realistic copper, nickel or sil-
isms are only relevant if the system is to be agement and other relevant parts of the ver, and you wont spend your efforts and
deployed in the markets where the accents team also must be capable of delivering resources on fools gold. G

April/May 2010 

Guide: GETTING STARTED
Corpus Linguistics and
the Translation Process
Thiana DonaTo

he multilingual services market has the majority of MT systems are based on a Vu and Wordfast. These tools, besides con-
received a series of innovations corpus comprised of bilingual texts (original sidering grammar, use a TM that enables
through computational linguistics or and translated). terms used in a text to be standardized and
natural language processing (NLP), a mul- The computational tools used by corpus added to a glossary, making quality con-
tidisciplinary area that encompasses arti- linguistics provide a mechanism that col- trol in translation easier. These tools are
ficial intelligence, information technology lects, stores and analyzes linguistic data designed to support the translators work,
and linguistics, using computer processes the so-called corpus. This data is used as for instance, storing previously translated
to handle human language. Artificial intel- research material that can help elaborate segments into a TM so that when the same
ligence is the field of research within com- theories about language functionality. segment of text appears again, the soft-
puter science that studies how machines Some programs list words according to ware brings up the previous translation
can think, simulating the human capacity the frequency with which they occur in the used for that phrase.
for intelligence and solving problems. As a corpus. Others are called concordancers Each technological advance brings ru-
result of the integration of these sciences, and serve to allow specific word searches mors that the days of the professional trans-
research has been providing important ap- in a corpus, pulling up a comprehensive lator are numbered. However, the work of
plications for translators work, such as list of phrases that shows the contexts in human translators continues to be essen-
search tools, spell-checkers and voice rec- which the word has been used. The use of tial. Technology is no substitute for human
ognition, as well as tools in computer-aid- tagging is also common to automatically work, but is rather a tool to help speed up
ed translation (CAT), including translation analyze the corpora and produce codes or certain types of translation work.
memory (TM), management terminology tags that contain only data belonging to a Terminology is one of the areas that
and machine translation (MT). These pro- particular morphosyntactic and syntactic. may be significantly influenced by corpus
jects aim to develop a search mechanism for This area of research has contributed to linguistics, which has been developing
the most common terms, by segmentation, improving hybrid MT software, through vocabularies by using its own methodol-
thus eliminating repetition and resulting in its theories on linguistic variables, directly ogy. Glossaries are prepared from a corpus,
a more natural translation. The goal of these influencing the translation so that the final creating a kind of filter so that the vocabu-
artificial intelligence researchers is to devel- text is as close as possible to the original lary shows only terms contained in the cor-
op CAT tools and MT that can simulate the one. The MT systems are based on a cor- pus, compiled according to specific criteria.
human ability to think and solve problems. pus comprised of bilingual texts (original As a result, the glossary contains the most
Corpus linguistics studies language and translated) and a database with sys- commonly used terms for a particular area
in use, investigating language through tems of rules and statistics. Technological of specialization. Another characteristic
observation of large quantities of authen- innovations can therefore speed up the of glossaries created by corpus linguistics
tic data contained in the corpus, which is a translation process, resulting in a better is that they are rich in authentic examples
representative set of texts on a particular quality MT, with the human translator act- extracted from the corpus and other infor-
area, electronically organized to enable ing as a sort of validator of the MT data. mation that can facilitate the translators
searches by using specialized search tools. This is a valuable contribution when task. Therefore, the type of translation that
Corpus linguistics considers language as we consider that the first technological can benefit most from corpus linguistics
a probabilistic system. That is, there are advance used to support translation work is technical translation, which focuses on
many possibilities for an expression in lan- was the development of MT, created by the various areas of specialization from a tech-
guage, but not all are as frequent. Americans in the 1950s to spy on the Rus- nical or scientific standpoint. This is a type
Research in this area advanced in the 1980s sians during the Cold War period. These of translation that involves a high degree
with the widespread use of personal com- software components were capable of of terminology research and the develop-
puters that led to the increased availability analyzing sentences based on grammar, ment of glossaries to ensure the use of
and accessibility of corpora and processing giving rise to very unnatural, sometimes standardized terminology in the document
tools, helping to strengthen research in the meaningless translations that had to be in question, and also for any future projects
field and reinforcing the fact that this area corrected and validated by a human trans- carried out on the same subject.
of research is and always has been closely lator. Today, the most famous MT system Both the reference material and the
related to technology. Since then, research worldwide is Googles, which proves that at research material that have led to the devel-
on the subject has contributed to translation least currently the results of MT cannot be opment of computer tools can speed up the
in several ways. Using the most commonly satisfactory without human intervention. technical translation process and provide
used standards in a language results in a Another technological contribution was gains in terms of quality, by giving the trans-
translation that flows more naturally and is the development of CAT tools, which gave lator not only a better knowledge of the
more faithful to the native language. Also, rise to software products such Trados, Dj specialized terminology of the industry that

The Guide From MultiLingual

10-11 Donato#111 GSG.indd 10 4/5/10 9:53:20 AM

GETTING STARTED: Guide

the translation is aimed to, but also the sup- characteristics. So far, this project has pro- centers around the world. One of the major
port of multifunctional software, like the duced two important reference materials in centers is in Great Britain, with projects
programs that have been launching in the the areas of Brazilian cuisine and receiving being carried out at various universities,
multilingual services market. guests. What makes this project different in the cities of Birmingham, Brighton,
In Brazil, for example, research in corpus is its presentation of a parallel corpus that Lancaster, Liverpool, London and others.
linguistics is still in its infancy, but it has makes it possible to compare the original Research in British institutions has con-
been gathering strength. Brazilian research with the translation. tributed to the theorization of corpora and
in this field is carried out by interest groups Another contribution is CorTec, a tech- other support materials in various areas. In
such as the COMET project (Corpus Multi- nical corpus for Portuguese-English that the Scandinavian countries there are also
lngue para Ensino e Traduo), developed enables terminology comparisons. It is active centers dedicated to this research.
together with the modern literature depart- divided into 14 subcorpora segmented Corpus linguistics appears to be more
ment of the Faculty of Philosophy, Literature into specialized areas. These studies are widespread in Europe than in other parts
and Human Sciences at University of So recent and are still in the initial stages; of the world. In the United States, corpus
Paulo (USP). Members are mostly graduate however, they need to have their relevance linguistics exists but is more modest. North
students and volunteers. acknowledged. The development of lan- American researchers are more engaged
An example of the contribution of corpus guage technology is extremely dependent in projects involving NLP, which, although
linguistics is CorTrad, a project developed on these studies, which means that the closely related to computer sciences with
by USP, Linguateca and NILC, which applies growth of the translation market depends various characteristics in common with cor-
a methodology proposed by corpus lin- on investments in this area of research. pus linguistics, is treated separately.
guistics that has new functionalities, such Some TM systems have already received A new trend in the worldwide corpus
as new search types, for translation. The new functionalities derived from corpus linguistics scenario is investment by pri-
project also enables different versions of linguistics methodology. Although it would vate companies, through partnerships
the same translation to be compared and be incorrect to say that statistical MT uses between companies and universities.
specific structural components to be con- some type of corpus linguistics, it is true The business world has a great interest in
sulted. CorTrad is available on COMETs that these methods and techniques can studies in this area of knowledge for com-
website. One of its main advantages is its help computational linguistics develop mercial purposes such as the automated
efficient search mechanism, which refines new mechanisms for TM systems. processing of texts, computerization of
the search into three different subcorpora, Currently, corpus linguistics is being databases, and the creation of intelligent
including genre, text type and other specific developed in various linguistic research voice and data management systems. G

April/May 2010

Guide: GETTING STARTED
Creating Your Own
Multilingual Technology
Dennis Wakabayashi anD Chris GolaszeWski

ur global landscape presents oppor- with all that interaction invisibly trans- webs than anything else, but we believed
tunity everywhere, from business, lated behind-the-scenes so that readers in the potential and the track record of
educational, social, local, travel can traverse the landscape of the content. WordPress and so decided to develop it.
and humanitarian efforts, just to name a Open-source technology is a kind of During our initial development, we saw a
few. If you look around, youll find endless goodness that allows ideas to rapidly great leap forward with the release of Bud-
communication methods using electronic develop, reusing modular programming dyPress 1.1.
language enablement. Whats great about thats already been done by someone What Buddypress did for us was allow
all this opportunity is the vast number of else. Its like going to an assembly line our multilingual users to link to each other
different ways you can manifest technol- and picking all the parts needed for your and share things such as e-mail and short
ogy to educate, foster peace or help those creation for free. format communications called wires.
in need. As great as that sounds, those parts We tried a number of ways to language
Multilanguage technology falls into still require considerable programming enable our WordPress MU + BuddyPress
one of three camps human translation, and thinking to coalesce into software environment. After several attempts with
machine translation (MT) or hybrid solu- that works the way you want. In the case various plug-in technologies, we were
tions. Human translation technology is of, getting things to able to get a modified version of http://
handled today by systems that use con- scale consistently became a focus of our and Googles API to work.
tact, billing and workflow management to thinking and developmental investment. We chose for host-
shuffle translation jobs to a large group In some cases its like coaxing a square ing because it has an ability to scale at
of online translators. These systems are peg into a round hole or creating an effi- small incremental levels. This meant
often accessible by an application pro- cient custom adapter kit. that we could grow our hosting in small
gram interface (API). For the blog publishing system, we steps as we grew, essentially making it
With an API, you can access the core chose WordPress MU for the following so we didnt have to pay for much unused
translation methods and bubble them reasons: hosting space over the development and
up to your own user interface. MT sys- Search engine optimization (SEO) growth stages.
tems are also accessible by API. Hybrid advantage: Over the years WordPress has Our first and favored project man-
translation technologies are less common done a number of savvy things to play agement (PM) solution, while not open
because many of these are proprietary nice with Googles Search. Permalinks and source, is Basecamp (http://basecam
and/or protected by patents. Hybrid solu- sitemapping, for example, gave us confi- from 37 Signals. We began
tions allow users to take advantage of both dence that our users would have a first- thinking this would prove as our end-all
human translation and MT from within one class chance of experiencing competent solution but found our use of the product
application. SEO throughout the world. to be best suited for file management,
From these three core technologies any Worldwide: WordPress is localized high-level PM duties and overall business
number of great ideas can emerge www in over 50 languages worldwide. What goals management. For a nominal fee, we, for example, allows users this meant to us is that users from around handle graphic source files, requirements
to upload video and then work as a giant the world would have access to robust documentation, high-level business goals
crowdsource community to translate sub- publishing tools from the onset of our and projects with Basecamp.
titles into languages around the world. development. With high-level information being tracked
Open source: WordPress philosophy in Basecamp, we wanted a separate solu-
Ingredients for a new technology is something we support and were happy tion to track the details of our development
In our case, an idea emerged from the to take advantage of. Heres a link to one efforts bug fixes and iterative feature
void of social networking we observed of the videos that influenced our decision: requests. We decided to use Mantis (www
actively connecting members regardless for this purpose. The Devel-
of natively spoken language. We set out mullenweg-wordpress-gpl opment Manager translates high-level
to create, the basis for At the time we were concocting our tech- business goals into digestible tasks for
our how to explanation. Our goal is to nology recipe, WordPress was developing insertion into Mantis. We then try to group
create a place where internet users from this new way to link users together called related requests for assignment to specific
around the world can gather to publish BuddyPress. BuddyPress was in beta developers. Each developer sets the tick-
blogs, send messages and socially interact and felt more like duct tape and spider ets to Resolved upon completion, and

The Guide From MultiLingual

12-14 Wakaba./SHOWC#111 GSG.indd12 12 4/5/10 9:54:36 AM

GETTING STARTED: Guide

April/May 2010

Guide: GETTING STARTED

a Change Manager closes them once and the proposed schedule. If approved,
pushed to our live environment. the resources are allocated and the work
begins. Next comes the user interface
Human resources design development, where the teams
We utilized a senior engineer who was do the design of screens associated with
able to strategically lead the holistic the software. Then again come the staff
hosting, system administration and pro- review and approval, and the team col-
gramming development. Mojofiti at any lectively determines if we are on-target to
given time has one manager coordinating meet the business goals.
human, software and financial resources Software/programming/development
related to a system of production. The is the next logical step. PHP/MySQL pro-
production system manages bug tracking, grammers get to work. Business goals are
new development and priorities. redefined as digestible and logical devel-
Our programming team is comprised opment tasks. The development team
of several PHP, MySQL and WordPress- determines the appropriate technical solu-
specific developers. Typical duties include tion and executes. Development cycles
everything from printing a users selected iterate until the business goals are met.
primary language to a template creating a Then come quality assurance and testing
BuddyPress-compatible plugin, thus allow- of the software from a technical perspec-
ing users to request and save crowdsourced tive, which sometimes includes an exter-
translations. A few overall challenges faced nal focus group team or service to double
at the start of this project include a URL check the functionality and user experi-
rewriting override for the default method ence. Theres another staff review and
provided by BuddyPress in order for Trans- approval and then the closed beta launch.
posh to work as expected; a handful of The closed beta step allows a larger group
smaller compatibility issues with Trans- of our teams, both internal and external
posh and this WordPress MU/ BuddyPress to test the production. Sometimes we
environment; a modification to the default include our public relations teams, adver-
Blog creation process in order to pick up a tising agency and investors. A punchlist
default set of Transposh settings per Blog; of items to be completed before launch
and a link changing of default Transposh is reviewed, prioritized and worked on.
output to work within the multiuser setup. Final staff review and approval take place
We have designers who contribute cre- before launching the beta to the public.
atively to the user experience. Most of these At the open beta step, the public gets
people have five or more years experience, to test the software and weigh in on any
which helps a lot when you want to cycle updates, bugs, modifications or changes
through things frequently and continuously. that are to be considered. Feedback from
the public beta is then reviewed and pri-
Technological development oritized by staff. If all items are done and
So after we determined our ingredients, approved, we move to launch. The inter-
we set out to manifest the idea. Our pro- nal launch includes contingency planning,
cess went something like this. server configurations and release sched-
At the Business Requirement Documen- ule. At this time all things move live to the
tation stage we gather user experience, public servers. Finally, theres the public
customer benefits and resource availabil- launch, and the files are transferred to live
ity information. We evaluate these items servers. Refinements become a version-
together and distill a potential combina- ing system where we launch new updates
tion that exhibits a strong opportunity for weekly or monthly to sites.
the users of the software to benefit. In the Once your software is in a place so that
case of, it amounted to users can start working with it, get it online.
users publishing blogs, with those users/ We recommend a beta label to inform
blogs united into a social network system users that the site is a work in progress.
with system-wide communications with- During this period, have users start to tell
out language barriers. Costs were to be you whats working and whats not, then
less than $500,000. identify and fix bugs and improve the site.
At the Production Scheduling stage, This process is probably the most efficient
the production manager maps out the way to develop as it gives you real world
development. Then come staff review insight into how to manage your ongoing
and approval. The staff gets together and investments to get the best results for
reviews the work business requirements Iterations of the site from the design team. your users. G

The Guide From MultiLingual

GETTING STARTED: Guide
