Professional Documents
Culture Documents
Apertium A Unique Free Open-Source MT System For Related Languages (But Not Only)
Apertium A Unique Free Open-Source MT System For Related Languages (But Not Only)
Free/Open-Source MT System
for Related Languages
[but not only]
Gema Ramrez Snchez1
Mikel L. Forcada1,2
1
Prompsit Language Engineering, Elx, Spain
1,2
Universitat dAlacant, Alacant, Spain
#LocWorld34
Outline
Apertium components
Ready-to-use Apertium products
Machine translation but not only!
Licensing free/open-source
The Apertium community
Research and business with Apertium
Languages and language pairs
Success cases
Funding
#LocWorld34
Apertium components
#LocWorld34
Apertium components: the
engine /1
A fast, free/open-source, modular,
shallow-transfer, language-independent
machine translation engine with:
text format management,
translation memory querying,
finite-state lexical processing,
statistical and constraint-based lexical
disambiguation, and
shallow structural transfer based on
finite-state pattern matching
#LocWorld34
Apertium components: the
engine /2
Most of the engine was developed inside the
Apertium project but some external
technologies are used:
Helsinki Finite-State toolkit (for some
morphologically-rich languages),
VISL CG-3 (constraint grammars for
rule-based lexical disambiguation).
#LocWorld34
Apertium components: the data
#LocWorld34
Apertium components: the data.
A typical language pair
Language pair organization
2 monolingual packages (A, B) 1 bilingual package (AB)
1 monolingual dictionary 1 bilingual dictionary
(monodix) (bidix)
1 tagset + probabilities 2 sets of structural transfer
1 plain/tagged corpus (grammar) rules (levels 13)
1 postgeneration dictionary
Format: typically XML-based (sometimes text-based) files
Sizes:
Monodixes: 10k90k lemmata; 100k23M surf. forms, 8597% cover.
Bidixes: 8k--90k bilingual lema correspondences
Rules: 100 (one level) 300 (3 level) per translation direction
#LocWorld34
Apertium components: the
tools
Free/open-source tools:
compilers to turn linguistic data into a fast
and compact form used by the engine
and
software to learn disambiguation or
translation rules from corpora.
#LocWorld34
Ready-to-use Apertium
products
A stand-alone Java application for the
desktop: apertium-caffeine.
An Android version for handhelds.
A stand-alone version (Apertium Simpleton)
for Windows and MacOS.
Plug-ins and support for CAT platforms:
OmegaT, MateCat, MemoQ, Trados Studio.
Available as a PPA repository for GNU/Linux
users.
#LocWorld34
Apertium extras: mobile app
Full
offline
Over 60 mode!
translation
directions!
On
Android!
#LocWorld34
No need to install: web access
www.apertium.org
#LocWorld34
No need to install: web access
www.apertium.org
#LocWorld34
No need to install: web/API
access
Other portals with all Apertium languages:
Prompsits portal: + TMX +
navigate&translate
iTranslate4.eu portal: multiengine
Other portals with some Apertium languages:
UOC, UPV, UA (+ TMX + terminology support
+ more formats)
GiellaTekno portal
etc.
Also API access and connectors to translation
tools are marketed
#LocWorld34
Machine translation but not
only! /1
#LocWorld34
Machine translation but
not only! /2
Morphological
Monodix t analyser
o
o
Tagset+prob PoS tagger
l
s
Bidix Lexical transfer
Full MT
#LocWorld34
Licensing: free/open-source /1
#LocWorld34
Licensing: free/open-source /2
The free/open-source model creates a
community which effectively connects
researchers, developers, vendors, and users
in a continuum.
#LocWorld34
The Apertium community
Very active group of hundreds of developers
Contributions to Apertium at Sourceforge
Wiki documentation (wiki.apertium.org)
Easy entry: Apertium linguistic modelling is
simple, no need to program.
IRC channel #apertium in freenode.net
Mailing lists: apertium-stuff@lists.sf.net and
other lists
#LocWorld34
The Apertium community
[A search for Apertium faces in Google Images]
#LocWorld34
The Apertium community
[A search for Apertium faces in Google Images]
#LocWorld34
The Apertium community:
activities
President and project management
committee election according to bylaws
Support: mail, chat, online meetings
Maintenance: pairs, web, mobile app
Manuals & documentation: wiki, manuals,
how-tos, training materials
Organization of Google Summer of Code and
Google Code-In activity
Outreach activities: conferences, workshops
Language-related groups
#LocWorld34
Research and business with
Apertium
Apertium is already an active research and
business platform:
Research: 40+ publications, 2 PhD thesis, 4
master's theses.
Business: companies (Prompsit, Eleka,
Imaxin Software, etc.) offering services to
customers such as Autodesk, Adobe, the
Government of Catalonia, 2 daily newspapers
in Spain, freelancers and LSPs
#LocWorld34
Languages and language pairs
/1
Language data is encoded mostly in XML,
but some language pairs contain data
encoded in other text-based formats.
Stable language pairs (bilingual data) are
currently more than 40.
#LocWorld34
Languages and language pairs
/2
#LocWorld34
Languages and language pairs
/3
#LocWorld34
Languages and language pairs
/4
Year Milestone Language pairs
2004 The Spanish Ministry of Industry funds a consortium
to build FOSS MT for the languages of Spain ----------------------------
2005 Apertium RBMT plaftorm is launched providing 3 pairs: esca, esgl
engine, tools and data under free licenses and espt
2005-2009 Language pair-driven innovation, still very +19: fr, en, eo, ro, eu,
European-focused language pairs oc, cy, nn, nb, sv, da, is,
mk, bg, ast, br
2010 Five years on! 22 pairs!!!
2011-2015 Consolidated community, support for non-European +19: af, nl, hr, sr, mt, sl,
languages, new tools and reorganisation of data arg, sme, urd, hin, kaz,
tat, id, ms, ar
2017 Twelve years on! 43 pairs!!!
#LocWorld34
Apertium loves small
languages
BretonFrench
AragoneseSpanish/Catalan
OccitanCatalan/Spanish
ItalianSardinian
North SmiNorwegian
IcelandicSwedish
SpanishSpanish Sign Language
#LocWorld34
Language pairs with approx.
95% text coverage
Language Lemmata Inflection models Surface forms
#LocWorld34
Apertium language-pair
life cycles
For new pairs:
resource compilation
basic system creation (85% coverage, most
frequent structural phenomena)
evaluation
typically takes 36 months
#LocWorld34
A related-languages pair
performance: apertium-es-pt
From Masselot et al., 2010 (Using the Apertium
SpanishBrazilian Portuguese MT system for
localization):
Post-editing effort (word error rate): 20%
Post-editing speed: average 4,500 words/day
#LocWorld34
Related language-pair
post-editing experience /1
#LocWorld34
Related language-pair
post-editing experience /2
Original Spanish Apertium output Portuguese final
Cree documentacin Cr documentao e Produza desenhos e
y dibujos 2D con un desenhos 2D com documentao 2D
completo conjunto um completo com um conjunto
de herramientas de conjunto de abrangente de
dibujo, edicin y ferramentas de ferramentas de
anotacin. desenho, edio e desenho, edio e
anotao. anotao.
#LocWorld34
Some success cases /2
Also by-products:
Based on Apertium
#LocWorld34
Some other success cases/3
In Wikimedia Content Translation,
Apertium translates Wikipedia content
#LocWorld34
Wikimedia Content Translation
into Norwegian Nynorsk
Co-funded project on MT
for Scandinavian
languages including
community outreach starts
#LocWorld34
Other success cases involving
interaction with other
communities
Translators Without Borders develop
crisis-specific, portable machine translation
from English to Kurdish languages (Kurmanji,
Sorani) on Apertium.
Apertium and language experts help promote a
unified standard for Occitan by defining and
selecting it for SpanishOccitan and
CatalanOccitan MT
#LocWorld34
Funding /1
The Ministry of Industry, Tourism and
Commerce of Spain (also, the Ministries of
Education and Science and of Science and
Technology of Spain)
The Secretariat for Technology and the
Information Society of the Government of
Catalonia
The European Commission (DGT training and
Abu-Matran project)
The Ministry of Foreign Affairs of Romania
#LocWorld34
Funding /2
Universitat d'Alacant and Universitat Oberta
de Catalunya
Ofis Publik ar Brezhoneg (Breton Language
Board)
Ministry of Education and Science of the
Republic of Kazakhstan
Google Summer of Code scholarships
(20092014, 2016, 2017) and Google Code-In
donations (20102016).
And many other private companies
#LocWorld34
You can be part of it!
If you want to build, integrate, or customize
fast, reliable, predictable machine translation for
your application.
If youd rather understand application-oriented
dictionaries and rules rather than deal with the
magic of embeddings, decoders, phrase tables,
convolutions, or probabilities.
If theres no way you can amass and curate
millions of translated words to train a system
for your language or application.
#LocWorld34
Sharing
#LocWorld34