Published by Crowdsourcing.org

Published by: Crowdsourcing.org on Jul 26, 2011
What languages are spoken by crowdsourced workers?
Language and cognition research that used to take thousands of dollars overseveral months can now be completed in a matter of hours for a few dollars.This is thanks to the recent proliferation of crowdsourcing technologies likeAmazon Mechanical Turk  (AMT) and CrowdFlower which have opened up a new avenues of research by engaging a ready, online workforce through
microtasking. But we haven’t yet seen much work in cross
-languagecrowdsourced studies (with some notable exceptions fromJohns Hopkins likeAnn Irvine andAlexandre Klementiev
). There have been
a few studies of the demographics of people using AMT, (espThe NewDemographics of Mechanical Turk ,byPanos Ipeirotis), but no-one had thought to simply ask the workers what languages they spoke. So I did.
Languages spoken by AMT workers
I asked a few hundred AMT workers what languages they spoke other thanEnglish. This was staggered over a number of tasks at different times of theday and week to get as much variety as possible. You are welcome to playwith the results:crowdsourcing_languages.csv (contact me directly for  expanded metadata). This gave about 2,100 data points in total.The languages were (deep breath): Afrikaans, Albanian, American SignLanguage, Ancient Greek, Arabic, Assamese, Badaga, Bengali, Bhojpuri,
Bulgarian, Cantonese, Catalan, Caucasian, Cebuano, Chattisgarhi, Chinese,Coorgi, Creole, Croatian, Czech, Danish, Dusun, Dutch, Esperanto,Estonian, Farsi, Finnish, Flemish, French, Fuzhou, Galician, Garwali,German, Greek, Gujarati, Haryanvi, Hawaiian, Hebrew, Hindi, Hokkien,Hungarian, Icelandic, Ilocano, Ilonggo, Indonesian, Irish, Italian, Japanese,Kadazan, Kannada, Kiswahili, Klingon, Konkani, Korean, Kurdish, Kutchi,Latin, Latvian, Lithuanian, Macedonian, Maithli, Malay, Malayalam,Mandarin, Manipuri, Marathi, Marwari, Nepali, Norwegian, Orriya, PaDutch, Pig Latin, Plattduitsch, Polish, Portuguese, Punjabi, Pushto,Rajasthani, Romanian, Russian, Sanskrit, Serbian, Shanghainese, Sindhi,Slovenian, Sowrashtra, Spanish, Swedish, Swiss German, Tagalog, Tamil,Telugu, Thai, Tulu, Turkish, Ukrainian, Urdu, Vietnamese, Visayan,Yiddish and Yupik! By smoothing estimates, it is safe to predict that at leasta few hundred more are spoken by AMT workers.There are some fuzzy (and not so fuzzy) interpretations. Hindi and Urdu areone language with some minor dialectal variation, as are Indonesian andMalay. At the other end, a number of the participants who reported speaking
‘Chinese’ probably speak any number of related languages, as distinctlanguages are often called ‘dialects’ within China, especially in relation to
the more prestige language
s. ‘Pig Latin’ is not a language. The one personwho claimed to speak Klingon … well, who knows, perhaps they do.
 I combined the results with theWALS database to map the lineage and origin of many of the languages, showing a huge geographical bias in the
distribution. The world’s languages are concentrated in or near the tropics
 but those spoken here were predominantly from European or non-tropicalAsia in origin. Despite that, it is great to see a scattering of less widely-spoken languages like Kadazan (Austronesian) and Yupik (Eskimo-Aleut)showing that despite the biases in overall volume there is a
rich varietyof languages spoken by AMT workers. Six of the ten most commonlyspoken (Tamil, Malayalam, Telegu, Kannada, Marathi and Gujurati) do notyet have online translation tools via Google or Bing so there is clearly greatscope to support online translation for new languages, too.To populate the map in an interesting way, I also calculated the mostfrequent language reported at each hour of the day, restricting this to onelanguage per timezone. This gives us 24 languages (see below); one for each
hour of the day. I’ve added these to the map at midday for the timezone for 
which they were most frequently spoken. This is more for visual effect thananything else, but it does give an idea of the optimal time to run tasks for any specific language, and strongly correlates with the part of the world thatthe language originates in (there are surprisingly few crossing lines).
The most common language per timezone, limited to one timezone per language
In a recent paper we argued that ‘introspective’ analysis of invented
sentences was no longer a required fallback for language studies as we canquickly obtain speaker judgments about sentences through crowdsourcedexperiments (withSteven Bethard, Victor Kuperman, Vicky Tzuyin Lai,  Robin Melnick , Christopher Potts, Tyler Schnoebelen andHarry Tily, 
“) . Given the variety of languages here it is safe to say that researchersalso don’t also need to limit studies to their native languages. Even language
researchers who choose not to undertake fieldwork can now contribute toour knowledge of the
world’s linguistic diversity, making for an exciting
future for our field.There is the potential to be more proactive than this study in seeking outspeakers of other languages through crowdsourcing platforms.Scott Novotney andChris Callison-Burch recently found additional Korean
speakers for one study by creating a new task asking people to ‘find aKorean speaker’ (
). They shared the income with
those people who found the speakers of Korean, finding an unlikelycombination of affordable outsourcing and pyramid schemes for the forces
of good. I wish I’d thought of that. Another proactive approach would be to
 partner directly with organizations that establish microstasking centersaround the world, like Samasource.Their workers are concentrated in areas of high linguistic diversity, and while most complete tasks for western businesses there is no doubt that many would enjoy contributing to researchabout their more local languages. The least well-studied languages are inless-resourced parts of the world, and so the wages typically paid oncrowdsourcing platforms could provide a competitive income while
contributing to an individual’s work experience and digit
al skills.

