Welcome to Scribd!

Discriminating Similar Languages Persian and Dari

Uploaded by

0% found this document useful (0 votes)

9 views2 pages

This document summarizes research on discriminating the similar languages of Persian and Dari using language identification techniques. The researchers created a corpus of 14,000 sentences each in Persian and Dari and were able to identify the languages with 96% accuracy using character and word n-grams. They then tested the models on an out-of-domain corpus and achieved 87% accuracy, showing the ability to generalize. Feature analysis revealed lexical, morphological and orthographic differences between the languages. The identification of these differences can help in adapting existing natural language processing resources for the under-resourced language of Dari.

Original Description:

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

9 views2 pages

Discriminating Similar Languages Persian and Dari

Uploaded by

Sana Isam

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 2

Search inside document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/279298572

Discriminating Similar Languages: Persian and Dari

Conference Paper · March 2015

CITATIONS READS

0 344

1 author:

Shervin Malmasi
Harvard Medical School
78 PUBLICATIONS 2,864 CITATIONS

SEE PROFILE

All content following this page was uploaded by Shervin Malmasi on 30 June 2015.

The user has requested enhancement of the downloaded file.

Discriminating Similar Languages:
Persian and Dari

Shervin Malmasi
Centre for Language Technology
Macquarie University, Sydney, Australia
shervin.malmasi@mq.edu.au

ABSTRACT
Although widely-studied in recent years, Language Identification (LID) systems for determining the
language of input texts often fail to discriminate between similar languages like Croatian-Serbian
and Malay-Indonesian. This has brought attention to the task of discriminating similar languages,
varieties and dialects – including a recent shared task [3].
Persian (also known as Farsi) and Dari (Eastern Persian, spoken predominantly in Afghanistan)
are two close variants that have not hitherto been investigated in LID and we report the first results
on this pair. Dari is a low-resourced but important language, particularly for the U.S. due to its
ongoing involvement in Afghanistan, which has led to increasing research interest [1].
We developed a corpus of 28k sentences (14k per-language) and using character and word n-grams,
we discriminated them with 96% accuracy. Out-of-domain cross-corpus evaluation was conducted to
test the discriminative models’ generalizability, achieving 87% accuracy in classifying 79k sentences
from the Uppsala Persian Corpus. Feature analysis revealed lexical, morphological and orthographic
inter-language differences.
Further to determining document languages, LID has applications in character encoding detection,
statistical machine translation, inducing dialect-to-dialect lexicons and authorship profiling in the
forensic linguistics domain. In Information Retrieval it can help filter documents (e.g. news articles
or search results) by dialect.
LID can also be used in other Natural Language Processing tasks, including the adaptation
of tools like part-of-speech taggers for low-resourced languages [2]. Since Dari is too different to
directly apply Persian resources, the distinguishing features identified through LID can assist in
adapting existing resources.

BODY
Language Identification methods using surface features can distinguish close
linguistic variants Persian and Dari with 96% accuracy.

REFERENCES
[1] S. Condon, L. Hernandez, D. Parvaz, M. S. Khan, and H. Jahed. Producing Data for
Under-Resourced Languages: A Dari-English Parallel Corpus of Multi-Genre Text. 2012.
[2] A. Feldman, J. Hana, and C. Brew. A cross-language approach to rapid creation of new
morpho-syntactically annotated resources. In Proceedings of LREC, pages 549–554, 2006.
[3] M. Zampieri, L. Tan, N. Ljubešic, and J. Tiedemann. A report on the DSL shared task 2014.
COLING 2014, page 58, 2014.

Volume 3 of Tiny Transactions on Computer Science

This content is released under the Creative Commons Attribution-NonCommercial ShareAlike License. Permission to
make digital or hard copies of all or part of this work is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
CC BY-NC-SA 3.0: http://creativecommons.org/licenses/by-nc-sa/3.0/.

View publication stats

Subdialectal Differences in Sorani Kurdish
Document8 pages
Subdialectal Differences in Sorani Kurdish
100 Slavic
No ratings yet
Adalya
Document10 pages
Adalya
Sana Isam
No ratings yet
Sample Acmsmall
Document15 pages
Sample Acmsmall
Zubair Tahir Khanzada
No ratings yet
ELBERGUI Initiation Recherche
Document6 pages
ELBERGUI Initiation Recherche
sofia
No ratings yet
Code Mixing A Challenge For Language Identification in The Language of Social Media - 2001541112 - Ni Putu Ayu Devi Agustini
Document12 pages
Code Mixing A Challenge For Language Identification in The Language of Social Media - 2001541112 - Ni Putu Ayu Devi Agustini
Devi Agustini
No ratings yet
L2 Speaking Development During Study Abroad: Fluency, Accuracy, Complexity, and Underlying Cognitive Factors
Document16 pages
L2 Speaking Development During Study Abroad: Fluency, Accuracy, Complexity, and Underlying Cognitive Factors
kerrin galvez
No ratings yet
1 s2.0 S0167639322000292 Main
Document16 pages
1 s2.0 S0167639322000292 Main
Abdelkbir Ws
No ratings yet
167 574 3 PB
Document5 pages
167 574 3 PB
Sana Isam
No ratings yet
Accent Classification Among Punjabi, Urdu, Pashto, Saraiki and Sindhi Accents of Urdu Language
Document8 pages
Accent Classification Among Punjabi, Urdu, Pashto, Saraiki and Sindhi Accents of Urdu Language
Farzana Butt - 66806/TCHR/MGB
No ratings yet
Feature Hashing For Language and Dialect Identification
Document5 pages
Feature Hashing For Language and Dialect Identification
Sana Isam
No ratings yet
Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM: Habash and Sadat 2006
Document10 pages
Arabic Multi-Dialect Segmentation: bi-LSTM-CRF vs. SVM: Habash and Sadat 2006
Husien Naser
No ratings yet
Transliteration of Arabizi Into Arabic Orthography
Document12 pages
Transliteration of Arabizi Into Arabic Orthography
Ichraf Khechine
No ratings yet
Creating Lexical Resources For Endangered Languages: January 2014
Document10 pages
Creating Lexical Resources For Endangered Languages: January 2014
filmonts11
No ratings yet
Seminar Powerpoint
Document27 pages
Seminar Powerpoint
ዘብሄረ መኧኸኧተኽ ኸብታሙ
No ratings yet
Debapriya Sengupta, Goutam Saha - Identification of The Major Language Families of India and Evaluation of Their Mutual Influence 2016
Document16 pages
Debapriya Sengupta, Goutam Saha - Identification of The Major Language Families of India and Evaluation of Their Mutual Influence 2016
sreenathpnr
No ratings yet
Contrastive Analysis
Document16 pages
Contrastive Analysis
any one
No ratings yet
Language and Dialect Identification of Cuneiform T
Document11 pages
Language and Dialect Identification of Cuneiform T
Romuald Dion
No ratings yet
Multilingual Acoustic and Language Modeling For Et
Document6 pages
Multilingual Acoustic and Language Modeling For Et
Legesse Samuel
No ratings yet
LT 273 Oa
Document26 pages
LT 273 Oa
boutazmait97
No ratings yet
Lesson 2
Document2 pages
Lesson 2
Gomer Jay Legaspi
No ratings yet
V.vo JSLW Corpus
Document13 pages
V.vo JSLW Corpus
ThomasNguyễn
No ratings yet
Light Stemming For Arabic Information Retrieval
Document34 pages
Light Stemming For Arabic Information Retrieval
Vega Delta
No ratings yet
Lichouri LID 2023
Document6 pages
Lichouri LID 2023
MUHAMMAD AHSAN
No ratings yet
Paper+2+ (2022 3 1) +A+Grammar+Sketch+of+Southern+Sinama+Language PDF
Document53 pages
Paper+2+ (2022 3 1) +A+Grammar+Sketch+of+Southern+Sinama+Language PDF
Fatima Raiza K. Sammara
No ratings yet
An Effective Bi-LSTM Word Embedding System For Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
Document13 pages
An Effective Bi-LSTM Word Embedding System For Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
Big Daddy
No ratings yet
08-05-2021-1620457714-8 - Ijll-4. Ijll - Analyzing Different Phrases Structure Between Malay, Arabic and English
Document10 pages
08-05-2021-1620457714-8 - Ijll-4. Ijll - Analyzing Different Phrases Structure Between Malay, Arabic and English
iaset123
No ratings yet
Strategies of TranslatingIdioms
Document17 pages
Strategies of TranslatingIdioms
Assignments EFF
No ratings yet
Icpram 2020 38
Document9 pages
Icpram 2020 38
Melat Fissha
No ratings yet
Arabic Automatic Speech Recognition Transcripts
Document9 pages
Arabic Automatic Speech Recognition Transcripts
Sana Isam
No ratings yet
Automatic Wordnet Development For Low-Resource Languages Using Cross-Lingual WSD
Document28 pages
Automatic Wordnet Development For Low-Resource Languages Using Cross-Lingual WSD
SHIKHAR AGNIHOTRI
No ratings yet
1 s2.0 S1877042814025853 Main
Document6 pages
1 s2.0 S1877042814025853 Main
Mhilal Şahan
No ratings yet
Code Mixing: A Challenge For Language Identification in The Language of Social Media
Document11 pages
Code Mixing: A Challenge For Language Identification in The Language of Social Media
Devi Agustini
No ratings yet
Symmetry: Low-Resource Named Entity Recognition Via The Pre-Training Model
Document15 pages
Symmetry: Low-Resource Named Entity Recognition Via The Pre-Training Model
Mehari Yohannes
No ratings yet
A Novel Unsupervised Corpus-Based Stemming
Document16 pages
A Novel Unsupervised Corpus-Based Stemming
Office Work
No ratings yet
Research On Regional Languages
Document6 pages
Research On Regional Languages
Abhishek Rana
No ratings yet
Design and Implementation of Morphology Based Spell Checker
Document9 pages
Design and Implementation of Morphology Based Spell Checker
Singitan Yomiyu
No ratings yet
Vocabulary Knowledge As A Predictor of Performance in Writing and Speaking: A Case of Turkish EFL Learners
Document32 pages
Vocabulary Knowledge As A Predictor of Performance in Writing and Speaking: A Case of Turkish EFL Learners
Vera Zokli
No ratings yet
Rule Based Hindi To Urdu Transliteration System: Bushra78622@yahoo - in
Document5 pages
Rule Based Hindi To Urdu Transliteration System: Bushra78622@yahoo - in
nipun sharma
No ratings yet
Zariquiey Etal
Document29 pages
Zariquiey Etal
Adriano Wilson Valencia . B
No ratings yet
Language Lexicons For Hindi-English Multilingual Text Processing
Document8 pages
Language Lexicons For Hindi-English Multilingual Text Processing
IAES IJAI
No ratings yet
Some Good Projects in NLP
Document3 pages
Some Good Projects in NLP
DharmendraChoudhary
No ratings yet
Globaltrait: Personality Alignment of Multilingual Word Embeddings
Document8 pages
Globaltrait: Personality Alignment of Multilingual Word Embeddings
ASDAD zxc
No ratings yet
Translation Students' Knowledge of Lexical Cohesion Patterns and Their Performance in The Translation of English Texts
Document7 pages
Translation Students' Knowledge of Lexical Cohesion Patterns and Their Performance in The Translation of English Texts
Trúc Phạm
No ratings yet
Meky Jurnal
Document10 pages
Meky Jurnal
Gustian Siharta
No ratings yet
A Linguistic Analysis of Simplified and Authentic Texts: The Modern Language Journal, 91, I, (2007)
Document16 pages
A Linguistic Analysis of Simplified and Authentic Texts: The Modern Language Journal, 91, I, (2007)
envi
No ratings yet
Kespeech An Open Source Speech
Document12 pages
Kespeech An Open Source Speech
gupnaval2473
No ratings yet
Effect of V and G On Writing
Document16 pages
Effect of V and G On Writing
chio
No ratings yet
Semisupervised Data Driven Word Sense... (Pratibha Rani and Others)
Document11 pages
Semisupervised Data Driven Word Sense... (Pratibha Rani and Others)
arpit
No ratings yet
A Comparative Analysis of The Similar Word Formation Processes in English and Arabic
Document21 pages
A Comparative Analysis of The Similar Word Formation Processes in English and Arabic
Nur Hamelia
No ratings yet
Characteristics of Malay Translated Hadith Corpus
Document10 pages
Characteristics of Malay Translated Hadith Corpus
22080359
No ratings yet
The Modern Language Journal - 2007 - CROSSLEY - A Linguistic Analysis of Simplified and Authentic Texts
Document16 pages
The Modern Language Journal - 2007 - CROSSLEY - A Linguistic Analysis of Simplified and Authentic Texts
dowo
No ratings yet
Contrastive Analysis
Document3 pages
Contrastive Analysis
Glennmelove
No ratings yet
Borodkin Kenett Faust & Mashal 2016
Document11 pages
Borodkin Kenett Faust & Mashal 2016
Olga Tszaga
No ratings yet
Saraiki Language Hybrid Stemmer Using Rule-Based and LSTM-Based Sequence-To-Sequence Model Approach
Document24 pages
Saraiki Language Hybrid Stemmer Using Rule-Based and LSTM-Based Sequence-To-Sequence Model Approach
INNOVATIVE COMPUTING REVIEW
No ratings yet
Syntactic Analysis Between Spoken and Written Language: A Case Study On A Senior High Student's Lexical Density
Document33 pages
Syntactic Analysis Between Spoken and Written Language: A Case Study On A Senior High Student's Lexical Density
Joneth Derilo-Vibar
No ratings yet
Miller 2016
Document14 pages
Miller 2016
Jose Alfonso Szabo Calvo
No ratings yet
Agic2015 Universal
Document8 pages
Agic2015 Universal
Ge Arias
No ratings yet
Implemented Stemming Algorithms For Six Ethiopian Languages
Document5 pages
Implemented Stemming Algorithms For Six Ethiopian Languages
balj balh
No ratings yet
Enayat
Document15 pages
Enayat
Caku Chen
No ratings yet
Analysis of a Medical Research Corpus: A Prelude for Learners, Teachers, Readers and Beyond
From Everand
Analysis of a Medical Research Corpus: A Prelude for Learners, Teachers, Readers and Beyond
Georgette Nicolas Jabbour
No ratings yet
Jproc 2012 2237151
Document24 pages
Jproc 2012 2237151
Sana Isam
No ratings yet
German Dialect Identification Using Classifier Ensembles
Document7 pages
German Dialect Identification Using Classifier Ensembles
Sana Isam
No ratings yet
Arabic Automatic Speech Recognition Transcripts
Document9 pages
Arabic Automatic Speech Recognition Transcripts
Sana Isam
No ratings yet
Jira A Kurdish Speech Recognition System Designing
Document19 pages
Jira A Kurdish Speech Recognition System Designing
Sana Isam
No ratings yet
Findings of The VarDial Evaluation Campaign 2017
Document15 pages
Findings of The VarDial Evaluation Campaign 2017
Sana Isam
No ratings yet
Kurdish Language, Its Family and Dialects
Document14 pages
Kurdish Language, Its Family and Dialects
Sana Isam
No ratings yet
Feature Hashing For Language and Dialect Identification
Document5 pages
Feature Hashing For Language and Dialect Identification
Sana Isam
No ratings yet
K. M. Petyt Study of Dialect
Document231 pages
K. M. Petyt Study of Dialect
Sana Isam
No ratings yet
Hidden Markov Model and Persian Speech Recognition
Document9 pages
Hidden Markov Model and Persian Speech Recognition
Sana Isam
No ratings yet
FARSDAT
Document12 pages
FARSDAT
Sana Isam
No ratings yet
DIWAN A Dialectal Word Annotation Tool For Arabic
Document10 pages
DIWAN A Dialectal Word Annotation Tool For Arabic
Sana Isam
No ratings yet
Jira A Kurdish Speech Recognition System
Document19 pages
Jira A Kurdish Speech Recognition System
Sana Isam
No ratings yet
Etman Paper1
Document13 pages
Etman Paper1
Sana Isam
No ratings yet
2015 Attention Based Models For Speech Recognition Paper
Document9 pages
2015 Attention Based Models For Speech Recognition Paper
Sana Isam
No ratings yet
ERG Tokenization and Lexical Categorization A Sequence Labelling Approach
Document106 pages
ERG Tokenization and Lexical Categorization A Sequence Labelling Approach
Sana Isam
No ratings yet
Fine-Grained Arabic Dialect Identification
Document13 pages
Fine-Grained Arabic Dialect Identification
Sana Isam
No ratings yet
A New Approach For Persian Speech Recognition
Document6 pages
A New Approach For Persian Speech Recognition
Sana Isam
No ratings yet
German Dialect Identification in Interview Transcriptions
Document6 pages
German Dialect Identification in Interview Transcriptions
Sana Isam
No ratings yet
Discriminative N-Gram Selection For Dialect Recognition
Document4 pages
Discriminative N-Gram Selection For Dialect Recognition
Sana Isam
No ratings yet
A Multi Purpose and Large Scale Speech Corpus in Persian and English For Speaker and Speech Recognition The Deepmine Database
Document6 pages
A Multi Purpose and Large Scale Speech Corpus in Persian and English For Speaker and Speech Recognition The Deepmine Database
Sana Isam
No ratings yet
167 574 3 PB
Document5 pages
167 574 3 PB
Sana Isam
No ratings yet
International Journal of Current Advanced Research: A Survey On Automatic Speech Recognition System Research Article
Document11 pages
International Journal of Current Advanced Research: A Survey On Automatic Speech Recognition System Research Article
Sana Isam
No ratings yet
Paragraph Structure For Key Stage 5: Moving Beyond Point, Evidence, Explanation (PEE)
Document10 pages
Paragraph Structure For Key Stage 5: Moving Beyond Point, Evidence, Explanation (PEE)
Sean Clyde
No ratings yet
KET Vocabulary Test
Document9 pages
KET Vocabulary Test
kamy214
No ratings yet
Homonym y
Document6 pages
Homonym y
Dương Thảo Linh
No ratings yet
Comparative and Superlative Adjectives
Document18 pages
Comparative and Superlative Adjectives
Sarah Khalil
No ratings yet
Animal Farm. Vocab From Chapters 9 and 10 and Essay
Document2 pages
Animal Farm. Vocab From Chapters 9 and 10 and Essay
Yue Fan
No ratings yet
A Practical Worksheet
Document1 page
A Practical Worksheet
Hamid Ben
No ratings yet
Fitri Zakiah
Document111 pages
Fitri Zakiah
Diah Maulidya
No ratings yet
Pythagorean Plato McClain
Document204 pages
Pythagorean Plato McClain
dkwuk8720
100% (5)
Title of The Program: Fueling Students' Grammatical Skill On Subject-Verb Agreement
Document6 pages
Title of The Program: Fueling Students' Grammatical Skill On Subject-Verb Agreement
TCHR KIM
No ratings yet
C Annotation
Document502 pages
C Annotation
Bogotano SY
No ratings yet
Public Notice - New Translator - Sri Lanka (15.01.2024)
Document1 page
Public Notice - New Translator - Sri Lanka (15.01.2024)
Rodrigo da Frota
No ratings yet
What Are Interesting Language Facts About The Dutch Language?
Document8 pages
What Are Interesting Language Facts About The Dutch Language?
DutchTrans
No ratings yet
Ilovepdf Merged
Document7 pages
Ilovepdf Merged
Ahmad Rifath
No ratings yet
Technical English Civil Engineering and Construction
Document31 pages
Technical English Civil Engineering and Construction
samir
0% (1)
Lesson Plans To Accommodate Diverse Learners
Document5 pages
Lesson Plans To Accommodate Diverse Learners
Amirah Husna
No ratings yet
Teaching Receptive and Productive Skills
Document37 pages
Teaching Receptive and Productive Skills
Nurul Hanisah
No ratings yet
Prepositions
Document7 pages
Prepositions
Atul Rajak
No ratings yet
Grade 8 q3 m6
Document6 pages
Grade 8 q3 m6
SilvestrePariolanDionela
No ratings yet
Gerunds and Infinitives Part 2
Document16 pages
Gerunds and Infinitives Part 2
Try W
No ratings yet
To Use
Document4 pages
To Use
Olga Lomadze
No ratings yet
COCOMO II - Reference Manual
Document88 pages
COCOMO II - Reference Manual
gtomasig
No ratings yet
Education Innovation Capsule Proposal
Document4 pages
Education Innovation Capsule Proposal
Andrea Lyn Palma
No ratings yet
TOEIC A Reading Unit 1
Document45 pages
TOEIC A Reading Unit 1
Hụê Chan
No ratings yet
Vowel Sounds
Document3 pages
Vowel Sounds
Gusni Dariani
No ratings yet
Eported Peech K S F T: Instructions
Document2 pages
Eported Peech K S F T: Instructions
ariana grande
No ratings yet
Resume
Document2 pages
Resume
api-3745674
100% (4)
Spanish Vocabulary Genius Book1111 PDF (1) .PDF WORD
Document114 pages
Spanish Vocabulary Genius Book1111 PDF (1) .PDF WORD
Amra-Refik Mujanovic
No ratings yet
Importance of Mechanics in Writing
Document3 pages
Importance of Mechanics in Writing
Jesa Mae Copioso
No ratings yet
Be Going To For Plans
Document1 page
Be Going To For Plans
David Carlos Bertolotto Huamaní
No ratings yet
A Figure of Speech Is Just That
Document6 pages
A Figure of Speech Is Just That
Heidi
No ratings yet