Welcome to Scribd!

1-Word Frequency Lists:: Monolingual Parallel Comparable

Uploaded by

0% found this document useful (0 votes)

6 views2 pages

A corpus is a collection of text gathered according to explicit criteria that is used for linguistic investigation. Corpus analysis tools include word frequency lists, stop lists to ignore common words, and concordances to display contexts of searched words. Concordances can be monolingual, bilingual, or multilingual. Collocations identify words that commonly occur together based on their mutual information score. Annotation and markup can be added to corpora for linguistic purposes like part-of-speech tagging or to distinguish word meanings, or for non-linguistic purposes like publication dates.

Original Description:

حاسوبية shiakajaja

Original Title

حاسوبية ٧

Copyright

Available Formats

DOCX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

6 views2 pages

1-Word Frequency Lists:: Monolingual Parallel Comparable

Uploaded by

haifa

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 2

Search inside document

Corpora and Corpus-Analysis Tools

A corpus is simply a collection of text or utterances that is used as a basis for conducting some type of
linguistic investigation.

Recently, corpus refer to a large collection of electronic text that gathered according to explicit criteria.

Types of corpora:

Monolingual Parallel Comparable

Bilingual Multilingual Monolingual Bilingual Multilingual

Corpus analysis tools:

1-word frequency lists:

*it can be sorted in different orders (e.g. alphabetical order)
*user discover< types (all the words even repeated words) and token (only different words)
#Lemmatized lists:
*To group related words together rather than separate each individual word form.
*Lemma used to describe a word that includes and represents all related words.
*There is a problem when lemmatizing a word list automatically which is homograph, when two words have
same spelling but different part of speech. ‫يعني يحطهم تحت ليما واحد وهم كلمتين مختلفتين‬
#A stop list:
*Any items that a user wants the computer to ignore (e.g. ignore articles)

2-Concordances:
-Monolingual -Bilingual
*The results are displayed after a search conducted.
*The most common format is a KWIC, all occurrences of the search are lined up in the center.
*Contexts can be sorted in a variety of ways. (e.g. order of appearance, alphabetically)
*Search patterns (types): exact-string, case-sensitive, wildcard, use boolean operators, context search.
*A parallel corpus is a corpus contains a collection of ST in language A aligned with their translation into
language B.
*Alignment is the process whereby sections of the ST are linked up with their corresponding translation.
*The aligned sections are typically displayed either alongside each other or one above the other, they also
can be sorted (e.g. alphabetically)
*Most bilingual concordancers are bidirectional. (the search can be entered in either language A or
language B).
*Statistical measures is a sophisticated feature it tries to specifically identify potential equivalence.
(separate windows). The advantage to this type of search is the ST and TT can be sorted independently to
reveal patterns in both languages.
*Bilingual query is a type of searching which user can specify search term in both languages, it's useful for
checking whether or not given translation is attested.

3-colcations:
It regarded when two words go together.
A mutual Information (MI) is a formula for determining either the two words are collocates or not.
If two words are strongly connected they will have high MI score, and if not, they will have low MI score.
Its drawback:
1-it assumes that the different words occurred as completely independent events whereas languages
actually full of dependencies
2-MI requires usually about 5 of co-occurrences within a corpus in order to be valid.
Annotation and mark-up:
It depends on the project at hand.
It requires a greater initial investment of time, but subsequently allows more specific searching.

Linguistic (annotation) Non-linguistic (marking up)

The advantage is that it allows users to focus their
research more narrowly
1-Syntactic A way of adding non-linguistic information to a
*Each word in the corpus has its part of speech corpus.
specified with tags (there's no standard tag set). It's possible to ask the computer to retrieve only
*Taggers programs add all part of speech to a occurrences of a specific search (such as the title of a
corpus automatically (need edited by human) text, publication date)
2-Symantic
*distinguish between multiple meanings of a word.
Such as Homonyms (have same spelling different
meaning)

HLD Sample
Document34 pages
HLD Sample
Manoj Kumar
100% (3)
(A) What Is Traditional Model of NLP?: Unit - 1
Document18 pages
(A) What Is Traditional Model of NLP?: Unit - 1
Sonu Kumar
No ratings yet
Training Program: Subject:: Final Project
Document7 pages
Training Program: Subject:: Final Project
Suraj Apex
0% (2)
FILEAID Manual PDF
Document27 pages
FILEAID Manual PDF
Murali Mohan N
No ratings yet
Corpus Linguistics
Document23 pages
Corpus Linguistics
Mumtaz
No ratings yet
Corpus Linguistics 1
Document48 pages
Corpus Linguistics 1
Abdul Moaiz
No ratings yet
Group Members:: Ayesha Azhar Bareera Akbar Irum Masood Maryam Ahmed Tahira Jabeen
Document58 pages
Group Members:: Ayesha Azhar Bareera Akbar Irum Masood Maryam Ahmed Tahira Jabeen
aulia alwina a
No ratings yet
Corpus Typology
Document23 pages
Corpus Typology
HUN Teng
No ratings yet
Corpus Linguistics Part 1
Document30 pages
Corpus Linguistics Part 1
Amani Adam Dawood
No ratings yet
A Computer Approach To Content Analysis: Studies Using The General Inquirer System
Document16 pages
A Computer Approach To Content Analysis: Studies Using The General Inquirer System
Beenish Fatima
No ratings yet
Practical Introduction. Ottawa: University of Ottawa Press
Document4 pages
Practical Introduction. Ottawa: University of Ottawa Press
Radu Adrian Pana
No ratings yet
Towards Creating Precision Grammars From Interlinear Glossed Text: Inferring Large-Scale Typological Properties
Document10 pages
Towards Creating Precision Grammars From Interlinear Glossed Text: Inferring Large-Scale Typological Properties
Bengt Hörberg
No ratings yet
Chapter II. PRINCIPLE AREAS OF CL
Document42 pages
Chapter II. PRINCIPLE AREAS OF CL
Thanh Tú
No ratings yet
Chapter One
Document27 pages
Chapter One
ahmed neccar
No ratings yet
Dicción 1
Document52 pages
Dicción 1
Paloma Robertson
No ratings yet
Chapter II. PRINCIPLE & AREAS OF CL
Document46 pages
Chapter II. PRINCIPLE & AREAS OF CL
cuthithanh1033
No ratings yet
Anbes Yafet
Document10 pages
Anbes Yafet
Sintayehu Bedassa
No ratings yet
Appiled Linguistics Corpus Linguistics
Document16 pages
Appiled Linguistics Corpus Linguistics
elbazalaziz
No ratings yet
Module 15
Document2 pages
Module 15
mahesh.panchal0033
No ratings yet
REPORT
Document5 pages
REPORT
GS Library
No ratings yet
Halliday Tagger PDF
Document4 pages
Halliday Tagger PDF
ram
No ratings yet
Heterogeneous Linguistic Data-Generic XML-based Representation and Flexible Visualization-2005
Document5 pages
Heterogeneous Linguistic Data-Generic XML-based Representation and Flexible Visualization-2005
driss ou
No ratings yet
7 Formal Systems and Programming Languages: An Introduction
Document40 pages
7 Formal Systems and Programming Languages: An Introduction
rohit_pathak_8
No ratings yet
Machine Translation For English To Kanna
Document8 pages
Machine Translation For English To Kanna
siva prince
No ratings yet
LN401 - Spring 23 - Annotation
Document14 pages
LN401 - Spring 23 - Annotation
Rania Abd El Fattah Abd El Hameed
No ratings yet
Unit 1 2 3 4 5 NLP Notes Merged
Document105 pages
Unit 1 2 3 4 5 NLP Notes Merged
natih73213
No ratings yet
Corpus Stylistic: Presented By: Quissa Marie M. Gonzales-BSED Presented To: Dr. Arjan Espiritu
Document16 pages
Corpus Stylistic: Presented By: Quissa Marie M. Gonzales-BSED Presented To: Dr. Arjan Espiritu
Quissa Gonzales
No ratings yet
Research On Regional Languages
Document6 pages
Research On Regional Languages
Abhishek Rana
No ratings yet
Gries & Berez 2017
Document31 pages
Gries & Berez 2017
Átila Augusto
No ratings yet
Corpus Linguistics: An Introduction
Document43 pages
Corpus Linguistics: An Introduction
Tessy Molada Tebar
No ratings yet
7
Document4 pages
7
Yuriy Kondratiuk
No ratings yet
Grammar I Angela's Class
Document12 pages
Grammar I Angela's Class
Luisa Calle
No ratings yet
Corpora in Human Language Technologies
Document42 pages
Corpora in Human Language Technologies
Voula Giouli
No ratings yet
Text Corpus
Document3 pages
Text Corpus
linda976
No ratings yet
Volk - Graen - Callegaro - 2014-Innovations Parallel Corpus Tools
Document7 pages
Volk - Graen - Callegaro - 2014-Innovations Parallel Corpus Tools
Hernán Robledo Nakagawa
No ratings yet
Ch03 Slides
Document106 pages
Ch03 Slides
Eren Yeager
No ratings yet
CMR University School of Engineering and Technology Department of Cse and It
Document8 pages
CMR University School of Engineering and Technology Department of Cse and It
Smart Work
No ratings yet
Corpus Linguistics
Document25 pages
Corpus Linguistics
isabelprofeatal
No ratings yet
Entry Structure of Electronic Dictionary
Document2 pages
Entry Structure of Electronic Dictionary
Maruf Alam Munna
No ratings yet
CORPUS TYPES and CRITERIA
Document14 pages
CORPUS TYPES and CRITERIA
Elias Zermane
100% (1)
Creating An Orthography Description: M. Hosken
Document13 pages
Creating An Orthography Description: M. Hosken
dNA
No ratings yet
Arabic Words Clustering by Using K-Means Algorithm
Document5 pages
Arabic Words Clustering by Using K-Means Algorithm
Faiez Musa Lahmood Alrufaye
No ratings yet
Introduction To Natural Language Processing and NLTK
Document23 pages
Introduction To Natural Language Processing and NLTK
Nikhil Saini
No ratings yet
3.1 Natural Language Processing
Document5 pages
3.1 Natural Language Processing
Fardeen Azhar
No ratings yet
Chapter-1 Introduction To NLP
Document12 pages
Chapter-1 Introduction To NLP
Sruja Koshti
No ratings yet
Getting Started On Natural Language Processing With Python: Crossroads September 2007
Document17 pages
Getting Started On Natural Language Processing With Python: Crossroads September 2007
Harshit Gupta
No ratings yet
Introduction To Compilers: Syntax Analysis
Document35 pages
Introduction To Compilers: Syntax Analysis
Ankit Komar
No ratings yet
Solutions To NLP I Mid Set A
Document8 pages
Solutions To NLP I Mid Set A
jyothibellaryv
100% (1)
Cheng 2012 PP 3-8 Intro
Document6 pages
Cheng 2012 PP 3-8 Intro
Debo
No ratings yet
基于网络的第四代语料库分析工具核心功能评介
Document14 pages
基于网络的第四代语料库分析工具核心功能评介
alex
No ratings yet
CORPORA
Document6 pages
CORPORA
Joy Erica C. Leo
No ratings yet
Retrieving Terminological Information On The Net. Are Linguistic Tools Still Useful?
Document8 pages
Retrieving Terminological Information On The Net. Are Linguistic Tools Still Useful?
Fernanda
No ratings yet
Metalanguages - Languages To Describe Other Languages
Document13 pages
Metalanguages - Languages To Describe Other Languages
James Lin
No ratings yet
An Approach For Interconnecting Lexical Resources
Document6 pages
An Approach For Interconnecting Lexical Resources
Andrei Scutelnicu
No ratings yet
Designing Monolingual Sample Corpus
Document19 pages
Designing Monolingual Sample Corpus
ramlohani
No ratings yet
Parsing of Part-Of-Speech Tagged Assamese Texts
Document7 pages
Parsing of Part-Of-Speech Tagged Assamese Texts
IJCSI Editor
No ratings yet
Natural Language Processing
Document13 pages
Natural Language Processing
Manju Vino
No ratings yet
Cbse - Department of Skill Education Artificial Intelligence
Document11 pages
Cbse - Department of Skill Education Artificial Intelligence
Nipun Sharma
No ratings yet
Terminology in The Age of Multilingual Corpora
Document23 pages
Terminology in The Age of Multilingual Corpora
VascoSobreiro
No ratings yet
UNIT 5 NLP Tools and Techniques
Document7 pages
UNIT 5 NLP Tools and Techniques
Yuvraj Pardeshi
No ratings yet
Comparision of Different Types of Parser and Parsing Techniques
Document4 pages
Comparision of Different Types of Parser and Parsing Techniques
erpublication
No ratings yet
Seminar 7
Document4 pages
Seminar 7
mans1n
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Chapter: 5.4 Handling Operators Topic: 5.4.1 Handling Operators
Document3 pages
Chapter: 5.4 Handling Operators Topic: 5.4.1 Handling Operators
ETL LABS
No ratings yet
Abiy 000652632 Busi 1359 Mba Thesis Isc-Bse
Document63 pages
Abiy 000652632 Busi 1359 Mba Thesis Isc-Bse
jonashi
0% (1)
Dame Proposal 123
Document13 pages
Dame Proposal 123
Dame Tolossa
No ratings yet
Manual Fone Baseus WM01 Plus
Document42 pages
Manual Fone Baseus WM01 Plus
eduardowoltmann
No ratings yet
Orthogonal Code Convolution by Pushpesh
Document19 pages
Orthogonal Code Convolution by Pushpesh
Pushpesh Kumar
No ratings yet
18th International Conference On Intelligent Games and Simulation
Document125 pages
18th International Conference On Intelligent Games and Simulation
kumpullanfilmm
No ratings yet
Chapter 1: Introduction: Muhammad Zeshan Qurashi Fuuast, Islamabad Spring, 2020
Document77 pages
Chapter 1: Introduction: Muhammad Zeshan Qurashi Fuuast, Islamabad Spring, 2020
Zee Shan
No ratings yet
WeAccess Enrollment and Maintenance Agreement Form - 2022
Document11 pages
WeAccess Enrollment and Maintenance Agreement Form - 2022
lailalyn222
No ratings yet
(Slides Note) 05 ISFT Security Policies, Standards and Compliance VE - MHH
Document69 pages
(Slides Note) 05 ISFT Security Policies, Standards and Compliance VE - MHH
FaZe MoHA
No ratings yet
HTML Set 1
Document5 pages
HTML Set 1
Magarsaa
No ratings yet
TheAvtar - Indian Social Network
Document10 pages
TheAvtar - Indian Social Network
Sameera Lopez
No ratings yet
Erik Wilde, Cesare Pautasso - REST - From Research To Practice
Document524 pages
Erik Wilde, Cesare Pautasso - REST - From Research To Practice
Tulho kayldland hawkbam
No ratings yet
Assignment 1
Document15 pages
Assignment 1
Pakmedic inc
No ratings yet
Mobile Money Research Papers
Document6 pages
Mobile Money Research Papers
afedetbma
100% (1)
1.4 Review of Various IoT Application Domain
Document12 pages
1.4 Review of Various IoT Application Domain
lakshmi
No ratings yet
7 Best Ways To Lookup For IP Address
Document3 pages
7 Best Ways To Lookup For IP Address
Abhishek Prajapati
No ratings yet
Linux Basics Commands
Document4 pages
Linux Basics Commands
P Dinesh
No ratings yet
Lab04 - B - m02 - Implementing Branchcache PDF
Document6 pages
Lab04 - B - m02 - Implementing Branchcache PDF
M -
No ratings yet
Wireshark Lab: 802.11: Approach, 6 Ed., J.F. Kurose and K.W. Ross
Document5 pages
Wireshark Lab: 802.11: Approach, 6 Ed., J.F. Kurose and K.W. Ross
N Azzati Labibah
No ratings yet
Flowchart Programming Ass 3
Document21 pages
Flowchart Programming Ass 3
BETHANY LABRADOR
No ratings yet
User Manual - ELNet - PPS - App
Document12 pages
User Manual - ELNet - PPS - App
NM Group
No ratings yet
OOP in Java - Get Your Hands Dirty With Code
Document8 pages
OOP in Java - Get Your Hands Dirty With Code
Iftekharul Islam
No ratings yet
YG002 Fiber Composition Measuring Instrument
Document24 pages
YG002 Fiber Composition Measuring Instrument
Danish Iftikhar
No ratings yet
Returns, Profits and Growth - It Happens Here
Document18 pages
Returns, Profits and Growth - It Happens Here
Akhil Hussain
No ratings yet
Instagram User Analytics
Document6 pages
Instagram User Analytics
Rahul Shirude
No ratings yet
SRS Banking System PDF
Document8 pages
SRS Banking System PDF
Aditya Jha
No ratings yet
AI PPT Spring 2k22
Document44 pages
AI PPT Spring 2k22
Hussnain Waheed Bhatti
No ratings yet