You are on page 1of 7

2021/‫‏‬6/‫‏‬26 GitHub - MagedSaeed/farasapy: A Python implementation of Farasa toolkit

MagedSaeed / farasapy

Code Issues Pull requests 1 Actions Projects Wiki Security Insights


master
Go to file
Code
About

A Python implementation of
MagedSaeed
Merge pull request #14 from Gingerbreadfork/patch-1 … on 27 May
85 Farasa toolkit

#
nlp
#
tokenizer
#
python-library
farasa Fix Java regex version pattern problem 2 months ago
#
python3
#
named-entity-recognition
farasapy.egg-info adding 0.0.9 to pypi 10 months ago
#
arabic
#
python36
#
stemmers

.gitignore Merge branch 'master' 13 months ago #


arabic-nlp
#
postagging
#
diacritization

LICENSE add license 14 months ago #


farasa

README.md Minor Text Edits for English Portion 2 months ago


Readme

requirements.txt Bump bleach from 3.1.5 to 3.3.0 5 months ago


MIT License

setup.py fix the issue of java version, upgrade to 0.0.13 2 months ago

tests.py update code to download binaries. 13 months ago Releases


1
tags

Packages
Table of Content
No packages published

Table of Content
https://github.com/MagedSaeed/farasapy 1/7
2021/‫‏‬6/‫‏‬26 GitHub - MagedSaeed/farasapy: A Python implementation of Farasa toolkit

Disclaimer
Used by 15
Introduction
Installation +7

How to use
AN IMPORTANT REMARK
An Overview Contributors 4

Standalone Mode
MagedSaeed
Maged Saeed
Interactive Mode
Contribution hefengxian
hefengxian

Want to cite?
Gingerbreadfork
Gingerbread…
Useful URLs
dependabot[bot]
Open in Colab

downloads 615/week
license MIT
python 3.6
pypi v0.0.13

Languages

Disclaimer Python 100.0%

This is a Python API wrapper for farasa [http://qatsdemo.cloudapp.net/farasa/] toolkit.


Although this work is licsenced under MIT, the original work(the toolkit) is strictly
premitted for research purposes only. For any commercial uses, please contact the toolkit
creators[http://qatsdemo.cloudapp.net/farasa/].

Introduction
Farasa is an Arabic NLP toolkit serving the following tasks:

https://github.com/MagedSaeed/farasapy 2/7
2021/‫‏‬6/‫‏‬26 GitHub - MagedSaeed/farasapy: A Python implementation of Farasa toolkit

1. Segmentation.
2. Stemming.
3. Named Entity Recognition (NER).
4. Part Of Speech tagging (POS tagging).
5. Diacritization.

The toolkit is built and compiled in Java. Developers who want to use it without using this
library may call the binaries directly from their code.

As Python is a general purpose language and so popular for many NLP tasks, an automation to
these calls to the toolkit from the code would be convenient. This is where this wrapper fits.

Installation

pip install farasapy

How to use
An interactive Google colab code of the library can be reached from here
[https://colab.research.google.com/drive/1xjzYwmfAszNzfR6Z2lSQi3nKYcjarXAW?
usp=sharing].

AN IMPORTANT REMARK
The library, as it is a wrapper for Java jars, requires that Java is installed in your system and
is in your PATH. It is, also, not recommended to have a version below Java 1.7.

https://github.com/MagedSaeed/farasapy 3/7
2021/‫‏‬6/‫‏‬26 GitHub - MagedSaeed/farasapy: A Python implementation of Farasa toolkit

Some binaries are computationally HEAVY!

An Overview
Farasapy wraps and maintains all the toolkit's APIs in different classes where each class is in
separate file. You need to import your class of interest from its file as follows:

from farasa.pos import FarasaPOSTagger

from farasa.ner import FarasaNamedEntityRecognizer

from farasa.diacratizer import FarasaDiacritizer

from farasa.segmenter import FarasaSegmenter

from farasa.stemmer import FarasaStemmer

Now, If you are using the library for the first time, the library needs to download farasa toolkit
binaries first. You do not need to worry about anything. The library, whenever you instantiate an
object of any of its classes, will first check for the binaries, download them if they are not
existed. This is an example of instantiating an object from FarasaStemmer for the first use of the
library.

stemmer = FarasaStemmer()

perform system check...

check java version...

Your java version is 1.8 which is compatiple with Farasa

check toolkit binaries...

some binaries are not existed..

downloading zipped binaries...

100%|███████████████████████████████████████| 200M/200M [02:39<00:00, 1.26MiB/s]

extracting...

toolkit binaries are downloaded and extracted.

Dependencies seem to be satisfied..

task [STEM] is initialized in STANDALONE mode...

https://github.com/MagedSaeed/farasapy 4/7
2021/‫‏‬6/‫‏‬26 GitHub - MagedSaeed/farasapy: A Python implementation of Farasa toolkit

let us stem the following example:

sample =\

'''

‫ مليون نسمة ويتوزع متحدثوها‬422 ‫


ُيشار إلى أن اللغة العربية يتحدثها أكثر من‬

في المنطقة المعروفة باسم الوطن العربي باإلضافة إلى العديد من المناطق ال‬
‫وهي اللغ‬.‫
أخرى المجاورة مثل األهواز وتركيا وتشاد والسنغال وإريتريا وغيرها‬
‫ة الرابعة من لغات منظمة األمم المتحدة الرسمية الست‬.

'''

stemmed_text = stemmer.stem(sample)

print(stemmed_text)

'‫ مليون نسمة توزع متحدثوها في منطقة معروف اسم وطن عربي إضافة إلى عديد من‬422 ‫أشار إلى أن لغة عربي تحدث أكثر من‬
‫ هي لغة رابع من لغة منظمة أمة متحد رسمي ست‬. ‫ منطقة آخر مجاور مثل أهواز تركيا تشاد سنغال أريتريا غير‬.'

You may notice that the last line of object instantiation states that the object is instantiated in
STANDALONE mode. Farasapy, like the toolkit binaries themselves, can run in two different
modes: Interactive and Standalone.

Standalone Mode
In standalone mode, the instantiated object will call the binary each time it performs its task. It
will put the input text in a temporary file, execute the binary with this temporary file, and finally
extract the output from another temporary file. These temporary files are garbage collected
once the task ends. Be careful that some binaries, like the diacritizer, might take very long time
to start. Hence, this option is preferred when you have long text and you want to do it only
once.

Interactive Mode

https://github.com/MagedSaeed/farasapy 5/7
2021/‫‏‬6/‫‏‬26 GitHub - MagedSaeed/farasapy: A Python implementation of Farasa toolkit

In interactive mode, the object will run the binary once instantiated. It, then, will feed the text to
the binary interactively and capture the output on each input. However, the user should be
careful not to put large lines as the output, just like in shells, might not be as expected. It is a
good practice to terminate by my_obj.terminate() these kinds of objects once they are not
needed to avoid any unexpected behaviour in your code.

For best practices, use the INTERACTIVE mode where the input text is small and you need to do
the task multiple times. However, The STANDALONE mode is the best for large input texts
where the task is expected to be done only once.

To work on interactive mode, you just need to pass interactive=True option to your object
constructor.

The following is an example on the segmentation API that is running interactively.

segmenter = FarasaSegmenter(interactive=True)

perform system check...

check java version...

Your java version is 1.8 which is compatiple with Farasa

README.md
check toolkit binaries...

Dependencies seem to be satisfied..

/path/to/the/library/farasa/__base.py:40: UserWarning: Be careful with large lines as


they may break on interactive mode. You may switch to Standalone mode for such cases.

warnings.warn("Be careful with large lines as they may break on interactive mode. You
may switch to Standalone mode for such cases.")

initializing [SEGMENT] task in INTERACTIVE mode...

task [SEGMENT] is initialized interactively.

segmented = segmenter.segment(sample)

print(segmented)

'‫ة باسم‬+‫معروف‬+‫ة ال‬+‫منطق‬+‫يتوزع متحدثوها في ال‬+‫ة و‬+‫ مليون نسم‬422 ‫ها أكثر من‬+‫ة يتحدث‬+‫عربي‬+‫ة ال‬+‫لغ‬+‫يشار إلى أن ال‬
‫سنغال‬+‫ال‬+‫تشاد و‬+‫تركيا و‬+‫أهواز و‬+‫ة مثل ال‬+‫مجاور‬+‫أخرى ال‬+‫مناطق ال‬+‫عديد من ال‬+‫ة إلى ال‬+‫إضاف‬+‫ال‬+‫عربي ب‬+‫وطن ال‬+‫ال‬
‫ست‬+‫ة ال‬+‫رسمي‬+‫ة ال‬+‫متحد‬+‫أمم ال‬+‫ة ال‬+‫ات منظم‬+‫ة من لغ‬+‫رابع‬+‫ة ال‬+‫لغ‬+‫هي ال‬+‫ و‬. ‫ها‬+‫غير‬+‫إريتريا و‬+‫ و‬.'

https://github.com/MagedSaeed/farasapy 6/7
2021/‫‏‬6/‫‏‬26 GitHub - MagedSaeed/farasapy: A Python implementation of Farasa toolkit

Contribution
The credit of desegmentation code goes for @Wissam Antoun
[https://github.com/WissamAntoun/Farasa_Desegmenter] in his repository
[https://github.com/WissamAntoun/Farasa_Desegmenter].

Want to cite?
You can find the list of publications to site from here: http://qatsdemo.cloudapp.net/farasa/.

Useful URLs
The official site: http://alt.qcri.org/farasa/
farasa from GitHub topics: https://github.com/topics/farasa
A repository by one of the toolkit authors containing WikiNews corpus:
https://github.com/kdarwish/Farasa

https://github.com/MagedSaeed/farasapy 7/7

You might also like