Welcome to Scribd!

Training Textblob With Custom Datasets: En-Spelling

Uploaded by

0% found this document useful (0 votes)

28 views3 pages

This document discusses how to train TextBlob's spellchecking capabilities with custom datasets. It explains that TextBlob uses word usage statistics to suggest spellcheck corrections. These statistics can be trained on specific text corpora to improve accuracy for that domain. The document walks through rewriting a script to train a model on Charles Darwin's "On the Origin of Species", then tests the trained model on a sample text, correcting around 2 of 3 misspelled words.

Original Description:

Original Title

Copyright

Available Formats

DOCX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

28 views3 pages

Training Textblob With Custom Datasets: En-Spelling

Uploaded by

GAYATRI RAM PADILE

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 3

Search inside document

Training

TextBlob with Custom Datasets

What if you want to spellcheck another language which isn't supported by

TextBlob out of the box? Or maybe you want to get just a little bit more
precise? Well, there might be a way to achieve this. It all comes down to
the way spell checking works in TextBlob.

TextBlob uses statistics of word usage in English to make smart

suggestions on which words to correct. It keeps these statistics in a file
called en-spelling.txt , but it also allows you to make your very own word
usage statistics file.

Let's try to make one for our Darwin example. We'll use all the words in
the "On the Origin of Species" to train. You can use any text, just make
sure it has enough words, that are relevant to the text you wish to correct.

In our case, the rest of the book will provide great context and additional
information that TextBlob would need to be more accurate in the correction.

Let's rewrite the script:

from textblob.en import Spelling

import re

textToLower = ""

with open ( "originOfSpecies.txt" , "r" ) as f1: # Open our

source file
text = f1.read() # Read the file

textToLower = text.lower() # Lower all the

capital letters

words = re.findall( "[a-z]+" , textToLower) # Find all the words

and place them into a list
oneString = " " .join(words) # Join them into one
string

pathToFile = "train.txt" # The path we want to

store our stats file at
spelling = Spelling(path = pathToFile) # Connect the path to the
Spelling object
spelling.train(oneString, pathToFile) # Train

If we look into the train.txt file, we'll see:

a 3389

abdomen 3

aberrant 9

aberration 5

abhorrent 1

abilities 1

ability 4

abjectly 1

able 54

ably 5

abnormal 17

abnormally 2

abodes 2

...

This indicates that the word "a" shows up as a word 3389 times,
while "ably" shows up only 5 times. To test out this trained model, we'll
use suggest(text) instead of correct(text) , which a list of word-confidence
tuples. The first elements in the list will be the word it's most confident
about, so we can access it via suggest(text)[0][0] .

Note that this might be slower, so go word by word while spell-checking, as

dumping huge amounts of data can result in a crash:

from textblob.en import Spelling

from textblob import TextBlob

pathToFile = "train.txt"

spelling = Spelling(path = pathToFile)

text = " "

with open ( "test.txt" , "r" ) as f:

text = f.read()

words = text.split()

corrected = " "

for i in words :

corrected = corrected + " " + spelling.suggest(i)[ 0 ][ 0 ] # Spell

checking word by word

print (corrected)

And now, this will result in:

As far as I am all to judge after long attending to the subject the conditions

of life appear to act in two ways—directly on the whole organisation or on

certain parts alone and indirectly by acting the reproduce system It respect to

the direct action we most be in mid the in every case as Professor Weismann as

lately insisted and as I have incidently shown in my work on "Variatin under

Domesticcation," there are two facts namely the nature of the organism and the

nature of the conditions The former seems to be much th are important for

nearly similar variations sometimes arise under as far as we in judge

dissimilar conditions and on the other hand dissimilar variations arise under

conditions which appear to be nearly uniform The effects on the offspring are

either definite or in definite They may be considered as definite when all or

nearly all the offspring off individuals exposed to certain conditions during

several generations are modified in the same manner.

This fixes around 2 out of 3 of misspelled words, which is pretty good,

considering the run without much context.

Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Prolog Tutorial
Document31 pages
Prolog Tutorial
Haddad Sammir
No ratings yet
Wordnet Improves Text Document Clustering: Andreas Hotho Steffen Staab Gerd Stumme
Document8 pages
Wordnet Improves Text Document Clustering: Andreas Hotho Steffen Staab Gerd Stumme
riverguardian
No ratings yet
Python With Textblob
Document5 pages
Python With Textblob
GAYATRI RAM PADILE
No ratings yet
Quickstart Textstat PDF
Document2 pages
Quickstart Textstat PDF
Hernán Cornejo
No ratings yet
Quickstart Guide To Text Analysis With Textstat
Document2 pages
Quickstart Guide To Text Analysis With Textstat
Wallacyyy
No ratings yet
Python Cht4 PDF
Document29 pages
Python Cht4 PDF
mebratu teklehaimanot
No ratings yet
Markov Processes Generator
Document5 pages
Markov Processes Generator
dirge00
No ratings yet
NLP CT1
Document6 pages
NLP CT1
kz9057
No ratings yet
Assignment 3
Document5 pages
Assignment 3
Thịnh Trần
No ratings yet
Processing Text: 4.1 From Words To Terms
Document52 pages
Processing Text: 4.1 From Words To Terms
Thangam Natarajan
No ratings yet
Saltz HW11
Document2 pages
Saltz HW11
Alan Galeana Vega
No ratings yet
Lesson 2: Matching Single Characters
Document7 pages
Lesson 2: Matching Single Characters
Me Its
No ratings yet
Lab Manual: Artificial Intelligence
Document18 pages
Lab Manual: Artificial Intelligence
Anonymous jTERdc
No ratings yet
Term Frequency and Inverse Document Frequency
Document26 pages
Term Frequency and Inverse Document Frequency
lalitha sri
No ratings yet
IT Text Book
Document65 pages
IT Text Book
Venkatesh Prasad Boinapalli
No ratings yet
Text Mining in R: A Tutorial
Document7 pages
Text Mining in R: A Tutorial
meenana
No ratings yet
19 ALG Assignment Part1 2
Document6 pages
19 ALG Assignment Part1 2
Anaya Ranta
No ratings yet
Natural Language Systems in Prolog
Document29 pages
Natural Language Systems in Prolog
Anbe Tran
No ratings yet
Advancrd Python Practical SEM II PDF
Document48 pages
Advancrd Python Practical SEM II PDF
omkar dhumal
No ratings yet
Logic Programming in Prolog: Par (Lloyd, James) - Par (Lloyd, Janet) .
Document3 pages
Logic Programming in Prolog: Par (Lloyd, James) - Par (Lloyd, Janet) .
wajahat
No ratings yet
Mid-Term Project Report On Spell Checker
Document15 pages
Mid-Term Project Report On Spell Checker
hansrajpatidar
No ratings yet
NLP Asgn2
Document7 pages
NLP Asgn2
[TE A-1] Chandan Singh
No ratings yet
Definite Clause Grammars
Document14 pages
Definite Clause Grammars
Meo Meo Con
No ratings yet
1 Motivation: Setting Up To Use Pstone
Document9 pages
1 Motivation: Setting Up To Use Pstone
Swathi Patibandla
No ratings yet
Stylo R Script Mini Howto
Document6 pages
Stylo R Script Mini Howto
coldfrites
No ratings yet
Unit 5
Document16 pages
Unit 5
Sanju Shree
No ratings yet
Language Engineering - Section
Document20 pages
Language Engineering - Section
asmaa soliman
No ratings yet
Prolog and Watson 1
Document4 pages
Prolog and Watson 1
prodigy100
No ratings yet
Introduction To Prolog
Document11 pages
Introduction To Prolog
Zashidul Islam
No ratings yet
NLP Intro
Document9 pages
NLP Intro
Vinisha Chandnani
No ratings yet
W11 Natural Language Processing Lecture
Document9 pages
W11 Natural Language Processing Lecture
abbiha.mustafamalik
No ratings yet
Pseudocode To Find Co-Words in Text
Document2 pages
Pseudocode To Find Co-Words in Text
Izaac Garcilazo
No ratings yet
Ai Unit 5
Document16 pages
Ai Unit 5
Mukeshram.B AIDS20
No ratings yet
Csa2001-Fundamentals in Ai and ML: Submitted by
Document21 pages
Csa2001-Fundamentals in Ai and ML: Submitted by
Abhay Kumar
No ratings yet
Lecture 3-Skip Pointers and Phrase Queries
Document12 pages
Lecture 3-Skip Pointers and Phrase Queries
Yash Gupta
No ratings yet
Introduction To Python
Document13 pages
Introduction To Python
suryakant barkade
No ratings yet
A Free Word Dependency Parser in Prolog
Document8 pages
A Free Word Dependency Parser in Prolog
nmamdali
No ratings yet
Word Game Lab
Document8 pages
Word Game Lab
Neel Jani
No ratings yet
Learning Python: From Zero To Hero: by TK
Document23 pages
Learning Python: From Zero To Hero: by TK
Ramesh Kumar
No ratings yet
Unit 1
Document4 pages
Unit 1
Shiv M
No ratings yet
A Formal Grammar For Toki Pona: Zach Tomaszewski ICS661 11 Dec 2012
Document13 pages
A Formal Grammar For Toki Pona: Zach Tomaszewski ICS661 11 Dec 2012
Neutral Network
No ratings yet
Lab GRU For Sen
Document5 pages
Lab GRU For Sen
Pervaiz Akhter
No ratings yet
NLP Unit Test 2
Document10 pages
NLP Unit Test 2
sneha
No ratings yet
Natural Language Processing: Dr. G. Bharadwaja Kumar
Document44 pages
Natural Language Processing: Dr. G. Bharadwaja Kumar
vikas belida
No ratings yet
Chapter Five (ISR)
Document17 pages
Chapter Five (ISR)
Wudneh Aderaw
No ratings yet
Python Programming Notes
Document76 pages
Python Programming Notes
Jayyant Chaudhari
No ratings yet
NLP For ML - Spam Classifier
Document14 pages
NLP For ML - Spam Classifier
Thomas West
No ratings yet
Course Notes For Unit 1 of The Udacity Course CS262 Programming Languages
Document32 pages
Course Notes For Unit 1 of The Udacity Course CS262 Programming Languages
Iain McCulloch
No ratings yet
Basic Parsing Techniques
Document34 pages
Basic Parsing Techniques
oju
No ratings yet
11 Examples To Master Python List Comprehensions - by Soner Yıldırım - Towards Data Science
Document9 pages
11 Examples To Master Python List Comprehensions - by Soner Yıldırım - Towards Data Science
dhanyafb
No ratings yet
The Art of Logic
Document314 pages
The Art of Logic
Zeski Phagara
No ratings yet
AI Unit 3 Lecture 2
Document8 pages
AI Unit 3 Lecture 2
Sunil Nagar
No ratings yet
Lecture 4
Document48 pages
Lecture 4
mohsin
No ratings yet
Lab1 - Introduction To Python
Document9 pages
Lab1 - Introduction To Python
Ayesha 71
No ratings yet
Prolog Coding Guidelines 0
Document12 pages
Prolog Coding Guidelines 0
Al Oy
No ratings yet
Dictionaries: 'One' 'Uno'
Document10 pages
Dictionaries: 'One' 'Uno'
Zahid
No ratings yet
CS For Prisha
Document3 pages
CS For Prisha
Prisha Kedia
No ratings yet
Lecture NLP
Document38 pages
Lecture NLP
harpritsingh
100% (1)
Conclusion
Document1 page
Conclusion
GAYATRI RAM PADILE
No ratings yet
SE Mini Project
Document10 pages
SE Mini Project
GAYATRI RAM PADILE
No ratings yet
SBL Mini Project Code
Document1 page
SBL Mini Project Code
GAYATRI RAM PADILE
No ratings yet
SE Mini Project
Document10 pages
SE Mini Project
GAYATRI RAM PADILE
No ratings yet
Linguistic For Non Linguistic
Document39 pages
Linguistic For Non Linguistic
ersonefendi
No ratings yet
Let REV
Document5 pages
Let REV
Scribd
No ratings yet
Manual Engleza Innovations Upper Intermediate Coursebook
Document178 pages
Manual Engleza Innovations Upper Intermediate Coursebook
BiancaDorianaP
80% (5)
Article by Neetu Mam
Document47 pages
Article by Neetu Mam
Abhishek Singh
No ratings yet
Info Speech Organs
Document7 pages
Info Speech Organs
Mari Figueroa
No ratings yet
LUTZ Viticulture Brewing
Document188 pages
LUTZ Viticulture Brewing
stardust76
No ratings yet
Present Simple: Affirmative Negative Interogative Short Answers
Document4 pages
Present Simple: Affirmative Negative Interogative Short Answers
Cidália Fonseca
100% (1)
Legal Bases Research
Document4 pages
Legal Bases Research
Rica Moreno Abel
No ratings yet
How To Make A Powerful Power Point Presentation
Document10 pages
How To Make A Powerful Power Point Presentation
Gottumukkala Venkateswara Rao
100% (1)
Guidance For Employees Oromo
Document5 pages
Guidance For Employees Oromo
Jemal San Yuya
No ratings yet
Midterm Exam Pre Intermediate English April 2021
Document7 pages
Midterm Exam Pre Intermediate English April 2021
rosdi jkm
100% (1)
CXC English A May - June 2007
Document8 pages
CXC English A May - June 2007
Lisa Fernando
0% (1)
Reading Comprehension: January The Fifth
Document2 pages
Reading Comprehension: January The Fifth
radha
No ratings yet
TEST Za VIII-past Simple I Present Perfect, Will, Won't
Document4 pages
TEST Za VIII-past Simple I Present Perfect, Will, Won't
Cvetanka Shterjovska
No ratings yet
The Role of English in Intercultural Communication
Document11 pages
The Role of English in Intercultural Communication
Rizqa Mahdyna
No ratings yet
English Final
Document30 pages
English Final
Keith Kathe
No ratings yet
Complete The Sentences With The Correct Form of The Verbs
Document3 pages
Complete The Sentences With The Correct Form of The Verbs
BanI VercanI
No ratings yet
Thirty Years On: Reading More With Rachna Books
Document2 pages
Thirty Years On: Reading More With Rachna Books
RachnaBooks
100% (1)
Tercero Nouns and Adjectives - Activities and Exam No.5 PDF
Document0 pages
Tercero Nouns and Adjectives - Activities and Exam No.5 PDF
Carlos Billot Ayala
No ratings yet
3ac Langlais
Document10 pages
3ac Langlais
Hammadi 20
No ratings yet
You Might Also Like : We Hope You Enjoy This Mini
Document10 pages
You Might Also Like : We Hope You Enjoy This Mini
Dwitney Bethel
No ratings yet
Language of Presentations
Document10 pages
Language of Presentations
marjoseoyaneder
No ratings yet
Unit 2
Document8 pages
Unit 2
Adinda
No ratings yet
Bharat Asudani
Document2 pages
Bharat Asudani
bharat
No ratings yet
UN Personal History Form (P-11)
Document4 pages
UN Personal History Form (P-11)
QrizshaQaye GolezSalandaguit
No ratings yet
The Rise and The Fall of The Bilingual Intellectual Ramchandra Guha
Document8 pages
The Rise and The Fall of The Bilingual Intellectual Ramchandra Guha
Sachin Ketkar
No ratings yet
Korean and Romanized Lyrics
Document8 pages
Korean and Romanized Lyrics
wina_wyne
No ratings yet
Peperiksaan Bahasa Inggeris Paper 2 BI YEAR 4 May 2020
Document8 pages
Peperiksaan Bahasa Inggeris Paper 2 BI YEAR 4 May 2020
SylviaJonathan
100% (2)
Beowulf Gawain Essay Improved
Document6 pages
Beowulf Gawain Essay Improved
api-357140019
No ratings yet
State Exam: Blue Print
Document78 pages
State Exam: Blue Print
Gayuh Mahardika
No ratings yet