You are on page 1of 3

Training 

TextBlob with Custom Datasets

What if you want to spellcheck another language which isn't supported by


TextBlob out of the box? Or maybe you want to get just a little bit more
precise? Well, there might be a way to achieve this. It all comes down to
the way spell checking works in TextBlob.

TextBlob uses statistics of word usage in English to make smart


suggestions on which words to correct. It keeps these statistics in a file
called  en-spelling.txt , but it also allows you to make your very own word
usage statistics file.

Let's try to make one for our Darwin example. We'll use all the words in
the "On the Origin of Species" to train. You can use any text, just make
sure it has enough words, that are relevant to the text you wish to correct.

In our case, the rest of the book will provide great context and additional
information that TextBlob would need to be more accurate in the correction.

Let's rewrite the script:

from textblob.en import Spelling

import re

textToLower = ""

with open ( "originOfSpecies.txt" , "r" ) as f1: # Open our


source file
text = f1.read() # Read the file

textToLower = text.lower() # Lower all the


capital letters

words = re.findall( "[a-z]+" , textToLower) # Find all the words


and place them into a list
oneString = " " .join(words) # Join them into one
string

pathToFile = "train.txt" # The path we want to


store our stats file at
spelling = Spelling(path = pathToFile) # Connect the path to the
Spelling object
spelling.train(oneString, pathToFile) # Train

If we look into the  train.txt  file, we'll see:

a 3389

abdomen 3

aberrant 9

aberration 5

abhorrent 1

abilities 1

ability 4

abjectly 1

able 54

ably 5

abnormal 17

abnormally 2

abodes 2

...

This indicates that the word  "a"  shows up as a word 3389 times,
while  "ably"  shows up only 5 times. To test out this trained model, we'll
use  suggest(text)  instead of  correct(text) , which a list of word-confidence
tuples. The first elements in the list will be the word it's most confident
about, so we can access it via  suggest(text)[0][0] .

Note that this might be slower, so go word by word while spell-checking, as


dumping huge amounts of data can result in a crash:

from textblob.en import Spelling

from textblob import TextBlob

pathToFile = "train.txt"

spelling = Spelling(path = pathToFile)


text = " "

with open ( "test.txt" , "r" ) as f:

text = f.read()

words = text.split()

corrected = " "

for i in words :

corrected = corrected + " " + spelling.suggest(i)[ 0 ][ 0 ] # Spell


checking word by word

print (corrected)

And now, this will result in:

As far as I am all to judge after long attending to the subject the conditions

of life appear to act in two ways—directly on the whole organisation or on

certain parts alone and indirectly by acting the reproduce system It respect to

the direct action we most be in mid the in every case as Professor Weismann as

lately insisted and as I have incidently shown in my work on "Variatin under

Domesticcation," there are two facts namely the nature of the organism and the

nature of the conditions The former seems to be much th are important for

nearly similar variations sometimes arise under as far as we in judge

dissimilar conditions and on the other hand dissimilar variations arise under

conditions which appear to be nearly uniform The effects on the offspring are

either definite or in definite They may be considered as definite when all or

nearly all the offspring off individuals exposed to certain conditions during

several generations are modified in the same manner.

This fixes around 2 out of 3 of misspelled words, which is pretty good,


considering the run without much context.

You might also like