A Spell-Checker in R - Anrprogrammer

9/17/2017 A spell-checker in R | anrprogrammer
anrprogrammer
A summary of my Statistics, Mathematics, CSC, AI, Machine Learning Projects
A spell-checker in R
I came across Dr. Peter Norvig’s blog about writing a basic spell-checker (http://norvig.com/spell-correct.html), and
just had to try to implement it in R. Please excuse the ugly-ish code (I have not optimized it or commented it
adequately at this point, but you can get the idea of what it does by reading Dr. Norvig’s blog). If anyone knows of
any pre-built spell-checker packages in R, please let me know in a comment!
I do not think R is a particularly good language for this sort of activity, but I got it to work out ne. The rst few lines
here create a list of common words, and their frequencies in the English language. The following lines may take a
few minutes to run on an average machine, but I will try to upload them soon so that you can just download the
table instead of creating it yourself…
1 words <- scan("http://norvig.com/big.txt", what = character())

2 words <- strip.text(words)
3 counts <- table(words)
Next, here are the functions we need to do the spell-check operations…
1 # This is a text processing function, which I

2 # borrowed from a CMU Data mining course professor.
3 strip.text <- function(txt) {
4 # remove apostrophes (so "don't" -> "dont", "Jane's" -> "Janes", etc.)
5 txt <- gsub("'","",txt)
6 # convert to lowercase
7 txt <- tolower(txt)
8 # change other non-alphanumeric characters to spaces
9 txt <- gsub("[^a-z0-9]"," ",txt)
10 # change digits to #
11 txt <- gsub("[0-9]+"," ",txt)
12 # split and make one vector
13 txt <- unlist(strsplit(txt," "))
14 # remove empty words
15 txt <- txt[txt != ""]
16 return(txt)
17 }
18
19 # Words within 1 transposition.
20 Transpositions <- function(word = FALSE) {
21 N <- nchar(word)
22 if (N > 2) {
23 out <- rep(word, N - 1)
24 word <- unlist(strsplit(word, NULL))
25 # Permutations of the letters
26 perms <- matrix(c(1:(N - 1), 2:N), ncol = 2)
27 reversed <- perms[, 2:1]
28 trans.words <- matrix(rep(word, N - 1), byrow = TRUE, nrow = N - 1)
29 for(i in 1:(N - 1)) {
30 trans.words[i, perms[i, ]] <- trans.words[i, reversed[i, ]]
31 out[i] <- paste(trans.words[i, ], collapse = "")
32 }
33 }
34 else if (N == 2) {
35 out <- paste(word[2:1], collapse = "")
36 }
37 else {
38 out <- paste(word, collapse = "")
39 }
40 return(out)
https://anrprogrammer.wordpress.com/2012/02/08/a-spell-checker-in-r/ 1/6
40 return(out)
41 }
42
43 # Single letter deletions.
44 # Thanks to luiscarlosmr for partial correction in comments
45 Deletes <- function(word = FALSE) {
46 N <- nchar(word)
47 out<-mat.or.vec(1,N)
48 word <- unlist(strsplit(word, NULL))
49 for(i in 1:N) {
50 out[i] <- paste(word[-i], collapse = "")
51 }
52 return(out)
53 }
54
55 # Single-letter insertions.
56 Insertions <- function(word = FALSE) {
57 N <- nchar(word)
58 out <- list()
59 for (letter in letters) {
60 out[[letter]] <- rep(word, N + 1)
61 for (i in 1:(N + 1)) {
62 out[[letter]][i] <- paste(substr(word, i - N, i - 1), letter,
63 substr(word, i, N), sep = "")
64 }
65 }
66 out <- unlist(out)
67 return(out)
68 }
69
70 # Single-letter replacements.
71 Replaces <- function(word = FALSE) {
72 N <- nchar(word)
73 out <- list()
74 for (letter in letters) {
75 out[[letter]] <- rep(word, N)
76 for (i in 1:N) {
77 out[[letter]][i] <- paste(substr(word, i - N, i - 1), letter,
78 substr(word, i + 1, N + 1), sep = "")
79 }
80 }
81 out <- unlist(out)
82 return(out)
83 }
84 # All Neighbors with distance "1"
85 Neighbors <- function(word) {
86 neighbors <- c(word, Replaces(word), Deletes(word),
87 Insertions(word), Transpositions(word))
88 return(neighbors)
89 }
90
91 # Probability as determined by our corpus.
92 Probability <- function(word, dtm) {
93 # Number of words, total
94 N <- length(dtm)
95 word.number <- which(names(dtm) == word)
96 count <- dtm[word.number]
97 pval <- count/N
98 return(pval)
99 }
100
101 # Correct a single word.
102 Correct <- function(word, dtm) {
103 neighbors <- Neighbors(word)
104 # If it is a word, just return it.
105 if (word %in% names(dtm)) {
106 out <- word
107 }
108 # Otherwise, check for neighbors.
109 else {
110 # Which of the neighbors are known words?
111 known <- which(neighbors %in% names(dtm))
112 N.known <- length(known)
113 # If there are no known neighbors, including the word,
114 # look farther away.
115 if (N.known == 0) {
116 print(paste("Having a hard time matching '", word, "'...", sep = ""))
117 neighbors <- unlist(lapply(neighbors, Neighbors))
118 }
119 # Then out non-words.
119 # Then out non-words.
120 neighbors <- neighbors[which(neighbors %in% names(dtm))]
121 N <- length(neighbors)
122 # If we found some neighbors, find the one with the highest
123 # p-value.
124 if (N >= 1) {
125 P <- 0*(1:N)
126 for (i in 1:N) {
127 P[i] <- Probability(neighbors[i], dtm)
128 }
129 out <- neighbors[which.max(P)]
130 }
131 # If no neighbors still, return the word.
132 else {
133 out <- word
134 }
135 }
136 return(out)
137 }
138
139 # Correct an entire document.
140 CorrectDocument <- function(document, dtm) {
141 by.word <- unlist(strsplit(document, " "))
142 N <- length(by.word)
143 for (i in 1:N) {
144 by.word[i] <- Correct(by.word[i], dtm = dtm)
145 }
146 corrected <- paste(by.word, collapse = " ")
147 return(corrected)
148 }
The above functions generate “neighbors” of words, determine probabilities of the neighbors, and return the best
ones. Function “CorrectDocument” will correct an entire document (with special characters and punctuation
removed), and “Correct” will simply correct a word. Here are some sample runs.
> Correct("speling", dtm = counts)

l4
"spelling"
> Correct("korrecter", dtm = counts)
[1] "Having a hard time matching 'korrecter'..."
c1.d9
"corrected"
> CorrectDocument("the quick bruwn fowx jumpt ovre tha lasy dog", dtm = counts)
[1] "the quick brown fox jump over the last dog"
As you can see, this function is obviously not perfect. It will do some basic corrections automatically though, but
there are some improvements to be made. More to come!
Advertisements
3M 6200 Reusable Half Ma…

Rs. 1,227
BUY NOW
Mahindra TUV300 Stepney…

Rs. 4,625
BUY NOW
Ads by industrybuying.com
3M 6200 Reusable Half Ma…

Rs. 1,227
BUY NOW
Havells MHPAME1X00 Sin…

Rs. 4,159
BUY NOW
Ads by industrybuying.com
This entry was posted in cipher, R, Statistics and tagged data mining, machine learning, R, statistics on February 8,
2012 [https://anrprogrammer.wordpress.com/2012/02/08/a-spell-checker-in-r/] .
8 thoughts on “A spell-checker in R”
Richie Cotton
February 8, 2012 at 12:04 pm
Very nice, but did you know about the aspell function in the utils package?
RGuy Post author
Yes, I had come across that in my search. I didn’t think much about it since you cannot directly supply it with a
string and have it corrected, but it could easily perform the same function (and better, I’m sure) if I just output the
string to a le and then read it in using aspell. I will likely do the latter if I need any serious spell-correction done
with R, but learning the idea behind very basic spell-checking was interesting!
luiscarlosmr
Kindly, let me add something about this code:
# Deletes
# TRY: word[-i]
Deletes <- function(word = FALSE) {
N <- nchar(word)
out<-mat.or.vec(1,length(N))
word <- unlist(strsplit(word, NULL))
for(i in 1:length(N)) {
out[i] <- paste(word[-i], collapse = "")
}
return(out)
}
As you can see I just added a line. Have a great time and this code is really interesting to me. Thank you.
Deepak
October 2, 2012 at 12:39 pm
Actually, a small typo in the nal module, you need to change

by.word <- unlist(strsplit(essay, " "))
to by.word <- unlist(strsplit(document, " "))……………….//////the full version of the module is
# Correct an entire document.

CorrectDocument <- function(document, dtm) {
by.word <- unlist(strsplit(document, " "))
N <- length(by.word)
for (i in 1:N) {
by.word[i] <- Correct(by.word[i], dtm = dtm)
}
corrected <- paste(by.word, collapse = " ")
return(corrected)
}
Dan
October 15, 2012 at 2:49 am
Is there a license for your code?
RGuy Post author
October 15, 2012 at 2:55 am
Nope, it is completely free to use, and has no license associated with it.
Dan
October 16, 2012 at 9:57 pm
Awesome! I’d love to use it in a project! Unfortunately, “no license associated with it” doesn’t exactly mean “free to
use” because it’s copyrighted by default. Do you mean that you release it into the public domain? Thanks again, this
was an awesome lesson, and I’d love to be able to adapt it for a need of mine.
Ankit
July 9, 2013 at 2:15 pm
Hello, I am working on a project where I need to build a spell checker and that’s how I stumbled onto your blog
post. But I’m facing an issue when I am trying to execute the code. R gives me the following error:
Error in out[i] <- paste(word[-i], collapse = "") :

object 'out' not found
Do you have any idea what could be the reason for such an error?

A Spell-Checker in R - Anrprogrammer

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Spell-Checker in R - Anrprogrammer

Uploaded by

Copyright:

Available Formats

9/17/2017 A spell-checker in R | anrprogrammer

1 words <- scan("http://norvig.com/big.txt", what = character())

Next, here are the functions we need to do the spell-check operations…

1 # This is a text processing function, which I

> Correct("speling", dtm = counts)

3M 6200 Reusable Half Ma…

Mahindra TUV300 Stepney…

3M 6200 Reusable Half Ma…

Havells MHPAME1X00 Sin…

8 thoughts on “A spell-checker in R”

RGuy Post author

February 8, 2012 at 4:43 pm

Kindly, let me add something about this code:

Actually, a small typo in the nal module, you need to change

# Correct an entire document.

Is there a license for your code?

RGuy Post author

October 15, 2012 at 2:55 am

Error in out[i] <- paste(word[-i], collapse = "") :

You might also like