Professional Documents
Culture Documents
n−1
Õ
Oq = O(w i , w i+1 ) (2)
i=0
where Oq represents the occurrence of the candidate; n rep-
resents the number of the words in the candidate; w i represents
the ith word; O(w i , w i+1 ) represents the occurrence of the bigram
words w i and w i+1 .
Based on the above method, all of the candidates have the occur-
rence value. Therefore, for multiple correction candidates of a given
query, these candidates can be ranked based on their occurrences.
Figure 3: Cascade Structure
However, the method also has three drawbacks:
1. The occurrence depends on the quality of the source data. If
the source data contains much noises, the occurrences of some
wrong bigram words are also very big.
2. Suppose there are more than one kind of data sources (e.g.,
user query log and product catalog), for a given candidate, every
data source may have an occurrence. However, it is hard to combine
the occurrences together for the same candidate because different
data sources have different noise level and different signal coverage.
3. In Equation (2), the occurrence of the candidate is defined as
the mean value of the bigram words occurrences. But the variance
of the bigram words occurrences is not considered. Suppose there
are two candidates, one of them has big mean and big variance; the
other one has small mean and small variance. It is hard to conclude
which one is better.
Therefore, a novel ranking model is proposed to solve the above
problems.
Correct Brand NWE RWE Break Concatenate Phonetic Product Unit Dim Overall
Count 1,559,534 4678 10,083 1,528 10,594 89,762 345,323 9,490 50,538 13,498 2,095,028
THD SC v1 91.57% 64.71% 42.94% 15.05% 22.71% 25.25% 40.96% 62.17% 33.19% 23.76% 77.71%
THD SC v2 95.36% 67.36% 43.18% 17.24% 23.07% 24.29% 40.03% 64.98% 32.25% 23.37% 80.32%
Lucene SC 81.24% 43.50% 26.10% 21.92% 4.18% 13.97% 20.72% 48.23% 10.76% 5.72% 65.26%
information is a huge benefit to increase correction accuracy. How- data can be feed into a deep neural network for the real word errors
ever, due to some noises issues existing in the bigram co-occurrence checking and correcting.
model, some wrong bigram words also have high occurrences, e.g.,
popular error "dish washer". It results in a low performance on the
correction of the real word errors. REFERENCES
The highest accuracy values of different correction types are [1] James L Peterson. Computer programs for detecting and correcting spelling
errors. Communications of the ACM, 23(12):676–687, 1980.
marked with Bold in the tables. As shown in Table 1, the proposed [2] Neha Gupta and Pratistha Mathur. Spell checking techniques in nlp: a survey.
THD spell correction v2 has the highest overall accuracy among International Journal of Advanced Research in Computer Science and Software
Engineering, 2(12), 2012.
the three algorithms. For the comparisons on most of the correction [3] James F Allen. Natural language processing. 2003.
types ("Correct", "Brand", "NWE", "Break" and "Product"), THD spell [4] Fei Liu, Fuliang Weng, Bingqing Wang, and Yang Liu. Insertion, deletion, or sub-
correction v2 is also the best one. It proves that word embedding stitution?: normalizing text messages without pre-categorization nor supervision.
In Proceedings of the 49th Annual Meeting of the Association for Computational
based dictionary, multi-candidate based ranking model and their Linguistics: Human Language Technologies: short papers-Volume 2, pages 71–76.
integration can improve spell correction. The correction accuracy Association for Computational Linguistics, 2011.
of real word errors is still lower than that of Lucene spell correction. [5] Daniel Naber et al. A rule-based style and grammar checker. Citeseer, 2003.
[6] Tommi Pirinen and Krister Lindén. Creating and weighting hunspell dictionar-
But it is much better than that of THD spell correction v1. As for iesas finite-state automata. Investigationes Linguisticae, 21:1–16, 2010.
the correction types "Concatenate", "Phonetic", "Unit" and "Dim", [7] Manu Konchady. Building Search Applications: Lucene, LingPipe, and Gate. Lulu.
com, 2008.
THD spell correction v1 still gets the highest accuracy but has very [8] Genaro V Japos. Effectiveness of coaching interventions using grammarly soft-
minimal differences compared with THD spell correction v2. ware and plagiarism detection software in reducing grammatical errors and
plagiarism of undergraduate researches. JPAIR Institutional Research, 1(1):97–109,
2013.
7 CONCLUSION [9] Youssef Bassil and Mohammad Alwani. Post-editing error correction algo-
In summary, the proposed algorithms of word embedding based rithm for speech recognition using bing spelling suggestion. arXiv preprint
arXiv:1203.5255, 2012.
multi-source (product catalog, query log and Wikipedia) dictionary [10] Youssef Bassil and Mohammad Alwani. Ocr post-processing error correction
construction and multi-candidate ranking model can improve spell algorithm using google online spelling suggestion. arXiv preprint arXiv:1204.0191,
correction much more than the other context based algorithms, e.g., 2012.
[11] Eric Sven Ristad and Peter N Yianilos. Learning string-edit distance. IEEE
just using bigram co-occurrence model, especially which performs Transactions on Pattern Analysis and Machine Intelligence, 20(5):522–532, 1998.
much better than that only considering edit distance and phonetics [12] Justin Zobel and Philip Dart. Phonetic string matching: Lessons from information
retrieval. In Proceedings of the 19th annual international ACM SIGIR conference on
matching. Adopting word embedding, multiple data sources can be Research and development in information retrieval, pages 166–172. ACM, 1996.
integrated together without considering the weight mechanism to [13] Eric Brill and Robert C Moore. An improved error model for noisy channel
balance them. In addition, the merits of these data sources are also spelling correction. In Proceedings of the 38th Annual Meeting on Association
for Computational Linguistics, pages 286–293. Association for Computational
effectively exerted, e.g., using Wikipedia data to trim the invalid Linguistics, 2000.
terms due to the low noises performance. With parallel structure [14] Mark D Kernighan, Kenneth W Church, and William A Gale. A spelling cor-
based multi-candidate generation, all of the possible candidates rection program based on a noisy channel model. In Proceedings of the 13th
conference on Computational linguistics-Volume 2, pages 205–210. Association for
can be gotten synchronously. Then the proposed word embedding Computational Linguistics, 1990.
based ranking model can be used to select the best one. [15] Ziang Xie, Anand Avati, Naveen Arivazhagan, Dan Jurafsky, and Andrew Y
Ng. Neural language correction with character-based attention. arXiv preprint
arXiv:1603.09727, 2016.
8 FUTURE WORK [16] Shamil Chollampatt and Hwee Tou Ng. A multilayer convolutional encoder-
decoder neural network for grammatical error correction. In Thirty-Second AAAI
Through all of the comparisons, real word error is regarded as the Conference on Artificial Intelligence, 2018.
weakest point for the proposed THD spell correction algorithms. [17] Ronan Collobert and Jason Weston. A unified architecture for natural language
Most of the real word errors are blocked by spell checker compo- processing: Deep neural networks with multitask learning. In Proceedings of the
25th international conference on Machine learning, pages 160–167. ACM, 2008.
nent of spell correction, especially for some popular errors, e.g., [18] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khu-
"coffee mud" for correct one "coffee mug". Even searching with danpur. Recurrent neural network based language model. In Eleventh annual
Google and Microsoft Bing, they cannot do anything with them conference of the international speech communication association, 2010.
[19] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional
as well. Observed from some search engines behaviors, when en- neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014.
countering these search queries, they would present some results [20] Yoav Goldberg and Omer Levy. word2vec explained: deriving mikolov et al.’s
only searching "coffee" or only searching "mud". Among them, the negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722,
2014.
users should click the really correct ones they want. The clicked [21] Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In Proceed-
content can be used as a user feedback to help identify the error ings of the 52nd Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), volume 2, pages 302–308, 2014.
"coffee mud" and correct it into "coffee mug". The user behavior
SIGIR 2019 eCom, July 2019, Paris, France Wang et al.
[22] David Milne and Ian H Witten. Learning to link with wikipedia. In Proceedings
of the 17th ACM conference on Information and knowledge management, pages
509–518. ACM, 2008.
[23] Srikanth Machiraju and Ritesh Modi. Azure cognitive services. In Developing
Bots with Microsoft Bots Framework, pages 233–260. Springer, 2018.
ACKNOWLEDGMENTS
We thank our colleagues, Yuemeng Li, Mingzi Cao, Felipe Castrillon,
Nagaraj Palanichamy for assistance with the software development
and experiments, and Surya Kallumadi for comments that greatly
improved this paper.