Professional Documents
Culture Documents
Product Classification in E-Commerce Using Distributional Semantics
Product Classification in E-Commerce Using Distributional Semantics
book and book. Non-book data is more discrimina- Figure 3: Figure shows distribution of items over sub-
tive with average description + title length of around categories and leaf category (verticals) for non-book dataset
10 to 15 words, whereas book descriptions have
an average length greater than 200 words. To give
more importance to the title we generally weight it
three times the description value. The distribution of
items over leaf categories (verticals) exhibits high
skewness and heavy tailed nature and suffers from
sparseness as shown in Figure 3. We use random
forest and k nearest neighbor as base classifiers as
they are less affected by data skewness
We have removed data samples with multiple
paths to simplify the problem to single path predic-
tion. Overall, we have 0.16 million training and 0.11
million testing samples for book data and 0.5 million
training and 0.25 million testing samples for non-
book data. Since the taxonomy evolved over time
all category nodes are not semantically mutually ex-
clusive. Some ambiguous leaf categories are even Figure 4: Comparison of prediction accuracy for path predic-
meta categories. We handle this by giving a unique tion using different methods for document vector generation.
id to every node in the category tree of book-data.
Furthermore, there are also category paths with dif-
ferent categories at the top and similar categories at
the leaf nodes i.e. reduplication of the same path
with synonymous labels. Level #Categories %Data Samples
The quality of the descriptions and titles also 1 21 34.9%
varies a lot. There are titles and descriptions that do 2 278 22.64%
not contain enough information to decide an unique 3 1163 25.7%
appropriate category. There were labels like Others 4 970 12.9%
and General at various depths in the taxonomy tree 5 425 3.85%
which carry no specific semantic meaning. Also, 6 18 0.10%
descriptions with the special label ‘wrong procure- Table 2: Percentage of Book Data ending at each depth level of
ment’ are removed manually for consistency. the book taxonomy hierarchy which had a maximum depth of
The quality of the descriptions and titles also 6.
varies a lot. There are titles and descriptions that do
not contain enough information to decide an unique
appropriate category. There were labels like Others gwBoVW and BoCV is 200. Note AWV , PV-DM
and General at various depths in the taxonomy tree and PV-DBoW are independent of cluster number
which carry no specific semantic meaning. Also, and have dimension 600. Clearly gwBoWV per-
descriptions with the special label ‘wrong procure- forms much better than other methods especially
ment’ are removed manually for consistency. PV-DBoW and PV-DM.
Table 3 Shows the effect of varying cluster num-
6 Results bers on accuracy for Non Book Data for 2 lakh train-
ing and 2 lakh testing using 200 dimension word
The classification system is evaluated using the vector.
usual precision metric defined as fraction of prod- We use the notation given below to define our
ucts from test data for which the classifier predicts evaluation metrics for Top K path prediction :
correct taxonomy paths. Since there are multiple
similar paths in the data set predicting a single path • τ ∗ represents the true path for a product de-
is not appropriate. One solution is to predict more scription.
than one path or better a ranked list of of 3 to 6 paths
with predicted label coverage matching labels in the • τi represents the ith predicted path by our algo-
true path. The ranking is obtained using the confi- rithm , where i ∈ {1, 2 . . . K}.
dence score of the predictor. We also calculate the
confidence score of the correct prediction path by • T∗ represent the nodes in true path τ ∗ .
using the k (3 to 6) confidence scores of the individ- • Ti represents the nodes in ith predicted path τi ,
ual predicted paths. For the purpose of measuring where i ∈ {1, 2 . . . K}.
accuracy when more than one path is predicted, the
classifier result is counted as correct when the cor- • p(τ ∗ ) represents the probability predicted by
rect class (i.e. path assigned by seller) is one of the our algorithm for true path τ ∗ . p(τ ∗ ) = 0 if τ ∗
returned class (paths). Thus we calculated Top 1, ∈
/ {τ1 , τ2 . . . τK }
Top 3 and Top 6 prediction accuracy when 1, 3 and
6 paths are predicted respectively. • p(τi ) represents the probability of ith predicted
path by our algorithm, here i ∈ {1, 2 . . . K}.
6.1 Non-Book Data Result
We use four evaluation metrics to measure perfor-
We also compare our results with document vectors mance for the top k predictions as described below:
formed by averaging word-vectors of words in the
document i.e. Average Word Vectors (AWV), Dis- 1. Prob Precision @ K (PP) : P P @K = p(τ ∗ ) /
tributed Bag of Words version of Paragraph Vector (p(τ1 ) + p(τ2 ) + . . . + p(τK )).
by Mikolov (PV-DBoW), Frequency Histogram of
word distribution in Word-Clusters i.e. Bag of Clus- 2. Count Precision @ K (CP) : CP @k = 1 if τ ∗ ∈
ter Vector (BoCV). We keep the classifier (random {τ1 , τ2 . . . τK } else CP @K = 0.
forest with 20 trees) common for all document vec-
3. Label Recall @ K (LR) : LR@k = kT∗ ∩
tor representations. We compare performance with ∗
(∪K i
1 T )k/kT k.Here kSk represent number of
respect to number of clusters, word-vector dimen-
elements in set S.
sion, document vector dimension and vocabulary di-
mension (tf-idf) for various models. 4. Label Correlation @ k (LC) : LC@k = k ∩
Figure 4 shows results for a random forest (20 K Ti k/k ∪ K Ti k . Here kSk represent number
1 1
trees) on various classifiers trained by various meth- of elements in set S.
ods on 0.2 million training and 0.2 million testing
samples with 3589 classes. It compares our ap- Table 4 shows the results on all evaluation metrics
proach gwBoWV with PV-DBoW and PV-DM mod- with varying word-vec dimension and clusters. Ta-
els with varying word vector dimension and num- ble 5 shows results of top 6 paths prediction for tfidf
ber of clusters. The dimension of word vector for baseline with varying dimension.
6.2 Book Data Result • comb-2(A): level two combined ensemble clas-
Book data is harder to classify. There are more cases sifier with A reduced features (original features
of improper paths and labels in the taxonomy and 21706).
hence we had to do a lot of pre-processing. Around
51% of the books did not have labels at all and 15% 7 Conclusions
books were given extremely ambiguous labels like
We presented a novel compositional technique using
‘general’ and ‘others’. To maintain consistency we
embedded word vectors to form appropriate docu-
prune the above 66% data samples and work with
ment vectors. Further, to capture importance, weight
the remaining 44% i.e. 0.37 million samples.
and distinctiveness of words across documents we
To handle improper labels and ambiguity in the
used a graded weighting approach for composition
taxonomy we use multiple classifiers one predicting
based on recent work by Mukerjee et. el. (Pran-
path (or leaf) label, another predicting node labels
jal Singh, 2015) where instead of weighting we
and multiple classifiers, one at each depth level of
weight using cluster frequency. Our document vec-
the taxonomy tree, that predict node labels at that
tors are embedded in a vector space different from
level. In depth-wise node classification we also in-
the word embedding vector space. This document
troduce the ‘none’ label to denote missing labels at
vector space is higher dimensional and tries to en-
a particular level i.e. for paths that end at earlier lev-
code the intuition that a document has more topics
els. However we only take a random strata sample
or senses than a word.
for this ‘none’ label.
We also developed a new technique which uses
6.3 Ensemble Classification an ensemble of multiple classifiers that predicts la-
bel paths, node labels and depth-wise labels to de-
We use the ensemble of multi-type predictors as crease classification error. We tested our method on
describe in Section 4 for final classification. For data sets from a leading e-commerce platform and
dimensionality reduction we use feature selec- show improved performance when compared with
tion methods based on mutual information criteria competing approaches.
(ANOVA F-value i.e. analysis of variance). We ob-
tain improved results for all four evaluation metrics 8 Future Work
with the new ensemble technique as shown in Ta-
ble 6for Book Data.The list below says how the first Instead of using k-means we can use the Chinese
column in Table 5 should be interpreted Restaurant Process. Extending the gwBOVW ap-
proach to learn supervised class document vectors
• tf-idf (A-2-C): term frequency and inverse doc- that consider the label in some fashion during em-
ument frequency feature with #A top 1, 2 gram bedding.
words and #C random forest trees
References
• path-1 (A-C): path prediction model without
ensemble, trained with gwBoWV with #A Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and
(cluster*wordvec dimension) using C trees Christian Janvin. 2003. A neural probabilistic lan-
guage model. The Journal of Machine Learning Re-
• Dep (A+B-C): trained with gwBoWV with A search, 3:1137–1155.
features, B represents size of out-probability Samy Bengio, Jason Weston, and David Grangier. 2010.
Label embedding trees for large multi-class tasks. In
vectors (#total nodes) for all depths using depth
Advances in Neural Information Processing Systems
classifier of level 1 using #C trees. 23, pages 163–171. Curran Associates, Inc.
Wei Bi and James T. Kwok. 2011. Multi-label classi-
• node (A+B): trained with gwBoWV with A fication on tree- and dag-structured hierarchies. In In
features, B represents size of output probabil- ICML.
ity vector(#total nodes) by level 1 node classi- Zellig Harris. 1954. Distributional structure. Word,
fier using #C tree. 10:146–162.
Christopher D. Manning Jeffrey Pennington, Richard Socher, Alex Perelygin, Jean Y Wu, Jason
Richard Socher. 2014. Glove: Global vectors Chuang, Christopher D Manning, Andrew Y Ng, and
for word representation. In Empirical Methods Christopher Potts. 2013. Recursive deep models for
in Natural Language Processing (EMNLP), pages semantic compositionality over a sentiment treebank.
1532–1543. ACL. In Proceedings of the conference on empirical meth-
Zornitsa Kozareva. 2015. Everyone likes shopping! ods in natural language processing (EMNLP), volume
multi-class product categorization for e-commerce. 1631, page 1642. Citeseer.
Human Language Technologies: The 2015 Annual Chris Dyer Wang Ling. 2015. Two/too simple adapta-
Conference of the North American Chapter of the ACL, tions of wordvec for syntax problems. In Proceedings
pages 1329–1333. of the 50th Annual Meeting of the North American As-
Shailesh Kumar, Joydeep Ghosh, and M. Melba Craw- sociation for Computational Linguistics. North Amer-
ford. 2002. Hierarchical fusion of multiple classifiers ican Association for Computational Linguistics.
for hyperspectral data analysis. Pattern Analysis and Gui-Rong Xue, Dikan Xing, Qiang Yang, and Yong Yu.
Applications, 5:210–220. 2008. Deep classification in large-scale text hierar-
Quoc V Le and Tomas Mikolov. 2014. Distributed repre- chies. In Proceedings of the 31st Annual Interna-
sentations of sentences and documents. arXiv preprint tional ACM SIGIR Conference on Research and De-
arXiv:1405.4053. velopment in Information Retrieval, SIGIR ’08, pages
Omer Levy and Yoav Goldberg. 2014a. Dependen- 619–626.
cybased word embeddings. Proceedings of the 52nd
Annual Meeting of the Association for Computational
Linguistics, 2:302–308.
Omer Levy and Yoav Goldberg. 2014b. Linguistic regu-
larities in sparse and explicit word representations. In
Proceedings of the Eighteenth Conference on Compu-
tational Natural Language Learning, pages 171–180.
Association for Computational Linguistics.
Omer Levy and Yoav Goldberg. 2014c. Neural word em-
bedding as implicit matrix factorization. In Advances
in Neural Information Processing Systems 27, pages
2177–2185.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013a. Efficient estimation of word representa-
tions in vector space. arXiv preprint arXiv:1301.3781.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013b. Distributed represen-
tations of words and phrases and their composition-
ality. In Advances in Neural Information Processing
Systems, pages 3111–3119.
Amitabha Mukerjee Pranjal Singh. 2015. Words are not
equal: Graded weighting model for building compos-
ite document vectors. In Proceedings of the twelfth
International Conference on Natural Language Pro-
cessing (ICON-2015). BSP Books Pvt. Ltd.
Dan Shen, Jean David Ruvini, Manas Somaiya, and Neel
Sundaresan. 2011. Item categorization in the e-
commerce domain. In Proceedings of the 20th ACM
International Conference on Information and Knowl-
edge Management, CIKM ’11, pages 1921–1924.
Dan Shen, Jean-David Ruvini, and Badrul Sarwar. 2012.
Large-scale item categorization for e-commerce. In
Proceedings of the 21st ACM International Confer-
ence on Information and Knowledge Management,
CIKM ’12, pages 595–604.
#Cluster CBoW SGNS
10 81.35% 82.12%
20 82.29% 82.74%
50 83.66% 83.92%
65 83.85% 84.07%
80 83.91% 84.06%
100 84.40% 84.80%
Table 3: Result of classification on varying Cluster Numbers
for fixed word vector size 200 for Non Book Data for CBow and
SGNS architecture #Train Sample = 0.2 million , #Test Sample
= 0.2 million
Method PP CP LR LC
tfidf(4000-2-20) 41.33 75.00 86.86 22.14
tfidf(8000-2-20) 41.39 74.95 86.83 22.16
tfidf(10000-2-20) 41.39 74.96 86.85 22.18
path-1(4100-15) 39.86 74.17 86.37 22.19
path-1(8080-20) 41.08 74.83 86.60 22.19
Dep(5100+2875-20) 41.54 75.34 87.08 22.47
node(4100+2810) 41.54 74.68 86.65 22.34
comb-2(8000) 45.64 77.26 88.86 24.57
comb-2(6000) 46.68 75.74 87.67 25.08
comb-2(10000) 42.82 75.83 87.62 23.08
Table 6: Results from various approaches for Top 6 predictions
for Book Data
The dependency based word vectors use the same
training methods as SGNS. Compared to similarly
learned linear context based vectors learned it is
found that the dependency based vectors perform
better on functional similarity. However, for the
task of topical similarity estimation the linear con-
text based word vectors encode better distributional
semantic content.