Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Using Fuzzy Support Vector Machine in Text Categorization Base on Reduced Matrices

Using Fuzzy Support Vector Machine in Text Categorization Base on Reduced Matrices

Ratings: (0)|Views: 99 |Likes:
Published by ijcsis
In this article, the authors present result compare from using Fuzzy Support Vector Machine
(FSVM) and Fuzzy Support Vector Machine which combined Latin Semantic Indexing and Random Indexing on reduced matrices (FSVM_LSI_RI). Our results show that FSVM_LSI_RI provide better results on Precision and Recall than FSVM. In this experiment a corpus comprising 3299 documents and from the Reuters-21578 corpus was used.
In this article, the authors present result compare from using Fuzzy Support Vector Machine
(FSVM) and Fuzzy Support Vector Machine which combined Latin Semantic Indexing and Random Indexing on reduced matrices (FSVM_LSI_RI). Our results show that FSVM_LSI_RI provide better results on Precision and Recall than FSVM. In this experiment a corpus comprising 3299 documents and from the Reuters-21578 corpus was used.

More info:

Published by: ijcsis on Nov 02, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

01/11/2011

pdf

text

original

 
Using Fuzzy Support Vector Machine in TextCategorization Base on Reduced Matrices
Vu Thanh Nguyen
1
 
1
University of Information Technology HoChiMinh City, VietNamemail: nguyenvt@uit.edu.vn
 Abstract
- In this article, the authors present resultcompare from using Fuzzy Support Vector Machine(FSVM) and Fuzzy Support Vector Machine whichcombined Latin Semantic Indexing and RandomIndexing on reduced matrices (FSVM_LSI_RI). Ourresults show that FSVM_LSI_RI provide better resultson Precision and Recall than FSVM. In this experimenta corpus comprising 3299 documents and from theReuters-21578 corpus was used.
 Keyword – SVM, FSVM, LSI, RI 
I.
 
INTRODUCTION
Text categorization is the task of assigning a text toone or more of a set of predefined categories. As withmost other natural language processing applications,representational factors are decisive for the performance of the categorization. The incomparablymost common representational scheme in textcategorization is the Bag-of-Words (BoW) approach,in which a text is represented as a vector t of wordweights, such that t
i
= (w
1
...w
n
) where w
n
are theweights of the words in the text. The BoWrepresentation ignores all semantic or conceptualinformation; it simply looks at the surface wordforms. BoW modern is based on three models:Boolean model, Vector Space model, Probabilitymodel.There have been attempts at deriving moresophisticated representations for text categorization,including the use of n-grams or phrases (Lewis, 1992;Dumais et al., 1998), or augmenting the standardBoW approach with synonym clusters or latentdimensions (Baker and Mc- Callum, 1998; Cai andHofmann, 2003). However, none of the moreelaborate representations manage to significantlyoutperform the standard BoW approach (Sebastiani,2002). In addition to this, they are typically moreexpensive to compute.In order to do this, the authors introduce a newmethod for producing concept-based representationsfor natural language data. This method is acombination of Random indexing(RI) and LatinSemantic Indexing (LSI), computation time for Singular Value Decomposition on a RI reducedmatrix is almost halved compared to LSI. The authorsuse this method to create concept-basedrepresentations for a standard text categorization problem, and the representations as input to a FSVMclassifier. The categorization results are compared tothose reached using standard BoW representations byVector Space Model (VSM), and the authors alsodemonstrate how the performance of the FSVM can be improved by combining representations.
II.
 
VECTOR SPACE MODEL (VSM)
([14])
.1.
 
Data Structuring
In Vector space model, documents are represented asvectors in t-dimensional space, where t is the number of indexed terms in the collection. Function toevaluate terms weight:w
ij
= l
ij
* g
i
* n
 j
 
-
l
ij
 
denotes the local weight of term i in document j.- g
i
is the global weight of term i in the documentcollection- n
 j
is the normalization factor for document
 j
.l
ij
= log ( 1 + f 
ij
)Where:
 -
ij
 
is the frequency of token
i
in document
j.
-
is the probability of token ioccurring in document j.
2.
 
Term document matrix
In VSM is implemented by forming term-documentmatrix. Term- document matrix is
m
×
n
matrix where
 
m
 
is number of terms and
n
is number of documents.
 ⎠ ⎞⎝ ⎛  =
mnmmnn
 A
212221212111
 
where:- term: row of term-document matrix.- document: column of term-document matrix.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 2010139http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
- d
ij
: is the weight associated with token i indocument j.
III.
 
LATENT SEMANTIC INDEXING (
 LSI 
)
([1]-[4])
 
The vector space model is presented in section 2suffers from the
curse of dimensionality
. In other words, as the problem of sizes increase may becomemore complex, the processing time required toconstruct a vector space and query throughout thedocument space will increase as well. In addition, thevector space model exclusively measures term co-occurrence—that is, the inner product between twodocuments is nonzero if and only if there exist atleast one shared term between them. Latent SemanticIndexing (LSI) is used to overcome the problems of 
 synoymy
and
 polysemy
 
1.
 
Singular Value Decomposition
(SVD)
([5]-[9])
 
LSI is based on a mathematical technique calledSingular Value Decomposition (SVD). The SVD isused to process decomposes a term-by-documentmatrix A into three matrices: a term-by-dimensionmatrix, U, a singular-value matrix,
, and adocument-by-dimension matrix, V
T
. The purpose of analysis the SVD is to detect semantic relationshipsin the documents collection. This decomposition is performed as following:
 A
Σ=
 
Where:
-U 
orthogonal
m×m
matrix whose columns are leftsingular vectors of 
 A-
Σ
diagonal matrix on whose diagonal are singular values of matrix
 A
in descending order -
orthogonal
n×n
matrix whose columns are rightsingular vectors of 
 A.
To generate a rank-k approximation A
of A where k << r, each matrix factor is truncated to its first k columns. That is, A
is computed as:
 A
Σ=
 
Where:- U
is m×k matrix whose columns are first k leftsingular vectors of A-
Σ
is k×k diagonal matrix whose diagonal isformed by k leading singular values of A- V
is n×k matrix whose columns are first k rightsingular vectors of AIn LSI, A
is approximation of A is created and thatis very important: detected a combination of literatures between terms used in the documents,excluding the change in usage term bad influence tothe method to search for the index [6], [7], [8].Because use of k-dimensional LSI (k<<r) thedifference is not important in the "means" isremoved. Keywords often appear together in thedocument is nearly the same performance space in k-dimensional LSI, even the index does not appear simultaneously in the same document.
2.
 
Drawback of the LSI model
 
SVD often treated as a ”magical” process.
 
SVD is computationally expensive.
 
Initial ”huge matrix step”
 
Linguistically agnostic.
IV.
 
RANDOM INDEXING (RI) ([6],[10])
Random Indexing is an incremental vector spacemodel that is computationally less demanding(Karlgren and Sahlgren, 2001). The RandomIndexing model reduces dimensionality by, instead of giving each word a whole dimension, it gives them arandom vector by a much lesser dimensionality thanthe total number of words in the text.Random Indexing differs from the basic vector spacemodel in that it doesn’t give each word an orthogonalunit vector. Instead each word is given a vector of length 1 in a random direction. The dimension of thisrandomized vector will be chosen to be smaller thanthe amount of words in the document, with the endresult that not all words will be orthogonal to eachother since the rank of the matrix won’t be highenough. This can be formulated as
 AT 
=
 A
˜ where
 A
is the original matrix representation of the
d × w
word document matrix as in the basic vector spacemodel,
is the random vectors as a
w×k 
matrixrepresenting the mapping between each word
w
i
 
andthe k-dimensional random vectors,
 A
˜ is
 A
 projecteddown into
d × k 
dimensions. A query is then matched by first multiplying the query vector with
, and thenfinds the column in
 A
˜ that gave the best match.
isconstructed by, for each column in
, eachcorresponding to a row in
 A
, electing
n
differentrows.
n/ 
2 of these are assigned the value 1
 / 
!(
n
), andthe rest are assigned
1
 / 
!(
n
). This ensures unitlength, and that the vectors are distributed evenly inthe unit sphere of dimension
(Sahlgren, 2005).An even distribution will ensure that every pair of vectors has a high probability to be orthogonal.Information is lost during this process (pigeonhole principle, the fact that the rank of the reduced matrixis lower). However, if used on a matrix with very fewnonzero elements, the induced error will decrease asthe likelihood of a conflict in each document, and between documents, will decrease. Using RandomIndexing on a matrix will introduce a certain error tothe results. These errors will be introduced by wordsthat match with other words, i.e. the scalar product between the corresponding vectors will be
0. In thematrix this will show either that false positivematches are created for every word that have anonzero scalar product of any vector in the vector room of the matrix. False negatives can also becreated by words that have corresponding vectors thatcancel each other out.
Advantages of Random Indexing
 
Based on Pentti Kanerva's theories on SparseDistributed Memory.
 
Uses distributed representations to accumulatecontext vectors.
 
Incremental method that avoids the ”huge matrixstep”.
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 2010140http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
 
V.
 
COMBINING RI AND LSI
We have seen the advantages and disadvantages for  both RI and LSI: RI is efficient in terms of computational time but does not preserve as muchinformation as LSI; LSI, on the other hand, iscomputationally expensive, but produces highlyaccurate results, in addition to capturing theunderlying semantics of the documents. Asmentioned earlier, a hybrid algorithm was proposedthat combines the two approaches to benefit from theadvantages of both algorithms. The algorithm worksas follows:
 
First the data is pre-processed with RI to a lower dimension
k1
.
 
Then LSI is applied on the reduced, lower-dimensional data, to further reduce the data to thedesired dimension,
k2
.This algorithm supposedly will improve running timefor LSI, and accuracy for RI. As mentioned earlier,the time complexity of SVD D is
O
(
cmn
) for large,sparse datasets. It is reasonable, then, to assume thata lower dimensionality will result in faster computation time, since it’s dependent of thedimensionality
m
.
VI.
 
TEXT CATEGORIZATION1.
 
Support Vector Machines
Support vector machine is a very specific class of algorithms, characterized by the use of kernels, theabsence of local minima, the sparseness of thesolution and the capacity control obtained by actingon the margin, or on other “dimension independent”quantities such as the number of support vectors.
 
Let
is a
training sample set, where x
i
n
andcorresponding binary class labels y
i
{1,-1} .Let
φ
 is a non-linear mapping from original date space to ahigh-dimensional feature space, therefore , wereplace sample points
 x
i
and
 x
 j
 
with their mappingimages
φ
(
 x
i
) and
φ
(
 x
 j
) respectively.Let the weight and the bias of the separating hyper- plane is
w
and
b
, respectively. We define a hyper- plane which might act as decision surface in featurespace, as following.
 
To separate the data linearly in the feature space, thedecision function satisfies the following constrainconditions. The optimization problem isMinimize:Subject to:where
ξ
i
 
is a slack variable introduced to relax thehard margin constraints and the regularizationconstant
> 0 implements the trade-off between themaximal margin of separation and the classificationerror.To resolve the optimization problem, we introducethe following Lagrange function.Where
α
i
>=0,
β
 j
>=0 is Lagrange genes.Differentiating
 L
with respect to
w, b
and
ξ
i
, andsetting the result to zero. The optimization problem(2) can translate into the following simple dual problem.Maximize:Subject toWhere
(
 x
i
,
 x
 j
) (
φ
(
 x
i
),
φ
(
 x
 j
)
)
is a kernel function andsatisfies the Mercer theorem.Let
α
* is the optimal solutions of (4) andcorresponding weight and the bias
w
*,
b
* ,respectively.According to the Karush-Kuhn-Tucker(KKT)conditions, the solution of the optimal problem (4)must satisfyWhere
α
i
* are non zero only for a subset of vector x
i
 called support vectors.Finally, the optimal decision function isWhere
1.
 
Fuzzy Support Vector Machines ([12]-[13]).
Consider the aforementioned binary training set
.We choose a proper membership function and receive
 s
i
 
which is the fuzzy memberships value of thetraining point
 x
i
 
. Then, the training set
 becomefuzzy training set
where x
i
n
and corresponding binary class labelsy
i
{1,-1}, 0<=s
i
<=1.Then, the quadratic programming problem for classification can be described as following:Minimize:
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 2010141http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->