Visualizing Data Using Multidimensional Scaling

DATA VISUALIZATION USING MULTIDIMENSIONAL SCALING
ProjectReport
SubmittedinPartialFulfilmentoftheRequirements FortheAwardoftheDegreeof
IntegratedDualDegree
in
ComputerScienceandEngineering
UndertheGuidanceofBy Prof.A.K.AgarwalSSujana RollNo10400EN005
DEPARTMENTOFCOMPUTERENGINEERING INDIANINSTITUTEOFTECHNOLOGY(BANARASHINDUUNIVERSITY) VARANASI221005INDIA

1
CERTIFICATE
This is to certify that the project entitled, Data Visualization Using Multidimensional Scaling submitted by S Sujana in partial fulfilment of the requirements for the award of Integrated Dual Degree in Computer Science & Engineering at the Indian Institute Of Technology, Banaras Hindu University, is an authentic work carried out by her under my supervision and guidance. To the best of my knowledge, the matter embodied in the project report has not been submitted to any other University/Institute for the award of any degree or diploma.
Supervisor
Prof. A. K. Agrawal Department Of Computer Engineering IIT(BHU), Varanasi
ACKNOWLEDGEMENTS
I would like to convey my deepest gratitude to Professor A. K. Agarwal, who guided me through this project. His keen interest, continuous motivation, suggestions and support helped me immensely in successfully completing this project.
I would also Department, allowing me necessary for
like to thank Prof. R. B. Mishra, Head of the Department of Computer Engineering, for to avail all the facilities of the Department this project.
S Sujana Roll No. 10400EN005
ABSTRACT
Datavisualizationisapowerfultechniqueofdataexplorationthatharnessesthepowerofthe humanmindtodetectstructuresandpatternsinvisualdata.Thisprojectdiscussesdata visualizationusingmultidimensionalscaling,atechniqueforanalyzingsimilarity/dissimilarity dataonasetofobjects. Multidimensionalscalingtakesthedissimilaritydataasinput,andreturnsasetofcoordinates oftheobjectsinalowdimensionalEuclideanspace,suchthatthedistancesbetweenthese pointsarepreservedasmuchaspossible.Here,thishasbeencarriedoutbyimplementing analgorithmSMACOF(ScalingbyMajorizingACOmplicatedFunction),whichoffersfaster convergenceontheClassicalmultidimensionalscalingalgorithm.
CONTENTS
Abstract....4 1.Introduction.6 1.1ExploratoryDataAnalysis...6 1.2DataVisualization.6 2.MultidimensionalScaling....8 2.1Need8 2.2Formulation.8 2.3Paradigm....9 2.4Classification......9 2.5Applications...11 2.6ComparisonofTechniques.12 3.SMACOF.14 3.1MathematicalBackground..14 3.2Algorithm...15 4.Implementation18 4.1PreparatoryWork.18 4.2Specification..19 4.3Pseudocode...20 5.UserGuide...22 5.1InputFormat...22 5.2Output.....22 6.TestCases...23 7.Conclusion...25 8.FurtherScope..26 8.1TunnelingMethod......26 9.References...27
INTRODUCTION
Exploratory Data Analysis
Exploratory data analysis is an approach to analyzing data to summarize its main characteristics, often with visual methods. The purpose of thisanalysisistoseestructurein thedata. Exploratory data analysiscanbedescribedasdatadrivenhypothesisgeneration.Thedatais examined, in search of structures that may indicate deeper relationships between cases or variables. This process stands in contrast to hypothesis testing, which begins with a proposed model or hypothesis and undertakes statistical manipulations to determine the likelihoodthatthedataarosefromsuchamodel. The distinction hereis that it is thepatterns in the data which give riseto thehypothesisin contrast to situations in which thehypothesisisgenerated fromtheoreticalargumentsabout underlying mechanisms.Thisincreasesourconfidenceintheresultinghypotheses,asthese results are less likely to be influenced by external factors, like the contextin whichthe data wasfound,theanalystspersonalbiases,traditionalmethodsusedandsoon. Multidimensional scaling, principal components analysis, scatter plots, etc are some of the typicalgraphicaltechniquesusedinexploratorydataanalysis.
Data Visualization
Data visualization is the presentation of data in a pictorial or graphical format. It is basically an abstraction of data. Its main goal is to communicate information clearly and effectively usingvisualmeans. For centuries, peoplehave depended on visual representations such as chartsandmapsto understand information more easily and quickly. As more and more data is collected and analyzed, decision makers at all levels welcome data visualization software that enables them to see analytical results presented visually, find relevance among the millions of variables,communicateconceptsandhypothesestoothers,andevenpredictthefuture. Visualization methods are used to display data in ways that harnesstheparticularstrengths of human pattern processing abilities. Visual methods are important in data mining as they are ideal for sifting through data to find unexpected relationships. They enable the analyst to grasp the structure and hence, meaning of the data, faster. These methods, however, may notbeabletocommunicateclearlyforextremelylargedatasets. Visualization also enables representing data in a universal and accessible manner.Itmakes itsimpletoshareideaswithothersandencouragesexplorativedataanalysis. 6
Somecommonvisualizationtechniquesare: Standard2D/3Ddisplays,suchasbarchartsandxyplots Geometricallytransformeddisplays,suchaslandscapesandparallelcoordinates Iconbaseddisplays,suchasneedleiconsandstaricons Densepixeldisplays,suchastherecursivepatternandcirclesegments Stackeddisplays,suchastreemapsanddimensionalstacking Multidimensionalscalingisavisualizationtechniqueofthefirstkind.
MULTIDIMENSIONAL SCALING
Multidimensionalscalingreferstoaclassofalgorithmsforexploratorydataanalysiswhich visualizeproximityrelationsofobjects,bydistancesbetweenpointsinalowdimensional Euclideanspace.Proximityvaluesarerepresentedasdissimilarityvalues. Given a set of data objects, multidimensional scaling aims to place each object in an Ndimensional space such that the interobject distances are preserved as much as possible. For thecase when N=2, the problem can be stated as plotting a scatterplot of the givendataset.
Need
One of the mostimportantgoalsinvisualizingdatais togetasenseofhownearorfarpoints arefromeachother.Often,thiscanbedonewithascatterplot.However,forsomeanalyses, the data that we have might not be in the form of points at all, but rather in the form of pairwise similaritiesordissimilarities betweencases,observations,orsubjects.There areno pointstoplot. Even if the data are in the form of points rather than pairwise distances, a scatter plot of these data might not be useful. For some kinds of data, the relevantway to measure how near two points are might notbetheirEuclideandistance.Whilescatterplotsoftherawdata make it easy to compare Euclidean distances, they are not always useful when comparing other kinds of interpoint distances, city block distance for example, or even more general dissimilarities. Also, with a large number of variables, it is very difficulttovisualizedistances unless the data can be represented in a small number of dimensions. Some sort of dimensionreductionisusuallynecessary. Multidimensional scaling (MDS) is a set of methods that address all these problems. MDS allows you to visualize how near points are to each other for many kinds of distance or dissimilarity metrics and can produce a representation of your data in a small number of dimensions.
Formulation
Formally,multidimensionalscalingcanbepresentedasfollows The data to be analyzed is a collection of I objects (colors, faces, stocks, . . .)on whicha distancefunctionisdefined, i,j:=distancebetweenithandjthobjects. Thesedistancesaretheentriesofthedissimilaritymatrix. 8
ThegoalofMDSis,given,tofind vectors forall where isavectornorm. ,
suchthat
Paradigm
1. Multidimensional scaling is typically used as a visualization technique for proximity data. 2. When the dissimilarities are distancesbetweenhighdimensionalobjects,itactsasa (often nonlinear) dimensionreduction technique. Hence, it makes complex and coupleddatamoreunderstable. 3. When the dissimilarities are shortestpath distances in a graph, it functions as a graph layout technique.Itisusefulinvisualizingweightedgraphs,bothplanarandnon planar.
Classification
Here,wediscuss3classificationsofmultidimensionalscalingalgorithms ClassicalMDS MetricMDS NonMetricMDS
Classical MDS
ClassicalmultidimensionalscalingisalsoknownasPrincipalCoordinatesAnalysis.Ittakes aninputmatrixgivingdissimilaritiesbetweenpairsofitemsandoutputsacoordinatematrix whoseconfigurationminimizesalossfunctioncalledstrain. This method tries to find the main axes through a matrix. It is a kind of eigenanalysis (sometimes referred as "singular value decomposition") and calculates a series of eigenvalues and eigenvectors. Each eigenvalue has an eigenvector, and there are as many eigenvectorsandeigenvaluesastherearerowsintheinitialmatrix.
Eigenvalues are usually ranked from the greatest to the least. The first eigenvalue is often called the "dominant" or "leading" eigenvalue. Using the eigenvectors we can visualize the main axes through the initial distance matrix. Eigenvalues are also often called "latent values". The result is a rotationofthedatamatrix:itdoesnotchangethepositionsofpointsrelativeto each other but it just changes the coordinate systems. Thus, we can visualize individual and/orgroupdifferences.Individualdifferencescanbeusedtoshowoutliers.
Metric MDS
Metric multidimensional scaling is a superset of classical MDS that generalizes the optimization procedure to a varietyof loss functions and input matrices ofknown distances with weights and so on. It also offers a choice of different criteria to construct the configuration,andallowsmissingdataandweights. The distances between objects are required to be proportional to the dissimilarities, or to someexplicitfunctionofthedissimilarities. A useful loss function in this context is calledstress,representedas(X).Thefunction is a costorlossfunctionthatmeasuresthesquareddifferencesbetweenideal(mdimensional) distancesandactualdistancesinrdimensionalspace.Itisdefinedas:
where
is a weight for the measurement between a pair ofpoints
is the euclidean distance between and and istheidealdistancebetweenthepoints (theirseparation)inthe dimensionaldataspace. can be used to specifya degree of confidence in the similaritybetween points (e.g. 0 canbespecifiedifthereisnoinformationforaparticularpair). The problem of multidimensional scaling is thus reduced to minimizing this stress function (X). A common approach to this is termed as Least Squares Scaling, which involves . There exist other algorithms which do not rely on gradient descents.Oneofthesemethods, aimed at minimizing a stress function of the Sammon type, is known by the acronym SMACOF (Scaling by MAjorizing A COmplicated Function). It is based on an iterative majorizationalgorithmthatintroducesideasfromconvexanalysis.
10
Non Metric MDS

In metric multidimensional scaling, the distances between objects are required to be proportional to the dissimilarities, or to some explicit function of the dissimilarities. In nonmetric multidimensional scaling, this condition is relaxed to require only that the distances between the objects increase in the same orderasthedissimilaritiesbetweenthe objects. In contrast tometricmultidimensionalscaling,nonmetricmultidimensionalscalingfindsboth a nonparametric monotonic relationship between the dissimilarities in the itemitem matrix and the Euclidean distances between items, and the location of each item in the lowdimensionalspace.Therelationshipistypicallyfoundusingisotonicregression. Nonmetric multidimensional scaling includes an additional optimization step tosmooth the data. This is usually done by carrying out isotonic regression. The implementation for nonmetric multidimensional scaling usually smoothly fits into the implementation of metric multidimensionalscalingbyincludingtheoptimizationstep.
Applications
Multidimensionalscalingisextremelyusefulforscientificvisualizationanddatamining.Here, someofitscommonapplicationsarementioned. In marketing, multidimensionalscalingisastatisticaltechniquefortakingthepreferencesand perceptionsofrespondentsandrepresentingthemonavisualgrid,calledperceptualmaps. Potential customersareaskedtocomparepairsofproductsandmakejudgmentsabouttheir similarity. Whereas other techniques (such as factor analysis, discriminant analysis, and conjoint analysis) obtain underlying dimensions from responses to product attributes identified by theresearcher,multidimensionalscalingobtainstheunderlyingdimensionsfrom respondents judgments about the similarity of products. This is an important advantage. It does not depend on researchers judgments. It does not require a list of attributes to be shown to the respondents. The underlying dimensions come from respondents judgments about pairs ofproducts. Because of these advantages,multidimensional scalingisthemost commontechniqueusedinperceptualmapping. In cognitive sciences, multidimensional scaling is often used to study confusion data. Confusion data represents the possibility of different items being mistaken for each other. Thus, it is a form of a similaritymatrix.Multidimensionalscalingpresentsthisdatain avisual form such that theitems which are grouped together in the representation aremorelikelyto be similar, and thusconfused for each other.Multidimensional scalingisoftendemonstrated withRothkopfsMorsecodedataset,asanexampleofaconfusiondataset. In the social sciences, proximity data take the form of similarity ratings for pairs of stimuli suchastastes,colors,sounds,people,nations,etc.
11
In archaeology, similarity of two digging sites can be quantified based on the frequency of sharedfeaturesinartifactsfoundinthesites. In classification problems: In classification with large numbers of classes, pairwise misclassification rates produce confusion matrices that can be analyzed assimilarity data. Anexamplewouldbeconfusionratesofphonemesinspeechrecognition. Another early use of MDS was for dimension reduction: Given highdimensional data y1,..., yN IRK (K large), compute a matrix of pairwise distances dist(yi , yj ) = Di,j , and use distance scaling to find lowerdimensional x1 , ..., xN IRk (k << K) whose pairwise distances reflect the highdimensional distances Di,j as well as possible. In this application, distance scalingis a nonlinearcompetitorofprincipalcomponents.Classicalscaling,onthe otherhand,isidenticaltoprincipalcomponentswhenusedfordimensionreduction. In chemistry, MDS can be used for molecular conformation, that is, the problem of reconstructing spatial structure of molecules. This situation differs from the above areas in that 1) actual distance information is available from experiments or theory, and 2) theonly meaningful embedding dimension is k = 3, physical space. Configurations are here called conformations. Yet another use of MDS is for graph layout, an active area at the intersection of discrete mathematics and network visualization. From graphs one can derive distances, such as shortestpath metrics, which canbe subjected to MDS for planar or spatial layout. Note that shortestpath metrics aregenerally strongly nonEuclidean,hencesignificantresidualshould beexpectedinthistypeofapplication.
Comparison of Techniques
Metric multidimensional scaling is preferred when the dissimilarities in the input need to be strictly honoured. On the other hand, non metric scaling is used when the ranks of the dissimilarities need to be preserved. This approach tends to take a longer time on large datasets. An often discussed deficit of the classical multidimensional scaling techniques such as Sammon mapping is their inherent batchcharacter. A run of the program will only yield an embedding of the corresponding data without direct generalization capabilities. To project new data, the program has to be restarted on the pooled data, because a projection of additionaldatawillmodifytheembeddingoftheolddataaswell. Another, perhapsmoreurgentdeficitistheamount ofproximityvaluesthatcharacterizelarge data sets. For non linear dimension reduction, the standard technique clusters the data beforehand and visualizes the resulting cluster prototypes. This coarsegraining of a large data set by clustering is unsatisfactoryand often unacceptable. The need to overcome this drawback has recently initiated a number of developments. These approaches share the 12
common idea to usethe Sammon stress function as relative supervisor to train a nonlinear mapping.Wewilllookatoneofthesemethodsattheendofthisreport.
13
SMACOF
SMACOFstandsforScalingbyMajorizingACOmplicatedFunction.Itisastrategytosolve theproblemofmultidimensionalscalingbyusingmajorizationtominimizeastressfunction.
Mathematical Background
Majorization
Before describingdetails about SMACOF, we giveabriefoverviewonthegeneralconceptof majorization which optimizesa particularobjective function in our applicationreferred toas stress. More details about the particular stress functions and their surrogates for various SMACOFextensionswillbeelaboratedbelow. In a strict sense, majorization is not an algorithm but rather a prescription for constructing optimization algorithms. The principle of majorization is to construct a surrogate function which majorizes a particularobjective function. ForMDS,majorizationwasintroducedbyDe Leeuw (1977a) and further elaborated in De Leeuw and Heiser (1977) and De Leeuw and Heiser(1980). From aformalpoint ofview,majorizationrequiresthefollowing definitions.Letusassumewe have afunctionf(x)tobeminimized.Findingananalyticalsolutionforcomplicatedf(x)canbe rather cumbersome. Thus, the majorization principle suggests to find a simpler, more manageablesurrogatefunctiong(x,y)whichmajorizesf(x),i.e.forallx g(xy)>=f(x) where y is some fixedvalue calledthe supporting point.Thesurrogatefunctionshouldtouch the surface aty, i.e., f(y) = g(y, y), which, at the minimizer x* of g(x, y) over x, leadsto the inequalitychain. f(x*)<=g(x*,y)<=g(y,y)=f(y) calledthesandwichinequality. Majorizationisaniterativeprocedurewhichconsistsofthefollowingsteps: 1.Chooseinitialstartingvaluey:=y0. 2.Findx(t)suchthatg(x(t),y)<=g(y,y). 3.Stopiff(y)f(x(t))<,elsey:=x(t)andproceedwithstep2. Thisprocedurecanbeextendedtomultidimensionalspacesandaslongasthesandwich inequalityin2holds,itcanbeusedtominimizethecorrespondingobjectivefunction.In MDStheobjectivefunctioncalledstressisamultivariatefunctionofthedistancesbetween 14
objects.
Algorithm
Multidimensionalscalinginputdataaretypicallyanxnmatrixofdissimilaritiesbasedon observeddata. issymmetric,nonnegative,andhollow(i.e.haszerodiagonal).Theproblemwesolveis tolocatei,j=1,.,npointsinlowdimensionalEuclideanspaceinsuchawaythatthe distancesbetweenthepointsapproximatethegivendissimilaritiesij.Thuswewanttofind annxpmatrixXsuchthatdij(X)ij,where
dij(X)=s=1p(xisxjs)2
The index s = 1, , p denotes the number of dimensions in the Euclidean space. The elements of X are called configurations of the objects. Thus, each object is scaled in a pdimensional space such that the distances between the points inthespacematchaswell as possible the observed dissimilarities. By representing the results graphically, the configurationsrepresentthecoordinatesintheconfigurationplot. Nowwemaketheoptimizationproblemmoreprecisebydenotingstress(X)by
Here,Wisaknownnnmatrixofweightswij,alsoassumedtobesymmetric,nonnegative, andhollow.Weassume,withoutlossofgenerality,that
2 ij ij
=n*(n1)/2
i<j
andthatWisirreducible(DeLeeuw1977a),sothattheminimizationproblemdoesnot separateintoanumberofindependentsmallerproblems.Wcanforinstancebeusedfor imposingmissingvaluestructures:wij=1ifi,jisknownandwij=0ifijismissing. However,otherkindsofweightingstructuresareallowedalongwiththerestrictionwij>=0. FollowingDeLeeuw(1977a),stress,asgivenin(4),canbedecomposedas
=2+2(X)2(X) Fromrestriction(5)itfollowsthatthefirstcomponent2=n(n1)/2.Thesecond component2(X)isaweightedsumofthesquareddistancesdij2(X),andthusaconvex quadratic.Thethirdone,i.e.2(X),isthenegativeofaweightedsumofthedij(X),andis 15
consequentlyconcave. Thethirdcomponentisthecrucialtermformajorization.Letusdefinethematrix Aij=(eiej)(eiej) whoseelementsequal1ataii=ajj=1,1ataij=aji,and0elsewhere. Furthermore,wedefine
astheweightedsumofrowandcolumncenteredmatricesAij.Hence,wecanrewrite
Forasimilarrepresentationof(X)wedefinethematrix
where
UsingB(X)wecanrewrite(X)as
and,consequently,thestressdecompositionbecomes
At this point it is straightforward to find the majorizing function of (X). Let us denote the supporting pointby Y which,inthecaseofMDS,isanxp matrixofconfigurations.Similarto (8)wedefine
with
16
TheCauchySchwartzinequalityimpliesthatforallpairsofconfigurationsXandY,wehave
Thusweminorizetheconvexfunction(X)withalinearfunction.Thisgivesusamajorization ofstress
Obviously,(X,Y)isa(simple)quadraticfunctioninXwhichmajorizesstress.Findingits minimumanalyticallyinvolves
TosolvethisequationsystemweusetheMoorePenroseinverse whichleadsto
ThisisknownastheGuttmantransform(Guttman1968)ofaconfiguration.Notethatif wij=1forallij,wehave
andtheGuttmantransformsimplybecomes
17
IMPLEMENTATION
Preparatory Work
Python Programming Language
Python is a widely used generalpurpose, highlevel programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including objectoriented, imperative and functional programming or procedural styles. It features a dynamic type system and automaticmemorymanagementandhasalargeandcomprehensivestandardlibrary. Like other dynamiclanguages, Python isoftenusedasascriptinglanguage,butisalso used in a wide range of nonscripting contexts. Using thirdparty tools, Python code can be packaged into standalone executable programs (such as Py2exe, or Pyinstaller). Python interpretersareavailableformanyoperatingsystems. CPython, thereference implementation of Python, is freeandopensourcesoftwareandhas a communitybased development model, as do nearly all of its alternative implementations. CPythonismanagedbythenonprofitPythonSoftwareFoundation.
Numpy library
NumPy is an extension to the Python programming language, adding support for large, multidimensional arrays and matrices, along with a large library of highlevel mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of Numarray into Numeric with extensive modifications.NumPyisopensourceandhasmanycontributors. NumPyisthefundamentalpackageforscientificcomputingwithPython.Itcontainsamong otherthings:

apowerfulNdimensionalarrayobject sophisticated(broadcasting)functions toolsforintegratingC/C++andFortrancode usefullinearalgebra,Fouriertransform,andrandomnumbercapabilities 18
Besides its obvious scientific uses, NumPy can also be used as an efficient multidimensional container of generic data. Arbitrary datatypes can be defined. Thisallows NumPy to seamlessly and speedily integrate with a wide variety of databases. Numpy is licensedundertheBSDlicense,enablingreusewithfewrestrictions.
PyQT
PyQtisaPythonbindingofthecrossplatformGUItoolkitQt.ItisoneofPython'soptionsfor GUIprogramming.PopularalternativesarePySide(theQtbindingwithofficialsupportanda moreliberallicence),PyGTK,wxPython,andTkinter(whichisbundledwithPython).LikeQt, PyQtisfreesoftware.PyQtisimplementedasaPythonplugin. PyQtimplementsaround440classesandover6,000functionsandmethodsincluding: asubstantialsetofGUIwidgets classesforaccessingSQLdatabases(ODBC,MySQL,PostgreSQL,Oracle) QScintilla,Scintillabasedrichtexteditorwidget dataawarewidgetsthatareautomaticallypopulatedfromadatabase anXMLparser SVGsupport classesforembeddingActiveXcontrolsonWindows(onlyincommercialversion)
Matplotlib library
matplotlib is a plotting libraryforthePythonprogramminglanguageanditsNumPynumerical mathematics extension. It provides an objectoriented API for embedding plots into applications using generalpurpose GUI toolkits like wxPython, Qt, or GTK. There is also a procedural "pylab" interface based on a state machine (like OpenGL), designed to closely resemblethatofMATLAB. matplotlib was originally written by JohnHunter, has an active development community, and isdistributedunderaBSDstylelicense.
Specification
Description
The multidimensional scaling software allows its user to visualize data points on a 2dimensional scatterplot for given proximity data and analyze its performance with respect tovariousconfigurableparameters.
Product Perspective
19
The software must, upon getting input from user in required format, process it to initialize itself and the run the multidimensional scaling algorithm on the information and provide the outputtotheuserintherequiredformat.
Product Scope
Thesoftwareshallhavethefollowingscope Users can get the multidimensional scaling output for various configuration of cores foranalysis. It may be specializedfor particular typesof data by providing wrappersthatgenerate input in required format and that read the output generated and display or process it forfurtheranalysis.
User Interface
TheGraphicalUserInterfacehasbeendevelopedusingPyQt,aPythonwrapperfortheQt4 Framework.
Pseudocode
1.Generateaninitialconfigurationofthepointsrandomly. defcheck_random_state(seed): """Turnseedintoanp.random.RandomStateinstance IfseedisNone,returntheRandomStatesingletonusedbynp.random. Ifseedisanint,returnanewRandomStateinstanceseededwithseed. IfseedisalreadyaRandomStateinstance,returnit. OtherwiseraiseValueError. """ ifseedisNoneorseedisnp.random: returnnp.random.mtrand._rand ifisinstance(seed,(numbers.Integral,np.integer)): returnnp.random.RandomState(seed) ifisinstance(seed,np.random.RandomState): returnseed raiseValueError('%rcannotbeusedtoseedanumpy.random.RandomState' 'instance'%seed)
#Randomlychooseinitialconfiguration random_state=check_random_state() X=random_state.rand(n_samples*n_components) X=X.reshape((n_samples,n_components)) 20
2.Computethestress #Computedistances dis=euclidean_distances(X) disparities=similarities #Computestress stress=((dis.ravel()disparities.ravel())**2).sum()/2 3.ComputeGuttmantransform #UpdateXusingtheGuttmantransform dis[dis==0]=1e5 ratio=disparities/dis B=ratio B[np.arange(len(B)),np.arange(len(B))]+=ratio.sum(axis=1) X=1./n_samples*np.dot(B,X) 4.Iterate2and3untilconvergence dis=np.sqrt((X**2).sum(axis=1)).sum() ifold_stressisnotNone: if(old_stressstress/dis)<eps: break old_stress=stress/dis Thisimplementationexecutesafixednumberofrunsoftheabovealgorithmandreturnsthe resulthavingminimalstress.
21
USER GUIDE
Input Format
Theinputshallcontainthefollowing Thefirstlineshallcontainalistoflabelsforthedataitemsoflengthn. Thisisfollowedbythedissimilaritymatrixofdimensionnxn. Thisisfollowedbyalinecontainingthenumberofrunsofmultidimensionalscalingwith differentinitialconfigurations. Thenextlinecontainsthenumberofrunstobeexecutedinparallel. Thenextlinecontainsthemaximumnumberofiterationsinasinglerun. Thenextlinecontainstherelativetolerancewithrespecttostresstoachieveconvergence.
Output
The output shows the plot ofthe points obtained from multidimensional scaling. In the case wherelabelsarenotprovidedintheinput,thepointsaremarkedbytheirindices. The matplotlib toolbar can be used to pan/zoom into the plot. The navigation buttons canbe used to return ormove forward to a previous or next view. The plotcanbesavedinafilefor furtheranalysis. Thestressshownisthefinalstressobtainedinthemultidimensionalscalingprocess.
22
TEST CASES
Input:
23
Output:
24
CONCLUSION
Multidimensional scaling has been implemented using majorization technique, by the SMACOFalgorithm. For a givendissimilaritymatrix,thealgorithmsuccessfullycomputesthepositionsofthedata intwodimensionalspace.Variousparameterscanbeusedtotunethealgorithm. Thus, multidimensional scaling using SMACOF algorithm has been successfully implemented.
25
FURTHER SCOPE
As noted earlier, one of the drawbacks of the majorization approach to multidimensional scaling is the issue of the algorithm getting stuck in a local minima while minimizing the stress. Only a few global minimization strategies have been developed for MDS, the most prominent algorithm being the tunneling method. This deterministic scheme allows the algorithm to escapelocal minima bytunneling to new configurations withthesamestress, possiblyprovidingastartingpointforfurtherstressreduction.
Tunneling Method
The tunneling method isan approachtomultidimensionalscaling.Italternatesalocalsearch step, in which a local minimum is sought, with a tunneling step, in which a different configuration is sought with the same STRESS as the previous local minimum. In this manner successively better local minima are obtained, and the last one is often a global minimum. It consists of aniterativetwostepprocedure:inthefirststep,alocalminimumissought,and in the second step, another configuration is determined with exactly the same STRESS. It canbedescribedbythefollowinganalogy. Suppose we wish to find the lowest spot in a selected area in the Alps.First,wepoursome water and see where it stops: thelocalsearch.Fromthispoint,a globalsearchisperformed by digging tunnelshorizontallyuntilwecomeoutofthemountain.Therewepourwateragain, find out where it stops, and dig tunnels again. If we stay underground for a long time while digging the tunnel, we simply conclude that the last spotwas in fact the lowest place in the area,thecandidateglobalminimum. An important and attractive feature of the tunnelingalgorithmisthatsuccessivelocalminima always have lower or equal function values. The tunneling step is the crux of the tunneling method.Itisperformedbyminimizationofaparticularfunction,calledthetunnelingfunction. Theeffectivenessofthetunnelingmethodisdeterminedbythesuccessofthetunneling step.
26
REFERENCES
1. D.J.Hand,HeikkiMannila,PadhraicSmytPrinciplesofDataMining 2. http://www.mathworks.in/help/stats/multidimensionalscaling.htmlIntroductionto MultidimensionalScaling 3. WSTorgersonMultidimensionalscalingI.Theoryandmethod 4. I.Borg,P.J.F.GroenenModernMultidimensionalScaling:TheoryandApplications 5. AndreasBuja,DeborahF.Swayne,MichaelL.Littman,NathanielDean,Heike Hofmann,andLishaChenDataVisualizationwithMultidimensionalScaling 6. PiotrPawliczek,WitoldDzwinelInteractiveDataMiningbyUsingMultidimensional Scaling 7. J.D.LeeuwApplicationsofconvexanalysistomultidimensionalscaling 8. JandeLeeuwConvergenceoftheMajorizationMethodforMultidimensionalScaling 9. J.D.Leeuw,P.MairMultidimensionalscalingusingmajorization:SMACOF 10. PatrickJ.F.GroenenANDWillemJ.HeiserTheTunnelingMethodforGlobal OptimizationinMultidimensionalScaling 11. HansjrgKlockJoachimM.BuhmannDatavisualizationbymultidimensional scaling:adeterministicannealingapproach
27

Visualizing Data Using Multidimensional Scaling

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Visualizing Data Using Multidimensional Scaling

Uploaded by

Copyright:

Available Formats

DATA VISUALIZATION USING MULTIDIMENSIONAL SCALING

UndertheGuidanceofBy Prof.A.K.AgarwalSSujana RollNo10400EN005

DEPARTMENTOFCOMPUTERENGINEERING INDIANINSTITUTEOFTECHNOLOGY(BANARASHINDUUNIVERSITY) VARANASI221005INDIA

I would also Department, allowing me necessary for

S Sujana Roll No. 10400EN005

ThegoalofMDSis,given,tofind vectors forall where isavectornorm. ,

is a weight for the measurement between a pair ofpoints

Non Metric MDS

=2+2(X)2(X) Fromrestriction(5)itfollowsthatthefirstcomponent2=n(n1)/2.Thesecond component2(X)isaweightedsumofthesquareddistancesdij2(X),andthusaconvex quadratic.Thethirdone,i.e.2(X),isthenegativeofaweightedsumofthedij(X),andis 15

consequentlyconcave. Thethirdcomponentisthecrucialtermformajorization.Letusdefinethematrix Aij=(eiej)(eiej) whoseelementsequal1ataii=ajj=1,1ataij=aji,and0elsewhere. Furthermore,wedefine

apowerfulNdimensionalarrayobject sophisticated(broadcasting)functions toolsforintegratingC/C++andFortrancode usefullinearalgebra,Fouriertransform,andrandomnumbercapabilities 18

#Randomlychooseinitialconfiguration random_state=check_random_state() X=random_state.rand(n_samples*n_components) X=X.reshape((n_samples,n_components)) 20

You might also like