## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Gunnar R¨ tsch a Friedrich Miescher Laboratory of the Max Planck Society, Spemannstraße 37, 72076 T¨ bingen, Germany u http://www.tuebingen.mpg.de/˜raetsch

1

Introduction

The Machine Learning ﬁeld evolved from the broad ﬁeld of Artiﬁcial Intelligence, which aims to mimic intelligent abilities of humans by machines. In the ﬁeld of Machine Learning one considers the important question of how to make machines able to “learn”. Learning in this context is understood as inductive inference, where one observes examples that represent incomplete information about some “statistical phenomenon”. In unsupervised learning one typically tries to uncover hidden regularities (e.g. clusters) or to detect anomalies in the data (for instance some unusual machine function or a network intrusion). In supervised learning, there is a label associated with each example. It is supposed to be the answer to a question about the example. If the label is discrete, then the task is called classiﬁcation problem – otherwise, for realvalued labels we speak of a regression problem. Based on these examples (including the labels), one is particularly interested to predict the answer for other cases before they are explicitly observed. Hence, learning is not only a question of remembering but also of generalization to unseen cases.

2

Supervised Classiﬁcation

An important task in Machine Learning is classiﬁcation, also referred to as pattern recognition, where one attempts to build algorithms capable of automatically constructing methods for distinguishing between different exemplars, based on their differentiating patterns. Watanabe [1985] described a pattern as ”the opposite of chaos; it is an entity, vaguely deﬁned, that could be given a name.” Examples of patterns are human faces, text documents, handwritten letters or digits, EEG signals, and the DNA sequences that may cause a certain disease. More formally, the goal of a (supervised) classiﬁcation task is to ﬁnd a functional mapping between the input data X, describing the input pattern, to a class label Y (e.g. −1 or +1), such that Y = f (X). The construction of the mapping is based on so-called training data supplied to the classiﬁcation algorithm. The aim is to accurately predict the correct label on unseen data. A pattern (also: “example”) is described by its features. These are the characteristics of the examples for a given problem. For instance, in a face recognition task some features could be the color of the eyes or the distance between the eyes. Thus, the input 1

to a pattern recognition task can be viewed as a two-dimensional matrix. Pattern classiﬁcation tasks are often divided into several sub-tasks: 1. however. that describe the differences in classes as best as possible. making it difﬁcult to apply it to problems with large training sets. Data collection and representation. the classiﬁcation phase of the process ﬁnds the actual mapping between patterns and labels (or targets). and a label is given to the test point by a majority vote between the k points. In broad terms. I chose to describe six methods that I am frequently using when solving data analysis tasks (usually classiﬁcation). ﬁfty) quite accurately and also work efﬁciently when examples are abundant (for instance several hundred thousands of examples). 3. the number of features) for the remaining steps of the task. These algorithms solve the classiﬁcation problem by repeatedly partitioning the in2 . Feature selection and/or feature reduction. but it is computationally expensive and requires a large memory to store the training data. Finally. Linear Discriminant Analysis computes a hyperplane in the input space that minimizes the within-class variance and maximizes the between class distance [Fisher.g. there exist more learning algorithms than I can mention in this introduction. Therefore it is difﬁcult to give general statements about this step of the process. 1967]. Feature selection and feature reduction attempt to reduce the dimensionality (i. 2. In the second part I will brieﬂy outline two methods (Support Vector Machines & Boosting) that have received a lot of attention in the Machine Learning community recently. often a linear separation is not sufﬁcient. Data collection and representation are mostly problem-speciﬁc.e. The ﬁrst four methods are traditional techniques that have been widely used in the past and work reasonably well when analyzing low dimensional data sets with not too few labeled training examples. 3 Classiﬁcation Algorithms Although Machine Learning is a relatively young ﬁeld of research. 3.. Nonlinear extensions by using kernels exist [Mika et al. This method is highly intuitive and attains – given its simplicity – remarkably low classiﬁcation errors. However.1 Traditional Techniques k-Nearest Neighbor Classiﬁcation Arguably the simplest method is the k-Nearest Neighbor classiﬁer [Cover and Hart. 2003]. They are able to solve high-dimensional problems with very few examples (e. one should try to ﬁnd invariant features. It can be efﬁciently computed in the linear case even with large data sets. whose axes are the examples and the features. In many applications the second step is not essential or is implicitly performed in the third step. Classiﬁcation. 1936]. Decision Trees Another intuitive class of classiﬁcation algorithms are decision trees. Here the k points of the training data closest to the test point are found.

by the network can approximate any function (specif.Figure 1: An example a decision tree (Figure cated trees. the margin is the distance of the example to the separation boundary and a large margin classiﬁer generates decision boundaries with large margins to almost all training examples. Algorithm 1). the function computed tational element known as a neuron. Each circle denotes a computational element referred to as a neuron. 1984. The C4. ory for storage. The two most widely studied classes of large margin classiﬁers are Support Vector Machines (SVMs) [Boser et al. Neural Networks are perhaps one of the most commonly used approaches to classiﬁcation. The computational complexity scales unfavorably with the number of dimensions of the data. Intuitively. 1992]: Support Vector Machines work by mapping the training data into a feature space by the aid of a so-called kernel function and then separating the data using a large margin hyperplane (cf.put space. the kernel computes a similarity 3 .g. Cortes and Vapnik. A further boost to their popularity came with the proof that they can approximate any function mapping via the Universal Approximation Theorem [Haykin. 1992. Roughly speaking. provided enough neurons exist in the network and enough training examples are provided.com/Personal. 1999]. Large datasets tend to result in compli. large margin classiﬁcation techniques have emerged as a practical result of the theory of generalization.rulequest. which computes Figure 2: A schematic diagram of a a weighted sum of its inputs. Within the last decade. 1995] and Boosting [Valiant.taken from Yom-Tov [2004]). which in turn require a large mem. starting from the root node.. until a terminal node is reached. Each circle in the nonlinear function on this sum. Vapnik. and possibly performs a neural network. Neural networks (suggested ﬁrst by Turing [1992]) are a computational model inspired by the connectivity of neurons in animate nervous systems. A simple scheme for a neural network is shown in Figure 2. so as to build a tree whose nodes are as pure as possible (that is.5 implementation by Quinlan [1992] is frequently used and can be downloaded at http://www. Schapire.(Figure taken from Yom-Tov [2004]) ically a mapping from the training patterns to the training targets). Classiﬁcation of a new test point is achieved by moving from top to bottom along the branches of the tree. they contain points of a single class). 1995] which provides conditions and guarantees for good generalization of learning algorithms. Decision trees are simple yet effective classiﬁcation schemes for small datasets.2 Large Margin Algorithms Machine learning rests upon the theoretical foundation of Statistical Learning Theory [e. 3. If certain classes of hidden and output layer is a compunonlinear functions are used.

Given labeled sequences (x1 . sj ) +C m i=1 ξi yi f (xi ) ≥ 1 − ξi Boosting The basic idea of boosting and ensemble learning algorithms in general is to iteratively combine relatively simple base hypotheses – sometimes called rules of thumb – for the ﬁnal prediction. M¨ ller et al.. Research papers and implementations can be downloaded from the kernel machines web-site http://www.j=1 αi αj k(si . . . In boosting the base hypotheses are linearly combined. 2002].org. 1995.e. For so-called positive deﬁnite kernels. x ) = exp x−x and polynomial kernels k(x. (xm . x ) = (x · x )d . . Sch¨ lkopf and u o Smola. ym ) (x ∈ X and y ∈ {−1. 4 Summary Machine Learning research has been extremely active the last few years. i. Algorithm 1 The Support Vector Machine with regularization parameter C.kernel-machines.between two given examples. the ﬁnal prediction is the weighted majority of the votes. the large margin then ensures that these examples are correctly classiﬁed as well. 2003]. 2001]. 2001.. the optimization problem can be solved efﬁciently and SVMs have an interpretation as a hyperplane separation in a high dimensional feature space [Vapnik. 2001. The result is a large number of very accurate and efﬁcient algorithms that are quite easy to use for a practitioner. Research papers and implementations can be downloaded from http://www. In the case of two-class classiﬁcation. the SVM computes a function m f (s) = i=1 αi k(xi . Most commonly used kernel functions are RBF kernels 2 k(x. σ2 The SVM ﬁnds a large margin separation between the training examples and previously unseen examples will often be close to the training examples. the decision rule generalizes. . Hence. The combination of these simple rules can boost the performance drastically. +1}) and a kernel k.boosting. where the coefﬁcients αi and b are found by solving the optimization problem minimize subject to m i. y1 ).org. One uses a so-called base learner that generates the base hypotheses. Boosting techniques have been a a used on very high dimensional data sets and can quite easily deal with than hundred thousands of examples. x) + b. It seems rewarding and almost mandatory for (computer) scientist and engineers to learn how and where Machine Learning can help to automate tasks 4 . Support Vector Machines have been used on million dimensional data sets and in other cases with more than a million examples [Mangasarian and Musicant. Meir and R¨ tsch. It has been shown that Boosting has strong ties to support vector machines and large margin classiﬁcation [R¨ tsch.

Vapnik. and D. C. Nearest neighbor pattern classiﬁcations. In D. Laskov et al. Cortes and V.g. S.. 2004]. 2004] which are the result of the last annual Summer Schools on Machine Learning (cf. T. pages 144–152. G.g. Duffy. R¨ tsch [2001]. I did not mention regression algorithms (e. R. Bioinformatics. Onoda et al. Blankertz et al. Bednarski. C. 1995. Losch. Kohlmorgen. Machine Learning. Fisher.. 2003] and disk spin-down prediction [e. Gramacy. [2000]. Blankertz. network intrusion detection [e. Helmbold et al. Boser.. Acknowledgments I would like to thank Julia L¨ ning for encouraging me to write u this paper. 2000.g.E. R. Warmuth. Haussler. drug discovery [e. and G. 2003. 13(1):21—27.or provide predictions where humans have difﬁculties to comprehend large amounts of data. Kunza u mann. Ari.E. Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory.A. IEEE Transactions on Neural Systems and Rehabilitation Engineering. von Luxburg et al. Furthermore. N. Zien et al.A.. M. cancer tissue classiﬁcation. N. Dornhege. pages 1465–1472. D. BCI bit rates and error detection for fast-pace motor commands based on single-trial EEG analysis.g.-R. References B. 2003. Warmuth et al. and I.N. Sch¨ fer. high-energy physics particle classiﬁcation. IEEE transaction on information theory. 2003. The use of multiple measurements in taxonomic problems. 20:273–297. Meir and R¨ tsch [2003]. Cover and P. and V. Krepki. natural scene analysis etc. 5 . regression trees). u a a This work was partly funded by the PASCAL Network of Excellence project (EU #506778). online learning algorithms or model-selection issues.. K.g. 1967. e. J. Furey et al. Brandt.g. 2003]).K. principle component analysis). B. Obviously. Schummer.. 2000].B. Bioinformatics (e.g. 2003]. Sonnenburg et al. Yom-Tov [2004]. Support vector networks. reinforcement learning. M. Vapnik. Furey.g. Cristianini. 11:127–131. [2000]. 7:179–188. monitoring of electric appliances [e. F. I. Joachims.N.M. [2001]. 10:906–914. editor. in this brief summary I have to be far from being complete. ridge regression. Curio. http://mlss. MIT Press. 2001] (for instance spam ﬁltering). I would like to refer the interested reader to two introductory books [Mendelson and Smola...g. [2002]). The long list of examples where Machine Learning techniques were successfully applied includes: Text classiﬁcation and categorization [e.g. Adaptive caching by refetching. R. Guyon. 2000]). V. gene ﬁnding. unsupervised learning algorithms (such as clustering. In In Advances in Neural Information Processing Systems 15. recognition of hand writing. 1936.cc). Haussler. Gramacy et al.M. T. thanks to S¨ ren Sonnenburg and Nora Toussaint for proofo reading the manuscript. Support vector machine classiﬁcation and validation of cancer tissue samples using microarray expression data. brain computer interfacing [e. optimization of hard disk caching strategies [e. A training algorithm for optimal margin classiﬁers. M¨ ller. 1992. Annals of Eugenics. Hart. For the preparation of this work I used and modiﬁed some parts of M¨ ller et al. Some of these techniques extend the applicability of Machine Learning algorithms drastically and would each require an introduction for them self.

IEEE Transactions on Patterns Analysis and Machine Intelligence. September 2000. Wiley. G. ICSC Academic Press Canada/Switzerland. October 2001. pages 71–82. An introduction to pattern classiﬁcation. Learning with Kernels. G. D. Smola. volume 2600 of LNAI. Dorronsoro. The Design and Analysis of Efﬁcient Learning Algorithms.M. Meir and G. MIT Press. 1992. 2nd Ed. 25(5):623–627. editors. B. D. K Warmuth.G. Proc. Smola. R¨ tsch. and K. Proc. T. A.-R. editors. and K. T. Onoda. MIT Press. Springer Berlin. and K. Ince. Collected works of A. PhD thesis. pages 119–184. R¨ tsch. Constructing a o u descriptive and discriminative nonlinear features: Rayleigh coefﬁcients in kernel feature spaces. Adaptive disk spin-down for mobile computers. S. R¨ tsch. 27(11):1134–1142. J. 2004. Sch¨ lkopf and A. Mobile Networks and Applications. Pattern recognition: Human and mechanical. Helmbold. MA. Jagota. 43(2):667–673. In JR. and G. R¨ tsch.R. Advanced Lectures on Machine Learning. Prentice-Hall. Zien. T. 5(4):285–297. 2002. Sch¨ lkopf. 1999. J. B. Sconyers. Robust Boosting via Convex Optimization. Mathieson. pages 329–336. Computer Science Dept. E. University of Dortmund. Bothe and R. Springer. Mendelson a and A. O. of NC’2000. Smola. A. A.E. Long. 1995. Springer. and I. Schapire. PhD thesis. IEEE Transactions on Neural Networks. M. 1992. G. O. The nature of statistical learning theory. The Maximum-Margin Approach to Learning Text Classiﬁers.J. A theory of the learnable. Elsevier Science Publishers. Laskov.. V. Vapnik. Communications of the ACM. R¨ tsch. R¨ tsch. G. 2004. 16(9):799–807. May 2003. editors. editors. An introduction to boosting and leveraging. Neural Networks: A comprehensive foundation. S. 2003. 1985. M¨ ller. Intrusion detection in unlabeled data with quarter-sphere a support vector machines. Advanced Lectures on Machine Learning. Tsuda.C. Turing. C. 2001. G. von Luxburg. editor.E. Kotenko. Journal of Chemical Information Sciences. In U. Joachims. Putta. Sch¨ lkopf. 1992. R¨ tsch.S. Smola. and C. Liao. Springer. Weston.-R. editor. volume 3176 of LNAI. volume 3176 of LNAI. LNCS 2415. 6 . Valiant. JMLR. Bousquet. S. New York.-R.D. In D. M. o S. editors. Mika. BioInformatics. Springer Verlag. 1:161– 177. International conference on artiﬁcial Neural Networks – ICANN’02. S. a Neues Palais 10.M. 2000. M¨ ller. Support Vector a Machines for active learning in the drug discovery process. University of Potsdam. 2000. Mika. Yom-Tov. R¨ tsch. S. November 1984. In H. Lemmem. Lagrangian support vector machines. 2003.-R. Sonnenburg. New methods for splice-site recognia u tion. Advanced Lectures on Machine Learning. A. Springer.P.R. DIMVA. W. C4. Lengauer. Berlin. M¨ ller. O. R. Sch¨ fer.-R. March 2001. and G. 14469 Potsdam. Sch¨ lkopf. Mika.L. pages a 1–23. Morgan Kaufmann. P.N. and K. Cambridge. PhD thesis. and B. T. Germany. 2003. M¨ ller. 2004. Advanced Lectures on Machine Learna ing. J. L. Sherrod. von Luxburg. An introduction to kernel-based u a o learning algorithms. Intelligent machinery. K. Engineering Supa o u port Vector Machine Kernels That Recognize Translation Initiation Sites. and B. 12(2):181–201. Quinlan.. Bousquet. R¨ tsch. Rojas. K. Amsterdam. Musicant. In S. In Proc. 2002. LNAI. 2001. R.J. M¨ ller. Mangasarian and D. G. Haykin. G. The Netherlands.5: Programs for Machine Learning. Turing: Mechanical Intelligence. B.L. R¨ tsch. U. Mendelson and A. A non-intrusive monitoring system for household a u electric appliances with inverters. Watanabe.

- Applicability of Machine Learning Techniques in Predicting Customer
- (the Pragmatic Programmers) Dmitry Zinoviev-Data Science Essentials in Python_ Collect - Organize - Explore - Predict - Value-Pragmatic Bookshelf (2016)
- IJETTCS-2013-10-23-061
- hw3 machine learning
- SVM Tutorial
- __ - A SURVEY ON PATTERN RECOGNITION APPLICATIONS OF SVM
- EIE520 Labs
- KDD cup 99
- Application of Machine Learning Techniques for Predicting Software Effort
- Recognizing Hand_written English Letters.pdf
- Performance Comparison of SVM and ANN for Handwritten Devnagari Character Recognition
- 07cp18 Neural Networks and Applications 3 0 0 100
- SPE-163370-MS-P
- A Comparison of Classifier Performance_2014
- Development of an Expert System for Diagnosis and appropriate Medical Prescription of Heart Disease Using SVM and RBF
- Burst Onset Landmark Detection and Its Application to Speech Recognition
- a course in machine learning.pdf
- GA-CFS APPROACH TO INCREASE THE ACCURACY OF ESTIMATES IN ELECTIONS PARTICIPATION
- ffe925e50a919120e1753d6ecf7de6dbd57546c8
- Musical Instrument Recognition using Machine Learning Technique
- Barber, Bayesian Reasoning and Machine Learning
- DSA
- A Network Analysis of Relationship Status on Facebook
- WORD SENSE DISAMBIGUATION
- Artificial Neural Network (U)
- Study on Application of Quantum BP Neural Network to Curve Fitting
- Business Analytics Course
- FiebrinkTruemanCook_NIME2009
- Mining of Massive Datasets
- [IJCST-V1I3P1]

Skip carousel

- tmpD1F
- tmp8086.tmp
- A Survey of Various Intrusion Detection Systems
- Colour Object Recognition using Biologically Inspired Model
- Detecting Content Based Image Spam in E-mail
- A Survey On Mining Conceptual Rule and Ontological Matching For Text Summarization
- Texture Classification based on Gabor Wavelet
- An Improved sentiment classification for objective word.
- Lemon Disease Detection Using Image Processing
- Product Review Analysis with Ranking System Based on Transaction ID and OTP
- An Enhance Image Retrieval of User Interest Using Query Specific Approach and Data Mining Technique
- tmp35DF.tmp
- Moving Car Detection using HOG features
- Named Entity Recognition for English Tweets using Random Kitchen Sink Algorithm
- A Survey on Medical Data Classification
- Data Partitioning and Machine Learning Techniques for Lake Level Forecasting
- tmp768B
- tmpBC71
- Language Identification System for Indian Languages
- tmp4E31.tmp
- tmp81FA.tmp
- Agricultural Plant Disease Detection and its Treatment using Image Processing
- A Survey on Machine Learning Approach in Data Mining
- Survey paper on methodologies employed in MINERAL exploration
- tmpA609
- tmpD0AB
- Character recognition of Sindhi (Arabic) Script
- An Accurate Approach for Satellite Image Classification Using Neuroevolutionary Method
- tmpD7F8
- Survey on Efficient Feature Subset Selection Technique on High Dimensional Small Sized Data

Sign up to vote on this title

UsefulNot usefulClose Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Close Dialog## This title now requires a credit

Use one of your book credits to continue reading from where you left off, or restart the preview.

Loading