This document provides an overview of data mining and machine learning techniques discussed in three chapters of a book. Chapter 1 defines the data mining process and discusses supervised vs unsupervised learning. Chapter 2 covers data types and preparation. Chapter 3 introduces classification and supervised learning algorithms like Naive Bayes and K-Nearest Neighbors for assigning classifications. It explains how Naive Bayes uses probabilities to determine the most likely classification and how K-Nearest Neighbors uses the classifications of the k closest training instances.
This document provides an overview of data mining and machine learning techniques discussed in three chapters of a book. Chapter 1 defines the data mining process and discusses supervised vs unsupervised learning. Chapter 2 covers data types and preparation. Chapter 3 introduces classification and supervised learning algorithms like Naive Bayes and K-Nearest Neighbors for assigning classifications. It explains how Naive Bayes uses probabilities to determine the most likely classification and how K-Nearest Neighbors uses the classifications of the k closest training instances.
This document provides an overview of data mining and machine learning techniques discussed in three chapters of a book. Chapter 1 defines the data mining process and discusses supervised vs unsupervised learning. Chapter 2 covers data types and preparation. Chapter 3 introduces classification and supervised learning algorithms like Naive Bayes and K-Nearest Neighbors for assigning classifications. It explains how Naive Bayes uses probabilities to determine the most likely classification and how K-Nearest Neighbors uses the classifications of the k closest training instances.
A. CHAPTER 1 : we see in this explanation the definition of data mining major , start by the KDD (Knowledge discovery of data) process which maintain the Data Sources and transform it to Data Store by using Data integration , so as to move the data preparation and then use directly the data mining to gain the patterns and finally arrive to the knowledge needed. -Moving to the datasets which has examples (instances) , we can see the labelled data (with target) or unlabelled data (without target) -In the main of data mining there’s two types of machine learning , supervised learning which includes classification, regression, association rules… .And the unsupervised learning contains the clustering(K-means)
B. CHAPTER 2 : This chapter shows us how data can effect the data mining : type of variable -> nominal variables, binary variable, ordinal variable…
-Also , there’s the categorical and continuous attributes so as to
preprare to a very important part called the data preparation and data cleaning without mess to remove the missing values - So, it introduces the standard formulation for the data input to data mining algorithms that will be assumed throughout this book. It goes on to distinguish between different types of variable and to consider issues relating to the preparation of data prior to use, particularly the presence of missing data values and noise. The UCI Repository of datasets is introduced
C. CHAPTER 3 : we will discover in part an supervised learning
techniques. But first by giving the meaning of classification which looks like many practical decision-making tasks can be formulated as classification problems... assigning people or objects to one of a number of categories, for example customers who are likely to buy or not buy a particular product in a supermarket. -We have also the Naïve Bayes algorithm known as probability theory to find the most likely of the possible classifications. And we can use in everyday life: the probability of an event, e.g. that the 6.30 p.m. train from London to your local station arrives on time, is a number from 0 to 1 inclusive, with 0 indicating ‘impossible’ and 1 indicating ‘certain’. -So, the Naïve Bayes algorithm gives us a way of combining the prior probability and conditional probabilities in a single formula, which we can use to calculate the probability of each of the possible classifications in turn. Having done this we choose the classification with the largest value -We discussed also the K-nearst neighbours algorithm which is mainly used when all attribute values are continuous, although it can be modified to deal with categorical attributes. The idea is to estimate the classification of an unseen instance using the classification of the instance or instances that are closest to it, in some sense that we need to define and move to the next steps is to find the k training instances that are closest to the unseen instance. Then take the most commonly occurring classification for these k instances