Professional Documents
Culture Documents
Algorithm:
a. Find all the neighbor points within eps and identify the core points or
visited with more than MinPts neighbors.
b. For each core point if it is not already assigned to a cluster, create a new
cluster.
c. Find recursively all its density connected points and assign them to the
same cluster as the core point.
d. Iterate through the remaining unvisited points in the dataset. Those
points that do not belong to any cluster are noise
5. Write in detail about Support Vector Machines.
Support Vector Machine is a supervised machine learning algorithm used for
both classification and regression. The objective of SVM algorithm is to find
a hyperplane in an N-dimensional space that distinctly classifies the data
points.
The dimension of the hyperplane depends upon the number of features. If
the number of input features is two, then the hyperplane is just a line. If
the number of input features is three, then the hyperplane becomes a 2-D
plane. It becomes difficult to imagine when the number of features exceeds
three.
Let’s consider two independent variables x1, x2 and one dependent variable
which is either a blue circle or a red circle.
From the figure above its very clear that there are multiple lines that
segregates our data points or does a classification between red and blue
circles. So how do we choose the best line or in general the best hyperplane
that segregates our data points.
Selecting the best hyper-plane:
One reasonable choice as the best hyperplane is the one that represents the
largest separation or margin between the two classes.
So we choose the hyperplane whose distance from it to the nearest data
point on each side is maximized. If such a hyperplane exists it is known as
the maximum-margin hyperplane/hard margin. So from the above figure, we
choose L2.
Let’s consider a scenario like shown below
The SVM algorithm has the characteristics to ignore the outlier and finds
the best hyperplane that maximizes the margin. SVM is robust to outliers.
So in thus type of data points what SVM does is, it finds maximum margin as
done with previous data sets along with that it adds a penalty each time a
point crosses the margin.
Say, our data is like shown in the figure above. SVM solves this creating a
new variable using a kernel. We call a point x i on the line and we create a new
variable yi as a function of distance from origin o. So if we plot this we get
something like as shown below.
In this case, the new variable y is created as a function of distance from the
origin. A non-liner function that creates a new variable is referred to as
kernel.
SVM Kernel:
The SVM kernel is a function that takes low dimensional input space and
transforms it into higher-dimensional space, i.e. it converts non separable
problem to separable problem. It is mostly useful in non-linear separation
problems. Simply put the kernel, it does some extremely complex data
transformations then finds out the process to separate the data based on
the labels or outputs defined.
6. Explain Regression techniques with examples.
- Numerical prediction is similar to classification
a. Construct a model
b. Use model to predict continuous or ordered value for a given input
- Prediction is different from classification
a. Classification refers to predict categorical class label.
b. Prediction models continuous-valued function
- Major method for prediction: regression
a. Model the relationship between one or more independent or predictor
variables and a dependent or response variable.
- Types of Regression
a. Linear (simple regression and multiple regression)
b. Non-linear regression
c. Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
Linear Vs Non-Linear:
Linear Regression – The degree of a polynomial is 1 in linear regression.
Non-linear Regression – The degree of a polynomial > 1
Linear Regression Example:
Linear Regression: Involves a response variable y and a single predictor
variable x.
Y = w0 + w1x
Where w0 (y-intercept) and w1(slope) are regression coefficients
Methods of least squares: estimates the best fitting straight line
W1 = ∑(xi – x’)(yi – y’)/∑(xi – x’)2 1
W0 = y – w1x 2
Multiple Linear regression: involves more than one predictor variable
a. Training data is of the form (X1, y1), (X2, y2), . . ., (X|D|, y|D|)
b. Ex. For 2-D data, we may have: y = w 0 + w1x1 + w2x2
c. Solvable by extension of least square method or using SAS, S-Plus
d. Many nonlinear function can be transformed into the above.
Example: Straight-line regression using the method of least squares. Table
6.7 shows a set of paired data where x is the number of years of work
experience of a college graduate and y is the corresponding salary of the
graduate.
The 2-D data can be graphed on a scatter plot, as in Figure 6.26. The plot
suggests a linear relationship between the two variables, x and y
We model the relationship that salary may be related to the number of
years of work experience with the equation y = w0 + w1x.
Given the above data, we compute x = 9.1 and y = 55.4. Substituting these
values into equations 1 and 2, we get
Thus, the equation of the least squares line is estimated by y = 23.6 + 3.5x.
Using this equation, we can predict that the salary of a college graduate with
say, 10 years of experience is $58,600.
Non-Linear Regression:
- Some nonlinear models can be modeled by a polynomial function
- A polynomial regression can be transformed into linear regression model.
For example,
Y = w0 + w1x + w2x2 + w3x3
Convertible to linear with new variables: x 2 = x2, x3 = x3
Y = w0 + w1x + w2x2 + w3x3
- Other functions, such as power function, can also be transformed to
linear model.
- Some models are intractable non-linear
a. Possible to obtain least square estimates through extensive
calculation on more complex formulae.
7. What is outlier detection? Explain distance-based outlier
detection.
- What are outliers?
a. The set of objects are considerably dissimilar from the remainder of
the data.
- Problem: Define and find outlier in large data sets
- Applications:
a. Credit card fraud detection
b. Telcom fraud detection
c. Customer segmentation
d. Medical analysis
Statistical Approaches:
b. Divisive:
We can say that the Divisive Hierarchical clustering is precisely the
opposite the Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, we take into account all of the data points as a
single cluster and in every iteration, we separate the data points from
the clusters which aren’t comparable. In the end, we are left with N
clusters.
Split on feature X
Split on feature Y
Split on feature Z:
From the above images, we can see that the information gain in maximum
when we make a split on feature Y. So, for the root node best-suited feature
is feature Y. Now we can see that while splitting the dataset by feature Y,
the child contains a pure subset of the target variable. So we don’t need to
further split the dataset. The final tree for the above dataset would look
like this:
10. Explain Spatial Data Mining.
- Geometric, geographic or spatial data: space-related data
a. Example: Geographic space, VLSI design, model of human brain, 3-D
space representing the arrangement of chains of protein molecule.
- Spatial database system vs. image database systems.
a. Image database system: handling digital raster image, may also
contain techniques for object analysis and extraction from images and
some spatial database functionality.
b. Spatial database system: handling objects in space that have identify
and well-defined extents, locations and relationships.
- Spatial databases contain spatial-related information.
a. Ex:- Geographic databases(maps), VLSI, CAD design.
- Spatial databases represented in raster format which consisting of n-
dimensional bit maps or pixels
a. Ex:- weather forecasting, which is generally represented in raster
format, where each pixel registers rain fall in given area.
- Data mining techniques used to describe the characteristics of particular
location where rainfall is heavy or less.
- A spatial databases that stores spatial objects that change with time
period is called spatiotemporal database.
GIS (Geographic Information System)
- GIS
a. Analysis and visualization of geographic data
- Common analysis function of GIS
a. Search
b. Location analysis
c. Terrain analysis
d. Flow analysis
e. Distribution
f. Spatial analysis/statistics
g. Measurements
Modeling Spatial Objects:
- Two important alternative views
a. Single objects: distinct entities arranged in space each of which has
its own geometric description
Ex: modeling cities, forests, rivers
b. Spatially related collection of objects: describe space itself
Ex: modeling land use, partition of a country into districts
Example: British Columbia Weather Pattern Analysis
- Input:
a. A map with about 3,000 weather probes scattered in B.C.
b. Daily data for temperature, precipitation, wind velocity, etc.
c. Data warehouse using star schema
- Output:
a. A map that reveals patterns: merged (similar) regions
- Goals:
a. Interactive analysis
b. Fast response time
c. Minimizing storage space used
- Challenge:
a. A merged region may contain hundreds of “primitive” regions
Star Schema of the BC Weather Warehouse
11. Explain Text Mining.
- Text Databases are the database that contains word descriptions for
objects. These are long sentences / paragraphs / documents. Ex:-
summary reports, product specifications, any other documents.
- Text Databases are rapidly growing due to the increasing amount of
information available in electronic form, such as electronic publications,
various kinds of electronic documents, e-mail and the world wide web.
- Data stored in most text databases are semi-structured data in that
they are in neither completely unstructured nor completely structured.
Text Data Analysis and Information Retrieval
- IR is a field that has been developing in parallel database systems for
many years, which has focused on query and transaction processing of
structured data.
- Information retrieval is concerned with the organization and retrieval of
information from a large number of text-based documents.
- Some database system problems are usually not present in information
retrieval systems, such as concurrency control, transaction management,
and update.
- A typical information retrieval problem is to locate relevant documents in
a document collection based on a user’s query, which is often some
keywords describing an information need, although it could also be an
example relevant.
- Basic Measure for Text Retrieval: Precision and Recall
Precision: This is the percentage of retrieved documents that are in fact
relevant to the query. It is formally defined as
Precision= |{Relevant} ∩ {Retrieved}|/|{Retrieved}|
Recall: This is the percentage of retrieved documents that are relevant to
the query and were, in fact, retrieved. It is formally defined as
Recall = |{Relevant} ∩ {Retrieved}|/|{Relevant}|
An information retrieval system often needs to trade off recall for
precision or vide versa. One commonly use trade-off is the F-score, which is
defined as the harmonic mean of recall and precision: