7B-Data Handling and BI 21 Part 2

Data Handling and Business Intelligence
1
Table of Contents
Part 2 :........................................................................................................................................3
2.1 - Clustering analysis of SmileClinic Dataset...................................................................5
2.2 - Most common Data Mining Methods used in Businesses :...........................................8
2.3 - Advantages/Disadvantages of SPSS over Excel :..........................................................9
References............................................................................................................................10
2
Part 2 :
1. How many clinic customers do eat rice?

In the figure below, 0 represents customers who do not eat rice and 1 for
customers who do eat rice.
Therefore, 60% of customers consume rice and the rest 40% don't.
2. How many customers are male and female?

In the figure below, 1 represents the customers and 2 represents male customers.
We infer that the proportion of male to female customers is 1:1.
3
3. What is the Mean/Median Age of customers?
Mean age of customers is 20.35 whereas median age of customers is 19.
4
4. Mean/Median Age of customers who eat rice?
The Average age of customers who eat rice is 20.22 whereas their median age is
19.
5
2.1 - Clustering analysis of SmileClinic Dataset.
In cluster analysis, we’re given some observations and using algorithms we find groups of
observations where observations in a group are more similar to observations in the same
group. The similarity could be any factor, quality or criteria deemed reasonable by the
clustering algorithm or set by us before beginning of the execution . Since our goal is to find
clusters of homogenous observations, dependent and independent variables are of very little
value to us. It could be carried out even if we don’t have access to the output of each
observation. Hence it falls under the category of unsupervised Machine Learning algorithms.
In the smile_clinic dataset that we’re provided with, we’ve details of 100 patients,
specifically their names, age, gender and whether or not they eat rice. We’re required to find
clusters of similar observations based on their gender and age. There are many clustering
algorithms we can do our analyses with, but we decide to go with Hierarchical Clustering.
In Hierarchical clustering, similar data points are grouped into clusters. No of clusters
range from 1 to total number of observations in consideration.
6
We then find the appropriate number of clusters by looking at the coefficient value in
the agglomeration schedule table.
1. We load the csv file in SPSS and click on Analyze>Classify>Hierarchical
Clustering.
2. We select the variables to perform clustering on.
3. We click on the Statistics button, and see our requirements are already
satisfied, thus we make no change.
4. Then we click on the plot button, and check the Dendrogram option and
disable the icicles options.
7
5. Next we click on the method option to select the methods used in our analysis.
- We select the cluster method to be Nearest Neighbour.

- For measure, we select the Interval option and keep the default
option of Squared Euclidean DIstance.
6. Then we click continue and then ok on the parent window to start the analysis.
7. After the analysis is done, a new window appears with the result.
8. We look at the agglomeration schedule table to obtain the number of clusters.
8
9. We see that the algorithm takes the agglomerative approach i.e. the number of
clusters go from 1 to 99.
10. Looking at the table, we find that there are two noticeable jumps made in the
coefficient column.
a. First one occurs when transitioning from stage 81 to 82. Which gives
us 19 clusters.
b. Next jump occurs at transition from stage 95 to 96, which suggests the
number of clusters to be 5.
11. Now if we look at the dendrogram, it also shows that there are 5 clusters at
major level and 19 at minor level.
12. We take 5 to be an appropriate number of clusters based on evidence from
table and dendrogram.
13. Hence, In the smile_clinic dataset, we can create 5 groups of customers based
on their ages and gender. Each group has some similarities based on the above
factors.
9
2.2 - Most common Data Mining Methods used in Businesses :
In today’s times Data is the most important asset a company could have. Data about their
customers, their habits, their likes etc. The more data a company has, the more it helps the
company to turn that data into actionable insights and use it in order to cater their
clients/users better.
But in order to turn that data into something which could be acted upon, they first need to
analyse the data using a suitable technique. Some of the most common data mining
techniques are as follows :
1. Classification analysis : Companies use classification analysis in order to segment
customers into different groups based on their characteristics and then serve them
accordingly.
Ex : Banks use Classification analysis in order to determine which customer to offer a
loan to.
2. Association Rule Learning : This technique helps businesses to find which products
are the customer most likely to buy if he buys some other product. It helps companies
understand user’s needs and increase sales.
3. Regression Analysis : In regression analysis, Some predictor variables are used to
predict the value of a response variable. It studies how change in values of predictors
drives change in the value of the response variable.
Ex. Number of calls made before a successful purchase is made.
4. Clustering Analysis : In clustering, observations are grouped into clusters based on
degree of similarity in their characteristics. It allows us to group homogenous
observations based on the patterns they exhibit, in order to gain better insights.
Ex : Identifying fake news on the internet based on the keywords.
5. Anomaly/Outlier Detection : This technique helps us find anomalies in datasets.
Outliers are also called noise and are usually bad for analysis if not removed before.
They can hamper with the legitimate values in the dataset and deviate the output from
actual insights.
Ex. System health monitoring is used in computer systems in order to keep the
computational resources in check and alert the authority if some suspicious activity is
observed.
10
2.3 - Advantages/Disadvantages of SPSS over Excel :
- SPSS stands for Statistical Package for Social Science and is a software used for
statistical analyses of data, whereas Microsoft Excel is a spreadsheet software and
serves a more general purpose of data entry and handy manipulations.
- In terms of usage, SPSS is more oriented towards professionals aiming at performing
advanced statistical operations and getting insights from a large dataset whereas Excel
is a general purpose spreadsheet software to be used by Normal people and
professionals alike to perform small data manipulation tasks.
- SPSS is more focused on speed and Performance, whereas Excel helps us alleviate
data redundancy.
- SPSS allows us to batch process large volumes of data and find inferences whereas
Excel is quite limited in terms of volume of data it can process parallely.
- Main aim of SPSS is to efficiently carry out data manipulation techniques to retrieve
good results whereas Excel primarily focuses on safe handling and storing of data.
- Excel uses batch processing to process data whereas Excel allows us to use formulas
in order to flexibly manipulate data.
11
References
https://exceljet.net/formula/get-year-from-date
https://exceljet.net/excel-functions/excel-sort-function
https://www.excel-easy.com/data-analysis/pivot-tables.html
https://www.displayr.com/what-is-hierarchical-clustering/
https://www.educba.com/spss-vs-excel/
https://towardsdatascience.com/5-data-mining-techniques-businesses-need-to-know-about-
20fd723800b2
12

7B-Data Handling and BI 21 Part 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7B-Data Handling and BI 21 Part 2

Uploaded by

Copyright:

Available Formats

Data Handling and Business Intelligence

2.1 - Clustering analysis of SmileClinic Dataset...................................................................5

2.2 - Most common Data Mining Methods used in Businesses :...........................................8

2.3 - Advantages/Disadvantages of SPSS over Excel :..........................................................9

1. How many clinic customers do eat rice?

2. How many customers are male and female?

- We select the cluster method to be Nearest Neighbour.

You might also like