Clustering and Factor Analysis in R

Methodology
The idea was to first of all collect all the data and analyze it using R.
Data Cleaning:
Right from the get go eyeballing the data with the team helped identify a few flaws that were
present. For instance there were a few missing cells, then the ratio of female:male data that
was present was not upto the mark, it had a few problems – in some places it was written in
strict “:” ratio form and in other places we had “/, ;”. Hence naturally data cleaning was the 1st
obvious step.
We first imported the data using .csv format, got rid of all the columns which had no
numerical significance as the Respondent_ID, collector_ID, start_date, End_date, IP Address
etc. The meaningful columns were from 10 to 27, hence specifically they were kept only.
Since the 2nd row and the 1st row had some discrepancy which affected the column lables, the
2nd row was all together deleted and fresh names were given to the columns. Data cleaning
with respect to the female:male ratio problem was dealt with by using a new proxy variable
“sex ratio”, in this variable the values in the cells were converted from their ratio notation to
numeric values. Then all the cells were “NA” was occurring the complete rows were
removed, as its difficult to put a mean or mode value as the response for a particular person
could have been very different. Hence to keep the sanity of the survey the complete rows
were removed.
Factor Analysis:
The data had various variables, which were basically the questions asked in the survey. We
wanted to know what kind of hidden attribute was responsible for the variables. Hence factor
analysis using the scree-plot and the KMO tests were done to find out the factor and the
loading of the various variables onto them. Names were given to the factors. Factor Analysis
would help us in clubbing the numerous variables into a few for better further analysis.
Cluster Analysis:
Since we wanted to know actually who were the people afraid of statistics. We went a step
further, after finding the factors we decided to go for the clustering technique, so as to
identify which group of people could be bad at statistics and which could be good. These
could be used as target areas for future analysis.
Results and Insights
After the data was cleaned using a mixture of loops, string statements a new variable was
created called “sex ratio” to do the factor analysis. A total of 18 variables were used to do the
Factor analysis. First the variance- covariance matrix was created using the
“corr <- cor(Assmnt_sub)” correlation function, followed by a KMO test. Ther results of the
KMO test were encouraging such that all the values of the corresponding variable were above
0.5 as shown below.
KMO here is 0.82 in an overall level hence we are good to proceed.
Hence we proceeded with the Scree plot and found out that, there were only two points above
the 1 cutoffline which indicated that there are two underlying factors. Next the factor loading
values were calculated using the “fa” function in and it was seen that a two factors that were
coming up had charactersitcs which were related to:
1. In statistics courses taken upto now, then number of students in the class, how many
times a week one studies statistics etc. These all related to the personal capabilities be
concentrating in class, or doing self-study, or even showing self interest hence was
names as “self_effort”
2. The second factor that emerged had values more related to the instructors, how their
deliver was, were the students able to follow, was the instructor explaining using
diagrams, how many people you though are better than you in class, were the reading
materials provided helpful, how comfortable are you with maths, how comfortable are
you with a programming language etc. these all indicated “outside_effort”
The loadings were negative for a few variables and hence it indicated that the reverse of
the given question was affecting the factor. The loadings were rotated using varimax
method for a better fit with the factors.
Now two new variables with the names stated above were created to reflect the factors
and for further analysis. This was done by simply taking an average of the loading
variables, keeping the variable negative when the loading was negative. Clustering using
K-means method was done for the customers on the basis of the two newly created
variables. The optimal number of clusters were found using the elbow method. As visible
from the graph below a total of 4 cluster were to be formed.
Clustering showed that there are 4 distinct groups: the purple one can be those who are
good at statistics as they are responding positively to outsid effort and have seld effort
also. Those who are in blue i.e. have low self-effort may be the weakest of the batch and
must be motivated to do more self-efforts in order to improve.
One interesting find which was observed was that although the screen plot for the factor
analysis showed two points, when the factoring was done for 3 factors the split was
better. Then three clear groups were emerging as self effort, outside effort and class
concentration.
Which could help explain the variables better.

Clustering and Factor Analysis in R

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering and Factor Analysis in R

Uploaded by

Copyright:

Available Formats

Methodology

You might also like