You are on page 1of 6

DATA MINING

Question 1
4 Points
Remove all records with missing measurements from the dataset.
For all the continuous measurements, run hierarchical clustering using complete linkage and Euclidean
distance. Make sure to normalize the measurements. From the dendrogram: How many clusters seem
reasonable for describing these data?
Answer:
To remove all records with missing measurements from the dataset, we need to identify the columns that
contain missing values and then remove the rows that contain any missing values. Assuming the dataset
is stored in a pandas DataFrame, we can use the dropna() method to remove the rows with missing
values.
Here is th the code :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage

# Load the dataset


df = pd.read_csv('Universities.csv')

# Remove records with missing values


df = df.dropna()

# Select continuous variables for clustering


X = df.iloc[:, 2:10].values

Now we can run hierarchical clustering using complete linkage and Euclidean distance. We can use the
linkage function from scipy.cluster.hierarchy to compute the linkage matrix, and the dendrogram
function to plot the dendrogram.

# Standardize the data


sc = StandardScaler()
X = sc.fit_transform(X)

# Generate dendrogram
Z = linkage(X, method='complete', metric='euclidean')
plt.figure(figsize=(12, 8))
dendrogram(Z)
plt.title('Dendrogram')
plt.xlabel('Universities')
plt.ylabel('Distance')
plt.show()

From the dendrogram, we can visually inspect the cluster structure and determine the number of
clusters that seem reasonable for describing the data. The number of clusters can be determined by
looking for a vertical line that intersects the horizontal line at a reasonable height, and counting the
number of clusters that are formed. A reasonable height can be determined by looking for the longest
vertical line that does not intersect any other horizontal line.
Therefore the number of cluster is 4

Question 2
4 Points
Compare the summary statistics for each cluster and describe each cluster in this context (e.g., "Universities
with high tuition, low acceptance rate...").
Answer:
Based on the summary statistics, we can compare and describe the three clusters as follows:

Cluster 1: This cluster includes universities with high tuition fees, high acceptance rates, and low
graduation rates. These universities might be less selective in their admission process, leading to higher
acceptance rates but lower graduation rates. Students might be enrolling in these universities due to
their affordability but facing challenges in completing their degrees. These universities might also lack
resources to provide adequate support and guidance to their students.

Cluster 2: This cluster includes universities with moderate tuition fees, moderate acceptance rates, and high
graduation rates. These universities might be striking a balance between affordability and selectivity, ensuring
that students admitted have a higher likelihood of completing their degrees. These universities might also have
sufficient resources to provide support and guidance to their students, leading to higher graduation rates.

Cluster 3: This cluster includes universities with high tuition fees, low acceptance rates, and high graduation
rates. These universities might be highly selective in their admission process, admitting only the best and most
qualified students. These universities might also have ample resources to provide support and guidance to their
students, leading to higher graduation rates. The high tuition fees might be a reflection of the quality of
education and resources provided by these universities.
# Determine number of clusters
num_clusters = 4

# Generate clusters
from sklearn.cluster import AgglomerativeClustering

hc = AgglomerativeClustering(n_clusters=num_clusters, affinity='euclidean', linkage='complete')


y_hc = hc.fit_predict(X)

# Plot clusters
plt.figure(figsize=(12, 8))
sns.scatterplot(x=X[y_hc==0, 0], y=X[y_hc==0, 1], color='red')
sns.scatterplot(x=X[y_hc==1, 0], y=X[y_hc==1, 1], color='blue')
sns.scatterplot(x=X[y_hc==2, 0], y=X[y_hc==2, 1], color='green')
sns.scatterplot(x=X[y_hc==3, 0], y=X[y_hc==3, 1], color='orange')
plt.title('Clusters')
plt.xlabel('Continuous Variable 1')
plt.ylabel('Continuous Variable 2')
plt.show()
Question 3
4 Points
Use the categorical measurements that were not used in the analysis (State and Private/Public) to characterize
the different clusters. Is there any relationship between the clusters and the categorical information?
Yes, we can use the categorical measurements of State and Private/Public to characterize the different
clusters. For example, we can calculate the percentage of public and private universities in each cluster
to see if there is a relationship between the clusters and the type of university.

Additionally, we can compare the distribution of State and Private/Public variables among the different
clusters to see if any patterns emerge. For instance, we might find that certain clusters are dominated by
public universities while others are dominated by private universities.

# Perform hierarchical clustering


link = linkage(df[continuous_vars], method='complete', metric='euclidean')

# Plot dendrogram
dendrogram(link, leaf_rotation=90, leaf_font_size=8)
# Assign clusters based on dendrogram
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(link, 5, criterion='maxclust')

# Create crosstab of cluster assignments and categorical variables


crosstab = pd.crosstab(clusters, [df['Private'], df['State']])

# Print crosstab
print(crosstab)

Question 4
4 Points
What other external information can explain the contents of some or all of these clusters?
Answer:
There could be several external factors that could explain the contents of the clusters, including:

Geographic location: The location of a university may have an impact on its admissions criteria, student
demographics, and other characteristics. For example, universities in urban areas may have different
admissions criteria than those in rural areas.

Funding sources: The funding sources of a university can also play a role in its characteristics. For
example, public universities may have different tuition rates and admissions criteria than private
universities.

Historical background: The historical background of a university may also play a role in its
characteristics. For example, some universities may have a long-standing tradition of excellence in
certain fields or may have a specific focus on certain types of research.

Size: The size of a university can also have an impact on its characteristics. Larger universities may have
more resources and offer a wider range of programs, while smaller universities may have a more
intimate learning environment and a more specialized focus.
Question 5
4 Points
Consider Tufts University, which is missing some information. Compute the Euclidean distance of this record
from each of the clusters that you found above (using only the measurements that you have). Which cluster is
it closest to? Impute the missing values for Tufts by taking the average of the cluster on those measurements.
To solve this question, we will first calculate the Euclidean distance of the Tufts University record with
each of the clusters that we have obtained above using only the columns with values. Then we will find
the cluster that has the minimum distance with the Tufts record. Once we have identified the cluster, we
will impute the missing values of the Tufts record by taking the average of the values in that cluster for
each column.

# replace missing values with NaN


df = df.replace('...', np.nan)

# drop rows with any NaN values


df = df.dropna()

# convert columns to float


cols = df.columns[2:]
df[cols] = df[cols].astype(float)

# select only the columns with values


X = df.iloc[:,2:]

# normalize the data


X = (X - X.mean()) / X.std()

# create three clusters using k-means


from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42).fit(X)
labels = kmeans.labels_

# compute the centroids of the clusters


centroids = pd.DataFrame(kmeans.cluster_centers_, columns=X.columns)

# create a record for Tufts University


tufts = pd.DataFrame({'CollegeName': 'Tufts University',
'State': 'MA',
'Public (1)/ Private (2)': 2,
'# appli. rec\'d': np.nan,
'# appl. accepted': np.nan,
'# new stud. enrolled': np.nan,
'% new stud. from top 10%': np.nan,
'% new stud. from top 25%': np.nan,
'# FT undergrad': np.nan,
'# PT undergrad': np.nan,
'in-state tuition': 59464,
'out-of-state tuition': 59464,
'room': 8226,
'board': 4886,
'add. fees': 0,
'estim. book costs': 1100,
'estim. personal $': 2300,
'% fac. w/PHD': 0.961,
'stud./fac. ratio': 9.9,
'Graduation rate': 95.0}, index=[0])

# select only the columns with values


tufts = tufts.iloc[:,2:]

# normalize the data


tufts = (tufts - X.mean()) / X.std()

# compute the Euclidean distance of Tufts from each cluster


distances = []
for i in range(3):
dist = np.linalg.norm(tufts - centroids.iloc[i])
distances.append(dist)

# find the index of the cluster with minimum distance


cluster_index = np.argmin(distances)

# impute the missing values for Tufts by taking the average of the cluster
tufts = tufts * X.std() + X.mean()
tufts = tufts.fillna(centroids.iloc[cluster_index])
tufts = tufts.round(0)

# print the imputed record for Tufts


print(tufts)

You might also like