You are on page 1of 18

7/30/2020 Exploratory Data Analysis

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

# Read in the dataset:


df = pd.read_table("merged_uscol.txt", sep=",")
df.head()

Out[2]:

FICE College_name.x States Public_indicator Average_M\r\nath_SAT_score Average_Verb

Alabama Agri. &


0 1002 AL 1 NaN
Mech. Univ.

University of
1 1004 AL 1 NaN
Montevallo

Auburn
2 1009 University-Main AL 1 575.0
Campus

Birmingham-
3 1012 Southern AL 2 575.0
College

University of
4 1016 AL 1 NaN
North Alabama

5 rows × 51 columns

In [3]:

# Remove carriage return and newline sequences in column names:


df.columns = list(map(lambda x: x.replace("\r\n", ""), df.columns.tolist()))

localhost:8888/lab 1/18
7/30/2020 Exploratory Data Analysis

In [4]:

# Return first 5 rows of our dataframe:


df.head()

Out[4]:

FICE College_name.x States Public_indicator Average_Math_SAT_score Average_Verbal_

Alabama Agri. &


0 1002 AL 1 NaN
Mech. Univ.

University of
1 1004 AL 1 NaN
Montevallo

Auburn
2 1009 University-Main AL 1 575.0
Campus

Birmingham-
3 1012 Southern AL 2 575.0
College

University of
4 1016 AL 1 NaN
North Alabama

5 rows × 51 columns

In [5]:

# Let's replace our NaN values with the mean of the corresponding column:
df.fillna(df.mean(), inplace=True, axis=0)

In [6]:

df.head()

Out[6]:

FICE College_name.x States Public_indicator Average_Math_SAT_score Average_Verbal_

Alabama Agri. &


0 1002 AL 1 512.605144
Mech. Univ.

University of
1 1004 AL 1 512.605144
Montevallo

Auburn
2 1009 University-Main AL 1 575.000000
Campus

Birmingham-
3 1012 Southern AL 2 575.000000
College

University of
4 1016 AL 1 512.605144
North Alabama

5 rows × 51 columns

localhost:8888/lab 2/18
7/30/2020 Exploratory Data Analysis

In [7]:

# Let's explore our dataset now !


df.info()

localhost:8888/lab 3/18
7/30/2020 Exploratory Data Analysis

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1133 entries, 0 to 1132
Data columns (total 51 columns):
# Column Non-Null Count Dt
ype
--- ------ -------------- --
---
0 FICE 1133 non-null in
t64
1 College_name.x 1133 non-null ob
ject
2 States 1133 non-null ob
ject
3 Public_indicator 1133 non-null in
t64
4 Average_Math_SAT_score 1133 non-null fl
oat64
5 Average_Verbal_SAT_score 1133 non-null fl
oat64
6 Average_Combined_SAT_score 1133 non-null fl
oat64
7 Average_ACT_score 1133 non-null fl
oat64
8 First_quartile_Math_SAT 1133 non-null fl
oat64
9 Third_quartile_Math_SAT 1133 non-null fl
oat64
10 First_quartile_Verbal_SAT 1133 non-null fl
oat64
11 Third_quartile_Verbal_SAT 1133 non-null fl
oat64
12 First_quartile_ACT 1133 non-null fl
oat64
13 Third_quartile_ACT 1133 non-null fl
oat64
14 Number_applications_received 1133 non-null fl
oat64
15 Number_applicants_accepted 1133 non-null fl
oat64
16 Number_new_students_enrolled 1133 non-null fl
oat64
17 new_students_from_top_ten_percent_HS_class 1133 non-null fl
oat64
18 students_from_top_twenty_five_percent_of_HS_class 1133 non-null fl
oat64
19 Number_fulltime_undergraduates 1133 non-null fl
oat64
20 Number_parttime_undergraduates 1133 non-null fl
oat64
21 In_state_tuition 1133 non-null fl
oat64
22 Out_state_tuition 1133 non-null fl
oat64
23 Room_and_board_costs 1133 non-null fl
oat64
24 Room_costs 1133 non-null fl
oat64
25 Board_costs 1133 non-null fl
oat64
26 Additional_fees 1133 non-null fl
oat64
localhost:8888/lab 4/18
7/30/2020 Exploratory Data Analysis

27 Estimated_book_costs 1133 non-null fl


oat64
28 Estimated_personal_spending 1133 non-null fl
oat64
29 Pct_of_faculty_with_PhD 1133 non-null fl
oat64
30 Pct_of_faculty_with_terminal_degree 1133 non-null fl
oat64
31 Student_and_faculty_ratio 1133 non-null fl
oat64
32 Pct_alumni_who_donate 1133 non-null fl
oat64
33 Instructional_expenditure_per_student 1133 non-null fl
oat64
34 Graduation_rate 1133 non-null fl
oat64
35 College_name.y 1133 non-null ob
ject
36 State 1133 non-null ob
ject
37 Type 1133 non-null ob
ject
38 Average_salary_full_professors 1133 non-null fl
oat64
39 Average_salary_associate_professors 1133 non-null fl
oat64
40 Average_salary_assistant_professors 1133 non-null fl
oat64
41 Average_salary_all_ranks 1133 non-null in
t64
42 Average_compensation_full_professors 1133 non-null fl
oat64
43 Average_compensation_associate_professors 1133 non-null fl
oat64
44 Average_compensation_assistant_professors 1133 non-null fl
oat64
45 Average_compensation_all_ranks 1133 non-null in
t64
46 Number_of_full_professors 1133 non-null in
t64
47 Number_of_associate_professors 1133 non-null in
t64
48 Number_of_assistant_professors 1133 non-null in
t64
49 Number_of_instructors 1133 non-null in
t64
50 Number_of_faculty_all_ranks 1133 non-null in
t64
dtypes: float64(37), int64(9), object(5)
memory usage: 451.6+ KB

We know have no null values, which is great ! Our data is clean ,


let's explore the cleaned data further.

localhost:8888/lab 5/18
7/30/2020 Exploratory Data Analysis

In [8]:

# Return information about our dataset columns/features:


df.describe()

Out[8]:

FICE Public_indicator Average_Math_SAT_score Average_Verbal_SAT_score A

count 1133.000000 1133.00000 1133.000000 1133.000000

mean 2955.491615 1.61165 512.605144 465.243570

std 2136.044239 0.48759 51.038802 43.852767

min 1002.000000 1.00000 320.000000 280.000000

25% 1893.000000 1.00000 496.000000 450.000000

50% 2638.000000 2.00000 512.605144 465.243570

75% 3406.000000 2.00000 520.000000 468.000000

max 29261.000000 2.00000 750.000000 665.000000

8 rows × 46 columns

localhost:8888/lab 6/18
7/30/2020 Exploratory Data Analysis

In [9]:

# Return the number of unique values in each column:


df.nunique()

Out[9]:

FICE 1132
College_name.x 1110
States 51
Public_indicator 2
Average_Math_SAT_score 227
Average_Verbal_SAT_score 206
Average_Combined_SAT_score 315
Average_ACT_score 17
First_quartile_Math_SAT 83
Third_quartile_Math_SAT 80
First_quartile_Verbal_SAT 65
Third_quartile_Verbal_SAT 82
First_quartile_ACT 21
Third_quartile_ACT 20
Number_applications_received 1007
Number_applicants_accepted 974
Number_new_students_enrolled 804
new_students_from_top_ten_percent_HS_class 89
students_from_top_twenty_five_percent_of_HS_class 91
Number_fulltime_undergraduates 1018
Number_parttime_undergraduates 809
In_state_tuition 850
Out_state_tuition 868
Room_and_board_costs 735
Room_costs 548
Board_costs 434
Additional_fees 410
Estimated_book_costs 157
Estimated_personal_spending 376
Pct_of_faculty_with_PhD 88
Pct_of_faculty_with_terminal_degree 75
Student_and_faculty_ratio 197
Pct_alumni_who_donate 62
Instructional_expenditure_per_student 1051
Graduation_rate 89
College_name.y 1112
State 52
Type 4
Average_salary_full_professors 428
Average_salary_associate_professors 303
Average_salary_assistant_professors 235
Average_salary_all_ranks 343
Average_compensation_full_professors 486
Average_compensation_associate_professors 373
Average_compensation_assistant_professors 305
Average_compensation_all_ranks 431
Number_of_full_professors 298
Number_of_associate_professors 255
Number_of_assistant_professors 241
Number_of_instructors 83
Number_of_faculty_all_ranks 493
dtype: int64

localhost:8888/lab 7/18
7/30/2020 Exploratory Data Analysis

In [10]:

# Return the counts of all the categorical values in the "Type" column:
df["Type"].value_counts()

Out[10]:

IIB 598
IIA 356
I 178
VIIB 1
Name: Type, dtype: int64

In [11]:

# Drop all categorical columns except "Type" as we convert this to a numerical column!
df.drop(["College_name.x", "States", "College_name.y", "State"], axis=1,inplace=True)

In [12]:

# Convert "Type" to numerical columns:


df = pd.get_dummies(df)

localhost:8888/lab 8/18
7/30/2020 Exploratory Data Analysis

In [13]:

# Check that our "Type" column has been replaced with numerical columns for "Type":
df.columns.tolist()

Out[13]:

['FICE',
'Public_indicator',
'Average_Math_SAT_score',
'Average_Verbal_SAT_score',
'Average_Combined_SAT_score',
'Average_ACT_score',
'First_quartile_Math_SAT',
'Third_quartile_Math_SAT',
'First_quartile_Verbal_SAT',
'Third_quartile_Verbal_SAT',
'First_quartile_ACT',
'Third_quartile_ACT',
'Number_applications_received',
'Number_applicants_accepted',
'Number_new_students_enrolled',
'new_students_from_top_ten_percent_HS_class',
'students_from_top_twenty_five_percent_of_HS_class',
'Number_fulltime_undergraduates',
'Number_parttime_undergraduates',
'In_state_tuition',
'Out_state_tuition',
'Room_and_board_costs',
'Room_costs',
'Board_costs',
'Additional_fees',
'Estimated_book_costs',
'Estimated_personal_spending',
'Pct_of_faculty_with_PhD',
'Pct_of_faculty_with_terminal_degree',
'Student_and_faculty_ratio',
'Pct_alumni_who_donate',
'Instructional_expenditure_per_student',
'Graduation_rate',
'Average_salary_full_professors',
'Average_salary_associate_professors',
'Average_salary_assistant_professors',
'Average_salary_all_ranks',
'Average_compensation_full_professors',
'Average_compensation_associate_professors',
'Average_compensation_assistant_professors',
'Average_compensation_all_ranks',
'Number_of_full_professors',
'Number_of_associate_professors',
'Number_of_assistant_professors',
'Number_of_instructors',
'Number_of_faculty_all_ranks',
'Type_I',
'Type_IIA',
'Type_IIB',
'Type_VIIB']

localhost:8888/lab 9/18
7/30/2020 Exploratory Data Analysis

In [14]:

df.head()

Out[14]:

FICE Public_indicator Average_Math_SAT_score Average_Verbal_SAT_score Average_Com

0 1002 1 512.605144 465.24357

1 1004 1 512.605144 465.24357

2 1009 1 575.000000 501.00000

3 1012 2 575.000000 525.00000

4 1016 1 512.605144 465.24357

5 rows × 50 columns

In [15]:

# Let's create an intuituve dataset which we think is consisting only of significant fe


atures
labels_to_drop = ['First_quartile_Math_SAT',
'Third_quartile_Math_SAT',
'First_quartile_Verbal_SAT',
'Third_quartile_Verbal_SAT',
'First_quartile_ACT',
'Third_quartile_ACT',
'Average_salary_full_professors',
'Average_salary_associate_professors',
'Average_salary_assistant_professors',
'Average_compensation_full_professors',
'Average_compensation_associate_professors',
'Average_compensation_assistant_professors',
'Number_of_full_professors',
'Number_of_associate_professors',
'Number_of_assistant_professors',
'Number_of_instructors',
'Pct_alumni_who_donate'
]
df_intuitive = df.drop(labels=labels_to_drop, axis=1)
df_intuitive.to_csv('intuitive_data.txt', header=True, index=None, sep=',')

localhost:8888/lab 10/18
7/30/2020 Exploratory Data Analysis

In [16]:

# Let's visualize our correlation matrix using a heatmap:


plt.figure(figsize=(12,12))
sns.heatmap(df.corr(), cmap="coolwarm")
plt.savefig("correlation matrix", quality=95, dpi=300, bbox_inches="tight")

Let's examine the relationship with explanatory variables which


have a profound correlation (positive or negative) with
Graduation_rate

localhost:8888/lab 11/18
7/30/2020 Exploratory Data Analysis

In [17]:

# We see a positive correlation here:


sns.lmplot(x="In_state_tuition", y="Graduation_rate", data=df)
plt.title("Graduation Rate vs In State Tuition")
plt.savefig("grad_rate vs in_state_tuition", quality=95, dpi=300, bbox_inches="tight")

In [18]:

# We see a positive correlation here:


sns.lmplot(x="Out_state_tuition", y="Graduation_rate", data=df)
plt.title("Graduation Rate vs Out State Tuition")
plt.savefig("grad_rate vs out_state_tuition", quality=95, dpi=300, bbox_inches="tight")

localhost:8888/lab 12/18
7/30/2020 Exploratory Data Analysis

In [19]:

# We see a negative correlation here:


sns.lmplot(x="Number_parttime_undergraduates", y="Graduation_rate", data=df)
plt.title("Graduation Rate vs Number of part time undergraduates")
plt.savefig("grad_rate vs Number_parttime_undergraduate", quality=95, dpi=300, bbox_inc
hes="tight")

In [20]:

# We see a negative correlation here:


sns.lmplot(x="Student_and_faculty_ratio", y="Graduation_rate", data=df)
plt.title("Student and Faculty Ratio vs Number of part time undergraduates")
plt.savefig("grad_rate vs Student_and_faculty_ratio", quality=95, dpi=300, bbox_inches=
"tight")

localhost:8888/lab 13/18
7/30/2020 Exploratory Data Analysis

In [21]:

# It is apparent that in our data we have some outliers, let's proceed to remove these
outliers:
plt.figure(figsize=(12,12))
sns.boxplot(x="Public_indicator", y="Graduation_rate", data=df)
plt.savefig("grad_rate vs public_indicator", quality=95, dpi=300, bbox_inches="tight")

localhost:8888/lab 14/18
7/30/2020 Exploratory Data Analysis

In [22]:

# We now remove all examples which contain values more than 3 standard deviations away
from our mean:
df = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
df.info()

localhost:8888/lab 15/18
7/30/2020 Exploratory Data Analysis

<class 'pandas.core.frame.DataFrame'>
Int64Index: 832 entries, 1 to 1104
Data columns (total 50 columns):
# Column Non-Null Count Dt
ype
--- ------ -------------- --
---
0 FICE 832 non-null in
t64
1 Public_indicator 832 non-null in
t64
2 Average_Math_SAT_score 832 non-null fl
oat64
3 Average_Verbal_SAT_score 832 non-null fl
oat64
4 Average_Combined_SAT_score 832 non-null fl
oat64
5 Average_ACT_score 832 non-null fl
oat64
6 First_quartile_Math_SAT 832 non-null fl
oat64
7 Third_quartile_Math_SAT 832 non-null fl
oat64
8 First_quartile_Verbal_SAT 832 non-null fl
oat64
9 Third_quartile_Verbal_SAT 832 non-null fl
oat64
10 First_quartile_ACT 832 non-null fl
oat64
11 Third_quartile_ACT 832 non-null fl
oat64
12 Number_applications_received 832 non-null fl
oat64
13 Number_applicants_accepted 832 non-null fl
oat64
14 Number_new_students_enrolled 832 non-null fl
oat64
15 new_students_from_top_ten_percent_HS_class 832 non-null fl
oat64
16 students_from_top_twenty_five_percent_of_HS_class 832 non-null fl
oat64
17 Number_fulltime_undergraduates 832 non-null fl
oat64
18 Number_parttime_undergraduates 832 non-null fl
oat64
19 In_state_tuition 832 non-null fl
oat64
20 Out_state_tuition 832 non-null fl
oat64
21 Room_and_board_costs 832 non-null fl
oat64
22 Room_costs 832 non-null fl
oat64
23 Board_costs 832 non-null fl
oat64
24 Additional_fees 832 non-null fl
oat64
25 Estimated_book_costs 832 non-null fl
oat64
26 Estimated_personal_spending 832 non-null fl
oat64
localhost:8888/lab 16/18
7/30/2020 Exploratory Data Analysis

27 Pct_of_faculty_with_PhD 832 non-null fl


oat64
28 Pct_of_faculty_with_terminal_degree 832 non-null fl
oat64
29 Student_and_faculty_ratio 832 non-null fl
oat64
30 Pct_alumni_who_donate 832 non-null fl
oat64
31 Instructional_expenditure_per_student 832 non-null fl
oat64
32 Graduation_rate 832 non-null fl
oat64
33 Average_salary_full_professors 832 non-null fl
oat64
34 Average_salary_associate_professors 832 non-null fl
oat64
35 Average_salary_assistant_professors 832 non-null fl
oat64
36 Average_salary_all_ranks 832 non-null in
t64
37 Average_compensation_full_professors 832 non-null fl
oat64
38 Average_compensation_associate_professors 832 non-null fl
oat64
39 Average_compensation_assistant_professors 832 non-null fl
oat64
40 Average_compensation_all_ranks 832 non-null in
t64
41 Number_of_full_professors 832 non-null in
t64
42 Number_of_associate_professors 832 non-null in
t64
43 Number_of_assistant_professors 832 non-null in
t64
44 Number_of_instructors 832 non-null in
t64
45 Number_of_faculty_all_ranks 832 non-null in
t64
46 Type_I 832 non-null ui
nt8
47 Type_IIA 832 non-null ui
nt8
48 Type_IIB 832 non-null ui
nt8
49 Type_VIIB 832 non-null ui
nt8
dtypes: float64(37), int64(9), uint8(4)
memory usage: 308.8 KB

localhost:8888/lab 17/18
7/30/2020 Exploratory Data Analysis

In [23]:

df.describe()

Out[23]:

FICE Public_indicator Average_Math_SAT_score Average_Verbal_SAT_score A

count 832.000000 832.000000 832.000000 832.000000

mean 2752.307692 1.639423 507.004439 461.343310

std 1192.384792 0.480457 37.914697 32.679401

min 1004.000000 1.000000 380.000000 350.000000

25% 1939.250000 1.000000 495.000000 450.000000

50% 2653.500000 2.000000 512.605144 465.243570

75% 3388.250000 2.000000 512.605144 465.243570

max 9345.000000 2.000000 655.000000 579.000000

8 rows × 50 columns

localhost:8888/lab 18/18

You might also like