Numpy NP Pandas PD Matplotlib - Pyplot PLT Seaborn SNS: "Merged - Uscol - TXT" ","

7/30/2020 Exploratory Data Analysis
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
# Read in the dataset:

df = pd.read_table("merged_uscol.txt", sep=",")
df.head()
Out[2]:
FICE College_name.x States Public_indicator Average_M\r\nath_SAT_score Average_Verb
Alabama Agri. &

0 1002 AL 1 NaN
Mech. Univ.
University of
1 1004 AL 1 NaN
Montevallo
Auburn
2 1009 University-Main AL 1 575.0
Campus
Birmingham-
3 1012 Southern AL 2 575.0
College
University of
4 1016 AL 1 NaN
North Alabama
5 rows × 51 columns
In [3]:
# Remove carriage return and newline sequences in column names:

df.columns = list(map(lambda x: x.replace("\r\n", ""), df.columns.tolist()))
localhost:8888/lab 1/18
In [4]:
# Return first 5 rows of our dataframe:

df.head()
Out[4]:
FICE College_name.x States Public_indicator Average_Math_SAT_score Average_Verbal_
Alabama Agri. &

0 1002 AL 1 NaN
Mech. Univ.
University of
1 1004 AL 1 NaN
Montevallo
Auburn
Campus
Birmingham-
3 1012 Southern AL 2 575.0
College
University of
4 1016 AL 1 NaN
North Alabama
In [5]:
# Let's replace our NaN values with the mean of the corresponding column:
df.fillna(df.mean(), inplace=True, axis=0)
In [6]:
df.head()
Out[6]:
FICE College_name.x States Public_indicator Average_Math_SAT_score Average_Verbal_
Alabama Agri. &

0 1002 AL 1 512.605144
Mech. Univ.
University of
1 1004 AL 1 512.605144
Montevallo
Auburn
Campus
Birmingham-
3 1012 Southern AL 2 575.000000
College
University of
4 1016 AL 1 512.605144
North Alabama
In [7]:
# Let's explore our dataset now !

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1133 entries, 0 to 1132
Data columns (total 51 columns):
# Column Non-Null Count Dt
ype
--- ------ -------------- --
---
0 FICE 1133 non-null in
t64
1 College_name.x 1133 non-null ob
ject
2 States 1133 non-null ob
ject
3 Public_indicator 1133 non-null in
t64
4 Average_Math_SAT_score 1133 non-null fl
oat64
5 Average_Verbal_SAT_score 1133 non-null fl
oat64
6 Average_Combined_SAT_score 1133 non-null fl
oat64
7 Average_ACT_score 1133 non-null fl
oat64
8 First_quartile_Math_SAT 1133 non-null fl
oat64
9 Third_quartile_Math_SAT 1133 non-null fl
oat64
10 First_quartile_Verbal_SAT 1133 non-null fl
oat64
11 Third_quartile_Verbal_SAT 1133 non-null fl
oat64
12 First_quartile_ACT 1133 non-null fl
oat64
13 Third_quartile_ACT 1133 non-null fl
oat64
14 Number_applications_received 1133 non-null fl
oat64
15 Number_applicants_accepted 1133 non-null fl
oat64
16 Number_new_students_enrolled 1133 non-null fl
oat64
17 new_students_from_top_ten_percent_HS_class 1133 non-null fl
oat64
18 students_from_top_twenty_five_percent_of_HS_class 1133 non-null fl
oat64
19 Number_fulltime_undergraduates 1133 non-null fl
oat64
20 Number_parttime_undergraduates 1133 non-null fl
oat64
21 In_state_tuition 1133 non-null fl
oat64
22 Out_state_tuition 1133 non-null fl
oat64
23 Room_and_board_costs 1133 non-null fl
oat64
24 Room_costs 1133 non-null fl
oat64
25 Board_costs 1133 non-null fl
oat64
26 Additional_fees 1133 non-null fl
oat64
27 Estimated_book_costs 1133 non-null fl

oat64
28 Estimated_personal_spending 1133 non-null fl
oat64
29 Pct_of_faculty_with_PhD 1133 non-null fl
oat64
30 Pct_of_faculty_with_terminal_degree 1133 non-null fl
oat64
31 Student_and_faculty_ratio 1133 non-null fl
oat64
32 Pct_alumni_who_donate 1133 non-null fl
oat64
33 Instructional_expenditure_per_student 1133 non-null fl
oat64
34 Graduation_rate 1133 non-null fl
oat64
35 College_name.y 1133 non-null ob
ject
36 State 1133 non-null ob
ject
37 Type 1133 non-null ob
ject
38 Average_salary_full_professors 1133 non-null fl
oat64
39 Average_salary_associate_professors 1133 non-null fl
oat64
40 Average_salary_assistant_professors 1133 non-null fl
oat64
41 Average_salary_all_ranks 1133 non-null in
t64
42 Average_compensation_full_professors 1133 non-null fl
oat64
43 Average_compensation_associate_professors 1133 non-null fl
oat64
44 Average_compensation_assistant_professors 1133 non-null fl
oat64
45 Average_compensation_all_ranks 1133 non-null in
t64
46 Number_of_full_professors 1133 non-null in
t64
47 Number_of_associate_professors 1133 non-null in
t64
48 Number_of_assistant_professors 1133 non-null in
t64
49 Number_of_instructors 1133 non-null in
t64
50 Number_of_faculty_all_ranks 1133 non-null in
t64
dtypes: float64(37), int64(9), object(5)
memory usage: 451.6+ KB
We know have no null values, which is great ! Our data is clean ,

let's explore the cleaned data further.
In [8]:
# Return information about our dataset columns/features:

df.describe()
Out[8]:
FICE Public_indicator Average_Math_SAT_score Average_Verbal_SAT_score A
count 1133.000000 1133.00000 1133.000000 1133.000000
mean 2955.491615 1.61165 512.605144 465.243570
std 2136.044239 0.48759 51.038802 43.852767
min 1002.000000 1.00000 320.000000 280.000000
25% 1893.000000 1.00000 496.000000 450.000000
50% 2638.000000 2.00000 512.605144 465.243570
75% 3406.000000 2.00000 520.000000 468.000000
max 29261.000000 2.00000 750.000000 665.000000
In [9]:
# Return the number of unique values in each column:

df.nunique()
Out[9]:
FICE 1132
College_name.x 1110
States 51
Public_indicator 2
Average_Math_SAT_score 227
Average_Verbal_SAT_score 206
Average_Combined_SAT_score 315
Average_ACT_score 17
First_quartile_Math_SAT 83
Third_quartile_Math_SAT 80
First_quartile_Verbal_SAT 65
Third_quartile_Verbal_SAT 82
First_quartile_ACT 21
Third_quartile_ACT 20
Number_applications_received 1007
Number_applicants_accepted 974
Number_new_students_enrolled 804
new_students_from_top_ten_percent_HS_class 89
students_from_top_twenty_five_percent_of_HS_class 91
Number_fulltime_undergraduates 1018
Number_parttime_undergraduates 809
In_state_tuition 850
Out_state_tuition 868
Room_and_board_costs 735
Room_costs 548
Board_costs 434
Additional_fees 410
Estimated_book_costs 157
Estimated_personal_spending 376
Pct_of_faculty_with_PhD 88
Pct_of_faculty_with_terminal_degree 75
Student_and_faculty_ratio 197
Pct_alumni_who_donate 62
Instructional_expenditure_per_student 1051
Graduation_rate 89
College_name.y 1112
State 52
Type 4
Average_salary_full_professors 428
Average_salary_associate_professors 303
Average_salary_assistant_professors 235
Average_salary_all_ranks 343
Average_compensation_full_professors 486
Average_compensation_associate_professors 373
Average_compensation_assistant_professors 305
Average_compensation_all_ranks 431
Number_of_full_professors 298
Number_of_associate_professors 255
Number_of_assistant_professors 241
Number_of_instructors 83
Number_of_faculty_all_ranks 493
dtype: int64
In [10]:
# Return the counts of all the categorical values in the "Type" column:
df["Type"].value_counts()
Out[10]:
IIB 598
IIA 356
I 178
VIIB 1
Name: Type, dtype: int64
In [11]:
# Drop all categorical columns except "Type" as we convert this to a numerical column!
df.drop(["College_name.x", "States", "College_name.y", "State"], axis=1,inplace=True)
In [12]:
# Convert "Type" to numerical columns:

df = pd.get_dummies(df)
In [13]:
# Check that our "Type" column has been replaced with numerical columns for "Type":
df.columns.tolist()
Out[13]:
['FICE',
'Public_indicator',
'Average_Math_SAT_score',
'Average_Verbal_SAT_score',
'Average_Combined_SAT_score',
'Average_ACT_score',
'First_quartile_Math_SAT',
'Third_quartile_Math_SAT',
'First_quartile_Verbal_SAT',
'Third_quartile_Verbal_SAT',
'First_quartile_ACT',
'Third_quartile_ACT',
'Number_applications_received',
'Number_applicants_accepted',
'Number_new_students_enrolled',
'new_students_from_top_ten_percent_HS_class',
'students_from_top_twenty_five_percent_of_HS_class',
'Number_fulltime_undergraduates',
'Number_parttime_undergraduates',
'In_state_tuition',
'Out_state_tuition',
'Room_and_board_costs',
'Room_costs',
'Board_costs',
'Additional_fees',
'Estimated_book_costs',
'Estimated_personal_spending',
'Pct_of_faculty_with_PhD',
'Pct_of_faculty_with_terminal_degree',
'Student_and_faculty_ratio',
'Pct_alumni_who_donate',
'Instructional_expenditure_per_student',
'Graduation_rate',
'Average_salary_full_professors',
'Average_salary_associate_professors',
'Average_salary_assistant_professors',
'Average_salary_all_ranks',
'Average_compensation_full_professors',
'Average_compensation_associate_professors',
'Average_compensation_assistant_professors',
'Average_compensation_all_ranks',
'Number_of_full_professors',
'Number_of_associate_professors',
'Number_of_assistant_professors',
'Number_of_instructors',
'Number_of_faculty_all_ranks',
'Type_I',
'Type_IIA',
'Type_IIB',
'Type_VIIB']
In [14]:
df.head()
Out[14]:
FICE Public_indicator Average_Math_SAT_score Average_Verbal_SAT_score Average_Com
0 1002 1 512.605144 465.24357
1 1004 1 512.605144 465.24357
2 1009 1 575.000000 501.00000
3 1012 2 575.000000 525.00000
4 1016 1 512.605144 465.24357
In [15]:
# Let's create an intuituve dataset which we think is consisting only of significant fe

atures
labels_to_drop = ['First_quartile_Math_SAT',
'Third_quartile_Math_SAT',
'First_quartile_Verbal_SAT',
'Third_quartile_Verbal_SAT',
'First_quartile_ACT',
'Third_quartile_ACT',
'Average_salary_full_professors',
'Average_salary_associate_professors',
'Average_salary_assistant_professors',
'Average_compensation_full_professors',
'Average_compensation_associate_professors',
'Average_compensation_assistant_professors',
'Number_of_full_professors',
'Number_of_associate_professors',
'Number_of_assistant_professors',
'Number_of_instructors',
'Pct_alumni_who_donate'
]
df_intuitive = df.drop(labels=labels_to_drop, axis=1)
df_intuitive.to_csv('intuitive_data.txt', header=True, index=None, sep=',')
In [16]:
# Let's visualize our correlation matrix using a heatmap:

plt.figure(figsize=(12,12))
sns.heatmap(df.corr(), cmap="coolwarm")
plt.savefig("correlation matrix", quality=95, dpi=300, bbox_inches="tight")
Let's examine the relationship with explanatory variables which

have a profound correlation (positive or negative) with
Graduation_rate
In [17]:
# We see a positive correlation here:

sns.lmplot(x="In_state_tuition", y="Graduation_rate", data=df)
plt.title("Graduation Rate vs In State Tuition")
plt.savefig("grad_rate vs in_state_tuition", quality=95, dpi=300, bbox_inches="tight")
In [18]:
# We see a positive correlation here:

sns.lmplot(x="Out_state_tuition", y="Graduation_rate", data=df)
plt.title("Graduation Rate vs Out State Tuition")
plt.savefig("grad_rate vs out_state_tuition", quality=95, dpi=300, bbox_inches="tight")
In [19]:
# We see a negative correlation here:

sns.lmplot(x="Number_parttime_undergraduates", y="Graduation_rate", data=df)
plt.title("Graduation Rate vs Number of part time undergraduates")
plt.savefig("grad_rate vs Number_parttime_undergraduate", quality=95, dpi=300, bbox_inc
hes="tight")
In [20]:
# We see a negative correlation here:

sns.lmplot(x="Student_and_faculty_ratio", y="Graduation_rate", data=df)
plt.title("Student and Faculty Ratio vs Number of part time undergraduates")
plt.savefig("grad_rate vs Student_and_faculty_ratio", quality=95, dpi=300, bbox_inches=
"tight")
In [21]:
# It is apparent that in our data we have some outliers, let's proceed to remove these
outliers:
plt.figure(figsize=(12,12))
sns.boxplot(x="Public_indicator", y="Graduation_rate", data=df)
plt.savefig("grad_rate vs public_indicator", quality=95, dpi=300, bbox_inches="tight")
In [22]:
# We now remove all examples which contain values more than 3 standard deviations away
from our mean:
df = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 832 entries, 1 to 1104
Data columns (total 50 columns):
# Column Non-Null Count Dt
ype
--- ------ -------------- --
---
0 FICE 832 non-null in
t64
1 Public_indicator 832 non-null in
t64
2 Average_Math_SAT_score 832 non-null fl
oat64
3 Average_Verbal_SAT_score 832 non-null fl
oat64
4 Average_Combined_SAT_score 832 non-null fl
oat64
5 Average_ACT_score 832 non-null fl
oat64
6 First_quartile_Math_SAT 832 non-null fl
oat64
7 Third_quartile_Math_SAT 832 non-null fl
oat64
8 First_quartile_Verbal_SAT 832 non-null fl
oat64
9 Third_quartile_Verbal_SAT 832 non-null fl
oat64
10 First_quartile_ACT 832 non-null fl
oat64
11 Third_quartile_ACT 832 non-null fl
oat64
12 Number_applications_received 832 non-null fl
oat64
13 Number_applicants_accepted 832 non-null fl
oat64
14 Number_new_students_enrolled 832 non-null fl
oat64
15 new_students_from_top_ten_percent_HS_class 832 non-null fl
oat64
16 students_from_top_twenty_five_percent_of_HS_class 832 non-null fl
oat64
17 Number_fulltime_undergraduates 832 non-null fl
oat64
18 Number_parttime_undergraduates 832 non-null fl
oat64
19 In_state_tuition 832 non-null fl
oat64
20 Out_state_tuition 832 non-null fl
oat64
21 Room_and_board_costs 832 non-null fl
oat64
22 Room_costs 832 non-null fl
oat64
23 Board_costs 832 non-null fl
oat64
24 Additional_fees 832 non-null fl
oat64
25 Estimated_book_costs 832 non-null fl
oat64
26 Estimated_personal_spending 832 non-null fl
oat64
27 Pct_of_faculty_with_PhD 832 non-null fl

oat64
28 Pct_of_faculty_with_terminal_degree 832 non-null fl
oat64
29 Student_and_faculty_ratio 832 non-null fl
oat64
30 Pct_alumni_who_donate 832 non-null fl
oat64
31 Instructional_expenditure_per_student 832 non-null fl
oat64
32 Graduation_rate 832 non-null fl
oat64
33 Average_salary_full_professors 832 non-null fl
oat64
34 Average_salary_associate_professors 832 non-null fl
oat64
35 Average_salary_assistant_professors 832 non-null fl
oat64
36 Average_salary_all_ranks 832 non-null in
t64
37 Average_compensation_full_professors 832 non-null fl
oat64
38 Average_compensation_associate_professors 832 non-null fl
oat64
39 Average_compensation_assistant_professors 832 non-null fl
oat64
40 Average_compensation_all_ranks 832 non-null in
t64
41 Number_of_full_professors 832 non-null in
t64
42 Number_of_associate_professors 832 non-null in
t64
43 Number_of_assistant_professors 832 non-null in
t64
44 Number_of_instructors 832 non-null in
t64
45 Number_of_faculty_all_ranks 832 non-null in
t64
46 Type_I 832 non-null ui
nt8
47 Type_IIA 832 non-null ui
nt8
48 Type_IIB 832 non-null ui
nt8
49 Type_VIIB 832 non-null ui
nt8
dtypes: float64(37), int64(9), uint8(4)
memory usage: 308.8 KB
In [23]:
df.describe()
Out[23]:
FICE Public_indicator Average_Math_SAT_score Average_Verbal_SAT_score A
count 832.000000 832.000000 832.000000 832.000000
mean 2752.307692 1.639423 507.004439 461.343310
std 1192.384792 0.480457 37.914697 32.679401
min 1004.000000 1.000000 380.000000 350.000000
25% 1939.250000 1.000000 495.000000 450.000000
50% 2653.500000 2.000000 512.605144 465.243570
75% 3388.250000 2.000000 512.605144 465.243570
max 9345.000000 2.000000 655.000000 579.000000

Numpy NP Pandas PD Matplotlib - Pyplot PLT Seaborn SNS: "Merged - Uscol - TXT" ","

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Numpy NP Pandas PD Matplotlib - Pyplot PLT Seaborn SNS: "Merged - Uscol - TXT" ","

Uploaded by

Copyright:

Available Formats

7/30/2020 Exploratory Data Analysis

# Read in the dataset:

FICE College_name.x States Public_indicator Average_M\r\nath_SAT_score Average_Verb

Alabama Agri. &

# Remove carriage return and newline sequences in column names:

# Return first 5 rows of our dataframe:

FICE College_name.x States Public_indicator Average_Math_SAT_score Average_Verbal_

Alabama Agri. &

FICE College_name.x States Public_indicator Average_Math_SAT_score Average_Verbal_

Alabama Agri. &

# Let's explore our dataset now !

27 Estimated_book_costs 1133 non-null fl

We know have no null values, which is great ! Our data is clean ,

# Return information about our dataset columns/features:

FICE Public_indicator Average_Math_SAT_score Average_Verbal_SAT_score A

count 1133.000000 1133.00000 1133.000000 1133.000000

mean 2955.491615 1.61165 512.605144 465.243570

std 2136.044239 0.48759 51.038802 43.852767

min 1002.000000 1.00000 320.000000 280.000000

25% 1893.000000 1.00000 496.000000 450.000000

50% 2638.000000 2.00000 512.605144 465.243570

75% 3406.000000 2.00000 520.000000 468.000000

max 29261.000000 2.00000 750.000000 665.000000

# Return the number of unique values in each column:

# Convert "Type" to numerical columns:

FICE Public_indicator Average_Math_SAT_score Average_Verbal_SAT_score Average_Com

0 1002 1 512.605144 465.24357

1 1004 1 512.605144 465.24357

2 1009 1 575.000000 501.00000

3 1012 2 575.000000 525.00000

4 1016 1 512.605144 465.24357

# Let's create an intuituve dataset which we think is consisting only of significant fe

# Let's visualize our correlation matrix using a heatmap:

Let's examine the relationship with explanatory variables which

# We see a positive correlation here:

# We see a positive correlation here:

# We see a negative correlation here:

# We see a negative correlation here:

27 Pct_of_faculty_with_PhD 832 non-null fl

FICE Public_indicator Average_Math_SAT_score Average_Verbal_SAT_score A

count 832.000000 832.000000 832.000000 832.000000

mean 2752.307692 1.639423 507.004439 461.343310

std 1192.384792 0.480457 37.914697 32.679401

min 1004.000000 1.000000 380.000000 350.000000

25% 1939.250000 1.000000 495.000000 450.000000

50% 2653.500000 2.000000 512.605144 465.243570

75% 3388.250000 2.000000 512.605144 465.243570

max 9345.000000 2.000000 655.000000 579.000000

You might also like