You are on page 1of 1

Data Science & Machine Learning Cheat Sheet

1 Main Concepts 3 Exploratory Data Analysis 4 Data Preprocessing


Understand your data Removing missing data Removing unused Columns
rows = data.shape[0] Number of Samples data.isnull().sum()
Missing values in each data.drop("region", axis=1, inplace=True)
columns = data.shape[1] Number of Columns column
data.info() Data types, Missing values The idea is to remove columns that do not
Drop rows with missing
data.describe() Statistical description of columns contribute to our prediction. In our example,
data = data.dropna()
values region does not affect the cost charged.

Distribution of charges
Convert Categorical columns to numerical Normalization
data["charges"].plot(kind="hist")
plt.title("Distribution of charges") gender = {'male':0, 'female':1} data_max = data.max()
plt.xlabel("Charges") data['sex'] = data['sex'].apply(lambda x: gender[x])
Data Science Life Cycle plt.ylabel("Frequency") data = data.divide(data_max)
plt.show()
smokers = {'no':0, 'yes':1} The idea is to divide each column by
data['smoker'] = data['smoker'].apply(lambda x:  its maximum value.
smokers[x])
Correlation between smoking and cost of
treatment
smokers = data[(data.smoker == "yes")] Get smokers
non_smokers = data[(data.smoker == "no" Get non smokers 5 Model Training and testing
)]
fig = plt.figure(figsize=(12,5))
ax = fig.add_subplot(121)
Create the figure Data Splits
1st subplot smokers
ax.hist(smokers["charges"]) Smokers histogram
ax.set_title('charges for smokers') Set subplot title X = data.iloc[:,0:-1].values Store all columns except last one as inputs in X

y = data.iloc[:,-1].values Store the last column as the output (label) in y


Machine Learning Framework Repeat subplot for non smokers
Next, we will apply these x_train, x_test, y_train, y_test = train_test_split(X, y, test_size Split dataset into 80/20
concepts to the medical =0.2, random_state=42)
cost prediction problem
as per the course
example, but they are Linear Regression Modeling
also applicable to other
machine learning model = LinearRegression() Define our regression model
problems.
model.fit(x_train, y_train) Train our model

2 Data Loading Correlation between age and cost of treatment


Model Evaluation
plt.scatter(smokers["age"], smokers["charges"], color='r')
Import Python modules plt.scatter(non_smokers["age"], non_smokers["charges"], c print('Model score {}'.format(model.score(x_tes Evaluate the model based on score

import numpy as np Numpy olor='b') t,y_test)))


import pandas as pd Pandas plt.xlabel("Age")
import matplotlib.pyplot as plt Matplotlib plt.ylabel("Charges") Note that there are several ways to evaluate your model that you will see later on
from sklearn.model_selection import train Scikit learn plt.show() during other courses.
_test_split
from sklearn.linear_model import LinearR The idea is that in this phase, Feature importance
egression we can understand how the columns_names = data.columns[0:-1].values
Read and Visualize the data features are correlated
through different plots. features_importance = model.coef_
data = pd.read_csv(Path_to_data) Read CSV file in Pandas
data.head() Display first 5 rows Correlation between BMI and cost of treatment
plt.barh(columns_names, features_importance)
plt.hist(obese["charges"], color='r')
plt.hist(overweight["charges"], color plt.title('Features Importance')
='y') plt.xlabel('importance')
plt.hist(healthy["charges"], color='g')
plt.hist(underweight["charges"], col plt.ylabel('feature')
or='b') plt.show()
plt.title("Charges distribution")
plt.xlabel("Charges")
plt.ylabel("Frequency")
plt.show() © 2021, Zaka AI, Inc. All Rights Reserved.

You might also like