You are on page 1of 4

Data Cleaning/Cleansing/Wrangling - Formatting data to compute easily and perform

certain opertions

Python Libraries :

=> for Scientific Computing Libraries :-


1. Pandas (Data Structures & Tools) - for effective data manipulation and analysis,
fast access to structure data, primary elements consisting 2 dimensional
data like rows and columns are called data frames and also performs easy indexing
fuctionality
2. Numpy ( Arrays and Matrices) - Arrays for input and outputs, used for faster
array processing
3. SciPy (Integrals, solving differential equation, optimization) - includes
fuction for advance mathematical operation

=> for Visualization Libraries :-


1. Matplotlib (plots & graphs) - used to prepare graphas and plots and highly
customizable
2. Seaborn (plots : heat maps, time series, violin plots) - based on matplotlib

=> for Algorithmic Libraries :-


1. Scikit-Learn (Machine learning) - includes tools for statistical modelling like
regression, classification, clustering, this library is built on on Numpy,
scipy and matplotlib
2. Statsmodel (Explore data, estimate statiscal models, and perform certain task) -
for statistical modelling and estimation

Importing data - process of loading and reading data into python from various
resource
formats like .csv, .json, .xlsx, .hdf

IMPORTING:

import pandas as pd
url = "https://archive.ics.uci.edu/ml "
df = pd.read_csv(url) & Without header : df = pd.read_csv(url, header = None)

PRINTING:
df prints entire dataframe
df.head(n) to show the first n rows of data frame
df.tail(n) shows the bottom n rows of data frame

ADDING HEADERS:
Replace default header by ( df.columns = headers )
headers = ["abc", "cvs", "xyz"....]

EXPORTING TO CSV:
path="C:\Windows\...\automobile.csv"
df.to_csv(path)

READING:
pd.read_sql(), pd.read_excel(), pd.read_json().....

TYPES:
df.dtypes
DESCRIBE:
df.describe() - it avoids cells with empty values, & to include empty cells
df.describe(include="all")

NaN - Not a Number

Dealing with missing values :


1. Drop the missing values - drop variable, drop data entry
2. Replace the missing values - replace by an average, frequency, based on another
function
3. Leave as it is

DROP EMPTY VALUES:


df.dropna() - df.dropna(subset = ["price"], axis=0, inplace = True)

REPLACE MISSING VALUES


df.replace(missing-value, new_value)
mean = df["normalized-losses"].mean()
df["normalized-losses"].replace(np.nan, mean)

Data Formatting - bringing data into a common standard of expression allows users
to make meaningful comparision

Converting in Pandas from one metrics to another is easy to perform


like miles per gallon --> kilometers per litres
df["city-mpg"] = 235/df["city-mpg"]
df.rename(columns={"city-mpg": "city-L/100km"}, inplace=True)

df.dtypes() - identify data type


df.astype() - convert one format to another

DATA NORMALIZATION:
organizing data into a specific range and organizing into a better form

Approaches of normalization:
1. Simple Feature Scaling - Xnew = Xold/Xmax (range 0-1)
df["length"] = df["length"]/df["length"].max()

2. Min-Max - Xnew = Xold-Xmin/Xmax-Xmin (range 0-1)


df["length"] = (df["length"]-df["length"].min())/(df["length"].max() -
df["length"].min())

3. Z-score - Xnew = Xold- Mu/sigma (range -3 to 3)


df["length"] = (df["length"] - df["length"].mean())/df["length"].std()

BINNING:
method of data preprocessing
group values to bins
converts numeric into categorical values
group a set of numerical values into set of bins for better understanding of data
distribution

linspace - creates space between the arrays into multiple bins


bins = np.linspace(min(df["price"]),max(df["price"]),4)
group_names = ["Low", "Medium", "High"] - dividing into 3 bins/groups
df["price-binned"] = pd.cut(df["price"], bins, labels = group_names,
include_lowest=True) cut- sort and segment values into bin

CATEGOTICAL FORM TO NUMERIC:


using the "One-hot Encoding" technique
pd.get_dummies(df["fuel"]) - covert categorical value to variable (0 or 1)

EXPLORATORY DATA ANALYSIS (EDA):


summarize main characterstics of data
gain better understanding
uncover relationship
extract import variables

Descriptive Statistics: basic fetaures of dataset


df.describe() - shows table
drive_wheels_counts = df["drive-wheels"].values_counts()
drive_wheels_counts.rename(colums= ('drive-wheels': value-counts' inplace=True)
drive_wheels_counts.index.name = 'drive_wheels'

sns.boxplot (x = "drive-wheels", y="prices", data=df)

In scatter plots we set the predictive variable on the x-axis and set the target
variable on the y-axis.

ex : y = df["engine-size']
x = df["price"]
plt.scatter(x,y)

plt.title("Scatter plot of Engine Size & Price")


plt.ylable("engine size")
plt.xlable("Price")

GROUPING :
df.groupby() - group multiple variable into one
df_test = df['drive-wheels', 'body-style', 'price']
d_grp = df_test.groupby(["dive_wheels", "body-style"], as_index=False).mean()
Create the result into pivot form:
df_pivot = df.grp.pivot(index= "drive-wheels", columns = "body-style")
=> Create heatmap from same data:
plt.pcolor(df_pivot, cmap='RdBBu')
plt.colorbar()
plt.show()

ANALYSIS OF VARIANCE (ANOVA):


a statiscal test, used to find correlation between different groups of a
categorical data
Large F implies strong correlation and small F implies weak correlation
Anova between two elements
df_anova = df[["make, price"]]
grouped_anova = df_anova.groupby(["make"])
anova_results_1 = stats.f_oneaway(grouped_anova.get_group("honda") ["price"],
grouped_anova.get_group('subaru')['price'])

CORRELATION:
measures to what extent different variables are independent
correlation doesn't imply causation
Rain --> Umbrella
Positive Linear Relationship
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)
Negative Linear Relationship
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)
Pearson Correlation:
Measure strength between 2 features.
1. Correlation coefficient ( close to +1 is large positive, close to -1 is large
negative, close to 0 is no relationship)
2. P-value (pvalue <0.001 - strong, pvalue<0.05 - moderate,pvalue <0.1 - weak,
pvalue>0.1 - no certainity in the result )

Pearson_coef, p_value=stats.personr[['horsepower'], df['price']]

Assign 0 & 1 in categorical data for yes or no called as one-hot encoding

pd.get_dummies(df['fuel'])

You might also like