Python Data Analysation and Visualisation

Data Cleaning/Cleansing/Wrangling - Formatting data to compute easily and perform
certain opertions
Python Libraries :
=> for Scientific Computing Libraries :-

1. Pandas (Data Structures & Tools) - for effective data manipulation and analysis,
fast access to structure data, primary elements consisting 2 dimensional
data like rows and columns are called data frames and also performs easy indexing
fuctionality
2. Numpy ( Arrays and Matrices) - Arrays for input and outputs, used for faster
array processing
3. SciPy (Integrals, solving differential equation, optimization) - includes
fuction for advance mathematical operation
=> for Visualization Libraries :-

1. Matplotlib (plots & graphs) - used to prepare graphas and plots and highly
customizable
2. Seaborn (plots : heat maps, time series, violin plots) - based on matplotlib
=> for Algorithmic Libraries :-

1. Scikit-Learn (Machine learning) - includes tools for statistical modelling like
regression, classification, clustering, this library is built on on Numpy,
scipy and matplotlib
2. Statsmodel (Explore data, estimate statiscal models, and perform certain task) -
for statistical modelling and estimation
Importing data - process of loading and reading data into python from various
resource
formats like .csv, .json, .xlsx, .hdf
IMPORTING:
import pandas as pd
url = "https://archive.ics.uci.edu/ml "
df = pd.read_csv(url) & Without header : df = pd.read_csv(url, header = None)
PRINTING:
df prints entire dataframe
df.head(n) to show the first n rows of data frame
df.tail(n) shows the bottom n rows of data frame
ADDING HEADERS:
Replace default header by ( df.columns = headers )
headers = ["abc", "cvs", "xyz"....]
EXPORTING TO CSV:
path="C:\Windows\...\automobile.csv"
df.to_csv(path)
READING:
pd.read_sql(), pd.read_excel(), pd.read_json().....
TYPES:
df.dtypes
DESCRIBE:
df.describe() - it avoids cells with empty values, & to include empty cells
df.describe(include="all")
NaN - Not a Number
Dealing with missing values :

1. Drop the missing values - drop variable, drop data entry
2. Replace the missing values - replace by an average, frequency, based on another
function
3. Leave as it is
DROP EMPTY VALUES:

df.dropna() - df.dropna(subset = ["price"], axis=0, inplace = True)
REPLACE MISSING VALUES

df.replace(missing-value, new_value)
mean = df["normalized-losses"].mean()
df["normalized-losses"].replace(np.nan, mean)
Data Formatting - bringing data into a common standard of expression allows users
to make meaningful comparision
Converting in Pandas from one metrics to another is easy to perform

like miles per gallon --> kilometers per litres
df["city-mpg"] = 235/df["city-mpg"]
df.rename(columns={"city-mpg": "city-L/100km"}, inplace=True)
df.dtypes() - identify data type

df.astype() - convert one format to another
DATA NORMALIZATION:
organizing data into a specific range and organizing into a better form
Approaches of normalization:
1. Simple Feature Scaling - Xnew = Xold/Xmax (range 0-1)
df["length"] = df["length"]/df["length"].max()
2. Min-Max - Xnew = Xold-Xmin/Xmax-Xmin (range 0-1)

df["length"] = (df["length"]-df["length"].min())/(df["length"].max() -
df["length"].min())
3. Z-score - Xnew = Xold- Mu/sigma (range -3 to 3)

df["length"] = (df["length"] - df["length"].mean())/df["length"].std()
BINNING:
method of data preprocessing
group values to bins
converts numeric into categorical values
group a set of numerical values into set of bins for better understanding of data
distribution
linspace - creates space between the arrays into multiple bins

bins = np.linspace(min(df["price"]),max(df["price"]),4)
group_names = ["Low", "Medium", "High"] - dividing into 3 bins/groups
df["price-binned"] = pd.cut(df["price"], bins, labels = group_names,
include_lowest=True) cut- sort and segment values into bin
CATEGOTICAL FORM TO NUMERIC:

using the "One-hot Encoding" technique
pd.get_dummies(df["fuel"]) - covert categorical value to variable (0 or 1)
EXPLORATORY DATA ANALYSIS (EDA):

summarize main characterstics of data
gain better understanding
uncover relationship
extract import variables
Descriptive Statistics: basic fetaures of dataset

df.describe() - shows table
drive_wheels_counts = df["drive-wheels"].values_counts()
drive_wheels_counts.rename(colums= ('drive-wheels': value-counts' inplace=True)
drive_wheels_counts.index.name = 'drive_wheels'
sns.boxplot (x = "drive-wheels", y="prices", data=df)
In scatter plots we set the predictive variable on the x-axis and set the target
variable on the y-axis.
ex : y = df["engine-size']
x = df["price"]
plt.scatter(x,y)
plt.title("Scatter plot of Engine Size & Price")

plt.ylable("engine size")
plt.xlable("Price")
GROUPING :
df.groupby() - group multiple variable into one
df_test = df['drive-wheels', 'body-style', 'price']
d_grp = df_test.groupby(["dive_wheels", "body-style"], as_index=False).mean()
Create the result into pivot form:
df_pivot = df.grp.pivot(index= "drive-wheels", columns = "body-style")
=> Create heatmap from same data:
plt.pcolor(df_pivot, cmap='RdBBu')
plt.colorbar()
plt.show()
ANALYSIS OF VARIANCE (ANOVA):

a statiscal test, used to find correlation between different groups of a
categorical data
Large F implies strong correlation and small F implies weak correlation
Anova between two elements
df_anova = df[["make, price"]]
grouped_anova = df_anova.groupby(["make"])
anova_results_1 = stats.f_oneaway(grouped_anova.get_group("honda") ["price"],
grouped_anova.get_group('subaru')['price'])
CORRELATION:
measures to what extent different variables are independent
correlation doesn't imply causation
Rain --> Umbrella
Positive Linear Relationship
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)
Negative Linear Relationship
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)
Pearson Correlation:
Measure strength between 2 features.
1. Correlation coefficient ( close to +1 is large positive, close to -1 is large
negative, close to 0 is no relationship)
2. P-value (pvalue <0.001 - strong, pvalue<0.05 - moderate,pvalue <0.1 - weak,
pvalue>0.1 - no certainity in the result )
Pearson_coef, p_value=stats.personr[['horsepower'], df['price']]
Assign 0 & 1 in categorical data for yes or no called as one-hot encoding
pd.get_dummies(df['fuel'])

Python Data Analysation and Visualisation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Python Data Analysation and Visualisation

Uploaded by

Copyright:

Available Formats

Data Cleaning/Cleansing/Wrangling - Formatting data to compute easily and perform

=> for Scientific Computing Libraries :-

=> for Visualization Libraries :-

=> for Algorithmic Libraries :-

NaN - Not a Number

Dealing with missing values :

DROP EMPTY VALUES:

REPLACE MISSING VALUES

Converting in Pandas from one metrics to another is easy to perform

df.dtypes() - identify data type

2. Min-Max - Xnew = Xold-Xmin/Xmax-Xmin (range 0-1)

3. Z-score - Xnew = Xold- Mu/sigma (range -3 to 3)

linspace - creates space between the arrays into multiple bins

CATEGOTICAL FORM TO NUMERIC:

EXPLORATORY DATA ANALYSIS (EDA):

Descriptive Statistics: basic fetaures of dataset

sns.boxplot (x = "drive-wheels", y="prices", data=df)

plt.title("Scatter plot of Engine Size & Price")

ANALYSIS OF VARIANCE (ANOVA):

Pearson_coef, p_value=stats.personr[['horsepower'], df['price']]

Assign 0 & 1 in categorical data for yes or no called as one-hot encoding

You might also like