Professional Documents
Culture Documents
certain opertions
Python Libraries :
Importing data - process of loading and reading data into python from various
resource
formats like .csv, .json, .xlsx, .hdf
IMPORTING:
import pandas as pd
url = "https://archive.ics.uci.edu/ml "
df = pd.read_csv(url) & Without header : df = pd.read_csv(url, header = None)
PRINTING:
df prints entire dataframe
df.head(n) to show the first n rows of data frame
df.tail(n) shows the bottom n rows of data frame
ADDING HEADERS:
Replace default header by ( df.columns = headers )
headers = ["abc", "cvs", "xyz"....]
EXPORTING TO CSV:
path="C:\Windows\...\automobile.csv"
df.to_csv(path)
READING:
pd.read_sql(), pd.read_excel(), pd.read_json().....
TYPES:
df.dtypes
DESCRIBE:
df.describe() - it avoids cells with empty values, & to include empty cells
df.describe(include="all")
Data Formatting - bringing data into a common standard of expression allows users
to make meaningful comparision
DATA NORMALIZATION:
organizing data into a specific range and organizing into a better form
Approaches of normalization:
1. Simple Feature Scaling - Xnew = Xold/Xmax (range 0-1)
df["length"] = df["length"]/df["length"].max()
BINNING:
method of data preprocessing
group values to bins
converts numeric into categorical values
group a set of numerical values into set of bins for better understanding of data
distribution
In scatter plots we set the predictive variable on the x-axis and set the target
variable on the y-axis.
ex : y = df["engine-size']
x = df["price"]
plt.scatter(x,y)
GROUPING :
df.groupby() - group multiple variable into one
df_test = df['drive-wheels', 'body-style', 'price']
d_grp = df_test.groupby(["dive_wheels", "body-style"], as_index=False).mean()
Create the result into pivot form:
df_pivot = df.grp.pivot(index= "drive-wheels", columns = "body-style")
=> Create heatmap from same data:
plt.pcolor(df_pivot, cmap='RdBBu')
plt.colorbar()
plt.show()
CORRELATION:
measures to what extent different variables are independent
correlation doesn't imply causation
Rain --> Umbrella
Positive Linear Relationship
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)
Negative Linear Relationship
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)
Pearson Correlation:
Measure strength between 2 features.
1. Correlation coefficient ( close to +1 is large positive, close to -1 is large
negative, close to 0 is no relationship)
2. P-value (pvalue <0.001 - strong, pvalue<0.05 - moderate,pvalue <0.1 - weak,
pvalue>0.1 - no certainity in the result )
pd.get_dummies(df['fuel'])