Data Handling Part Ii

DESCRIPTIVE
STATISTICS
WITH
PANDAS
DESCRIPTIVE STATISTICS WITH PANDAS
 Used to summaries the given data.

METHODS USED:
 max,
min, count, sum, mean, median,
mode, quartiles, variance.
CALCULATING MAXIMUM VALUES
max()
 Used to calculate the maximum values from the
DataFrame, regardless of its data types.
 By default the axis is the COLUMN WISE.
SYNTAX:
 DataFrame.max(axis=None, skipna=bool/None,
numeric_only=bool/None)
DataFrame.max(axis=None, skipna=bool/None,
 axis = 0 (columns) / 1 (rows)
 skipna ( exclude NA/null values )
 numeric_only ( accept bool. True: numeric column
only. False: every columns.
By default None )
To get the maximum values of a specific

column in pandas DataFrame:
df[‘columnname’].max()
CALCULATING MINIMUM VALUES
min()
 Used to calculate the minimum values from the
DataFrame, regardless of its data types.
 By default the axis is the COLUMN WISE.
SYNTAX:
 DataFrame.min(axis=None, skipna=bool/None,
DataFrame.min(axis=None, skipna=bool/None,
 axis = 0 (columns) / 1 (rows)
 skipna ( exclude NA/null values )
 numeric_only ( accept bool. True: numeric column
only. False: every columns.
By default None )
CALCULATING SUM
sum()
 Used to add all of the values in a particular column
of a dataframe.
 Skips all the missing values by default.
SYNTAX:
 DataFrame.sum(axis=None, skipna=bool/None,
CALCULATING COUNT
count()
 Used to get the number of values present in the column.
 It counts all the non-NA entries for each row or column.
Ignores NA values.
SYNTAX:
 DataFrame.count(axis=None, numeric_only=bool/
None)
CALCULATING MODE
mode()
 mode - the most repeated values of a given set of
numbers.
 mode() function calculates the mode i.e. the most
occurrence of each element among the axis
selected.
 Can return multiple values. If no mode value is
found, sorted dataframe will be returned.
CALCULATING MODE
 axis 0  get mode for each column
 axis 1  get mode for each row
SYNTAX:
 DataFrame.mode(axis=None, numeric_only=
True/False)
CALCULATING MEAN
mean() mean – average.
 mean()function calculates the arithmetic mean
(average) of a dataframe, rows and column.
SYNTAX:
 DataFrame.mean(axis=None, numeric_only=
True/False, skipna=None)
CALCULATING MEDIAN
median() median – middle value.
 median()function calculates the median
(middle) of a dataframe, rows and column.
SYNTAX:
 DataFrame.median(axis=None,
numeric_only= True/False, skipna=None)
CALCULATING QUANTILE
quantile()
 means  Fractile / Quarter (A sample is divided into equal –
sized subgroup)
 quantile() function is used to get the quantile of each rows
and columns of the dataframe.
 Divides the dataframe in four equal parts:
 1st quantile  25%  .25
 2nd quantile  50%  .5 (Median)
 3rd quantile  75%  .75
CALCULATING QUANTILE
SYNTAX:
 DataFrame.quantile(p,
axis=None,
 By default it will return the 2nd quantile of

all numeric values (median).
CALCULATING VARIANCE
var()
 var() function calculates the variance of a dataframe,
rows and column.
 It is the average of squared difference from the
mean.
SYNTAX:
 DataFrame.var (axis=None, numeric_only=
CALCULATING VARIANCE
 Find the mean.

 Subtract the mean value form no and square
it.
 Find the average -1
CALCULATING STANDARD DEVIATION
std()
 std()function returns the standard deviation of the
values.
 Standard deviation is calculated as the square root
of variance.
SYNTAX:
 DataFrame.std (axis=None, numeric_only=
describe()
 describe() function displays the descriptive
statistical values in a single command.
SYNTAX:
 DataFrame.describe(axis=None,
DATA AGGREGATIONS
 Dataaggregation is the process where data is collected
and presented in a summarized format for statistical
analysis.
 “Aggregationmeans to transform the dataset and
produce a single numeric value from an array.”
 Aggregation can be applied to one or more columns
together.
 Aggregation functions are max(), min(), sum(), count(),
std(), ver().
DATA AGGREGATIONS
aggregate()
SYNTAX:
 DataFrame.aggregate (‘function’,axis=None)
 DataFrame.aggregate ([‘function’, ‘function’],
axis=None)
SORTING
 Sorting refers to the arrangement of data
elements in a specific order, which can be
either be ascending or descending.
 sort_values() function is used sort the data
values of a dataframe.
SYNTAX:
 DataFrame.sort_values (by, axis=0/1, ascending
=True/False)
SORTING
 by Defines column to be sorted

 axis ( 0 = row-wise, 1 = column wise )
 ascending (True 1 / False 0 (descending))
SORTING
 sort_index()function is used sort or arrange
the rows or columns on the basis of index
value.
SYNTAX:
 DataFrame.sort_index (by, axis=0/1,
ascending =True/False)
HANDLING MISSING VALUES
A missing value is denoted by NaN.
The two common ways to handle missing

values are:
1. drop the object having missing values
2. fill or estimate the missing values.
CHECKING MISSING VALUES
 isnull()
function is used to check whether
any value is missing or not in the DataFrame.
 Returns True if missing value is found
otherwise False.
Syntax: dataframe_object.isnull()
dataframe_object[‘column_name’].isnull()
// checks NaN column wise //
any() function is used to check whether a

column has a missing value in the entire
dataset.
dataframe_object.isnull().any()
any() function can be used for a particular
attribute also.
dataframe_obj[‘column_name’].isnull().any()
To find the number of NaN for each column
use sum() along isnull()function
dataframe_obj.isnull().sum()
DROPPING MISSING VALUES
Dropping will remove the entire row having
the missing values(s).
dropna() function is used to drop an entire row
form the DataFrame.
dataframe_obj.dropna()
ESTIMATING MISSING VALUES
Missing values can be filled by using
estimations or approx values e.g. Values
before or after the missing value, average
etc.
fillna(num) function is used to replace missing
values by specified value in num.
dataframe_obj.fillna(num)
ESTIMATING MISSING VALUES
dataframe_obj.fillna(method=‘pad’) // ffill
method=pad replaces the missing value by the
value before the missing value.
dataframe_obj.fillna(method=‘bfill’)
method=bfill replaces the missing value by the
value after the missing value.
GROUPBY
groupby() function is used to split the data into group
based on some criteria. (on the basis of axis.).
STEPS TO GROUP
1. Splitting the data into group (based on criteria)
2. Applying a function to each group independently.
3. Combine the results to form a new DataFrame.
GROUPBY
dataframe_obj.groupby(column-
name).aggregate-function
GROUPBY
first() - display the first entry from each group.
last() - display the last entry from each group.
size() - display the size of each group.
groups - display group data (group name, row index)
get_group() - display the data of a single group.
PIVOTING
 RESHAPE DATA in DataFrame.
 Allows to transform columns into rows and rows
into columns.
 Summarize large amounts of data.
 Used to summarize, sort, reorganize, group,
count, total or average data stored in a table.
 PIVOTING FUNCTION IN PYTHON PANDAS
1. pivot() 2. pivot_table()
pivot()
 pivot() method creates a new dataframe after reshaping the data
based on column values.
 Take three arguments – index, columns and values.
 SYNTAX:
dataframe_object.pivot(index, columns, values)
 index - creates an index of a new dataframe, which is column
name from the original table.
 columns - create columns of new dataframe (columns from
original table).
 values - values of the columns from the original table.
Pivot_table()
 Pivoting with aggregation.
 Pivot_table() method is used when duplicate values are there in rows and
columns.
 SYNTAX:
pd.pivot(DataFrame, index, columns, values, aggfunc)
 DataFrame - pandas DataFrame
 index - creates an index of a new dataframe, which is column name from
the original table.
 columns - create columns of new dataframe (columns from original table).
 aggfunc – functions to use like sum, max, min, mean, std, etc.
 values - hold the value of a column to be aggregated.
REINDEXING AND RENAMING
 rename()
 rename () method is used to rename the
indexes in a dataframe.
 Syntax:
dataframe_object.rename(index, inplace)
 set_index()
 set_index () method is used change the index
to some other column of the dataframe.
 Syntax:
dataframe_object.set_index(column,
inplace=True)
 reset_index()
 reset_index () method is used to create a new
continuous index.
 Syntax:
dataframe_object.reset_index( inplace=True)
 reindex()
 reindex () method is used revert back to the
previous index.
 Syntax:
dataframe_object.reindex(index / column,
inplace=True)

Data Handling Part Ii

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Handling Part Ii

Uploaded by

Copyright:

Available Formats

DESCRIPTIVE

 Used to summaries the given data.

To get the maximum values of a specific

 By default it will return the 2nd quantile of

 Find the mean.

 by Defines column to be sorted

The two common ways to handle missing

any() function is used to check whether a

You might also like