You are on page 1of 31
MODULE 1 | ={ole) eee Module Handbook Oi. ea aa Data Analytics - Statistics and Data Visualization Objectives ‘At the end of this module, you should be able to understand: Introduction to terms in statistics Measure of Central tendency - Mean, Median, Mode Measure of Spread ~ Variance, Standard Deviation, Range Measure of shape - Skewness, Kurtosis, Correlation Analysis Data Visualization with matplotlib Data Visualization with seaborn vw typ 3 Boot Introduction to Statistics Some important terminologies - Population: A collection of persons, objets or items. Eg. All Automobiles, All employees of Microsoft Census: When researchers gathers data from whole population for a given measurement of interest they call it census. E.g. US Population Census taken every 10 years Sample: A portion of whole or representative of the whole population. E.g. 75 samples of Dairy milks for testing quality Statistics and type of statistics — Statistics is a form of mathematical analysis that uses quantified models, representations and synopses for a given set of experimental data or real-life studies. Statistics studies methodologies to gather, review, analyze and draw conclusions from data. Descriptive Statistics: Using data gathered on a group to reach conclusion about the same group. Inferential Statistics: gathering data from a sample and use the statistics to bring conclusions about the population. vw typ 3 Boot Introduction to Statistics Types of Variables Variable Types Categorical Data Type — Qualitative data are often termed categorical data. Data that can be added into categories according to their characteristics. Nominal Variable (Unordered list) _ Ordinal Variable (Ordered list) A variable that has two or more A variable that has two or more categories, categories, without any implied clear ordering, ordering. Examples : Examples: Scale - Strongly Disagree, Disagree, Gender - Male, Female Neutral, Agree, Strongly Agree Marital Status - Unmarried, Married, Divorcee Rating - Very low, Low, Medium, Great, State - New Delhi, Haryana, Illinois, Very great Michigan scoop Boot Introduction to Statistics Numerical Data Type - Quantitative data are often termed numerical data. Interval An interval variable is similar to an ordinal variable, except that the intervals between the values of the interval v iable are equally spaced. In other words, it has order and equal i ervals. Examples : + Temperature in Celsius - Temperature of 30°C is higher than 20°C, and temperature of 20°C is higher than 10°C. The size of these intervals is the same. + Annual Income in Dollars - Three people who make $5,000, $10,000 and $15,000. The second person makes $5,000 more than the first person and $5,000 less than the third person, and the size of these intervals is the same. Ratio It is interval data with a natural zero point. When the variable equals 0.0, there is none of that variable. Examples : + Height + Weight + Temperature in Kelvin - It is a ratio variable, as 0.0 Kelvin really does mean 'no temperature. vw typ 3 Boot Measure of Central Tendency - Mean, Median & Mode It describes a whole set of data with a single value that represents the centre of its distribution. There are three main measures of central tendency: the mode, the median and the mean Mean Average Value Mode Most Frequent Value Median Middle Value Mean Itis the sum of the observations divided by the sample size. The mean of the values 5,6,6,8,9,9,9,9,10,10 is (S+6+6+8+9+9+9+94+10+10)/10 = 8.1 Limitation : It is affected by extreme values. Very large or very small numbers can distort the answer Median Itis the middle value. It splits the data in half. Half of the data are above the median; half of the data are below the median. ‘Advantage : It is NOT affected by extreme values. Very large or very small numbers does not affect it Mode It is the value that occurs most frequently in a dataset Advantage : It can be used when the data is not numerical Disadvantage 1. There may be no mode at all if none of the data is the same 2. There may be more than one mode ups! {=e esol Measure of Central Tendency - Mean, Median & Mode When to use mean, median and mode? vw typ 3 Boot Measure of Spread Itrefers to the spread or dispersion of scores. There are four main measures of variability: Range, Inter quartile range, Standard deviation and Variance. Range Difference biw max and min ina distribution Standard Deviation | Average distance of scores in a distribution trom their mean Variance ‘Square of Standard Deviation Range It is simply the largest observation minus the smallest observation. Advantage: Itis easy to calculate. Disadvantage: It is very sensitive to outliers and does not use all the observations ina data set Variance The variance is the average of the squared deviations about the arithmetic mean for a set of numbers. The population variance is denoted by o42. Standard Deviation Itis a measure of spread of data about the mean and It is the square root of the variance. = 1)? N Advantage : It gives a better picture of your data than just the mean alone. Disadvantage : vw typ 3 Boot Measure of Shape ‘Skewness and Kurtosis are used to measure shape of data Skewness Degree to which scores in a distribution are spread out. Kurtosis Flatness or peakness of the curve Skewness It is a measure of symmetry. A distribution is symmetric if it looks the same to the left and right of the center point. Mean Median Mode Mode Mode i Mean Left skewed Normal Distribution Right skewed Kurtosis It is a measure of whether the data are peaked or flat relative to the rest of the data. Higher values indicate a higher, sharper peak; lower values indicate a lower, less distinct peak. () Leptokurtic General Forms of (0) Mesokurtic Kurtosis (ormal) (© Platykurtic: vw typ 3 Boot Correlation Analysis Correlation is used to test relationships between quanti ive variables or categorical variables. In other words, it’s a measure of how things are related. The study of how variables are correlated is called correlation analysis. * Correlations are useful because if you can find out what relationship variables have, you can make predictions about future behavior. + Acorrelation coefficient is a way to put a value to the relationship. Correlation coefficients have a value of between -1 and 1. + A“0" means there is no relationship between the variables at all + while -1 or 1 means that there is a perfect negative or positive correlation, vw typ 3 10 Boot Data Visualization with matplotlib Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits. Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of code. For examples, see the sample plots and thumbnail gallery. For simple plotting the pyplot module provides a MATLAB-like interface, particularly when combined with IPython. For the power user, you have full control of line styles, font properties, axes properties, etc, via an object oriented interface or via a set of functions familiar to MATLAB users. "1 vw typ 3 Data Visualization with matplotlib Line Plot import matplotlib.pyplot as pit plt.plot({1.6, 2.7]) plt.show() vw typ 3 Boot Data Visualization with matplotlib Pie chart import matplo from matplotlib.gridspec import GridSpec plt. figure (figsize=(8,8)) abels = 'Frogs', 'Hogs', 'Dogs', "Logs’ fracs = [15, 30, 45, 10] explode = (0, 0.05, 0, 0) plt.pie(fracs, labels=labels, autopct='%1.1f%%', shadow: plt.show() Hogs Loos vw typ 3 nrue) 13 Boot Data Visualization with matplotlib Working with text The text() command can be used to add text in an arbitrary location, and the xlabel(), ylabel() and title() are used to add text in the indicated locations (see Text introduction for a more detailed example) mu, sigma = 100, 15 x = mu + sigma * numpy.random.randn (10000) n, bins, pate! plt.hist(x, 50, normed=1, facecolor='g', +75) plt.xlabel ("Smarts") pit.ylabel ('Probability') pit.title('Histogram of 1Q') plt.text (60,.025,r'$\mu=100,\ \sigma=15$") pit.axis({40, 160, 0, 0.03]) plt.grid (True) plt.show() Histogram of | 0.030 me a 002s w= 100, 0-15 0020 001s & 0010 0.005 0.000 + 14 vw typ 3 Boot Data Visualization with matplotlib Working with multiple figures and axes import numpy t1 = numpy.arange(0.0, 5.0, 0.1) t2 = numpy.arange(0.0, 5.0, 0.02) f=numpy.exp (-t1) *numpy.cos (2*numpy.pi*t1) £2-numpy.exp (-t2) *numpy.cos (2*numpy.pi*t2) plt. figure (figsize=(12,5)) plt. subplot (211) plt.plot(tl, f, "bo", t2, £2, plt. subplot (212) plt.plot(t2, numpy.cos(2*numpy.pi*t2), 'r--") plt.show() 10 0s 00 10 os 00 05 15 vw typ 3 Boot Data Visualization with Seaborn Seaborn is a Python visualization library based on matplotlib. It pro\ les a high-level interface for drawing attractive statistical graphics. Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels. Some of the features that seaborn offers are + Several built-in themes for styling matplotlib graphics * Tools for choosing color palettes to make beautiful plots that reveal patterns in your data * Functions for visualizing univariate and bivariate distributions or for comparing them between subsets of data + Tools that fit and visualize linear regression models for different kinds of independent and dependent variables * Functions that visualize matrices of data and use clustering algorithms to discover structure in those matrices ‘A function to plot statistical timeseries data with flexible estimation and representation of uncertainty around the estimate * High-level abstractions for structuring grids of plots that let you easily build complex visualizations vw typ 3 16 Boot Data Visualization with Seaborn Seaborn aims to make visualization a central part of exploring and understanding data. The plotting functions operate on dataframes and arrays containing a whole dataset and internally perform the necessary aggregation and statistical model-fitting to produce informative plots. If matplotlib “tries to make easy things easy and hard things possible”, seaborn tries to make a well-defined set of hard things easy too. The plotting functions try to do something useful when called with a minimal set of arguments, and they expose a number of customizable options through additional parameters. Some of the functions plot directly into a matplotlib axes object, while others operate on an entire figure and produce plots with several panels. In the latter case, the plot is drawn using a Grid object that links the structure of the figure to the structure of the dataset. Seaborn should be thought of as a complement to matplotlib, not a replacement for it. When using seaborn, it is likely that you will often invoke matplotlib functions directly to draw simpler plots already available through the pyplot namespace. Further, the seaborn functions aim to make plots that are reasonably “production ready” (including extracting semantic information from Pandas objects to add informative labels), but full customization will require changing attributes on the matplotlib objects directly. The combination of seaborn’s high-level interface and matplotlib’s customizability and wide range of backends makes it easy to generate publication-quality figures. vw typ 3 7 Boot Data Visualization with seaborn Plotting with categorical data import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt import seaborn as sns style="whitegrid", color_code random.seed(sum(map(ord, "categori titanic = sns.load_dataset ("titanic") sns.load_dataset ("tips") iris = sns.load_dataset ("iris") 18 vw typ 3 Data Visualization with seaborn Categorical scatterplots Stripplots sns.stripplot(x="day", y="total_ bill", data=tips) plt.show() : a . . 7 « s . ' : . - s 3 * ; 32 . : 5 i : a 3 : 2 e 10 j : : ! . Thur Fi Sat Sun vw typ 3 Data Visualization with seaborn Categorical scatterplots Stripplots tripplot (x="day", y="total bill", datastips, jitter=True) . a - . e e . 0 . * 4 * * ‘ ty . * ve. a ° @ ete 4 a "e 3 ey 5 f * Hf ¥ . » 2 4 7 . 0 t 3 oe : * . Thur Fi Sat Sun day 20 vw typ 3 Data Visualization with seaborn Categorical scatterplots swarmplots sns.swarmplot (x="day", y="total bill", hue="sex", data=tips) plt.show() ‘otal_bill dele Bi? ley o resd oF oe Siar ar Cheat Thur Fri Sat Sun 24 vw typ 3 Boot Data Visualization with seaborn Boxplot tips["weekend"] = tips["day"].isin(["Sat", "Sun"]) ="total_bill", hue="weekend", sns.boxplot (x= data=tips, dodge=False) plt.show() = weekend : ; GE False ¢ 7 GE Tue § 40 s . 10 | Ther Fi ‘Sat Sun «ay 22 scoop oot Data Visualization with seaborn Violin Plot sns.violinplot (x="total bill", y="day", hue="time", data=tips) plt.show() ime Thur WE Linch GE Dinner ns > (a = 8 Set SSS Sun a oO 10 a xa 40 » ao ‘otal_bill 23 vw typ 3 Boot Data Visualization with seaborn Visualizing continuous data - import pandas as pd ‘matplotlib inline import random import matplotlib.pyplot as pit import seaborn as sns df = pd.DataFrame () d£['x'] = random.sample(range(1, 100), 25) df['y"] = ran mple(range(1, 100), 25) Scatterplot fit_reg-False) sns.1mplot ("x', 24 vw typ 3 Boot Data Visualization with seaborn Density Plot sns.distplot (df.x) plt.show() oo1a oo1z oo1o 0.008 0.006 0.004 0.002 0.000 So 3 0 & 0 7% 00 WS 150 Histogram plt.hist(df.x, alpha~.3) sns-rugplot (df.x); plt.show() 25 snczcotpat oot Data Visualization with seaborn Heatmap sns tmap([df.y, df.x], annot=True, fmt="d") plt.show() r . ’ ' ' tit t CHAM TNR HOG Sams SAAR AA SR SARANAANA vw typ 3 26 Boot Data Visualization with seaborn Plotting bivariate distributions mean, cov = [0, 1], [(1, .5), (.5, 1)] data = np.random.multivariate_normal(mean, cov, 200) df = pd.DataFrame(data, columns=["x", "y"]) x, y = np.random.multivariate_normal(mean, cov, 1000) .T Jointplot sns.jointplot (x="x", y="y", data=df) pearsonr = 0.52; p= 1.5215 ° ar scoop oot Data Visualization with seaborn Visualizing pairwise relationships in a dataset iris = sns.load_dataset ("iris") sns.pairplot (iris) plt.show() peta in 28 vw typ 3 Boot Summary ‘This Module covered following topics Introduction to terms in statistics Measure of Central tendency ~ Mean, Median, Mode Measure of Spread ~ Variance, Standard Deviation, Range Measure of shape - Skewness, Kurtosis Correlation Analysis Data Visualization with matplotlib Data Visualization with seaborn vw typ 3 29

You might also like