0% found this document useful (0 votes)

34 views19 pages

The 8 Basic Statistics Concepts For Data Science - +

The document outlines eight fundamental statistics concepts essential for data science, including types of analytics, probability, central tendency, variability, relationships between variables, probability distributions, hypothesis testing, and regression. Each concept is explained with definitions and examples, emphasizing their importance in data analysis and interpretation. Understanding these concepts is crucial for aspiring data scientists to effectively analyze and derive insights from data.

Uploaded by

Ramachandran M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views19 pages

The 8 Basic Statistics Concepts For Data Science - +

Uploaded by

Ramachandran M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

The 8 Basic Statistics Concepts for Data Science

https://www.kdnuggets.com/2020/06/8-basic-statistics-concepts.html

Understanding the fundamentals of statistics is a core capability for

becoming a Data Scientist. Review these essential ideas that will be
pervasive in your work and raise your expertise in the field.

By Shirley Chen, Data Analyst @ Outdoorsy on April 21, 2022 in Data

Science

Statistics is a form of mathematical analysis that uses quantified models

and representations for a given set of experimental data or real-life studies.
The main advantage of statistics is that information is presented in an easy
way. Recently, I reviewed all the statistics materials and organized the 8
basic statistics concepts for becoming a data scientist!
Understand the Type of Analytics
Probability
Central Tendency
Variability
Relationship Between Variables
Probability Distribution
Hypothesis Testing and Statistical Significance
Regression

Understand the Type of Analytics

Descriptive Analytics tells us what happened in the past and helps a

business understand how it is performing by providing context to help
stakeholders interpret information.

Diagnostic Analytics takes descriptive data a step further and helps you
understand why something happened in the past.

Predictive Analytics predicts what is most likely to happen in the future

and provides companies with actionable insights based on the information.

Prescriptive Analytics provides recommendations regarding actions that

will take advantage of the predictions and guide the possible actions toward
a solution.

Probability

Probability is the measure of the likelihood that an event will occur in a

Random Experiment.

Complement: P(A) + P(A’) = 1

Intersection: P(A∩B) = P(A)P(B)

Union: P(A∪B) = P(A) + P(B) − P(A∩B)

Conditional Probability: P(A|B) is a measure of the probability of one

event occurring with some relationship to one or more other events.
P(A|B)=P(A∩B)/P(B), when P(B)>0.
Independent Events: Two events are independent if the occurrence of one
does not affect the probability of occurrence of the other. P(A∩B)=P(A)P(B)
where P(A) != 0 and P(B) != 0 , P(A|B)=P(A), P(B|A)=P(B)

Mutually Exclusive Events: Two events are mutually exclusive if they

cannot both occur at the same time. P(A∩B)=0 and P(A∪B)=P(A)+P(B).

Bayes’ Theorem describes the probability of an event based on prior

knowledge of conditions that might be related to the event.
Central Tendency

Mean: The average of the dataset.

Median: The middle value of an ordered dataset.

Mode: The most frequent value in the dataset. If the data have multiple
values that occurred the most frequently, we have a multimodal
distribution.

Skewness: A measure of symmetry.

Kurtosis: A measure of whether the data are heavy-tailed or light-tailed

relative to a normal distribution
Variability

Range: The difference between the highest and lowest value in the dataset.

Percentiles, Quartiles and Interquartile Range (IQR)

Percentiles — A measure that indicates the value below which a given

percentage of observations in a group of observations falls.
Quantiles— Values that divide the number of data points into four more
or less equal parts, or quarters.
Interquartile Range (IQR)— A measure of statistical dispersion and
variability based on dividing a data set into quartiles. IQR = Q3 − Q1
Percentiles, Quartiles and Interquartile Range (IQR).

Variance: The average squared difference of the values from the mean to
measure how spread out a set of data is relative to mean.
Standard Deviation: The standard difference between each data point and
the mean and the square root of variance.

Population and Sample Variance and Standard Deviation.

Standard Error (SE): An estimate of the standard deviation of the sampling

distribution.

Population and Sample Standard Error.

Relationship Between Variables

Causality: Relationship between two events where one event is affected by

the other.

Covariance: A quantitative measure of the joint variability between two or

more variables.

Correlation: Measure the relationship between two variables and ranges

from -1 to 1, the normalized version of covariance.
Probability Distributions

Probability Distribution Functions

Probability Mass Function (PMF): A function that gives the probability

that a discrete random variable is exactly equal to some value.

Probability Density Function (PDF): A function for continuous data where

the value at any given sample can be interpreted as providing a relative
likelihood that the value of the random variable would equal that sample.

Cumulative Density Function (CDF): A function that gives the probability

that a random variable is less than or equal to a certain value.
Comparison between PMF, PDF, and CDF.

Continuous Probability Distribution

Uniform Distribution: Also called a rectangular distribution, is a

probability distribution where all outcomes are equally likely.

Normal/Gaussian Distribution: The curve of the distribution is bell-

shaped and symmetrical and is related to the Central Limit Theorem that
the sampling distribution of the sample means approaches a normal
distribution as the sample size gets larger.
Exponential Distribution: A probability distribution of the time between
the events in a Poisson point process.

Chi-Square Distribution: The distribution of the sum of squared standard

normal deviates.

Discrete Probability Distribution

Bernoulli Distribution: The distribution of a random variable which takes

a single trial and only 2 possible outcomes, namely 1(success) with
probability p, and 0(failure) with probability (1-p).

Binomial Distribution: The distribution of the number of successes in a

sequence of n independent experiments, and each with only 2 possible
outcomes, namely 1(success) with probability p, and 0(failure) with
probability (1-p).

Poisson Distribution: The distribution that expresses the probability of a

given number of events k occurring in a fixed interval of time if these
events occur with a known constant average rate λ and independently of
the time.

Hypothesis Testing and Statistical Significance

Null and Alternative Hypothesis

Null Hypothesis: A general statement that there is no relationship between

two measured phenomena or no association among groups. Alternative
Hypothesis: Be contrary to the null hypothesis.

In statistical hypothesis testing, a type I error is the rejection of a true null

hypothesis, while a type II error is the non-rejection of a false null
hypothesis.

Interpretation

P-value: The probability of the test statistic being at least as extreme as the
one observed given that the null hypothesis is true. When p-value > α, we
fail to reject the null hypothesis, while p-value ≤ α, we reject the null
hypothesis, and we can conclude that we have a significant result.

Critical Value: A point on the scale of the test statistic beyond which we
reject the null hypothesis and is derived from the level of significance α of
the test. It depends upon a test statistic, which is specific to the type of test,
and the significance level, α, which defines the sensitivity of the test.
Significance Level and Rejection Region: The rejection region is actually
dependent on the significance level. The significance level is denoted
by α and is the probability of rejecting the null hypothesis if it is true.

Z-Test

A Z-test is any statistical test for which the distribution of the test statistic
under the null hypothesis can be approximated by a normal distribution
and tests the mean of a distribution in which we already know the
population variance. Therefore, many statistical tests can be conveniently
performed as approximate Z-tests if the sample size is large or
the population variance is known.

T-Test

A T-test is the statistical test if the population variance is unknown, and

the sample size is not large (n < 30).

Paired sample means that we collect data twice from the same group,
person, item, or thing. Independent sample implies that the two samples
must have come from two completely different populations.
ANOVA (Analysis of Variance)

ANOVA is the way to find out if experimental results are significant. One-
way ANOVA compares two means from two independent groups using only
one independent variable. Two-way ANOVA is the extension of one-way
ANOVA using two independent variables to calculate the main effect and
interaction effect.
Chi-Square Test

Chi-Square Test checks whether or not a model follows approximately

normality when we have s discrete set of data points. Goodness of Fit
Test determines if a sample matches the population fit one categorical
variable to a distribution. Chi-Square Test for Independence compares
two sets of data to see if there is a relationship.

Regression

Linear Regression

Assumptions of Linear Regression

Linear Relationship
Multivariate Normality
No or Little Multicollinearity
No or Little Autocorrelation
Homoscedasticity
Linear Regression is a linear approach to modeling the relationship
between a dependent variable and one independent variable.
An independent variable is a variable that is controlled in a scientific
experiment to test the effects on the dependent variable. A dependent
variable is a variable being measured in a scientific experiment.

Multiple Linear Regression is a linear approach to modeling the

relationship between a dependent variable and two or more independent
variables.

Multiple Linear Regression Formula.

Steps for Running the Linear Regression

Step 1: Understand the model description, causality, and directionality

Step 2: Check the data, categorical data, missing data, and outliers

Outlier is a data point that differs significantly from other observations.

We can use the standard deviation method and interquartile range (IQR)
method.
Dummy variable takes only the value 0 or 1 to indicate the effect for
categorical variables.

Step 3: Simple Analysis — Check the effect comparing between dependent

variable to independent variable and independent variable to independent
variable

Use scatter plots to check the correlation

Multicollinearity occurs when more than two independent variables
are highly correlated. We can use Variance Inflation Factor (VIF) to
measure if VIF > 5 there is highly correlated and if VIF > 10, then there is
certainly multicollinearity among the variables.
Interaction Term implies a change in the slope from one value to
another value.

Step 4: Multiple Linear Regression — Check the model and the correct
variables

Step 5: Residual Analysis

Check normal distribution and normality for the residuals.

Homoscedasticity describes a situation in which the error term is the
same across all values of the independent variables and means that the
residuals are equal across the regression line.
Step 6: Interpretation of Regression Output

R-Squared is a statistical measure of fit that indicates how much

variation of a dependent variable is explained by the independent
variables. Higher R-Squared value represents smaller differences
between the observed data and fitted values.
P-value
Regression Equation

Shirley Chen is a Data Analyst at Outdoorsy.

Original. Reposted with permission.

Top Resources for Learning Statistics for Data Science

Statistics with Julia: The Free eBook
A Concise Course in Statistical Inference: The Free eBook
Data Science Internship Interview Questions
Python and R Courses for Data Science
ebook: Learn Data Science with R - free download

<= Previous post

Next post =>

Top Posts Past 30 Days

1. 9 Free Harvard Courses to Learn Data Science in 2022
2. Decision Tree Algorithm, Explained
3. 15 Python Coding Interview Questions You Must Know For Data Science
4. Naïve Bayes Algorithm: Everything You Need to Know
5. Top Programming Languages and Their Uses
6. 5 Different Ways to Load Data in Python
7. Top YouTube Channels for Learning Data Science
8. DBSCAN Clustering Algorithm in Machine Learning
9. Why Are So Many Data Scientists Quitting Their Jobs?
10. Centroid Initialization Methods for k-means Clustering

Statistics For Data Analytics
No ratings yet
Statistics For Data Analytics
15 pages
Key Concepts in Statistics Explained
No ratings yet
Key Concepts in Statistics Explained
38 pages
Data Science Statistics Overview
No ratings yet
Data Science Statistics Overview
21 pages
Statistics Equationls
No ratings yet
Statistics Equationls
5 pages
A. Variables:: Types of Distributions
No ratings yet
A. Variables:: Types of Distributions
10 pages
Data Analysis in Research Methods
No ratings yet
Data Analysis in Research Methods
63 pages
Biostatistics Notes Part 1
No ratings yet
Biostatistics Notes Part 1
9 pages
Analyzing Quantitative Data in Research
No ratings yet
Analyzing Quantitative Data in Research
33 pages
Data Analytics: Statistical Methods Overview
No ratings yet
Data Analytics: Statistical Methods Overview
38 pages
Basic Concepts of Statistics
No ratings yet
Basic Concepts of Statistics
43 pages
Overview of Descriptive Statistics
No ratings yet
Overview of Descriptive Statistics
2 pages
1 Descriptive Statistics
No ratings yet
1 Descriptive Statistics
20 pages
Descriptive and Inferential Statistics Guide
0% (1)
Descriptive and Inferential Statistics Guide
19 pages
Data Analysis Techniques in SPSS
No ratings yet
Data Analysis Techniques in SPSS
26 pages
Understanding Statistics Basics
No ratings yet
Understanding Statistics Basics
4 pages
Understanding Central Tendency and Variability
No ratings yet
Understanding Central Tendency and Variability
7 pages
ISA Summary Toya
No ratings yet
ISA Summary Toya
38 pages
Probability and Statistics Overview
No ratings yet
Probability and Statistics Overview
19 pages
Understanding Statistics Basics
No ratings yet
Understanding Statistics Basics
51 pages
GEA1000 Finals
No ratings yet
GEA1000 Finals
2 pages
Data Science by CFA
No ratings yet
Data Science by CFA
27 pages
Elementary Statistics Key Concepts Guide
No ratings yet
Elementary Statistics Key Concepts Guide
9 pages
Psychological Assessment Statistics Guide
No ratings yet
Psychological Assessment Statistics Guide
5 pages
2.2 Probability
No ratings yet
2.2 Probability
19 pages
Main Title: Planning Data Analysis Using Statistical Data
100% (2)
Main Title: Planning Data Analysis Using Statistical Data
40 pages
Probability and Statistics in ML
No ratings yet
Probability and Statistics in ML
18 pages
Data Preparation and Analysis Techniques
No ratings yet
Data Preparation and Analysis Techniques
56 pages
Statistics Basics for Data Science
100% (2)
Statistics Basics for Data Science
27 pages
Comprehensive Guide to Biostatistics
No ratings yet
Comprehensive Guide to Biostatistics
72 pages
STA301 IMP Notes Headings and Some Questions Answers Prepared by
No ratings yet
STA301 IMP Notes Headings and Some Questions Answers Prepared by
32 pages
Descriptive Stats & Probability Basics
No ratings yet
Descriptive Stats & Probability Basics
19 pages
ML2 Math Algo
No ratings yet
ML2 Math Algo
72 pages
ML Unit 3
No ratings yet
ML Unit 3
46 pages
Biostatistics: Tests of Significance Guide
No ratings yet
Biostatistics: Tests of Significance Guide
97 pages
ST Formula Sheet Midterm
No ratings yet
ST Formula Sheet Midterm
4 pages
Data Analysis for Business Students
No ratings yet
Data Analysis for Business Students
27 pages
Understanding Statistical Concepts in Psychology
No ratings yet
Understanding Statistical Concepts in Psychology
33 pages
Lecture 7.descriptive and Inferential Statistics
100% (1)
Lecture 7.descriptive and Inferential Statistics
44 pages
Statistics През
No ratings yet
Statistics През
46 pages
Predictive Analytics Notes1
No ratings yet
Predictive Analytics Notes1
37 pages
Data Analysis Techniques Overview
67% (3)
Data Analysis Techniques Overview
35 pages
ISDS 361A - Cheat Sheet Exam 1 PDF
No ratings yet
ISDS 361A - Cheat Sheet Exam 1 PDF
2 pages
MMW Data Management and Analysis
No ratings yet
MMW Data Management and Analysis
96 pages
Descriptive and Inferential Statistics Basics
No ratings yet
Descriptive and Inferential Statistics Basics
22 pages
Inferential Statistics
No ratings yet
Inferential Statistics
4 pages
2.1 Descriptive Statistics Contd
No ratings yet
2.1 Descriptive Statistics Contd
20 pages
Tutoring Study Plan
No ratings yet
Tutoring Study Plan
17 pages
Statistics: Descriptive & Inferential Methods
No ratings yet
Statistics: Descriptive & Inferential Methods
12 pages
Statistics Fundamentals - Glossary
No ratings yet
Statistics Fundamentals - Glossary
4 pages
Data Collection and Sampling Methods
No ratings yet
Data Collection and Sampling Methods
4 pages
Data Analysis Basics and Techniques
No ratings yet
Data Analysis Basics and Techniques
31 pages
AP Statistics Vocabulary Terms Guide
No ratings yet
AP Statistics Vocabulary Terms Guide
5 pages
Statistics Cheat Sheet
100% (3)
Statistics Cheat Sheet
23 pages
Data-science-Interviews Technical - MD at Master A+
No ratings yet
Data-science-Interviews Technical - MD at Master A+
21 pages
Candlesticks PDF
No ratings yet
Candlesticks PDF
27 pages
Understanding Continuous Random Variables
No ratings yet
Understanding Continuous Random Variables
1 page
PH Calculation by Differential Conductivity Measurement in Mixtures of Alkalization Agents Marco Lendi
100% (1)
PH Calculation by Differential Conductivity Measurement in Mixtures of Alkalization Agents Marco Lendi
12 pages
Online Export & Import Training
No ratings yet
Online Export & Import Training
7 pages
1 Week Deep Learning Using Medical Data Course Content
No ratings yet
1 Week Deep Learning Using Medical Data Course Content
2 pages
SQL Exercise & Worksheet: Connecting You To The Next Level in Life
No ratings yet
SQL Exercise & Worksheet: Connecting You To The Next Level in Life
6 pages
Impact of Backpressure on Grootvlei Efficiency
100% (6)
Impact of Backpressure on Grootvlei Efficiency
116 pages
Mechanical Engineering Course Overview
No ratings yet
Mechanical Engineering Course Overview
34 pages
Understanding Sampling Distributions and Errors
No ratings yet
Understanding Sampling Distributions and Errors
32 pages
Effective Action Research in Education
No ratings yet
Effective Action Research in Education
95 pages
La Union Statistics Test Specifications
100% (1)
La Union Statistics Test Specifications
6 pages
ECON 307 Statistics Assignment 3
No ratings yet
ECON 307 Statistics Assignment 3
3 pages
(240, 253, 254) Final Probability and Statistics
No ratings yet
(240, 253, 254) Final Probability and Statistics
4 pages
Bechtel Technology Journal: Major Offices
No ratings yet
Bechtel Technology Journal: Major Offices
216 pages
Quiz2 Fall22
No ratings yet
Quiz2 Fall22
4 pages
Statistical Methods
91% (11)
Statistical Methods
109 pages
4 DD 5 Bfa 21 DF 6
No ratings yet
4 DD 5 Bfa 21 DF 6
233 pages
Ma026iu 2021 HK1 Midterm
No ratings yet
Ma026iu 2021 HK1 Midterm
3 pages
Statistics Explained, 4th Edition Full PDF Download
100% (16)
Statistics Explained, 4th Edition Full PDF Download
14 pages
B SC - II-10
No ratings yet
B SC - II-10
84 pages
Estimating Population Parameters with Confidence Intervals
No ratings yet
Estimating Population Parameters with Confidence Intervals
31 pages
Ecological Methodology Second Edition PDF
100% (4)
Ecological Methodology Second Edition PDF
765 pages
CH 3. Theory of Errors in Observations
No ratings yet
CH 3. Theory of Errors in Observations
7 pages
Robust Estimation of AR Coefficients Under Simultaneously Influencing Outliers and Missing Values
No ratings yet
Robust Estimation of AR Coefficients Under Simultaneously Influencing Outliers and Missing Values
13 pages
MCQ Normal Distribution With Correct Answers
100% (18)
MCQ Normal Distribution With Correct Answers
6 pages
Blue Sky Network Revenue Management
No ratings yet
Blue Sky Network Revenue Management
10 pages
Engineering Data Analysis Guide
No ratings yet
Engineering Data Analysis Guide
5 pages
New SEP Solution for MPSK
No ratings yet
New SEP Solution for MPSK
3 pages
Understanding Skewness and Kurtosis
No ratings yet
Understanding Skewness and Kurtosis
23 pages
Aps U9 Test Review Key
No ratings yet
Aps U9 Test Review Key
5 pages
Surveying: Error, Precision, and Accuracy
No ratings yet
Surveying: Error, Precision, and Accuracy
24 pages
Statistics & Probability for Engineers Course
No ratings yet
Statistics & Probability for Engineers Course
2 pages
Journal Sem 1
No ratings yet
Journal Sem 1
95 pages
Fundamentals of Data Analysis
No ratings yet
Fundamentals of Data Analysis
16 pages
Comprehensive Survey of 400 Activation Functions
No ratings yet
Comprehensive Survey of 400 Activation Functions
107 pages
Inventory Management Assignment
No ratings yet
Inventory Management Assignment
6 pages
Revised Brochure BStat (2016)
No ratings yet
Revised Brochure BStat (2016)
36 pages

The 8 Basic Statistics Concepts For Data Science - +

Uploaded by

The 8 Basic Statistics Concepts For Data Science - +

Uploaded by

The 8 Basic Statistics Concepts for Data Science

Understanding the fundamentals of statistics is a core capability for

By Shirley Chen, Data Analyst @ Outdoorsy on April 21, 2022 in Data

Statistics is a form of mathematical analysis that uses quantified models

Understand the Type of Analytics

Descriptive Analytics tells us what happened in the past and helps a

Predictive Analytics predicts what is most likely to happen in the future

Prescriptive Analytics provides recommendations regarding actions that

Probability is the measure of the likelihood that an event will occur in a

Complement: P(A) + P(A’) = 1

Intersection: P(A∩B) = P(A)P(B)

Union: P(A∪B) = P(A) + P(B) − P(A∩B)

Conditional Probability: P(A|B) is a measure of the probability of one

Mutually Exclusive Events: Two events are mutually exclusive if they

Bayes’ Theorem describes the probability of an event based on prior

Mean: The average of the dataset.

Median: The middle value of an ordered dataset.

Skewness: A measure of symmetry.

Kurtosis: A measure of whether the data are heavy-tailed or light-tailed

Percentiles, Quartiles and Interquartile Range (IQR)

Percentiles — A measure that indicates the value below which a given

Population and Sample Variance and Standard Deviation.

Standard Error (SE): An estimate of the standard deviation of the sampling

Population and Sample Standard Error.

Causality: Relationship between two events where one event is affected by

Covariance: A quantitative measure of the joint variability between two or

Correlation: Measure the relationship between two variables and ranges

Probability Distribution Functions

Probability Mass Function (PMF): A function that gives the probability

Probability Density Function (PDF): A function for continuous data where

Cumulative Density Function (CDF): A function that gives the probability

Continuous Probability Distribution

Uniform Distribution: Also called a rectangular distribution, is a

Normal/Gaussian Distribution: The curve of the distribution is bell-

Chi-Square Distribution: The distribution of the sum of squared standard

Discrete Probability Distribution

Bernoulli Distribution: The distribution of a random variable which takes

Binomial Distribution: The distribution of the number of successes in a

Poisson Distribution: The distribution that expresses the probability of a

Hypothesis Testing and Statistical Significance

Null Hypothesis: A general statement that there is no relationship between

In statistical hypothesis testing, a type I error is the rejection of a true null

A T-test is the statistical test if the population variance is unknown, and

Chi-Square Test checks whether or not a model follows approximately

Assumptions of Linear Regression

Multiple Linear Regression is a linear approach to modeling the

Multiple Linear Regression Formula.

Step 1: Understand the model description, causality, and directionality

Outlier is a data point that differs significantly from other observations.

Step 3: Simple Analysis — Check the effect comparing between dependent

Use scatter plots to check the correlation

Step 5: Residual Analysis

Check normal distribution and normality for the residuals.

R-Squared is a statistical measure of fit that indicates how much

Shirley Chen is a Data Analyst at Outdoorsy.

Original. Reposted with permission.

More On This Topic

Top Resources for Learning Statistics for Data Science

<= Previous post

Top Posts Past 30 Days

You might also like