Ms Data Science S, 24 (WEEK# 1)

1
Lahore Garrison University

DSC-701 Statistical and Mathematical
Methods
for Data Science
Week-1 Semester-
Spring-2024
Prepared by:
Dr.Quratulain Rana
Assistant Professor
2
Learning Objectives
After studying this lecture, the student will be able to:

 Understand the importance of statistical and mathematical methods in data science
 Understand the fundamental statistical concepts.
 Learn how to summarize and describe data using measures such as mean, median,
variance, standard deviation, percentiles, and quartiles.
 Analyse and interpret the shape, centre, and spread of data distributions.
 Identify outliers and understand their impact on data analysis.
Dr. Quratulain Rana

3
What is Data Science
 Data science is an interdisciplinary field that combines

statistics, mathematics, computer science, and domain
knowledge to extract insights and knowledge from
data.
 It involves collecting, processing, analysing, and
interpreting large volumes of structured and
unstructured data to make informed decisions.
4
Importance of Statistical and Mathematical
Methods
In the realm of data science, statistical and mathematical methods form the
cornerstone of analysis and interpretation. Statistics enables data scientists
to uncover patterns, trends, and insights, while mathematical methods
provide the framework for modelling and predictive analytics.
Understanding the significance of these methods is vital for harnessing the
true power of data science.
DR. Quratulain Rana

5
Overview of Course Content
Statistics Probability Linear Algebra Optimization
DR. Quratulain Rana

6
Introduction to Statistics
Statistics is the study of data collection, analysis, interpretation,

presentation, and organization. It plays a crucial role in
understanding and making decisions based on data.
DR. Quratulain Rana

7
Common Statistical Methods in Data Science
 Descriptive Statistics
 Inferential Statistics
DR. Quratulain Rana

8
Descriptive Statistics: Measures of Central
Tendency Variability and Data Visualization
Central Tendency Variability Measures

Includes mean, median, and mode, Quantifies the spread or dispersion of the data,
characterizing the center of the data. including range and standard deviation.
Data Visualization
Utilizing statistical methods for effective visual representation
of data.
DR. Quratulain Rana

9
Inferential Statistics: Hypothesis Testing and
Confidence Intervals
Hypothesis Testing Confidence Intervals

Examines the plausibility of an assumption
made about a population parameter. Provides a range of values to estimate an
unknown population parameter.
DR. Quratulain Rana

10
Applicatin of Descriptive Methods
Descriptive statistical methods can be applied to data sets from the following domains:
1.Business:
• Analysing sales data to calculate measures such as mean, median, and standard deviation to understand the distribution of sales revenue.
• Summarizing customer demographics using descriptive statistics such as frequencies, percentages, and averages to identify target markets.
2.Healthcare:
• Investigating patient health data to compute descriptive statistics like mean blood pressure, cholesterol levels, or BMI to understand overall health trends.
• Analysing hospital admission records to identify patterns such as the most common diagnoses or lengths of stay.
• Summarizing clinical trial data to assess the efficacy of a new drug by calculating descriptive statistics such as mean treatment effect.
3.Finance:
• Examining stock market data to compute descriptive statistics like average daily returns, volatility, and correlations between different assets.
• Analysing credit card transaction data to identify patterns of spending behaviour among different demographic groups.
4.Social Sciences:
• Conducting surveys and analysing responses to calculate descriptive statistics such as frequencies, averages, and measures of variability to understand
public opinions and attitudes.
• Examining crime data to identify hotspots and trends using descriptive statistics like crime rates, frequencies of specific types of crimes, and demographic
characteristics of offenders.
• Analysing educational assessment scores to assess student performance and identify areas for improvement by calculating measures such as mean scores,
standard deviations, and percentiles.
Dr Quratulain Rana
11
Some Key Terms
Dr Quratulain Rana
12
Data
In data science, "data" refers to raw facts, statistics, observations,

measurements, or records that are collected, saved, and analysed
with the goal of gaining insights, making decisions, and solving
issues. Text, numbers, photos, audio, video, and other formats can all
be used to represent data.
DR. Quratulain Rana

13
Types of Data
Some important types of date are:

 Structured Data: Data organized in a tabular format, often found in databases or spreadsheets.
Examples include:
1. Sales records: Date, customer ID, product ID, quantity, and price.
2. Employee information: Name, age, department, salary, and hire date.
3. Sensor readings: Time, temperature, humidity, and pressure.
 Unstructured Data: Data that does not have a predefined structure, making it more challenging to
analyse. Examples include:
1. Text data: Emails, social media posts, customer reviews, and articles.
2. Multimedia data: Images, audio files, and video footage.
3. Web data: HTML documents, web logs, and clickstream data.
DR. Quratulain Rana

14
Population and Sample
"population" and "sample" refer to two distinct groups of data that are central to statistical analysis:
 Population:
• The population refers to the entire set of individuals, items, or observations of interest that the researcher wants to
study. It includes all possible subjects that share a common set of characteristics.
• In many cases, the population is too large or impractical to study in its entirety. For example, if you want to understand
the average height of all adults in a country, the population would be all adults in that country.
 Sample:
• A sample is a subset of the population that is selected for study. It is representative of the larger population and is used
to make inferences or draw conclusions about the population as a whole.
• Samples are chosen through various sampling methods to ensure they accurately reflect the characteristics of the
population. Random sampling, stratified sampling, and cluster sampling are common techniques.
• Continuing with the previous example, instead of measuring the height of every adult in a country, you might select a
sample of 1000 adults from various regions and demographics to represent the entire adult population.
Dr Quratulain Rana
15
Parameter and Statistics
 Parameter:
• A parameter is a numerical characteristic of a population. It is a fixed value that describes some aspect of the entire population being
studied.
• Parameters are typically denoted using Greek letters. Common parameters include the population mean (μ), population standard
deviation (σ), population proportion (p), population median, etc.
• Parameters are often unknown and are estimated using sample data in inferential statistics.
 Statistic:
• A statistic is a numerical characteristic of a sample. It is calculated from sample data and is used to estimate or infer the corresponding
parameter of the population.
• Statistic values can vary from one sample to another and are subject to sampling variability.
• Common statistics include the sample mean (), sample standard deviation (s), etc.
Dr Quratulain Rana
16
Dr Quratulain Rana
17
Basics of Summarizing Data
Data summarization is the process of condensing and presenting key information from a dataset to gain
insights and draw meaningful conclusions. Effective data summarization is a fundamental step in the
analysis of data. Here are the basics of data summarization:
 Descriptive Statistics
Descriptive statistics measuring the location, spread, variability, and other characteristics can be computed
immediately. We discuss the following statistics:
 Mean, measuring the average value of a data;
 Median, measuring the central value;
 Quantiles and quartiles, showing where certain portions of a sample are located;
 Variance, standard deviation, and interquartile range, measuring variability and spread of data.
DR. Quratulain Rana

18
Mean
Example 1: To evaluate effectiveness of a processor for a certain type of tasks, we recorded the following
CPU time in seconds for n = 30 randomly chosen jobs (data set CPU),
Calculate average(mean) CPU time

DR. Quratulain Rana
19
Median
Another simple measure of location is a sample median, which estimates the population median.
It is much less sensitive than the sample mean.
Example 2: Consider the data given in example:1 and calculate median

Solution: n = 30 is even, find n/2 = 15-th smallest and (n + 2)/2 = 16-th smallest observations. These are 42 and
43. Any number between them is a sample median (typically reported as 42.5).
DR. Quratulain Rana

20
Quantiles, Percentiles, and Quartiles
DR. Quratulain Rana

21
Quantiles, Percentiles, and Quartiles
Example3: Consider the data given in example:1 and calculate 1 st and 3rd quartile
Solution:
DR. Quratulain Rana

22
Variance and Standard deviation
DR. Quratulain Rana

23
Variance and Standard deviation
Example 4: For the data in (Example:1) , Compute sample variance and standard deviation
Solution:
The sample standard deviation is:
DR. Quratulain Rana

24
Interquartile range
DR. Quratulain Rana

25
Detection of outliers
Example:5 For the data in (Example:1) , Calculate IQR , can we suspect that the
data has an outlier?
DR. Quratulain Rana

26
Practice Question:1
A network provider investigates the load of its network. The number of concurrent users is recorded at
fifty locations (thousands of people),
(a) Compute the sample mean, variance, and standard deviation of the number of concurrent users.
(b) Compute the interquartile range. Are there any outliers?
DR. Quratulain Rana

27
Practice Question:2
Consider three data sets:
a) Compute sample means and sample medians. Do they support your findings about skewness and
symmetry? How?
DR. Quratulain Rana
28
Practice Question:3
The following data set represents the number of new computer accounts registered during ten consecutive
days.
43, 37, 50, 51, 58, 105, 52, 45, 45, 10.
a) Compute the mean, median, quartiles, and standard deviation.
b) Check for outliers using the 1.5(IQR) rule.
c) Delete the detected outliers and compute the mean, median, quartiles, and standard deviation again.
d) Make a conclusion about the effect of outliers on basic descriptive statistics.
DR. Quratulain Rana

29
Q &A
Lahore Garrison University

30
References
 Probability and Statistics for Computer Scientists, 2nd Edition, Michael Baron
 Linear Algebra and Its Applications, 5th Edition, David C. Lay and Steven R. Lay
 Introduction to Linear Algebra, 5th Edition, Gilbert Strang
 Probability for Computer Scientists, online Edition, David Forsyth.

Ms Data Science S, 24 (WEEK# 1)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ms Data Science S, 24 (WEEK# 1)

Uploaded by

Copyright:

Available Formats

1

Lahore Garrison University

After studying this lecture, the student will be able to:

Dr. Quratulain Rana

What is Data Science

 Data science is an interdisciplinary field that combines

DR. Quratulain Rana

Statistics Probability Linear Algebra Optimization

DR. Quratulain Rana

Statistics is the study of data collection, analysis, interpretation,

DR. Quratulain Rana

DR. Quratulain Rana

Central Tendency Variability Measures

DR. Quratulain Rana

Hypothesis Testing Confidence Intervals

DR. Quratulain Rana

Some Key Terms

In data science, "data" refers to raw facts, statistics, observations,

DR. Quratulain Rana

Some important types of date are:

DR. Quratulain Rana

DR. Quratulain Rana

Calculate average(mean) CPU time

Example 2: Consider the data given in example:1 and calculate median

DR. Quratulain Rana

Quantiles, Percentiles, and Quartiles

DR. Quratulain Rana

DR. Quratulain Rana

Variance and Standard deviation

DR. Quratulain Rana

Variance and Standard deviation

The sample standard deviation is:

DR. Quratulain Rana

DR. Quratulain Rana

DR. Quratulain Rana

(b) Compute the interquartile range. Are there any outliers?

DR. Quratulain Rana

Consider three data sets:

DR. Quratulain Rana

Lahore Garrison University

You might also like