You are on page 1of 30

1

Lahore Garrison University


DSC-701 Statistical and Mathematical
Methods
for Data Science
Week-1 Semester-
Spring-2024
Prepared by:

Dr.Quratulain Rana

Assistant Professor
2
Learning Objectives

After studying this lecture, the student will be able to:


 Understand the importance of statistical and mathematical methods in data science
 Understand the fundamental statistical concepts.
 Learn how to summarize and describe data using measures such as mean, median,
variance, standard deviation, percentiles, and quartiles.
 Analyse and interpret the shape, centre, and spread of data distributions.
 Identify outliers and understand their impact on data analysis.

Dr. Quratulain Rana


3

What is Data Science

 Data science is an interdisciplinary field that combines


statistics, mathematics, computer science, and domain
knowledge to extract insights and knowledge from
data.
 It involves collecting, processing, analysing, and
interpreting large volumes of structured and
unstructured data to make informed decisions.
4
Importance of Statistical and Mathematical
Methods

In the realm of data science, statistical and mathematical methods form the
cornerstone of analysis and interpretation. Statistics enables data scientists
to uncover patterns, trends, and insights, while mathematical methods
provide the framework for modelling and predictive analytics.
Understanding the significance of these methods is vital for harnessing the
true power of data science.

DR. Quratulain Rana


5
Overview of Course Content

Statistics Probability Linear Algebra Optimization

DR. Quratulain Rana


6
Introduction to Statistics

Statistics is the study of data collection, analysis, interpretation,


presentation, and organization. It plays a crucial role in
understanding and making decisions based on data.

DR. Quratulain Rana


7
Common Statistical Methods in Data Science

 Descriptive Statistics
 Inferential Statistics

DR. Quratulain Rana


8
Descriptive Statistics: Measures of Central
Tendency Variability and Data Visualization

Central Tendency Variability Measures


Includes mean, median, and mode, Quantifies the spread or dispersion of the data,
characterizing the center of the data. including range and standard deviation.

Data Visualization
Utilizing statistical methods for effective visual representation
of data.

DR. Quratulain Rana


9
Inferential Statistics: Hypothesis Testing and
Confidence Intervals

Hypothesis Testing Confidence Intervals


Examines the plausibility of an assumption
made about a population parameter. Provides a range of values to estimate an
unknown population parameter.

DR. Quratulain Rana


10
Applicatin of Descriptive Methods

Descriptive statistical methods can be applied to data sets from the following domains:
1.Business:
• Analysing sales data to calculate measures such as mean, median, and standard deviation to understand the distribution of sales revenue.
• Summarizing customer demographics using descriptive statistics such as frequencies, percentages, and averages to identify target markets.
2.Healthcare:
• Investigating patient health data to compute descriptive statistics like mean blood pressure, cholesterol levels, or BMI to understand overall health trends.
• Analysing hospital admission records to identify patterns such as the most common diagnoses or lengths of stay.
• Summarizing clinical trial data to assess the efficacy of a new drug by calculating descriptive statistics such as mean treatment effect.
3.Finance:
• Examining stock market data to compute descriptive statistics like average daily returns, volatility, and correlations between different assets.
• Analysing credit card transaction data to identify patterns of spending behaviour among different demographic groups.
4.Social Sciences:
• Conducting surveys and analysing responses to calculate descriptive statistics such as frequencies, averages, and measures of variability to understand
public opinions and attitudes.
• Examining crime data to identify hotspots and trends using descriptive statistics like crime rates, frequencies of specific types of crimes, and demographic
characteristics of offenders.
• Analysing educational assessment scores to assess student performance and identify areas for improvement by calculating measures such as mean scores,
standard deviations, and percentiles.

Dr Quratulain Rana
11

Some Key Terms

Dr Quratulain Rana
12
Data

In data science, "data" refers to raw facts, statistics, observations,


measurements, or records that are collected, saved, and analysed
with the goal of gaining insights, making decisions, and solving
issues. Text, numbers, photos, audio, video, and other formats can all
be used to represent data.

DR. Quratulain Rana


13
Types of Data

Some important types of date are:


 Structured Data: Data organized in a tabular format, often found in databases or spreadsheets.
Examples include:
1. Sales records: Date, customer ID, product ID, quantity, and price.
2. Employee information: Name, age, department, salary, and hire date.
3. Sensor readings: Time, temperature, humidity, and pressure.
 Unstructured Data: Data that does not have a predefined structure, making it more challenging to
analyse. Examples include:
1. Text data: Emails, social media posts, customer reviews, and articles.
2. Multimedia data: Images, audio files, and video footage.
3. Web data: HTML documents, web logs, and clickstream data.

DR. Quratulain Rana


14
Population and Sample

"population" and "sample" refer to two distinct groups of data that are central to statistical analysis:
 Population:
• The population refers to the entire set of individuals, items, or observations of interest that the researcher wants to
study. It includes all possible subjects that share a common set of characteristics.
• In many cases, the population is too large or impractical to study in its entirety. For example, if you want to understand
the average height of all adults in a country, the population would be all adults in that country.
 Sample:
• A sample is a subset of the population that is selected for study. It is representative of the larger population and is used
to make inferences or draw conclusions about the population as a whole.
• Samples are chosen through various sampling methods to ensure they accurately reflect the characteristics of the
population. Random sampling, stratified sampling, and cluster sampling are common techniques.
• Continuing with the previous example, instead of measuring the height of every adult in a country, you might select a
sample of 1000 adults from various regions and demographics to represent the entire adult population.

Dr Quratulain Rana
15
Parameter and Statistics

 Parameter:
• A parameter is a numerical characteristic of a population. It is a fixed value that describes some aspect of the entire population being
studied.
• Parameters are typically denoted using Greek letters. Common parameters include the population mean (μ), population standard
deviation (σ), population proportion (p), population median, etc.
• Parameters are often unknown and are estimated using sample data in inferential statistics.
 Statistic:
• A statistic is a numerical characteristic of a sample. It is calculated from sample data and is used to estimate or infer the corresponding
parameter of the population.
• Statistic values can vary from one sample to another and are subject to sampling variability.
• Common statistics include the sample mean (), sample standard deviation (s), etc.

Dr Quratulain Rana
16

Dr Quratulain Rana
17
Basics of Summarizing Data

Data summarization is the process of condensing and presenting key information from a dataset to gain
insights and draw meaningful conclusions. Effective data summarization is a fundamental step in the
analysis of data. Here are the basics of data summarization:
 Descriptive Statistics
Descriptive statistics measuring the location, spread, variability, and other characteristics can be computed
immediately. We discuss the following statistics:
 Mean, measuring the average value of a data;
 Median, measuring the central value;
 Quantiles and quartiles, showing where certain portions of a sample are located;
 Variance, standard deviation, and interquartile range, measuring variability and spread of data.

DR. Quratulain Rana


18
Mean

Example 1: To evaluate effectiveness of a processor for a certain type of tasks, we recorded the following
CPU time in seconds for n = 30 randomly chosen jobs (data set CPU),

Calculate average(mean) CPU time


DR. Quratulain Rana
19
Median

Another simple measure of location is a sample median, which estimates the population median.
It is much less sensitive than the sample mean.

Example 2: Consider the data given in example:1 and calculate median


Solution: n = 30 is even, find n/2 = 15-th smallest and (n + 2)/2 = 16-th smallest observations. These are 42 and
43. Any number between them is a sample median (typically reported as 42.5).

DR. Quratulain Rana


20

Quantiles, Percentiles, and Quartiles

DR. Quratulain Rana


21
Quantiles, Percentiles, and Quartiles

Example3: Consider the data given in example:1 and calculate 1 st and 3rd quartile

Solution:

DR. Quratulain Rana


22

Variance and Standard deviation

DR. Quratulain Rana


23

Variance and Standard deviation

Example 4: For the data in (Example:1) , Compute sample variance and standard deviation

Solution:

The sample standard deviation is:

DR. Quratulain Rana


24
Interquartile range

DR. Quratulain Rana


25
Detection of outliers

Example:5 For the data in (Example:1) , Calculate IQR , can we suspect that the
data has an outlier?

DR. Quratulain Rana


26
Practice Question:1

A network provider investigates the load of its network. The number of concurrent users is recorded at
fifty locations (thousands of people),

(a) Compute the sample mean, variance, and standard deviation of the number of concurrent users.

(b) Compute the interquartile range. Are there any outliers?

DR. Quratulain Rana


27
Practice Question:2

Consider three data sets:

a) Compute sample means and sample medians. Do they support your findings about skewness and
symmetry? How?
DR. Quratulain Rana
28
Practice Question:3

The following data set represents the number of new computer accounts registered during ten consecutive
days.
43, 37, 50, 51, 58, 105, 52, 45, 45, 10.
a) Compute the mean, median, quartiles, and standard deviation.
b) Check for outliers using the 1.5(IQR) rule.
c) Delete the detected outliers and compute the mean, median, quartiles, and standard deviation again.
d) Make a conclusion about the effect of outliers on basic descriptive statistics.

DR. Quratulain Rana


29

Q &A

Lahore Garrison University


30
References

 Probability and Statistics for Computer Scientists, 2nd Edition, Michael Baron
 Linear Algebra and Its Applications, 5th Edition, David C. Lay and Steven R. Lay
 Introduction to Linear Algebra, 5th Edition, Gilbert Strang
 Probability for Computer Scientists, online Edition, David Forsyth.

You might also like