Professional Documents
Culture Documents
Ms Data Science S, 24 (WEEK# 1)
Ms Data Science S, 24 (WEEK# 1)
Dr.Quratulain Rana
Assistant Professor
2
Learning Objectives
In the realm of data science, statistical and mathematical methods form the
cornerstone of analysis and interpretation. Statistics enables data scientists
to uncover patterns, trends, and insights, while mathematical methods
provide the framework for modelling and predictive analytics.
Understanding the significance of these methods is vital for harnessing the
true power of data science.
Descriptive Statistics
Inferential Statistics
Data Visualization
Utilizing statistical methods for effective visual representation
of data.
Descriptive statistical methods can be applied to data sets from the following domains:
1.Business:
• Analysing sales data to calculate measures such as mean, median, and standard deviation to understand the distribution of sales revenue.
• Summarizing customer demographics using descriptive statistics such as frequencies, percentages, and averages to identify target markets.
2.Healthcare:
• Investigating patient health data to compute descriptive statistics like mean blood pressure, cholesterol levels, or BMI to understand overall health trends.
• Analysing hospital admission records to identify patterns such as the most common diagnoses or lengths of stay.
• Summarizing clinical trial data to assess the efficacy of a new drug by calculating descriptive statistics such as mean treatment effect.
3.Finance:
• Examining stock market data to compute descriptive statistics like average daily returns, volatility, and correlations between different assets.
• Analysing credit card transaction data to identify patterns of spending behaviour among different demographic groups.
4.Social Sciences:
• Conducting surveys and analysing responses to calculate descriptive statistics such as frequencies, averages, and measures of variability to understand
public opinions and attitudes.
• Examining crime data to identify hotspots and trends using descriptive statistics like crime rates, frequencies of specific types of crimes, and demographic
characteristics of offenders.
• Analysing educational assessment scores to assess student performance and identify areas for improvement by calculating measures such as mean scores,
standard deviations, and percentiles.
Dr Quratulain Rana
11
Dr Quratulain Rana
12
Data
"population" and "sample" refer to two distinct groups of data that are central to statistical analysis:
Population:
• The population refers to the entire set of individuals, items, or observations of interest that the researcher wants to
study. It includes all possible subjects that share a common set of characteristics.
• In many cases, the population is too large or impractical to study in its entirety. For example, if you want to understand
the average height of all adults in a country, the population would be all adults in that country.
Sample:
• A sample is a subset of the population that is selected for study. It is representative of the larger population and is used
to make inferences or draw conclusions about the population as a whole.
• Samples are chosen through various sampling methods to ensure they accurately reflect the characteristics of the
population. Random sampling, stratified sampling, and cluster sampling are common techniques.
• Continuing with the previous example, instead of measuring the height of every adult in a country, you might select a
sample of 1000 adults from various regions and demographics to represent the entire adult population.
Dr Quratulain Rana
15
Parameter and Statistics
Parameter:
• A parameter is a numerical characteristic of a population. It is a fixed value that describes some aspect of the entire population being
studied.
• Parameters are typically denoted using Greek letters. Common parameters include the population mean (μ), population standard
deviation (σ), population proportion (p), population median, etc.
• Parameters are often unknown and are estimated using sample data in inferential statistics.
Statistic:
• A statistic is a numerical characteristic of a sample. It is calculated from sample data and is used to estimate or infer the corresponding
parameter of the population.
• Statistic values can vary from one sample to another and are subject to sampling variability.
• Common statistics include the sample mean (), sample standard deviation (s), etc.
Dr Quratulain Rana
16
Dr Quratulain Rana
17
Basics of Summarizing Data
Data summarization is the process of condensing and presenting key information from a dataset to gain
insights and draw meaningful conclusions. Effective data summarization is a fundamental step in the
analysis of data. Here are the basics of data summarization:
Descriptive Statistics
Descriptive statistics measuring the location, spread, variability, and other characteristics can be computed
immediately. We discuss the following statistics:
Mean, measuring the average value of a data;
Median, measuring the central value;
Quantiles and quartiles, showing where certain portions of a sample are located;
Variance, standard deviation, and interquartile range, measuring variability and spread of data.
Example 1: To evaluate effectiveness of a processor for a certain type of tasks, we recorded the following
CPU time in seconds for n = 30 randomly chosen jobs (data set CPU),
Another simple measure of location is a sample median, which estimates the population median.
It is much less sensitive than the sample mean.
Example3: Consider the data given in example:1 and calculate 1 st and 3rd quartile
Solution:
Example 4: For the data in (Example:1) , Compute sample variance and standard deviation
Solution:
Example:5 For the data in (Example:1) , Calculate IQR , can we suspect that the
data has an outlier?
A network provider investigates the load of its network. The number of concurrent users is recorded at
fifty locations (thousands of people),
(a) Compute the sample mean, variance, and standard deviation of the number of concurrent users.
a) Compute sample means and sample medians. Do they support your findings about skewness and
symmetry? How?
DR. Quratulain Rana
28
Practice Question:3
The following data set represents the number of new computer accounts registered during ten consecutive
days.
43, 37, 50, 51, 58, 105, 52, 45, 45, 10.
a) Compute the mean, median, quartiles, and standard deviation.
b) Check for outliers using the 1.5(IQR) rule.
c) Delete the detected outliers and compute the mean, median, quartiles, and standard deviation again.
d) Make a conclusion about the effect of outliers on basic descriptive statistics.
Q &A
Probability and Statistics for Computer Scientists, 2nd Edition, Michael Baron
Linear Algebra and Its Applications, 5th Edition, David C. Lay and Steven R. Lay
Introduction to Linear Algebra, 5th Edition, Gilbert Strang
Probability for Computer Scientists, online Edition, David Forsyth.