Data Analysis Methods for Statistics and Data Mining

Methods for Data Analysis
Basic Statistical Concepts
Set 2 – Data Mining
IS 119 SY 2022-2023 1st Semester

Statistics
• Statistical Thinking – thought processes that
focus on ways to understand, manage and
reduce variation.
– Get information
– Get survey
– Control for variation
– Example: shopping – survey where it would be
cheaper to get the items that you want

Statistics
• Definition
– (Plural sense) is a set of numerical data
– (Singular sense) is a branch of science which deals
with the collection, presentation, analysis, and
interpretation of data
– Applications:
• build macroeconomic model to estimate economic
relationships and evaluate government policies
• Studying the effects of specific interventions on the
welfares of households or institutions

Statistics
– Applications:
• In financial matters, may use regression and correlation
analysis to help understand the relationship of financial
ratio to a set of other variables in business
• Use statistical models to forecast sales in the coming
year
• Political party may want to study the effects of political
campaign expenditures on voting outcomes
• Academicians may use statistics in modeling its
graduates via tracer studies

Basic Terminologies
• Population VS sample (collection of all
elements VS part or subset of the population)
• Parameter VS statistic
– Parameter: a numerical characteristic of the
population (usually denoted in Greek letters)
– Statistic: a numerical characteristic of the sample
• Descriptive VS Inferential statistics

Basic Terminologies
• Descriptive VS Inferential statistics
– Descriptive: composed of methods concerned
with collecting, describing, and analyzing a set of
data without drawing conclusions or inferences
about a large group.
– Inferential: composed of methods concerned with
the analysis of a subset of data leading to
predictions or inferences about the entire set of
data

Basic Terminologies
Descriptive Inferential
• simply describe what is or • Conclusions are reached that
what the data shows extend beyond the immediate
data alone
• Used to try to infer from the
• Use to describe what’s sample data what the population
going on in our data might think
• Used to make judgment on
whether the probability that an
observed difference between
groups is dependable or might
have happened by chance
• use to make inferences from the
sample data to more general
conditions
Types of Data
• Qualitative or categorical data – objects being
studied are grouped into labeled categories based
on some qualitative traits
– Examples: sex (male or female); status (single, married,
separated, widow, etc)
– Commonly summarized as percentages or proportions
• Quantitative or numerical data – refers to any
attribute that is measured numerically
– Discrete: counts (no. of poor households, etc)
– Continuous: numerical responses from measurements
(income, etc)
– Commonly summarized using averages or means
Methods of Summarizing the Data
• Textual Presentation
• Tabular
• Graphical
• Computation of summary measures
– Measure of location or central tendency (mean,
median or mode)
– Measure of statistical dispersion like standard
deviation, variance or range
– Measure of location such as percentile, or quartile

Measure of Central Tendency
• Mean: the ratio of the sum of all values of
observations to the number of observations in
the data set
– Properties: reflects the magnitude of every
observation since each contributes to the value of
the mean
– Affected by extreme values
– Weighted mean: means of subgroups combined
when properly weighted

• Median: value which divides the ordered data
set into two equal parts
– Example: (case of even) 4 observations – 1, 3, 2, 1;
ordered from lowest to highest: 1, 1, 2, 3, thus
md= (1+2)/2=1.5
– (case of odd) 5 observations 1, 5, 9, 11, 12, md=9
– Properties: positional value hence not affected by
extreme values; not amenable to further
computation (cannot be combined)

• Mode: value which occurs most often
– Example: data set {1, 1, 1, 1, 2, 3, 4, 4}; Mode is 1
– For {1, 1, 2, 2, 3}, modes are 1 and 2
– For {1, 2, 3, 4}, there is no mode
– Properties: determined by the frequency of
occurrence and not by values of observation
– Cannot be manipulated algebraically
– Can be defined with qualitative and quantitative
data
Measures of Dispersion
Measure how scattered the data values are
around the mean/average;
• Range: the length of the interval which
contains all the data
– Calculated by subtracting the lowest from the
highest: it is a poor and weak measure of
dispersion since it depends in only two
observations except when sample size is large

• Range: the length of the interval which
contains all the data
– Simple, easy to understand
– Gives comprehensive value since it gives the limit
– Lack of clustering of values
– Depends on the value of extreme items
– Sensitive to sampling variations
– Not tractable mathematically

• Variance: a measure of dispersion of data with
respect to the mean.
– Measures the average distance between each of a
set of data points and their mean value.
– Population variance, δ2= Σi=1 (xi - μ)2/N
– Sample variance, s2= Σi=1 (xi –x)2/(n-1)

• Standard Deviation:
– Measures on the average, the dispersion of each
observation from the mean. Large amount of
variation means that the data values are far from the
mean, hence sd is large
– Population Sd, δ= SQTR(Σi=1 (x - μ)2/N)

i
– Sample Sd, s= SQTR(Σi=1 (x –x)2/(n-1))

i

Measures of Relative Dispersion
• Unitless and are used to compare the scatter of
one distribution with the scatter of another
distribution.
• Coefficient of Variation: is a statistical measure of
the dispersion of data points in a data series
around the mean. It is the ratio of the standard
deviation to the mean.
– Useful when interest is in the size of variation relative
to the size of the observation.
CV = (δ/μ)*100%

Example 1
The foreign exchange rate is an indicator of the stability of the
peso and is also an indicator of the economic performance.
Market forces and not government policy have determined
the level of the pesos since Gov’t intervenes through the
Bangko Sentral ng Pilipinas, only when there are speculative
elements in the market. Given below are the means and
standard deviations of the quarterly P-$ exchange rate for the
periods 1998 to 1999 and 2000 to 2001. which of the two
periods is more stable?
Mean Standard Deviation
1998-1999 40.4 2.01
2000-2001 48.6 1.21

Solution to Example 1
• 1998-1999
CV98-99 = (2.01/40.4) X 100% = 4.98%
• 2000-2001
CV2000-2001 = (1.21/48.6) X 100% = 2.49%
• Thus the period 2000-2001 is more stable with

respect to the peso-dollar exchange rate
Inferential Statistics
• Deals with the methods to generalize what the
sample data show
• Inferential statistics answer the question:
• Can I generalize to the population the patterns/

differences/ profile that I see in my sample?
• Note that there is no need to do inferential

statistics if the data are already population data
Next topic…
• Studying Relationships
– Correlation Analysis
– Regression Analysis
– Important Statistics and Tests

Data Analysis Methods for Statistics and Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis Methods for Statistics and Data Mining

Uploaded by

Copyright:

Available Formats

Methods for Data Analysis

Basic Statistical Concepts

Set 2 – Data Mining

IS 119 SY 2022-2023 1st Semester

IS 119 SY 2022-2023 1st Semester

IS 119 SY 2022-2023 1st Semester

IS 119 SY 2022-2023 1st Semester

IS 119 SY 2022-2023 1st Semester

IS 119 SY 2022-2023 1st Semester

IS 119 SY 2022-2023 1st Semester

IS 119 SY 2022-2023 1st Semester

IS 119 SY 2022-2023 1st Semester

IS 119 SY 2022-2023 1st Semester

IS 119 SY 2022-2023 1st Semester

– Population variance, δ2= Σi=1 (xi - μ)2/N

– Sample variance, s2= Σi=1 (xi –x)2/(n-1)

– Population Sd, δ= SQTR(Σi=1 (x - μ)2/N)

– Sample Sd, s= SQTR(Σi=1 (x –x)2/(n-1))

IS 119 SY 2022-2023 1st Semester

IS 119 SY 2022-2023 1st Semester

IS 119 SY 2022-2023 1st Semester

• Thus the period 2000-2001 is more stable with

• Inferential statistics answer the question:

• Can I generalize to the population the patterns/

• Note that there is no need to do inferential

IS 119 SY 2022-2023 1st Semester

You might also like