You are on page 1of 21

Methods for Data Analysis

Basic Statistical Concepts

Set 2 – Data Mining

IS 119 SY 2022-2023 1st Semester


Statistics
• Statistical Thinking – thought processes that
focus on ways to understand, manage and
reduce variation.
– Get information
– Get survey
– Control for variation
– Example: shopping – survey where it would be
cheaper to get the items that you want

IS 119 SY 2022-2023 1st Semester


Statistics
• Definition
– (Plural sense) is a set of numerical data
– (Singular sense) is a branch of science which deals
with the collection, presentation, analysis, and
interpretation of data
– Applications:
• build macroeconomic model to estimate economic
relationships and evaluate government policies
• Studying the effects of specific interventions on the
welfares of households or institutions

IS 119 SY 2022-2023 1st Semester


Statistics
– Applications:
• In financial matters, may use regression and correlation
analysis to help understand the relationship of financial
ratio to a set of other variables in business
• Use statistical models to forecast sales in the coming
year
• Political party may want to study the effects of political
campaign expenditures on voting outcomes
• Academicians may use statistics in modeling its
graduates via tracer studies

IS 119 SY 2022-2023 1st Semester


Basic Terminologies
• Population VS sample (collection of all
elements VS part or subset of the population)
• Parameter VS statistic
– Parameter: a numerical characteristic of the
population (usually denoted in Greek letters)
– Statistic: a numerical characteristic of the sample
• Descriptive VS Inferential statistics

IS 119 SY 2022-2023 1st Semester


Basic Terminologies
• Descriptive VS Inferential statistics
– Descriptive: composed of methods concerned
with collecting, describing, and analyzing a set of
data without drawing conclusions or inferences
about a large group.
– Inferential: composed of methods concerned with
the analysis of a subset of data leading to
predictions or inferences about the entire set of
data

IS 119 SY 2022-2023 1st Semester


Basic Terminologies
Descriptive Inferential
• simply describe what is or • Conclusions are reached that
what the data shows extend beyond the immediate
data alone
• Used to try to infer from the
• Use to describe what’s sample data what the population
going on in our data might think
• Used to make judgment on
whether the probability that an
observed difference between
groups is dependable or might
have happened by chance
• use to make inferences from the
sample data to more general
conditions
IS 119 SY 2022-2023 1st Semester
Types of Data
• Qualitative or categorical data – objects being
studied are grouped into labeled categories based
on some qualitative traits
– Examples: sex (male or female); status (single, married,
separated, widow, etc)
– Commonly summarized as percentages or proportions
• Quantitative or numerical data – refers to any
attribute that is measured numerically
– Discrete: counts (no. of poor households, etc)
– Continuous: numerical responses from measurements
(income, etc)
– Commonly summarized using averages or means
IS 119 SY 2022-2023 1st Semester
Methods of Summarizing the Data
• Textual Presentation
• Tabular
• Graphical
• Computation of summary measures
– Measure of location or central tendency (mean,
median or mode)
– Measure of statistical dispersion like standard
deviation, variance or range
– Measure of location such as percentile, or quartile

IS 119 SY 2022-2023 1st Semester


Measure of Central Tendency
• Mean: the ratio of the sum of all values of
observations to the number of observations in
the data set
– Properties: reflects the magnitude of every
observation since each contributes to the value of
the mean
– Affected by extreme values
– Weighted mean: means of subgroups combined
when properly weighted

IS 119 SY 2022-2023 1st Semester


Measure of Central Tendency
• Median: value which divides the ordered data
set into two equal parts
– Example: (case of even) 4 observations – 1, 3, 2, 1;
ordered from lowest to highest: 1, 1, 2, 3, thus
md= (1+2)/2=1.5
– (case of odd) 5 observations 1, 5, 9, 11, 12, md=9
– Properties: positional value hence not affected by
extreme values; not amenable to further
computation (cannot be combined)

IS 119 SY 2022-2023 1st Semester


Measure of Central Tendency
• Mode: value which occurs most often
– Example: data set {1, 1, 1, 1, 2, 3, 4, 4}; Mode is 1
– For {1, 1, 2, 2, 3}, modes are 1 and 2
– For {1, 2, 3, 4}, there is no mode
– Properties: determined by the frequency of
occurrence and not by values of observation
– Cannot be manipulated algebraically
– Can be defined with qualitative and quantitative
data
IS 119 SY 2022-2023 1st Semester
Measures of Dispersion
Measure how scattered the data values are
around the mean/average;
• Range: the length of the interval which
contains all the data
– Calculated by subtracting the lowest from the
highest: it is a poor and weak measure of
dispersion since it depends in only two
observations except when sample size is large

IS 119 SY 2022-2023 1st Semester


Measures of Dispersion
• Range: the length of the interval which
contains all the data
– Simple, easy to understand
– Gives comprehensive value since it gives the limit
– Lack of clustering of values
– Depends on the value of extreme items
– Sensitive to sampling variations
– Not tractable mathematically

IS 119 SY 2022-2023 1st Semester


Measures of Dispersion
• Variance: a measure of dispersion of data with
respect to the mean.
– Measures the average distance between each of a
set of data points and their mean value.

– Population variance, δ2= Σi=1 (xi - μ)2/N

– Sample variance, s2= Σi=1 (xi –x)2/(n-1)


IS 119 SY 2022-2023 1st Semester
Measures of Dispersion
• Standard Deviation:
– Measures on the average, the dispersion of each
observation from the mean. Large amount of
variation means that the data values are far from the
mean, hence sd is large

– Population Sd, δ= SQTR(Σi=1 (x - μ)2/N)


i

– Sample Sd, s= SQTR(Σi=1 (x –x)2/(n-1))


i

IS 119 SY 2022-2023 1st Semester


Measures of Relative Dispersion
• Unitless and are used to compare the scatter of
one distribution with the scatter of another
distribution.
• Coefficient of Variation: is a statistical measure of
the dispersion of data points in a data series
around the mean. It is the ratio of the standard
deviation to the mean.
– Useful when interest is in the size of variation relative
to the size of the observation.

CV = (δ/μ)*100%

IS 119 SY 2022-2023 1st Semester


Example 1
The foreign exchange rate is an indicator of the stability of the
peso and is also an indicator of the economic performance.
Market forces and not government policy have determined
the level of the pesos since Gov’t intervenes through the
Bangko Sentral ng Pilipinas, only when there are speculative
elements in the market. Given below are the means and
standard deviations of the quarterly P-$ exchange rate for the
periods 1998 to 1999 and 2000 to 2001. which of the two
periods is more stable?
Mean Standard Deviation
1998-1999 40.4 2.01
2000-2001 48.6 1.21

IS 119 SY 2022-2023 1st Semester


Solution to Example 1
• 1998-1999
CV98-99 = (2.01/40.4) X 100% = 4.98%

• 2000-2001
CV2000-2001 = (1.21/48.6) X 100% = 2.49%

• Thus the period 2000-2001 is more stable with


respect to the peso-dollar exchange rate
IS 119 SY 2022-2023 1st Semester
Inferential Statistics
• Deals with the methods to generalize what the
sample data show

• Inferential statistics answer the question:

• Can I generalize to the population the patterns/


differences/ profile that I see in my sample?

• Note that there is no need to do inferential


statistics if the data are already population data
IS 119 SY 2022-2023 1st Semester
Next topic…
• Studying Relationships
– Correlation Analysis
– Regression Analysis
– Important Statistics and Tests

IS 119 SY 2022-2023 1st Semester

You might also like