Foundations or Research Analysis

What is Data?
• Observations of a set of variables

• Lowest level of abstraction from which information is derived
• Each Discipline has evolved it’s own method of classification of data
• Two Broad Classification of Data Based on Source

– Primary Data:
• Data Collected from Primary Source
– Secondary Data:
• Data Collected From Secondary Source
1
Classification :: Statistics
• Categorical Data
– The Objects are grouped into categories based on some Qualitative Trait
– The resultant data are merely labels or categories
– Example:
• Hair Color: Brown / Black / Red
• Smoking Status: Favor / Neutral / Against
• Measurement Data
– The Objects are “measured” on some Quantitative Trait
– The resultant data is a set of numbers
– Example:
• Age of the Students
• JEMAT Score
• Number of Students Not Attending Class
2
Categorical Data
• Nominal Data
– A type of categorical data in which numbers act as a label without having
any specific meaning
– Example:
• Male : 1
• Female : 2
• Ordinal Data
– A type of categorical data in which numbers act as an guide to the level of
importance of the object
– Example:
• Mild
• Moderate
• Severe
3
Measurement Data
• Discrete Data
– Only Certain Values are Possible
– There are gaps between the possible value
– Are generated through the process of Counting
– Example:
• Number of students in the class
• Number of Employees Absent from Work
• Continuous Data
– Any value within an interval is possible with a suitable measuring device
– Theoretically, the number can be accurate to any desired number of
decimal places
– Are generated through the process of Measurement
– Example:
• Height in cm
• Time to complete the assignment
4
Classification :: Scaling Theory
• Nominal Data ORDER DISTANCE ORIGIN
– A type of categorical data in which numbers act as a label without having
any specific meaning
– Example:
• Male : 1
• Female : 2
• Ordinal Data
– A type of categorical data in which numbers act as an guide to the level of
importance of the object
– Example:
• Mild
• Moderate
• Severe
ORDER DISTANCE ORIGIN

5
Classification :: Scaling Theory
• Interval Data ORDER DISTANCE ORIGIN
– Quantitative Data but does not has any real zero point
– Allows comparison within the scale but cannot compare outside the scale
– Used in Social Research, but most researcher not clear about Interval
scale
– Example:
• Definitely Will Buy / Probably Will Buy / May or May not Buy / Probably Will not
Buy / Definitely Will not Buy
• Ratio Data
– Quantitative Data but has real zero point
– Allows conversion and preservation on the magnitude in another scale
– Example:
• Distance in Kms
ORDER DISTANCE ORIGIN

6
Why understand Data?
• The type of Analysis depends on the Type of data you
have collected
• General Guideline is a follows:
– Nominal Data Mode, Chi-Square
– Ordinal Data + Median / Percentiles
– Interval Data + Mean / SD / Correlation / Regression /

ANOVA
– Ratio Scale + Geometric Mean / Harmonic Mean /

Coefficient of Variation /
Logarithms
7
Some Points to Remember
• Tend to use Interval Scales
• Data need not be comparable with other studies
• Data has to make sense in your context
• Students fail to understand the importance of Data
– Wrong Approach
• “Data Collect Kore Niyechi… Ebar Ki Kori”
– Right Approach
• “Amar Ki Data Dorkar? Kano Daokar? Kothay Pabo? Kibhabe
Analyse Kore Uttor Pabo”
8
Descriptive Statistics
:: A Quick Review
9
Measures of Central Tendency
• Central tendency is “loosely” defined as the concept of
location of the center of a distribution of data
• Three basic measures
– Arithmetic Mean
– Median
– Mode
10
Arithmetic Mean
• Advantages:
– Easy to Compute
– Affected by every value in the set of observations
– Defined by rigid mathematical formulation
– It is relatively reliable
– It represents the “center of gravity” of the data
• Disadvantages:
– Unduly affected by small and / or large values
– Cannot be calculated for data with open ended class
– Is a good measure only when the distribution is fairly symmetric
11
Median
• Advantages
– Refers to the “Middle Value” of the distribution
– It is a “positional measure”
– Useful in case of open ended class
– Not seriously affected by Extreme Values
– Most appropriate for dealing with Qualitative Rank Data
– Has a series of related positional measures like Quartiles, Deciles,
Percentiles
• Disadvantages:
– It does not take every value into consideration
– It is not capable of algebraic treatment
– It is erratic if the number of items are smalle
12
Mode
• Advantages:
– It is the most typical or representative value of a distribution
– Not unduly affected by extreme values
– It can be used to describe qualitative phenomenon
• Disadvantages:
– Mode may not be there in a distribution or may be present more
than once in a distribution
– Not capable of algebraic treatment
– It is not rigidly defined for calculation
13
Relation Between the 3 Measures
• In moderately skewed distribution:
Mode = 3 Median – 2 Mean
14
Measures of Dispersion
• Dispersion is defined as the degree to which data tends to
spread about a central value
• Four Absolute & Relative Measures
– Range Coefficient of Range
– Quartile Deviation Coefficient of Quartile Deviation
– Mean Absolute Deviation Coefficient of MAD
– Standard Deviation Coefficient of Variation
• Range and QD are positional measures of dispersion

• AD and SD are calculation measures of dispersion
15
Range
• Range
• Advantages
– Simplest to understand and compute
• Disadvantages:
– Not based on each and every item in the data
– Does not take into account the shape of distribution
– Cannot be computed in case of open ended classes
16
Quartile Deviation
• Inter Quartile Range (IQR)
• Quartile Deviation (Semi IQR)
• Coefficient of QD
17
Quartile Deviation
• Advantages:
– Can measure variation in open ended distributions
– It is extremely useful in case of erratic or badly skewed data
– It is not affected by extreme values
• Disadvantages:
– Ignores 50% of the data
– Is not capable of mathematical manipulation
– Is not considered as a measure of dispersion:
• Effectively shows the distance between two positional points
18
Mean Absolute Deviation
• Mean Absolute Deviation (MAD) defined as:
• Coefficient of MAD defined as:

= MAD / Median or MAD / Mean
• Advantages:
– Simple to understand and compute
– Based on each and every item in the data
– Less affected by extreme values than other measured
• Disadvantage:
– It is not capable of mathematical treatment
19
Standard Deviation
• Defined as “Root Mean Squared Deviation from Mean”
• Coefficient of Variation
20
Standard Deviation
• Advantages:
– Best Measure of Dispersion
– Possible to calculate the combined standard deviation of two or
more groups
– Chebycheff’s Theorem (1821-1894)
• What so ever be the distribution at least 75% of the values will fall
within +/- 2 sd from the mean of the distribution and at least 89% will
fall within +/- 3 sd from the mean of the distribution
– Has relation with other measures:
• QD = 0.667 SD
• MD = 0.80 SD
21
Skewness
• Refers to the asymmetry in the shape of the distribution
• Important to test skewness in data analysis as skewed

data suggest that the assumption of normality is violated
22
Skewness - Measures
• Karl Pearson’s Measure of Skewness:
Mean – Mode OR
3(Mean – Median)
Standard Deviation Standard Deviation
- Skewness coefficient > 0 is positively skewed
- Skewness coefficient < 0 is negatively skewed
- Skewness coefficient = 0 is symmetrical
• Bowley’s Measure
• Moments Measure
23
Kurtosis
• Kurtosis means “Bulginess”
• Refers to the degree of flatness or peaked-ness in the
region about the mode of the distribution:
– Lepto-Kurtic : If the curve is more peaked than Normal Curve
– Meso-Kurtic : If the curve is the same as the Normal Curve
– Platy-Kurtic : If the curve is less peaked than Normal Curve
• Presence of Kurtosis does not violate normality

• Important to check Kurtosis because it shows the
distribution of data around the mode
24
KURTOSIS - Measures
• Kurtosis
Excess Kurtosis
Kurtosis
25
Interpretation
• A normal distribution has kurtosis exactly 3 (excess
kurtosis exactly 0). Any distribution with kurtosis ≈3
(excess ≈0) is called mesokurtic.
• A distribution with kurtosis <3 (excess kurtosis <0) is
called platykurtic. Compared to a normal distribution, its
central peak is lower and broader, and its tails are shorter
and thinner.
• A distribution with kurtosis >3 (excess kurtosis >0) is
called leptokurtic. Compared to a normal distribution, its
central peak is higher and sharper, and its tails are longer
and fatter.
Kurtosis: Leptokurtic
Kurtosis: Mesokurtic
Kurtosis: Platykurtic
Uses of Skewness and Kurtosis
• Most stock prices and asset returns are positive or
negative skew. Skewed data can be used to determine
whether a given or future data point can be more or less
than the mean. Basically related to asymmetries (or
risks) in information. Higher risks lead to higher
returns
• Kurtosis is used to describe volatility around the mean.
For example, if past data yields leptokurtic distribution,
the stock will have a relatively low amount of variance.
This further implies the return values are close to the
mean hence less volatile. Platykurtic distribution
expect more volatilty (or losses ) in the future.
What is Descriptive Statistics?
• The following Needs to Be Reported:
– Arithmetic Mean
– Median
– Mode
– Standard Deviation
– Variance
– Kurtosis
– Skewness
– Range
– Minimum
– Maximum
– Sum
– Count
31

Foundations or Research Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Foundations or Research Analysis

Uploaded by

Copyright:

Available Formats

What is Data?

• Observations of a set of variables

• Each Discipline has evolved it’s own method of classification of data

• Two Broad Classification of Data Based on Source

ORDER DISTANCE ORIGIN

ORDER DISTANCE ORIGIN

– Nominal Data Mode, Chi-Square

– Ordinal Data + Median / Percentiles

– Interval Data + Mean / SD / Correlation / Regression /

– Ratio Scale + Geometric Mean / Harmonic Mean /

• Range and QD are positional measures of dispersion

• Quartile Deviation (Semi IQR)

• Coefficient of MAD defined as:

• Important to test skewness in data analysis as skewed

• Presence of Kurtosis does not violate normality

You might also like