You are on page 1of 31

What is Data?

• Observations of a set of variables


• Lowest level of abstraction from which information is derived

• Each Discipline has evolved it’s own method of classification of data

• Two Broad Classification of Data Based on Source


– Primary Data:
• Data Collected from Primary Source
– Secondary Data:
• Data Collected From Secondary Source

1
Classification :: Statistics
• Categorical Data
– The Objects are grouped into categories based on some Qualitative Trait
– The resultant data are merely labels or categories
– Example:
• Hair Color: Brown / Black / Red
• Smoking Status: Favor / Neutral / Against
• Measurement Data
– The Objects are “measured” on some Quantitative Trait
– The resultant data is a set of numbers
– Example:
• Age of the Students
• JEMAT Score
• Number of Students Not Attending Class

2
Categorical Data
• Nominal Data
– A type of categorical data in which numbers act as a label without having
any specific meaning
– Example:
• Male : 1
• Female : 2
• Ordinal Data
– A type of categorical data in which numbers act as an guide to the level of
importance of the object
– Example:
• Mild
• Moderate
• Severe

3
Measurement Data
• Discrete Data
– Only Certain Values are Possible
– There are gaps between the possible value
– Are generated through the process of Counting
– Example:
• Number of students in the class
• Number of Employees Absent from Work
• Continuous Data
– Any value within an interval is possible with a suitable measuring device
– Theoretically, the number can be accurate to any desired number of
decimal places
– Are generated through the process of Measurement
– Example:
• Height in cm
• Time to complete the assignment

4
Classification :: Scaling Theory
• Nominal Data ORDER DISTANCE ORIGIN
– A type of categorical data in which numbers act as a label without having
any specific meaning
– Example:
• Male : 1
• Female : 2
• Ordinal Data
– A type of categorical data in which numbers act as an guide to the level of
importance of the object
– Example:
• Mild
• Moderate
• Severe

ORDER DISTANCE ORIGIN


5
Classification :: Scaling Theory
• Interval Data ORDER DISTANCE ORIGIN
– Quantitative Data but does not has any real zero point
– Allows comparison within the scale but cannot compare outside the scale
– Used in Social Research, but most researcher not clear about Interval
scale
– Example:
• Definitely Will Buy / Probably Will Buy / May or May not Buy / Probably Will not
Buy / Definitely Will not Buy
• Ratio Data
– Quantitative Data but has real zero point
– Allows conversion and preservation on the magnitude in another scale
– Example:
• Distance in Kms

ORDER DISTANCE ORIGIN


6
Why understand Data?
• The type of Analysis depends on the Type of data you
have collected
• General Guideline is a follows:

– Nominal Data Mode, Chi-Square

– Ordinal Data + Median / Percentiles

– Interval Data + Mean / SD / Correlation / Regression /


ANOVA

– Ratio Scale + Geometric Mean / Harmonic Mean /


Coefficient of Variation /
Logarithms

7
Some Points to Remember
• Tend to use Interval Scales
• Data need not be comparable with other studies
• Data has to make sense in your context
• Students fail to understand the importance of Data
– Wrong Approach
• “Data Collect Kore Niyechi… Ebar Ki Kori”
– Right Approach
• “Amar Ki Data Dorkar? Kano Daokar? Kothay Pabo? Kibhabe
Analyse Kore Uttor Pabo”

8
Descriptive Statistics
:: A Quick Review

9
Measures of Central Tendency
• Central tendency is “loosely” defined as the concept of
location of the center of a distribution of data
• Three basic measures
– Arithmetic Mean
– Median
– Mode

10
Arithmetic Mean
• Advantages:
– Easy to Compute
– Affected by every value in the set of observations
– Defined by rigid mathematical formulation
– It is relatively reliable
– It represents the “center of gravity” of the data
• Disadvantages:
– Unduly affected by small and / or large values
– Cannot be calculated for data with open ended class
– Is a good measure only when the distribution is fairly symmetric

11
Median
• Advantages
– Refers to the “Middle Value” of the distribution
– It is a “positional measure”
– Useful in case of open ended class
– Not seriously affected by Extreme Values
– Most appropriate for dealing with Qualitative Rank Data
– Has a series of related positional measures like Quartiles, Deciles,
Percentiles
• Disadvantages:
– It does not take every value into consideration
– It is not capable of algebraic treatment
– It is erratic if the number of items are smalle

12
Mode
• Advantages:
– It is the most typical or representative value of a distribution
– Not unduly affected by extreme values
– It can be used to describe qualitative phenomenon
• Disadvantages:
– Mode may not be there in a distribution or may be present more
than once in a distribution
– Not capable of algebraic treatment
– It is not rigidly defined for calculation

13
Relation Between the 3 Measures
• In moderately skewed distribution:
Mode = 3 Median – 2 Mean

14
Measures of Dispersion
• Dispersion is defined as the degree to which data tends to
spread about a central value
• Four Absolute & Relative Measures
– Range Coefficient of Range
– Quartile Deviation Coefficient of Quartile Deviation
– Mean Absolute Deviation Coefficient of MAD
– Standard Deviation Coefficient of Variation

• Range and QD are positional measures of dispersion


• AD and SD are calculation measures of dispersion

15
Range
• Range

• Advantages
– Simplest to understand and compute
• Disadvantages:
– Not based on each and every item in the data
– Does not take into account the shape of distribution
– Cannot be computed in case of open ended classes

16
Quartile Deviation
• Inter Quartile Range (IQR)

• Quartile Deviation (Semi IQR)

• Coefficient of QD

17
Quartile Deviation
• Advantages:
– Can measure variation in open ended distributions
– It is extremely useful in case of erratic or badly skewed data
– It is not affected by extreme values
• Disadvantages:
– Ignores 50% of the data
– Is not capable of mathematical manipulation
– Is not considered as a measure of dispersion:
• Effectively shows the distance between two positional points

18
Mean Absolute Deviation
• Mean Absolute Deviation (MAD) defined as:

• Coefficient of MAD defined as:


= MAD / Median or MAD / Mean
• Advantages:
– Simple to understand and compute
– Based on each and every item in the data
– Less affected by extreme values than other measured
• Disadvantage:
– It is not capable of mathematical treatment

19
Standard Deviation
• Defined as “Root Mean Squared Deviation from Mean”

• Coefficient of Variation

20
Standard Deviation
• Advantages:
– Best Measure of Dispersion
– Possible to calculate the combined standard deviation of two or
more groups
– Chebycheff’s Theorem (1821-1894)
• What so ever be the distribution at least 75% of the values will fall
within +/- 2 sd from the mean of the distribution and at least 89% will
fall within +/- 3 sd from the mean of the distribution
– Has relation with other measures:
• QD = 0.667 SD
• MD = 0.80 SD

21
Skewness
• Refers to the asymmetry in the shape of the distribution

• Important to test skewness in data analysis as skewed


data suggest that the assumption of normality is violated

22
Skewness - Measures
• Karl Pearson’s Measure of Skewness:
Mean – Mode OR
3(Mean – Median)
Standard Deviation Standard Deviation
- Skewness coefficient > 0 is positively skewed
- Skewness coefficient < 0 is negatively skewed
- Skewness coefficient = 0 is symmetrical

• Bowley’s Measure

• Moments Measure

23
Kurtosis
• Kurtosis means “Bulginess”
• Refers to the degree of flatness or peaked-ness in the
region about the mode of the distribution:
– Lepto-Kurtic : If the curve is more peaked than Normal Curve
– Meso-Kurtic : If the curve is the same as the Normal Curve
– Platy-Kurtic : If the curve is less peaked than Normal Curve

• Presence of Kurtosis does not violate normality


• Important to check Kurtosis because it shows the
distribution of data around the mode

24
KURTOSIS - Measures

• Kurtosis

Excess Kurtosis
Kurtosis

25
Interpretation
• A normal distribution has kurtosis exactly 3 (excess
kurtosis exactly 0). Any distribution with kurtosis ≈3
(excess ≈0) is called mesokurtic.
• A distribution with kurtosis <3 (excess kurtosis <0) is
called platykurtic. Compared to a normal distribution, its
central peak is lower and broader, and its tails are shorter
and thinner.
• A distribution with kurtosis >3 (excess kurtosis >0) is
called leptokurtic. Compared to a normal distribution, its
central peak is higher and sharper, and its tails are longer
and fatter.
Kurtosis: Leptokurtic
Kurtosis: Mesokurtic
Kurtosis: Platykurtic
Uses of Skewness and Kurtosis
• Most stock prices and asset returns are positive or
negative skew. Skewed data can be used to determine
whether a given or future data point can be more or less
than the mean. Basically related to asymmetries (or
risks) in information. Higher risks lead to higher
returns
• Kurtosis is used to describe volatility around the mean.
For example, if past data yields leptokurtic distribution,
the stock will have a relatively low amount of variance.
This further implies the return values are close to the
mean hence less volatile. Platykurtic distribution
expect more volatilty (or losses ) in the future.
What is Descriptive Statistics?
• The following Needs to Be Reported:
– Arithmetic Mean
– Median
– Mode
– Standard Deviation
– Variance
– Kurtosis
– Skewness
– Range
– Minimum
– Maximum
– Sum
– Count

31

You might also like