You are on page 1of 32

DATE: 30-10-2022

STATISTICAL DATA ANALYSIS TECHNIQUES


Learning Outcomes 学习成果
At the end of this class, students will be able to:

❑ Describe statistical analysis techniques

❑ Describe primary measurement scales with appropriate examples

❑ Define normal distribution and its characteristics

❑ Highlight appropriate tests under parametric and non-parametric statistical analysis

❑ Compute t-test, f-test, chi-square test, and ANOVA using MS excel

❑ Describe non-parametric statistical tests and the associated tests

❑ Describe univariate, bivariate, and multivariate analysis

❑ Highlight methods of data presentation with suitable examples


Introduction
❑ Statistical analysis techniques allow researchers, businesses and

organizations to make sense of data and guide their decision making.

❑ There are different types of statistical analysis techniques that can be

applied to a wide range of data, industries and applications.

❑ Knowing the different statistical analysis methods and how to use them can

help you explore data, find patterns, and discover trends in your research.

❑ Choosing the appropriate statistical tests for each research question is the

most challenging feature in statistics but also the most necessary in order to

eliminate measurement errors.


Primary Measurement Scales

A zero on a ratio scale


means there is a total
absence of the variable
you are measuring.
Normal Distribution
❑ An important aspect of the "description" of a variable.

❑ It is a probability distribution that is symmetric about the mean, showing that data near

the mean are more frequent in occurrence than data far from the mean.

❑ Why the "Normal distribution" so important?

❑ BECAUSE in most cases, the distribution of many statistical tests is normal or follows

some forms that can be derived from the normal distribution.

❑ It also indicates the type of analysis that can be perform on the data (Selection of tests)

❑ Normal distribution is represented by a group of curves (mean & SD).

❑ The curves are - Symmetrical (mean=median=mode)

- Bell-shaped
Parametric Statistical Tests
❑ Parametric tests are hypothesis testing procedures which assume that the variables of
interest are measured on an interval or ratio scale and observations must be drawn from
the normally distributed population.
❑ When the dependent variable is measured on a continuous scale, then a parametric test
should typically be selected.
❑ These types of test includes t-tests, f-tests, z-tests and ANOVA tests.
❑ The appropriate use of such tests requires one to check whether the data fulfil certain
assumptions or conditions.
❖ First, observations should be independent i.e. the occurrence of A should not affect
the probability of B.
❖ Second, data should follow a normal distribution with mean equals to zero and a
given variance.
❑ Statistical tests such as Kolmogorov-Smirnov, Shapiro-Wilk and D’Agostino-Pearson are
used under the null hypothesis to test that the sample data fits a standard normal
distribution.
Parametric Statistical Tests Cont’d
T- test:
i. Independent t- test (Student):
- one sampled or 2 sampled t-test
- independent variables (males/females, Malay/Chinese, case/control)

ii. Paired t- test: (dependent variables (before/after, 2 methods used on the same
patients)
ANOVA
- Comparing between more than 2 groups (Malay/Chinese/Indians)
Pearson correlation
- Correlation between 2 quantitative variables (weight and height; weight and age).
Example 1: t-test
t-test is used when you have 2 conditions and you will like to compare the
differences between them to see if it is significant.
So in this example, I want to see if there is a significant difference between
my 2 conditions.
❑ My data came from the same people. Condition A was recorded in the
morning while the Condition B was taken in afternoon.
❑ array1: values from Condition A
❑ array2: values from Condition B
❑ tails: choose 1 if you have a direct hypothesis, then it will be 1 tail test
(if you can predict the direction of the effect) otherwise choose 2 for
two tails test.
❑ type: 1-if the data comes from the same participants
❑ type: 2-if the data comes from different groups and variance
associated with each group are the same
❑ type: 3-if the data comes from different groups and variance With this p-value, it shows that the difference between the 2
associated with each group are not the same. conditions is significant at 0.05 level
Demo Session 演示会议
Example 2: f-test
We can use f-test to test whether the 2 variances of 2 population are equal or not.
Female Male
935 978
955 782
967 905
1002 973
1000 1006
964 1017 The f-value is greater than the critical 1 tail
1952 995 value, therefore we reject H0 (Null hypothesis)
933

H0:𝜎1 2 = 𝜎2 2 (The null hypothesis states that the variance of both groups are equal)

H1:𝜎1 2 ≠ 𝜎2 2 (The alternate hypothesis states that the variance of both groups are not
equal)

To carry out this test in MS Excel, Click Data tab -> Analysis Group ->
click on Data Analysis -> Select F-Test Two-Sample for Variances ->OK
-> Variable 1 & 2 Range -> choose Output Range -> OK
Demo Session 演示会议
Example 3: ANOVA
We can use ANOVA test when comparing between more than 2 groups.

This example illustrate how to compute 1-way ANOVA in MS Excel.

With the p-value < 0.05,


Young Middle Old
15 8 12 we reject the null
14 9 13
6 10 14 hypothesis that states
8 11 15
10 12 16 that the variance of the 3
5 6 5
8 8 11 groups are equal
4 15 12
6 5 11
3 7 15

H0:𝜎1 2 = 𝜎2 2 = 𝜎3 3 (The variance of the 3 groups are equal)

H1:𝜎1 2 ≠ 𝜎2 2≠ 𝜎3 3 (The variance of the 3 groups are not equal)


To carry out this test in MS Excel, Click Data tab -> Analysis
Group -> click on Data Analysis -> Select Anova: Single Factor
-> OK -> Input Range -> Labels in first row -> choose Output
Range -> OK
Demo Session 演示会议
Non-Parametric Statistical Tests
❑ When data fails to fulfil the assumptions of a parametric test, researchers opt
for non-parametric tests given the fact that they are less restrictive.
❑ The test can also be used for small sample size of < 30, and when the
variables are non-metric (i.e. not based on the meter as a standard of
measurement).
❑ Non-parametric tests also assume that the variables are measured on a
nominal or ordinal scale.
❑ The widely used non-parametric tests are:
➢ Chi-square test
➢ Fisher’s exact test
➢ Mann-Whitney U test
➢ McNemar’s test
➢ Wilcoxon sum rank test
➢ Wilcoxon signed rank test
➢ Kruskal-Wallis 1-way analysis of variance
➢ Spearman correlation
Example: Chi-square test
❑ This example illustrate how to compute Chi-square test in MS Excel.
❑ In this example, we have a sample of 200 people that visited a local pub.
❑ We want to perform a Chi-square test of independence to see if there is
an association between gender and smoking status in our sample.
❑ H0: There is no association between gender and smoking status These are the actual value from our experiment
known as Observed values
❑ H1: There is an association between gender and smoking status

❑ H0 will be accepted if p>0.05 and H1 is rejected

❑ H1 will be accepted if p<0.05 and H0 is rejected


❑ Step 1: Add up each of the rows and columns
❑ Step 2: Workout the expected value for each entry of the table
Expected value = (row total X column total) / overall total

❑ Step 3: Compute your p value


Select the cell where you want your p-value to appear -> type
=CHISQ.TEST(select range of your observed values, expected values) ->
press enter With the p-value<0.05, we accept H1 and reject H0
Demo Session 演示会议
Further Examples on Statistical Tests
Univariate Analysis
❑ Univariate analysis deals with only one variable. Otherwise know as a single variable analysis
❑ It doesn't deal with causes or relationships (unlike regression) Gender
❑ Its major purpose is to describe data (Descriptive statistics)
❑ It takes data, summarizes that data and finds patterns in the data.
❑ Methods to describe patterns found in univariate data include:
➢ Central Tendency (mean, mode and median)
➢ Dispersion: range , variance, maximum, minimum, quartiles (including the interquartile range)
➢ Standard Deviation.

❑ Graph or Chart for this purpose include:


❖ Frequency Distribution Tables.
❖ Bar Charts.
❖ Histograms.
❖ Frequency Polygons.

❖ Pie Charts.
Bivariate Analysis
❑ Identifies the relationship between 2 variables

❑ It is a form of statistical analysis, used to find out if there is a relationship between two sets of values.

❑ It usually involves the variables X and Y.

❑ For example, the association between smoking and lung cancer.

❑ Another example could be, if you want to find out the relationship between caloric intake and weight.

❑ Caloric intake would be your independent variable, X and weight would be your dependent variable, Y.

❑ With bivariate analysis, there is a Y value for each X.

❑ Let’s say you had a caloric intake of 3,000 calories per day and a weight of 300lbs.

❑ You would write that with the x-variable followed by the y-variable: (3000,300).

❑ (X,Y)=(100,56),(23,84),(398,63),(56,42)

Common types of bivariate analysis include:


❖ Scatter plots Smoking Lung cancer
❖ Regression Analysis

❖ Correlation Coefficients
Multivariate Analysis
❑ Identifies the relationship between more than 2 variables

❑ For example, the association between Alcoholic, Smoking and Lung


Cancer.

❑ Multivariate analysis is used to study more complex sets of data than


what univariate analysis methods can handle.

❑ This type of analysis is mostly performed with software (i.e. SPSS or


SAS), as working with even the smallest of data sets can be
overwhelming by hand.

❑ Multivariate analysis methods include: Additive trees, multidimensional Alcoholic

scaling, cluster analysis, Principal component analysis etc. Lung cancer


❑ Which method you choose depends upon the type of data you have
and what your goals are. Smoking
Methods of Data Presentation
Data Tabulation
❑ Tabulation is the first step before data is used for
analysis. Simple frequency
distribution Table
❑ Tabulation can be in form of Simple Tables or (S.F.D.T.)
Frequency distribution table (i.e., data is split into
Title of table
convenient groups).

A good table should include the following parts: Name of variable


Frequency %
(Units of variable)
❑ Table number and title: These are placed above the
table. The title is usually written right after the table -
number. - Categories
❑ Caption subhead: This refers to columns and rows. -

❑ Body: It contains all the data under each subhead.


Total
❑ Source: It indicates if the data is secondary and it
should be acknowledge. Source:
Example 1
❑ Table 1: Distribution of 50 patients at the hospital in
October 2022 according to their ABO blood groups

Blood Frequency %
group
A 12 24
B 18 36
AB 5 10
O 15 30
Total 50 100
Example 2
❑ Table 2: Distribution of 50 patients at the hospital in
February 2022 according to their age.

Age (years) Frequency %

<20 12 24
20-39 18 36
40-49 5 10
50+ 15 30

Total 50 100
Example 3: Complex Frequency Distribution Table

❑ Table 3: Distribution of 20 lung cancer patients at the hospital


and 40 controls in January 2022 according to smoking.

Lung cancer
Total
Smoking Cases Control
No. % No. % No. %
Smoker 15 75% 8 20% 23 38.33
Non smoker
5 25% 32 80% 37 61.67

Total 20 100 40 100 60 100


Graphical Presentation
❑ A graph or chart portrays the visual presentation of data

using symbols such as lines, dots, bars or slices.

❑ It depicts the trend of a certain set of measurements or

shows comparison between two or more sets of data or

quantities.

Quantitative data Qualitative data


◼ Histogram ▪Bar Chart
◼ Frequency Polygon ▪Pie Chart
◼ Scatter Diagram

◼ Line graph
Nominal Data Presentation
It all the same information,
(based on the same data).
Just different method of presentation.
Summary Statistics (Mathematical Presentation)
❑Summary statistics summarize and provide information about your
sample data.
❑It tells you something about the values in your data set.
❑This includes where the mean lies and whether your data is skewed.
It falls into three main categories:
❑Measures of location (also called central tendency)
❖ Mean (also called the arithmetic mean or average)
❖ Geometric mean (used for interest rates and other types of growth)
❖ Trimmed Mean (the mean with outliers excluded i.e. a piece of data that is
an abnormal distance from other points)
❖ Median (the middle of a dataset). ❑Graphs/charts.
❑Measures of spread. ❖ Histogram.
❖ range (how spread out your data is). ❖ Frequency Distribution Table.
❖ Interquartile range (where the “middle fifty” percent of your data is). ❖ Box plot.
❖ Quartiles (boundaries for the lowest, middle and upper quarters of data. ❖ Bar chart.
❖ Skewed (does your data have mainly low, or mainly high values?). ❖ Scatter plot.
❖ Kurtosis (a measure of how much data is in the tails). ❖ Pie chart.
Determining the Appropriate Statistical Test
What statistical test should I
❑ Regardless of which strategy for the statistical analysis methods use for my data?

you pick, try to take exceptional note of every expected

drawback, just as their different formula.

❑ There’s no highest quality level or wrong or right technique to

utilize.

❑ It will rely upon the kind of information you’ve gathered, just as

the bits of knowledge you are hoping to have as a final product.


Class Attendance 课堂出勤
Please click on the link below to submit your class attendance.

https://forms.gle/SPizKfEhKFNGrbNh6

You might also like