You are on page 1of 23

DATA PRESENTATION AND

ANALYSIS
DR. FAHAD MAHMOOD
LECTURER
NATIONAL INSTITUTE OF PREVENTIVE AND SOCIAL MEDICINE (NIPSOM)
STEPS OF DATA ANALYSIS
STEPS OF DATA ANALYSIS (CONT.)
• Data Collection: Guided by the requirements you’ve identified, it’s time to collect the data from your sources.
Sources include case studies, surveys, interviews, questionnaires, direct observation, and focus groups. Make
sure to organize the collected data for analysis.
• Data Cleaning: Not all of the data you collect will be useful, so it’s time to clean it up. This process is where you
remove white spaces, duplicate records, and basic errors. Data cleaning is mandatory before sending the
information on for analysis.
• Data Analysis: Here is where you use data analysis software and other tools to help you interpret and
understand the data and arrive at conclusions. Data analysis tools include Excel, Python, R, Looker, Rapid Miner,
Chartio, Metabase, Redash, and Microsoft Power BI.
• Data Interpretation: Now that you have your results, you need to interpret them and come up with the best
courses of action, based on your findings.
• Data Visualization: Data visualization is a fancy way of saying, “graphically show your information in a way that
people can read and understand it.” You can use charts, graphs, maps, bullet points, or a host of other methods.
Visualization helps you derive valuable insights by helping you compare datasets and observe relationships.
DATA ANALYSIS

• Data analysis is a process of finding, collecting, cleaning, examining, and modeling data to derive useful
information and insights and understand the derived information for data-driven decision-making.
• Different statistics and methods used to describe the characteristics of the members of a sample or
population, explore the relationships between variables, to test research hypotheses, and to visually
represent data are described. Here we will discuss about
 Descriptive Statistics
 Tests of Significance
 Graphical/Pictorial Methods
DESCRIPTIVE STATISTICS

Descriptive statistics can be useful for two purposes:


• To provide basic information about the characteristics of a sample or population. These characteristics
are represented by variables in a research study dataset.
• To highlight potential relationships between these characteristics, or the relationships among the
variables in the dataset.
• The four most common descriptive statistics are:
 Proportions, Percentages and Ratios
 Measures of Central Tendency
 Measures of Dispersion
 Measures of Association
PROPORTIONS, PERCENTAGES AND RATIOS

• In research, variables with discrete, qualitative categories are called nominal or categorical variables.
The categories can be given numerical codes, but they cannot be ranked, added, or multiplied.
• Examples of nominal variables include gender (male, female), preschool program attendance (yes, no),
and race/ethnicity (White, African American, Hispanic, Asian, American Indian). Researchers calculate
proportions, percentages and ratios in order to summarize the data from nominal or categorical
variables and to allow for comparisons to be made between groups.
PROPORTIONS, PERCENTAGES AND RATIOS (CONT.)

• Proportion—The number of cases in a category divided by the total number of cases across all categories of a
variable.
• Percentage—The proportion multiplied by 100 (or the number of cases in a category divided by the total
number of cases across all categories of a value times 100).
• Ratio—The number of cases in one category to the number of cases in a second category.
Example:
• A researcher selects a sample of 100 students from a Head Start program. The sample includes 20 White
children, 30 African American children, 40 Asian children and 10 children of mixed-race/ethnicity.
• Proportion of Asian children in the program = 40 / (20+30+40+10) = .40.
• Percentage of Asian children in the program = .40 x 100 = 40%.
• Ratio of Asian children to White children in the program = 40/20 = 2.0, or the ratio of Asian to White children
enrolled in the Head Start program is 2 to 1.
MEASURES OF CENTRAL TENDENCY

• Proportions, percentages and ratios are used to summarize the characteristics of a sample or population that fall into discrete
categories. Measures of central tendency are the most basic and, often, the most informative description of a population's
characteristics, when those characteristics are measured using an interval scale.
• Measures of central tendency describe the "average" member of the sample or population of interest. There are three measures of
central tendency:
• Mean—The arithmetic average of the values of a variable. To calculate the mean, all the values of a variable are summed and divided
by the total number of cases.
• Median—The value within a set of values that divides the values in half (i.e. 50% of the variable's values lie above the median, and
50% lie below the median).
• Mode—The value of a variable that occurs most often.
Example:
• The annual incomes of five randomly selected people in the United States are $10,000, $10,000, $45,000, $60,000, and $1,000,000.
• Mean Income = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = $225,000.
• Median Income = $45,000.
• Modal Income = $10,000.
MEASURES OF DISPERSION
• Measures of dispersion provide information about the spread of a variable's values. There are three key measures of
dispersion:
• Range is simply the difference between the smallest and largest values in the data. Researchers often report simply the
values of the range (e.g., 75 – 100).
• Variance is a commonly used measure of dispersion, or how spread out a set of values are around the mean. It is
calculated by taking the average of the squared differences between each value and the mean. The variance is the
standard deviation squared.
• Standard deviation, like variance, is a measure of the spread of a set of values around the mean of the values. The wider
the spread, the greater the standard deviation and the greater the range of the values from their mean. A small standard
deviation indicates that most of the values are close to the mean. A large standard deviation on the other hand indicates
that the values are more spread out. Five randomly selected children were administered a standardized reading
assessment. Their scores on the assessment were 50, 50, 60,75 and 90 with a mean score of 65.
• Range = 90 - 50 = 40.
• Variance = [(50 - 65)2 + (50 - 65)2 + (60 - 65)2 + (75 - 65)2 + (90 - 65)2] / 5 = 300.
STANDARD DEVIATION

You collect data on job satisfaction ratings from three groups of


employees using simple random sampling.
The mean (M) ratings are the same for each group – it’s the
value on the x-axis when the curve is at its peak. However, their
standard deviations (SD) differ from each other.

The standard deviation reflects the dispersion of the


distribution. The curve with the lowest standard deviation has
a high peak and a small spread,
while the curve with the highest standard deviation is more flat
and widespread.
MEASURES OF ASSOCIATION

Measures of association indicate whether two variables are related. Two measures are commonly used:
• Chi-square test of independence
• Correlation
CHI-SQUARE TEST

• Chi-Square test of independence is used to evaluate whether there is an association between two
variables.
• It is most often used with nominal data (i.e., data that are put into discrete categories: e.g., gender
[male, female] and type of job [unskilled, semi-skilled, skilled]) to determine whether they are
associated. However, it can also be used with ordinal data.
• Assumes that the samples being compared (e.g., males, females) are independent.
• Tests the null hypothesis of no difference between the two variables (i.e., type of job is not related to
gender).
CHI-SQUARE TEST (CONT.)

To test for associations, a chi-square is calculated in the following way:


• Suppose a researcher wants to know whether there is a relationship between gender and two types of
jobs, construction worker and administrative assistant.
• To perform a chi-square test, the researcher counts the number of female administrative assistants, the
number of female construction workers, the number of male administrative assistants, and the number
of male construction workers in the data.
• These counts are compared with the number that would be expected in each category if there were no
association between job type and gender (this expected count is based on statistical calculations). The
association between the two variables is determined to be significant (the null hypothesis is rejected), if
the value of the chi-square test is greater than or equal to the critical value for a given significance level
(typically .05) and the degrees of freedom associated with the test found in a chi-square table.
CORRELATION

• The degree to which two variables are associated. Variables are positively correlated if they both tend
to increase at the same time.
• For example, height and weight are positively correlated because as height increases weight also tends
to increases.
• Variables are negatively correlated if as one increases the other decreases.
• For example, number of police officers in a community and crime rates are negatively correlated
because as the number of police officers increases the crime rate tends to decrease.

Correlation Coefficient
• A measure of the degree to which two variables are related. A correlation coefficient in always between
-1 and +1. If the correlation coefficient is between 0 and +1 then the variables are positively correlated.
If the correlation coefficient is between 0 and -1 then the variables are negatively correlated.
TESTS OF SIGNIFICANCE

• Chi-Square test
• t-test
• Analysis of Variance (ANOVA)
TESTS OF SIGNIFICANCE (CONT.)

Widely used tests of statistical significance are described briefly below.


• Chi-Square test is used when testing for associations between categorical variables (e.g., differences in whether a
child has been diagnosed as having a cognitive disability by gender or race/ethnicity). It is also used as a goodness-
of-fit test to determine whether data from a sample come from a population with a specific distribution.
• T-tests are used when comparing the means of precisely two groups (e.g. the average heights of men and women).
ANOVA test are used when comparing the means of more than two groups (e.g. the average heights of children,
teenagers, and adults).
T-TEST

• A t-test is a statistical test that is used to compare the means of two groups. It is often used in
hypothesis testing to determine whether a process or treatment actually has an effect on the
population of interest, or whether two groups are different from one another.
• You want to know whether the mean petal length of Rose flower differs according to their species. You
find two different species of roses growing in a garden and measure 25 petals of each species. You can
test the difference between these two groups using a t-test.
• The null hypothesis (H0) is that the true difference between these group means is zero.
• The alternate hypothesis (Ha) is that the true difference is different from zero.
ONE-SAMPLE, TWO-SAMPLE, OR PAIRED T-TEST?

• When choosing a t-test, you will need to consider two things: whether the groups being compared
come from a single population or two different populations, and whether you want to test the
difference in a specific direction.
• If the groups come from a single population (e.g. measuring before and after an experimental
treatment), perform a paired t-test.
• If the groups come from two different populations (e.g. two different species, or people from two
separate cities), perform a two-sample t-test (a.k.a. independent t-test).
• If there is one group being compared against a standard value (e.g. comparing the acidity of a liquid to a
neutral pH of 7), perform a one-sample t-test.
Predictor variable Outcome variable Research question example
Paired t-test •Categorical •Quantitative What is the effect of two
•1 predictor •groups come from the same different test prep
population programs on the average
exam scores for students
from the same class?

Independent t-test •Categorical •Quantitative What is the difference


•1 predictor •groups come from different in average exam scores for
populations students from two different
schools?

ANOVA •Categorical •Quantitative What is the difference


•1 or more predictor •1 outcome in average pain
levels among post-surgical
patients given three
different painkillers?
GRAPHICAL/PICTORIAL METHODS

There are several graphical and pictorial methods that enhance understanding of individual variables and
the relationships between variables. Graphical and pictorial methods provide a visual representation of the
data. Some of these methods include:
• Bar charts
• Pie charts
• Line graphs
• Scatter plots
• Geographical Information Systems (GIS)
• Sociograms
BAR CHART
PIE CHART
LINE GRAPH

You might also like