Professional Documents
Culture Documents
Data Presentation and Analysis
Data Presentation and Analysis
ANALYSIS
DR. FAHAD MAHMOOD
LECTURER
NATIONAL INSTITUTE OF PREVENTIVE AND SOCIAL MEDICINE (NIPSOM)
STEPS OF DATA ANALYSIS
STEPS OF DATA ANALYSIS (CONT.)
• Data Collection: Guided by the requirements you’ve identified, it’s time to collect the data from your sources.
Sources include case studies, surveys, interviews, questionnaires, direct observation, and focus groups. Make
sure to organize the collected data for analysis.
• Data Cleaning: Not all of the data you collect will be useful, so it’s time to clean it up. This process is where you
remove white spaces, duplicate records, and basic errors. Data cleaning is mandatory before sending the
information on for analysis.
• Data Analysis: Here is where you use data analysis software and other tools to help you interpret and
understand the data and arrive at conclusions. Data analysis tools include Excel, Python, R, Looker, Rapid Miner,
Chartio, Metabase, Redash, and Microsoft Power BI.
• Data Interpretation: Now that you have your results, you need to interpret them and come up with the best
courses of action, based on your findings.
• Data Visualization: Data visualization is a fancy way of saying, “graphically show your information in a way that
people can read and understand it.” You can use charts, graphs, maps, bullet points, or a host of other methods.
Visualization helps you derive valuable insights by helping you compare datasets and observe relationships.
DATA ANALYSIS
• Data analysis is a process of finding, collecting, cleaning, examining, and modeling data to derive useful
information and insights and understand the derived information for data-driven decision-making.
• Different statistics and methods used to describe the characteristics of the members of a sample or
population, explore the relationships between variables, to test research hypotheses, and to visually
represent data are described. Here we will discuss about
Descriptive Statistics
Tests of Significance
Graphical/Pictorial Methods
DESCRIPTIVE STATISTICS
• In research, variables with discrete, qualitative categories are called nominal or categorical variables.
The categories can be given numerical codes, but they cannot be ranked, added, or multiplied.
• Examples of nominal variables include gender (male, female), preschool program attendance (yes, no),
and race/ethnicity (White, African American, Hispanic, Asian, American Indian). Researchers calculate
proportions, percentages and ratios in order to summarize the data from nominal or categorical
variables and to allow for comparisons to be made between groups.
PROPORTIONS, PERCENTAGES AND RATIOS (CONT.)
• Proportion—The number of cases in a category divided by the total number of cases across all categories of a
variable.
• Percentage—The proportion multiplied by 100 (or the number of cases in a category divided by the total
number of cases across all categories of a value times 100).
• Ratio—The number of cases in one category to the number of cases in a second category.
Example:
• A researcher selects a sample of 100 students from a Head Start program. The sample includes 20 White
children, 30 African American children, 40 Asian children and 10 children of mixed-race/ethnicity.
• Proportion of Asian children in the program = 40 / (20+30+40+10) = .40.
• Percentage of Asian children in the program = .40 x 100 = 40%.
• Ratio of Asian children to White children in the program = 40/20 = 2.0, or the ratio of Asian to White children
enrolled in the Head Start program is 2 to 1.
MEASURES OF CENTRAL TENDENCY
• Proportions, percentages and ratios are used to summarize the characteristics of a sample or population that fall into discrete
categories. Measures of central tendency are the most basic and, often, the most informative description of a population's
characteristics, when those characteristics are measured using an interval scale.
• Measures of central tendency describe the "average" member of the sample or population of interest. There are three measures of
central tendency:
• Mean—The arithmetic average of the values of a variable. To calculate the mean, all the values of a variable are summed and divided
by the total number of cases.
• Median—The value within a set of values that divides the values in half (i.e. 50% of the variable's values lie above the median, and
50% lie below the median).
• Mode—The value of a variable that occurs most often.
Example:
• The annual incomes of five randomly selected people in the United States are $10,000, $10,000, $45,000, $60,000, and $1,000,000.
• Mean Income = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = $225,000.
• Median Income = $45,000.
• Modal Income = $10,000.
MEASURES OF DISPERSION
• Measures of dispersion provide information about the spread of a variable's values. There are three key measures of
dispersion:
• Range is simply the difference between the smallest and largest values in the data. Researchers often report simply the
values of the range (e.g., 75 – 100).
• Variance is a commonly used measure of dispersion, or how spread out a set of values are around the mean. It is
calculated by taking the average of the squared differences between each value and the mean. The variance is the
standard deviation squared.
• Standard deviation, like variance, is a measure of the spread of a set of values around the mean of the values. The wider
the spread, the greater the standard deviation and the greater the range of the values from their mean. A small standard
deviation indicates that most of the values are close to the mean. A large standard deviation on the other hand indicates
that the values are more spread out. Five randomly selected children were administered a standardized reading
assessment. Their scores on the assessment were 50, 50, 60,75 and 90 with a mean score of 65.
• Range = 90 - 50 = 40.
• Variance = [(50 - 65)2 + (50 - 65)2 + (60 - 65)2 + (75 - 65)2 + (90 - 65)2] / 5 = 300.
STANDARD DEVIATION
Measures of association indicate whether two variables are related. Two measures are commonly used:
• Chi-square test of independence
• Correlation
CHI-SQUARE TEST
• Chi-Square test of independence is used to evaluate whether there is an association between two
variables.
• It is most often used with nominal data (i.e., data that are put into discrete categories: e.g., gender
[male, female] and type of job [unskilled, semi-skilled, skilled]) to determine whether they are
associated. However, it can also be used with ordinal data.
• Assumes that the samples being compared (e.g., males, females) are independent.
• Tests the null hypothesis of no difference between the two variables (i.e., type of job is not related to
gender).
CHI-SQUARE TEST (CONT.)
• The degree to which two variables are associated. Variables are positively correlated if they both tend
to increase at the same time.
• For example, height and weight are positively correlated because as height increases weight also tends
to increases.
• Variables are negatively correlated if as one increases the other decreases.
• For example, number of police officers in a community and crime rates are negatively correlated
because as the number of police officers increases the crime rate tends to decrease.
Correlation Coefficient
• A measure of the degree to which two variables are related. A correlation coefficient in always between
-1 and +1. If the correlation coefficient is between 0 and +1 then the variables are positively correlated.
If the correlation coefficient is between 0 and -1 then the variables are negatively correlated.
TESTS OF SIGNIFICANCE
• Chi-Square test
• t-test
• Analysis of Variance (ANOVA)
TESTS OF SIGNIFICANCE (CONT.)
• A t-test is a statistical test that is used to compare the means of two groups. It is often used in
hypothesis testing to determine whether a process or treatment actually has an effect on the
population of interest, or whether two groups are different from one another.
• You want to know whether the mean petal length of Rose flower differs according to their species. You
find two different species of roses growing in a garden and measure 25 petals of each species. You can
test the difference between these two groups using a t-test.
• The null hypothesis (H0) is that the true difference between these group means is zero.
• The alternate hypothesis (Ha) is that the true difference is different from zero.
ONE-SAMPLE, TWO-SAMPLE, OR PAIRED T-TEST?
• When choosing a t-test, you will need to consider two things: whether the groups being compared
come from a single population or two different populations, and whether you want to test the
difference in a specific direction.
• If the groups come from a single population (e.g. measuring before and after an experimental
treatment), perform a paired t-test.
• If the groups come from two different populations (e.g. two different species, or people from two
separate cities), perform a two-sample t-test (a.k.a. independent t-test).
• If there is one group being compared against a standard value (e.g. comparing the acidity of a liquid to a
neutral pH of 7), perform a one-sample t-test.
Predictor variable Outcome variable Research question example
Paired t-test •Categorical •Quantitative What is the effect of two
•1 predictor •groups come from the same different test prep
population programs on the average
exam scores for students
from the same class?
There are several graphical and pictorial methods that enhance understanding of individual variables and
the relationships between variables. Graphical and pictorial methods provide a visual representation of the
data. Some of these methods include:
• Bar charts
• Pie charts
• Line graphs
• Scatter plots
• Geographical Information Systems (GIS)
• Sociograms
BAR CHART
PIE CHART
LINE GRAPH