You are on page 1of 109

IBM

Dünya 524
Seni Bekliyor
ÜNİVERSİTE SIRALAMALARI
Academic Research
UNIVERSITY Methods
RANKINGS and Ethics
Looking at Data

• INSTRUCTOR : Prof. Dr. Ufuk Türen

• E-MAIL : ufuk.turen@ostimteknik.edu.tr

• SHEDULE : Friday - 18.00-20.50

• PLACE : Class no - 424

3/53
Course Objective

• The main purpose of this course is to examine the research process (problem identification, data
collection, data analysis and interpretation of results), to review certain scientific research methods
(experimental method, descriptive method, historical method, etc.) literature research, collecting data,
evaluating data and writing reports is to enable them to learn practically. Statistics and software
packages (SPSS 25.0) required for data evaluation and report writing will also be used during the
course.

• This course covers the structure of science and scientific research, scientific methods and different
views on these methods, problem, research model, universe and sample, data collection and data
collection methods (quantitative and qualitative data collection techniques), data recording, analysis,
interpretation and reporting. It includes the explanation of research and writing techniques
accompanied by basic concepts related to social sciences and social sciences. This course will also
discuss ethical considerations related to conducting scientific research and reporting.

4/53
Course Content
WEEK 1 Introduction WEEK 9 Introduction to SPSS

WEEK 2 Introduction to Scientific Research WEEK 10


Methods SPSS-I
WEEK 3
WEEK 11
WEEK 4 Qualitative Research Methods
WEEK 12 SPSS-II
Hypothesis Development
WEEK 5
Questionnaire Design
WEEK 13 Sample Anaysis
The Concept of Measurement,
WEEK 6 Attitude Measurement and Attitude Ethics in Scientific Research
WEEK 14
Scales Research Report Preparation

WEEK 7 Sampling Fundamentals WEEK 15 In class presentations

WEEK 8 Midterm Exam WEEK 16 Final Exam


5/53
Grading Homeworks

• Midterm Exam (30%)


Problem based
• Final Exam (50%)
homeworks – Each week
• Homework (20%)

6/53
Clinical Data Example

• 1. Kline et al. (2002)

– The researchers analyzed data from 934 emergency room patients


with suspected pulmonary embolism (PE). Only about 1 in 5 actually
had PE. The researchers wanted to know what clinical factors
predicted PE.

– I will use four variables from their dataset today:


• Pulmonary embolism (yes/no)
• Age (years)
• Shock index = heart rate/systolic BP
• Shock index categories = take shock index and divide it into 10 groups (lowest
to highest shock index)
7/53
Descriptive Statistics

8/43
Types of Variables: Overview
Categorical Quantitative

Binary Nominal Ordinal Discrete Continuous


2 categories +
more categories +
order matters +
numerical +
uninterrupted

9/43
Categorical Variables
Also known as “qualitative.”

Categories.

• Treatment groups
• Exposure groups
• Disease status

10/43
Categorical Variables
• Dichotomous (binary) – two levels

• Dead/alive
• Treatment/placebo
• Disease/no disease
• Exposed/Unexposed
• Heads/Tails
• Pulmonary Embolism (yes/no)
• Male/female

11/43
Categorical Variables

• Nominal variables – Named categories Order


doesn’t matter!

• The blood type of a patient (O, A, B, AB)


• Marital status
• Occupation

12/43
• Ordinal variable – Ordered categories. Order matters!

• Staging in breast cancer as I, II, III, or IV


• Birth order—1st, 2nd, 3rd, etc.
• Letter grades (A, B, C, D, F)
• Ratings on a scale from 1-5
• Ratings on: always; usually; many times; once in a while; almost never; never
• Age in categories (10-20, 20-30, etc.)
• Shock index categories (Kline et al.)

13/43
Quantitative Variables
• Numerical variables; may be arithmetically
manipulated.

– Counts
– Time
– Age
– Height

14/43
Quantitative Variables
• Discrete Numbers – a limited set of distinct values, such as
whole numbers.

• Number of new AIDS cases in CA in a year (counts)


• Years of school completed
• The number of children in the family (cannot have a half a child!)
• The number of deaths in a defined time period (cannot have a partial death!)
• Roll of a die

15/43
Quantitative Variables
• Continuous Variables - Can take on any number within a
defined range.

• Time-to-event (survival time)


• Age
• Blood pressure
• Serum insulin
• Speed of a car
• Income
• Shock index (Kline et al.)

16/43
Looking at Data
• How are the data distributed?

– Where is the center?


– What is the range?
– What’s the shape of the distribution (e.g., Gaussian, binomial,
exponential, skewed)?

• Are there “outliers”?

• Are there data points that don’t make sense?

17/43
The first rule of statistics:
USE COMMON SENSE!

90% of the information is contained in the


graph.

18/43
Frequency Plots (univariate)

Categorical variables
– Bar Chart

Continuous variables
– Box Plot
– Histogram

19/43
Bar Chart

• Used for categorical variables to show frequency or


proportion in each category.

• Translate the data from frequency tables into a


pictorial representation.

20/43
Bar Chart: categorical
variables

NO

YES

21/43
Bar Chart for SI categories
200.0
183.3
Number of Patients 166.7
150.0
133.3
116.7
100.0 Much easier to
83.3 extract information
66.7 from a bar chart
50.0 than from a table!
33.3
16.7
0.0
1 2 3 4 5 6 7 8 9 10
Shock Index Category
22/43
Box plot and histograms: for
continuous variables

To show the distribution (shape, center, range,


variation) of continuous variables.

23/43
Box Plot: Shock Index
2.0
Shock Index Units
maximum (1.7)

Outliers
1.3
Q3 + 1.5IQR =
.8+1.5(.25)=1.175
“whisker”
75th percentile (0.8)
0.7 interquartile range median (.66)
(IQR) = .8-.55 = .25 25th percentile (0.55)

minimum (or Q1-


1.5IQR)
0.0
SI 24/43
Histogram of SI
25.0

16.7
Percent

8.3

0.0
0.0 0.7 1.3 2.0
SI
25/43
Histogram
6.0 100 bins (too much detail)

4.0
Percent

2.0

0.0
0.0 0.7 1.3 2.0
SI 26/43
Histogram
200.0
2 bins (too little detail)

133.3
Percent

66.7

0.0
0.0 0.7 1.3 2.0
SI
27/43
Box Plot: Shock Index
2.0

Shock Index Units


1.3

0.7

0.0
SI
28/43
Box Plot: Age
100.0
maximum
More symmetric

66.7 75th percentile

interquartile range
Years

median

25th percentile
33.3

minimum

0.0
AGE
Variables 29/43
Histogram: Age
14.0

9.3
Percent

4.7

0.0
0.0 33.3 66.7 100.0
AGE (Years) 30/43
Some histograms from your class
(n=24)
Starting with politics.

31/43
32/43
33/43
Feelings about math and writing

34/43
Optimism

35/43
Diet

36/43
Habits

37/43
Measures of central tendency

• Mean
• Median
• Mode

38/43
Central Tendency
• Mean – the average; the balancing point

calculation: the sum of values divided by the sample size

In math ∑x X1 + X 2 +  + X n
shorthand: i =1
X= =
n n

39/43
Mean: example
Some data:
Age of participants: 17 19 21 22 23 23 23 38

∑X i
17 + 19 + 21 + 22 + 23 + 23 + 23 + 38
i =1
X= = = 23.25
n 8

40/43
Mean of age in Kline’s data
Means Section of AGE
Geometric Harmonic

Parameter Mean Median Mean Mean Sum Mode

Value 50.19334 49 46.66865 43.00606 46730 49

556.9546

14.0

Percent 9.3

4.7

0.0
0.0 33.3 66.7 100.0
Mean of age in Kline’s data
14.0

9.3
Percent

4.7

0.0
0.0 33.3 66.7 100.0
The balancing point
42/43
Mean
• The mean is affected by extreme values
(outliers)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Mean = 4

1 + 2 + 3 + 4 + 5 15 1 + 2 + 3 + 4 + 10 20
= =3 = =4
5 5 5 5
Central Tendency
• Median – the exact middle value

Calculation:
• If there are an odd number of observations, find the middle value
• If there are an even number of observations, find the middle two
values and average them.
Median: example
Some data:
Age of participants: 17 19 21 22 23 23 23 38

Median = (22+23)/2 = 22.5


Median of age in Kline’s data
Means Section of AGE
GeometricHarmonic
Parameter Mean Median Mean Mean Sum Mode

Value 50.19334 49 46.66865 43.00606 46730 49

14.0

Percent
9.3

4.7

0.0 33.3 66.7 100.0


AGE (Years)
Median of age in Kline’s data
14.0
50% 50%
of mass of mass

9.3
Percent

4.7

0.0
0.0 33.3 66.7 100.0
Does PE have a median?
• Yes, if you line up the 0’s and 1’s, the middle number is 0.
Median
• The median is not affected by extreme
values (outliers).

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3
Central Tendency
• Mode – the value that occurs most frequently
Mode: example
Some data:
Age of participants: 17 19 21 22 23 23 23 38

Mode = 23 (occurs 3 times)


Mode of age in Kline’s data

Means Section of AGE


GeometricHarmonic

Parameter Mean Median Mean Mean Sum Mode

Value 50.19334 49 46.66865 43.00606 46730 49


Mode of PE?
• 0 appears more than 1, so 0 is the mode.
Measures of Variation/Dispersion
• Range
• Percentiles/quartiles
• Interquartile range
• Standard deviation/Variance
Range

• Difference between the largest and the


smallest observations.
Range of age: 94 years-15 years = 79 years
14.0

9.3
Percent

4.7

0.0
0.0 33.3 66.7 100.0
AGE (Years)
Range of PE?
• 1-0 = 1
Quartiles
25% 25% 25% 25%

Q Q Q
1 2 3
◼ The first quartile, Q1, is the value for which
25% of the observations are smaller and 75%
are larger
◼ Q2 is the same as the median (50% are
smaller, 50% are larger)
◼ Only 25% of the observations are greater than
the third quartile
Interquartile Range

• Interquartile range = 3rd quartile – 1st


quartile = Q3 – Q1
Interquartile Range: age

Median
Q1 (Q2) Q3 maximum
minimum
25% 25% 25% 25%

15 35 49 65 94

Interquartile range
= 65 – 35 = 30
Variance
• Average (roughly) of squared deviations of values from
the mean

 (x − X )
i
2

S =
2 i
n −1
Why squared deviations?
• Adding deviations will yield a sum of 0.
• Absolute values are tricky!
• Squares eliminate the negatives.

• Result:
– Increasing contribution to the variance as you go farther from
the mean.
Standard Deviation

• Most commonly used measure of variation


• Shows variation about the mean
• Has the same units as the original data
n

 (x − X )
i
2

S= i
n −1
Calculation Example:
Sample Standard Deviation
Age data (n=8) : 17 19 21 22 23 23 23 38
n=8 Mean = X = 23.25

(17 − 23.25) 2 + (19 − 23.25) 2 +  + (38 − 23.25) 2


S=
8 −1
280
= = 6.3
7
Std. dev is a measure of the
14.0 “average” scatter around the mean.

Estimation method: if the


distribution is bell shaped, the
9.3 range is around 6 SD, so here
rough guess for SD is 79/6 =
Percent

13

4.7

0.0
0.0 33.3 66.7 100.0
AGE (Years)
Std. Deviation age

Variation Section of AGE


Standard
Parameter Variance Deviation
Value 333.1884 18.25345
250.0 Std Dev of Shock Index

Std. dev is a measure of the


187.5
Count “average” scatter around the mean.

Estimation method: if the


125.0 distribution is bell shaped, the
range is around 6 SD, so here
rough guess for SD is 1.4/6
=.23
62.5

0.0
0.0 0.5 1.0 1.5 2.0
SI
Std. Deviation SI

Variation Section of SI

Parameter Variance Standard Deviation Std Error of Mean Interquartile Range

Value 4.155749E-02 0.2038566 6.681129E-03 0.2460432


1.430856
Std. Dev of binary variable, PE
181 * (1 − .1944) 2 + 750 * (0 − .1944) 2
S=
931 − 1
145.8 Std. dev is a measure of the
= = .3959
“average” scatter around the mean.
930

80.56%
19.44%
Std. Deviation PE

Variation Section of PE
Standard
Parameter Variance Deviation

Value 0.156786 0.3959621


Comparing Standard Deviations
Data A
Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21
S = 3.338

Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 0.926

Data C
Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21 S = 4.570
◼ SSlide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Bienaymé-Chebyshev Rule
• Regardless of how the data are distributed,
a certain percentage of values must fall
within K standard deviations from the mean:

Note use of  (sigma) to represent


Note use of  (mu) to “standard deviation.”
represent “mean”.
At least within

(1 - 1/12) = 0% …….….. k=1 (μ ± 1σ)


(1 - 1/22) = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) = 89% ………....k=3 (μ ± 3σ)
Symbol Clarification
• S = Sample standard deviation (example of a “sample
statistic”)
•  = Standard deviation of the entire population (example
of a “population parameter”) or from a theoretical
probability distribution
• X = Sample mean
• µ = Population or theoretical mean
**The beauty of the normal curve:

No matter what  and  are, the area between - and


+ is about 68%; the area between -2 and +2 is
about 95%; and the area between -3 and +3 is
about 99.7%. Almost all values fall within 3 standard
deviations.
68-95-99.7 Rule

68% of the
data

95% of the data

99.7% of the data


Summary of Symbols
• S2= Sample variance
• S = Sample standard dev
• 2 = Population (true or theoretical) variance
•  = Population standard dev.
• X = Sample mean
• µ = Population mean
• IQR = interquartile range (middle 50%)
What’s wrong with this
graph?

from: ER Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut,
1983, p.69
Notice the X-axis
Correctly scaled X-axis…
Report of the Presidential Commission on the Space Shuttle Challenger Accident, 1986
(vol 1, p. 145)
The graph excludes the observations where no O-rings failed.
Smooth curve at least shows the trend toward failure at high and low temperatures…

◼ http://www.math.yorku.ca/SCS/Gallery/
Even better: graph all the data (including non-failures) using a logistic
regression model

Tappin, L. (1994). "Analyzing data relating to the Challenger disaster". Mathematics Teacher, 87, 423-426
What’s wrong with
this graph?

from: ER Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut,
1983, p.74
What’s the message here?

Diagraphics II, 1994


Diagraphics II, 1994
From: Johnson
R. Just the
Essentials of
Statistics.
Duxbury Press,
1995.
From:
Johnson R.
Just the
Essentials of
Statistics.
Duxbury
Press, 1995.
From: Johnson
R. Just the
Essentials of
Statistics.
Duxbury Press,
1995.
From: Johnson R.
Just the Essentials
of Statistics.
Duxbury Press,
1995.
For more examples…
• http://www.math.yorku.ca/SCS/Gallery/
“Lying” with statistics
• More accurately, misleading with statistics…
Example 1: projected statistics
Lifetime risk of melanoma:
1935: 1/1500
1960: 1/600
1985: 1/150
2000: 1/74
2006: 1/60

http://www.melanoma.org/mrf_facts.pdf
Example 1: projected statistics
• How do you think these statistics are calculated?

• How do we know what the lifetime risk of a person born in 2006


will be?
Example 1: projected statistics
Interestingly, a clever clinical researcher
recently went back and calculated (using
SEER data) the actual lifetime risk (or risk up
to 70 years) of melanoma for a person born in
1935.

The answer?
Closer to 1/150 (one order of magnitude off)

(Martin Weinstock of Brown University, AAD conference 2006)


Example 2: propagation of statistics

• In many papers and reviews of eating


disorders in women athletes, authors cite
the statistic that 15 to 62% of female
athletes have disordered eating.
• I’ve found that this statistic is attributed to
about 50 different sources in the literature
and cited all over the place with or without
citations...
For example…
• In a recent review (Hobart and Smucker, The Female Athlete
Triad, American Family Physician, 2000):

• “Although the exact prevalence of the female athlete triad is


unknown, studies have reported disordered eating behavior in 15
to 62 percent of female college athletes.”

• No citations given.
And…
• Fact Sheet on eating disorders:

• “Among female athletes, the prevalence of eating


disorders is reported to be between 15% and
62%.”
Citation given: Costin, Carolyn. (1999) The Eating
Disorder Source Book: A comprehensive guide to the
causes, treatment, and prevention of eating disorders. 2nd
edition. Lowell House: Los Angeles.
And…
• From a Fact Sheet on disordered eating from a college
website:

• “Eating disorders are significantly higher (15 to 62


percent) in the athletic population than the general
population.”

• No citation given.
And…
• “Studies report between 15% and 62% of college
women engage in problematic weight control behaviors
(Berry & Howe, 2000).” (in The Sport Journal, 2004)

• Citation: Berry, T.R. & Howe, B.L. (2000, Sept). Risk


factors for disordered eating in female university
athletes. Journal of Sport Behavior, 23(3), 207-219.
And…
• 1999 NY Times article

• “But informal surveys suggest that 15 percent to 62


percent of female athletes are affected by disordered
behavior that ranges from a preoccupation with losing
weight to anorexia or bulimia.”
And
• “It has been estimated that the prevalence of disordered
eating in female athletes ranges from 15% to 62%.” (in
Journal of General Internal Medicine 15 (8), 577-590.)

• Citations:
Steen SN. The competitive athlete. In: Rickert VI, ed.
Adolescent Nutrition: Assessment and Management. New
York, NY: Chapman and Hall; 1996:223 47.
Tofler IR, Stryer BK, Micheli LJ. Physical and emotional
problems of elite female gymnasts. N Engl J Med.
1996;335:281 3.
Where did the statistics come
from?
• The 15%: Dummer GM, Rosen LW, Heusner WW, Roberts PJ, and Counsilman
JE. Pathogenic weight-control behaviors of young competitive swimmers.
Physician Sportsmed 1987; 15: 75-84.

• The “to”: Rosen LW, McKeag DB, O’Hough D, Curley VC. Pathogenic weight-
control behaviors in female athletes. Physician Sportsmed. 1986; 14: 79-86.

• The 62%:Rosen LW, Hough DO. Pathogenic weight-control behaviors of female


college gymnasts. Physician Sportsmed 1988; 16:140-146.
Where did the statistics come
from?
• Study design? Control group?
– Cross-sectional survey (all)
– No non-athlete control groups

• Population/sample size?
– Convenience samples
– Rosen et al. 1986: 182 varsity athletes from two midwestern universities
(basketball, field hockey, golf, running, swimming, gymnastics, volleyball,
etc.)
– Dummer et al. 1987: 486 9-18 year old swimmers at a swim camp
– Rosen et al. 1988: 42 college gymnasts from 5 teams at an athletic
conference
Where did the statistics come
from?
• Measurement?
– Instrument: Michigan State University Weight Control Survey
– Disordered eating = at least one pathogenic weight control behavior:
• Self-induced vomiting
• fasting
• Laxatives
• Diet pills
• Diuretics
• In the 1986 survey, they required use 1/month; in the 1988 survey, they required use
twice-weekly
• In the 1988 survey, they added fluid restriction
Where did the statistics come
from?
• Findings?
– Rosen et al. 1986: 32% used at least one “pathogenic weight-
control behavior” (ranges: 8% of 13 basketball players to 73.7%
of 19 gymnasts)
– Dummer et al. 1987: 15.4% of swimmers used at least one of
these behaviors
– Rosen et al. 1988: 62% of gymnasts used at least one of these
behaviors
References

• http://www.math.yorku.ca/SCS/Gallery/
• Kline et al. Annals of Emergency Medicine 2002; 39: 144-152.
• Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
• Tappin, L. (1994). "Analyzing data relating to the Challenger disaster".
Mathematics Teacher, 87, 423-426
• Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire,
Connecticut, 1983.
• Visual Revelations: Graphical Tales of Fate and Deception from Napoleon
Bonaparte to Ross Perot Wainer, H. 1997.
Mean of Pulmonary Embolism? (Binary
variable?)
n

X
i =1
i
181 * 1 + 750 * 0 181
X= = = = .1944
n 931
Histogram 931
100.0

80.56%
(750)
66.7
Percent

33.3
19.44% (181)

0.0
0.0 0.3 0.7 1.0
PE
ÜÇÜNCÜ NESİL, YENİLİKÇİ VE GİRİŞİMCİ
ÜNİVERSİTE MODELİ

www.ostimteknik.edu.tr

You might also like