Chapter 01

Introduction to Statistics
Types of Data Analysis

Quantitative Methods
Testing theories using numbers
Qualitative Methods
Testing theories using language
Magazine articles/Interviews Conversations Newspapers Media broadcasts
The Research Process
Initial Observation
Find something that needs explaining
Observe the real world Read other research
Test the concept: collect data

Collect data to see whether your hunch is correct To do this you need to define variables
Anything that can be measured and can differ across entities or time.
Generating and Testing Theories

Theories
An hypothesized general principle or set of principles that explain known findings about a topic and from which new hypotheses can be generated.
Hypothesis
A prediction from a theory. E.g. the number of people turning up for a Big Brother audition that have narcissistic personality disorder will be higher than the general level (1%) in the population.
Falsification
The act of disproving a theory or hypothesis.
TV Show
Big Brother is a reality television franchise created by John de Mol. The premise of the show is that a group of people live together in a large house, isolated from the outside world. They are continuously watched by television cameras. Each series lasts for around three months, with 1216 contestants entering the house. In order to win the final cash prize, all the contestants have to do is survive periodic evictions and be the last one standing. The idea for Big Brother is said to have come during a brainstorm session at the Dutch-based international television production firmEndemol, on March 10, 1997. The first-ever version of Big Brother was broadcast in 1999 on Veronica in the Netherlands. Since then the format has become a worldwide TV franchise, airing in many countries in varying adaptations.

Why so many contestants with unpleasant personalities (Narcissistic disorder) i.e ND Need some data to see that this observation is true. Measure it using the variable Personalityo f the contestant Found that 75% had this disorder ; This supports the observation.
Theory 1: People with N disorder (ND) are more likely to audition for Big B than those without. Theory 2: Producers of BB are more likely to select people with a higher ND than those with less ND. Prediction from theory is known as Hypothesis. Two predictions can from theory above H1: No of people turning in for audition have a higher ND than the general public level i.e. 1% H2: BB selection panel are more likely to choose people with higher ND H1: Clinical Psychologists will interview contestants at the time of audition to find ND H2: Collect the data of successful participants after audition
narcissism (nrs-szm) also narcism (-szm)n. 1. Excessive love or admiration of oneself. See Synonyms at conceit. 2. A psychological condition characterized by selfpreoccupation, lack of empathy, and unconscious deficits in selfesteem. 3. Pleasure derived from contemplation or admiration of one's own body or self, especially as a fixation on or a regression to an infantile stage of development. 4. The attribute of the human psyche charactized by admiration of oneself but within normal limits.
H1: % of people with ND will be higher is proven (845 /7662 x 100) = 11% having ND in the contestants versus the population.
H2: The panel had a bias towards higher ND

Out of 12 they selected 9 with a higher ND
Another observation: The contestants having high ND, are convinced that the public will love them and that they will win.
H: People with a high ND have a conviction to win. Testing of Hypothesis: by asking the Question: Do you thing you will win? 7/9 say yes, which confirms the observation. Another Theory: These contestants think they will win because they do not realize they have a personality disorder (ND). Hypothesis: These people are similar to others and do not have a different personality (ND). Hypothesis Testing: by asking question to those who thought that they would win, whether their personalities were different from the norm? If all 7 said that their personality was different from the norm. This contradicts my theory and is called Falsification
Other research carried out on BB 1. People with ND think they are more interesting than others 2. They think that they deserve more success than others 3. They think that others like them because they have special personalities.
Rival theory comes from another researcher that the problem is not that the ND contestants do not realize that they have a personality disorder, but they falsely believe that this ND is perceived positively by other people (i.e. they believe that their personality makes them likeable) Hypothesis: from this theory is: If higher ND contestants are asked to evaluate what other people of them, they will overestimate other peoples perceptions.
Data Collection 1: What to Measure?

Hypothesis:
Coca-cola kills germs.
Independent Variable
The proposed cause A predictor variable A manipulated variable (in experiments) Coca-cola in the hypothesis above The proposed effect An outcome variable Measured not manipulated (in experiments) Sperm in the hypothesis above
Dependent Variable
Levels of Measurement
Categorical (entities are divided into distinct categories):
Binary variable: There are only two categories
e.g. dead or alive.
Nominal variable: There are more than two categories

e.g. whether someone is an omnivore, vegetarian, vegan, or fruitarian.
Ordinal variable: The same as a nominal variable but the categories have a logical order
e.g. whether people got a fail, a pass, a merit or a distinction in their exam.
Continuous (entities get a distinct score):

Interval variable: Equal intervals on the variable represent equal differences in the property being measured
e.g. the difference between 6 and 8 is equivalent to the difference between 13 and 15.
Ratio variable: The same as an interval variable, but the ratios of scores on the scale must also make sense
e.g. a score of 16 on an anxiety scale means that the person is, in reality, twice as anxious as someone scoring 8.
Measurement Error
Measurement error
The discrepancy between the actual value were trying to measure, and the number we use to represent that value.
Example:
You (in reality) weigh 80 kg. You stand on your bathroom scales and they say 83 kg. The measurement error is 3 kg.
Validity
Whether an instrument measures what it set out to measure. Content validity
Evidence that the content of a test corresponds to the content of the construct it was designed to cover
Ecological validity
Evidence that the results of a study, experiment or test can be applied, and allow inferences, to realworld conditions.
Reliability
Reliability
The ability of the measure to produce the same results under the same conditions.
Test-Retest Reliability
The ability of a measure to produce consistent results when the same entities are tested at two different points in time.
Data Collection 2: How to Measure

Correlational research:
Observing what naturally goes on in the world without directly interfering with it.
Cross-sectional research:
This term implies that data come from people at different age points with different people representing each age point.
Experimental research:
One or more variable is systematically manipulated to see their effect (alone or in combination) on an outcome variable. Statements can be made about cause and effect.
Experimental Research Methods

Cause and Effect (Hume, 1748)
1. 2. 3. Cause and effect must occur close together in time (contiguity); The cause must occur before an effect does; The effect should never occur without the presence of the cause.
Confounding variables: the Tertium Quid

A variable (that we may or may not have measured) other than the predictor variables that potentially affects an outcome variable. E.g. The relationship between cancer and suicide is confounded by self esteem.
Ruling out confounds (Mill, 1865)

An effect should be present when the cause is present and that when the cause is absent the effect should be absent also. Control conditions: the cause is absent.
Methods of Data Collection

Between-group/Betweensubject/independent
Different entities in experimental conditions
Repeated measures (within-subject)

The same entities take part in all experimental conditions. Economical Practice effects Fatigue
Types of Variation
Systematic Variation
Differences in performance created by a specific experimental manipulation.
Unsystematic Variation
Differences in performance created by unknown factors.
Age, Gender, IQ, Time of day, Measurement error etc.
Randomization
Minimizes unsystematic variation.
Chimpanzee Experiment
Chimpanzees running the economy

Training phase 1: Chimps using computers to change various parameters of the economy. As they set the parameters they get the results on the screens indicating economic growth. This feed back is meaning less as they can not read. Training Phase 2: The same except that if the economic growth is good, they get a banana. This feedback is meaningful. Notes: If there is not experimental manipulation, we expect the chimps behavior to be the same. Because external factors such as age, gender, IQ, Motivation will be similar for both conditions. Chimps who score high will also score high in phase 2 etc. However performance will not be identical. This is known as Unsystematic Variation. If we introduce an experimental manipulation (i.e. provide bananas as feedback in one of the sessions), then we do something different to participants. The only difference between conditions 1 and 2 is the manipulation that the experimenter has made (in this case that the chimps get bananas as a positive reward in one condition but not in the other). Therefore, any differences between the means of the two conditions is probably due to the experimental manipulation. So, if the chimps perform better in one training phase than the other then this has to be due to the fact that bananas were used to provide feedback in one training phase but not the other. Differences in performance created by a specific experimental manipulation are known as systematic variation.
Analysing Data: Histograms

Frequency Distributions (aka Histograms)
A graph plotting values of observations on the horizontal axis, with a bar showing how many times each value occurred in the data set.
The Normal Distribution

Bell shaped Symmetrical around the centre
The Normal Distribution
Properties of Frequency Distributions

Skew
The symmetry of the distribution. Positive skew (scores bunched at low values with the tail pointing to high values). Negative skew (scores bunched at high values with the tail pointing to low values).
Kurtosis
The heaviness of the tails. Leptokurtic = heavy tails. Platykurtic = light tails.
Skew
Kurtosis
Central tendency: The Mode

Mode
The most frequent score
Bimodal
Having two modes
Multimodal
Having several modes
A Bimodal Distribution
Central Tendency: The Median

Median
The middle score when scores are ordered.
Example
Number of friends of 11 Facebook users.
Central Tendency: The Mean

Mean
The sum of scores divided by the number of scores. Number of friends of 11 Facebook users.
X
x
i 1
x
i 1
22 40 53 57 93 98 103 108 116 121 252 1063
x
i 1
1063 96.64 11
The Dispersion: Range

The Range
The smallest score subtracted from the largest
Example
Number of friends of 11 Facebook users. 22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252 Range = 252 22 = 230 Very biased by outliers
The Dispersion: The Interquartile range

Quartiles
The three values that split the sorted data into four equal parts. Second Quartile = median. Lower quartile = median of lower half of the data Upper quartile = median of upper half of the data
Using Frequency distributions to go beyond data

Frequency distributions: How often scores actually occurred Frequency distributions in terms of probability: How likely it is that a score would occur.
How likely it is that a 70 year old commits suicide?

No of sucicides by age How likely it is that a 30 year old commits suicide?
Not likely 3/172

Not likely 33/172
We have talked how frequency distributions can be used to get rough idea of probability of a score occurring. We can be precise. For any distribution of scores, we can calculate the probability of obtaining a score of a certain size. (It will be tedious and complex) Statisticians have defined common distributions known as probability distributions. From these it is possible to calculate the probability of getting particular scores based on the frequencies with which a particular score occurs in a distribution with these common shapes. One of these distributions is the Normal Distribution. Statisticians have calculated the probability of certain scores occurring in a normal distribution with a mean of 0 and a std. deviation of 1. We can use the probability tables to find the likely hood of that score.
172 suicided in total
Various Types of Distributions

Look at the web page
Going beyond the data: Z-scores

Not all the data we collect has a mean of 0 and a std deviation of 1. What to do? e.g. a data set with a mean of 567 and a std dev of 52.97.
We can convert any data set in to a data set that has a mean of 0 and std dev of 1. To centre the data around 0, take each score and subtract it from the mean. Divide the resulting score by the Std dev to ensure that the data has a std dev of 1.
Look at the Example in the Excel File
Z-scores
Standardising a score with respect to the other scores in the group. Expresses a score in terms of how many standard deviations it is away from the mean. The distribution of z-scores has a mean of 0 and SD = 1.
XX z s
Standard Normal Distribution Frequency Distribution in terms of Probability: Beachy Head example
No of sucicides by age
Therefore if we have a distribution similiar to Normal , we can use table of probabilities for Normal distribution. Example: Probability of person aged 70 or more. Mean is 36 and St. Deviation is 13 Z = X X / s, Z score is 2.62. Look this value in the Z table and you will find a value of .0044 r .44% The Normal Dist and the z scores allow us to go one step beyond our data in that from a set of scores.
Properties of z-scores
Z scores are important because their value cuts off certain important percentages of the distribution. The first important value is 1.96 followed by 2.58 and 3.29. 1.96 cuts off the top 2.5% of the distribution. 1.96 cuts off the bottom 2.5% of the distribution. As such, 95% of z-scores lie between 1.96 and 1.96. 99% of z-scores lie between 2.58 and 2.58, 99.9% of them lie between 3.29 and 3.29.
1.96 cuts off the top 2.5% of the distribution 95% of z-scores lie between 1.96 and 1.96. 99% of z-scores lie between 2.58 and 2.58, 99.9% of them lie between 3.29 and 3.29.
Fitting statistical models to the data Scientific statements can be be split into testable hypotheses.
The hypothesis or prediction that comes from your theory is usually saying that an effect will be present. This hypothesis is called the alternative hypothesis and is denoted by H1. (It is sometimes also called the experimental) There is another type of hypothesis, though, and this is called the null hypothesis and is denoted by H0. This hypothesis is the opposite of the alternative hypothesis and so would usually state that an effect is absent.
Fitting statistical models to the data

The Big Brother example Alternative hypothesis: Big Brother contestants will score higher on personality disorder questionnaires than members of the public. Null hypothesis: Big Brother contestants and members of the public will not differ in their scores on personality disorder questionnaires.
The reason that we need the null hypothesis is because we cannot prove the experimental hypothesis using statistics, but we can reject the null hypothesis.
If our data give us confidence to reject the null hypothesis then this provides support for our experimental hypothesis. However, be aware that even if we can reject the null hypothesis, this doesnt prove the experimental hypothesis it merely supports it.
Rather than talking about accepting or rejecting a hypothesis (which some textbooks tell) we should be talking about the chances of obtaining the data weve collected assuming that the null hypothesis is true. i.e. what is the chance of obtaining the data that the big B contestants and the common public are same.
Fitting statistical Models to the data

In the Big Brother example, it was found that 75% (9/12) of them had a disorder. When we analyse our data, we are really asking, Assuming that contestants are no more likely to have personality disorders than members of the public, is it likely that 75% or more of the contestants would have personality disorders? Intuitively the answer is that the chances are very low: if the null hypothesis is true, then most contestants would not have personality disorders because they are relatively rare. Therefore, we are very unlikely to have got the data that we did if the null hypothesis were true.
Fitting statistical Models to the data

What if we found that only 1 contestant reported having a personality disorder (about 8%)? If the null hypothesis is true, and contestants are no different in personality to the general population, then only a small number of contestants would be expected to have a personality disorder. The chances of getting these data if the null hypothesis is true are, therefore, higher than before. When we collect data to test theories we have to work in these terms: we cannot talk about the null hypothesis being true or the experimental hypothesis being true, we can only talk in terms of the probability of obtaining a particular set of data if, hypothetically speaking, the null hypothesis was true.
H1: % of people with ND will be higher is proven (845 /7662 x 100) = 11% having ND in the contestants versus the population.
H2: The panel had a bias towards higher ND

Out of 12 they selected 9 with a higher ND

Chapter 01

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 01

Uploaded by

Copyright:

Available Formats

Introduction to Statistics

Types of Data Analysis

The Research Process

Test the concept: collect data

The Research Process

Generating and Testing Theories

The Research Process

H2: The panel had a bias towards higher ND

The Research Process

Data Collection 1: What to Measure?

Nominal variable: There are more than two categories

Continuous (entities get a distinct score):

Data Collection 2: How to Measure

Experimental Research Methods

Confounding variables: the Tertium Quid

Ruling out confounds (Mill, 1865)

Methods of Data Collection

Repeated measures (within-subject)

Chimpanzees running the economy

The Research Process

Analysing Data: Histograms

The Normal Distribution

The Normal Distribution

Properties of Frequency Distributions

Central tendency: The Mode

Central Tendency: The Median

Central Tendency: The Mean

22 40 53 57 93 98 103 108 116 121 252 1063

The Dispersion: Range

The Dispersion: The Interquartile range

Using Frequency distributions to go beyond data

How likely it is that a 70 year old commits suicide?

Not likely 3/172

172 suicided in total

Various Types of Distributions

Going beyond the data: Z-scores

Fitting statistical models to the data

Fitting statistical Models to the data

Fitting statistical Models to the data

H2: The panel had a bias towards higher ND

You might also like