You are on page 1of 26

1r Batxillerat Matemàtiques INS Pau Vila

One-dimensional and two-dimensional statistics

Índex

1 Types of data: Quantitative vs categorical variables 2

1.1 Quantitative variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Graphics 3

2.1 Bar Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Grouped Bar Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Pictogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.4 Line graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.5 Stem and Left Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.6 Pie or Circle graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.7 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.8 Frequency Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.9 Box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.10 Scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Calculations and formulas 10

3.1 Univariate statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Bivariate statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Instruments for making calculations and graphs 21

5 Correlation and Causation 21

6 Statistics and problem solving 23

6.1 Scientific method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.2 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.3 Conclusions and confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.4 Statistical calculators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Project 26

1
1r Batxillerat Matemàtiques INS Pau Vila

One-dimensional and two-dimensional statistics


In statistical research, a variable is defined as an attribute of an object of study. Choosing which
variables to measure is central to good experimental design.

Example If you want to test whether some plant species are more salt-tolerant than others, some key
variables you might measure include the amount of salt you add to the water, the species of plants
being studied, and variables related to plant health like growth and wilting.

You need to know which types of variables you are working with in order to choose appropriate
statistical tests and interpret the results of your study.

You can usually identify the type of variable by asking two questions:

What type of data does the variable contain? What part of the experiment does the variable represent?

1 Types of data: Quantitative vs categorical variables

Data is a specific measurement of a variable – it is the value you record in your data sheet. Data is
generally divided into two categories:

Quantitative data represents amounts.

Categorical data represents groupings.

A variable that contains quantitative data is a quantitative variable; a variable that contains categorical
data is a categorical variable. Each of these types of variable can be broken down into further types.

1.1 Quantitative variables

When you collect quantitative data, the numbers you record represent real amounts that can be added,
subtracted, divided, etc. There are two types of quantitative variables: discrete and continuous.

1.2 Categorical variables

Categorical variables represent groupings of some kind. They are sometimes recorded as numbers, but
the numbers represent categories rather than actual amounts of things.

2
1r Batxillerat Matemàtiques INS Pau Vila

There are three types of categorical variables: binary, nominal, and ordinal variables.

*Note that sometimes a variable can work as more than one type! An ordinal variable can also be
used as a quantitative variable if the scale is numeric and doesn’t need to be kept as discrete integers.

2 Graphics

There are several types of graphics. The choice depends on the type of data collected and the infor-
mation you want to transmit. Each has a set of advantages and disadvantages. The following table
summarizes the main types of graphs available. The knowledge of each one of them is fundamental
for a correct reading of the information they contain.

2.1 Bar Graph

Description: The height of the bars shows the frequency. The bars can be vertical or horizontal.
There is an empty space between the bars.
Advantages Allows you to easily compare. It has strong visual impact.
Disadvantages It can only be used to convey simple information.

3
1r Batxillerat Matemàtiques INS Pau Vila

2.2 Grouped Bar Graph

Description: For each value of the variable a group of bars appears.


Advantages Lets you compare different data groups for the same values of the variable.
Disadvantages Can not be used for variables that have a lot of data.

2.3 Pictogram

Description:The data are represented by symbols related to the object under study.
Advantages Very attractive. Great visual impact.
Disadvantages Gives little information. Little precision.

4
1r Batxillerat Matemàtiques INS Pau Vila

2.4 Line graph

Description:They are formed by lines. On the horizontal axis is the time variable.
Advantages Allows for various types of comparisons. It allows studying the variation of a variable
over time.
Disadvantages It does not easily identify the continuity of the variation.

2.5 Stem and Left Plot

Description: The data are divided into two parts: the stem and the leaves. The stem is on the left
side of the vertical trace and the leaves on the right side.
Advantages All data appears on the chart. It is not necessary to build a frequency table beforehand.
It gives a visual interpretation of how the data is distributed.
Disadvantages It is not advisable when there are many or few stems. It gives little information in
case the data is very scattered.

5
1r Batxillerat Matemàtiques INS Pau Vila

2.6 Pie or Circle graph

Description: A circle is divided into sectors. The amplitude of each sector is proportional to the
corresponding frequency.
Advantages It is useful when the ratios analysis is more important than the actual value. It has a
strong visual impact.
Disadvantages It should only be used when the variable takes few values. A single chart does not
allow you to compare two groups of data.

2.7 Histogram

Description: It is a bar graph in which the height of these bars is proportional to the frequency.
There is no space between bars. It is only used if the variable is quantitative and the scale of the
values is continuous.
Advantages For certain situations, it is the only correct way to present the data. The histogram
gives an idea of how the data is distributed.
Disadvantages Difficult to construct when the amplitude of the intervals is different. However, with
graphing calculators or computers, this problem is overcome.

6
1r Batxillerat Matemàtiques INS Pau Vila

2.8 Frequency Polygons

Description: It is a line graph that is obtained by joining the midpoints of the upper base of the
rectangles of the histogram.
Advantages Lets you compare histograms using only the respective frequency polygons.
Disadvantages Difficult manual construction. Using technology, this problem is overcome.

2.9 Box plot

Description: It consists of a rectangle and two straight segments. About 50% of the data is within
the rectangle, 25% to the right and 25% to the left.

Neither segment can be longer than 1.5 times the interquartile range (Q3 - Q1). If a segment exceeds
this distance, the extra length is represented by a dashed line extending the segment, in the direction
of the isolated point corresponding to the maximum or minimum
Marks 1B Frequency
0 2
4 4
5 7
6 6
7 3
8 2
10 1

7
1r Batxillerat Matemàtiques INS Pau Vila

Interquartile range = 6,5-4,5=2

The distance from Q1 to minimum is 4, greatest than interquartile range. The same between the
maximum and Q3.

Advantages For a simple observation, it gives an idea of how the data is distributed.
Disadvantages For its construction it is necessary to know: the minimum, the maximum, the median
(middle quartile), the lower and the upper quartile.

Example and self-assessment

2.10 Scatter plot

Description: A scatter plot displays the relationship between two factors of the experiment. A trend
line is used to determine positive, negative or no correlation.
Advantages Shows a trend in the data relationship. Retains exact data values and sample size.
Shows minimum/maximum
Disadvantages Flat trend line gives inconclusive results. Hard to visualize results in large data sets.
Data on both axes should and outliers be continuous.

8
1r Batxillerat Matemàtiques INS Pau Vila

Activitat 2.1: Graphics

1. Plot and compare the box plot of each of the following distributions of the grades obtained
in the last high school maths exam of three different groups.
Marks 1A Frequency Marks 1B Frequency Marks 1C Frequency
2 2 0 1 1 4
3 1 3 3 2 6
4 5 5 7 3 3
5 7 6 6 4 2
6 6 7 4 5 1
7 1 8 2 6 1
8 3 10 2 7 3
9 5

2. Waiting time (in minutes) is measured in three fast food restaurants (1,2,3). The results
are shown in the following box plot

a) Which restaurant shows the highest waiting range?


b) Which restaurant minimizes the waiting time?
c) If a wait time of 8 minutes is set, which restaurant will you choose?

3. Here are two box plots. The first one shows the results of experimental measurement of
the length of tree leaves . The second one shows the number of people who attend mass
daily in a small mountain town. Both variables have an outlier. Why? Justify if any of
these outliers should be removed.

9
1r Batxillerat Matemàtiques INS Pau Vila

Activitat 2.2: Graphics

1. The University of Lleida is conducting a study on the incidence of rainy days in September
on the ”rovellons”harvest. That’s why they have a study done over the last 30 years
detailing the number of rainy days in September and the number of ”rovellons”(in tons)
harvested in Catalonia. These data are: (A,B) where A is the number of days of rain in
September and B is the number of tons of ”rovellons”.

Plot a scatter plot to discover if There is a good linear correlation between the two
variables.

3 Calculations and formulas

3.1 Univariate statistics

Qualitative variables

• Absolute frequency

• Relative frequency, Percentages

• Cumulative Frequency

10
1r Batxillerat Matemàtiques INS Pau Vila

Exemple 3.1: Univariate qualitative variable

When Apple Computer introduced the iMac computer in August 1998, the company wanted
to learn whether the iMac was expanding Apple’s market share. Was the iMac just attracting
previous Macintosh owners? Or was it purchased by newcomers to the computer market,
and by previous Windows users who were switching over? To find out, 500 iMac customers
were interviewed. Each customer was categorized as a previous Macintosh owners, a previous
Windows owner, or a new computer purchaser. The qualitative data results were displayed in
a frequency table.

Cumulate
Previous Relative Cumulate
Frequency % Relative
Ownership Frequency Frequency
Frequency
None 85 0.17 17 85 0.17
Windows 60 0.12 12 145 0.29
Mac 355 0.71 71 500 1
Total 500 1.00 100

Quantitative variables

• Central values: mean, mode, median


Mean The mean is the average of a data set. It is a measure of central tendency.

Pn
i=1 xi
x1 + x2 + · · · + xn
x̄ = =
N N

Pn
i=1 xi · fi x1 · f1 + x2 · f2 + · · · + xn · fn
x̄ = =
N N
• Situation (position)values: median, 1st and 3rd quartiles, min, max
In a symmetric distribution,
- the median is in the center of the box
- the whiskers has the same length
- mean = median = mode

11
1r Batxillerat Matemàtiques INS Pau Vila

Activitat 3.1: Univariate quantitative variable

1. Consider the following data set: 4 ; 5 ; 6 ; 6 ; 6 ; 7 ; 7 ; 7 ; 7 ; 7 ; 7 ; 8 ; 8 ; 8 ; 9 ; 10


Plot the corresponding histogram. Observe and explain it.
Calculate the mean, median and mode

2. Redo the same exercises with this data: 4 ; 5 ; 6 ; 6 ; 6 ; 7 ; 7 ; 7 ; 7 ; 7 ; 7 ; 8


The distribution is called left-skewed because it is drawn to the left.

3. Redo with 6 ; 7 ; 7 ; 7 ; 7 ; 7 ; 7 ; 8 ; 8 ; 8 ; 9 ; 10 What do you think this distribution


is called?

4. Exercise

5. Exercise

• Spread values: range, variance, standard deviation, interquartile range,variation coefficient


Range = maximum value - minimum value
Variance of a population (For a sample that estimates a population, N - 1 instead of N)
The term variance refers to a statistical measurement of the spread between numbers in a data
set. More specifically, variance measures how far each number in the set is from the mean
(average), and thus from every other number in the set. In particular, it measures the degree of

12
1r Batxillerat Matemàtiques INS Pau Vila

dispersion of data around the sample’s mean.

(xi − x)2
P P 2
2 xi
σ = = − x2
N N

(xi − x)2 · fi
P P 2
2 xi · fi
σ = = − x2
N N

Standard deviation = variance
Interquartile range = Q3 − Q1
Variation coefficient shows the extent of variability in relation to the mean of the population.
The higher the CV, the greater the dispersion.
deviation S
CV= =
mean x̄

Uses of univariate statistics

1. The weight of 5 patients are 70, 60, 56, 83 and 79 kg (mean =69.6 kg. and its standard deviation
(s) = 10.44) and their TAS are 150, 170, 135, 180 and 195 mmHg (mean=166 mmHg and its
standard deviation is 21.3).
which distribution is more dispersed, weight or blood pressure?
If we compare the standard deviations, we observe that the standard deviation of blood pressure
is much higher; however, we cannot compare two variables that have different measurement
scales, so we calculate the coefficients of variation (CV weight=15%, TAS=12, 8% In view of the
results, we observe that the weight variable has greater dispersion.

2. When data is distributed symmetrically (and we have already said that this occurs when its mean
and median values are close), mean and standard deviation are used to describe that variable.
In the case of skewed distributions, the median and range are more appropriate measures. In
this case, quartiles are also often used.

Activitat 3.2: Univariate quantitative variable

1. Suppose that, in a small town of 50 people, one person earns 5,000,000 per year and the
other 49 each earn 30,000. Which is the better measure of the çenter,”the mean or the
median?

13
1r Batxillerat Matemàtiques INS Pau Vila

Exemple 3.2: Univariate quantitative variable

1. Exercises

2. You want to illustrate the example that changing units effect different measures differently.
You have data on the average heights of males in the UK from 1900 to 1980. Find the
average and variance of these heights, then calculate these measures for heights in feet and
inches (1 in = 2.54cm). Choose an appropriate chart or plot to illustrate these changes.

Year Heights in Cm
1900 169.4
1910 170.9
1920 171
1930 173.9
1940 174.9
1950 176
1960 176.9
1970 177.1
1980 176.8

1566.9
x̄ = = 174.1
9
Σ(xi − x̄)2
σ2 =
n−1
70.4
σ2 = = 8.8
9−1
To find the mean and variance in inches, we simply need to follow the rules for changing
units.

174.1
x̄in = = 68.5
2.54

2 8.8
σin = = 1.4
(2.542 )

14
1r Batxillerat Matemàtiques INS Pau Vila

Activitat 3.3: Univariate quantitative variable

1. Annual Survey of Salary Structure (2020)

True or false?
a) The average annual gross profit per worker was 25.165,51 euros in 2020
b) The mean is a good measure to describe the salary
c) Half of the population has a gross profit of less than 25.165,51 euros
d) The most common annual gross profit is 18.480,19 euros.
e) Approximately 550.000 workers earns 13.970,80 euros.

2. A basketball team needs a winger. The coach has observed 2 players from other teams in
their last matches. The points achieved by each one are:
Player 1: 16, 14, 13, 13, 14
Player 2: 25, 10, 8, 6, 21
Which player would you choose?

3. A company wants to but a batch of light bulbs and consider two possibles manufacturers
sources A and B. The price of the bulbs are the same in the two options. He test 6 bulbs
of each one:
Bubls A Hours A Hours B
B1 400 650
B2 550 400
B3 600 400
B4 450 700
B5 550 350
B6 500 450

4. We know that the average monthly salary in the town A is 1500 euros and the deviation
is 200, while in B the average is 1850 euros and the deviation is 390.
a) Is this information enough to know which of the populations has more dispersion in
their monthly salaries?
b) Find out which of the populations has more dispersed salaries

15
1r Batxillerat Matemàtiques INS Pau Vila

3.2 Bivariate statistics

Quantitative variables

• Covariance

• Determination coefficient, R2 all the models of functions

• Linear coefficient of Pearson, r → only linear model: r is the square root of R2

• Equation of the line of best fit → only linear model, x against y / y against x

Covariance of two variables

Covariance is a measure of how much two random variables vary together. It’s similar to variance,
but where variance tells you how a single variable varies, co variance tells you how two variables vary
together.

P
xi · yj · fij
σxy = −x·y
N

Linear coefficient of Pearson

The Pearson correlation coefficient or as it denoted by r is a measure of any linear trend between two
variables. The value of r ranges between -1 and 1.

σxy
r=
σx · σy

When r = zero, it means that there is no linear association between the variables. However, there
might be some nonlinear relationship but if r = zero then there is no consistent linear component to
that relationship.

When r = 1, it means that there is a perfect positive linear relationship between the variables, and
all individuals sampled lie exactly on the line of best fit with a positive slope.

If 0 ¡ r ¡ 1 then there is a positive linear trend and the sampled individuals are scattered around the
line of best fit; the smaller the absolute value of r the less well the data can be visualized by a single
linear relationship. If r is positive then an increase in the value of one variable is associated with an
increase in the other variable.

When r = -1, it means that there is a perfect negative relationship between the variables, and all
individuals sampled lie exactly on the line of best fit with a negative slope.

16
1r Batxillerat Matemàtiques INS Pau Vila

If -1 ¡ r ¡ 0 then sampled individuals will be scattered around the e variables, and all individuals
sampled lie exactly on the; the smaller the absolute value of r the less well the data can be visualized
by a single linear relationship.

The value of r2 is called the coefficient of determination. The coefficient of determination is the
percentage of variance that could be explained by the two variables.

Equations of the lines of best fit

Linear regression attempts to model the relationship between two variables by fitting a linear equation
to observed data. One variable is considered to be an explanatory variable, and the other is considered
to be a dependent variable. For example, a modeler might want to relate the weights of individuals
to their heights using a linear regression model.

σxy
y−y = · (x − x)
σx2

σxy
x−x= · (y − x)
σy2

One of the most common reasons for fitting a regression model is to use the model to predict the
values of new observations. You must use the first equation to predict Y known X. You must use the
second one to predict X know Y. The two equations intersect at the point (x, y)

We use the following steps to make predictions with a regression model:

Step 1: Collect the data.


Step 2: Fit a regression model to the data.
Step 3: Verify that the model fits the data well.
Step 4: Use the fitted regression equation to predict the values of new observations.

17
1r Batxillerat Matemàtiques INS Pau Vila

Exemple 3.3: Linear regression

Fat percentage and kilocalories per container have been studied in different brands of natural
yogurts. The results obtained in six of them are:

Scatter plot

We calculate all the statistical parameters to find out if there is a relationship between the fat
and the calories in the yogurt
P
xi · fi 14, 2 373
x= = = 2, 37 y= = 62, 17
N 6 6
P 2
2 xi · fi 35, 06
σx = − x2 = − 2, 372 = 0, 23
N 6
P 2
2 yi · fi 23655
σy = − x2 = − 62, 172 = 77, 39
N 6
P
xi · yj · fij 904, 9
σxy = −x·y = − 2, 37 · 62, 17 = 3, 47
N 6
σxy 3, 47
r= =√ √ = 0, 85
σx · σy 0, 23 · 77, 39

R2 = 0, 852 = 0, 7225

σxy 3, 47
mxy = 2
= = 15, 1
σx 0, 23

σxy
y−y = · (x − x)
σx2
y − 62, 17 = 15, 1 · (x − 2, 37)

= 15, 1x + 26, 38

As the correlation is high, r=0.85, it is reasonable to make estimates within the data range.
For a percentage of 2.5 fat, the kilocalories will be approximately 64.13. However, to estimate
a value for a percentage of 10 fat is invalid because x=10 is very far from the range of data we
considered.

18
1r Batxillerat Matemàtiques INS Pau Vila

Activitat 3.4: Scatter plot

1. Nutrition at Starbucks. The scatterplot below shows the relationship between the number
of calories and amount of carbohydrates (in grams) Starbucks food menu items contain.
Since Starbucks only lists the number of calories on the display items, we are interested
in predicting the amount of carbs a menu item has based on its calorie content.

Describe the relationship between number of calories and amount of carbohydrates (in
grams) that Starbucks food menu items contain.
Why might we want to fit a regression line to these data?

2. Body measurements. Researchers studying anthropometry collected body girth


measurements and skeletal diameter measurements, as well as age, weight, height and
gender for 507 physically active individuals. The scatterplot below shows the relationship
between height and shoulder girth (over deltoid muscles), both measured in centimeters.

a) Describe the relationship between shoulder girth and height.


b) The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The
mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between
height and shoulder girth is 0.67.
Write the equation of the regression line for predicting height.
c) A randomly selected student from your class has a shoulder girth of 100 cm. Predict
the height of this student using the model.
d) The student from part (d) is 160 cm tall. Calculate the residual, and explain what this
residual means.
19
1r Batxillerat Matemàtiques INS Pau Vila

Activitat 3.5: Equation of the lines of best fit

1. A company manufactures an electronic device to be used in a very wide temperature


range. The company knows that increased temperature shortens the life time of the
device, and a study is therefore performed in which the life time is determined as a
function of temperature. The following data is found:

Is possible to use a linear regression model, which expresses the life time as a linear
function of the temperature?

2. You are interested in knowing the relationship between the weather and tourism levels.
To investigate, you collect data from the touristic centre in a city during one month in
the summer, counting the number of people that arrive at the square at the same time
every day. Given the data set below, what is the correlation between temperature and
tourism? Interpret the correlation and name a few other reasons why these two variables
are or are not related.

Temperature Number of Visitors


12 87
21 150
20 110
25 90
17 85
15 70
13 90

3. There are two variables that need to be studied: weight loss and days spent exercising
one month. You are given a data set in which individuals have been asked the number
of days they exercise for more than half an hour in one month. What kind of regression
model can you use here? What are the results of this regression given the data set below.
Interpret the model’s estimators.

Exercise Days Weight Loss (in kg)


0 4
4 1
8 1.5
12 2
16 4
20 5
24 2

20
1r Batxillerat Matemàtiques INS Pau Vila

4 Instruments for making calculations and graphs

**** GeoGebra spreadsheet GeoGebra spreadsheet - Calculations of central, spread, and position
values, - Boxplots, histograms, and lines of best fit.

**** Excel Excel - Lines of best fit, R2

To create original graphs Graphs

Gapminder is an independent educational non-profit fighting global misconceptions Cases of real life

5 Correlation and Causation

What are correlation and causation and how are they different?

Two or more variables considered to be related, in a statistical context, if their values change so that
as the value of one variable increases or decreases so does the value of the other variable (although it
may be in the opposite direction).

For example, for the two variables ”hours worked”and ı̈ncome earned”there is a relationship between
the two if the increase in hours worked is associated with an increase in income earned. If we consider
the two variables ”price”and ”purchasing power”, as the price of goods increases a person’s ability to
buy these goods decreases (assuming a constant income).

As you say, correlation is a statistical measure (expressed as a number) that describes the size and
direction of a relationship between two or more variables. A correlation between variables, however,
does not automatically mean that the change in one variable is the cause of the change in the values
of the other variable.

Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a
causal relationship between the two events. This is also referred to as cause and effect.

Theoretically, the difference between the two types of relationships are easy to identify — an action or
occurrence can cause another (e.g. smoking causes an increase in the risk of developing lung cancer),
or it can correlate with another (e.g. smoking is correlated with alcoholism, but it does not cause
alcoholism). In practice, however, it remains difficult to clearly establish cause and effect, compared
with establishing correlation.

21
1r Batxillerat Matemàtiques INS Pau Vila

Examples of correlation but not causation

https://www.correlated.org/1543

22
1r Batxillerat Matemàtiques INS Pau Vila

6 Statistics and problem solving

The main purpose of Statistics is to solve problems when there is a lot of data implied, not totally
accessible, because of time, money or any other circumstance:

Examples.
- the effectiveness of a new vaccine in preventing a disease.
- quality control in a factory where car parts are manufactured.
- the veracity of a saying about weather.

6.1 Scientific method

Being a science, Statistics solve problems by using the scientific method:

• The problem to be solved originates questions.

• Hypotheses are the provisional answers to these questions.

• Hypotheses are tested with statistical work:


- Data collection: we obtain a sample/s
- Data analysis:
First, we describe the characteristics (variables) of the sample that are related to the problem,
with calculations and graphs.
Next, we use these results for the sample to infer results for the whole population.

• Finally we enunciate our conclusions in terms of the hypotheses.

Example:

23
1r Batxillerat Matemàtiques INS Pau Vila

Problem: How many pines in Collserola (population) are affected by the processionary caterpillar?
Hypothesis: A few of them, less than a 10 per cent.
Data collection: We observe 200 pines distributed in different areas of Collserola (sample).
Data analysis:
- First, we count the number of affected pines (22) in our sample and we calculate the sample propor-
tion of infected pines: 11%, that is more than 10%.
- We do not assume that this is the affection for the whole population. Statistics theory states that
the proportion for the population would be any number in the interval (6.66, 15.34)
Conclusion: The result obtained is compatible with the hypothesis.

6.2 Populations and samples

In order to know about some characteristics of all the individuals of a POPULATION, we would need
to have data for all of them.

However, usually it is not possible to reach all these individuals (i.e. all the trees of a forest, all the
people suffering from an illness around the world,...).

What we do instead is to obtain information about only some individuals of the population that we
want to study. This available group is a SAMPLE

Population All the individuals that constitute the object of study, even though they are inaccessible.
Examples:
- all the human beings.
- all the pieces made in the car factory.

Sample The set of individuals obtained, preferably at random, to test the hypotheses.
We expect that they represent the whole population. In case individuals are not chosen randomly
possibly we are going to have a bias.

Examples:
- a group of 300 people from a particular town.
- one out of every thousand car parts made in the factory

Sample size The number of individuals that belong to the sample.


It should be large enough to ensure some correctness of the conclusions.

Sampling How the data is chosen:


- Random (a draw in a school to be part of a poll station, drug control at the airport,...)
- Systematic (street survey, police drug control,...)
- Stratified (age, gender, level of education,...)
- Cluster (geographical areas, counties,...)
- Convenience (people for testing a medicine)

24
1r Batxillerat Matemàtiques INS Pau Vila

Data source Where the data comes from:


(a) Observation. Counting/Measurement
(b) Experimentation. Counting/Measurement.
(c) Preexisting data / Official websites
(d) Polls, surveys and questionnaires

Results The results we obtain are only valid for the sample we are using.

However, the results of a sample give an estimation about the population. The better the sample, the
better the estimation we obtain for the population.

The bigger the sample, the better the estimation of the values for the population.

6.3 Conclusions and confidence

Confidence interval
We expect the population value to be around the sample value. Statistical theory gives the width of
the interval centered on the value of the sample.

Confidence level
How sure we are that the sample represents the population. In other words: How sure we are that
the value of the population will be inside the confidence interval. Usually we ask for a 95% or a 99%
confidence level.

CL

Sample size, again


The number of individuals that belong to the sample. The number of individuals that the sample
should have to ensure a certain confidence interval and a certain confidence level.

6.4 Statistical calculators

They let us obtain confidence intervals, among others, in an automatic way. We can make the calcu-
lations without knowing about the theory.
select-statistics.co.uk/calculators/

Statistics Kingdom

25
1r Batxillerat Matemàtiques INS Pau Vila

Exemple 6.1: Confidence intervals

If exactly 30% of a sample of 200 pine trees in a forest have processionary caterpillar nests, we
cannot assume that exactly 30% of the pines of the whole forest have this problem.
However, we can assume that the percentage for the forest will be within an interval around
30%.
We have calculators that give us these intervals, which are called ÇONFIDENCE INTERVALS”.
For our example, with a forest of 10 000 pines, the calculator gives us these results:

If our sample size is 200, our conclusion will be that the problem could be affecting any per-
centage of pines inside the interval (23.7 , 36.3)
But if our sample size were bigger, for example 400, we could have a narrower interval for our
estimate and conclude that the problem could be affecting the whole forest with any percentage
of pines inside the interval (25.6 , 34.4)

7 Project

In pairs, you must solve one of these two problems.


The final product must be a report.

26

You might also like