0 Up votes0 Down votes

10 views13 pageslecture

Oct 13, 2015

© © All Rights Reserved

PDF, TXT or read online from Scribd

lecture

© All Rights Reserved

10 views

lecture

© All Rights Reserved

- Neuromancer
- The E-Myth Revisited: Why Most Small Businesses Don't Work and
- How Not to Be Wrong: The Power of Mathematical Thinking
- Drive: The Surprising Truth About What Motivates Us
- Chaos: Making a New Science
- The Joy of x: A Guided Tour of Math, from One to Infinity
- How to Read a Person Like a Book
- Moonwalking with Einstein: The Art and Science of Remembering Everything
- The Wright Brothers
- The Other Einstein: A Novel
- The 6th Extinction
- The Housekeeper and the Professor: A Novel
- The Power of Discipline: 7 Ways it Can Change Your Life
- The 10X Rule: The Only Difference Between Success and Failure
- A Short History of Nearly Everything
- The Kiss Quotient: A Novel
- The End of Average: How We Succeed in a World That Values Sameness
- Made to Stick: Why Some Ideas Survive and Others Die
- Algorithms to Live By: The Computer Science of Human Decisions
- The Universe in a Nutshell

You are on page 1of 13

1.1

Whats it about

practical point of view) is to take those data, to put them into a form

which is informative (so they become information), and then use the

information to make decisions. So we start in this chapter to consider

how to present data and gain simple information; in chapters 2 4 we

develop the idea of probability distributions, in chapter 5 we see how

we can fit them to the data, and in chapters 7 to 9 we see what

decisions can be made based on this.

1.2

True or false?

1.2.1

The boxes for Puma sports shoes say Average contents 2. Is this useful

information?

Most people have more than the average number of feet. True or

false?

1.2.2

What does it mean to say that the probability of rain in London

tomorrow is 79%?

1.2.3

Dan is being tried for murder on a remote island. The only evidence against him is the

forensic evidence blood samples taken from the scene of the crime match Dans.

The chance that such a match happening if Dan were in fact innocent, and the

match were just a coincidence, is calculated (by the prosecutions expert witness) as

1 in 10000. The defences expert witness, however, points out that the crime could

have been committed by any one of the 40000 people on the island. If you were on

the jury, would you find Dan guilty? What do you think is the probability of his guilt?

person or to let a guilty person go free? This will lead us to the idea of

Type I errors and Type II errors in chapter 9.

1.2.4

LONDON (Reuters) - House prices rose at the quickest pace in more than four years

during the three months to October, a survey showed on Thursday, supporting

evidence of a resilient housing market in spite of rising borrowing costs.

The Royal Institution of Chartered Surveyors said its house price balance rose to +48.1

in October from +45.7 in the three months to September -- the strongest reading since

September 2002 and more than double the long-run average of +21.

The sales-to-stock ratio, which economists say is a more reliable gauge of the health

of the market, rose to +40.9 from +39.1 -- the highest ratio in over two years.

"Even after last week's interest rate rise, surveyors are still confident that the housing

market will remain buoyant," said RICS spokesman Ian Perry.

"The market is unlikely to feel cold winds from high finance costs until mid-year at the

earliest as economic conditions are favourable."

The RICS data chimes with other property surveys which have portrayed a robust

market following August's interest rate rise to 4.75 percent, but analysts have said the

market may flag next year after last week's rise in borrowing costs.

The Bank of England hiked rates to 5 percent -- their highest level in five years -- but

did not make explicit mention of the buoyant house market when it gave reasons for

the move.

This is from the BBC news page for 16 November 2006. Most readers will

not know what the figures mean, and no explanation is given.

1.3

Misusing data

produce misleading information.

We want to illustrate the gender ratio in a class: suppose 58% are male

and 42% are female. If they are represented by squares, and the

length of the sides of the squares are proportional to the numbers, then

we get these. But we would probably look at the areas, in which case

the first square is almost twice the area of the second in fact

0.582

= 1.907 approximately. On the other hand if we draw them so

0.42 2

that the areas are in the correct ratio, we get

various ways).

Suppose we look at the monthly sales of some product. We might be

given these plots:

1140

1120

1100

1080

10

12

1200

1000

800

600

400

200

10

12

But using this scale the increase is seen to be much more modest.

Which of these is better depends on the purpose, and the reader: if the

aim is to convince a backer that it is booming, or to act as a tool in

prediction, for example.

1.4

agreed that the rate of increase of house prices in some area (say

Surrey) should be found. How would it be measured? House prices

need to be measured at different times. But there are so many houses

sold in the area: a conservative estimate is about 5000 per month. It

would be too time-consuming to find all of them, so we use a sample,

a selection of those we are interested in. We can then use the sample

to estimate the quantities we want for the whole set. The whole set of

interest is called the population. An important issue is how accurate

will the estimate based on the sample be?

Just as in MT181 you met different types of numbers, we find different

types of data. If we are measuring (say) the heights of a group of

plants, or the weights of animals, then they can be any real nonnegative number. If we are considering currency rates, then in

practice they are given to a fixed number of decimal places, but they

4

are so close together that we can assume that they also can be any

real non-negative number, and if we are looking at changes, then they

can be any real number.

If on the other hand we are looking at the number of goals scored in

football matches in a league, or the number of people entering a shop

in a day, then they will be always non-negative integers. There may be

a limit on the value, too: if y is the number of heads that occur if a coin

is tossed 12 times, then y can only take integer values such that

0 y 12 .

All these are quantitative variates: the first group are continuous and

the second group are discrete. We shall look at continuous variates in

chapter 2 and discrete variates in chapter 3. But variates can also be

qualitative: we can look for example at the gender of insects (M or F),

or nationality (UK, French, Pakistani, ). A survey of a television

programme may involve sampling the views of members of the

population in the form Brilliant, Good, Satisfactory, Rubbish. Sometimes

qualitative results can be converted to quantitative ones, and this is

dealt with in MT230.

Take another item from the BBC on 16 November 2006.

The Maltese and the Greeks are

the heavyweights of Europe,

figures from the European

Commission reveal.

The Italians and French the most

trim, while the average Briton like the average European - is

slightly over the ideal weight.

The average European is overweight

heart disease, is a growing problem across much of the

developed world.

The Commission plans to launch a strategy to tackle obesity

next year.

Obesity is measured by

calculating body mass

index (BMI).

A BMI of between 18.5

and 25 is considered

healthy, between 25 and

30 overweight and above

30 obese.

EUROPE'S HEAVYWEIGHTS

Malta - 26.6

Greece - 25.9

Finland - 25.8

Luxembourg - 25.7

Hungary - 25.6

Cyprus - 25.6

Lithuania - 25.5

Slovenia - 25.5

Denmark - 25.5

UK 25.4

The latest figures show

that the average citizen in 20 EU countries, including the UK where the average BMI is 25.4 - is overweight.

The average person in the other five, including Italy and France,

is officially healthy.

Poor self-awareness

The Maltese recorded the highest average BMI at 26.6, the

Italians the lowest, at 24.3.

Despite the figures, only 38% of

EU citizens consider themselves

to be overweight, according to

the poll of about 1,000 people

in each of the 25 EU member

states.

Most blamed a sedentary life

for restricting their scope to be

healthier.

SLIMMEST COUNTRIES

Italy - 24.3

France - 24.5

Austria - 24.8

Poland - 24.8

Netherlands - 24.9

Slovakia - 25.0

Belgium - 25.1

Latvia - 25.1

Estonia - 25.2

Czech Rep - 25.2

Body Mass Index figures

physical activity.

And a large majority - 85% - said public authorities should play a

stronger role in fighting obesity.

EU Health and Consumer Protection Commissioner Markos

Kyprianou said: "This survey provides us with valuable insights into

the concerns of EU citizens on health and nutrition."

The item also explains how the BMI is calculated. How can they be

sure that the national figures are correct? Clearly they have not

measured everyone in the EU, so they must have sampled. Was the

choice a representative sample? In the last section on self-awareness

they say that they sampled 1000 people from each EU country.

Doesnt that mean that the result will be much more reliable for Malta,

with less than 400000 inhabitants than for Germany, with over 82000000

inhabitants? How much better? In chapter 7 we look at the reliability

of such samples.

1.5

Take first the marks of just five students (taken from last years MT171

test): 70, 84, 61, 48, 60. The minimum is 48, the maximum is 84, and the

obvious measure is the mark of the average student, the central value,

which is 61. This is the median. Now consider these figures for the sales

of a company over 30 weeks. This uses the Data Display instruction in

MINITAB.

Data Display

Sales

2817 2913 3040 2539 2323 2763 2898 2834 2786 3095 2411 2972 3405 2852 2756

2754 2492 2804 3263 2902 2705 2724 3114 2400 3260 2658 3194 2672 2716 3069

2323 2400 2411 2492 2539 2658 2672 2705 2716 2724 2754 2756 2763 2786 2804

2817 2834 2852 2898 2902 2913 2972 3040 3069 3095 3114 3194 3260 3263 3405

We can see that the minimum is 2323 and the maximum 3405, and the

central value is a measure of the location of the values. But in this case

there isnt a central value, since we have an even number (30) of them

if we had 29 then the fifteenth value would be in the middle. The

best we can do is take the average of the two in the middle here the

fifteenth and sixteenth, 2804 and 2817, which is 2810.5. This is the

median.

To get an idea of the spread around this, we can take the values one

quarter and three quarters up the ordered lists, giving the lower quartile

Q1 and the upper quartile Q3 (we can also call the median Q2 ). One

quarter of the way for the first to the thirtieth here is number

1

(1 + 30 ) = 7.75 , so here we take a value between the seventh and

4

1

eighth in the list, (1 2672 + 3 2705) = 2696.75 . In the same way the

4

3

upper quartile is at (1 + N ) , here 23.25, so we take a point between

4

1

the 23rd and 24th values, (3 3040 + 1 3069) = 3047.25 . The inter-quartile

4

range is Q3 Q1 , the range containing half the values, here 350.5.

We can also define percentiles: the pth percentile is the value p % of

the way up the list. Q1 is the 25th percentile, the median is the 50th

percentile, and Q3 is the 75th percentile.

Another measure of location is the mode, the most frequently

occurring value. We shall not need to use it.

1.6

There are several ways of getting a quick idea of the location and

dispersion of the figures. One is the use of a boxplot, as here.

Boxplot of Sales

3500

3250

Sales

3000

2750

2500

Here the line runs from the maximum (here 3405) down to the minimum

(here 2323). The box runs from the upper quartile (here 3047.25, one

quarter of the way from 3040 to 3069) to the lower quartile (here

2696.75, three quarters of the way from 2672 to 2705). The line across

the middle is at the median (here 2810.5, halfway from 2804 to 2817).

So half of the values lie within the box, and we can see the range

easily.

We can also use a histogram. A set of intervals is chosen, and the

number of data values in each interval plotted. It is important to

choose the intervals so that a reader gets a good picture too many

intervals and the picture is confusing, too few and information is lost.

The advantage of a histogram over a boxplot is that you get a feel for

the distribution of the data values.

Histogram of Sales

7

6

Frequency

5

4

3

2

1

0

2400

2600

2800

Sales

3000

3200

3400

It is clear that this is better than the version below: we have lost too

much information by making the intervals too large.

8

Histogram of Sales

18

16

14

Frequency

12

10

8

6

4

2

0

2400

2600

2800

C6

3000

3200

This is MINITABs version.

Stem-and-Leaf Display: Sales

Stem-and-leaf of Sales N = 30

Leaf Unit = 10

1 23 2

4 24 019

5 25 3

7 26 57

14 27 0125568

(5) 28 01359

11 29 017

8 30 469

5 31 19

3 32 66

1 33

1 34 0

Here the centre column gives the first two significant figures of the data

values, and each figure in the right hand column gives the next figure

of each value. So in the first row there is one value close to 2320, and

in the second row there are three values, close to 2400, 2410 and 2490.

The left hand column has one figure in a bracket, which gives the

position of the median, and the number there is the number of values

in that interval. The other figures in the left hand column are

cumulative number of points up to and down from the median. Note

that as well as giving the figures, the shape of the right hand side gives

a feel for the distribution. As with histograms, the unit needs to be

chosen to give the most useful information. If the unit here is changed

to 100 instead, we get this, which is much less helpful.

Stem-and-Leaf Display: Sales

Stem-and-leaf of Sales N = 30

Leaf Unit = 100

(22) 2 3444566777777788888999

8

3 00011224

1.7

location. If our data values are y1 , y2 ,K , yn , then this is defined by

1 n

1

y j = ( y1 + y2 + K + yn ) .

(1.1)

n j =1

n

This is the sample mean. In the example above,

1

85131

y = ( 2817 + 2913 + K + 3069 ) =

= 2837.7.

30

30

We next consider a measure of the dispersion around the sample

mean. The sample variance s 2 is given by

2

1 n

s2 =

yj y) .

(1.2)

(

n 1 j =1

The sample standard deviation s is (not surprisingly) the square root of

that:

2

1 n

s=

yj y) .

(1.3)

(

n 1 j =1

y=

1

2052556

2

2

2

s2 =

= 70778

( 2817 2837.7 ) + ( 2913 2837.7 ) + K + ( 3069 2837.7 ) =

29

29

and s = 70778 = 266.0 .

n

( y

y ) = y 2j ny 2 .

j =1

n

(1.4)

j =1

2

j =1

j =1

j =1

= ny . In

j =1

the example,

n

2

j

= 243628795 , so that

( y

j =1

y)

851312

= 243628795

= 2052556 as

30

before.

MINITAB will produce descriptive statistics: a set of useful information.

Here we have

Descriptive Statistics: Sales

Variable N N* Mean SE Mean StDev Minimum

Q1 Median

Q3

Sales

30 0 2837.7

48.6 266.0

2323.0 2696.8 2810.5 3047.3

Variable Maximum

Sales

3405.0

10

data, the sample mean, standard deviation and median we have

already seen, the maximum and minimum are obvious, Q1 and Q3 are

the lower and upper quartiles, and the standard error of the mean is

266

= 48.56 . It will crop up

the standard deviation divided by N : here

30

in chapter 7.

1.8

Grouped data

If the size of the sample is large, it may be better to summarize the data

by giving the frequencies in certain classes instead that is, the number

of data values in each class or interval. Consider two examples.

1.8.1

The number of goals scored in 100 matches in a league were recorded

as follows:

Number of goals scored in

Frequency

match

0

9

1

23

2

33

3

22

4

12

5

1

In this case a histogram is appropriate.

Goals in each match

35

30

Frequency

25

20

15

10

5

0

0

C1

1

( 9 0 + 23 1 + 33 2 + 22 3 + 12 4 + 1 5 ) = 2.08 .

100

11

1.8.2

Now consider 120 customers in a supermarket: the amounts they spend

are recorded. Here the minimum spent was 1.19 and the maximum

was 78.11. The amounts were divided into classes which run from

0.01 to 10.00, 10.01 to 20.00 and so on. The figures are

Amount spent

Frequency Midpoint of

interval

0.01 - 10.00

20

5.00

10.01 - 20.00

35

15.00

20.01 - 30.00

17

25.00

30.01 - 40.00

13

35.00

40.01 - 50.00

14

45.00

50.01 - 60.00

12

55.00

60.01 - 70.00

4

65.00

70.01 - 80.00

5

75.00

We can approximate the mean by assuming that all the amounts in

each interval were equal to the midpoint, so we get as an estimate of

the mean 28.58. Since this is a sample anyway, it is likely to be as

good as anything. In the same way we can estimate the standard

deviation as 19.78. But is this the appropriate method? It is clear from

the histogram that the distribution is not symmetric, and so perhaps we

should make use of that. We could take the median, which is 25, with

the lower quartile 15 and the upper quartile 45, instead of the mean

and standard deviation.

We could continue by estimating the standard deviation. In this case,

however, it is unwise for two reasons. The quartiles (and the percentiles

generally) are more useful, and the estimation of the variance is

numerically unreliable.

supermarket sales

40

Frequency

30

20

10

0

10

20

30

40

C1

50

60

12

70

1. Why use the apparently complicated formula for the sample

2

1 n

standard deviation s =

y j y ) when one could use the

(

n 1 j =1

interquartile range, or just the average deviation from the mean

1

y j y ? The answer is that we can use calculus easily on s 2 ,

n

but not on the other formulae, and differentiating quadratic

functions leads to linear formulae which are very easy to handle.

1

1

2. Why the

factor rather than

in (1.2)? The answer is rather

n 1

n

subtle, and will be explained later when we meet the idea of

unbiasedness.

13

- Ejer HipotesisUploaded bysory
- dietary analysisUploaded byapi-316599426
- Battle of the Bulge- Online VersionUploaded byMark
- obesity final2Uploaded byapi-301109610
- Article QuestionsUploaded byBrian Dean
- CANTIKUploaded byyossy aci
- unit 4 obesityUploaded byapi-306481199
- v59n1a5Uploaded byCrisis Gutierrez
- Vegan vs Meat EatersUploaded byNestor Balboa
- ten_Have_Texeira_ethics_in_obesity_prevention_ObRev2011_1.pdfUploaded bysuzanalucas
- young at heartUploaded byManu Shrivastava
- abc-109-06-0509Uploaded byToth Csongor
- Prevalence of Functional Gastrointestinal Disorders in ObeseUploaded bydaidanona
- To Study Body Mass Index, Waist Circumference, Waist Hip Ratio,Body Adiposity Index And Lipid Profile Level In Patients With Type-2 Diabetes MellitusUploaded byIOSRjournal
- ebp synthesis paperUploaded byapi-339132978
- Paper 2- Molecular NutUploaded byNindy Sabrina
- Support for Obesity-Related Policy and Its Association With Motivation for Weight ControlUploaded byainihaziq84
- Anthropometry DimensionUploaded byWan Mohamad Adnin
- Body CompositionUploaded byAura María Salazar Solarte
- 31648757Uploaded byRania Anis
- Understanding the Relationship Between Nutritional Status_ ObesitUploaded byTika Dwi Tama
- pediatriaUploaded byAngie K
- Ante Part Um CareUploaded byjoannehomecillo
- 5desc.pptUploaded bySOHAM JOSHI
- 1-s2.0-S0047272710001222-mainUploaded byMihaela Dragan
- Osteoarthrosis Knee in the Elderly Risk FactorsUploaded byAndrata23
- ContentServer (4)Uploaded byTisna Wenny
- Introduction to ggplot.pdfUploaded byJovanderson Jackson
- Jurnal GERDUploaded byrandy_27995
- Lab 3(Question)Uploaded byEmir Razak

- Tolerance ChartUploaded bytishtish2007
- ReportUploaded byCharmian
- Business Statistics 42petwkAWtUploaded byRanojay Ghosh
- Mba Assign Unit IUploaded byrexsiva_2k8019
- IGNOU STATISTICS2Uploaded bySatheesh Kalanilayam
- Finding Median GraphicallyUploaded byganusrs
- Continuous Random VariablesUploaded byabhay
- SPAD7 Data Miner Guide.pdfUploaded byPabloAPacheco
- IBM SPSS Statistical Analysis Course OutlineUploaded byVicksmanChidex
- Www Social Research Methods Net Kb Statdesc PhpUploaded byNexBengson
- 3.Format. App-effectiveness of Using Drill Method on Students Learning OutcomesUploaded byImpact Journals
- Statistics TutorialUploaded byFe Oracion
- Assignment 1 ReportUploaded byHelenny Gani
- MedianUploaded byIndu Malik
- SPSS_Assignment for UploadUploaded byderekyuen
- Heteroscedasticity- What Happens When ErrorUploaded byswm101
- Ch 1 Practice TestUploaded byRubbie Nguyen
- Statistics BasicsUploaded byJahaan Jafri
- lecture_3_standardnormaltable.pdfUploaded byWasim Khawaja
- Statistics Assignment of B.A Psychology IGNOUUploaded bySyed Ahmad
- CCP303Uploaded byapi-3849444
- Assignment # 1Uploaded byJunaidKhan
- Stats Cheat SheetUploaded byirodr029
- Assignment PH STATUploaded byAdnan Yousaf
- Stata Result July 5Uploaded byToulouse18
- Quantitative AnalysisUploaded byAmit Suvera
- 20268 Spss ExercisesUploaded byabcxyz7799
- W38 StatisticsIIUploaded bybidarihassan
- Central Tendency, Mean, Median, ModeUploaded byRhaizen Nu Sy
- Statistics Teacher GuideUploaded byRodrigo Chang

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.