You are on page 1of 5

Head to savemyexams.co.

uk for more awesome resources

YOUR NOTES

A Level Maths AQA 

6. Large Data Set

CONTENTS
6.1 Large Data Set
6.1 Large Data Set

Page 1 of 5

© 2015-2023 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to savemyexams.co.uk for more awesome resources

6.1 Large Data Set YOUR NOTES


6.1 Large Data Set

Using a Large Data Set


What is a large data set?
As part of your course there is a large data set that you can use
It contains lots of information
You are not expected to memorise any results from the data
You will have an advantage if you are familiar with the large data set
Understand what the variables are
Understand the terminology used
Understand the context
You will not get a copy of the large data set in your exam
if you are required to calculate anything using the large data set you will be given an
extract within the question
What skills can I practice with a large data set?
Cleaning data
There might be missing data
You could identify outliers and question their validity
Sampling and hypothesis testing
You can practice different methods of sampling using the data
You could use a sample to test a hypothesis
Statistical measures and diagram
You could calculate summary statistics for different variables
You could create different diagrams
You can interpret the summary statistics and diagrams (as it is real data you could
explore the context behind the results)
You could compare summary statistics and diagrams
Do I have to use spreadsheets and other technology?
You will not be assessed on using spreadsheets
However, it is a useful skill for your future career
You could use technology to calculate the summary statistics and create the statistical
diagrams
This will help you to practice these skills whilst using real data
Spreadsheets can calculate summary statistics
In the exam you could use the statistics mode on your calculator

Page 2 of 5

© 2015-2023 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to savemyexams.co.uk for more awesome resources

Summary of the Large Data Set YOUR NOTES

What is the data about? 


The large data set for AQA comes from the UK Department Stock Vehicle Database
(loosely referred to as “Cars” or “Vehicle data”)
The full database is too large to use in full so AQA have extracted some of the data into
a spreadsheet and this is what should be used to study parts of the statistics course
Some of the data in the spreadsheet is coded so keep a close eye on the information
contained under “Definition of fields” and “Field Values”
Beware! As the codes are numbers this may look like you can find statistics with them
like the mean, but this would not make sense
e.g. “The mean of the propulsion type data is 2 so the mean propulsion type is diesel”
does not make sense but it may be okay to say “diesel is the modal (most frequent)
propulsion type of vehicles in the sample”
You are likely to be asked to “use your knowledge of the large data set” – this is where the
familiarity of its key features can be an advantage
e.g. knowing that the mass of a vehicle includes an average 75 kg driver
Only mention things that can be justified from the dataset
e.g. knowing there is only one electric vehicle in the whole data set so don’t use or
assume things you may have heard about electric cars on the news recently
What variables are included in the large data set?
Reference
A unique number given to each individual vehicle by AQA to index the data
Could be used to easily identify a vehicle and all its information
The first few pieces of data about the vehicles are qualitative
Make
Only the five most frequently registered makes are included
BMW, Ford, Toyota, Vauxhall and Volkswagen
PropulsionTypeid
A data value of 1, 2, 3, 7 or 8 indicates the type of fuel powering the vehicle
(4, 5 and 6 are not used in the AQA extracted dataset)
1 is petrol-powered, 2 is a diesel-powered vehicle
The full codes are listed under “Field Values”
BodyTypeid
Also given by coded values defined in “Field Values” these represent the style of
vehicle including (amongst others) convertibles and MPVs (multi-purpose vehicles)
GovRegion
The database only includes cars registered in England (rather than the UK)
The region of a vehicle is determined by the postcode of the current registered keeper
The regions included are London, North West and South West
KeeperTitleid
The last of the coded values defined in “Field Values” represents whether the current
registered keeper is male, female, a company or unknown
The remaining data values are all quantitative
Engine size

Page 3 of 5

© 2015-2023 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to savemyexams.co.uk for more awesome resources

Size (capacity) of the engine measured in cubic centimetres (cc) YOUR NOTES

Year registered 
Vehicles included in the extract were either first registered in 2002 or 2016
The introduction says the precise dates are
3 June 2002 – 9 June 2002
6 June 2016 – 12 June 2016
Knowing that only a few days from each year are included gives an idea of the enormity
of the full database
Mass
Measured in kilograms (kg)
the mass of an average driver (75 kg) is included in the figures quoted
Emissions
The remaining data values centre around the emissions from the vehicles
CO2 – Carbon dioxide emissions, measured in g/km
CO – Carbon monoxide emissions measured in g/km
NOX – Oxides of nitrogen emissions measured in g/km
part – Particulate emissions measured in g/km
(this measure only applies to diesel cars)
hc – hydrocarbon emissions measured in g/km
Random number
A random number is generated by the spreadsheet for each vehicle so is not part of the
data set but can be used to randomly select vehicles in sampling
Be aware that the random number refreshes each time the spreadsheet is refreshed
Is the data complete?
Various data values are blank within the spreadsheet; others are 0 where this makes no
sense (such as the mass of the car)
There is no information as to why these occur but be aware they exist
Under the “Definition of fields” tab there is some extra information about the emissions
data
CO2 emissions are known for 83% of vehicles in the whole database
CO emissions are known for 82% of vehicles in the whole database
NOX emissions are known for 81% of vehicles in the whole database
Part – only for diesel vehicles (24% of the whole database)
Hc emissions are known for 51% of vehicles in the whole database
The above means that the data should be cleaned before samples are taken
What are the key features I need to know about the data set?
These have been mentioned in the lists above but here is a summary of those we have seen
used in exam and practice papers
There are only five makes, and Ford was the most frequently registered
There is only one electric vehicle in the database
Data is from a few days in summer and only in two years – 2002 and 2016
The mass of a vehicle includes an average 75 kg driver
Emissions data (CO2, CO and NOX) is only known for around 80% of the whole
database
Particulate emissions are only applicable to diesel cars
Page 4 of 5

© 2015-2023 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers
Head to savemyexams.co.uk for more awesome resources

YOUR NOTES

 Worked Example

Jay collects data on the masses of vehicles first registered in 2002 taking a random
sample of size 30.
(a)
Use your knowledge of the large data set to explain why Jay should clean the data
before taking a sample

(b)
Jay’s calculations show the mean mass of a vehicle in his sample is 1340 kg.
Using your knowledge of the large data set write down an estimate for the mean
mass of an empty vehicle in the whole database, justifying your answer.

(a)
Use your knowledge of the large data set to explain why Jay should clean the data before
taking a sample

(b)
Jay’s calculations show the mean mass of a vehicle in his sample is 1340 kg.
Using your knowledge of the large data set write down an estimate for the mean mass of an
empty vehicle in the whole database, justifying your answer.

 Exam Tip
As vehicle emissions are frequently mentioned in news articles be wary of
confusing popular opinion with what can be justified using the information
contained within the large data set.

Page 5 of 5

© 2015-2023 Save My Exams, Ltd. · Revision Notes, Topic Questions, Past Papers

You might also like