Professional Documents
Culture Documents
Question
Number Maximum Assessor’s marks Moderator’s marks External
marks and initials and initials Examiner
initials
1 Initials 2 Initials
Q1 10
Q2 10
Q3 10
Q4 10
Q5 10
Q6 25
Q7 25
Total
Marks
100
QUESTION 1 — Data Science [10 marks]
Briefly outline how each phase of this data science lifecycle could be applied to an electric
car assembly production line and its supply chain. Give at least one specific example that
demonstrates your understanding of each phase.
[10 marks]
Suppose that 1 in 10 adults suffer from a particular common disease for which a diagnostic
test has been developed. The test is such that an individual adult who actually has the
disease will show a positive test 95% of the time, whereas an individual adult without the
disease will show a positive test only 2% of the time.
(a) Draw a probability tree to summarise this information and find the probability of
each outcome. If a randomly selected adult is tested, what is the probability they
test positive? If you answer by drawing a diagram on paper, take a photo and paste
into this document.
[6 marks]
=P( positive test result I disease)*P( disease)+P( positive test result I no disease) * P( no
disease)
= 0.113
(b) Flip your probability tree from part (a). Suppose that a randomly selected adult is
tested and the test is positive. What is the probability that this adult does not have
the disease?
[4 marks]
= 0.1592
If the individual has positive result , probability that the individual has no disease =
15.92%
QUESTION 3 — Probability Distributions [10 marks]
Suppose the weight of sugar contained in “1kg” bags actually follows a normal distribution
with mean 1.030kg and standard deviation of 0.014kg.
(a) Use R to find the probability that a randomly selected bag of sugar is underweight.
Include R code and output in your answer.
[4 marks]
(b) The top 10% of sugar bags (by weight) are rejected during quality control as being
too heavy. What weight of sugar bag is the division between those that are rejected
and those that are accepted? Give your answer to 3 decimal places and include any
R code you write in your answer.
[4 marks]
Division = 1.8125
(c) The plot below shows the probability density function (pdf) of weight of sugar in
these “1kg” bags. What is the weight of a sugar bag (value of x ) corresponding to
the point shown (solid dot) where the slope of the curve is steepest (or the curvature
changes)?
[2 marks]
The slope with highest absolute value is the steepest. Positive slope means the function is
ascending left to right, negative slope means it is descending left to right.
The weight will be 28.4958
QUESTION 4 — Big Data [10 marks]
The National Grid operates the electricity transmission network across England and Wales.
It is responsible for balancing the supply of electricity from generators (such as nuclear
power plants and wind farms) and the demand for electricity to be sent to consumers (such
as homes, businesses and factories). On the demand side, “smart” meters installed in over
25 million homes across England and Wales read electricity usage at a rate of 4 times every
hour. On the supply side, electricity generators obtain data about the amount of electricity
being generated by each generating plant, selling price, state of the generation equipment,
weather information, and losses on transmission lines. There are approximately 2000
electricity generating stations in the UK.
(a) Briefly justify why we might or might not consider this electricity data as Big Data.
[4 marks]
The definition of large data is statistics that carries more range, arriving in increasing
volumes and with extra pace. that is also referred to as the 3 Vs. put in reality, massive
records is larger, extra complex records units, especially from new information resources.
Huge information allows enterprises to hit upon developments, and notice patterns that
may be used for future benefit. it could assist to stumble on which customers are in all
likelihood to shop for products, or help to optimise advertising and marketing campaigns
with the aid of figuring out which commercial techniques have the best go back on
funding. It is straightforward to peer that organisations that ‘recognize’ extra than their
competitors, will outperform their friends in the end.
One of the key enterprise drivers behind big records is the ability to start information
pushed decision making. Statistics pushed selection making refers to the exercise of
basing choices on the analysis of statistics instead of basically on intuition. In place of
making a decision primarily based on enjoy, the choice could be primarily based on the
exceptional viable situation.
Big statistics comes with protection problems—protection and privacy troubles are key
concerns when it comes to huge statistics. Bad players can abuse huge records—if
information falls into the wrong hands, large records can be used for phishing, scams, and
to unfold disinformation.
(b) Suppose all of this electricity data is stored in a data centre (in the cloud) and that
the National Grid wish to predict the future demand for electricity over the next one
hour period across England and Wales. Briefly outline a model that may be useful
for this purpose and assess whether or not MapReduce would be suitable for
implementing this model. Be specific about the name of the model (or model type)
and inputs to that model.
[6 marks]
The ARIMA model has a tremendous function inside the prediction of the future trends of
a time series. Krishna et al. discussed the ARIMA model, wherein it captures the process
of power consumption within the strength machine.
An ARIMA model is a category of statistical fashions for studying and forecasting time
collection facts.
It explicitly caters to a suite of standard systems in time collection data, and as such
provides a easy but effective method for making skillful time series forecasts.
ARIMA is an acronym that stands for AutoRegressive included moving average. it's miles a
generalization of the less complicated AutoRegressive shifting common and provides the
perception of integration.
This acronym is descriptive, capturing the key aspects of the model itself. briefly, they
may be:
AR: Autoregression. A version that makes use of the based dating between an statement
and some range of lagged observations.
I: included. the use of differencing of uncooked observations (e.g. subtracting an remark
from an commentary on the previous time step) in order to make the time series
stationary.
MA: shifting average. A version that makes use of the dependency between an remark
and a residual error from a moving common version implemented to lagged observations.
each of those components are explicitly special in the version as a parameter. A standard
notation is used of ARIMA(p,d,q) wherein the parameters are substituted with integer
values to quickly imply the particular ARIMA model being used.
The figure below illustrates a rotation in which the original blue flag has been reflected to
the position of the red flag.
(a) Write R code to implement the 2 ×2 matrix M that represents this rotation. Use R to
find the inverse of the matrix M (and print it out) and describe the transformation
represented by the inverse of M . Include both R code and output in your answer.
[6 marks]
[
S= −1 0
0 1 ]
Describe the transformation represented by S (in words). Is the composite
transformation represented by the matrix product SM the same as the composite
transformation represented by the matrix product MS ? Justify your answer.
[4 marks]
In RStudio, make sure you go to the Session menu, select Set Working Directory and then
Source File Location. Save the “jurassic.csv” file to the same folder as where your R code is
saved.
library(tidyverse)
dino = read_csv('jurassic.csv')
(a) Write R code to find the number of rows and the names of the columns in this
dataset. Include only your R code in your answer.
[2 marks]
(b) Write R code to construct a summary table giving the number of dinosaurs of each
diet in this dataset. Include both R code and output in your answer.
[4 marks]
(c) Explain clearly what the R code given below does and why it is needed.
[5 marks]
(d) Using the dataset output from part (c), use plot to reproduce the graphical plot
below as accurately as possible. Give one conclusion that you can draw from the
graphical plot. Include both R code and the plot that your code produces in your
answer.
[7 marks]
(e) Use plot to build a grouped bar chart showing the number of dinosaurs of each
combination of type and diet. What conclusion can you draw about dinosaurs of
type “ceratopsian”? Include both R code and the plot that your code produces in
your answer.
[7 marks]
In RStudio, make sure you go to the Session menu, select Set Working Directory and then
Source File Location. Save the “hept.csv” file to the same folder as where your R code is
saved.
library(tidyverse)
hept = read_csv('hept.csv')
library(GGally)
ggpairs(select(hept,-name), aes(colour=year))
(a) The scatter matrix below is produced by the R code given above. Use only the
information in the scatter matrix to comment (as comprehensively as possible) on
the relationship between sprint and longjump.
[6 marks]
(c) Assess whether the linear model “points~longjump” is a good fit to the dataset by
considering the corresponding Residuals vs Fitted and Normal Q-Q diagnostic plots
produced by the R code given below. Interpret and comment on the residual of the
athlete with name Nwaba (from the 2016 Olympics).
library(ggfortify)
autoplot(model, data=hept, colour='year')
[5 marks]
[6 marks]
— End of assessment —