4 Solved

For assessor use only
Question
Number Maximum Assessor’s marks Moderator’s marks External
marks and initials and initials Examiner
initials
1 Initials 2 Initials
Q1 10
Q2 10
Q3 10
Q4 10
Q5 10
Q6 25
Q7 25
Total
Marks
100
QUESTION 1 — Data Science [10 marks]
Consider the Data Science Lifecycle (OSEMN pipeline) as shown below.
Briefly outline how each phase of this data science lifecycle could be applied to an electric
car assembly production line and its supply chain. Give at least one specific example that
demonstrates your understanding of each phase.
Image from https://www.fleetnews.co.uk/news/manufacturer-news/2022/06/30/uk-car-production-up-in-

may-after-10-consecutive-months-of-decline
[10 marks]
The various phases and implementation are:

1. The obtaining of data: in this phase the data is collected on which the data science property
needs to be applied, all the data from this athletic club is collected including the member and
the services that it provides.
2. The scrubbing of data: After the data collection the collected data is cleaned i.e. all the non-
useful and incorrect data is removed from the system and only useful is kept. Here the data
regarding the services that are least used are removed as they doesn't affect the whole
process.
3. The exploration : in this the data is carefully examined and explores in order to find the best
possible solution that is required, the data science Technique is applied to this data to explore
the various solutions. In the athletic center the various measures that are required from this
case is noted down and evaluated.
4. The modelling: this includes the modelling of the data i.e. creating the data into particular
models that are necessary for end result. In this case of Athletic center the data from previous
phases is combined together to get the modelling which is required.
5. The interpretation: the end phase is the interpretation that means applying all the previous
knowledge and steps into the main process. In the case of Athletic center the final application
to the system is done in this phase.
QUESTION 2 — Probability [10 marks]
Suppose that 1 in 10 adults suffer from a particular common disease for which a diagnostic
test has been developed. The test is such that an individual adult who actually has the
disease will show a positive test 95% of the time, whereas an individual adult without the
disease will show a positive test only 2% of the time.
(a) Draw a probability tree to summarise this information and find the probability of
each outcome. If a randomly selected adult is tested, what is the probability they
test positive? If you answer by drawing a diagram on paper, take a photo and paste
into this document.
[6 marks]
Given , P( disease) =0.1
P( positive test result I disease) = 0.95
P( positive test result I no disease ) = 0.02
To find, P (positive test result) =?
P( positive test result)
=P( positive test result I disease)*P( disease)+P( positive test result I no disease) * P( no
disease)
= 0.95* 0.1 + 0.02 * (1-0.1)
= 0.113
Probability of positive test result = 11.3 %
(b) Flip your probability tree from part (a). Suppose that a randomly selected adult is
tested and the test is positive. What is the probability that this adult does not have
the disease?
[4 marks]
To find P( no disease I positive test result)
Using Bayes' theorem
P( no disease I positive test result)
= P( positive test result I no disease) * P( no disease) / P( positive test result)

= 0.02 * (1-0.1) / 0.113
= 0.1592
If the individual has positive result , probability that the individual has no disease =
15.92%
QUESTION 3 — Probability Distributions [10 marks]
Suppose the weight of sugar contained in “1kg” bags actually follows a normal distribution
with mean 1.030kg and standard deviation of 0.014kg.
(a) Use R to find the probability that a randomly selected bag of sugar is underweight.
Include R code and output in your answer.
[4 marks]
pnorm(1, mean=1.030, sd=0.014)
(b) The top 10% of sugar bags (by weight) are rejected during quality control as being
too heavy. What weight of sugar bag is the division between those that are rejected
and those that are accepted? Give your answer to 3 decimal places and include any
R code you write in your answer.
[4 marks]
Division = 1.8125
(c) The plot below shows the probability density function (pdf) of weight of sugar in
these “1kg” bags. What is the weight of a sugar bag (value of x ) corresponding to
the point shown (solid dot) where the slope of the curve is steepest (or the curvature
changes)?
[2 marks]
The slope with highest absolute value is the steepest. Positive slope means the function is
ascending left to right, negative slope means it is descending left to right.
The weight will be 28.4958
QUESTION 4 — Big Data [10 marks]
The National Grid operates the electricity transmission network across England and Wales.
It is responsible for balancing the supply of electricity from generators (such as nuclear
power plants and wind farms) and the demand for electricity to be sent to consumers (such
as homes, businesses and factories). On the demand side, “smart” meters installed in over
25 million homes across England and Wales read electricity usage at a rate of 4 times every
hour. On the supply side, electricity generators obtain data about the amount of electricity
being generated by each generating plant, selling price, state of the generation equipment,
weather information, and losses on transmission lines. There are approximately 2000
electricity generating stations in the UK.
(a) Briefly justify why we might or might not consider this electricity data as Big Data.
[4 marks]
The definition of large data is statistics that carries more range, arriving in increasing
volumes and with extra pace. that is also referred to as the 3 Vs. put in reality, massive
records is larger, extra complex records units, especially from new information resources.
Huge information allows enterprises to hit upon developments, and notice patterns that
may be used for future benefit. it could assist to stumble on which customers are in all
likelihood to shop for products, or help to optimise advertising and marketing campaigns
with the aid of figuring out which commercial techniques have the best go back on
funding. It is straightforward to peer that organisations that ‘recognize’ extra than their
competitors, will outperform their friends in the end.
One of the key enterprise drivers behind big records is the ability to start information
pushed decision making. Statistics pushed selection making refers to the exercise of
basing choices on the analysis of statistics instead of basically on intuition. In place of
making a decision primarily based on enjoy, the choice could be primarily based on the
exceptional viable situation.
Big statistics comes with protection problems—protection and privacy troubles are key
concerns when it comes to huge statistics. Bad players can abuse huge records—if
information falls into the wrong hands, large records can be used for phishing, scams, and
to unfold disinformation.
(b) Suppose all of this electricity data is stored in a data centre (in the cloud) and that
the National Grid wish to predict the future demand for electricity over the next one
hour period across England and Wales. Briefly outline a model that may be useful
for this purpose and assess whether or not MapReduce would be suitable for
implementing this model. Be specific about the name of the model (or model type)
and inputs to that model.
[6 marks]
The ARIMA model has a tremendous function inside the prediction of the future trends of
a time series. Krishna et al. discussed the ARIMA model, wherein it captures the process
of power consumption within the strength machine.
An ARIMA model is a category of statistical fashions for studying and forecasting time
collection facts.
It explicitly caters to a suite of standard systems in time collection data, and as such
provides a easy but effective method for making skillful time series forecasts.
ARIMA is an acronym that stands for AutoRegressive included moving average. it's miles a
generalization of the less complicated AutoRegressive shifting common and provides the
perception of integration.
This acronym is descriptive, capturing the key aspects of the model itself. briefly, they
may be:
AR: Autoregression. A version that makes use of the based dating between an statement
and some range of lagged observations.
I: included. the use of differencing of uncooked observations (e.g. subtracting an remark
from an commentary on the previous time step) in order to make the time series
stationary.
MA: shifting average. A version that makes use of the dependency between an remark
and a residual error from a moving common version implemented to lagged observations.
each of those components are explicitly special in the version as a parameter. A standard
notation is used of ARIMA(p,d,q) wherein the parameters are substituted with integer
values to quickly imply the particular ARIMA model being used.
The parameters of the ARIMA version are described as follows:

p: The number of lag observations protected within the model, additionally called the lag
order.
d: The variety of times that the raw observations are differenced, also called the degree of
differencing.
q: the scale of the transferring common window, also called the order of transferring
common.
ARIMA model is the nice for predictions so it's far excellent to use it.
QUESTION 5 — Linear Algebra [10 marks]
The figure below illustrates a rotation in which the original blue flag has been reflected to
the position of the red flag.
(a) Write R code to implement the 2 ×2 matrix M that represents this rotation. Use R to
find the inverse of the matrix M (and print it out) and describe the transformation
represented by the inverse of M . Include both R code and output in your answer.
[6 marks]
Write your answer in this box.
(b) Let S be the matrix
[
S= −1 0
0 1 ]
Describe the transformation represented by S (in words). Is the composite
transformation represented by the matrix product SM the same as the composite
transformation represented by the matrix product MS ? Justify your answer.
[4 marks]
QUESTION 6 — Data Wrangling and Exploratory Data Analysis [25 marks]

Consider the dataset in the file “jurassic.csv” (provided along with these questions on Aula)
on dinosaurs from the Natural History Museum (in London).
In RStudio, make sure you go to the Session menu, select Set Working Directory and then
Source File Location. Save the “jurassic.csv” file to the same folder as where your R code is
saved.
library(tidyverse)
dino = read_csv('jurassic.csv')
(a) Write R code to find the number of rows and the names of the columns in this
dataset. Include only your R code in your answer.
[2 marks]
(b) Write R code to construct a summary table giving the number of dinosaurs of each
diet in this dataset. Include both R code and output in your answer.
[4 marks]
(c) Explain clearly what the R code given below does and why it is needed.
newdino = separate(dino, length, c('L', NA), sep='m', convert=TRUE)
[5 marks]
(d) Using the dataset output from part (c), use plot to reproduce the graphical plot
below as accurately as possible. Give one conclusion that you can draw from the
graphical plot. Include both R code and the plot that your code produces in your
answer.
[7 marks]
(e) Use plot to build a grouped bar chart showing the number of dinosaurs of each
combination of type and diet. What conclusion can you draw about dinosaurs of
type “ceratopsian”? Include both R code and the plot that your code produces in
your answer.
[7 marks]
QUESTION 7 — Exploratory Data Analysis and Linear Models [25 marks]

At the Olympic games, athletes in the Heptathlon compete in seven track and field events,
gaining points in each event depending on their performance. The dataset in the file
“hept.csv” (provided along with these questions on Aula) contains data about the athletes
that completed the Heptathlon at the 1988 and 2016 Olympics. The winner (gold medal) is
the athlete with the most overall points (Joyner-Kersee in 1988 and Thiam in 2016).
Performances for the 200m sprint (units of seconds), longjump (units of metres), and javelin
throw (units of metres) events are given, along with the overall points gained over all seven
events.
In RStudio, make sure you go to the Session menu, select Set Working Directory and then
Source File Location. Save the “hept.csv” file to the same folder as where your R code is
saved.
library(tidyverse)
hept = read_csv('hept.csv')
library(GGally)
ggpairs(select(hept,-name), aes(colour=year))
(a) The scatter matrix below is produced by the R code given above. Use only the
information in the scatter matrix to comment (as comprehensively as possible) on
the relationship between sprint and longjump.
[6 marks]

(b) Suppose we wish to use the performance (time or distance only, not points from
individual events) in each of the events to predict the variable points (the response
variable, e.g., Thiam was the gold medal winner in 2016 with 6810 points). Write R
code to fit the linear model “points~longjump” to the dataset and construct an
appropriate scatterplot including the line of best fit. Write down the equation of the
fitted model. Include your R code and scatterplot in your answer.
[8 marks]
(c) Assess whether the linear model “points~longjump” is a good fit to the dataset by
considering the corresponding Residuals vs Fitted and Normal Q-Q diagnostic plots
produced by the R code given below. Interpret and comment on the residual of the
athlete with name Nwaba (from the 2016 Olympics).
library(ggfortify)
autoplot(model, data=hept, colour='year')
[5 marks]

(d) Suppose we wish to predict points using all of the available variables (except name)
as predictors, i.e., “points~sprint+longjump+javelin+year”. Write R code to fit this
linear model to the dataset and write down the equation of the fitted model. Would
you consider this linear model to be a “better” model than “points~longjump”?
Justify your answers using appropriate R code and output.
[6 marks]
— End of assessment —

4 Solved

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 Solved

Uploaded by

Copyright:

Available Formats

For assessor use only

Consider the Data Science Lifecycle (OSEMN pipeline) as shown below.

Image from https://www.fleetnews.co.uk/news/manufacturer-news/2022/06/30/uk-car-production-up-in-

The various phases and implementation are:

Given , P( disease) =0.1

P( positive test result I disease) = 0.95

P( positive test result I no disease ) = 0.02

To find, P (positive test result) =?

P( positive test result)

= 0.95* 0.1 + 0.02 * (1-0.1)

Probability of positive test result = 11.3 %

To find P( no disease I positive test result)

Using Bayes' theorem

P( no disease I positive test result)

= P( positive test result I no disease) * P( no disease) / P( positive test result)

pnorm(1, mean=1.030, sd=0.014)

The parameters of the ARIMA version are described as follows:

Write your answer in this box.

(b) Let S be the matrix

Write your answer in this box.

QUESTION 6 — Data Wrangling and Exploratory Data Analysis [25 marks]

Write your answer in this box.

Write your answer in this box.

newdino = separate(dino, length, c('L', NA), sep='m', convert=TRUE)

Write your answer in this box.

Write your answer in this box.

Write your answer in this box.

QUESTION 7 — Exploratory Data Analysis and Linear Models [25 marks]

Write your answer in this box.

Write your answer in this box.

Write your answer in this box.

Write your answer in this box.

You might also like