You are on page 1of 11

Ndejje University

Engineering Maths IV

Assignment
Probability and Statistics

FRANCIS WANTONO
Question One
a) Explain what you understand by a statistical model.
b) Write down a random variable which could be modelled by
i. a discrete uniform distribution,
ii. a normal distribution.
Question 2
A group of students believes that the time taken to travel to college, T minutes, can be assumed
to be normally distributed. Within the college 5% of students take at least 55 minutes to travel
to college and 0.1% take less than 10 minutes.
Find the mean and standard deviation of T.
Question 3
The discrete random variable X has probability function

1
a) Show that 𝑘 = 15

Find the value of


b) E(2X + 3)
c) Var(2X – 4)
Question 4
A drilling machine can run at various speeds, but in general the higher the speed the sooner the
drill needs to be replaced. Over several months, 15 pairs of observations relating to speed, s
revolutions per minute, and life of drill, h hours, are collected.
For convenience the data are coded so that x = s – 20 and y = h – 100 and the following
summations obtained.

∑ 𝑥 = 143; ∑ 𝑦 = 391; ∑ 𝑥 2 = 2413; ∑ 𝑦 2 = 22441; ∑ 𝑥𝑦 = 484.

a) Find the equation of the regression line of h on s.


b) Interpret the slope of your regression line.
c) Estimate the life of a drill revolving at 30 revolutions per minute.
Question 5
a) Explain briefly the advantages and disadvantages of using the quartiles to summarize a set of
data.
a) Describe the main features and uses of a box plot.

1
The distances, in kilometres, travelled to school by the teachers in two schools, A and B, in the
same town were recorded. The data for School A are summarized in Diagram 1.

For School B, the least distance travelled was 3 km and the longest distance travelled was 55 km.
The three quartiles were 17, 24 and 31 respectively.
An outlier is an observation that falls either 1.5 × (interquartile range) above the upper quartile
or 1.5 × (interquartile range) below the lower quartile.
b) Draw a box plot for School B.
c) Compare and contrast the two box plots.
Question 6
For any married couple who are members of a tennis club, the probability that the husband has
3 1
a degree is 5 and the probability that the wife has a degree is 2. The probability that the
11
husband has a degree, given that the wife has a degree, is 12.

A married couple is chosen at random.


11
a) Show that the probability that both of them have degrees is 12
b) Draw a Venn diagram to represent these data.
Find the probability that
c) only one of them has a degree,
d) Neither of them has a degree.
Two married couples are chosen at random.
e) Find the probability that only one of the two husbands and only one of the two wives
have degrees.

2
Question 7
A piece of string AB has length 12 cm. A child cuts the string at a randomly chosen point P, into
two pieces. The random variable X represents the length, in cm, of the piece AP.
a) Suggest a suitable model for the distribution of X and specify it fully
b) Find the cumulative distribution function of X.
c) Write down P(X < 4).
Question 8
A manufacturer of chocolates produces 3 times as many soft centred chocolates as hard centred
ones.
Assuming that chocolates are randomly distributed within boxes of chocolates, find the
probability that in a box containing 20 chocolates there are;
a) Equal numbers of soft centred and hard centred chocolates,
b) Fewer than 5 hard centred chocolates.
A large box of chocolates contains 100 chocolates.
c) Write down the expected number of hard centred chocolates in a large box.
Question 9
The continuous random variable X has probability density function f(x) given by

a) Sketch f(x) for all values of x.


b) Calculate E(X).
c) Show that the standard deviation of X is 0.459 to 3 decimal places.
1
d) Show that for 1 ≤ 𝑥 ≤ 3, 𝑃(𝑋 ≤ 𝑥) is given by 80 (𝑥 4 − 1) and specify fully the
cumulative distribution function of X.
e) Find the interquartile range for the random variable X.
Some statisticians use the following formula to estimate the interquartile range:
4
Interquartile range = 3 standard deviation.

3
Question 10
At the end of a season a league of eight ice hockey clubs produced the following table showing
the position of each club in the league and the average attendances (in hundreds) at home
matches.

a) Calculate the Spearman rank correlation coefficient between position in the league and
average home attendance.
b) Stating clearly your hypotheses and using a 5% two-tailed test, interpret your rank
correlation coefficient.
Many sets of data include tied ranks.
c) Explain briefly how tied ranks can be dealt with.
Question 11
The three tasks most frequently carried out in a garage are A, B and C. For each of the tasks the
times, in minutes, taken by the garage mechanics are assumed to be normally distributed with
means and standard deviations given in the following table.

Assuming that the times for the three tasks are independent, calculate the probability that;
d) the total time taken by a single randomly chosen mechanic to carry out all three tasks
lies between 533 and 655 minutes,
e) a randomly chosen mechanic takes longer to carry out task B than task C.
Question 12
In 1789, Henry Cavendish estimated the density of the earth by using a torsion balance. His 29
measurements follow, expressed as a multiple of the density of water.

a) Calculate;

i. Construct a frequency distribution table starting from 12.1-12.4

4
ii. Use the data to obtain measures of central tendency

iii. Obtain the measures of dispersion

iv. Draw the necessary graphs

b) Using information provided below about the hydrocarbon levels and oxygen purity;

Hydrocarbon Level x (%) Oxygen Purity y (%)


0.99 90.01
1.02 89.05
1.15 91.43
1.29 93.74
1.46 96.73
1.36 94.45
0.87 87.59
1.23 91.77
1.55 99.42
1.4 93.65
1.19 93.54
1.15 92.52
0.98 90.56
1.01 89.54
1.11 89.85
1.2 90.39
1.26 93.25
1.32 93.41
1.43 94.98
0.95 87.33

i. Estimate the regression equation for the line of best fit of y on x


ii. Determine the coefficient of determination
iii. Find the sum of square of residuals (SSR)
iv. Estimate the variance of the error
v. Predict the oxygen purity when the hydrocarbon level is 1%.

THE END

5
HINT
The regression equation is;
𝑦 ^ = 𝑎 + 𝑏𝑥
The exact equation is;
𝑦 = 𝑎 + 𝑏𝑥 + 𝑒
𝑒 = 𝑦 − 𝑦 ^ = 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙/𝑒𝑟𝑟𝑜𝑟
𝑦 ^ = 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
𝑦 = 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
𝑆𝑥𝑦
𝑏=
𝑆𝑥𝑥
𝑎 = 𝑦̅ − 𝑏𝑥̅
𝑛

𝑆𝑆𝐸 = ∑(𝑦𝑖 − 𝑦̅𝑖 )2


𝑖=1

Where SSE is sum of squares for errors


𝑆𝑆𝐸
Standard error estimate 𝑆𝜀 = √𝑛−2

𝑆𝑆𝐸
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑟𝑟𝑜𝑟 𝑆𝜀 2 =
𝑛−2
Also;
The coefficient of determination is used to measure the strength of the linear relationship.
Coefficient of determination = 𝑅 2 = 𝑟 2
Where r = correlation coefficient
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑟=
√𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 √𝑛 ∑ 𝑦 2 − (∑ 𝑦)2

6
The Spearman Correlation
Spearman’s correlation coefficient is a statistical measure of the strength of a monotonic
relationship between paired data. In a sample it is denoted by rs or ρ and is by design constrained
as −1 ≤ 𝑟𝑠 ≤ 1
Its interpretation is similar to that of Pearson’s, e.g. the closer rs is to ±1 the stronger the
monotonic relationship.
Verbally, we can describe the strength of the correlation by use of absolute value of rs;
– 0 .00-0.19 “very weak”
– 0 .20-0.39 “weak”
– 0 .40-0.59 “moderate”
– 0 .60-0.79 “strong”
– 0 .80-1.0 “very strong”
The Spearman correlation is used in two general situations:
It measures the relationship between two ordinal variables; that is, X and Y and both consist of
ranks.
It measures the consistency of direction of the relationship between two variables. In this case,
the two variables must be converted to ranks before the Spearman correlation is computed.
The calculation of the Spearman correlation requires:
Two variables are observed for each individual.
The observations for each variable are rank ordered.
Note that the X values and the Y values are ranked separately.
After the variables have been ranked, the Spearman correlation is computed by either:
i. Using the Pearson formula with the ranked data.
ii. Using the special Spearman formula (assuming there are few, if any, tied ranks).
When there are no tied ranks:

6 ∑ 𝑑𝑖 2
𝜌=1−
𝑛(𝑛2 − 1)
Where di = difference in paired ranks and n = number of cases.
When there are tied ranks:
∑𝑖(𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)
𝜌=
√∑𝑖(𝑥𝑖 − 𝑥̅ )2 ∑𝑖(𝑦𝑖 − 𝑦̅)2
Where i = paired score

7
Reading Assignment
Read about how to test for the significance of a correlation
Question 13 (Not Compulsory)
The following data comprises 23 groundwater samples that were collected and analyzed to
determine the Uranium concentration (ppb) and the TDS (mg/L).
a) Plot a scatter diagram for the data
b) From the scatter diagram plotted, would you use the Pearson or Spearman’s method to
determine the correlation? Explain the reasons for your choice
c) Determine and explain the relationship between the 2 variables.
d) Investigate the significance of the correlation established

8
9
Question 14 (Not Compulsory)
Using the data provided below, construct rankings for each variable and calculate the rank order
correlation coefficient using Spearman's Rho. Determine if the correlation is statistically
significant. Show all work. Draw a conclusion related to the correlation.

Limitations of r
• Observe that seemingly high values of r e.g. r = 0.70, explain only about 50% of the
variability in the response variable y. So take care when interpreting correlation
coefficients.
• A low value for r does not necessarily imply absence of a relationship – could be a curved
relationship! So plotting the data is also important
• Tests exist for testing if there is no association. But depending on the sample size, even
low values of r e.g. r = 0.20 can give significant results – not a very useful finding!

10

You might also like