You are on page 1of 46

APPLIED SCIENCE PRIVATE UNIVERSITY

DEPARTMENT OF MECHANICAL AND INDUSTRIAL ENGINEERING

Laboratory of statistics and Excel programming

Student manual

Prepared by:
Eng. Rania Obaid
Exp#1: Descriptive statistics

Statistics is a mathematical science including methods of collecting, organizing and


analyzing data in such a way that meaningful conclusions can be drawn from them.
In general, its investigations and analyses fall into two broad categories called
descriptive and inferential statistics.
Descriptive statistics deals with the processing of data without attempting to draw any
inferences from it. The data are presented in the form of tables and graphs. The
characteristics of the data are described in simple terms. Events that are dealt with
include everyday happenings such as accidents, prices of goods, business, incomes,
sports data, and population data.

Statistical measures:

In the real world, there are many situations in which a large group of data is
collected. In order to make sense of the data, statistical measures are used. These
measures help to generalize a group of data, make inferences about it, and compare
it with other groups of data. Statistical measures include mean, median,
mode and range and others. Depending on the situation, certain measures may be
more helpful than others in interpreting data.

- The mean, commonly referred to as the average, is the sum of all the data
items divided by the number of data items.
Mean= Sum of observations/Number of observations
- The median is the middle number in a set of data that is ordered from least to
greatest. If there is an even number of data, you take the average of the
middle two numbers to find the median.
If the total number of observation given is odd, then the formula to calculate the
median is:
Median = {(n+1)/2}thterm
If the total number of observation is even, then the median formula is:
Median = [(n/2) th term + {(n/2) +1} th]/2
Where n is the number of observations

- The mode is the number that occurs most often.


- Standard deviation is a measure of the amount of variation or dispersion of
a set of values
- Variance is defined as the average of the squared differences from the Mean.

Variance= σ2

- Kurtosis is a measure of the "tailedness" of the probability distribution of a real-


valued random variable.

r i =n ni Yi – ni Yi – 2)2

Where

 Yi: ith Variable of the Distribution

 : Mean of the Distribution

 n: No. of Variables in the Distribution

- Skewness is a measure of the asymmetry of the probability distribution of a real-


valued random variable about its mean. The skewness value can be positive, zero,
negative, or undefined.

skewness = (3 * (mean - median)) / standard deviation

- Range: is defined as the difference between the maximum value and the minimum
one.

Range = Maximum Value–Minimum Value


Example:

Find the statistical analysis for the data inserted in the following table for x=1, 2,
3, 3, 6.

x -x ) -x)2 ) -x) /s )) -x) /s)3 -x)4


1 -2 4 -1.068 -1.22 16
2 -1 1 -0.5 -0.15 1
3 0 0 0 0 0
3 0 0 0 0 0
6 3 9 1.603 4.12 81
Sum=15 0 14 2.748 98

Mean= sum/n= 15/5= 3

s2= x-x) 2/n-1=14/4=3.5

s=√3.5=1.871

s ewness =n n-1 n-2 xi-x) 3/s)3

Measure the symmetry = (5/4*3)*2.748= 1.145

Kurtosis=1/ (n-1) s4 x-x) 4

=98/4(1.871)4=2
Installing data analysis in excel:

1. Click the File tab, click Options, and then click the Add-Ins category.

2. In the Manage box, select Excel Add-ins and then click Go.
3. In the Add-Ins box, check the Analysis ToolPak check box, and then click OK.

4. If Analysis ToolPak is not listed in the Add-Ins available box, click Browse to locate it.
5. If you are prompted that the Analysis ToolPak is not currently installed on your
computer, click Yes to install it.

6. Go to data then Data analysis then press Descriptive Statistics

7. Click ok then select the input range "the data you are looking to find out its
statistical analysis".

8. Select the output range and click the summary statistics, Kth largest and

smallest then click OK.

The analysis will appear in the table as the following:


Classwork 1:

In the table below column A shows the suggested retail price (SRP) for a book. In
column B, the worksheet shows the units sold of each book through one popular
bookselling outlet. You might choose to use the Descriptive Statistics tool to
summarize this data set by using excel.

Column A Column B
SRP units revenue
44.95 982 $44,141
42.95 792 $34.016
64.95 800 $51,440
44.95 744 $35,600
59.95 712 $47,480
49.95 609 $30,420
43.95 342 $15,822
Exp#2: Histogram

A frequency distribution shows how often each different value in a set of data occurs.
A histogram is the most commonly used graph to show frequency distributions. We
use a histogram when:

 The data are numerical


 You want to see the shape of the data‟s distribution, especially when determining
whether the output of a process is distributed approximately normally
 Analyzing whether a process can meet the customer‟s requirements
 Analyzing what the output from a supplier‟s process loo s li e
 Seeing whether a process change has occurred from one time period to another
 Determining whether the outputs of two or more processes are different
 You wish to communicate the distribution of data quickly and easily to others

How to build a histogram

If you are creating a histogram manually or using Excel, you will need to calculate:

the number of bins bin width bin intervals.

Example:

The following is a list of salaries (in dollars) of employees found in Morgan Company:

850 600 930 650 570 800 1200 620 950 740 490 1150 550 840 770 990
830 750 1220 1700

Make a frequency distribution table for this data and find the histogram describing
this data.

Solution:

We omit the units (dollars) while calculating. The values go from 490 to 1700, we can
divide this into 4 intervals of equal length:
The maximum value = 1700
The minimum value = 490
Bin width = max-min/n
When n is the number of intervals
Bin width = 1700-490/4
= 302.5

Bin Frequency
792.5 9
1095 7
1397.5 3
1700 1
Total 20
Building histogram in Excel:

1. First, enter the bin numbers (upper levels) in the range

2. On the Data tab, in the Analysis group, click Data Analysis.

3. Select Histogram and click OK.


4. Select the range A2:A19.

5. Click in the Bin Range box and select the range C4:C8.

6. Click the Output Range option button, click in the Output Range box and select cell
F3.

7. Check Chart Output.


8. Click OK.

9. To remove the space between the bars, right click a bar, click Format Data Series
and change the Gap Width to 0%.

Class work 2:

The following is a list of prices (in dollars) of birthday cards found in various drug
stores:

1.45 2.20 0.75 1.23 1.25 1.25 3.09 1.99 2.00 0.78 1.32 2.25 3.15 3.85 0.52
0.99 1.38 1.75 1.22 1.75

By using excel make a frequency distribution table for this data and find the
histogram by considering a zero gap.

Homework 2:

The following data represents the actual liquid weight in 16 "twelve-ounce" cans.
Construct a frequency distribution with four classes from this data.
11.95 11.91 11.86 11.94 12.00 11.93 12.00 11.94
12.10 11.95 11.99 11.94 11.89 12.01 11.99 11.94
Find the upper bound for each range and build a histogram by using excel
Exp #3 Ranking and percentile

In statistics, percentile rank is used to compare a specific score to other scores in a


group of scores.

In the world of statistics, percentile rank refers to the percentage of scores that are
equal to or less than a given score. Percentile ranks, like percentages, fall on a
continuum from 0 to 100. For example, a percentile rank of 35 indicates that 35% of
the scores in a distribution of scores fall at or below the score at the 35th percentile.
Percentile ranks are useful when you want to quickly understand how a particular
score compares to the other scores in a distribution of scores. For instance, knowing
someone scored 235 points on an exam doesn't tell you much. You don't know how
many points were possible, and even if you did, you wouldn't know how that person's
score compared to the rest of his classmates. If, however, you were told that he
scored at the 90th percentile, then you would know that he did as well or better than
90% of his class.
To identify percentile (Per Rank) of score

P = R*100 / n+1

Where,

P = Percentile Rank

R = Rank

n = number of ranks

Percentile is mainly applied to the data set of scores where ranking needs to be
identified. In addition, every 25th percentile is known as one quartile. Out of 100, the
25th percentile is known as 1st quartile. 50th percentile is known as 2nd quartile or
median, the 75th percentile is known as 3rd quartile. The difference between 3rd and
1st quartile is called an Interquartile range.

In statistics, “ran ing” refers to the data transformation in which numerical or ordinal
values are replaced by their rank when the data are sorted. If, for example, the
numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would
be 2, 3, 1 and 4 respectively. In another example, the ordinal data hot, cold, warm
would be replaced by 3, 1, and 2. In these examples, the ranks are assigned to
values in ascending order. (In some other cases, descending ranks are used.) Ranks
are related to the indexed list of order statistics, which consists of the original dataset
rearranged into ascending order.
Example 1:
The following data represents the marks out of 110 for math students:
33, 48, 57, 65, 67, 69, 75, 77, 77, 78, 80, 83, 85, 86, 87, 88, 89, 90, 99, 99, 99, 99,
100, 101, 105
The rank of 20% percentile is:
i = (P/100)*N
= 20/100*(25)
i = 5 (whole number)
So we will take the average number on position I and i+1, which means the 5th and
6th position:
20th %=67+69/2= 68
That means 20% make 68 or lower
80% make above 68
If percentile becomes 25%:
I=25/100*25=6.25 (not whole number)
So round up the number into 7
25th %=75

Using the Rank and Percentile in excel

Now that we have added the Analysis ToolPak, select the Data Analysis button and
select Rank and Percentile as shown below.
 First select the Input Range

 If your data is in columns, leave Grouped By as Columns (otherwise select Rows)


 If you have labels for your columns, then select Labels in first row
 Lastly, select your Output options, select Output Range to place it in a specific
place on your worksheet or select New Worksheet Ply or New
Workbook depending on your preference.

Class work 3:

The following table represents the points given by students at their final exam, find
the rank and percentile by using data analysis on excel for this data.

ID# Points
1147714 2
8878227 6
2374437 9
5513818 10
5609061 16
9575015 3
1151895 1
257927 4
845499 13
1969208 14
1832668 18
5989062 5
2515549 11
9354869 19
0797729 15
5787235 20
2140321 8
4462612 12
3442522 17
6162906 7
Exp #4 Correlation and covariance

Correlation is a statistical measure that expresses the extent to which two variables
are linearly related meaning they change together at a constant rate . It‟s a common
tool for describing simple relationships without making a statement about cause and
effect. The sample correlation coefficient, r, quantifies the strength of the relationship.

Correlation coefficient formulas are used to find how strong a relationship is between
data. The formulas return a value between -1 and 1, where:

 1 indicates a strong positive relationship.


 -1 indicates a strong negative relationship.
 A result of zero indicates no relationship at all.

There are several types of correlation coefficient formulas.

One of the most commonly used formulas is Pearson‟s correlation coefficient


formula. If you‟re ta ing a basic stats class, this is the one you‟ll probably use:
Pearson correlation coefficient

 r= strength of the correlation between variables x and y


 n = sample size
 ∑ = sum of what follows…
 X = every x-variable value
 Y = every y-variable value
 XY = the product of each x-variable score and the corresponding y-variable
score

Covariance is a measure of how much two random variables vary together. It‟s
similar to variance, but where variance tells you how a single variable
varies, co variance tells you how two variables vary together.

The formula for population is:


Cov X, Y = E X–μ EY–ν n where:

 X is a random variable
 E X = μ is the expected value the mean) of the random variable X and
 E Y = ν is the expected value (the mean) of the random variable Y
 n = the number of items in the data set.
 summation notation.

Also Correlation=Cov x,y σx×σy

Where,

Cov (x,y) is the covariance between x and y

σx and σy are the standard deviations of x and y.

Example

Calculate correlation and covariance for the following data set:


x: 2.1, 2.5, 3.6, 4.0 (mean = 3.1)
y: 8, 10, 12, 14 (mean = 11)

To find the correlation use the formula above and substitute the values:

= 4*(41) - 12.2 44 √ { 4 39.62 -(12.2)2)*(4*504)-(44)2}

r =0.979

To find the covariance Substitute the values into the formula and solve:
Cov X,Y = E X-μ Y-ν n-1
= (2.1-3.1)(8-11)+(2.5-3.1)(10-11)+(3.6-3.1)(12-11)+(4.0-3.1)(14-11) /(4-1)
= (-1)(-3) + (-0.6)(-1)+(.5)(1)+(0.9)(3) / 3
= 3 + 0.6 + .5 + 2.7 / 3
= 6.8/3
Cov(X,Y)= 2.267
Finding correlation and covariance by using excel:

To find the correlation and covariance follow these steps:

1Click Data tabs Data Analysis command button.

The Data Analysis dialog box appears.

2When Excel displays the Data Analysis dialog box, select the Correlation tool
from the Analysis Tools list and then click OK.

Excel displays the Correlation dialog box.

3Identify the range of X and Y values that you want to analyze.

4Select an output location.


5Click OK.

Excel calculates the correlation coefficient for the data that you identified and places
it in the specified location.

Now to find the covariance follow these steps:

1Click the Data Analysis command button on the Data tab.

The Data Analysis dialog box appears.

2When Excel displays the Data Analysis dialog box, select the Covariance tool
from the Analysis Tools list and then click OK.

Excel displays the Covariance dialog box.

3Identify the range of X and Y values that you want to analyze.


4Select an output location.

Use the Output Options radio buttons and text boxes to specify where Excel should
place the results of the covariance analysis. To place the results into a range in the
existing worksheet, select the Output Range and then identify the range address in
the Output Range text box.

5Click OK after you select the output options.

Excel calculates the covariance information for the data that you identified and places
it in the specified location.

Homework 2:

The following table shows the age of patients (X) and the glucose level in their blood
(Y); find the value of the correlation coefficient and the variance for this data.

AGE X GLUCOSE LEVEL Y

43 99

21 65

25 79

42 75

57 87

59 81
Exp #5 Random number generation and sampling

A random number generator (RNG) is a mathematical construct, either computational


or as a hardware device, that is designed to generate a random set of numbers that
should not display any distinguishable patterns in their appearance or generation,
hence the word random. A true random number generator cannot rely on
mathematical equations and computational algorithms to get a random number
because if there is an equation involved, then it is not random.

The first step to simulate numbers from a distribution is to be able to independently


simulate random numbers u1; u2… uNu1, u2… uN from a continuous uniform
distribution between zero and one, such a random variables has pdf

And cdf

These two are plotted in the following Figure

Figure 5.1: Pdf (left) and cdf (right) of the continuous uniform between zero and
one.
How to Generate Random Numbers in Excel

The Data Analysis command in Excel also includes a Random Number Generation
tool. The Random Number Generation tool is considerably more flexible than the
function, which is the other tool, that you have available within Excel to produce
random numbers.

To produce random numbers, take the following steps:

1. To generate random numbers, first clic the Data tab‟s Data Analysis command
button. Excel displays the Data Analysis dialog box.

2. In the Data Analysis dialog box, select the Random Number Generation entry from
the list and then click OK.

Excel displays the Random Number Generation dialog box.

3. Use the Number of Variables text box to specify how many columns of values
you want in your output range. Similarly, use the Number of Random
Numbers text box to specify how many rows of values you want in the output
range.

You don‟t absolutely need to enter values into these two text boxes, by the way. You
can also leave them blank. In this case, Excel fills all the columns and all the rows in
the output range.
4. Select the distribution method.

Select one of the distribution methods from the Distribution drop-down list. The
Distribution drop-down list provides several distribution methods: Uniform, Normal,
Bernoulli, Binomial, Poisson, Patterned, and Discrete. Typically, if you want a pattern
of distribution other than Uniform, you‟ll now which one of these distribution
methods is appropriate.

5. Provide any parameters needed for the distribution method. If you select a
distribution method that requires parameters, or input values, use the
Parameters text box (Value and Probability Input Range) to identify the
worksheet range that holds the parameters needed for the distribution
method.

6. Select a starting point for the random number generation. You have the
option of entering a value that Excel will use to start its generation of random
numbers. Identify the output range.

7. After you describe how you want Excel to generate random numbers and
where those numbers should be placed, click OK.

Excel generates the random numbers.

Sampling

Sampling is the selection of a subset (a statistical sample) of individuals from within


a statistical population to estimate characteristics of the whole population.
Statisticians attempt to collect samples that are representative of the population in
question. Sampling has lower costs and faster data collection than measuring the
entire population and can provide insights in cases where it is infeasible to sample an
entire population.

Each observation measures one or more properties (such as weight, location, color)
of independent objects or individuals. In survey sampling, weights can be applied to
the data to adjust for the sample design, particularly in stratified sampling. Results
from probability theory and statistical theory are employed to guide the practice. In
business and medical research, sampling is widely used for gathering information
about a population. Acceptance sampling is used to determine if a production lot of
material meets the governing specifications.
This tool selects a random sample from your range of values (a sample being a
portion of the whole range), therefore ensuring that your competition winner has been
chosen with integrity.

The screen shot below will be used for this example.

1. Select the Data tab and Data Analysis as per screen shot below.

2. Select Sampling and then OK.

3. Then select or type in the Input Range, Number of Samples and Output Range as
below. Select OK.
4. The results will be displayed as reflected in the image below.

Homework 5:

The life of a fully-charged cell phone battery is normally distributed with a mean of
(1414) hours and a standard deviation of (11) hour, by using excel:

1. Generate random numbers for (150) cell phone and draw the histogram for this
data.

2. Give a random sample of (20) cell phone batteries.


Exp #6 VLOOKUP

VLOOKUP stands for „Vertical Loo up‟. It is a function that ma es Excel search for a
certain value in a column the so called „table array‟ , in order to return a value from a
different column in the same row. We use VLOOKUP when we need to find things in
a table or a range by row. For example, look up a price of an automotive part by the
part number, or find an employee name based on their employee ID.

A VLOOKUP function exists of 4 components:


1. The value you want to look up;
2. The range in which you want to find the value and the return value;
3. The number of the column within your defined range, that contains the return
value;
4. 0 or FALSE for an exact match with the value you are looking for; 1 or TRUE for
an approximate match.

Syntax: VLOOKUP ([value], [range], [column number], [false or true])

Example:

Most of the times you are looking for an exact match when you use the VLOOKUP
function in Excel. Let's take a look at the arguments of the VLOOKUP function.

1. The VLOOKUP function below looks up the value 53 (first argument) in the
leftmost column of the red table (second argument).

2. The value 4 (third argument) tells the VLOOKUP function to return the value in the
same row from the fourth column of the red table.
Note: the Boolean FALSE (fourth argument) tells the VLOOKUP function to return an
exact match. If the VLOOKUP function cannot find the value 53 in the first column, it will
return a #N/A error.
3. Here's another example. Instead of returning the salary, the VLOOKUP function below
returns the last name (third argument is set to 3) of ID 79.

Let's take a look at an example of the VLOOKUP function in approximate match mode
(fourth argument set to TRUE).

1. The VLOOKUP function below looks up the value 85 (first argument) in the leftmost
column of the red table (second argument). There's just one problem. There's no value
85 in the first column.
3. Fortunately, the Boolean TRUE (fourth argument) tells the VLOOKUP function to
return an approximate match. If the VLOOKUP function cannot find the value 85 in
the first column, it will return the largest value smaller than 85. In this example,
this will be the value 80.

4. The value 2 (third argument) tells the VLOOKUP function to return the value in the
same row from the second column of the red table.

Note: always sort the leftmost column of the red table in ascending order if you
use the VLOOKUP function in approximate match mode (fourth argument set to
TRUE).
Homework:

The speed and rank for several kinds of animals is shown in the following table:

By using vlookup in excel:

1. Find the missing data required in the following table:

2. Find the kind of animal that has a speed of 52 (mph).


Exp #7 Test of hypothesis on mean

What Is Hypothesis Testing?

Hypothesis testing is an act in statistics whereby an analyst tests an assumption


regarding a population parameter. The methodology employed by the analyst
depends on the nature of the data used and the reason for the analysis. Hypothesis
testing is used to assess the plausibility of a hypothesis by using sample data. Such
data may come from a larger population, or from a data-generating process.

Here, we would like to discuss some common hypothesis testing problems. We


assume that we have a random sample X1, X2,..., Xn from a distribution and our
goal is to make inference about the mean of the distribution μ. We consider three
hypothesis testing problems. The first one is a test to decide between the following
hypotheses:

H0: μ=μ0

H1: μ≠μ0

In this case, the null hypothesis is a simple hypothesis and the alternative
hypothesis is a two-sided hypothesis (i.e., it includes both μ<μ0 and μ>μ0 . We call
this hypothesis test a two-sided test. The second and the third cases are one-
sided tests. More specifically, the second case is

H0: μ≤μ0

Ha: μ>μ0
Here, both H0 and H1 are one-sided, so we call this test a one-sided test. The third
case is very similar to the second case. More specifically, the third scenario is

H0: μ≥μ0

Ha: μ<μ0
In all of the three cases, we use the sample mean
x=X1+X2+...+Xn
1. Single mean example:
We claim:

a. Ho (the null hypothesis) μ =55 MPa , Ha(the alternative hypothesis) μ≠55

b. Ho μ≥55 MPa , Ha μ<55 MPa (left tailed)

c. Ho μ≤55 MPa, Ha μ>55 MPa (right tailed)


For case (a) single value:
Case 1:

Single value

δ2 nown δ2 unknown

z= x - μo t = x- μo
δ √n s √n

Steps of Hypothesis Testing:


1. Claim
2. Set parameter
3. Sample
4. Analysis
5. Result
6. Conclusion
7. decision

Example for single value:

Case 2:

1. Claim Ho: μ=8,

Ha: μ≠8

2. Set α=0.01

3. = 7.8, n=50, δ=0.5

4. z = - μo = 7.8-8

δ √n 0.5 √50

z=2.83

zcr = +2.575, -2.575 (from statistics tables)

5. Reject Ho

Example 2 (δ un nown
1. Claim Ho: μ ≥ 6

Ha: μ< 6

We use t distribution

2. α=0.05

3. =42, s=11.9, n=12

Df (degree of freedom) = n-1= 12-1= 11

tcr=-1.796 (from statistic tables)

4. t= - μo =42- 6 11.9 √12 = -1.16

√n

5. Do not reject
Exp #7 Test on two means

The t-Test Paired Two Sample for Means tool performs a paired two-sample Student's
t-Test to ascertain if the null hypothesis (means of two populations are equal) can be
accepted or rejected. This test does not assume that the variances of both populations
are equal. Paired t-tests are typically used to test the means of a population before and
after some treatment, i.e. two samples of math scores from students before and after a
lesson.

The result of this tool is a calculated t-value. This value can be negative or positive,
depending on the data. Assuming that the population means are equal:

 If t < 0, P(T <= t) one-tail is the probability that a value of the t-Statistic would be
observed that is more negative than t.
 If t >0, P(T<=t) one tail is the probability that a value of the t-Statistic would be
observed that is more positive than t.
 P(T <=t) two tail is the probability that a value of the t-Statistic would be observed
that is larger in absolute value than t.

The example datasets below were taken from a population of 10 students. The
students were given the same test at the beginning and end of the school year. Use
the Paired t-Test to determine if the average score of the 2nd test has improved over
the average score of the 1st test.

To run the t-test:

1. On the XLMiner Analysis ToolPak pane, click t-Test Paired Two Sample for
Means.
2. Enter A2:A11 for Variable 1 Range. This is our first set of values, the values
recorded at the beginning of the school year.
3. Enter B2:B11 for Variable 2 Range. This is our second set of values, the values
recorded at the end of the school year.
4. Enter "0" for Hypothesized Mean Difference. This means that we are testing that
the means between the two samples are equal.
5. Uncheck Labels since we did not include the column headings in our Variable 1
and 2 Ranges.
6. Keep the Alpha = 0.05.
7. Enter D1 for the Output Range.
8. Click OK.

The results are below.

 Cells E4 and F4 contain the mean of each sample, Variable 1 = Beginning and
Variable 2 = End.
 Cells E5 and F5 contain the variance of each sample.
 Cells E6 and F6 contain the number of observations in each sample.
 Cell E7 contains the Pearson Correlation which indicates that the two variables
are rather closely correlated.

 Cell E8 contains our entry for the Hypothesized Mean Difference. Cells E9
contains the degrees of freedom, 10 – 1.
 Cell E10 contains the result of the actual t-test. We will compare this value to the
t-Critical two-tail statistic. Note: Use a one-tail test if you have a direction in
your hypothesis, i.e. if testing that a value is above or below some level.
 In this example P(T <= t) two tail (0.0000321) gives the probability that the
absolute value of the t-Statistic (7.633) would be observed that is larger in
absolute value than the Critical t value (2.26). Since the p – value is less than
our alpha, 0.05, we reject the null hypothesis that there is no significant difference
in the means of each sample.

Test on two means

δ1 δ2 known δ1 δ2 unknown but equal δ1 δ2 unknown but not equal


Ho: μ-μo= do δ1= δ2= δ t=( x1- x2)-do √ S12/n1+S22/n2)
Ha: μ-μo≠ do t=(x1-x2)-do/Sp√ 1 n1+1/n2) Df= ( S12/n1+S22/n2)2
Z= x1-x2- do Sp=√S12 (n1-1)+S22(n2-1)
√ δ12/n1+ δ22/n2) n1+n2-2 (S12/n1)2/(n1-1)+(S22/n2)2/(n2-1)
Df=n1+n2-2
Exp 9: t- test assuming equal and unequal variances

How to do t-Tests in Excel:


1. Click Data Analysis in the Data menu to display the analysis you can perform.
Among other options, the popup presents three types of t-test,

2. After you enable it, click Data Analysis in the Data menu to display the
analyses you can perform. Among other options, the popup presents three
types of t-test, which we‟ll cover next.

3. To perform a 2-sample t-test in Excel, arrange your data in two columns, as


shown below.

.
4. From the Data Analysis popup, choose t-Test: Two-Sample Assuming Equal
Variances.
5. Under Input, select the ranges for both Variable 1 and Variable 2.
6. In Hypothesized Mean Difference, you‟ll typically enter zero. This value is the
null hypothesis value, which represents no effect. In this case, a mean
difference of zero represents no difference between the two methods, which is
no effect.
7. Check the Labels checkbox if you have meaningful variable names in row 1.
This option makes the output easier to interpret. Ensure that you include the
label row in step #3.
8. Excel uses a default Alpha value of 0.05, which is usually a good value. Alpha
is the significance level. Change this value only when you have a specific
reason for doing so.
9. Click OK.

For the example data, your popup should look like the image below:

10. After Excel creates the output, adjust the width of column A to display all text
in it interpreting the Two-Sample t-Test Results.
Test on two means

δ1 δ2 known δ1 δ2 unknown but equal δ1 δ2 unknown but not equal


Ho: μ-μo= do δ1= δ2= δ t=( x1- x2)-do √ S12/n1+S22/n2)
Ha: μ-μo≠ do t=(x1-x2)-do/Sp√ 1 n1+1/n2) Df= ( S12/n1+S22/n2)2
Z= x1-x2- do Sp=√S12 (n1-1)+S22(n2-1)
√ δ12/n1+ δ22/n2) n1+n2-2 (S12/n1)2/(n1-1)+(S22/n2)2/(n2-1)
Df=n1+n2-2
Homework:

The table shows the life time for old and new batteries:

Old batteries New batteries

38 42

53 54

51 53

48 50

50 48

Given that o=47, N=49.4, δo2=34, δN2=22, So2= 34.5, SN2= 22.8

Ho: μo- μN=0

Ha: μo- μN≠0

Tell if you will reject Ho or not, manually and by using excel.


Exp #10: Regression

Linear regression is a basic and commonly used type of predictive analysis. The overall idea
of regression is to examine two things: (1) does a set of predictor variables do a good job in
predicting an outcome (dependent) variable? (2) Which variables in particular are significant
predictors of the outcome variable, and in what way do they–indicated by the magnitude and
sign of the beta estimates–impact the outcome variable? These regression estimates are
used to explain the relationship between one dependent variable and one or more
independent variables. The simplest form of the regression equation with one dependent and
one independent variable is defined by the formula y = c + b*x, where y = estimated
dependent variable score, c = constant, b = regression coefficient, and x = score on the
independent variable.

Naming the Variables: there are many names for a regression‟s dependent variable. It may
be called an outcome variable, criterion variable, endogenous variable, or regressand. The
independent variables can be called exogenous variables, predictor variables, or regressors.

Three major uses for regression analysis are (1) determining the strength of predictors, (2)
forecasting an effect, and (3) trend forecasting.

Regression in excel:

This example teaches you how to run a linear regression analysis in Excel and how to
interpret the Summary Output.

Below you can find our data. The big question is: is there a relation between Quantity Sold
(Output) and Price and Advertising (Input). In other words: can we predict Quantity Sold if we
know Price and Advertising?

1. On the Data tab, in the Analysis group, click Data Analysis.


Select the Y Range (A1:A8). This is the predictor variable (also called dependent variable).

4. Select the X Range(B1:C8). These are the explanatory variables (also called independent
variables). These columns must be adjacent to each other.

5. Check Labels.

6. Click in the Output Range box and select cell A11.

7. Check Residuals.

8. Click OK.

Excel produces the following Summary Output (rounded to 3 decimal places).


R Square
R Square equals 0.962, which is a very good fit. 96% of the variation in Quantity Sold is
explained by the independent variables Price and Advertising. The closer to 1, the better the
regression line (read on) fits the data.

Significance F and P-values


To check if your results are reliable (statistically significant), look at Significance F (0.001). If
this value is less than 0.05, you're OK. If Significance F is greater than 0.05, it's probably
better to stop using this set of independent variables. Delete a variable with a high P-value
(greater than 0.05) and rerun the regression until Significance F drops below 0.05.

Most or all P-values should be below below 0.05. In our example this is the case.
(0.000, 0.001 and 0.005).

Coefficients
The regression line is: y = Quantity Sold = 8536.214 -835.722 * Price + 0.592 * Advertising. In
other words, for each unit increase in price, Quantity Sold decreases with 835.722 units. For
each unit increase in Advertising, Quantity Sold increases with 0.592 units. This is valuable
information.
You can also use these coefficients to do a forecast. For example, if price equals $4 and
Advertising equals $3000, you might be able to achieve a Quantity Sold of 8536.214 -835.722
* 4 + 0.592 * 3000 = 6970.
Residuals
The residuals show you how far away the actual data points are fom the predicted data points
(using the equation). For example, the first data point equals 8500. Using the equation, the
predicted data point equals 8536.214 -835.722 * 2 + 0.592 * 2800 = 8523.009, giving a
residual of 8500 - 8523.009 = -23.009.

You can also create a scatter plot of these residuals.


Homework

In the table below, the xi column shows scores on the aptitude test. Similarly, the yi column
shows statistics grades. The last two rows show sums and mean scores that we will use to
conduct the regression analysis.

Student xi yi

1 95 85

2 85 95

3 80 70

4 70 65

5 60 70

Sum 390 385

Mean 78 77

If a student made an 80 on the aptitude test, find the estimated statistics grade ŷ .

You might also like