Professional Documents
Culture Documents
Manual
Manual
Student manual
Prepared by:
Eng. Rania Obaid
Exp#1: Descriptive statistics
Statistical measures:
In the real world, there are many situations in which a large group of data is
collected. In order to make sense of the data, statistical measures are used. These
measures help to generalize a group of data, make inferences about it, and compare
it with other groups of data. Statistical measures include mean, median,
mode and range and others. Depending on the situation, certain measures may be
more helpful than others in interpreting data.
- The mean, commonly referred to as the average, is the sum of all the data
items divided by the number of data items.
Mean= Sum of observations/Number of observations
- The median is the middle number in a set of data that is ordered from least to
greatest. If there is an even number of data, you take the average of the
middle two numbers to find the median.
If the total number of observation given is odd, then the formula to calculate the
median is:
Median = {(n+1)/2}thterm
If the total number of observation is even, then the median formula is:
Median = [(n/2) th term + {(n/2) +1} th]/2
Where n is the number of observations
Variance= σ2
r i =n ni Yi – ni Yi – 2)2
Where
- Range: is defined as the difference between the maximum value and the minimum
one.
Find the statistical analysis for the data inserted in the following table for x=1, 2,
3, 3, 6.
s=√3.5=1.871
=98/4(1.871)4=2
Installing data analysis in excel:
1. Click the File tab, click Options, and then click the Add-Ins category.
2. In the Manage box, select Excel Add-ins and then click Go.
3. In the Add-Ins box, check the Analysis ToolPak check box, and then click OK.
4. If Analysis ToolPak is not listed in the Add-Ins available box, click Browse to locate it.
5. If you are prompted that the Analysis ToolPak is not currently installed on your
computer, click Yes to install it.
7. Click ok then select the input range "the data you are looking to find out its
statistical analysis".
8. Select the output range and click the summary statistics, Kth largest and
In the table below column A shows the suggested retail price (SRP) for a book. In
column B, the worksheet shows the units sold of each book through one popular
bookselling outlet. You might choose to use the Descriptive Statistics tool to
summarize this data set by using excel.
Column A Column B
SRP units revenue
44.95 982 $44,141
42.95 792 $34.016
64.95 800 $51,440
44.95 744 $35,600
59.95 712 $47,480
49.95 609 $30,420
43.95 342 $15,822
Exp#2: Histogram
A frequency distribution shows how often each different value in a set of data occurs.
A histogram is the most commonly used graph to show frequency distributions. We
use a histogram when:
If you are creating a histogram manually or using Excel, you will need to calculate:
Example:
The following is a list of salaries (in dollars) of employees found in Morgan Company:
850 600 930 650 570 800 1200 620 950 740 490 1150 550 840 770 990
830 750 1220 1700
Make a frequency distribution table for this data and find the histogram describing
this data.
Solution:
We omit the units (dollars) while calculating. The values go from 490 to 1700, we can
divide this into 4 intervals of equal length:
The maximum value = 1700
The minimum value = 490
Bin width = max-min/n
When n is the number of intervals
Bin width = 1700-490/4
= 302.5
Bin Frequency
792.5 9
1095 7
1397.5 3
1700 1
Total 20
Building histogram in Excel:
5. Click in the Bin Range box and select the range C4:C8.
6. Click the Output Range option button, click in the Output Range box and select cell
F3.
9. To remove the space between the bars, right click a bar, click Format Data Series
and change the Gap Width to 0%.
Class work 2:
The following is a list of prices (in dollars) of birthday cards found in various drug
stores:
1.45 2.20 0.75 1.23 1.25 1.25 3.09 1.99 2.00 0.78 1.32 2.25 3.15 3.85 0.52
0.99 1.38 1.75 1.22 1.75
By using excel make a frequency distribution table for this data and find the
histogram by considering a zero gap.
Homework 2:
The following data represents the actual liquid weight in 16 "twelve-ounce" cans.
Construct a frequency distribution with four classes from this data.
11.95 11.91 11.86 11.94 12.00 11.93 12.00 11.94
12.10 11.95 11.99 11.94 11.89 12.01 11.99 11.94
Find the upper bound for each range and build a histogram by using excel
Exp #3 Ranking and percentile
In the world of statistics, percentile rank refers to the percentage of scores that are
equal to or less than a given score. Percentile ranks, like percentages, fall on a
continuum from 0 to 100. For example, a percentile rank of 35 indicates that 35% of
the scores in a distribution of scores fall at or below the score at the 35th percentile.
Percentile ranks are useful when you want to quickly understand how a particular
score compares to the other scores in a distribution of scores. For instance, knowing
someone scored 235 points on an exam doesn't tell you much. You don't know how
many points were possible, and even if you did, you wouldn't know how that person's
score compared to the rest of his classmates. If, however, you were told that he
scored at the 90th percentile, then you would know that he did as well or better than
90% of his class.
To identify percentile (Per Rank) of score
P = R*100 / n+1
Where,
P = Percentile Rank
R = Rank
n = number of ranks
Percentile is mainly applied to the data set of scores where ranking needs to be
identified. In addition, every 25th percentile is known as one quartile. Out of 100, the
25th percentile is known as 1st quartile. 50th percentile is known as 2nd quartile or
median, the 75th percentile is known as 3rd quartile. The difference between 3rd and
1st quartile is called an Interquartile range.
In statistics, “ran ing” refers to the data transformation in which numerical or ordinal
values are replaced by their rank when the data are sorted. If, for example, the
numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would
be 2, 3, 1 and 4 respectively. In another example, the ordinal data hot, cold, warm
would be replaced by 3, 1, and 2. In these examples, the ranks are assigned to
values in ascending order. (In some other cases, descending ranks are used.) Ranks
are related to the indexed list of order statistics, which consists of the original dataset
rearranged into ascending order.
Example 1:
The following data represents the marks out of 110 for math students:
33, 48, 57, 65, 67, 69, 75, 77, 77, 78, 80, 83, 85, 86, 87, 88, 89, 90, 99, 99, 99, 99,
100, 101, 105
The rank of 20% percentile is:
i = (P/100)*N
= 20/100*(25)
i = 5 (whole number)
So we will take the average number on position I and i+1, which means the 5th and
6th position:
20th %=67+69/2= 68
That means 20% make 68 or lower
80% make above 68
If percentile becomes 25%:
I=25/100*25=6.25 (not whole number)
So round up the number into 7
25th %=75
Now that we have added the Analysis ToolPak, select the Data Analysis button and
select Rank and Percentile as shown below.
First select the Input Range
Class work 3:
The following table represents the points given by students at their final exam, find
the rank and percentile by using data analysis on excel for this data.
ID# Points
1147714 2
8878227 6
2374437 9
5513818 10
5609061 16
9575015 3
1151895 1
257927 4
845499 13
1969208 14
1832668 18
5989062 5
2515549 11
9354869 19
0797729 15
5787235 20
2140321 8
4462612 12
3442522 17
6162906 7
Exp #4 Correlation and covariance
Correlation is a statistical measure that expresses the extent to which two variables
are linearly related meaning they change together at a constant rate . It‟s a common
tool for describing simple relationships without making a statement about cause and
effect. The sample correlation coefficient, r, quantifies the strength of the relationship.
Correlation coefficient formulas are used to find how strong a relationship is between
data. The formulas return a value between -1 and 1, where:
Covariance is a measure of how much two random variables vary together. It‟s
similar to variance, but where variance tells you how a single variable
varies, co variance tells you how two variables vary together.
X is a random variable
E X = μ is the expected value the mean) of the random variable X and
E Y = ν is the expected value (the mean) of the random variable Y
n = the number of items in the data set.
summation notation.
Where,
Example
To find the correlation use the formula above and substitute the values:
r =0.979
To find the covariance Substitute the values into the formula and solve:
Cov X,Y = E X-μ Y-ν n-1
= (2.1-3.1)(8-11)+(2.5-3.1)(10-11)+(3.6-3.1)(12-11)+(4.0-3.1)(14-11) /(4-1)
= (-1)(-3) + (-0.6)(-1)+(.5)(1)+(0.9)(3) / 3
= 3 + 0.6 + .5 + 2.7 / 3
= 6.8/3
Cov(X,Y)= 2.267
Finding correlation and covariance by using excel:
2When Excel displays the Data Analysis dialog box, select the Correlation tool
from the Analysis Tools list and then click OK.
Excel calculates the correlation coefficient for the data that you identified and places
it in the specified location.
2When Excel displays the Data Analysis dialog box, select the Covariance tool
from the Analysis Tools list and then click OK.
Use the Output Options radio buttons and text boxes to specify where Excel should
place the results of the covariance analysis. To place the results into a range in the
existing worksheet, select the Output Range and then identify the range address in
the Output Range text box.
Excel calculates the covariance information for the data that you identified and places
it in the specified location.
Homework 2:
The following table shows the age of patients (X) and the glucose level in their blood
(Y); find the value of the correlation coefficient and the variance for this data.
43 99
21 65
25 79
42 75
57 87
59 81
Exp #5 Random number generation and sampling
And cdf
Figure 5.1: Pdf (left) and cdf (right) of the continuous uniform between zero and
one.
How to Generate Random Numbers in Excel
The Data Analysis command in Excel also includes a Random Number Generation
tool. The Random Number Generation tool is considerably more flexible than the
function, which is the other tool, that you have available within Excel to produce
random numbers.
1. To generate random numbers, first clic the Data tab‟s Data Analysis command
button. Excel displays the Data Analysis dialog box.
2. In the Data Analysis dialog box, select the Random Number Generation entry from
the list and then click OK.
3. Use the Number of Variables text box to specify how many columns of values
you want in your output range. Similarly, use the Number of Random
Numbers text box to specify how many rows of values you want in the output
range.
You don‟t absolutely need to enter values into these two text boxes, by the way. You
can also leave them blank. In this case, Excel fills all the columns and all the rows in
the output range.
4. Select the distribution method.
Select one of the distribution methods from the Distribution drop-down list. The
Distribution drop-down list provides several distribution methods: Uniform, Normal,
Bernoulli, Binomial, Poisson, Patterned, and Discrete. Typically, if you want a pattern
of distribution other than Uniform, you‟ll now which one of these distribution
methods is appropriate.
5. Provide any parameters needed for the distribution method. If you select a
distribution method that requires parameters, or input values, use the
Parameters text box (Value and Probability Input Range) to identify the
worksheet range that holds the parameters needed for the distribution
method.
6. Select a starting point for the random number generation. You have the
option of entering a value that Excel will use to start its generation of random
numbers. Identify the output range.
7. After you describe how you want Excel to generate random numbers and
where those numbers should be placed, click OK.
Sampling
Each observation measures one or more properties (such as weight, location, color)
of independent objects or individuals. In survey sampling, weights can be applied to
the data to adjust for the sample design, particularly in stratified sampling. Results
from probability theory and statistical theory are employed to guide the practice. In
business and medical research, sampling is widely used for gathering information
about a population. Acceptance sampling is used to determine if a production lot of
material meets the governing specifications.
This tool selects a random sample from your range of values (a sample being a
portion of the whole range), therefore ensuring that your competition winner has been
chosen with integrity.
1. Select the Data tab and Data Analysis as per screen shot below.
3. Then select or type in the Input Range, Number of Samples and Output Range as
below. Select OK.
4. The results will be displayed as reflected in the image below.
Homework 5:
The life of a fully-charged cell phone battery is normally distributed with a mean of
(1414) hours and a standard deviation of (11) hour, by using excel:
1. Generate random numbers for (150) cell phone and draw the histogram for this
data.
VLOOKUP stands for „Vertical Loo up‟. It is a function that ma es Excel search for a
certain value in a column the so called „table array‟ , in order to return a value from a
different column in the same row. We use VLOOKUP when we need to find things in
a table or a range by row. For example, look up a price of an automotive part by the
part number, or find an employee name based on their employee ID.
Example:
Most of the times you are looking for an exact match when you use the VLOOKUP
function in Excel. Let's take a look at the arguments of the VLOOKUP function.
1. The VLOOKUP function below looks up the value 53 (first argument) in the
leftmost column of the red table (second argument).
2. The value 4 (third argument) tells the VLOOKUP function to return the value in the
same row from the fourth column of the red table.
Note: the Boolean FALSE (fourth argument) tells the VLOOKUP function to return an
exact match. If the VLOOKUP function cannot find the value 53 in the first column, it will
return a #N/A error.
3. Here's another example. Instead of returning the salary, the VLOOKUP function below
returns the last name (third argument is set to 3) of ID 79.
Let's take a look at an example of the VLOOKUP function in approximate match mode
(fourth argument set to TRUE).
1. The VLOOKUP function below looks up the value 85 (first argument) in the leftmost
column of the red table (second argument). There's just one problem. There's no value
85 in the first column.
3. Fortunately, the Boolean TRUE (fourth argument) tells the VLOOKUP function to
return an approximate match. If the VLOOKUP function cannot find the value 85 in
the first column, it will return the largest value smaller than 85. In this example,
this will be the value 80.
4. The value 2 (third argument) tells the VLOOKUP function to return the value in the
same row from the second column of the red table.
Note: always sort the leftmost column of the red table in ascending order if you
use the VLOOKUP function in approximate match mode (fourth argument set to
TRUE).
Homework:
The speed and rank for several kinds of animals is shown in the following table:
H0: μ=μ0
H1: μ≠μ0
In this case, the null hypothesis is a simple hypothesis and the alternative
hypothesis is a two-sided hypothesis (i.e., it includes both μ<μ0 and μ>μ0 . We call
this hypothesis test a two-sided test. The second and the third cases are one-
sided tests. More specifically, the second case is
H0: μ≤μ0
Ha: μ>μ0
Here, both H0 and H1 are one-sided, so we call this test a one-sided test. The third
case is very similar to the second case. More specifically, the third scenario is
H0: μ≥μ0
Ha: μ<μ0
In all of the three cases, we use the sample mean
x=X1+X2+...+Xn
1. Single mean example:
We claim:
Single value
δ2 nown δ2 unknown
z= x - μo t = x- μo
δ √n s √n
Case 2:
Ha: μ≠8
2. Set α=0.01
4. z = - μo = 7.8-8
δ √n 0.5 √50
z=2.83
5. Reject Ho
Example 2 (δ un nown
1. Claim Ho: μ ≥ 6
Ha: μ< 6
We use t distribution
2. α=0.05
√n
5. Do not reject
Exp #7 Test on two means
The t-Test Paired Two Sample for Means tool performs a paired two-sample Student's
t-Test to ascertain if the null hypothesis (means of two populations are equal) can be
accepted or rejected. This test does not assume that the variances of both populations
are equal. Paired t-tests are typically used to test the means of a population before and
after some treatment, i.e. two samples of math scores from students before and after a
lesson.
The result of this tool is a calculated t-value. This value can be negative or positive,
depending on the data. Assuming that the population means are equal:
If t < 0, P(T <= t) one-tail is the probability that a value of the t-Statistic would be
observed that is more negative than t.
If t >0, P(T<=t) one tail is the probability that a value of the t-Statistic would be
observed that is more positive than t.
P(T <=t) two tail is the probability that a value of the t-Statistic would be observed
that is larger in absolute value than t.
The example datasets below were taken from a population of 10 students. The
students were given the same test at the beginning and end of the school year. Use
the Paired t-Test to determine if the average score of the 2nd test has improved over
the average score of the 1st test.
1. On the XLMiner Analysis ToolPak pane, click t-Test Paired Two Sample for
Means.
2. Enter A2:A11 for Variable 1 Range. This is our first set of values, the values
recorded at the beginning of the school year.
3. Enter B2:B11 for Variable 2 Range. This is our second set of values, the values
recorded at the end of the school year.
4. Enter "0" for Hypothesized Mean Difference. This means that we are testing that
the means between the two samples are equal.
5. Uncheck Labels since we did not include the column headings in our Variable 1
and 2 Ranges.
6. Keep the Alpha = 0.05.
7. Enter D1 for the Output Range.
8. Click OK.
Cells E4 and F4 contain the mean of each sample, Variable 1 = Beginning and
Variable 2 = End.
Cells E5 and F5 contain the variance of each sample.
Cells E6 and F6 contain the number of observations in each sample.
Cell E7 contains the Pearson Correlation which indicates that the two variables
are rather closely correlated.
Cell E8 contains our entry for the Hypothesized Mean Difference. Cells E9
contains the degrees of freedom, 10 – 1.
Cell E10 contains the result of the actual t-test. We will compare this value to the
t-Critical two-tail statistic. Note: Use a one-tail test if you have a direction in
your hypothesis, i.e. if testing that a value is above or below some level.
In this example P(T <= t) two tail (0.0000321) gives the probability that the
absolute value of the t-Statistic (7.633) would be observed that is larger in
absolute value than the Critical t value (2.26). Since the p – value is less than
our alpha, 0.05, we reject the null hypothesis that there is no significant difference
in the means of each sample.
2. After you enable it, click Data Analysis in the Data menu to display the
analyses you can perform. Among other options, the popup presents three
types of t-test, which we‟ll cover next.
.
4. From the Data Analysis popup, choose t-Test: Two-Sample Assuming Equal
Variances.
5. Under Input, select the ranges for both Variable 1 and Variable 2.
6. In Hypothesized Mean Difference, you‟ll typically enter zero. This value is the
null hypothesis value, which represents no effect. In this case, a mean
difference of zero represents no difference between the two methods, which is
no effect.
7. Check the Labels checkbox if you have meaningful variable names in row 1.
This option makes the output easier to interpret. Ensure that you include the
label row in step #3.
8. Excel uses a default Alpha value of 0.05, which is usually a good value. Alpha
is the significance level. Change this value only when you have a specific
reason for doing so.
9. Click OK.
For the example data, your popup should look like the image below:
10. After Excel creates the output, adjust the width of column A to display all text
in it interpreting the Two-Sample t-Test Results.
Test on two means
The table shows the life time for old and new batteries:
38 42
53 54
51 53
48 50
50 48
Given that o=47, N=49.4, δo2=34, δN2=22, So2= 34.5, SN2= 22.8
Linear regression is a basic and commonly used type of predictive analysis. The overall idea
of regression is to examine two things: (1) does a set of predictor variables do a good job in
predicting an outcome (dependent) variable? (2) Which variables in particular are significant
predictors of the outcome variable, and in what way do they–indicated by the magnitude and
sign of the beta estimates–impact the outcome variable? These regression estimates are
used to explain the relationship between one dependent variable and one or more
independent variables. The simplest form of the regression equation with one dependent and
one independent variable is defined by the formula y = c + b*x, where y = estimated
dependent variable score, c = constant, b = regression coefficient, and x = score on the
independent variable.
Naming the Variables: there are many names for a regression‟s dependent variable. It may
be called an outcome variable, criterion variable, endogenous variable, or regressand. The
independent variables can be called exogenous variables, predictor variables, or regressors.
Three major uses for regression analysis are (1) determining the strength of predictors, (2)
forecasting an effect, and (3) trend forecasting.
Regression in excel:
This example teaches you how to run a linear regression analysis in Excel and how to
interpret the Summary Output.
Below you can find our data. The big question is: is there a relation between Quantity Sold
(Output) and Price and Advertising (Input). In other words: can we predict Quantity Sold if we
know Price and Advertising?
4. Select the X Range(B1:C8). These are the explanatory variables (also called independent
variables). These columns must be adjacent to each other.
5. Check Labels.
7. Check Residuals.
8. Click OK.
Most or all P-values should be below below 0.05. In our example this is the case.
(0.000, 0.001 and 0.005).
Coefficients
The regression line is: y = Quantity Sold = 8536.214 -835.722 * Price + 0.592 * Advertising. In
other words, for each unit increase in price, Quantity Sold decreases with 835.722 units. For
each unit increase in Advertising, Quantity Sold increases with 0.592 units. This is valuable
information.
You can also use these coefficients to do a forecast. For example, if price equals $4 and
Advertising equals $3000, you might be able to achieve a Quantity Sold of 8536.214 -835.722
* 4 + 0.592 * 3000 = 6970.
Residuals
The residuals show you how far away the actual data points are fom the predicted data points
(using the equation). For example, the first data point equals 8500. Using the equation, the
predicted data point equals 8536.214 -835.722 * 2 + 0.592 * 2800 = 8523.009, giving a
residual of 8500 - 8523.009 = -23.009.
In the table below, the xi column shows scores on the aptitude test. Similarly, the yi column
shows statistics grades. The last two rows show sums and mean scores that we will use to
conduct the regression analysis.
Student xi yi
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Mean 78 77
If a student made an 80 on the aptitude test, find the estimated statistics grade ŷ .