You are on page 1of 68

MINITAB

By:
Ramandeep Kaur
Assistant Professor
Economics
Introduction
• Minitab was developed at the Pennsylvania State University by some
researchers in 1972.
• Minitab is a data analysis software package that is used for data
analysis. It is widely used in a variety of industries, including
healthcare, manufacturing, and education. Minitab provides users with
tools to perform statistical analysis, including hypothesis testing,
regression analysis, and ANOVA.
• Minitab Statistical Software Version 21.1.0 is available
How to Download it?

• Go to minitab.com

• Products > Minitab


Statistical Software

• Free Trial For 30 days


by filling the form

• Email in 15 minutes
with a link to download
the software
Types of Data
1) Descriptive:
• Descriptive statistics are a part of statistics that can be used to
describe data. It is used to summarize the attributes of a sample in
such a way that a pattern can be drawn from the group.
• Descriptive statistics uses two tools to organize and describe data.
These are given as follows:
Measures of Central Tendency - mean, median, and mode.
Measures of Dispersion - Range, standard deviation, variance, quartiles,
and absolute deviation
Types of Data
2) Inferential
• Inferential statistics is a branch of statistics that is used to make
inferences about the population by analyzing a sample. When the
population data is very large it becomes difficult to use it. In such
cases, certain samples are taken that are representative of the entire
population. Inferential statistics draws conclusions regarding the
population using these samples.
• Some methodologies used in inferential statistics are as follows:
➢Hypothesis Testing - z test, f test, t test, etc.
➢Regression Analysis
Overview of Minitab
Session Window
Data Input methods
• Write directly in Minitab worksheet

• Copy Paste from Excel : Ctrl+A, Ctrl+C from excel file and Ctrl+V in Minitab worksheet

• Open excel worksheet:


File > Open Worksheet > Files of Type > excel > look in > (Select file location) > Open

• Open MTW files:


File > Open Worksheet > Files of Type > Minitab > look in > (Select file location) > Open

• Open other files


File > Open Worksheet > Files of Type > All > look in > (Select file location) > Open
Data Types in Minitab

• Changing Types of Data:


Data > Change data type > Select (Numeric to Text, text to Numeric or
so on…..)
Visual Representations of Data

• Frequency tables, pie charts, and


bar charts can all be used to
display data.
• A frequency table contains the
counts of how often each value
occurs in the dataset. Minitab, will
use the term tally to describe a
frequency table.
• In addition to containing counts,
Minitab includes cumulative values
as well.
• A pie chart displays data concerning one categorical variable by
partitioning a circle into "slices" that represent the proportion in each
category.
• A bar chart is a graph that can be used to display data with vertical or
horizontal bars, symbolizing the number of cases in each category.
Frequency Table
To create a frequency table of the primary campus variable in Minitab:

1.Open the data file in Minitab


2.From the tool bar, select Stat > Tables > Tally Individual Variables
3.Double click the variable Primary Campus in the box on the left to insert it
into the Variable box on the right
4.Under Statistics, check Counts and Percents
5.Click OK

(Count work status as practice)


• To create a two-way table of the Work Status and Primary
Campus variables in Minitab:
1.Open the data file in Minitab
2.From the tool bar, select Stat > Tables > Cross Tabulation and Chi-
Square
3.Click in the Rows box, then double click the variable Work Status to
insert it into the Rows box on the right
4.Click in the Columns box, then double click the variable Primary
Campus to insert it into the Columns box on the right
5.Click OK
Pie Chart (raw data)
To create a pie chart using raw data:

1.Open the data file in Minitab


2.From the tool bar, select Graph > Pie Chart...
3.Select Counts of Unique Values
4.Click OK
5.Double click the variable Primary Campus in the box on the left to
insert it into the Categorical variables box on the right
6.Click OK
Pie Chart (summarised data)
To create a pie chart using summarized data:

1.Enter the data into a blank Minitab worksheet with one column containing
the Campus names and a second column containing the Count for each campus
(open file campus count pie chart)
2.From the tool bar, select Graph > Pie Chart...
3.Select Summarized Data in a Table
4.Click OK
5.Double click Campus in the box on the left to insert it into the Categorical
variable box on the right
6.Double click Count in the box on the left to insert it into the Summary
variables box on the right
7.Click OK
Bar Graph
(used for categorical data)
To create a bar graph of the primary campus variable in Minitab:
1.Open the data file in Minitab
2.From the tool bar, select Graph > Bar Chart > Counts of Unique
Values...
3.Select Simple
4.Click OK
5.Double click the variable Primary Campus in the box on the left to
insert it into the Categorical variable box on the right
6.Click OK
• To create a bar chart using summarized data:
1.Enter the data into a blank Minitab worksheet with one column containing
the Campus names and a second column containing the Count for each
campus
2.From the tool bar, select Graph > Bar Chart > Values from a Table...
3.Under One Column of Values, select Simple
4.Click OK
5.Double click Count in the box on the left to insert it into the Graph-
variable box on the right
6.Double click Campus in the box on the left to insert it into the Categorical
variable box on the right
7.Click OK
• To create a clustered bar chart of the Work Status and Primary
Campus variables in Minitab:
1.Open the data file in Minitab
2.From the tool bar, select Graph > Bar Chart > Counts of Unique
Values
3.Select Cluster (Select Stack for stacking)
4.Click OK
5.Double click the variables Work Status and Primary Campus to insert
them both into the Categorical variables box on the right
6.Click OK
Histograms
(used for quantitative data)
• To create a histogram of the number of online courses completed in
Minitab:
1.Open the data set in Minitab
2.From the tool bar, select Graph > Histogram...
3.Under One Y Variable, select Simple
4.Click OK
5.Double click the variable Online Courses Completed in the box on the
left to insert it into the Y-variable box on the right
6.Click OK
Symmetry/Skewness
• Quantitative variables are often
discussed in terms of their
shape. Both dotplots and
histograms can be used to
interpret a distribution's shape.
• Symmetrical Distribution
A distribution that is similar on
both sides of the center.
Normal Distribution
One specific type of symmetrical distribution. This is also known as
a bell-shaped distribution.
Skewed
A distribution in which values are
more spread out on one side of
the center than on the other.

Right Skewed
A distribution in which the higher
values (towards the right on a
number line) are more spread
out than the lower values. This is
also known as positively skewed.
Left Skewed
A distribution in which the lower values (towards the left on a number line) are more spread
out than the higher values. This is also known as negatively skewed.
❑Measures of Central Tendency
• Mean, Median, Mode

❑Measures of Spread
• The standard deviation is the most commonly used measure of
variability
• this is denoted as Ꝺ (sigma)
• When computing the standard deviation by hand, it is necessary to
first compute the variance. The variance is equal to the standard
deviation squared.
To obtain measures of central tendency and variability in Minitab:

1.Open the data set in Minitab


2.From the tool bar, select Stat > Basic Statistics > Display Descriptive
Statistics...
3.Double click the variable Online Courses Completed in the box on the
left to insert it into the Variables box on the right
4.Click on the Statistics button and select the descriptive statistics you
want displayed (e.g., Variance, Interquartile range, Mode)
5.Click OK
6.Click OK
Scatterplots
A scatterplot can be used to display the relationship between the
explanatory and response variables. (Or A graphical representation of two
quantitative variables in which the explanatory variable is on the x-axis and the response
variable is on the y-axis.)

• Explanatory variable
Variable that is used to explain variability in the response variable,
also known as an independent variable or predictor variable.

• Response variable
The outcome variable, also known as a dependent variable.
• Relationship: There is a positive
linear relationship between
height and shoe size in this
sample.
• There is a negative linear
relationship between the
maximum daily temperature and
coffee sales.
The file below contains data concerning students' quiz averages and final
exam scores. Let's construct a scatterplot with the quiz averages on the x-
axis and final exam scores on the y-axis.
• Grades.csv
1.Open the data file in Minitab
2.From the tool bar, select Graphs > Scatterplot > Simple
3.Double click the variable Final on the left to move it to the Y variable box
on the right
4.Double click the variable Quiz_Average on the left to move it to the X
variable box on the right
5.Click OK
Correlation
A measure of the direction and strength of the relationship between
two variables.

• Properties of Pearson’s r:
Correlation: Relationships

Absolute Value of r Strength of the Relationship


0 - 0.2 Very weak
0.2 - 0.4 Weak
0.4 - 0.6 Moderate
0.6 - 0.8 Strong
0.8 - 1.0 Very strong
1.Open the data file in Minitab: Exam.mwx (or Exam.csv)
2.Choose Stat > Basic Statistics > Correlation.
3.In Variables, enter Double click the Quiz_Average and Final in the box
on the left to insert them into the Variables box
Regression
1.Open the Minitab file: Exam.mwx (or Exam.csv)
2.Select Stat > Regression
3.Double click Final in the box on the left to insert it into the Responses
(Y) box on the right
4.Double click Quiz_Average in the box on the left to insert it into
the Continuous Predictors (X) box on the right
5.Click OK
• Interpretation
In the output in the above example we are given a simple linear
regression model of Final = 12.1 + 0.751 Quiz_Average
This means that the y-intercept is 12.1 and the slope is 0.751.

• Identify and interpret the slope.


The slope is 0.751. For every one point increase in quiz average, their
predicted final score increases by 0.751 points.
Time Series
• A time series plot displays time on the x-axis and a quantitative
response variable on the y-axis.
1.Open the sample data, Apple Stock Price.
2.Choose Graph > Time Series Plot
3.In Series, enter ‘Close'.
4.Click Time/Scale.
5.Under Time Scale, select Calendar, and then select Month.
6.Click OK in each dialog box.
Linear Regression
• The coefficient of determination or R squared method is the proportion of the variance in the
dependent variable that is predicted from the independent variable. It indicates the level of
variation in the given data set.
• The coefficient of determination is the square of the correlation(r), thus it ranges from 0 to 1
0r 1-100%.
• If R2 is equal to 0, then the dependent variable cannot be predicted from the independent
variable.
• If R2 is equal to 1, then the dependent variable can be predicted from the independent variable
without any error.
• If R2 is between 0 and 1, then it indicates the extent that the dependent variable can be
predictable. If R2 of 0.10 means, it is 10 percent of the variance in the y variable is predicted
from the x variable. If 0.20 means, 20 percent of the variance in the y variable is predicted
from the x variable, and so on.
• R2 tends to optimistically estimate the fit of the linear regression. It always increases as the
number of effects are included in the model. Adjusted R2 attempts to correct for this
overestimation. Adjusted R2 might decrease if a specific effect does not improve the model
Multiple Regression
• Step 1: Determine the dependent and independent variables, all should
be continuous.
• Y (dependent variable) is the score of final exam. X1, X2, and
X3 (independent variables) are the scores of exams one, two, and three
respectively. All x variables are continuous.
• Step 2: Start building the multiple linear regression model
1.Click Stat → Regression → Regression
2.A new window named “Regression” pops up.
3.Select “FINAL” as “Response” and “EXAM1”, “EXAM2” and “EXAM3” as
“Predictors.”
4.Press “OK”
• We use the VIF (Variance Inflation Factor) to determine if multicollinearity exists.
• Multicollinearity
• Multicollinearity is the situation when two or more independent variables in a multiple
regression model are correlated with each other. It may mislead the calculation for
individual independent variables.
• Rules of thumb to analyze variance inflation factor (VIF):
• If VIF = 1, there is no multicollinearity.
• If 1 < VIF < 5, there is small multicollinearity.
• If VIF ≥ 5, there is medium multicollinearity.
• If VIF ≥ 10, there is large multicollinearity.
• How to Deal with Multicollinearity
1.Increase the sample size.
2.Collect samples with a broader range for some predictors.
3.Remove the variable with high multicollinearity and high p-value.
4.Remove variables that are included more than once.
5.Combine correlated variables to create a new one.
Interpreting the Results
R-square Adj = 98.8%
• 98% of the variation in FINAL can be explained by the predictor
variables EXAM2 & EXAM3.

Variables p-value:
• Both are significant (less than 0.05).

Equation: −4.34 + +0.356*EXAM1+0.543*EXAM2 + 1.17*EXAM3


• −4.34 is the Y intercept, all equations will start with −4.34.
Confidence Interval
A range computed using sample statistics to estimate an unknown population
parameter with a stated level of confidence.
Example: Statistical Anxiety
• The statistics professors at a university want to estimate the average statistics
anxiety score for all of their undergraduate students. It would be too time
consuming and costly to give every undergraduate student at the university
their statistics anxiety survey. Instead, they take a random sample of 50
undergraduate students at the university and administer their survey.
• Using the data collected from the sample, they construct a 95% confidence
interval for the mean statistics anxiety score in the population of all university
undergraduate students. We could say, “we are 95% confident that the mean
statistics anxiety score of all undergraduate students at this university is
between 26 and 32.”
At the center of a confidence interval is the sample statistic, such as a sample
mean or sample proportion.
This is known as the point estimate.
Point Estimate
Sample statistic that serves as the best estimate for a population parameter
Hypothesis Testing
• P-value:
• The P-value is known as the probability value. It is defined as the
probability of getting a result that is either the same or more extreme than
the actual observations.
Types of T-tests

• One-Sample t-test
In a one-sample t-test, we compare the average (or mean
parameter) of one group against the set average (or mean). This
set average can be any theoretical value (or it can be the
population mean).
• Independent Two-Sample t-test
The two-sample t-test is used to compare the means of two
different samples.
• Paired Sample t-test
Here, we measure one group at two different times.
T-Test
• When Should We Perform a T-test?
• Example: Consider a telecom company that has two service centers in the city. The
company wants to find out whether the average time required to service a customer is the
same in both stores. The company measures the average time taken by 50 random
customers in each store. Store A takes 22 minutes, while Store B averages 25 minutes.
Can we say that Store A is more efficient than Store B in terms of customer service?
• It does seem that way, doesn’t it? However, we have only looked at 50 random customers
out of the many people who visit the stores. Simply looking at the average sample time
might not be representative of all the customers who visit both stores.
• This is where the t-test comes into play. It helps us understand if the
difference between two sample means is actually real or simply due to chance.
Assumptions for Performing a T-test

1.The dependent variable should be measured on a continuous scale


2.independent variable should consist of two categorical, independent
groups.
3.The data should resemble a bell-shaped curve, i.e., it should be
normally distributed.
4.Large sample size should be taken for the data to approach a normal
distribution (although a t-test is essential for small samples as their
distributions are non-normal)
5.Variances among the groups should be equal (for independent two-
sample t-test)
Steps:
1.Open the data file in Minitab (example: Golf )
2.From the tool bar, select Stat > Basis Statistics > 2-Sample t
3.Select Samples in different columns
4.Select First: Current, Second: New
5.Click Options
6.Confidence Interval: 95.0, Test Difference: 0.0, Alternative: Not equal
7.Click OK
8.Select Assume Equal Variances
9.Click OK
Note:
• To determine whether to reject the null hypothesis using the
t-value, compare the t-value to the critical value.
• If the absolute value of the t-value is greater than the
critical value, you reject the null hypothesis.
• If the absolute value of the t-value is less than the critical
value, you fail to reject the null hypothesis.
Sample Dataset File: Golf
• Case Study: Par Inc
• Par Inc., is a major manufacturer of golf equipment. Management believes
that Par’s market share could be increased with the introduction of a cut-
resistant, longer-lasting golf ball. Therefore, the research group at Par has
been investigating a new golf ball coating designed to resist cuts and
provide a more durable ball. The tests with the coating have been
promising. One of the researchers voiced concern about the effect of the
new coating on driving distances. Par would like the new cut-resistant ball
to offer driving distances comparable to those of the current-model golf
ball. To compare the driving distances for the two balls, 40 balls of both the
new and current models were subjected to distance tests. The testing was
performed with a mechanical hitting machine so that any difference
between the mean distances for the two models could be attributed to a
difference in the design.
Excercise:
• Questions:
1.Formulate and present the rationale for a hypothesis test that par
could use to compare the driving distances of the current and new
golf balls
2.Analyze the data to provide the hypothesis testing conclusion. What
is the p-value for your test? What is your recommendation for Par
Inc.?
3.Do you see a need for larger sample sizes and more testing with the
golf balls? Discuss
Results:

Two Tailed Two Sample Independent T Test


•In this scenario, the p value is 0.188 which is greater than the 0.05. Hence, we failed to reject the Null Hypothesis.

•From the given data, it may be concluded that, statistically there is no significance change in driving distance due
to new coating on golf balls.
•However, recommendation is that the test be carried out with a larger sample size covering number of golf courses
(at least a five different) to improve the accuracy of the test results and negating any effect of one type of ground.
Also, the results need to interpreted and future actions be planned with the understanding of other characteristics
like size, shape, weight etc.
Paired t test
Z-Test
• Z-statistic – Z Test
• Z-statistic is used when the sample follows a normal distribution.
It is calculated based on the population parameters like mean
and standard deviation.
One sample Z test is used when we want to compare a sample
mean with a population mean

Two sample Z test is used when we want to compare the mean


of two samples
For explanation
• H0: S=80
• H1: S not = 80
• And Z statistics comes out to be 4.56 on 95% confidence level.
• Critical Value of Z at 0.95 CI is +/-1.96.
• Here, absolute value of Z>Critical value of Z.(that means the absolute
value lies in rejection area in figure).
• Therefore H0 is rejected.
F test

• For samples involving three or more groups, we prefer the F Test.


• F distributions are always positive and skewed right.
• Performing T-test on multiple groups increases the chances of Type-1 error.
ANOVA is used in such cases.

• Analysis of variance (ANOVA) can determine whether the means of three or


more groups are different.
PROBLEM STATEMENT
• In the sample dataset, the variable Sprint is the respondent's
time (in seconds) to sprint a given distance, and Smoking is an
indicator about whether or not the respondent smokes (0 =
Nonsmoker, 1 = Past smoker, 2 = Current smoker).
• Let's use ANOVA to test if there is a statistically significant
difference in sprint time with respect to smoking status. Sprint
time will serve as the dependent variable, and smoking status
will act as the independent variable.
• We conclude that the mean sprint time is significantly different
for at least one of the smoking groups (F2, 350 = 9.209, p < 0.001).
• Note that the ANOVA alone does not tell us specifically which
means were different from one another.
Chi-Square Test

• For categorical variables, we would be performing a chi-Square.


It is a nonparametric test.
• Types of chi-squared tests:
• Chi-squared test of independence –to determine whether or not
there is a significant relationship between two categorical
variables. This test is also known as: Chi-Square Test of
Association.
• Chi-squared Goodness of fit helps to determine if the sample
data correctly represents the population.
Data Requirements for Chi-Square

1. Two or more categories (groups) for each variable.


2. Independence of observations.
1. There is no relationship between the subjects in each group.
2. The categorical variables are not "paired" in any way (e.g. pre-
test/post-test observations).
3. Relatively large sample size.
1. Expected frequencies for each cell are at least 1.
2. Expected frequencies should be at least 5 for the majority (80%) of
the cells.
Steps:
• Open required file. (sample_dataset_2014)
• Stat > Tables > Cross Tabulation and Chi-square
• A new window pops-up.
Categorical variables: For Rows: (Gender), For Columns: (Smoking)
• Select Chi-Square > From Display, Select Chi-square Analysis
• Press Ok
• Press Ok
PROBLEM STATEMENT
• In the sample dataset 2014, respondents were asked their gender and whether or not they were a
cigarette smoker. There were three answer choices: Nonsmoker, Past smoker, and Current smoker.
• Suppose we want to test for an association between smoking behavior (nonsmoker (0), current smoker(1),
or past smoker(2)) and gender (male(0) or female(1)) using a Chi-Square Test of Independence (we'll
use α = 0.05).
• Hypothesis:
H0: There is no association between smoking behaviour and gender.
H1: There is an association between smoking behaviour and gender.
• Results:
Since the p-value is greater than our chosen significance level (α = 0.05), we do not reject the null
hypothesis. Rather, we conclude that there is not enough evidence to suggest an association between
gender and smoking.
Based on the results, we can state the following:
No association was found between gender and smoking behavior

You might also like