You are on page 1of 5

COPENHAGEN BUSINESS SCHOOL Statistics

DEPARTMENT OF FINANCE Søren Feodor Nielsen


CENTER FOR STATISTICS September 26, 2023

First workshop

The first problem focuses on basic JMP-use with graphs and summaries. The second problem
recreates the test we did in the first lecture. Please note that even though the second problem
refers to one of the “Web Apps”, it does not ask you to use the app.

Forbes data-set

1. Start by reading the data into JMP:

(a) Open JMP and then click File → Open... and find the data set, where you have
placed it.
(b) Select forbes.txt and choose Open as: Data using best guess before pressing
return; you may have to unclick Use default program to open. Uncheck to open
as text.
When done, the data table should look exactly like figure 1; numbers are flushed right,
categories flushed left. Note also the icons in the Columns-section on the left.

Figure 1: Forbes data

1
2. Construct a histogram over the market values of the companies in the data set (Analyze
→ Distribution, click on Market_value and then on Y, Columns and OK. Turn the his-
togram so that it is horizontal rather than vertical by right-clicking on (or clicking on the
▾ in) the Distributions-header and choose Stack (there are at least three ways of turning
a histogram).

3. The distribution of the market values is highly skewed to the right, which makes it very
difficult to get any information out of the histogram. The usual way of handling this is to
take logs. To do so, construct a new variable:

(a) Go back to the data table, choose Cols → New Column... and give it a new (sensi-
ble!) name.
(b) Next go to the Column Properties and select Formula. First find Log (natural loga-
rithm, or “ln”) –or if you prefer Log10 (base 10-logarithm), but then some of the re-
sults will not match those in the Selected Solutions– in the Transcedental-section,
and then click on Market_value. Press OK.

4. (a) Now make as histogram of the log-transformed market values.


(b) Have a look at the summary statistics and compare them to the “empirical rule” (a
quick calculation in your head will be quite sufficient).

5. Make a barplot for the sector-variable.

(a) Start by making a “histogram” for sector.Then right-click the sector-header to turn
it into a barplot: Histogram Options → Separate Bars.
(b) Add a count axis (Histogram Options) and click on Show Percent.
(c) Then change your mind and change the count axis to a percentage axis (Prob Axis)
and replace the percentages by counts.

Consider which is better and when you have made up your mind:

(a) select the graph; in the menu at the top, which may be hidden in which case you
should let your mouse hover over it, click on the “fat cross”( ), use it to select the
graph
(b) right-click on the selected output and choose copy and then open (e.g.) Word and
paste it.

6. It may be better to have the bars in your barplot ordered according to height (i.e. a Pareto
plot). To do this

(a) Go back to the data table, right-click on the sector variable-heading and pick Column
Properties → Value Order.
(b) Use the graph you have saved to decide a suitable ordering (Finance is the largest
sector, so it should be moved up to the top of the list etc).
(c) Then make a new barplot: The easy way to do this is to go back to the barplot you
have made, right-click on the Distribution-header, choose Redo → Redo Analysis.

7. Go back to the data table:

(a) Right-click on the sector variable-heading and choose Sort → Ascending to get the
data sorted according to the sector-variable. JMP may complain and insist on open-
ing a new data table-window; if so use the new window.

2
(b) Now exclude all observations except those in the four largest sectors (largest in this
data set):
• Scroll down in the data table until you get to Hi-Tech (the fifth largest sector)
and then mark all rows starting here all the way down.*
• Right click on a selected row number and choose Exclude/Unexclude (not Hide
and Exclude!).
8. Now make histogram and obtain summaries for market value stratified by sector:

(a) Analyze → Distribution, let Market_value be Y and put sector into the by-box.
(b) Compare the median market values of the 4 sectors.

9. Also make a contingency table and a mosaic plot to see how the grouped number of em-
ployees depend on sector:

(a) Analyze → Fit Y by X with grouped number of employees as Y, Response and


sector as X, Factor.
(b) In the resulting contingency table, right click in the upper left corner and unselect
Total % and Col %.
10. Go back to the data table and “unexclude” the excluded observations (right-click on an
excluded row and choose Unexclude).

11. Construct a new variable, log-transformed sales (as in 3).

12. Select the Graph builder in the Graph-menu:

(a) Put the log-transformed market values on the vertical axis and the log-transformed
sales on the horizontal. You should now have a graph showing you how the loga-
rithm of market value depends on the logarithm of sales.
(b) Right click on the graph, choose Graph → Marker size and change this to 2. If the
graph does not change, choose another value for Marker size. Click Done when
done.

13. In the Graph menu select Graph Builder. Let the log-transformed market values be Y and
log-transformed sales X. There is a smooth curve in the scatterplot; remove it by clicking
op the icon with points and a smooth curve above the graph. Click on grouped_employee
and then on Overlay at the top right corner of the graph; this will make observations in
different grouped_employee-groups have different colours† .
Based on this graph, do you think the relationship between market values and sales de-
pend on the size of the company?

14. Fit a regression of log of market value to log of sales:

(a) Choose Analyze → Fit Y by X, choose log of market value as Y, Response and log
of sales as X, Factor. Press OK.
(b) Right-click on the Bivariate Fit...-header and choose Fit Line.

Note the equation of the line.


*
This will only work as intended if you have managed to order the values of in the previous problem; JMP sorts
according to the ordering of the variable.

Different plot symbols could be good. Maybe you can find a way to get this?

3
Simulating coin tosses

There are a set of “Web apps” for the book available online. One of these applets, “Random
Number” simulates coin tosses. The lecturer tried it out one Saturday afternoon and found
that after 700 tosses, the coin had ended up heads 374 times and tails 326 times. Not wanting
to ask his students to use this app unless he was convinced it actually worked he decided to
make a statistical test of the hypothesis that the probability of getting Heads is 50%.

1. Open a new data table in JMP (File → New → Data table). In the first column, write
“Heads” and “Tails”. Make a new column (New column in the Cols-menu) and type in the
observed data. It should look similar to the data table in figure 2 when done.

Figure 2: A small data table

2. To test the hypothesis that there is a 50-50 chance of each possible outcome (Heads or
Tails):
(a) Choose Analyze → Distribution; here the first variable you created should the be
Y whereas the second should be Freq. Press OK.
(b) Turn the histogram and make it into a barplot (if you have the time).
(c) Right click on the header and select Test probabilities; write in the hypothesis
value (0.5 or 0,5‡ ) and press Done to get the result (the Pearson-line in the output).
The test statistic you get out is not the same as the one used in the first lecture (which is
the one used in the book), but the test is the same: Same p-value, same conclusion. After
having seen the p-value from the test and remembering that small p-values are evidence
that the hypothesis is not true, do you think the applet works?
3. Also find a confidence interval for the proportion of Heads; right-click on the header,
choose Confidence Interval.
JMP uses a better, but more complicated, formula for the confidence interval for an un-
known probability than the formula given in the book; you will not get exactly the saem
result if you use the book’s formula.

Use the appropriate decimal symbol, i.e. comma if your computer “speaks” Danish or any continental Euro-
pean language, dot if your computer thinks you are from the US or the UK.

4
Selected solutions

Forbes data set

4. The empirical rule says that

• 68% of the observations lie between 5.75 and 8.21


• 95% of the observations lie between 4.52 and 9.44
• almost all of the observations lie between 3.29 and 10.67

Compared to the quantiles the intervals contain something like 80%, 88-90% and 98%;
not a very impressive performance for the empirical rule.

8. The median market values of the four largest sectors (in terms of number of observations
in the data set) are

Finance 606
Energy 779
Manufacturing 1093.5
Retail 1001.5

9. The counts for the finance sector are 2, 8, 6, 1, 0 (ordered by number of employees); row
percentages are 11.76, 47.06, 35.29, 5.88 and 0.

13. The regression equation is

log market value = 1.339 + 0.743 ⋅ log sales

Does the Web App work?

3. The p-value is 0.0696.

4. The 95%-confidence interval for “Heads” is ]0.497; 0.571[

You might also like