You are on page 1of 10

Part 1

Descriptive Statistics
CHAPTER 1

Introduction to Data Analysis

In God we trust, others must provide data.


— Edwin R. Fisher (1978) to the U.S. House of Representatives (sometimes attributed
to J. Edwards Deming)

1. Introduction

How do you form your opinions? Do you base them on:

• Hearsay?
• What others tell you?
• Evidence?

Do you trust claims more if they come from someone you know (e. g., via Facebook)? But where did the
person you know find the claim? Are you being manipulated (if so, by whom and to what end)? How
do you make decisions and could you make them in a better way?
We will study probability and statistics1, with a view to helping decisions.

Good decisions are based on facts, not opinions and emotions.


— Ronald D. Snee (1988) in The College Mathematics Journal

At the start of the season, how do you pick your fantasy football team?

• Which are the best players?


• But what do we mean by best?
• What kind of measure of last season’s performance should we use?
– Total points (but what if a player like de Bruyne was injured for a long period)?
– Points per minute played?
– (Points per minute played)/cost? — since there is a £100 million budget
• Having chosen a measure, what evidence do we have? Is past performance a good guide (there
will be form, luck and other uncertainties involved)?
– When making transfers, should we focus on a short recent time window of a player’s
performance, or a longer one?

This is just a small example of the everyday decisions we make. Big businesses make (and review) similar
(though more consequential) decisions every day.
Notice that the choice of measure (the “statistic”) can have a big e↵ect on the final decision. Propagan-
dists are famous for choosing the measure (or evidence) that suits their “narrative” — the e↵ect they
want to create. And people’s behaviour adjusts to the way it is measured.
To make an informed decision we need to:

• be clear on what question we are trying to answer;


• have the right information;
• choose a meaningful way to measure and/or summarise this information; and
• be explicit about any assumptions we make and any inherent risk.
1Statistics gets the stat- part of its name because this kind of thinking was first used systematically by states when
basing state policy on demographic and economic data.

3
4 MIS10090

Businesses have to make decisions all the time. Long-term strategic decisions concern decisions on what
business to engage in and where to locate. Short-term operational decisions need to be made about how
to get the product to the customer. All of these decisions need to be made on the basis of solid data
analysis if the business is to succeed.
Disadvantageous consumer decision-making was probably instrumental both the Irish
and global financial crises. Evidence suggests that many consumers lack basic financial
knowledge and that their choices of financial services are subject to systematic biases.
— (Lunn, 2012)
Using mathematical and scientific approaches to aid decision making is the main topic of Operational
Research; when large amounts of data are involved, the term Business Analytics is often used.

1.1. Why Study Data Analysis? Data Analysis helps us to structure problems to facilitate
informed decision making. We need to know if it is valid to compare, for example, price packages from
di↵erent mobile phone operators or di↵erent utility providers.
Given a large amount of data, we need to be able to weed out the relevant data and summarise the
quantitative elements; that is, we need to be able to make sense of the numbers. Given a small amount
of data (a sample), we need to be able to make inferences about the population that the sample data
came from.
Businesses need to track their performance to make sure they are on target to meet their goals, to
identify problems, to identify opportunities and to comply with regulatory obligations. So a business
manager can use Data Analysis to:
• properly present and analyse data;
• draw conclusions about populations based on sample information;
• improve processes;
• obtain reliable forecasts;
• track Key Performance Indicators (KPIs) as part of a Balanced Scorecard.
Businesses are obliged to produce certain reports. In a regulated industry sector such as finance or
telecommunications, businesses have to show that they are in compliance with the rules. Publicly quoted
companies have to report to their shareholders. Each company has to ensure that the correct amount
of tax is paid to Revenue, that its employees are paid correctly, that its suppliers are paid correctly and
that it receives the correct payments for its services.
In this course, we will study Data Analysis applied to Decision Making. As well as the lecture notes, we
recommend the textbook (Lind et al., 2019), and you may also find (Berenson et al., 2012) useful.
Exercise 1.1. Consider the possible benefits of studying Data Analysis, particularly in the wider world,
outside of business.

2. Business Analytics: analysing data

The broader study of evidence-based decision making is called Business Analytics. Key parts of Business
Analytics include Data Mining to find patterns in data, and Visualisation to present the results to the
user or decision maker.
Knowledge Discovery (from Data) is an iterative and interactive process involving many steps, usu-
ally drawing on data stored in databases. Data Analysis is an integral part of Knowledge Discovery.
Knowledge Discovery is a vast field, with a strong statistical element.
The two primary “high level” tasks of the Knowledge Discovery process are description and prediction,
which ties in with the areas of Analytics:
Descriptive analytics: describing what the data tells us about what has happened (finding
patterns in the data, that are of meaning to human beings)
Predictive analytics: predicting from the data what will happen (future or unknown values of
other variables).
Data Analysis for Decision Makers 5

There is also
Prescriptive analytics: using insight gleaned from the data to decide what we should do
which often uses optimisation approaches (covered in the Business Analytics module) to find the “best”
answer.
The Knowledge Discovery process has several main steps:
Data selection: The desired subset of the data and the attributes of interest are found by ex-
amining the entire data set;
Data cleansing: Noise and outliers are removed.2 In other pre-processing, field values are trans-
formed to standard units and in integration, new fields may be created by combining existing
fields.
Data Exploration: Preliminary statistical summary and reporting (much of our course covers
this);
Data Mining: Interesting patterns are found by the use of Data Mining algorithms;
Evaluation: The patterns are presented to the end user in a form he or she can understand, for
example, through visualisation: we will cover this to a limited extent. This usually involves a
decision support system.
Figure 1.1 illustrates how the steps of Knowledge Discovery build on each other to aid decision making
in business.

Figure 1.1. How the Knowledge Discovery process aids decision making in business

Exercise 1.2. Consider how data analysis can contribute to other areas of knowledge discovery than
data exploration. Do some online research on Business Analytics (you will find many di↵erent views!)
and consider where data analysis fits in.

3. Problem Solving and Methodology

We need to be clear about what problem we are trying to solve. We may be asked to:
• find the percentage of the vote that each political party will get in the next election;
• predict the number of Dáil seats that each party will get;
• predict which party may form a government.
2Caution is needed here. Nobody wants noise polluting their data, but sometimes — as in anomaly detection — it is
precisely the unusual things in the data that we want to find.
6 MIS10090

These are actually di↵erent problems. We will need to


• clearly define the problem (which involves making simplifying assumptions);
• use an appropriate model for this problem;
• apply an appropriate method(ology) to solve the problem
• interpret the solution (output data) in real-world terms.

input data Model = output data ! interpretation


Solution
assumptions,
method
equations etc.

Figure 1.2. Model and methodology

Data Analysis is not as straightforward as simply applying a mathematical technique to a set of data:
• The question we are trying to answer will dictate the technique and methodology to be applied;
• We must first understand and structure the problem by talking to the business users;
• This may involve making assumptions about the problem, to simplify it or because we lack
certain data;
• We may then need to gather new data or interpret existing data;
• We will then apply an appropriate data analysis technique and interpret the results.
So we need all of the components in Figure 1.2 to tackle any business problem.
We also need to understand the limitations, assumptions and specific applicability of quantitative analysis
techniques. We need to understand the risk associated with any proposed solution.
Exercise 1.3. Identify a problem where data analysis could be of use to:
• define the question,
• identify any assumptions (including implicit assumptions, the kind we often do not realise we
are making),
• consider what data could be relevant,
• ponder what could be an appropriate data analysis technique (this may be a challenge at this
stage in the course), and
• imagine what an interpretation of the data might look like.

4. Data Gathering

Let’s start with the ‘Input Data’. We will call an individual item of data we have collected or observed
an observation.
Where do we get that data from?
Each business may have many internal IT systems that track and process its internal operational and
customer data. Many businesses use Enterprise Resource Planning (ERP) systems to integrate their
organisational systems. For example, a supermarket chain may have
• a Corporate Data Warehouse (CDW) system to monitor what stock it holds,
• a Point of Sale (POS) system to sell goods to its customers,
• a Marketing system to monitor customer activity (via loyalty cards) and
• Financial systems to report on profitability.
Other applications include:
• Social Networks may use your list of contacts,
• Google may track your movements by storing your complete location history;
• and so on.
These systems may be interconnected and (subject to data protection, GDPR, etc.) may be able to
provide the required input data.
Data Analysis for Decision Makers 7

4.1. Primary and Secondary Data. Data that is sourced or observed directly is called Primary
Data. Here, the data analyst is the person collecting the data. Primary data is gathered directly:

• from internal IT systems or


• by conducting a survey or experiment.

Alternatively, some data can be bought or sourced from a data supplier; this is called Secondary Data:
it is indirect in that it has come through an intermediary; the data analyst is not the data collector.
Examples include:

• analysing census data;


• examining data from print journals or data published on the internet.

4.2. Types of Data and Levels of Measurement. This course is concerned with Quantitative
(clearly measurable, e. g., numerical) answers rather than Qualitative ones (thoughts and opinions).3
However, data can be categorical or numerical; furthermore, it can have several possible levels of mea-
surement. We now explain these terms.
We can class data as Numerical when it is a number of any kind.

• For example, the stock price of IBM Common Stock is $166.05.

We can class data as Categorical when there is a finite collection of distinct possible answers.

• For example, the answer to ‘Do you have a Smart Phone?’ is either ‘Yes’ or ‘No’.

If no order is implied among the possible answers, the scale is said to be a nominal scale or level of
measurement (the lowest level of measurement); for example,

‘Which Political Party do you Support? Fianna Fáil, Fine Gael, Labour, Sinn Féin, . . . ’

does not intend to imply any ordering of the response categories.


An ordinal scale or level of measurement (the next highest) is where some order or ranking is implied in
the set of categorical responses; for example,

‘How did you rate the service in our co↵ee shop today? “Poor, Fair, Average, Good or Excel-
lent”?’

implies an ordering of the responses. This is an example of a Likert scale.


Numerical data (numbers) can either be

• discrete (have gaps between them), e. g., we’re counting things; or


• continuous, as when we’re measuring things.

For example, the lecturer doesn’t know in advance of class how many students will turn up on any given
day: it can vary depending on many factors. The number of students who arrive could be 42 or some
other number. We call this a variable (more precisely, a random variable: see later) and often we let X
or some other symbol represent the value of the variable.
The number of students in a lecture hall is an example of a discrete variable, measured in whole units.
How late they come to class is an example of a continuous variable. Lateness can be measured to fractions
of a second. The accuracy of our measurement is determined by our ability to measure time.
Numbers have a natural ordering from small to big so automatically have the ordinal level of measure-
ment. But the ranking in an ordinal level of measurement doesn’t say anything about how big are the
di↵erences between successive items. Numbers can also have higher levels of measurement:
3Natural Language Processing is an algorithmic technique that is used to try to draw understanding from text data.
It is used by companies such as Google to determine target recipients for e. g., Google Ads, and Large Language Models
(LLMs) form the basis of increasingly-powerful AIs such as GPT.
8 MIS10090

• interval level of measurement; here, the interval or distance between values has meaning, and
there is a chosen unit of measurement, but not necessarily a “zero” point; an example could
be dress sizes or the Celsius temperature scale (where there is a “zero”, but it is assigned by
convention, rather than having a natural meaning in terms of Physics);
• ratio level of measurement (the highest level); here, the data are based on a scale with a chosen
unit of measurement and there is also a meaningful (not just conventional) “zero” point, so
we can take ratios of values; examples include most numerical variables we encounter, such as
production levels, salaries, heights, weights, distances, stock prices, etc.
Exercise 1.4. Assign each of the following to the highest appropriate level of measurement:

(1) Standard & Poor’s bond ratings AAA, AA, A, BBB, BB, B, CCC, CC, C, DDD, DD, D;
(2) A person’s age;
(3) A person’s internet provider;
(4) Student letter grade on this module;
(5) Student Grade Point Average (GPA);

4.3. Summary of data and levels of measurement. In short, the levels of measurement ordered
from lowest to highest are:

(1) nominal level of measurement;


(2) ordinal level of measurement;
(3) interval level of measurement;
(4) ratio level of measurement.

Summarising, some of the reasons we need data are:

• To provide input to a survey;


• To provide input to a study;
• To measure performance of service or production process;
• To evaluate conformance to standards;
• To assist in formulating alternative courses of action;
• To satisfy curiosity.
Exercise 1.5. Identify possible data (not mentioned in the notes above) that fall into each of the
following levels of measurement; for each, assign it to the highest level of measurement that applies to it:

• nominal level of measurement;


• ordinal level of measurement;
• interval level of measurement;
• ratio level of measurement.

5. Survey Methods and Sampling

The set we are studying (the population) may be too large to capture in full — or the expense of doing
this might be prohibitive. Thus, we may choose a restricted subset of the population to survey and
study: this subset is called a sample.
However, we need to take care in how we choose the items of the sample — we do not want the items
in the sample to have very di↵erent characteristics from those of the population as a whole. That is, we
need the sample to be representative.
Later we will see how we can use the data gathered in a survey to carry out Inferential Statistics and
make statements (to a certain level of confidence) about the entire population based on the sample
results.
We can only do this if we conduct probability samples where items (or subjects) are chosen for inclusion
in a survey according to their probability of occurrence, and so are representative of the population as
a whole.
Data Analysis for Decision Makers 9

We are only interested in probability samples. Thus, we are not interested in nonprobability samples
where items are chosen for inclusion without regard to their probability of occurrence. An example of a
nonprobability sample is a self-selecting web survey. Such samples are said to be biased: we cannot say
they are truly representative of the population from which they were drawn.
Samples can be obtained using tables of random numbers or computer random number generators.
Later, you will meet Sampling Methods such as Systematic, Stratified and Cluster Sampling.
Finally, we address the issue of Survey Worthiness. You’ll have heard the phrase ‘Garbage in — Garbage
Out’. We need to ensure the quality and integrity of the data we use in our analysis; otherwise, our
results and conclusions are worthless.
Typically the number of people that respond to a survey is between 2 and 4 out of every hundred. When
did you last agree to participate in a survey or take a phone call to ascertain your opinions or habits?
This lack of willingness is called survey fatigue and gives rise to non-response error. Many companies
o↵er an incentive or a donation to a charitable cause to encourage you to participate in their survey.
Unless care is taken when conducting a survey, the results may be unintentionally skewed toward a
certain section of the population: this is called coverage error or selection bias. We also need to ensure
that questions are clear and unambiguous so that participants are not confused or misled; similarly, it
would be unethical to use ‘leading’ questions to elicit a particular response from participants.
Exercise 1.6. Suppose (pre- or post-Covid) a pollster wants to carry out a survey to find people’s
opinion in response to some question. The pollster walks around a neighbourhood in Stillorgan, South
Dublin, between 10am and 4pm, knocking on the doors, and asking the question of anyone who opens
the door.
How representative will the responses to the survey be? Can you identify any issues with the method-
ology? Could the survey have selection bias? If so, why?
Now consider how your answers to the above might vary depending on the question asked. (Hint: what
if the question was whether the respondent supported setting up a playground in the local park?)

6. Summarising Data: Data Presentation

Businesses deal with large volumes of raw data: this only becomes useful information when it has been
processed and tidied.
It is difficult to interpret large tables of numbers; instead, data is often presented as a plot or graph, to
leverage our brains’ ability to interpret visual data.
We will see more on this in the next Chapter.

7. Analytics Software

Microsoft Excel, part of Microsoft Office, is a spreadsheet manipulation and graphing software package.
It is an important tool in business, having been around for over 30 years.
When doing data analysis and business analytics, however, one should not rely solely on a single software
package. Spreadsheets are powerful, particularly when the data to process is compact and well organised.
However, in the real world of business, that is rarely the case; spreadsheets are particularly weak for the
following:
• Working with very large data collections;
• Working with data spread across several tables;
• Working with incomplete data;
• Keeping track of the sequence of data manipulations executed;
• Working with multidimensional tables;
• Keeping up-to-date results with fresh data;
• Collaboration among multiple users.
10 MIS10090

Despite these limitations, spreadsheets are still very powerful, particularly when working with small
datasets, and their visual nature allows quick data visualisation and manipulation, and understanding the
direct consequences of that manipulation. For these reasons, Microsoft Excel will be used to demonstrate
techniques used in this course.

8. Chapter Summary

In this chapter we have:


• motivated the study of Data Analysis in Decision Making;
• placed it in the wider picture of Business Analytics;
• outlined the importance of correctly defining the problem, modelling and methodology;
• introduced types of data including numerical (discrete, continuous) and categorical;
• discussed survey methods;
• outlined some principles of data presentation; and
• mentioned some strengths and weaknesses of spreadsheet software.

You might also like