You are on page 1of 43

Unit 7 : Data Analytics

Business Understanding

Business understanding is the essential and mandatory first phase in any data mining or data analytics project.

It involves identifying and describing the fundamental aims of the project from a business perspective. This may
involve solving a key business problem or exploring a particular business opportunity. Such problems might be:

 Establishing whether the business has been performing or under-performing and in which areas
 Monitoring and controlling performance against targets or budgets
 Identifying areas where efficiency and effectiveness in business processes can be improved
 Understanding customer behaviour to identify trends, patterns and relationships
 Predicting sales volumes at given prices
 Detecting and preventing fraud more easily
 Using scarce resources most profitably
 Optimising sales or profits
Having identified the aims of the project to address the business problem or opportunity, the next step is to
establish a set of project objectives and requirements. These are then used to inform the development of a
project plan. The plan will detail the steps to be performed over the course of the rest of the project and should
cover the following:

 Deciding which data needs to be selected from internal or external sources;


 Acquiring suitable data;
 Determining criteria to determine whether or not the project will have been a success;
 Developing an understanding of the acquired data;
 Cleaning and preparing the data for modelling;
 Selecting suitable tools and techniques for modelling;
 Creating appropriate models from the data;
 Evaluating the created models;
 Visualising the information obtained from the data;
 Implementing a solution or proposal that achieves the original business objective.
 Evaluating whether the project has been a success using the predetermined criteria.

Data Understanding

The second phase of the CRISP-DM process involves obtaining and exploring the data identified as part of the
previous phase and has three separate steps, each resulting in the production of a report. Select each item to
reveal more information.

You must select each section before being able to proceed.

Data acquisition

Data description

Data exploration

Data acquisition
This step involves retrieving the data from their respective sources and the production of a data acquisition
report that lists the sources of data, along with their provenance, the tools or techniques used to acquire them. It
should also document any issues which arose during the acquisition along with the relevant solutions. This
report will facilitate the replication of the data acquisition process if the project is repeated in the future.

Data description

The next step requires loading the data and performing a rudimentary examination of the data to aid in the
production of a data quality report. This report should describe the data that has been acquired.

It should detail the number of attributes and the type of data they contain. For quantitative data, this should
include descriptive statistics such as minimum and maximum values as well as their mean and median and other
statistical measures. For qualitative data, the summary data should include the number of distinct values,
known as the cardinality of data, and how many instances of each value exists. The first step is to describe the
raw data. For instance, if analysing a purchases ledger, you would at this stage produce counts of the number of
transactions for each department and cost centre, the minimum, mean and maximum for amounts, etc.
Relationships between variables are examined in the data exploration phase (eg. by calculating correlation). For
both types of data, the report should also detail the number of missing or invalid values in each of the attributes.

If there are multiple sources of data, the report should state on which common attributes these sources will be
joined. Finally, the report should include a statement as to whether the data acquired is complete and satisfies
the requirements outlined during the business understanding phase.

Data exploration

This step builds on the data description and involves using statistical and visualisation techniques to develop a
deeper understanding of the data and their suitability for the analysis.

These may include:

 Performing basic aggregations;


 Studying the distribution of data; either through producing descriptive statistics such as means,
medians and standard deviations or by plotting histograms;
 Examining the relationships between pairs of attributes; eg. by correlation for numeric data using
regression analysis or chi-square testing; and
 Exploring the distribution and relationships in significant subsets of the data
These exploratory data analysis techniques can help provide an indication on the likely outcome of the analysis
and may uncover patterns in the data that may be worth subjecting to further examination.

The results of the exploratory data analysis should be presented as part of a data exploration report that should
also detail any initial findings.

Data Preparation

As with the data exploration phase, the data preparation phase is composed of multiple steps and is about
ensuring that the correct data is used, in the correct form in order for the data analytics model to work
effectively:

You must select each section before being able to proceed.

1Data selection
2Data cleaning
3Data integration
4Feature engineering
Data selection

The first step in data preparation is to determine the data that will be used in the analysis. This decision will be
informed by the reports produced in the data understanding phase but may also be based on the relevance of
particular datasets or attributes to the objectives of the data mining project, as well as the capabilities of the tools
and systems used to build analytical models. There are two distinct types of data selection, both of which may
be used as part of this step.

Feature selection is the process of eliminating features or variables which exhibit little predictive value or those
that are highly correlated with others and retaining those that are the most relevant to the process of building
analytical models such as:

 Multiple linear regression, where the correlation between multiple independent variables and the
dependent variable is used to model the relationship between them;
 Decision trees, simulating human approaches to solving problems by dividing the set of predictors into
smaller and smaller subsets and associating an outcome with each one.
 Neural networks, a naïve simulation of multiple interconnected brain cells that can be configured to
learn and recognise patterns.
Sampling may be needed if the amount of data exceeds the capabilities of the tools or systems used to build the
model. This normally involves retaining a random selection of rows as a predetermined percentage of the total
number of rows. Often surprisingly small samples can give reasonably reliable information about the wider
population of data, such as obtained from voter exit polls in local and national elections.

Any decisions taken during this step should be documented, along with a description of the reasons for
eliminating non-significant variables or selecting samples of data from a wider population of such data.

Data cleaning

Data cleaning is the process of ensuring the data can be used effectively in the analytical model. The next step is
to process missing and erroneous data identified during the data understanding or collection phase. Erroneous
data, values outside of reasonably expected ranges, are generally set as missing.

Missing values in each feature are then replaced either using simple rules of thumb, such as setting them to be
equal to the mean or median of data in the feature or by building models that represent the patterns of missing
data and using those models to 'predict' the missing values.

Other data cleaning tasks include transforming dates into a common format and removing non-alphanumeric
characters from text. The activities undertaken, and decisions made during this step should be documented in a
data cleaning report.

Data integration

Data mining algorithms expect a single source of data to be organised into rows and columns. If multiple
sources of data are to be used in the analysis, it is necessary to combine them. This involves using common
features in each dataset to join the datasets together. For example, a dataset of customer details may be
combined with records of their purchases. The resulting joined dataset will have one row for each purchase
containing attributes of the purchase combined with attributes related to the customer.

Feature engineering

This optional step involves the creation or inclusion of new variables or derived attributes into the existing
variables or features originally included to improve the model’s capability. This step is frequently performed
when the data analyst feels that the derived attribute or new feature or variable is likely to make a positive
contribution to the modelling process and where it involves a complex relationship that the model is unlikely to
infer by itself.

An example of a derived feature might be adding such attributes such as the amount a customer spends on
different products in a given time period, how soon they pay and how often they return goods to more reliably
assess the profitability of that customer, rather than just measure the gross profit generated by the customer
based on sales values.

Modelling

This key part of the data mining process involves creating generalised, concise representations of the data.
These are frequently mathematical in nature and are used later to generate predictions from new, previously
unseen data.

Determine the modelling techniques to be used


The first step in creating models is to choose the modelling techniques which are the most appropriate, given
both the nature of the analysis and of the data used. Many modelling methods make assumptions about the
nature of data. For examples, some methods can perform well in the presence of missing data whereas others
will fail to produce a valid model.

Design a testing strategy

Before proceeding to build a data analytics model, you will need to determine how you are going to assess the
quality of predictive ability of the model. This is done using data specially held aside for this purpose, in other
words, how well the model will perform on data it hasn't yet seen. This involves using a subset of data kept
aside for this purpose and using it to evaluate how far off the model's predictions of the dependent variable are
from the actual values in the data.

Evaluation

At this stage in the project, you need to verify and document that the results you have obtained from modelling
have the veracity (are reliable enough) for you to prove or reject your hypotheses in the business understanding
stage. For example, if you have performed a multiple regression analysis on predicting sales based on weather
patterns, are you sure that the results you have obtained are statistically significant enough for you to implement
the solution, or have you checked that there are no other intermediate variables linked to the X, Y variables in
your relationship which are a more direct causal link?

Before proceeding to final deployment of the model, it is important to thoroughly evaluate it and review the
steps executed to create it; to be certain the model properly achieves the business objectives. A key objective is
to determine if there is some important business issue that has not been sufficiently considered. At the end of
this phase, a decision on the use of the data mining results should be reached.

At this stage, you will determine if it is feasible to move on to the final phase deployment, or whether it is
preferable to return to and refine some of the earlier steps. The outcome of this phase should be a document
providing an overview of the evaluation and details of the final decision together with a supporting rationale for
proceeding.

Deployment

During this final phase, the outcome of the evaluation will be used to establish a timetable and strategy for the
deployment of the data mining models, detailing the required steps and how they should be implemented.

Data mining projects are rarely 'set it and forget it' in nature. At this time, you will need to develop a
comprehensive plan for the monitoring of the deployed models as well as their future maintenance.
This should take the form of a detailed document. Once the project has been completed there should be a final
written report, re-stating and re-affirming the project objectives, identifying the deliverables, providing a
summary of the results and identifying any problems encountered and how they were dealt with.

 Depending on the requirements, the deployment phase can be as simple as generating a report and
presenting it to the sponsors or as complex as implementing a repeatable data mining process across the
enterprise.
 In many cases, it is the customer, not the data analyst, who carries out the deployment steps. However,
even if the analyst does carry out the deployment, it is important for the customer to clearly understand
which actions need to be carried out in order to actually make use of the created models.
 This is where data visualisation is most important as the data analyst hands over the findings from the
modelling to the sponsor or the end user and these should be presented and communicated in a form
which is easily understood.

The 3 V’s of Big Data

The main focus in big data and the digital revolution is not so much about the quantity of data, although this is a
big advantage, but it is more about the speed and currency of the data and the variety in which it is made
available. Sophisticated data analytics is about accessing data that is useful for decision making and the three
things that Big Data brings to improve the quality of decision making are:

You must select each section before being able to proceed.

Volume - for reliability


Velocity - for timeliness
Variety - for relevance

Volume

In data analytics, the amount of data can make a difference. With big data, you will often have to process large
amounts of data, most of it unstructured and with low information density.

The term volume is used to refer to these massive quantities of data. Most of this may have no value but you
will not know until you somehow try and structure it and use it. This data can come from a wide range of
sources as we will see later, but could include social media data, hits on a website, or results from surveys or
approval ratings given by consumers.

The main benefit from the volume of big data is the additional reliability it gives the data analyst. As any
statistician knows, the more data you have, the more reliable your analysis becomes and the more confident you
are about using the results you obtain to inform decision-making.

For some organisations the quantity of this data will be enormous, and will be difficult to collect, store and
manage without the correct infrastructure, including adequate storage and processing capacity.

Velocity

Velocity refers to the rate at which data is received, stored and used. In today’s world transactions are conducted
and recorded in real time. Every time you scan your goods at a supermarket, the store knows instantly how
much inventory it still has available and so it knows as soon as possible when it needs to re-order each item.

Similarly, as people shop with debit and credit cards using their phone apps, these transactions are updated
immediately. The bank knows immediately that funds have gone out of your account. The business also knows
that funds have been transferred into their account – all in real time.
Variety

In the past the data we collected electronically came in the very familiar rows and columns form encountered in
any database or spreadsheet. With the advent of the Internet and more recently the Worldwide Web, the forms
data comes in have significantly broadened. Variety refers to the multitude of types of data that are available for
analysis as a result of these changes. Thanks to rapid developments in communications technology, the data we
store increasingly comes in different forms which possess far less structure and which take on a variety of
forms. Examples of these include numerical data, plain text, audio, pictures and videos.

With the increasingly prevalent use of mobile internet access, sensor data also counts as a new data type. We
still also have the more traditional data types; those that are highly structured such as data held in relational
databases. Corporate information systems such as Enterprise Resource Planning, Customer Relationship
Management and financial accounting functions employ such database systems and the data these systems
contain are a valuable resource for data analysts to work with.

Unstructured data require significant additional processing as in the data preparation stage in the CRISP
framework to transform them into meaningful and useful data which can be used to support decision-making,
but being able to access them and use them provides richer information which can make the information
obtained from the data analysed more relevant and significant than larger amounts of data from more structured
sources

The value and lessons to be learned from Big Data

Big data has become an important form of organisational capital. For some of the world’s biggest tech
companies, such as Facebook, a large part of the value they offer comes from their data, which they’re
constantly analysing to produce more efficiency and develop new revenue streams.

However, the impact of big data and data reliance doesn't stop with the tech giants. Data is increasingly
considered by many enterprises to be a key business asset with significant potential value.

Data which is not used or analysed has no real value. However, value can be added to data as it is cleaned,
processed, transformed and analysed. Data collected can be considered to be the raw material, as in a
manufacturing process and is frequently referred to as 'raw data'.

Some of this raw material is unrefined, such as unstructured data, and some refined, as is the case with
structured data. Such data needs to be stored in a virtual warehouse, such as a cloud storage provider or an on-
premise storage solution.

The cleaning and transformation of the data into a suitable form for analysis is really where the value is being
added, so that the data can become the finished product - the useful information which needs to be delivered or
communicated to the user. Reliable, timely and relevant information is what the customer wants.

What about the veracity of your data?

Deriving value from big data isn’t only about analysing it. It is a discovery process that requires insightful
analysts, business users and managers who ask the right questions, recognise patterns, make informed
assumptions, and predict behaviour.

If the original assumptions are wrong, the interpretation of the original business question or issue is incorrect, or
the integrity of the data used in the analysis is suspect, the data analysis may yield unreliable or irrelevant
information. A data analyst must be sceptical of the information that comes out of the data analytics process and
properly challenge or verify what it is saying.
Recent technological breakthroughs have exponentially reduced the cost of data storage and computing, making
it easier and less expensive to store and process more data than ever before. As the costs of handling big data are
becoming cheaper and more accessible, it is possible to make more accurate and informed business decisions as
long as the big data is stored, processed and interpreted appropriately.

Platforms for Big Data storage and processing

Select each of the following titles:

You must select each section before being able to proceed.

1HDFS
2MapReduce
3Hive
4Pig
5HBase
6Drill

HDFS
The Hadoop Distributed File System allows the storage of extremely large files in a highly redundant
manner, using a cluster of computers, in this case built using ‘off-the-shelf’ commodity hardware.
MapReduce
This is a divide and conquer approach to big data processing, allowing processing of data to be
distributed across multiple computers in a Hadoop cluster.
Hive
Data Query Language is a query tool used to analyse large sets of data stored on HDFS. It uses a SQL-
like language. It is a declarative language – in other words, you specify what you want, not how to
retrieve it.
Pig
Another high-level programming language used to query large data sets stored on HDFS. It is a data-
flow language that specifies the flows of data from one task to another.
HBase
A NoSQL database that runs on Hadoop clusters. NoSQL stands for Not Only SQL and is a pattern of
data access that is more suited to larger data stores. It differs from relational databases in a number
of ways, not least in that it stores each column in the data as a separate physical file.
Drill
A data processing environment for large-scale data projects where data is spread across thousands
of nodes in a cluster and the volume of data is in the petabytes.
Types of analytics
Descriptive Analytics

Descriptive analytics takes raw data and summarises or describes it in order to provide useful information about
the past. In essence, this type of analytics attempts to answer the question 'What has happened?'.

Descriptive analysis does exactly what the name implies because they “Describe” raw data and allow the user to
see and analyse data which has been classified and presented in some logical way. They are analytics that
describe the past. The past refers to any point of time that an event has occurred, whenever that was, from a
second ago to a year ago.

Descriptive analytics are useful because they allow analysts to learn from past behaviours, and understand how
they might influence future outcomes. Spreadsheet tools such as filtering and pivot tables are an excellent way
to view and analyse historic data in a variety of ways.

Descriptive statistics can be used to show things many different types of business data such as total sales by
volume or value, cost breakdowns, average amounts spent per customer and profitability per product.

Descriptive Analytics

An example of this kind of descriptive analytics can be illustrated where retail data on the sales, cost of sales
(COS) and gross profit margin (GP) in six retail outlets of a range of five products within each store are tracked
over time to establish trends and or to detect potential fraud or loss.

By looking at the overall figures for the company as a whole, or even by individual product across the company,
or for a store as a whole, the business leader may not notice any unusual trends or departures from the expected
levels from a chart or graph of these measures.

See below how all these metrics are reasonably constant when the overall performance is described:
Descriptive Analytics

Only by analysing and charting these trends more closely by product, in each individual store (such as by using
pivot tables) could the business leader detect if and where there is any specific fraud or loss and such
discrepancies would become more apparent if this type of micro level descriptive analysis is undertaken. In the
above example it looks like there was a problem with Product 2 in Store 6. See below:
In the above example when the trend for Product 2 in Store 6 is examined more closely, it can be seen that the
GP margin falls from 33% down to about 17% and it is nothing to do with sales which remain relatively
constant over time, but is caused by a significantly rising COS from period 2 which rises from just above $800
in periods 1 and 2 to $1000 by period 5. In this case the business manager, possibly an internal auditor in this
case, would be looking at a potential loss or theft of inventory relating to this product and would need to
investigate further.

Another example of descriptive analytics would be to analyse historical data about customers to analyse them.

A summary database is shown below:

This database shows a data table with three products which are being sold by a business and how much each
product sells for, how much it costs to make or purchase, and what cost is associated with the return of each unit
of the product from customers. The database itself shows, for each of the 30 customers, how much of each
product they have purchased, how many they have returned and an analysis of how much sales and gross profit
each customer has generated is included. The database also shows the totals, the means and the medians of each
measure.

Using data analytics, this type of descriptive analysis could help the business understand more about the
following business questions:

 Which products are customers buying?


 Which customers bring in the most revenue?
 Which customers are the most profitable?
 Which customers return the most or fewest goods?
 Which products are being returned the most or the least?
 This kind of descriptive analytics can help the business manager understand their customers and their
buying behaviour so that they can improve their marketing and promotion with these customers and
target their communications to them more effectively. The business can also gain a greater
understanding of its costs, such as the costs of returns associated with different products and customers
and try and find out why some products or some customers are cost more due to returns and address
these issues.
 The spreadsheet above uses filters at the top of each column so that the analyst can sort the data in any
way they choose. For example, they might wish to see a list of customers listed in order of sales,
profitability or by the number of returns they process.
 A powerful tool to use in descriptive analytics is pivot tables in Excel. Pivot tables allow the original
data table in a spreadsheet to be presented in a number of different ways, where the rows and columns
can be interchanged or where only certain fields or data are displayed.
 For example, from the above data, the analyst may want to focus only on the returns which each
customer sends back to the company. The following shows an example of how that might be done:


 Using the pivot table under the insert tab of the spreadsheet and selecting the range of data in the
previous example, it can be seen that for each customer the pivot table has prescribed that against each
customer ID, under row labels, only the values for the quantity of returns per product and the totals are
to be shown.
 The output of this pivot table would appear as such:


 By examining this, it seems that the total number of returns of products 1 and 3 are very similar for all
customers, but what does this really tell us? To get more insights into this it would be necessary to
compare these returns figures with the actual quantities of each product sold to each customer to
identify what percentage of each product was returned overall. More relevant still would be an analysis
of the percentage of returns by product returned by individual customers to establish which customers
sent back the greatest proportion of returns under each product type, requiring an even more targeted
analysis.
 An increasingly popular area to apply descriptive data analytics is in finance by using externally
available information from the stock markets to help inform and support investment decisions. Many
analysts source their data from a range of external sources such as Yahoo, Google finance or other
easily accessible and free to use databases. This now means that historical data of share prices and
stock market indices are readily and widely available to use by anyone.
 As an example, finance analysts often need to calculate the riskiness of stocks in order to estimate the
equity cost of capital and to inform their investment plans.
 An analyst wants to estimate the beta of Amazon shares against the Standard and Poor (S+P) 100 stock
index. The beta measures how volatile the periodic returns in this share have been, against the S+P
index as a whole.
 To do this, the analyst would access the financial data from an external website and download it to their
spreadsheet.
 In the example shown, monthly returns in Amazon shares have been measured against the returns in the
S+P between February 2017 and January 2019 and shown using a scatter chart:


 The above sheet shows the share/index price and returns data on the left and the comparative returns
plotted in a scatter chart on the right. The beta of the Amazon stock is calculated in cell F5. The
formula used in F5 is shown in the formula bar above the spreadsheet.

 The easiest way to calculate a beta is to estimate the slope of a best fit line through the data. This is
achieved using the =Slope formula in Excel and selecting the range of returns from the individual
company (Y axis) and the selecting the range containing the returns from the stock market as a whole
(X axis).
 Interpreting this, it can be seen that Amazon has a positive beta, meaning that as the stock market rises
of a period, the price of the company’s shares also rise in that same period. The measure for beta
calculated here is +1.2 for Amazon over the period considered which means that investing in Amazon’s
share independently, is riskier than investing, or tracking, the index as a whole.

Predictive Analytics

Predictive analytics builds statistical models from processed raw data with the aim of being able to forecast
future outcomes. It attempts to answer the question 'What will happen?'

This type of analytics is about understanding the future. Predictive analytics provides businesses with valuable
insights based on data which allow analysts to extrapolate from the past to assume behaviours and outcomes in
the future. It is important to remember that data analytics cannot be relied upon to exactly “predict” the future
with complete certainty. Business managers should therefore be sceptical and recognise the limitations of such
analytics, and that the prediction can only be based only on reasonable probabilities and assumptions.

These analytics use historical (descriptive data) and statistical techniques to estimate future outcomes based on
observed relationships between attributes or variables. They identify patterns in the data and apply statistical
models and algorithms to capture relationships between various data sets. Predictive analytics can be used
throughout the organization, from forecasting customer behaviour and purchasing patterns to identifying trends
in manufacturing processes and the predicted impact on quality control.
Regression
Regression analysis is a popular method of predicting a continuous numeric value. A simple example in a
business context would be using past data on sales volumes and advertising spend to build a regression model
that allows manager to predict future sales volumes on the basis of the projected or planned advertising spend.
Using a single predictor or independent variable to forecast the value of a target or dependent variable is known
as simple regression. The inclusion of multiple independent variables is more typical of real-world applications
and is known as multiple-regression.

The simplest regression models, such as those produced by Microsoft Excel, assume that the relationship
between the independent variables and the dependent variable is strictly linear. It is possible to accommodate a
limited range of alternative possible relationships by transforming the variables using logarithms or by raising
them to a power. More sophisticated algorithms can model curved or even arbitrarily-shaped relationships
between the variables.

The performance of a regression model is determined by how far the predictions are away from the actual
values. If the magnitude of errors is a consideration, the squared differences are used otherwise the absolute
differences are used. In Excel, this information is given by a regression output table which indicates the
predictive values of the independent variable(s) and the dependent variable. The key statistics are R 2 which
ranges from 0 being a completely random association to 1 which is perfect correlation. The statistical
significance of the relationships can also be confirmed by looking at the P-values and the Significance F, which
should be sufficiently small to allow greater confidence.

One common application, most people are familiar with, is the use of predictive analytics to estimate sales of
product based on different factors such as the weather.

The following spreadsheet includes data on monthly barbecue sales and how these are potentially influenced by:

 Wet days in the month


 Average monthly temperature
 Monthly hours of sunshine
Let’s examine an example of this now:

Download spreadsheet (23 kB)

The above is descriptive data, which shows the historical weather patterns and the monthly sales of barbecues
over a 24 month period.
Using the Data Pack function in Excel allows the data analyst to undertake multiple linear regressions to
measure the relationships between each weather factor and the dependent variable which is Barbecue sales.

This is done as follows using the multiple linear regression function under the data analysis tab in the Excel
spreadsheet:

First of all the analyst needs to identify the range in which the Y or dependent variable is found. In this case
these are the cells E4:E27. These are usually presented in a single column. Next the analyst should input the
range of X or independent variables. These are in the cell range B4:D27 and for ease of analysis these are
normally presented in adjacent columns. Then the confidence limit should be decided upon. Usually, and by
default, this is at 95%.

Finally the analyst will need to decide where the output table should be presented; in the same worksheet or in
another worksheet. For presentation purposes, the output is best placed in a separate worksheet.

To enable data analysis tools such as Regression and Solver in Excel on your device, you need to go into:

File > Options > Add- Ins >

Then at the bottom click Go and then check all the available boxes.

Undertaking this process with the data above reveals the following output table:

From the above output table the analyst can see that the overall relationships between the independent X
variables, being rain, temperature, and sunshine hours and the dependent Y variable, being barbecue sales; are
statistically significant. This is because the Adjusted R2 = 0.95 is very near to 1. The reliability of this statistic is
supported because the Significance F and the P values are all smaller than the significance level, being 5% in
this example. The combination of these results would give the analyst a high confidence that the relationships
between the X and Y variables are strong. This would allow the analyst to assume that the X variables are
reasonably good predictors of the level of barbecue sales.

This information can therefore be used to predict future sales given certain weather trends. This equation can be
obtained from the table above as follows:

Y = 588 - 30X1 - 20X2 + 10X3. This equation is given by taking the coefficient values from the summary
output table, from the top to the bottom, with the first figure being the intercept or constant which shows where
the line crosses the Y axis and the rest of the values being the X coefficients.

By knowing forecast weather data and inputting this into a model, it is then possible to predict how many
barbecues may be sold in future periods which can help the business plan their procurement of barbecues,
advertising and promotion campaigns and manage their budgets more effectively.

Exercising some scepticism here, a good data analyst might question the equation above because X2
(Temperature) is a negative variable, showing that temperature is negatively correlated with barbecue sales;
where someone would normally expect more barbecue sales if the temperature was higher; so some additional
testing might need to be undertaken, such as using other data from other periods to test if consistent results can
be obtained to verify the original results.

Prescriptive analytics is a development of predictive analytics that allows us to forecast multiple future
outcomes based on suggested courses of action, showing the potential effects of each decision. It seeks to
determine the best course of action to take, based on past data. In effect, it helps answer the question 'What
should we do?'

The relatively new field of prescriptive analytics allows users to “prescribe” a number of different possible
actions to and guide business managers or customers towards an optimal or best solution. This type of analytics
is about advising and supporting decision-makers. Prescriptive analytics attempt to quantify the effect of future
decisions in order to advise on possible outcomes before the decisions are actually made. Prescriptive analytics
go beyond predictions and explain why certain outcomes will happen, providing recommendations regarding
actions which will optimise outcomes rather than just predicting them.

So prescriptive analytics will result in the recommendation of one or more possible courses of action which
allow objectives set out in the business understanding stage of the CRISP-DM framework to be met.
Prescriptive analytics use a combination of techniques and tools such as business rules, algorithms, machine
learning and computer modelling. These techniques are applied to sources of data from internal and external
sources from a multiple data set.

Prescriptive analytics are relatively complex to model. When designed correctly, they can have a large impact
on how businesses make effective decisions, and on the business’s profitability.

Larger companies are successfully using prescriptive analytics to optimise business activities, whether that is to
maximise sales or minimise costs to make sure that they are delivering the right products at the right time at the
right price and optimizing the customer experience.

Some techniques of optimisation are already well known to accountants, such as limiting factor analysis, cost
volume profit (CVP) techniques, linear programming under constraints and such traditional techniques as
economic order quantity (EOQ) analysis used for optimising the frequency and size of purchase orders into the
business.

Spreadsheets are an excellent tool for applying prescriptive analytics. The key tools are ‘What if analysis such
as ‘Scenario manager’ which allows the analyst to quickly and easily test the outcomes of several scenarios and
choose the ones which best meet the business objectives, or optimisation tools such as ‘Goal seek’. ‘Goal seek’
is a technique that allows the analyst to vary a cell in the spreadsheet randomly or in some prescribed way which
allows the target cell to be optimised. This is a powerful tool in that the spreadsheet itself works out a solution
or solutions which meet the targeted outcome set for the objective cell.
Let's now take a look at an example of 'Goalseek':

In the above model of ABC Company, an income statement is presented which shows sales and variable costs
by unit and total contribution, overheads (fixed costs) and profit. The model also shows the shareholder capital
employed of ABC.

Goalseek has been used to select the cell $D$15 as the objective cell and to set a target value for that cell and
then the tool changes the cell value of $C$7 (sales price) to a value which meets the objective of 10% return on
shareholder capital employed (ROSCE).

An even more powerful tool within Excel is ‘Solver’ and this is a more flexible, versatile and powerful version
of ‘Goal seek’, where an objective can be set, subject to a series of constraints and the tool will allow the analyst
to be presented with a solution that not only meets the stated objective, but meets all the constraints, or it reports
that no feasible solution can be found. The tool is powerful because where a feasible solution cannot be found
the analyst can set an objective to a minimum or maximum target value, allowing the tool to explore and find
the best solution possible given the constraints that exist.

It is clear that such tasks as solving transportation or linear programming problems are easily achieved using
‘Solver’. Let’s now look at some examples of this type of application in action:

As a fun example it is worth demonstrating how ‘Solver’ could be used to solve the magic square problem.
Solving the ‘Magic Square’ requires the integers 1-9 to be arranged within a nine-cell square in a way that no
number is repeated and all rows and columns add to 15, including both diagonals. The spreadsheet sums each
row, column and diagonal, which should all, add up to 15 if the optimal solution is reached. The highlighted
cell, A14 will show a standard deviation of zero if all the totals in the rows, columns and diagonals are equal, as
they are currently, with all cells being empty.

Solver is able to provide a solution for this fairly quickly.


When the prescriptive analyst opens the solver tool they are doing the following:

1. Setting the objective cell as A14


2. Setting the target value of that cell to zero (Standard deviation)
3. Changing the cells $B$4:$D$6, subject to the following constraints:
1. All the cells in the square must have values of less than or equal to 9
2. All the cells in the square must have values greater than or equal to 1
3. All the cells must be different
4. All the cells must be integers or whole round numbers
4. By setting these constraints and setting the engine to evolutionary or GRG non-linear, which are two of
the options within the solving method field, the analyst can ask the spreadsheet to solve this and the
tool will then arrive at one of the four feasible solutions, which are either inverses or mirrors of each
other.
5. The resulting solution to the magic square problem is as such:
6.

Another more complex optimisation problem, which is commonly used in business, is the transportation
problem and this is shown below which uses solver as a data analytics tool with a problem to minimise
transportation costs between depots and stores:
A company wishes to minimise its costs of delivering televisions from three depots (D1, D2 and D3) to three
stores (S1, S2, S3).

The problem is set out on an Excel spreadsheet as follows:

Download spreadsheet (13 kB)

In the above example the cost per mile of delivering TVs, the distances between depots and the stores and the
capacities of the stores to hold TVs are given.

The Solver objective would be to minimise the total cost in the yellow cell E22, subject to the constraints that
the total allocations of TVs to each store from all depots cannot exceed the maximum capacity of the store to
hold TVs and that the total number of TVs transported from the depots cannot exceed the number of TVs held at
each depot. The optimal solution is presented in the green figures shown in the range C15:E18.

In this case the data analyst is setting the objective cell to a minimum value, subject to the constraints that exist
and as would be expected in a problem like this, the Simplex Linear Programming (LP) algorithm is the most
appropriate to use here.

These are given in the Solver table, like so:


The table above shows that cell E22 must be set to a minimum cost subject to total values of cells C18:E18
being the amount of TVs delivered to each store being less than the total values of cells C20:E20 being the
maximum capacity of each store to hold TVs in stock. The other constraint is that all the TVs available in the
depots should be distributed to stores.

Solver finds a solution by changing the cells C15:E17 until the objective and constraints are met.

Note that if there was less total capacity in the stores to hold TVs than there were TVs available in the depots,
Solver would report a non-feasible solution and the analyst must be careful to anticipate such problems in their
model design.

Another version of this model is more accountancy related, specifically in Performance Management. The
following example is a model to determine an optimal production plan for a company producing four products
with a maximum level of demand for each product and using specific amounts of three different types of raw
material. There are limited quantities of the raw materials available, meaning that not all the demand for all
products can be met.

Let’s examine an example of this:


Download spreadsheet (12 kB)

The model has used Solver to determine the optimal quantities of each of the four products to produce in cells
C16:F16.

The model then determines the optimal production plan and calculates the contribution each product generates
when implementing that plan and the total contribution is shown in cell H26.

The solver parameters for this are as follows:

In the above solver engine, the data analyst has specified that the objective cell is the total contribution in cell
H26. The target is to maximise this value by changing the quantities of the products in cells C16:F16.

This is to be achieved, subject to the following constraints:

 The production of each product must be less than equal to the maximum demand for each product in
cells C4:H4.
 The production plan must be calculated in integers or complete units of each product
 The usage of each of the three materials in the production plan must be less than or equal to the total
availability of each material shown in the red cells H19:H21
 Finally, another example of prescriptive analytics is now demonstrated.
 A mobile phone company is trying to maximise its coverage of customers in a given region. The region
has five cities; A, B, C, D, and E. Each city has different populations.
 The business problem for the data analyst is to recommend to the phone company where to construct
the radio mast to ensure that all city populations are reached, but minimising costs by using a radio
mast with only enough power to cover the minimum range needed to achieve the overall objective.
 The problem is set out in the spreadsheet below; with a chart, showing the (X, Y) coordinates of all five
cities and the radio mast set at an initial position of (0, 0):

 Download spreadsheet (35 kB)


 In the above model the (X, Y) coordinates of the cities in the region are shown in a coordinate table
given in cells B5:C9 and the populations in each city are also shown in cells D5:D9.
 The mast range is shown in cell B13 currently displaying 12.81 miles at this default position.
 The model then uses the Pythagoras theorem to calculate the distances from each city to the mast
displayed in cells G5:G9. This illustration shows the distance from City C to the mast and calculated
where the City coordinate is (5, 1) and the mast coordinate is (0, 0):

 The distance from the mast to the city along the dotted line (the hypotenuse) is equal to the square root
of the sum of the squares of the other two sides:
 √(X1 – X2)2 + (Y1 – Y2)2
 In this case the distance is (52 + 12) = 25 + 1 = 26 and √26 = 5.099 which is the value displayed in cell
G7.
 The model displays the coordinates of the cities and the mast in a coordinate chart in the right-hand
side of the spreadsheet. The mast location is shown in orange and the coordinates of the mast are
included in cells B10:C10 and are currently set at (0, 0) by default until the problem is solved.
 The table then contains if statements to determine if the mast has the range to reach each city at a given
location and a solver engine is then used to determine the suitable mast location to minimise the range
needed to reach the populations in all five cities.
 The solver engine used to determine the optimal position of the mast is GRG non-linear as shown
below as such:

 In the above solver engine, the data analyst has specified that the objective cell is the mast range in the
highlighted cell: B13. The target is to minimise this value to seek the minimum range needed by the
mast, by changing the coordinates of the mast in cells B10 and C10.

This is to be achieved, subject to the following constraints:

 The mast coordinates should be 10 or under


 The mast coordinates should be integers (round numbers)
 The mast coordinates must be 1 or over
 The total population reached in cell I10 must equal 73 (‘000) being the total population of all five
cities.
Selecting the GRG Nonlinear solving method and clicking solve reveals the following result:
The solution arrived at is to locate the mast at coordinate (6, 5) which achieves coverage of all cities and the
minimum range the mast needs to achieve this objective is five miles.

Data analytics methodologies Data Analytics

Artificial Intelligence

Artificial intelligence is an important branch of computer science that has the broad aim of creating machines
that behave intelligently. The field has subfields, including robotics and machine learning.There are three major
categories of artificial intelligence:

You must select each section before being able to proceed.

1
Artificial Narrow Intelligence

2
Artificial General Intelligence

3
Artificial Super Intelligence

Most of the current artificial intelligence is narrow, while general intelligence is becoming increasingly likely to
be commonplace in the near future. Super-intelligence is not yet even remotely likely.

Artificial Narrow Intelligence

Artificial Narrow Intelligence or Weak AI is so-called because it is limited to the performance of specialised and
highly-specific tasks. Amazon’s Alexa is an example of artificial narrow intelligence. Most commercial
applications of AI are examples of Artificial Narrow Intelligence.

Artificial General Intelligence

Artificial General Intelligence, also known as Strong AI or Human-Level AI, is the term used for artificial
intelligence that permits a machine to have the same capabilities as a human.
Artificial Super Intelligence

Artificial Super Intelligence goes beyond general intelligence and results in machines that have superior
capabilities than humans do.

Artificial Intelligence

Artificial Intelligence has many uses in business and finance, many of which draw heavily on machine learning,
including:

 Using sophisticated pattern-recognition techniques to identify potentially fraudulent insurance claims


and credit/debit card transactions.
 Employing network analysis to detect bank accounts likely to be used for the transfer of the proceeds of
crime.
 Customer segmentation and targeted advertising.
 Identifying IT outages before they happen using data from real-time monitoring and pattern-
recognition techniques.
 Using data from GPS sensors on delivery trucks and machine learning to optimise routes and ensure
maximum fleet usage.
 Product recommendation systems, such as that used by Amazon.
 Analysing customer sentiment from social media posts, using Natural Language Processing.
 Predicting the future direction and volatility of the stock market by building predictive models based on
past data and macroeconomic variables.
 Dynamic pricing of goods and services using sales, purchasing and market data together with machine
learning.
 Active monitoring of computer networks for intrusion attempts.
Robotics

Robotics is an interdisciplinary branch of artificial intelligence which draws on the disciplines of computer
science, electronic engineering and mechanical engineering and is concerned with the development of machines
which can perform human tasks and reproduce human actions. The human tasks robotics seeks to replicate
include logic, reasoning and planning.

Not all robots are designed to resemble human appearance, but many will be given human-like features to allow
them to perform physical tasks otherwise performed by humans. The design of such robots makes considerable
use of sensor technology, including but not limited to computer vision systems, which allow the robot to 'see'
and identify objects.

Robots are frequently used on production lines in large manufacturing enterprises, but can also be found in the
autopilot systems in aircraft as well as in the more recent and growing development of self-driving or
autonomous cars. All these examples represent the 'narrow' category of artificial intelligence. Robotics is
increasingly used in business and finance.

 Robo-advisors use speech-recognition and knowledge bases to assist customers of financial institutions
in selecting the most suitable products.
 Artificial Intelligence in mobile applications is being employed to assist customers of banks in the
managing of their personal finances.
 Businesses are increasingly employing robotic assistants in customer facing roles such as technical
support on telephones and websites.

 Machine Learning
 Machine learning is the use of statistical models and other algorithms in order to enable computers to
learn from data. It is divided into two distinct types, unsupervised and supervised learning. The main
feature of machine learning is that the machine learns from its own experience from interaction with
the data it is processing and can make decisions independently of any input from human beings. They
can adapt or create their own algorithms to help them make better and more relevant decisions on the
basis of this experience.
 Unsupervised Learning draws inferences and learns structure from data without being provided with
any labels, classifications or categories. In other words, unsupervised learning can occur without being
provided with any prior knowledge of the data or patterns it may contain.
 The most frequently used form of unsupervised learning is clustering, which is the task of grouping a
set of observations so that those in the same group (cluster) are more similar to each other in some way
than they are to those in other clusters. There are multiple methods of determining similarity or
dissimilarity. The most commonly used being some form of distance measure, with observations that
are close to each other being considered to be part of the same cluster.

The quality of clusters can be determined by a number of evaluation measures. These generally base their
quality score on how compact each cluster is and how distant it is from other clusters.

Another frequently encountered form of unsupervised learning is market basket analysis or affinity analysis.
This type of analysis is designed to uncover co-occurrence relationships between attributes of particular
individuals or observations. For instance, a supermarket which has used market basket analysis may discover
that a particular brand of washing powder and fabric conditioner frequently occur in the same transaction, so
offering a promotion for one of the two will likely increase the sales of both, but offering a promotion for the
purchase of both is likely to have little impact on revenue.

The use of market basket analysis can also be found in online outlets such as Amazon, who use the results of the
analysis to inform their product recommendation system. The two most often used market basket analysis
approaches are the a-priori algorithm and frequent pattern growth.

Supervised Learning is similar to the human task of concept learning. At its most basic level, it allows a
computer to learn a function that maps a set of input variables to an output variable using a set of example input-
output pairs. It does this by analysing the supplied examples and inferring what the relationship between the two
may be.

The goal is to produce a mapping that allows the algorithm to correctly determine the output value for as yet
unseen data instances. This is very much concerned with the predictive analytics covered earlier in the unit so
that the machine learns from past relationships between variables and then builds up a measure of how some
variables, factors or behaviours can predict the responses that the machine should give, such as knowing the
temperature and other weather conditions for the coming week will allow the machine to calculate orders for
specific products, based on those forecast factors.
Mainstream tools and key applications of data analytics Data Analytics

Tools and applications for Descriptive Analytics

There are many tools available for descriptive analytics, some of which are briefly described below:

You must select each section before being able to proceed.

1Microsoft Excel
2RapidMiner
3WEKA
4KNIME
5R
6Python
7SAS
8SPSS Statistics
9Stata

Microsoft Excel

Microsoft Excel with the Data Analysis Tool Pack is a relatively easy to use yet powerful application for
descriptive analysis. It has one drawback in that the number of rows of data that can be processed is limited to
one million. However, it is a viable and readily available tool for descriptive statistical analysis of smaller
datasets.

RapidMiner

RapidMiner is a data science software platform developed by the company of the same name that provides an
integrated environment for data preparation, machine learning, deep learning, text mining, and predictive
analytics.

WEKA

WEKA, the Waikato Environment for Knowledge Analysis is a suite of machine learning software written in
Java, developed at the University of Waikato, New Zealand.

KNIME

KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration
platform. KNIME integrates various components for machine learning and data mining through its modular data
pipelining concept.

R is a statistical programming language and computing environment created by the R Foundation for Statistical
Computing. The R language is widely used among statisticians and data miners for developing statistical
software. It is particularly useful for data analysts because it can read any type of data and supports much larger
data sets than is currently possible with spreadsheets.

Python
Python is a general-purpose programming language that can make use of additional code in the form of
'packages' that provide statistical and machine learning tools.

SAS

SAS is a commercial provider of Business Intelligence and data management software with a suite of solutions
that include artificial intelligence and machine learning tools, data management, risk management and fraud
intelligence.

SPSS Statistics

SPSS Statistics is a commercial solution from IBM, while originally designed for social science research, is
increasingly used in health sciences and marketing. In common with the other applications listed here, it
provides a comprehensive range of tools for descriptive statistics.

Stata

Stata is a commercial statistical software solution frequently used in economics and health sciences.

Tools and Applications for Predictive Analytics

All of the tools mentioned in the previous section can also be used for predictive analytics. Some, such as Excel
and SPSS Statistics are limited to in the range of predictive analytics tasks they can perform. In particular, these
tools do not offer the wide range of options for classification or advanced regression available.

Predictive analytics features are also provided by applications and services such as IBM Predictive Analytics,
SAS Predictive Analytics, Salford Systems SPM 8, SAP Predictive Analytics, Google Cloud Prediction API. R
and Python can also be used to perform predictive analytics.

Other tools in the predictive analytics space include SPSS Modeler from IBM, Oracle Data Mining, Microsoft
Azure Machine Learning and TIBCO Spotfire.

Tools in the prescriptive analytics space are fewer in number. One frequently overlooked solution is the 'what if'
analysis tool which is part of Excel's Analysis Tool Pack. This simple yet effective small-scale predictive
analytics tool allows the user to model different scenarios by plugging in different values to a worksheet's
formulas.

As mentioned earlier in the unit, there is also ‘Scenario Manager’ which allows the analyst to test outcomes
from different scenarios, but the most powerful tool in the Tool Pack is ‘Solver’ which is a flexible and
powerful optimisation tool and examples of how ‘Solver’ can help solve business problems and determine
optimal solutions have already been illustrated.

Although spreadsheets are versatile tools which most people have access to and can easily use, R and Python
would be two other widely used tools for more advanced prescriptive analytics as they use programming
languages which allow the user the flexibility to design prescriptive analytical models, only limited in their
sophistication by the programmer or coder’s skill, ingenuity and imagination.

Self-test on AI, Machine learning and data analytics tools

You have now reached the 'Self-test on AI, Machine learning and data analytics tools'. Select 'start test' to begin.
Data visualisation and communication
The purpose of data visualisation

Data visualisation allows us to:

 Summarise large quantities of data effectively.


 Answer questions that would be difficult, if not impossible, to answer using non-visual analyses.
 Discover questions that were not previously apparent and reveal previously unidentified patterns.
 View the data in its context.

The benefits of data visualisation


By utilising data visualisation techniques, we can:

 Quickly identify emerging trends and hidden patterns in the data.


 Gain rapid insights into data which are relevant and timely.
 Rapidly process vast amounts of data.
 Identify data quality issues.
The history of data visualisation

Data visualisation is not a new concept. It could be argued that it reaches all the way back to pre-history.

One of the most ancient calculators was invented over 2,500 years ago by the Chinese, called the Abacus. It is
not only an ancient calculator, but is also an early example of data visualisation where the number of beads in a
column shows the relative quantities as counted on rods. The Abacus has two sections, top and bottom with a
bar or ‘beam’ dividing them. When beads are pushed upwards or downwards towards the bar they are
considered as counted.

The magnitude of the numbers increases by a multiple of 10 for each rod, going from the right to the left, with
the far right-hand side rod containing the beads with the lowest value or denomination. This means that below
the bar of the rod at the far right, beads are worth 1 unit each and there are a total of 5. Each of the two beads
above the bar are worth the same as all five beads below the bar, so each of the beads above the bar on the far
right rod is worth 5 each. In the next adjacent column to the left, each of the bottom beads is worth 10 and the
top beads are worth 50 each and so on.

In the following example Mr Hoo has been calculating his fuel expenses for a month on an Abacus and has
arrived at the total. (Note the minimum denomination in the Abacus is assumed at $1):
Types of data visualisation - Comparison

In business many types of data visualisation are used to present data and information more clearly and
effectively to users of that data. Visualisation types for comparison and composition among categories are
classified into two distinct types:

You must select each section before being able to proceed.

1
Static

2
Dynamic
Static

Static comparison allows comparison between categories at a single point in time.

For a single category or a small number of categories, each with few items, a column chart is the best choice. In
this case sales of groceries, home wares, Deli and Bakery are shown as sold in two different stores.
For a single category or a small number of categories, each with many items, a bar chart is the best choice. In
the chart below where there are 6 stores the vertical column chart becomes less useful. A more effective way to
visualise this data is in a horizontal bar chart, with stores being listed on the Y axis and the sales along the X
axis.

Dynamic

Dynamic comparison permits comparison between categories over time.


For data covering a large number of periods or a small number of periods with many categories, a line chart is a
better alternative. In the line chart below, the total sales of each product in all stores over a four year period is
shown.

Types of data visualisation - Composition

The pie chart is an example of a static visualisation, but shows the relative composition using the size of the
slices, each of which represents a simple share of the total.

Types of data visualisation – Composition

The waterfall chart shows the how each component adds to or subtracts from the total.
In this example, the green bars represent revenue, adding to the total. The red bars represent costs and are
subtracted from the total. The net amount left after costs have been subtracted from the revenues, is represented
by the blue bar which is profit.

Types of data visualisation – Composition

Dynamic composition shows the change in the composition of the data over time. Where the analysis involves
few periods, a stacked bar chart is used where the absolute value for each category matters in addition to the
relative differences between categories.

Types of data visualisation - Composition


Types of data visualisation – Composition

Where only the relative differences between categories matters; a stacked 100% column chart can be used. In
this example it is useful as a way of visualising how much of total sales are made up of which product group. In
the below example it can be seen that in 2018 grocery is becoming a bigger component of total sales but deli
sales and bakery sales are declining as a percentage of the total.

Types of data visualisation – Relationship


The scatter plot is ideal for visualising the relationship between two variables and identifying potential
correlations between them. Each observation for the two variables is plotted as a point, with the position on the
x axis representing the value of one variable and the position on the y axis representing the value of the other.

In the example below taken from an earlier example is a scatter diagram of barbecue sales against the recorded
hours of sunshine per month:

The scatter diagram shows that there is a reasonably close positive correlation between the monthly hours of
sunshine and the sales of barbecues.

Types of data visualisation – Relationship

Although it is possible to introduce a third variable and create a 3D scatter chart, these can be difficult to
visualise and for the user to interpret. In this instance, the preferred solution is to produce multiple 2D scatter
charts for each combination of variables. An alternative to this approach is the bubble chart, which is a standard
2D scatter chart where the values in the third variable are represented by the size of the points or ‘bubbles’. The
bubble chart can sometimes be difficult to interpret where the number of observations is high and the range of
values in the third variable is wide.

The main problem with bubble charts is the ability of the reader to compare the dimensions of each bubble in
absolute terms. A useful feature to resolve this problem is to add a key where the relative size of the bubbles is
indicated. The bubble chart below shows the position on the grid of the sales generated for a product over four
quarterly periods. The size of the bubbles represents the advertising spend in each quarter. As the size and
height of the bubbles in the graph are positively correlated this chart shows clearly that the greater the
advertising spend the greater the quarterly sales generated. The key which has been added in white, helps the
reader gauge the value of absolute sales generated in different periods against the total advertising spend.
What makes a good visualisation?

According to Andy Kirk a data visualisation specialist, a good data visualisation should have the following
qualities:

 It must be trustworthy;
 It must be accessible; and
 It must be elegant.
In his work on graphical excellence, statistician Edward R. Tufte describes an effective visualisation as:

 the well-designed presentation of interesting data: a matter of substance, of statistics, and of design;
 consisting of complex ideas communicated with clarity, precision, and efficiency;
 that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the
smallest space;
 nearly always multivariate;
 requiring that we tell the truth about the data.
What makes a good visualisation?

What does this mean in practice?


Kirk's principle of trustworthiness and Tufte's call to tell the truth about the data mean that we should actively
avoid deliberately or accidently constructing visualisations that do not accurately depict the truth about the
underlying data. This includes, but is not limited to, choosing the most appropriate visualisation, for example,
ensuring all axes start with the lowest values (preferably zero) on the bottom left of the chart, that axes and data
series are labelled and where possible have the same scale. It is common to see politicians and advertisers
ignoring these principles in the interests of influencing their audiences to believe what they wish to tell them.

The principle of accessibility suggested by Kirk echoes Tufte's statement that a good visualisation should not
only give the viewer the greatest number of ideas in the shortest space of time, but should also have clarity,
precision and efficiency. In effect, this means concentrating on those design elements that actively contribute to
visualising the data and avoiding the use of unnecessary decoration; which Tufte refers to as 'chart junk'. It also
means we should avoid trying to represent too many individual data series in a single visualisation, breaking
them into separate visualisations if necessary.
Accessibility is, to paraphrase Kirk, all about offering the most insight for the least amount of viewer effort.
This implies that a significant part of designing a visualisation is to understand the needs of the audience for the
work and making conscious design decisions based on that knowledge.

Careful use of colour is a fundamental part of ensuring an accessible design. The choice of colours should be a
deliberate decision. They should be limited in number, should complement each other and should be used in
moderation to draw attention to key parts of the design. When colour is used this way, it provides a cue to the
viewer as to where their attention should be focussed. For visualisations with a potentially international
audience, the choice of colours should also be informed by cultural norms. For example, in the western world,
the colour red means danger but for the population of East Asia, it signifies luck or spirituality.

An accessible design should enable to the viewer to gain rapid insights from the work, much like a handle
enables us to use a cup more efficiently. These metaphors include the 'traffic light' design used to provide a
high-level summary of the state of some key business risk indicator or the speedometer metaphor employed to
indicate performance. Both are frequently found on Business Intelligence dashboards used by businesses.

Kirk's final principle, elegance, is a difficult one to describe, but is vital to any successful visualisation. The key
here is that what is aesthetically pleasing is usually simpler and also easier to interpret and is likely to not only
catch our attention but hold it for longer. A good design should avoid placing obstacles in the way of the viewer;
it should flow seamlessly and should guide the viewer to the key insights it is designed to impart.
Ethical considerations in the use of data

End of unit data analysis activity

The following activity requires you to analyse the data given in relation to rail travellers in a country called
Beeland, and the amount of ticket sales generated at each of the 30 stations on the main railway line to the
capital city per month.

You can download the spreadsheet below.

When you download the spreadsheet, please enable it. In it you will see a table showing a railway network with
data on the following X or independent variables relating to each station:
 Miles from capital city
 Population
 Spend on ticket barriers
 Predominant socioeconomic group
The furthest right column shows the Y or dependent variable which shows ticket sales at each station.

Once you have opened the spreadsheet, you need to click on the data tab and then on the data analysis tab at the
right hand side. Then select Regression from the drop down menu. You will then see a regression input table.
All you need to do is select the Y variable cells (Ticket sales) then click on the Input X variables and select the
X variables data in the four adjacent columns. Keep the confidence limit at the default of 95% and click the New
Workbook option in the output options. You do not need to select any other options under residuals or Normal
probability. You will then see an ANOVA table displayed which shows the results from your regression.

You must use this information and the values of the Intercept and the X variables to answer the following
questions.
https://www.youtube.com/watch?v=SswgYWffDOw

You might also like