You are on page 1of 29

Page |1

BA104 – FUNDAMENTALS OF PREDICTIVE ANALYTICS


LESSON 2 - DATA VISUALIZATION TOOLS

The Need for Data Visualization

In the world of Big Data, data visualization tools and technologies are essential to
analyze massive amounts of information and make data-driven decisions. Data
visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible
way to see and understand trends, outliers, and patterns in data.
What is Data Visualization?

 is broadly defined as a method of encoding quantitative, relational, or spatial


information into images
 is the graphical representation of information and data through the visual
elements like charts, graphs, maps and dashboards
 deals with how to present data, to the right people, at the right time, to enable
them to gain insights most effectively
Page |2

BENEFITS OF DATA VISUALIZATION TOOLS

Data visualization tools offer new approaches to dramatically improve the ability to
grasp information hiding in the large volume of business data. The primary advantages
of data-visualization to decision makers and their organizations are as follows:
1) Enhanced Assimilation of Business Information
Data visualization enables users to receive vast amounts of information
regarding operational and business conditions. Data visualization allows decision
makers to see connections between multi-dimensional data sets and provides new
ways to interpret data through the use of graphs, charts, infographics and other rich
graphical representations.

2) Quick Access to Relevant Business Insights


Adopting visual data discovery, business organizations improve their ability to
find the information they need when they need it and do so more productively
than other companies.

3) Determine patterns in business operations


Data visualization enables users to see interesting and previously unknown
patterns – like, for example, being able to picture the relationship between
business and operations – and then related performance measures. In fact, with
data visualization, it is easier to see how day-to-day job impacts the overall
business performance, and find if any operational changes caused an
increase/decrease in business performance.

4) Rapid Identification of Latest Trends


In this age, the volume of data that companies are able to gather about customers
and market conditions can provide business leaders with insights into new
revenue and business opportunities, presuming they can spot the opportunities
in the mountain of data. Using data visualization, decision makers are able to
grasp shifts in customer behaviors and market conditions across multiple data
sets much more quickly.

5) Accurate Customer Sentiment Analysis


Using data visualization, companies can attain a deeper dive into customer
sentiment and other data, which reveals emerging opportunities for them to
launch new services to their customers. These useful insights enable the
enterprises to act on new business opportunities for staying ahead of their rivals.
Page |3

6) Direct Interaction with Data


Data Visualization also helps the companies to manipulate and interact with
their data in a direct manner. One of the greatest strengths of data visualization
is how it brings actionable insights to the surface. Unlike one-dimensional tables
and charts that can only be viewed, data visualization tools enable users to
interact with data.

7) Predictive Sales Analysis


With the help of real-time data-visualization, sales executives can carry out
advanced predictive analytics for their sales figures, viewing up-to-date sales
figures and see why certain products are underperforming and the reasons that
sales are lagging. For example, discounts offered by competitors may be one of
those reasons.

8) Easy Comprehension of Data


Utilizing data-visualization, companies may approach huge data and makes it
easily comprehensible, be it the field of entertainment, current affairs, financial
issues or political affairs. It also builds in them a deep insight, prompting them to
take a good decision and an immediate business action if needed.

DATA VISUALIZATION APPLICATIONS

There is more number of commercial and non-commercial data visualization tools


available in the market. Some of the popular data visualization tools in use are Tableau,
Qlikview, Sisense, Looker, Google Data Studio, Zoho Analytics, Fusioncharts,
Highcharts, Datawrapper, Klipfolio, Kibana, Chartio, Plotly, Infogram, Visme,
Geckoboard, AnyChart, D3.js, Microsoft PowerBI, IBM Watson Analytics and SAP
Analytics Cloud.

DATA TRANSFORMATION

Data comes in many forms such as text, numerical, images and videos. For
example, a customer details form where few fields are not filled and left empty. Such
data are known as missing data. In most of the cases, data may be missing data,
unstructured data, or data that lacks regular structure. In data visualization, before
processing the data, there is a need of cleaning data to make it fit to process further.

Data cleansing has a long history in databases and is a key step known as extract,
transform, load (ETL), commonly used in data warehouses shown in figure 2.1, where
data is extracted from one or more sources; transformed into its proper format and
structure, including cleansing of the data; and finally loaded into a final target location,
Page |4

such as a single database or file which can be used for business analytics & data
visualization

Extraction, Transformation and Load (ETL)


1) Extraction

The first step of the ETL process is extraction. In this step, data from various
source systems is extracted which can be in various formats like relational databases,
SQL, XML and flat files into the staging area. It is important to extract the data from
various source systems and store it into the staging area first and not directly into
the data warehouse because the extracted data is in various formats and can be
corrupted also. Hence loading it directly into the data warehouse may damage it
and rollback will be much more difficult. Therefore, this is one of the most important
steps of ETL process.

2) Transformation
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard
format.
It may involve following processes/tasks:
 Filtering – loading only certain attributes into the data warehouse.
 Cleaning – filling up the NULL values with some default values, mapping
U.S.A, United States and America into USA, etc.
 Joining – joining multiple attributes into one.
Page |5

 Splitting – splitting a single attribute into multiple attributes.


 Sorting – sorting tuples on the basis of some attribute (generally key-
attribute)

3) Loading

The third and final step of the ETL process is loading. In this step, the transformed
data is finally loaded into the data warehouse. Sometimes the data is updated by
loading into the data warehouse very frequently and sometimes it is done after
longer but regular intervals. The rate and period of loading solely depends on the
requirements and varies from system to system.

DATA VISUALIZATION TOOLS & TECHNIQUES


1) BAR CHART
Bar charts involve rectangular blocks of varying heights, and the height of
the block corresponds to the value of the quantity being represented. The vertical
axis shows the values – for example, the total number of each type of object
counted and the horizontal axis shows the categories. In case of counting the
different types of vehicles in a parking lot, the individual blocks could represent
cars, vans, motorcycles and jeeps, and their heights could represent the count of
each vehicle.
Page |6

In other words, a bar chart uses horizontal or vertical bars to show


comparisons among categories. The longer the bar, the greater the value it
represents. In the bar chart, an axis of the chart shows the specific categories
(dimensions) which is being compared and the other axis represents a discrete

value (metric).
Page |7

*Stacked bar chart comparing consumer spending across different categories for different
generations

*Overlapping bar chart comparing branch efficiency across locations in terms of people and
profits
Page |8

*Column chart comparing net migration for different countries


Page |9

2) PIE CHART

Pie charts are extensively used in presentations and offices. Pie Charts help show
proportions and percentages between categories, by dividing a circle into proportional
segments. Each arc length represents a proportion of each category, while the full circle
represents the total sum of all the data, equal to 100%. Pie Charts are ideal for giving the
reader a quick idea of the proportional distribution of the data.
P a g e | 10

One major disadvantage to using pie charts is that they cannot show more than a few
values, because as the number of values shown increases, the size of each segment/slice
becomes smaller. This makes them unsuitable for large amounts of data.

*Management in U.S. Manufacturing: How many key performance indicators were monitored at
this establishment?
P a g e | 11

*Indian Language Use


P a g e | 12

*Working Population in America (2018)


P a g e | 13

3) DATA TABLES

Data tables display the data in a grid of rows and columns. Each column represents
a dimension or metric, while each row is one record of the data. Tables
automatically summarize the data. Each row in the table displays the summary for
each unique combination of the dimensions included in the table definition. Each
metric in the table is summarized according to the aggregation type for that metric
(sum, average, count, etc.).

For example, in Google


Data Studio, table can
have up to 10
dimensions and 20
metrics. A data table
which presents sales
data for a fictional pet
store is shown in Table
2.1. The store sells
items for dogs, cats,
P a g e | 14

and birds, with several products in each category.

Table 2.2 shows just the category dimension and quantity


metric for table 2.1. It has aggregated the quantities sold
per category. Since there are only 3 categories in the data
set, the table shows just 3 rows.

Table 2.3 contains 6 rows, 1 for


each item. The quantity sold
metric is now aggregated per
item.
P a g e | 15

FREQUENCY DISTRIBUTION TABLES


P a g e | 16

4) SCATTER PLOTS (CHARTS)

Scatter charts can be used to look for relationships between variables. These
charts show the data as points or circles on a graph using X (left to right) and Y
(top to bottom) axes. Scatter charts can include a trend line that shows how the
variables in the chart are related. They tend to be more frequently used in
scientific fields. Though infrequent, there are use cases for scatter charts in the
business world as well.

For example, to manage


bus fleet, we have to
understand the relationship
between miles driven and
cost per mile. The
scatterplot may look
something like in figure
2.16.
P a g e | 17

To focus primarily on
those cases where cost
per mile is above
average, a slightly
modified scatter chart
designed as given in
figure 2.17.

From the figure 2.17,


cost per mile is higher
than average when less
than about 1,700 miles
or more than about
3,300 miles
observations can be
made.

*Prices for each carat of Diamond


P a g e | 18

TIME SERIES CHART


Time series forecasting is a critical requirement for many organizations. The starting
point of forecasting is a time series visualization, which provides the flexibility to reflect
on historical data and analyze trends and seasonal components. It also helps to compare
multiple dimensions over time, spot trends and identify seasonal patterns in the data. A
few examples include stock market analysis, population trend analysis using a census,
or sales and profit trends over time.
P a g e | 19

Time series analysis is a statistical technique used to record and analyze data points
over a period of time, such as daily, monthly, yearly, etc. A time series chart is the
graphical representation of the time series data across the interval period.

DATA VISUALIZATION TECHNIQUE: Hypothesis vs. Prediction

In day-to-day life, we come across a lot of data lot of variety of content.


Sometimes the information is too much that we get confused about whether the
information provided is correct or not. At that moment, we get introduced to a word
called “Hypothesis testing” which helps in determining the proofs and pieces of
evidence for some belief or information.

Hypothesis testing is an integral part of statistical inference. It is used to decide


whether the given sample data from the population parameter satisfies the given
hypothetical condition. So, it will predict and decide using several factors whether the
predictions satisfy the conditions or not. In simpler terms, trying to prove whether the
facts or statements are true or not.
P a g e | 20

HYPOTHESIS VS PREDICTION

A hypothesis is a statement that provides an answer to a proposed question using


known facts and background research. Typically, hypotheses serve as starting points for
further study.

A prediction is a statement that uses existing data to forecast future events. Predictions
can be types of guesses, but they usually come directly from observations.

For example, if a delivery driver comes to your house every day at 2 p.m. for four days
in a row, you might predict that the driver will come the following day at the same
time. Based on your previous observations, your prediction is a likely foretelling of

future behavior.

EXAMPLES OF HYPOTHESES AND PREDICTIONS

Here are some example scenarios that can help you better understand hypotheses and
predictions:

Diet example
A teenager notices that a change in their diet has made their skin more oily and prone to
breakouts. They make the following hypothesis and prediction:

Hypothesis: Eating greasy, high-fat foods cause acne.

Prediction: If I eat healthier food, then my skin will produce less oil.

In this scenario, the independent variable is the person’s diet, and the dependent
variable is their skin. To test their hypothesis, the teenager can change the independent
variable and record the differences this makes on the dependent variable.
P a g e | 21

Lemonade stand example


A young girl with a lemonade stand on a busy street determines that she made more money on
Monday than she did on Tuesday. Monday was a sunny day with a high of 88 degrees. On
Tuesday, it rained, and the temperature dropped to 67 degrees. The girl makes the following
hypothesis and prediction to perform an experiment:

Hypothesis: Lemonade sales are higher when the temperature is warmer.

Prediction: If tomorrow is sunny and nice, I’ll make more money than I did on
Tuesday.

In this scenario, the weather is the independent variable, and lemonade sales is the
dependent variable. Although she can't control the weather, the girl can test her
hypothesis by recording the varying temperatures and her sales each day to see if she
can establish a correlation that proves her prediction is correct.

Gardener example
A gardener notices that when he plants his tomato plants next to marigolds, fewer nematodes
affect the roots of his crops. He creates the following hypothesis and prediction:

Hypothesis: Marigolds are a good companion crop for tomatoes because they reduce
nematodes.

Prediction: If I plant marigolds next to my tomatoes, then I can produce more tomatoes.

In this scenario, the marigolds are the independent variables, and the tomato plants are
the dependent variables. The gardener plants marigolds near his tomatoes and leaves
some without a companion crop. To test his hypothesis, he records the outcomes on his
dependent variables to see if his prediction holds true.

DATA VISUALIZATION AND DATA ANALYTICS COMPARISON


Data Visualization

 Data visualization is the graphical representation of information and data in a


pictorial or graphical format (Example: charts, graphs, and maps). Data
visualization tools provide an accessible way to see and understand trends,
patterns in data and outliers.

 Data visualization tools and technologies are essential to analyze massive


amounts of information and make data-driven decisions.
P a g e | 22

DATA ANALYTICS

 Data analytics is the process of analyzing data sets in order to make the decision
about the information they have, increasingly with specialized software and
system.

 Data analytics help a business optimize its performance, as well as make


informed business decisions

 The techniques and processes of data analytics have been automated into
mechanical processes and algorithms that work over raw data for human
consumption.

Based on… Data Visualization Data Analytics


Definition Data visualization is the Data analytics is the
graphical representation of process of analyzing data
information and data in a sets in order to make
pictorial or graphical decision about the
format. information they have,
increasingly with
specialized software and
system.
Benefits  Identify areas that need  Identify the underlying
attention or models and patterns
improvement
 Clarity which factors  Acts as an input source
influence customer for the Data
behavior Visualization

 Helps understand  Helps in improving the


which products to business by predicting
places where the needs conclusion

 Predict sales volumes


Used for The goal of the data Every business collects
visualization is to data; data analytics will
communicate information help the business to make
clearly and efficiently to more-informed business
users by presenting them decisions by analyzing the
visually data
P a g e | 23

Relation Data visualization helps toTogether Data


get better perception visualization and analytics
will draw the conclusions
about the datasets. In few
scenarios, it might act as a
source for visualization
Industries Data Visualization Data Analytics
technologies and technologies and
techniques are widely used techniques are widely used
in Finance, Banking, in Commercial, Finance,
Healthcare, Retailing etc Healthcare, Crime
detection, Travel agencies
etc
Platforms Big data processing, Big data processing, Data
Service management mining, Analysis and
dashboards, Analysis and design
design
Techniques Data visualization can be Data Analytics can be
static or interactive Prescriptive analytics,
Predictive analytics
Performed by Data Engineers/Scientists Data Analysts/Functional
Analysts

How Should I Interpret a Data Visualization?

Data visualizations can take on multiple formats and can represent a diversity of
information types and combinations, all of which can impact your ability to understand
what is being represented.
P a g e | 24

Sentence starters are one way to scaffold students' interpretation of data visuals.
Sentence starters provide a focal point for students to begin writing (or saying) an
interpretation of the data they are viewing in graphical form.

 Sentence starters can range in their cognitive demand, moving from identifying
information and patterns in the graph to generating comparisons, predictions,
and hypotheses.
Sentence starters teachers can provide students include:

 This graph shows …


 A pattern I notice in the graph is …
 An anomaly/outlier/different pattern in the graph is …
 A difference between … and …. is …
 A similarity between … and … is
 If this pattern continued, I predict …
 A probable reason for that pattern is …
P a g e | 25

 A probable reason for this difference is …


 When I first looked at this graph …
 The data that most stood out to me was …

DATA VISUALIZATION EXERCISES

Source: Figure 3 in Boden Institute, University of Sydney 2014. Evidence Brief Obesity: Sugar-
Sweetened Beverages, Obesity and Health. Australian National Preventive Health Agency,
Canberra.
Hypothesis Formulation Statements

o This graph shows the types of drinks drunk by Australian children.


o A general pattern I notice in the graph is that as the child's age increases, they
drink more of these kinds of drinks.
o A reason for this pattern might be because older children can go out and buy
their own drinks.
P a g e | 26

o A different pattern in the graph is that energy drinks go down for 14 to 16-year
old.
o A reason for this pattern might be because they prefer drinking other drinks.
o The data that most stood out to me was that sports drinks were drunk more than
soft drinks.

Sample Interpretation:

____________________________________________________________________

Hypothesis Statement:

____________________________________________________________________

Prediction:

____________________________________________________________________

Source: Manning, M., Smith, C., & Mazerolle, P. (2013). The estimated societal costs of alcohol misuse in
Australia. Trends and Issues in Crime and Criminal Justice no. 454. Canberra: Australian Institute of Criminology

Sample Interpretation
This graph shows the estimated societal costs of alcohol misuse in Australia. The total estimated
cost exceeds $14 billion. The largest cost relates to productivity, which accounted for 42.1% or
$6.046 billion. Traffic accidents comprised 25.5% or a quarter of the costs ($3.662 billion).
Alcohol misuse had the least cost to the health system, costing $1.686 billion.
P a g e | 27

Hypothesis?
Prediction?

Source: Surveillance of notifiable infectious diseases in Victoria, 2011–2014

Sample Interpretation

This graph shows the number of notified cases of laboratory-confirmed cases of influenza in
Victoria from 2011 to 2014. Each year, there is a spike in confirmed cases, which begins in June
and lasts until October. This coincides with winter when people are more likely to be spending
time indoors. The number of infected cases during the winter spike has also increased each
year. In 2011, the peak number of infected cases was around 800 while in 2014, the peak number
is just over 3000.

Hypothesis?

Prediction?
P a g e | 28

What is the trend line for this scatterplot?


It naturally decreases. This is an example of a weak or low negative correlation. It is negative
because as the number of kilometers increases, the weight decreases. It is a weak correlation
because the data points are not closely grouped.

Hypothesis?
Prediction?
P a g e | 29

You might also like