Cse2026 Module 1 & 2 Detailed Notes

CSE 2015- Data Analysis and
Visualization
Module 1-Introduction to Data Analysis
Module 1: Introduction to Data Visualization [12 Hrs] [Bloom’s

Level Selected: Understand]
Data collection, Data Preparation Basic Models- Overview of
data visualization - Data Abstraction - Task Abstraction -
Analysis: Four Levels for Validation, Interacting with Databases,
Data Cleaning and Preparation, Handling Missing Data, Data
Transformation.
Python Libraries: NumPy, pandas, matplotlib, GGplot,
Introduction to pandas Data Structures .
Introducing Data
• Facts and statistics collected together for reference or analysis
• Data has to be transformed into a form that is efficient for movement or

processing.
2
Over view of Data
Analysis
• Data analysis is defined as a process of cleaning,
transforming, and modeling data to discover useful
information for business decision-making.
• The purpose of Data Analysis is to extract useful information
from data and taking the decision based upon the data
analysis.
• A simple example of Data analysis is whenever we take any
decision in our day-to-day life is by thinking about what
happened last time or what will happen by choosing that
particular decision.
4
• This is nothing but analyzing our past or future and making
decisions based on it.
• For that, we gather memories of our past or dreams of our
future.
• So that is nothing but data analysis. Now same thing analyst
does for business purposes, is called Data Analysis.
5
Data Analysis Tools
6
Data in the Real World
7
8
Data Collection
Data collection is the process of gathering and collecting

information from various sources to analyse and make informed
decisions based on the data collected. This can involve various
methods, such as surveys, interviews, experiments, and
observation.
Types of Data Collection
• Primary Data Collection
• Primary data collection is the process of gathering original and firsthand information
directly from the source or target population.
• Secondary Data Collection
• Secondary data collection is the process of gathering information from existing sources
that have already been collected and analyzed by someone else, rather than conducting
new research to collect primary data.
• Qualitative Data Collection
• Qualitative data collection is used to gather non-numerical data such as opinions,
experiences, perceptions, and feelings, through techniques such as interviews, focus
groups, observations, and document analysis.
• Quantitative Data Collection
• Quantitative data collection is a used to gather numerical data that can be analyzed using
statistical methods. This data is typically collected through surveys, experiments, and
other structured data collection methods.
Data Collection Methods
• Surveys
• Surveys involve asking questions to a sample of individuals or
organizations to collect data. Surveys can be conducted in
person, over the phone, or online.
• Interviews
• Interviews involve a one-on-one conversation between the
interviewer and the respondent. Interviews can be structured or
unstructured and can be conducted in person or over the phone.
• Focus Groups
• Focus groups are group discussions that are moderated by a
facilitator. Focus groups are used to collect qualitative data on a
specific topic.
• Observation
• Observation involves watching and recording the behavior of people,
objects, or events in their natural setting. Observation can be done
overtly or covertly, depending on the research question.
• Experiments
• Experiments involve manipulating one or more variables and
observing the effect on another variable. Experiments are commonly
used in scientific research.
• Case Studies
• Case studies involve in-depth analysis of a single individual,
organization, or event. Case studies are used to gain detailed
information about a specific phenomenon.
Data Preparation
• Data preparation is the process of gathering, combining,

structuring and organizing data so it can be used in business
intelligence (BI), analytics and data visualization applications.
• Data Collection- Relevant data is gathered from operational systems,
data warehouses, data lakes and other data sources.
• Data Discovery and Profiling- The next step is to explore the collected
data to better understand what it contains and what needs to be done
to prepare it for the intended uses.
• Data Cleansing- Next, the identified data errors and issues are
corrected to create complete and accurate data sets.
• Data Structuring- At this point, the data needs to be modeled and

organized to meet the analytics requirements.
• Data Transformation and Enrichment.Data enrichment further
enhances and optimizes data sets as needed, through measures such as
augmenting and adding data.
• Data Validation and Publishing. In this last step, automated routines are
run against the data to validate its consistency, completeness and
accuracy. The prepared data is then stored in a data warehouse, a data
lake or another repository
Benefits of Data Preparation
• Ensure the data used in analytics applications produces reliable

results
• Identify and fix data issues that otherwise might not be detected
• Enable more informed decision-making by business executives and
operational workers
• Reduce data management and analytics costs
• Avoid duplication of effort in preparing data for use in multiple
applications
• Get a higher ROI from BI and analytics initiatives.
Overview of Data Visualization
• The purpose of visualization is to get insight, by means of
interactive graphics, into various aspects related to some process
we are interested in, such as a scientific simulation or some real-
world process.
Questions Targeted by the Visualization process

Conceptual View of Visualization Process
Data Abstraction
• Data abstraction is the process of concealing irrelevant or

unwanted data from the end user. Data Abstraction is a concept
that allows us to store and manipulate data in a more abstract,
efficient way. This type of abstraction separates the logical
representation of data from its physical storage, giving us the
ability to focus solely on the important aspects of data without
being bogged down by the details.
Challenges with Data Abstraction
Understanding Data Complexity
Data abstraction requires an understanding of both complex data
structures and logical rules. Although abstracting data can involve
simplifying it for easier management purposes, this doesn’t necessarily
mean less complexity.
Hiding Details while Remaining Accurate

Data abstraction is also a way to hide certain details from view without
compromising accuracy or security.
Limitations of Schemas and Abstraction Layers

When it comes to documenting large datasets, predefined schemas are
often used as an easy way to structuralize the data correctly.
21
Benefits of Data Abstraction
• Efficiency: Abstraction allows us to manipulate data in a more

abstract way, separating logical representation from physical
storage.
• Focus on Essentials: By ignoring unnecessary details, we can

concentrate on what truly matters.
• System Efficiency: Users access relevant data without hassle,

and the system operates efficiently
What is Data Validation?
Data validation refers to the process of ensuring the accuracy and

quality of data. It is implemented by building several checks into a
system or report to ensure the logical consistency of input and stored
data.
23
Types of Data Validation
1. Data Type Check

• A data type check confirms that the data entered has
the correct data type. For example, a field might only
accept numeric data. If this is the case, then any data
containing other characters such as letters or special
symbols should be rejected by the system.
2. Code Check
• A code check ensures that a field is selected from a valid
list of values or follows certain formatting rules. For
example, it is easier to verify that a postal code is valid
by checking it against a list of valid codes. The same
concept can be applied to other items such as country
codes and NAICS industry codes.
24
3. Range Check
A range check will verify whether input data falls within a
predefined range. For example, latitude and longitude are
commonly used in geographic data. A latitude value should be
between -90 and 90, while a longitude value must be between
-180 and 180. Any values out of this range are invalid.
4. Format Check
Many data types follow a certain predefined format. A common
use case is date columns that are stored in a fixed format like
“YYYY-MM-DD” or “DD-MM-YYYY.” A data validation procedure
that ensures dates are in the proper format helps maintain
consistency across data and through time.
25
5. Consistency Check
• A consistency check is a type of logical check that
confirms the data’s been entered in a logically consistent
way. An example is checking if the delivery date is after
the shipping date for a parcel.
6. Uniqueness Check
• Some data like IDs or e-mail addresses are unique by
nature. A database should likely have unique entries on
these fields. A uniqueness check ensures that an item is
not entered multiple times into a database.
26
What is Data Cleaning?
• Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly

formatted, duplicate, or incomplete data within a dataset. When combining
multiple data sources, there are many opportunities for data to be duplicated or
mislabeled.
• If data is incorrect, outcomes and algorithms are unreliable, even though they
may look correct.
27
What is the difference between Data Cleaning and Data
Transformation?
• Data cleaning is the process that removes data that does not
belong in your dataset. Data transformation is the process
of converting data from one format or structure into
another.
• Transformation processes can also be referred to as data
wrangling, or data munging, transforming and mapping data
from one "raw" data form into another format for
warehousing and analyzing.
28
Data Cleaning Steps
Step 1: Remove duplicate or irrelevant observations
Step 2: Fix structural errors

Structural errors are when you measure or transfer data and
notice strange naming conventions, typos, or incorrect
capitalization. These inconsistencies can cause mislabelled
categories or classes.
Step 3: Filter unwanted outliers
29
Step 4: Handle missing data
You can’t ignore missing data because many algorithms will not
accept missing values.
Step 5: Validate and QA
At the end of the data cleaning process, you should be able to
answer these questions as a part of basic validation:
•Does the data make sense?
•Does the data follow the appropriate rules for its field?
•Does it prove or disprove your working theory, or bring
any insight to light?
•Can you find trends in the data to help you form your next
theory?
•If not, is that because of a data quality issue?
30
Data Transformation
1. Removing Duplicates Duplicate rows may be found in a DataFrame
for any number of reasons. Here is an example:
31
Relatedly, drop_duplicates returns a DataFrame where the
duplicated array is False:
Suppose we had an additional column of values and wanted to filter

duplicates only based on the 'k1' column:
32
2. Transforming Data Using a Function or Mapping
• Consider the following hypothetical data collected about various kinds of meat:
33
• Suppose you wanted to add a column indicating the type of animal
that each food came from. Let’s write down a mapping of each
distinct meat type to the kind of animal:
34
The map method on a Series accepts a function or dict-like object containing a mapping. We
need to convert each value to lowercase using the str.lower Series method:
35
3. Replacing Multiple Values
If you want to replace multiple values at once, you instead pass a
list and then the substitute value:
To use a different replacement for each value, pass a list of

substitutes:
36
The argument passed can also be a dict:
Copyright ©2011
2-37
Pearson Education
4. Renaming Axis Indexes
Like values in a Series, axis labels can be similarly transformed by a
function or mapping of some form to produce new, differently
labeled objects.
38
5. Discretization and Binning
Continuous data is often discretized or otherwise separated into
“bins” for analysis. Suppose you have data about a group of people in
a study, and you want to group them into discrete age buckets
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally
61 and older. To do so, you have to use cut, a function in pandas:
39
40
6. Detecting and Filtering Outliers
Filtering or transforming outliers is largely a matter of applying array

operations. Consider a DataFrame with some normally distributed
data:
41
To select all rows having a value exceeding 3 or –3, you can use the
any method on a boolean DataFrame:
42
7.Computing Indicator/Dummy Variables
Another type of transformation for statistical modeling or machine

learning applications is converting a categorical variable into a
“dummy” or “indicator” matrix.
43
NumPy
NumPy is a Python package. It stands for 'Numerical Python'. It is a

library consisting of multidimensional array objects and a collection
of routines for processing of array.
Numeric, the ancestor of NumPy, was developed by Jim Hugunin.

Another package Numarray was also developed, having some
additional functionalities. In 2005, Travis Oliphant created NumPy
package by incorporating the features of Numarray into Numeric
package.
44
Operations using NumPy
Using NumPy, a developer can perform the following
operations −
• Mathematical and logical operations on arrays.
• Fourier transforms and routines for shape manipulation.
• Operations related to linear algebra.
• NumPy has in-built functions for linear algebra and random
number generation.
45
NumPy – A Replacement for MatLab
NumPy is often used along with packages like SciPy (Scientific
Python) and Mat−plotlib (plotting library).
This combination is widely used as a replacement for MatLab,
a popular platform for technical computing.
However, Python alternative to MatLab is now seen as a more
modern and complete programming language.
It is open source, which is an added advantage of NumPy.
46
NumPy package is imported using the following syntax −
import numpy as np
The most important object defined in NumPy is an N-dimensional array type

called ndarray. It describes the collection of items of the same type. Items in the
collection can be accessed using a zero-based index.
Every item in an ndarray takes the same size of block in the memory. Each
element in ndarray is an object of data-type object (called dtype).
The following diagram shows a relationship between ndarray, data type object
(dtype) and array scalar type −
47
numpy.array(object, dtype = None, copy = True, order = None, subok =
False, ndmin = 0)
The above constructor takes the following parameters −
Sr.No. Parameter & Description

1 object
Any object exposing the array interface method returns an array, or any (nested) sequence.
2 dtype
Desired data type of array, optional
3 copy
Optional. By default (true), the object is copied
4 order
C (row major) or F (column major) or A (any) (default)
5 subok
By default, returned array forced to be a base class array. If true, sub-classes passed through
6 ndmin
Specifies minimum dimensions of resultant array
48
Example:
import numpy as np
a = np.array([1,2,3])
print a
# more than one dimensions

import numpy as np
a = np.array([[1, 2], [3, 4]])
print a
49
Example:
# minimum dimensions
import numpy as np
a = np.array([1, 2, 3,4,5], ndmin = 2)
print a
# dtype parameter
import numpy as np
a = np.array([1, 2, 3],
dtype = complex)
print a
50
NumPy - Array Attributes
ndarray.shape:
This array attribute returns a tuple consisting of array dimensions. It can also be
used to resize the array.
import numpy as np
a = np.array([[1,2,3],[4,5,6]])
print a.shape
# this resizes the ndarray

mport numpy as np
a = np.array([[1,2,3],[4,5,6]])
a.shape = (3,2)
print a
51
import numpy as np
a = np.array([[1,2,3],[4,5,6]])
b = a.reshape(3,2)
print b
ndarray.ndim:
This array attribute returns the number of array dimensions.
# an array of evenly spaced numbers
import numpy as np
a = np.arange(24)
print a
52
# this is one dimensional array

import numpy as np
a = np.arange(24) a.ndim
# now reshape it
b = a.reshape(2,4,3)
print b
# b is having three dimensions
numpy.itemsize:
This array attribute returns the length of each element of array in bytes.
numpy.itemsize:
This array attribute returns the length of each element of array in bytes.
# dtype of array is int8 (1 byte)

import numpy as np
x = np.array([1,2,3,4,5], dtype = np.int8)
print x.itemsize
# dtype of array is now float32 (4 bytes)

import numpy as np
x = np.array([1,2,3,4,5], dtype = np.float32)
print x.itemsize
NumPy - Array Creation Routines
A new ndarray object can be constructed by any of the following array creation
routines or using a low-level ndarray constructor.
numpy.empty:
It creates an uninitialized array of specified shape and dtype. It uses the following
constructor −
numpy.empty(shape, dtype = float, order = 'C')

1 Shape: Shape of an empty array in int or tuple of int
2 Dtype: Desired output data type. Optional
3 Order: 'C' for C-style row-major array, 'F' for FORTRAN style column-
major array
A new ndarray object can be constructed by any of the following array creation
routines or using a low-level ndarray constructor.
numpy.empty:
It creates an uninitialized array of specified shape and dtype. It uses the following
constructor −
numpy.empty(shape, dtype = float, order = 'C')

1 Shape: Shape of an empty array in int or tuple of int
3 Order: 'C' for C-style row-major array, 'F' for FORTRAN style column-
major array
import numpy as np
x = np.empty([3,2], dtype = int)
print x
numpy.zeros:
Returns a new array of specified size, filled with zeros.
numpy.zeros(shape, dtype = float, order = 'C')

1 Shape: Shape of an empty array in int or sequence of int
3 Order: 'C' for C-style row-major array, 'F' for FORTRAN style column-major array
# array of five zeros. Default dtype is float

import numpy as np
x = np.zeros(5)
print x
import numpy as np
x = np.zeros((5,), dtype = np.int)
print x
# custom type
import numpy as np x = np.zeros((2,2), dtype = [('x', 'i4'), ('y', 'i4')])
print x
numpy.ones:
Returns a new array of specified size and type, filled with ones.
numpy.ones(shape, dtype = None, order = 'C')
NumPy - Array From Existing Data
numpy.as array:
This function is similar to numpy.array except for the fact that it has fewer
parameters. This routine is useful for converting Python sequence into ndarray.
numpy.asarray(a, dtype = None, order = None)

1 a
Input data in any form such as list, list of tuples, tuples, tuple of tuples or tuple of
lists
2 dtype
By default, the data type of input data is applied to the resultant ndarray
3 order
C (row major) or F (column major). C is default
60
dropna( )
Drop
missing
values
What is Matplotlib?
•Matplotlib is an open-source drawing library that

supports various drawing types
•You can generate plots, histograms, bar charts, and

other types of charts with just a few lines of code
•It’s often used in web application servers, shells, and

Python scripts
79
Pyplot is a Matplotlib module that provides simple functions for adding
plot elements, such as lines, images, text, etc. to the axes in the current
figure.
80
Matplotlib Subplots
You can use the subplot() method to add more than one plot in a figure.
Syntax: plt.subplots(nrows, ncols, index)
The three-integer arguments specify the number of rows and columns and the
index of the subplot grid.
81
82
83
84
Module 2
SCALAR & VECTOR POINTS

Outline for class
✓ Scalar and Point techniques
✓ Color maps
✓ Contouring – Height Plots - Vector visualization techniques
✓ Vector properties – Vector Glyphs – Vector Color Coding
✓ Matrix visualization techniques
Motivation
• Scalar is a physical quantity that contains only magnitude.
• Visualizing scalar data is frequently encountered in science, engineering, and
medicine, but also in daily life.
• scalar datasets, or scalar fields, represent functions f:D→R, where D is usually a
subset of R2 or R3.
• D corresponds to a real number in the range ℝ through the function f. This means that for any given point (x, y) in ℝ²
or (x, y, z) in ℝ³ that belongs to the domain D, the function f assigns a single real value, which could represent things
like temperature, pressure, density, or any other scalar quantity that varies continuously over space.
• There exist many scalar visualization techniques, both for 2D and 3D datasets.
• we present a number of the most popular scalar visualization techniques: color
mapping, contouring, Heatmaps, Topological Analysis and height plots.
Scalar Visualization
Color Mapping
• Color mapping is a common scalar visualization technique that maps scalar
data to colors, and displays the colors on the computer system.
• The scalar mapping is implemented by indexing into a color lookup table.
• Scalar values then serve as indices into this lookup table.
• The lookup table holds an array of colors.
• Associated with the table is a minimum and maximum scalar range into which the
scalars are mapped.
• Scalar values greater than the maximum are clamped to the maximum color,
scalar values less than the minimum are clamped to the minimum value.
Then, for each scalar value si, the index i into the color table with n entries is given
as:
rgb0
rgb1
si  min : i=0 si color
rgb2
si  max : i = n −1
●
s − min 
otherwise: i = n   i  ●
 max − min 
●
rgbn-1
•si: This is the scalar value that you want to map to a color.
•s_min: This represents the minimum scalar value in your dataset. It defines the lower bound of the scalar range.
•s_max: This represents the maximum scalar value in your dataset. It defines the upper bound of the scalar range.
•n: This is the number of entries in your color table.
91
Transfer Functions
A more general form of the lookup table is called transfer function. A transfer
function is any expression that maps scalar values into a color specification. For
example, a function can be used to map scalar values into separate intensity
values for the red, green, and blue components.
red green blue
intensity
Color mapping is a one-dimensional visualization technique. It maps one piece of
information (i.e. a scalar value) into a color specification. However, the display of
color information is not limited to one-dimensional displays. Often we use color
information mapped onto
1-D, 2-D, or 3-D objects.
This is a simple way to increase the information
content of our visualization.
In 3-D, cutting planes can be used to visualize the
data inside.
• Many other color map designs are possible. For example, geographical
applications often encode landscape height using a particular color map that
contains colors, which suggest typical relief forms at different heights, including
blue (sea level), green (fields), beige (hills), brown (mountains), and white
(mountain peaks).
• In other applications, such as medical imaging, the simple luminance color map
works best.
• Rainbow coloring would result in loss of linearity due to color values being
mapped to hue so that some users perceive the colors to change “faster” per
spatial unit in the higher yellow-to-red range than in the lower blue-to-cyan range..
Example:
luminance color map rainbow color map

Images courtesy of Alexandru Telea
Example:
256 colors 32 colors
16 colors 8 colors
Images courtesy of Alexandru Telea
Visualization without color map:
Visualization with color map:
Contouring
• A natural extension to color mapping is contouring.
• When we see a surface colored with data values, the eye often separates
similarly colored areas into distinct regions.
• When we contour data, we are effectively constructing the boundary between
these regions. These boundaries correspond to contour lines (2-D) or surfaces
(3-D) of constant scalar value.
• Examples of 2-D contour displays include weather maps annotated with lines of
constant temperature (isotherms), or topological maps drawn with lines of
constant elevation.
Three-dimensional contours are called isosurfaces, and can be approximated by
many polygonal primitives.
Examples of isosurfaces include constant medical image intensity corresponding to
body tissues such as skin, bone, or other organs. (The
corresponding isovalue for the same tissue, however, is
not necessarily constant among several different scans.)
Other abstract isosurfaces such as surfaces of constant
pressure or temperature in fluid flow also may be
created.
Marching Cubes
Lorensen and Cline introduced Marching Cubes in 1987.
[William E. Lorensen, Harvey E. Cline, „Marching Cubes: A High Resolution 3D
Surface Construction Algorithm“, ACM Computer Graphics Vol. 21 No. 4 (SIGGRAPH 1987 Proceedings)]
Marching Cubes (MC) is an efficient method for extracting isosurfaces from scalar
data set defined on a regular grid. Similar to marching squares, surface segment is
computed for each cell of the grid that approximates the isosurface.
Example:
Height Plots
Height plots, also called elevation or carpet plots, Given a two-dimensional surface
DsD, part of a scalar dataset D, height plots can be described by the mapping
operation
m: Ds→D, m(x)=x+s(x)n(x),xDs
Where s(x) is the scalar value of D at the point x and n(x) is the normal to the surface
Ds at x.
In other words, the height plot mapping operation “warps” a given surface Ds included
in the data set along the surface normal with a factor proportional to the scalar values.
Example: a torus surface (a) and its “warped” variant with the height corresponding
to the scalar value.
(a) Images courtesy of Alexandru Telea

(b)
Vector Visualization
Vector data is a three-dimensional representation of direction and magnitude. Vector
data often results from the study of fluid flow, or when examining derivatives, i.e. rate of
change, of some quantity.
Different visualization techniques are available for vector data sets, for example:
• Hedgehogs and oriented glyphs
• Warping
• Displacement plots
• Time animation
• Streamlines
Vector
Glyph →→ →→
l = (x ,x + kV (x))
• Vector glyph mapping technique associates a vector glyph (or icon) with the sampling
points of the vector dataset
• The magnitude and direction of the vector attribute is indicated by the various
properties of the glyph: location, direction, orientation, size and color.
• Many variations of framework
• Lines (convey direction)
• 3D cone (convey direction + orientation)
• Arrow (convey direction + orientation)
Vector Glyph
Line glyph,
or hedgehog glyph
Sub-sampled by
a factor of 8
(32 X 32)
Original (256 X 256)
Velocity Field of a 2D Magnetohydrodynamic Simulation

Vector Glyph
Line glyph,
or hedgehog glyph
Sub-sampled by
a factor of 4
(64 X 64)

Vector Glyph
Sub-sampled by
a factor of 2
(128 X 128)
Problem with a dense

representation using
glyph:
(1) clutter
(2) miss-representation
Vector Glyph
Random
Sub-sampling
Is better
Vector Glyph:
3D
Simulation box: 128 X 85 X 42; 456,960 data point,
100,000 glyphs
Problem: visual occlusion
Vector Glyph:
3D
10,000 glyphs
Less occlusion
Vector Glyph:
3D
100,000 glyphs, 0.15 transparency
Less occlusion
Vector Glyph:
3D
Simulation box: 128 X 85 X 42; 456,960 data point
3D velocity isosurface
Vector Glyph
• Glyph method is simple to implement, and intuitive to interpretation
• High-resolution vector datasets must be sub-sampled in order to avoid overlapping of
neighboring glyphs.
• Glyph method is a sparse visualization: does not represent all points
• Occlusion
• Subsampling artifacts: difficult to interpolate
• Alternative: color mapping method is a dense visualization
Vector Color
Coding
• Similar to scalar color mapping, vector color coding is to associate a color with
every point in the data domain
• Typically, use HSV system (color wheel)
• Hue is used to encode the direction of the vector, e.g., angle arrangement in
the color wheel
• Value of the color vector is used to encode the magnitude of the vector
• Saturation is set to one
For example, a bright red vector may indicate a strong wind blowing to the east, while a dimmer red vector might represen
This technique is particularly useful for visualizing vector fields in a way that makes it easy to discern both the direction an
117
Vector Color
Coding
2-D Velocity
Field of the
MHD
simulation:
Orientation,
Magnitude
Vector Color
Coding
2-D Velocity
Field of the
MHD
simulation:
Orientation
only; no
magnitude
Module 2
Visualization Techniques on
Trees, Graphs & Networks
Introduction
Important application of visualization is the conveying of relational information, e.g., how
data items or records are related to each other. These interrelationships can take many forms:
• part/subpart, parent/child, or other hierarchical relation;
• connectedness, such as cities connected by roads or computers connected by networks;
• derived from, as in a sequence of steps or stages;
• shared classification;
• similarities in values;
• similarities in attributes (e.g., spatial, temporal).
Introduction Cont…
• Relationships can be simple or complex:
• unidirectional or bi-directional
• nonweighted or weighted
• certain or uncertain
Indeed, the relationships may provide more and richer information than that contained in the
data records.
Applications for visualizing relational information are equally diverse, from categorizing
biological species, to exploring document archives, to studying a terrorist network.
Displaying Hierarchical Structures (Trees)
Trees or hierarchies (we'll use the terms interchangeably) are one of the most common
structures to hold relational information. We can divide these techniques into two classes of
algorithms:
• space-filling
• non-space-filling
Space Filling Method
• Space-filling techniques make maximal use of the display space. This is accomplished by
using juxtapositioning to imply relations, as opposed to, for example, conveying relations
with edges joining data objects.
• The two most common approaches to generating space-filling hierarchies are rectangular
and radial layouts.
1. Treemap, a rectangle is recursively divided into slices, alternating horizontal and vertical
slicing, based on the populations of the subtrees at a given level.
2. Sunburst displays, have the root of the hierarchy in the center of the display and use nested
rings to convey the layers of the hierarchy. Each ring is divided based on the number of nodes
at that level.
Pseudo code for treemap:
Start : Main Program
Width = width of rectangle
Height = height of rectangle
Node = root node of the tree
Origin = position of rectangle, e.g. , [0, 0]
Orientation = direction of cuts, alternating between horizontal and vertical
Treemap(Node, Orientation, Origin, Width, Height)
End: Main Program
Treemap(node n, orientation o, position orig, hsize W, vsize h)
if n is a terminal node (i.e., it has no children)
draw-rectangle( orig, W, h)
return
for each child of n (child-i), get number of terminal nodes in subtree
sum up number of terminal nodes
compute percentage of terminal nodes in n from each subtree (percent-i)
if orientation is horizontal
for each subtree
compute offset of origin based on origin and width (offset-i)
treemap(chi1d-i, vertical, orig + offset-i, W 9 percent-i, h)
else
for each subtree
compute offset of origin based on origin and height (offset-i)
treemap(chi1d-i, horizontal, orig + off set-i, W, h * percent-i)
End: Treemap
127
Pseudo code for sunburst display
Start : Main Program
Start = start angle for a node (initially 0)
End = end angle for a node (initially 360)
Origin = position of center of sunburst, e.g., [0,0]
Level = current level of hierarchy (initially 0)
Width = thickness of each radial band - based on max depth and display
size
Sunburst( Node, Start, End, Level)
End: Main Program
Pseudo code for sunburst display
Sunburst (node n, angle st, angle en, level 1)
if n is a terminal node (i.e., it has no children)
draw-radial-section(Origin, st, en, l * Width, (1+i) * Width)
return
for each child of n (child-i), get number of terminal nodes in subtree
sum up number of terminal nodes
compute percentage of terminal nodes in n from each subtree (percent31
for each subtree
compute start/end angle based on size of subtrees, order, and angle range
Sunburst (child-i, st-i, en-i, 1+i)
End: Sunburst
Non-Space Filling Methods
The most common representation used to visualize tree or hierarchical relationships is a node-link
diagram.
The drawing of such trees is influenced the most by two factors:
• The fan-out degree (e.g., the number of siblings a parent node can have)
• The depth (e.g., the furthest node from the root)
➢ When designing an algorithm for drawing any node-link diagram (not just trees), one must consider
three categories of often-contradictory guidelines:
➢ drawing conventions,
➢ constraints, and
➢ aesthetics.
➢ Conventions may include restricting edges to be either a single straight line, a series of rectilinear lines,
polygonal lines, or curves.
➢ Constraints may include requiring a particular node to be at the center of the display, or that a group of
nodes be located close to each other, or that certain links must either go from top to bottom or left to
right.
Each of the above guidelines can be used to drive the algorithm design.
Aesthetics, however, often have significant impact on the interpretability of a tree or
graph drawing, yet often result in conflicting guidelines. Some typical aesthetic rules
include:
• minimize line crossings
• maintain a pleasing aspect ratio
• minimize the total area of the drawing
• minimize the total length of the edges
• minimize the number of bends in the edges
• minimize the number of distinct angles or curvatures used
• strive for a symmetric structure
A simple tree drawing procedure is given below:
• Slice the drawing area into equal-height slabs, based on the depth of the tree.
• For each level of the tree, determine how many nodes need to be drawn.
• Divide each slice into equal-sized rectangles based on the number of
nodes at that level.
• Draw each node in the center of its corresponding rectangle.
• Draw a link between the center-bottom of each node to the center-top
• of its child node(s).
Displaying Arbitrary Graphs/Networks
• A Graph is a non-linear data structure consisting of vertices and edges.
• The vertices are sometimes also referred to as nodes and the edges are lines or arcs that connect any two
nodes in the graph.
• More formally a Graph is composed of a set of vertices( V ) and a set of edges( E ). The graph is denoted
by G(E, V).
Components of a Graph
• Vertices: Vertices are the fundamental units of the graph. Sometimes, vertices are also known as vertex or
nodes. Every node/vertex can be labeled or unlabelled.
• Edges: Edges are drawn or used to connect two nodes of the graph. It can be ordered pair of nodes in a
directed graph. Edges can connect any two nodes in any possible way.
Node-Link Graphs:
• A cutvertex is any node that causes the graph to be disconnected if it is removed.

• A biconnected graph is one without a cutvertex.
• A block is a maximally biconnected subgraph of a graph.
• A separating pair means two vertices whose removal causes a biconnected graph to
become disconnected.
• A triconnected graph is one without a separating pair. A planar triconnected graph has
a unique embedding.
Node-Link Graphs:
Node-Link Graphs:(Social Network)
Matrix Representations for Graphs:
• An alternate visual representation of a graph is via an adjacency matrix, which is an N by N grid (where
N is the number of nodes), where position (i, j) represents the existence (or not) of a link between nodes
i and j.
• This may be a binary matrix, or the value might represent the strength or weight of the link between the
two nodes.
• This method overcomes one of the biggest problems with node-link diagrams, namely that of crossing
edges, though it doesn't scale well to graphs with large numbers (thousands) of nodes.
Matrix Representations for Graphs:
Module 2
Visual Variables
Introduction
• Visual variables are distinctions that we can use to create and differentiate symbols on a map.
• Visual variables are the attributes or properties of a graphical element that can be visually perceived and used to
convey information. They are the building blocks of visual design and play a crucial role in visual communication.
• There are 10 visual distinctions available for symbolization: location, size, shape, orientation, focus,
arrangement, texture, saturation, hue, and value.
• These visual techniques can be used to create a pleasing aesthetic, convey precise geographic information,
and create a visual hierarchy that can be understood by the viewer of the map.
Types: Visual Variable
Location
• The location visual variable is the position of the object and the environment.
• Location can be determined in absolute, relative, or cognitive terms.
• In any case, location determines where in our environment the object exists.
• No matter whether the data is qualitative or quantitative in nature in order to be map it
must have a location.
Types: Qualitative
• Qualitative visual variables are used for nominal data.
• The goal of qualitative visual variables is to show how entities differ from each other.
• Qualitative visual variables show the grouping of similar entities.
Types: Hue
Hue, more commonly known as color, represents a wavelength on the visible portion of the
electromagnetic spectrum.
Hue is great for identifying items as unique, or of a type of item.
Hue creates a perception of groups or likeness.
The images show how hue can be applied to data with 0 to 3 dimensions.
Types: Orientation
The orientation visual variable changes the orientation of the object and creates a perception
of group or likeness.
Types: Shape
• The shape visual variable identifies an item as unique or of a type.
• The shape visual variable refers to a point symbol although it can be arranged to resemble
a line and placed inside an area or three-dimensional shape.
• The shape does not have to be a geometric form; it can also be a pictorial form.
Types: Arrangement
• The arrangement visual variable refers to the placement of elements composing a pattern
or a texture.
• Arranging patterns or textures differently can create a perception of a single unique item or
a group of items.
Types: Texture
• Texture refers to the symbols covering an area. Textures identify items as unique or of a
type.
Types: Focus
• Focus represents uncertainty by making the symbols look fuzzy or out of focus. The more
uncertain a value is the fuzzier or out of focus it should look.
Types: Quantitative Visual
Variables
• Quantitative visual variables are used to display ordinal, interval, or ratio data.
• The goal of the quantitative visual variable is to show relative magnitude or order between
entities.
Types: Size
• The size of visual variable changes the size of the symbol to imply relative levels of
importance.
• Line thickness implies relative flow levels in the case of road traffic or water flow through
a river.
Types: Value, Saturation
• The visual variable's value and saturation represent different magnitudes or order in a
data value. It is important that you only very the saturation or value but not both for a
given hue.
• This represents a single variable, which is represented by a single hue, with different
quantitative values, which are represented by a difference in saturation or value.
Types: Focus
• Again, the focus visual variable represents uncertainty in quantitative values.
Module 2
Map Color & Other Channels

Introduction
Color
• The color is best understood in terms of three separate channels: luminance, hue, and saturation.
The major design choice for colormap construction is whether the intent is to distinguish between
categorical attributes or to encode ordered attributes.
• Sequential ordered colormaps show a progression of an attribute from a minimum to a maximum
value, while diverging ordered colormaps have a visual indication of a zero point in the center
where the attribute values diverge to negative on one side and positive on the other
Color
• Bivariate colormaps are designed to show two attributes simultaneously using carefully
designed combinations of luminance, hue, and saturation.
• The characteristics of several more channels are also covered: the magnitude channels of
size, angle, and curvature and the identity channels of shape and motion
Color Vision
The retina of the eye has 2 different kinds of receptors.
• The rods actively contribute to vision only in low-light.
• The main sensors in normal lighting conditions are the cones.
Color Spaces
• The color space of what colors the human visual system can detect is three dimensional;
that is, it can be adequately described using three separate axes.
RGB System
• The most common color space in computer graphics is the system where colors are
specified as triples of red, green, and blue values.
• Although this system is computationally convenient, it is a very poor match for the
mechanics of how we see.
• The red, green, and blue axes of the RGB color space are not useful as separable
channels; they give rise to the integral perception of a color.
HSL System
• The hue–saturation–lightness or HSL system is more intuitive and is heavily used by artists
and designers.
• The hue axis captures what we normally think of as pure colors that are not mixed with
white or black: red, blue, green, yellow, purple, and so on.
• The saturation axis is the amount of white mixed with that pure color. For instance, pink is
a partially desaturated red.
• The lightness axis is the amount of black mixed with a color.
HSL System
• The hue–saturation–lightness or HSL system is more intuitive and is heavily used by artists
and designers.
• The hue axis captures what we normally think of as pure colors that are not mixed with
white or black: red, blue, green, yellow, purple, and so on.
• The saturation axis is the amount of white mixed with that pure color. For instance, pink is
a partially desaturated red.
• The lightness axis is the amount of black mixed with a color.
HSL System
• Luminance and saturation are magnitude channels, while hue is a identity channel.
Transparency
A fourth channel strongly related to the other three color channels is transparency:
information can be encoded by decreasing the opacity of a mark from fully opaque to
completely see-through.
• Transparency cannot be used independently of the other color channels because of its
strong interaction effects with them.
• Transparency is used most often with superimposed layers, to create a foreground layer
that is distinguishable from the background layer.
• It is frequently used redundantly, where the same information is encoded with another
channel as well.
Colormap
A colormap specifies a mapping between colors and data values; that is, a visual encoding
with color.
Using color to encode data is a powerful and flexible design choice, but colormap design has
many pitfalls for the unwary.
Colormaps can be categorical or ordered, and ordered colormaps can be either sequential
or diverging.
Categorical Colormap
• A categorical colormap uses color to encode categories and groupings.
• Categorical colormaps are normally segmented. They are are also known as qualitative
colormaps.
• Very effective when used appropriately; for categorical data, they are the next best
channel after spatial position.
• Categorical colormaps are typically designed by using color as an integral identity channel
to encode a single attribute, rather than to encode three completely separate attributes
with the three channels of hue, saturation, and luminance.
Categorical Colormap
• An ordered colormap is appropriate for encoding ordinal or quantitative attributes.
• A sequential colormap ranges from a minimum value to a maximum value.
• A diverging colormap has two hues at the endpoints and a neutral color as a midpoint,
such as white, gray, or black, or a high-luminance color such as yellow.
Other Channels
Size Channels:
• Size is a magnitude channel suitable for ordered data.
• Length is one-dimensional (1D) size; more specifically, height is vertical size and width is
horizontal size. Area is two-dimensional (2D) size, and volume is three-dimensional (3D)
size.
• Our judgements of length are extremely accurate.
• Our judgement of area is significantly less accurate.
• The volume channel is quite inaccurate.
Other Channels
Angle Channels
• The angle channel encodes magnitude information based on the orientation of a mark:
the direction that it points.
• There are two slightly different ways to consider orientation that are essentially the same
channel. With angle, the orientation of one line is judged with respect to another line.
With tilt, an orientation is judged against the global frame of the display.
• This channel is somewhat less accurate than length and position, it is more accurate than
area.
Other Channels
Curvature Channel
• The curvature channel is not very accurate, and it can only be used with line marks.
• It cannot be used with point marks that have no length, or area marks because their
shape is fully constrained.
• The number of distinguishable bins for this channel is low, probably around two or three;
it is in an equivalence class with volume (3D size) at the bottom of the magnitude channel
ranking.
Other Channels
Shape Channels
• Shape as a identity channel that can be used with point and line marks.
• Applying the shape channel to line marks results in stipple patterns such as dotted and
dashed lines.
Motion Channels
• Several kinds of motion are also visual channels, including direction of motion, velocity of
motion, and flicker frequency.
Module 2
Heat Map
What is Heat Map?
• A heat map is the visualization of data that represents the magnitude of a value in a
color code ranging from minor to major intensity.
• The name metaphor is born in the technique used to depict heat; blue means cool,
red means hot, and intermediate temperatures are coded as the gradient between
those two. The variation of color is often portrayed in intensity and hue, highlighting
the extent of the phenomenon to make it easier to interpret.
Types of Heat Map
• The Spatial Heat Map: Represented with a canvas that represents a two-
dimensional space; it can be a geographical map, a web page, or other cartesian
representation.
• Grid Heatmap: This type of heat map displays the magnitude of a phenomenon
using a two-dimensional matrix. Columns and Rows categorize a cell (the location)
and the cell’s color code defines the phenomenon’s value intensity.
Heat map Visualizations
Geographical Heatmap
• A geographical heatmap is a spatial map to visualize data according to geographical location. This
can be done to show the phenomenon’s intensity, such as weather trends or demographic
information.
• The heatmap quality will be determined by the density of the dots in the map, and the color spread.
• This means that the more latitude-longitude dots you have on the map, the better the representation
of reality.
• Precision geographical heat maps are created with mathematical-statistical tools, like R, Python, or
more specialized tools, and require a lot of data.
Heat Map
Representing the
temperature of the
world during 2100: by
NASA
Choropleth Maps – An alternative to Geographical Heat Maps
• This differ from the geographic spatial heat maps.

• Mainly because it does not use pure spatial information.
• It uses predefined segments in the map (area/region) to aggregate the phenomenon variable.
• This aggregation makes it simpler to understand and it has become even a more popular tool in
business, politics, and social studies.
Bubble Chart Heatmap
• Bubble charts are a generalization of the scatter plot.
• Each point is located on a cartesian axis (X, Y) and a circle is created in it.
• The size of the circle is a third dimension in the visualization, used to represent a magnitude.
• When complemented with a color gradient, a fourth dimension can be represented as a heat map.
• Bubble charts can be used both spatially and in the form of a grid.
Matrix Heatmap
• The Matrix heatmap uses a two-dimensional matrix to represent the phenomenon.
• This is basically a grid map with rows, columns, and cell colors to represent data.
• The matrix heatmap depicts the magnitude of a phenomenon based on a 2D matrix, with each category or trait
representing a dimension (e.g. year, month, and temperature).
• An example of a matrix heat map is the following analysis of the sales team.
Clustered Heatmap
• Clustered heat maps are a specialization of matrix heat maps; generally used in medicine, biological studies,
and mathematics.
• Their purpose is to aid in the visual comparison of sample sets.
• To understand the structure, everything starts with a matrix heat map, where columns are “measuring sets”, and
rows are the measured variable.
• Each cell contains the magnitude measured for the pair set-value.
• Then a clustering algorithm is applied to create Dendrograms, first for the variables (rows), later for the sets
(columns).
Clustered Heatmap
Correlation Heatmap
• This visualization is used to interpret the correlation phenomena of a set of measured variables.
Abstract Positioning Heat Map
• Abstract positioning heatmaps are those where the spatial canvas is not a geographical map, but another kind
of plane that wants to be analyzed by a ranging phenomenon.
• The positions in the plane are determined by a cartesian X, Y-axis; the plane is set as a background, and the
phenomena are placed with the intensity defined by the color code selected.
Abstract Positioning Heat Map

Cse2026 Module 1 &amp; 2 Detailed Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cse2026 Module 1 &amp; 2 Detailed Notes

Uploaded by

Copyright:

Available Formats

CSE 2015- Data Analysis and

Module 1-Introduction to Data Analysis

Module 1: Introduction to Data Visualization [12 Hrs] [Bloom’s

• Data has to be transformed into a form that is efficient for movement or

Data collection is the process of gathering and collecting

• Data preparation is the process of gathering, combining,

• Data Structuring- At this point, the data needs to be modeled and

• Ensure the data used in analytics applications produces reliable

Questions Targeted by the Visualization process

• Data abstraction is the process of concealing irrelevant or

Hiding Details while Remaining Accurate

Limitations of Schemas and Abstraction Layers

• Efficiency: Abstraction allows us to manipulate data in a more

• Focus on Essentials: By ignoring unnecessary details, we can

• System Efficiency: Users access relevant data without hassle,

Data validation refers to the process of ensuring the accuracy and

1. Data Type Check

• Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly

Step 1: Remove duplicate or irrelevant observations

Step 2: Fix structural errors

Step 3: Filter unwanted outliers

Suppose we had an additional column of values and wanted to filter

To use a different replacement for each value, pass a list of

Filtering or transforming outliers is largely a matter of applying array

Another type of transformation for statistical modeling or machine

NumPy is a Python package. It stands for 'Numerical Python'. It is a

Numeric, the ancestor of NumPy, was developed by Jim Hugunin.

The most important object defined in NumPy is an N-dimensional array type

The above constructor takes the following parameters −

Sr.No. Parameter & Description

# more than one dimensions

# this resizes the ndarray

# this is one dimensional array

# dtype of array is int8 (1 byte)

# dtype of array is now float32 (4 bytes)

Sr.No. Parameter & Description

Sr.No. Parameter & Description

Sr.No. Parameter & Description

# array of five zeros. Default dtype is float

numpy.asarray(a, dtype = None, order = None)

Sr.No. Parameter & Description

•Matplotlib is an open-source drawing library that

•You can generate plots, histograms, bar charts, and

•It’s often used in web application servers, shells, and

SCALAR & VECTOR POINTS

luminance color map rainbow color map

256 colors 32 colors

(a) Images courtesy of Alexandru Telea

Original (256 X 256)

Velocity Field of a 2D Magnetohydrodynamic Simulation

Original (256 X 256)

Original (256 X 256)

Problem with a dense

• A cutvertex is any node that causes the graph to be disconnected if it is removed.

Map Color & Other Channels

• This differ from the geographic spatial heat maps.

You might also like

Cse2026 Module 1 & 2 Detailed Notes

Cse2026 Module 1 & 2 Detailed Notes