Preprocessing Is An Important Part of Data Visualization

Maps : A
map is a symbolic representation of selected characteristics of a place, usually drawn on a flat

surface. Maps present information about the world in a simple, visual way. They teach about the world
by showing sizes and shapes of countries, locations of features, and distances between places. Maps can
show distributions of things over Earth, such as settlement patterns. They can show exact locations of
houses and streets in a city neighborhood.
A diagram or collection of data showing the spatial distribution of something or the relative positions of
its components.
They can be used as general reference to show landforms, political boundaries, water bodies, and the
positions of cities.
Thematic Maps : Thematic maps pull in attributes or statistics about a location and represent that data
in a way that enables a greater understanding of the relationships between locations and the discovery
of spatial patterns in the data that we are exploring.
Thematic maps display distributions, or patterns, over Earth’s surface. They emphasize one theme, or
topic. These themes can include information about people, other organisms, or the land. Examples
include crop production, people’s average income, where different languages are spoken, or average
annual rainfall.
Map Design : Visualizing data with maps involves making decisions in three basic areas:
 Projection
 Scale
 Symbolization
Map Projections: A projection is a system of mathematics and geometry whereby the information on
the surface of a sphere (the Earth) is able to be transferred onto a flat piece of paper (a map).
All projections result in some distortion of the relationships between features on the sphere when they
are projected onto a flat surface. These distortions include:
» the direction between a feature and surrounding features
» the distance between a feature and surrounding features
» the shape of any feature
» the size of any feature
There are three basic developable surfaces—plane, cylinder, and cone—which result in three kinds of
map grids—azimuthal, cylindrical, and conic.
Distortion increases with the distances from the point or line of contact—tangent or secant—between
the developable surface and the globe.
For this reason, cartographers recommend cylindrical projections for continents around the equator
(e.g., Africa, South America), conic projections for middle-latitude continents (e.g., Asia, North America),
and azimuthal projections for polar regions
Types of Projections:
• Mercator projection
It is conformal. Areas and shapes vary with latitude, especially away from the Equator, reaching
extreme distortions in the polar regions. All indicatrices are circles as there are no angular
distortion.
• Equal-area cylindrical projection
It preserves area. Shapes are distorted from north to south in middle latitudes and from east to
west in extreme latitudes.
• Mollweide projection
The shapes decrease in the north–south scale in the high latitudes and increase in the low
latitudes, with the opposite happening in the east–west direction.
• Robinson projection
All points have some level of shape and area distortion. Both properties are nearly right in
middle latitudes.
• Sinusoidal projection
It preserves area, such that areas on the map are proportional to same areas on the Earth.
Shapes are obliquely distorted away from the central meridian and near the poles.
Map Scale: Map scale refers to the relationship (or ratio) between distance on a map and the
corresponding distance on the ground. For example, on a 1:100000 scale map, 1cm on the map equals
1km on the ground.
Visual encoding is the process of matching the phenomena to be visualized, which is provided by the
dataset (data scale and attributes), to the most suitable type of representation (graphical elements and
visual properties).
For encoding we need to look at the three aspects:
• Data
• Titles and legends of map
• Variables of visual encoding
• Simply put, a cartogram is a map. But a cartogram is a unique type of map

because it combines statistical information with geographic location. Physical or
topographical maps show relative area, distance, and terrain, but they do not
provide any data about the inhabitants of a place.
• Cartograms on the other hand take some measurable variable: total population,
age of inhabitants, electoral votes, GDP, etc., and then manipulate a place’s area
to be sized accordingly. The produced cartogram can really look quite different
from the maps of cities, states, countries, and the world that are more
recognizable. It all depends on how a cartographer needs or wants to display the
information.
• Cartograms come in all shapes and sizes, literally, and with the continuous
advances in technology of geographic information system (GIS) software
cartograms are produced with more precision and greater graphics than ever.
There are two main types of cartograms: area cartograms and distance
cartograms.
Time: Time itself is an inherent data dimension that is central to the tasks of revealing trends and
identifying patterns and relationships in the data. Time and time-oriented data have distinct
characteristics that make it worthwhile to treat such data as a separate data type.
Characteristics of time can significantly improve the expressiveness of visual representations.
Hence, it is vital to
(1) chose a visual representation that fits the data characteristics (cyclic time in this case) and to
(2) Parameterize the visual representation accordingly in order to be able to detect patterns hidden in
the data.
Time has 4 general aspects:

1. Scale
2. Scope
3. Arrangement
4. Viewpoint
Scale is divided into three parts:

• Ordinal – Relative order relation (before, during, after)
• Discrete – Temporal distances are considered [smallest possible time , minutes, seconds] Most
common
• Continuous – A possible mapping to real numbers [Between two points in time, another point in
time exists]
Scope
 Point based
o Have a temporal extant equal to 0
o Can be seen in analogy to discrete Euclidean points in space
o No information is given about the region between two points in time
o Example May 1, 2014 00:00:00
 Interval based
o Relates to subsection of time having temporal extent greater than 0
o Example [May 1, 2014 00:00:00, May 1, 2014 23:59:59]
Arrangement
 Linear
o We mostly consider time as proceeding linearly from past to future
o Each time value has a unique predecessor and successor
 Cyclic
o Time domain is composed of a set of recurring time values.
o Any time value A is succeeded and preceded at the same time by any other time value B
o E.g. Winter comes before summer, but winter also succeeds summer
Viewpoint
 Ordered
Ordered time domains consider things that happen one after the other. On a more detailed
level, we might also distinguish between totally ordered and partially ordered domains. In a
totally ordered domain, only one thing can happen at a time
 Totally ordered – Only one thing can happen at a time.
 Partially ordered – Overlapping events are allowed
 Branching
multiple strands of time branch out and allow the description and comparison of alternative
scenarios (e.g., in project planning). Only one path through time will happen
 Multiple perspectives
Facilitates simultaneous (even contrary) views of time. Examples for this are eyewitness reports
that describe the same situation, each of which being slightly different, various statements of a
disaster reported in different countries and time zones, or stochastic multirun simulations.
Hierarchical organization of time

The hierarchical organization of time and concrete time elements is determined based on granularity,
time primitives, and determinacy.
• Granularity and calendars: None vs. Single vs. Multiple
Basically, granularity can be thought of as a (human-made) abstraction of time in order to make
it easier to deal with time in everyday life (such as minutes, hours, days, weeks, months). More
generally, granularity describes mappings from time values to larger or smaller conceptual units
1. If a granularity and calendar system is supported by time model, we characterize it as

multiple granularity
2. If every time value is given in terms of milliseconds it is single granularity
3. If none of the abstractions are supported, then its none
• Time Primitives: Instant vs. Interval vs. Span
It can be seen as an intermediary layer between data elements and the time domain
1. Anchored (Absolute)
a. Instant [Fixed position along the time domain]
b. Interval [Fixed position along the time domain]
2. Unanchored (Relative)
a. Span [No absolute position in time]
• Determinacy: Determinate vs. Indeterminate
Scale of Time
Abstraction of Time
Data
Data & Time
• Preprocessing is an important part of data visualization.
• Two types of data – Nominal and Ordinal.
• Structure of data – Scalar, Vector and Tensor.
• Methods used for data preprocessing
 Data Cleaning
 Assigning values
 Imputations
 Clustering and Segmentation
Descriptive data – categorical data

Transactional data – Numeric data
Descriptive data visualization  Bar Chart
Amount is Semantic
75$ is type of data – Structural (Columns) & Mathematical or measure data (Rows/record)
Observation are of two types.

1. Transactional or Ordinal
a. Binary
b. Discrete
c. Continuous
2. Descriptive or Nominal
a. Categorical
b. Ranked
c. Arbitrary
Structure within a record
Scalar – It is a point in space which represents individual numbers of data in a record. It is an entity that
has only magnitude and no direction. Examples of scalar quantities include mass, electric charge,
temperature, distance, etc. A point in space can have several different numbers associated with it; if
there is no underlying connection between them then they are simply multiple separate scalar fields.
Vector – It is always represented in 2 dimensions, multiple variables in one record or two data together
is called vector [ Sales – 100$ ]. is an entity that is characterized by a magnitude and a direction.
Examples of vector quantities are displacement, velocity, magnetic field, etc. For instance, every point
on the earth may be in the gravitational force field of the earth.
Tensor – in between 0&1, scalar is a tensor with rank 0 and vector is a tensor with rank 1. When we are
visualizing the data, we use tensor to give data a wave i.e. change of data. As the data in background
changes, scale of data also changes. The geometric intuition is that the full information at each point in a
tensor field cannot be represented by just an arrow and would require a more complex shape such as an
ellipsoid.
Get the data from source

Clean the data
Analyze the data -
Statistical analysis
Outlier detection can indicate records with erroneous data fields
Cluster analysis can help segment the data into groups exhibiting strong similarities
Correlation analysis can help users eliminate redundant fields or highlight associations between
dimensions that might not have been apparent otherwise
Plot the data
1. This seemingly drastic measure, namely to throw away any data record containing a missing or
erroneous field, is actually one of the most commonly applied, since the quality of the remaining
data entries in that record may be in question. However, this can potentially lead to a significant
loss of information, especially in data sets containing large numbers of records. In some
domains, as much as 90% of records have at least one missing or erroneous field. In addition,
those records with missing data may be the most interesting (e.g., such as due to a
malfunctioning sensor or an overly high response to a drug).
2. Assigning a sentinel value: In the sentinel approach, the sentinel value could be some data-
specific convention, such as indicating a missing integer value with -9999 or some rare bit
pattern, or it could be a more global convention, such as indicating a missing floating point value
with NaN (Not a Number), a special value. When we have -ve amount there, it might give wrong
statistical analysis.
Another popular strategy is to have a designated sentinel value for each variable in the data set
that can be assigned when the real value in a record is in question. For example, in a variable
that has a range of 0 to 100, one might use a value such as −5 to designate an erroneous or
missing entry. Then, when the data is visualized, the records with problematic data entries will
be clearly visible. Of course, if this strategy is chosen, care must be taken not to perform
statistical analysis on these sentinel values.
3. Assigning average value : If it is large set then it works.
A simple strategy for dealing with bad or missing data is to replace it with the average value for
that variable or dimension.
An advantage to using this strategy is that it minimally affects the overall statistics for that
variable. The average, however, may not be a good “guess” for this particular record.
Another drawback of using this method is that it may mask or obscure outliers, which can be of
particular interest.
4. Assign values based on nearest neighbors:
A better approximation for a substitute value is to find the record that has the highest similarity with the
record in question, based on analyzing the differences in all other variables. The basic idea here is that if
record A is missing an entry for variable i, and record B is closer than any other record to A without
considering variable i, then using the value of variable i from record B as a substitute in A is a reasonable
assumption.
The problem with this approach, however, is that variable i may be most dependent on only a subset of
the other dimensions, rather than on all dimensions, and so the best nearest neighbour based on all
dimensions may not be the best substitute for this particular dimension.
Imputation : developing methods for generating values to replace missing or erroneous data. The
process, known as imputation, seeks to find values that have high statistical confidence.
In Normal distribution – we impute the value by mean

In Skewed distribution – we use median as an imputation value
Normalization : Normalization is the process of transforming a data set so that the results satisfy a
particular statistical property. A simple example of this is to transform the range of values a particular
variable assumes so that all numbers fall within the range of 0.0 to 1.0. Other forms of normalization
convert the data such that each dimension has a common mean and standard deviation. Normalization
is a useful operation since it allows us to compare seemingly unrelated variables. It is important in
visualization as well, since graphical attributes have a range of values that are possible, and thus to map
data to those attributes, we need to convert the data range to be compatible with the graphical
attribute range.
Normalization may also involve bounding values, so that, for example, values exceeding a particular
threshold are capped at that threshold. In this way the details falling within the specified range can be
more effectively interpreted when mapped to a specific graphical attribute. For example, density values
in a tomographic data set may have a substantial range, yet the range of interest for someone
interpreting the data may be a very small portion of that range. By truncating the range and normalizing,
the variation across the shortened range will be more easily perceived. This is especially important when
extreme outliers exist.
Normalized value = (Original – Min) / (Max – Min)

But in this case the distribution of the underlying data gets changed.
This method is also known as standardization.
Other form can be obtained by using the formula

Normalized value = (Original – Mean)/ (Standard deviation)
In this case the distribution of the data remains same
Sometimes, for the sake of analysis and visualization we need to separate data into contiguous regions,
where each region corresponds to a particular classification of data.
This is called segmentation.
• Perception means recognizing, organizing and interpreting.
• Can be preattentive or attentive
• Color, texture and motion are important part of human perception
Interpreting at a glance is called preemptive processing
Perception : We see what we expect to see. Hence visualization must take into account what people
know and expect. The main purpose of data visualization is to aid in good decision making. To make
good decisions, we need to be able to understand trends, patterns, and relationships from a visual. This
is also known as drawing insights from data. the ability to interpret the surrounding environment by
processing information through sight, hearing, touch, smell, and taste. we don’t see images with our
eyes; we see them with our brains. The experience of visual perception is in fact what goes on inside our
brains when we see a visual.
Perception deals with the human senses that generate signals from the environment through sight,
hearing, touch, smell, and taste. Vision and audition are the most well understood. Simply put,
perception is the process by which we interpret the world around us, forming a mental representation
of the environment. This representation is not isomorphic to the world, but it’s subject to many
correspondence differences and errors. The brain makes assumptions about the world to overcome the
inherent ambiguity in all sensory data, and in response to the task at hand.
Illusion : In a data visualization context, illusions are dangerous because they can make us see things
that aren’t really there in the data. One of the most common places these illusions affect data
visualization is in color scales. To avoid them, choose good colors. Size illusions can also be a major
problem, and come from context as well. They are especially problematic in bubble plots. No matter the
type of illusion, they can definitely cause difficulties in accurately portraying your data. The best defense
is just to be aware of the conditions that could cause them and when your visualization has those
conditions, double check for illusions.
Preattentive Processing A preattentive visual property is one which is processed in spatial memory
without our conscious action. In essence it takes less than 500 milliseconds for the eye and the brain to
process a preattentive property of any image. is a term that refers to the body’s processing of sensory
information that occurs before the conscious mind starts to pay attention to any specific objects in its
vicinity. All available information is pre-attentively processed.[2] Then, the brain filters and processes
what is important. An example of this is that when a person walks out of their home, the first thing that
is noticed is the temperature and whether it is day or night, then the mind starts to process the events
that are occurring in the area. If we tune our awareness to everything, we will be very soon
overwhelmed. So we selectively pay attention to things that catch our attention.
Information that stands out the most or relevance to what a person is thinking about is selected for
further and more complete analysis by conscious (attentive) processing
Variables of Visual Encoding: help you to find clarity when you try to create something using data.
Visual Encoding is a mapping from data to display elements. You, as a data visualiser, encode data
visually, and the viewer must decode that information.
Image  Plane, Size, value

Differential Variables  Texture, color, orientation, shape
Motion can be a very functional component, a visual element that attracts humans’ attention. Motion is
used to animate the forms in a visualization to illustrate values of the data. The visual encodings are
directly and continuously linked to them.
The use of motion is common in certain areas of visualization, for example, the animation of particles,
dye, or glyphs to represent the direction and magnitude of a vector field (e.g., fluid flow visualization).
Animation can bring data to life, during both the visual exploration and storytelling phases. By animating
data visualisations, you can engage viewers in ways other methods may not be able to. For example,
transitions from one stage to the next enable us to track changes.
Animation is the technique of photographing successive drawings or positions of puppets or models

to create an illusion of movement when the movie is shown as a sequence. Use visualization that
registers in our recognition then organize and interpret
The intuition behind animation seems clear enough: if a two-dimensional image is good, then a
moving image should be better. Movement is familiar: we are accustomed to both moving
through the real world and seeing things in it move smoothly. All around us, items move, grow, and
change color in ways that we understand deeply and richly.In a visualization, animation might help
a viewer work through the logic behind an idea by showing the intermediate steps and transitions,
or show how data collected over time changes. A moving image might offer a fresh perspective, or
invite users to look deeper into the data presented. An animation might also smooth the change
between two views, even if there is no temporal component to the data.
Animation can be a powerful technique when used appropriately, but it can be very bad when used
poorly. Some animations can enhance the visual appeal of the visualization being presented, but
may make exploration of the dataset more difficult; other animated visualizations facilitate
exploration.
the ability to interpret the surrounding environment by processing information that is contained in
visible light. The resulting perception is also known as eyesight, sight, or vision.
Now here is the tricky part, we don’t see images with our eyes; we see them with our brains. The
experience of visual perception is in fact what goes on inside our brains when we see a visual.
Preattentive Processing is a term that refers to the body’s processing of

sensory information (ambient temperature, light levels, etc.) that occurs before
the conscious mind starts to pay attention to any specific objects in its vicinity. An
example of this is that when a person walks out of their home, the first thing that
is noticed is the temperature and whether it is day or night, then the mind starts
to process the events that are occurring in the area.
A preattentive visual property is one which is processed in spatial memory
without our conscious action. In essence it takes less than 500 milliseconds for
the eye and the brain to process a preattentive property of any image.

Preprocessing Is An Important Part of Data Visualization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Preprocessing Is An Important Part of Data Visualization

Uploaded by

Copyright:

Available Formats

Maps : A

map is a symbolic representation of selected characteristics of a place, usually drawn on a flat

• Simply put, a cartogram is a map. But a cartogram is a unique type of map

Time has 4 general aspects:

Scale is divided into three parts:

Hierarchical organization of time

1. If a granularity and calendar system is supported by time model, we characterize it as

Descriptive data – categorical data

Observation are of two types.

Get the data from source

3. Assigning average value : If it is large set then it works.

In Normal distribution – we impute the value by mean

Normalized value = (Original – Min) / (Max – Min)

Other form can be obtained by using the formula

Interpreting at a glance is called preemptive processing

Image  Plane, Size, value

Animation is the technique of photographing successive drawings or positions of puppets or models

Preattentive Processing is a term that refers to the body’s processing of

You might also like