You are on page 1of 3

GT Mentorship

Research Proposal
Alex Blackert
Bryan Gorman January 21, 2022

Using Bayesian Inference to increase the accuracy of data visualization comprehension

Overview of Research:
The basic field of research is predictive analytics and data visualizations, focusing on a
new method of optimizing data visualizations using Bayesian analysis. Numerous biases have
been identified as specifically affecting how individuals perceive trends and relationships among
variables in data visualizations, and the magnitude of many of these biases have been linked to
one or more aspects of a data visualization. The optimization of these features is therefore ideal
for maximizing the effectiveness of a data visualization. The present research explores the use of
Bayesian inference to measure the effects of changes to data visualizations on the accuracy of
readers’ conclusions. The experiment measures how close a test group’s interpretation of a data
visualization is to an ideal model, and uses the difference between curves through different trials
to pinpoint how the tested feature can be optimized.

Background and Rationale


There are countless individual contexts where data visualization is interpreted, and must
convey information efficiently and accurately. Each of these individual contexts demands a
specialized data model, and users are often subject to different biases depending on the data type
and context. Because of this, general research on data visualizations that can be actually applied
to many situations is extremely difficult. An experiment that can lay a foundation for future
research is therefore the most appropriate approach. By demonstrating a valid method of
analyzing the effect of a specific aspect of a data model, this research can provide a template for
how other researchers can investigate properties of specialized data visualizations in specific
contexts.
Within the proposed study, the method of statistical analysis used is Bayesian statistics.
Bayesian inference, a key part of Bayesian statistics, is a method of optimized belief updating
that incorporates prior beliefs, along with new evidence, to form an optimal posterior belief. This
is done at the most basic level using Bayes rule with simple p values, but can be done in a more
complex method with full probability distributions. By measuring the prior beliefs of an
individual and presenting them with a dataset, an ideal Bayesian posterior can be produced. This
can then be compared to the individual’s actual posterior to analyze the error in the individual's
incorporation of the dataset into their beliefs. When done on a sample, an aggregate model can
be formed to view the mean change of a population given a specific data visualization.
The method of belief elicitation incorporated into the study design is the Line-Cone
method. This is used for a bivariate dataset, in which the believed relationship between two
variables is measured. The participants first indicate the trend that they believe to be most likely
with a line on a coordinate plane. They then provide several possible alternatives that they deem
to be realistic. The cone that is formed around their initial prediction provides the density that is
used for the prior probability distribution to be made.

Primary Research Methodology


Research Question
How can data visualizations be improved to reduce bias and increase the accuracy of a viewer's
interpretation of the information?
Research Hypothesis
In order to increase the accuracy of information comprehension of non-professional audiences in
medical statistics, Bayesian cognitive models should be utilized to preemptively improve the
presentation of statistical information that is distributed.
Research Methodology
Quantitative research will be performed in order to reach a solid conclusion on an observed
trend. This will also allow the magnitude of the trend to be observed, which adds an extra layer
of usefulness to the information collected. The design will be experimental, which will allow for
a cause-effect relationship to be proven through the manipulation of an independent variable. In
this experiment, the independent variable will be the degree of color density gradient used. It will
vary from no color, to a simple color gradient to emphasize data density in a scatter plot, to a
heavier, more contrasted gradient. There will likely be more than three samples, as the
experiment is designed to locate an optimized situation, which may not be at an extreme.
In this case, the experiment uses Bayesian analysis to find the optimized minimum error.
Minimum error refers to the lowest distance between curves of the tested Bayesian posterior and
the calculated Bayesian posterior. When the experiment is conducted, a prior distribution will be
collected from each participant using the line-cone belief elicitation method. This allows
participants to show their believed relationship between the two tested variables, and then
provide alternative possibilities that they think are plausible, which forms a probability
distribution. This distribution is later used to create a Bayesian posterior distribution with the
data set that is being used for the sample. After communicating their beliefs, the participants are
shown a data visualization containing information relevant to the topic they previously conveyed
their prior beliefs about. Afterwards, they communicate their beliefs on the same relationship
again, after viewing the data. This new belief, the posterior, is measured against the calculated
Bayesian posterior to measure how accurately the participants learned from the data set. MatLab
or other similar tools will be used to develop the sample data sets and perform the statistical
calculations. These tools are readily available. By comparing the accuracy of learning over
different samples, the ideal conditions for the color gradient can then be determined.

Product Objectives
The paper that is produced by this study will serve as a product for use. The methodology
of the study itself should provide grounds for other professionals to analyze the effectiveness of
altering other aspects of a data visualization. The present study serves as a proof of concept for
further research into the optimization of data visualizations for specific circumstances and
audiences. This will allow professionals to study aspects of data models that are specifically
relevant to the contexts they are working in, since data size, density, variability, and many other
qualities may shift drastically between contexts.
The audience for this paper is therefore professional researchers seeking to mitigate
biases in the interpretation of data visualizations. By serving as a proof of concept, the study can
provide an example for how Bayesian analysis can be applied to optimize different qualities of a
data model in niche contexts.

Logistical Considerations
Sample size is the only major logistical barrier to this research project. Acquiring an
adequate sample size will be especially difficult, and even still will not likely reach professional
standards. Because the process of the test in the experiment is quick, it is possible for sample
groups to be used for different trials, with the order randomized between individuals. This would
even potentially give a higher accuracy of results, given that the learning process is being
compared over the same individual. Doing so would cut down on the time required for testing,
but would increase the preparation time, because additional data sets would have to be created.
Within the next month, the process of creating data sets and testing samples will take place.
Depending on the chosen sampling methods, more time may be allocated to preparation of
materials or testing.

Significance of Research
The accuracy of data comprehension is a massive field of research. One major field in
which this type of research is extraordinarily beneficial is in military vessels such as submarines.
Operators in these situations are bombarded with dense data regarding sonar or radar data, as
well as several other sources. Because these are high pressure environments, which often fall in
extremely high pressure situations, it is important that the visualization models used are
optimized to lead the operator to the correct conclusions. All measurements come with a degree
of uncertainty, and applying this uncertainty to data visualizations in an optimal way is crucial.
Not conveying uncertainty can lead to overconfident errors in important decisions, but
overemphasizing errors can make data seem useless or prevent an operator from making correct
choices. By demonstrating how a feature of a data model can be measured and optimized to
increase accurate comprehension, the paper can provide a proof of concept for further research
and discussion. It may also unveil any questions that can be answered to further increase decision
accuracy in certain situations. This will be beneficial to the audience of the paper, all of whom
will require insight in specific, niche contexts, and will therefore have a proof of concept of how
to study a model feature that is especially relevant to their context of research.

You might also like