You are on page 1of 17

1.

0 INTRODUCTION

Data is a set of values of qualitative or quantitative variables. Pieces of data are individual pieces
of information. While the concept of data is commonly associated with scientific research, data is
collected by a huge range of organizations and institutions, including businesses (e.g., sales data,
revenue, profits, stock price), governments (e.g., crime rates, unemployment rates, literacy rates)
and non-governmental organizations (e.g., censuses of the number of homeless people by non-
profit organizations).

Data is measured, collected and reported, and analyzed, whereupon it can be visualized using
graphs, images or other analysis tools. Data as a general concept refers to the fact that some
existing information or knowledge is represented or coded in some form suitable for better usage
or processing. Raw data ("unprocessed data") is a collection of numbers or characters before it
has been "cleaned" and corrected by researchers. Raw data needs to be corrected to remove
outliers or obvious instrument or data entry errors (e.g., a thermometer reading from an outdoor
Arctic location recording a tropical temperature). Data processing commonly occurs by stages,
and the "processed data" from one stage may be considered the "raw data" of the next stage.
Field data is raw data that is collected in an uncontrolled "in situ" environment. Experimental
data is data that is generated within the context of a scientific investigation by observation and
recording. Data has been described as the new oil of the digital economy.

Data, information, knowledge and wisdom are closely related concepts, but each has its own role
in relation to the other, and each term has its own meaning. According to a common view, data is
collected and analyzed; data only becomes information suitable for making decisions once it has
been analyzed in some fashion. Data is often assumed to be the least abstract concept,
information the next least, and knowledge the most abstract. In this view, data becomes
information by interpretation.

Gathering data can be accomplished through a primary source (the researcher is the first person
to obtain the data) or a secondary source (the researcher obtains the data that has already been
collected by other sources, such as data disseminated in a scientific journal). Data analysis
methodologies vary and include data triangulation and data percolation. The latter offers an
articulate method of collecting, classifying and analyzing data using five possible angles of
analysis (at least three) in order to maximize the research's objectivity and permit an
understanding of the phenomena under investigation as complete as possible: qualitative and
quantitative methods, literature reviews (including scholarly articles), interviews with experts,
and computer simulation. The data are thereafter "percolated" using a series of pre-determined
steps so as to extract the most relevant information.
2.0 DATA ANALYSIS

Data analysis, also known as analysis of data or data analytics, is a process of inspecting,
cleansing, transforming, and modeling data with the goal of discovering useful information,
suggesting conclusions, and supporting decision-making. Data analysis has multiple facets and
approaches, encompassing diverse techniques under a variety of names, in different business,
science, and social science domains.

Raw data can take a variety of forms, including measurements, survey responses, and
observations. In its raw form, this information can be incredibly useful, but also overwhelming.
Over the course of the data analysis process, the raw data is ordered in a way which will be
useful. For example, survey results may be tallied, so that people can see at a glance how many
people answered the survey, and how people responded to specific questions.

Data mining is a particular data analysis technique that focuses on modeling and knowledge
discovery for predictive rather than purely descriptive purposes, while business intelligence
covers data analysis that relies heavily on aggregation, focusing on business information. In
statistical applications data analysis can be divided into descriptive statistics, exploratory data
analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new
features in the data and CDA on confirming or falsifying existing hypotheses. Predictive
analytics focuses on application of statistical models for predictive forecasting or classification,
while text analytics applies statistical, linguistic, and structural techniques to extract and classify
information from textual sources, a species of unstructured data. All are varieties of data
analysis.

Data integration is a precursor to data analysis, and data analysis is closely linked to data
visualization and data dissemination. The term data analysis is sometimes used as a synonym for
data modeling.

Data Analysis is the process of systematically applying statistical and/or logical techniques to
describe and illustrate, condense and recap, and evaluate data. According to Shamoo and Resnik,
(2003) various analytic procedures “provide a way of drawing inductive inferences from data
and distinguishing the signal (the phenomenon of interest) from the noise (statistical fluctuations)
present in the data”..

While data analysis in qualitative research can include statistical procedures, many times
analysis becomes an ongoing iterative process where data is continuously collected and analyzed
almost simultaneously. Indeed, researchers generally analyze for patterns in observations through
the entire data collection phase (Savenye, Robinson, 2004). The form of the analysis is
determined by the specific qualitative approach taken (field study, ethnography content analysis,
oral history, biography, unobtrusive research) and the form of the data (field notes, documents,
audiotape, and videotape).

An essential component of ensuring data integrity is the accurate and appropriate analysis of
research findings. Improper statistical analyses distort scientific findings, mislead casual readers
(Shepard, 2002), and may negatively influence the public perception of research. Integrity issues
are just as relevant to analysis of non-statistical data as well.

In the course of organizing the data, trends often emerge, and these trends can be highlighted in
the write-up of the data to ensure that readers take note. In a casual survey of ice cream
preferences, for example, more women than men might express a fondness for chocolate, and
this could be a point of interest for the researcher. Modeling the data with the use of mathematics
and other tools can sometimes exaggerate such points of interest in the data, making them easier
for the researcher to see.

Analysis refers to breaking a whole into its separate components for individual examination.
Data analysis is a process for obtaining raw data and converting it into information useful for
decision-making by users. Data is collected and analyzed to answer questions, test hypotheses or
disprove theories.

Statistician John Tukey defined data analysis in 1961 as: "Procedures for analyzing data,
techniques for interpreting the results of such procedures, ways of planning the gathering of data
to make its analysis easier, more precise or more accurate, and all the machinery and results of
(mathematical) statistics which apply to analyzing data.”
There are several phases that can be distinguished as described below:

Fig. 1 Data science process flowchart from "Doing Data Science", Cathy O'Neil and Rachel Schutt, 2013

2.1 Data Requirements

The data is necessary as inputs to the analysis are specified based upon the requirements of those
directing the analysis or customers who will use the finished product of the analysis. The general
type of entity upon which the data will be collected is referred to as an experimental unit (e.g., a
person or population of people). Specific variables regarding a population (e.g., age and income)
may be specified and obtained. Data may be numerical or categorical (i.e., a text label for
numbers).

2.2 Data Collection

Data is collected from a variety of sources. The requirements may be communicated by analysts
to custodians of the data, such as information technology personnel within an organization. The
data may also be collected from sensors in the environment, such as traffic cameras, satellites,
recording devices, etc. It may also be obtained through interviews, downloads from online
sources, or reading documentation.

2.3 Data Processing

The phases of the intelligence cycle used to convert raw information into actionable intelligence
or knowledge are conceptually similar to the phases in data analysis.

Data initially obtained must be processed or organised for analysis. For instance, these may
involve placing data into rows and columns in a table format (i.e., structured data) for further
analysis, such as within a spreadsheet or statistical software.

2.4 Data Cleaning

Once processed and organised, the data may be incomplete, contain duplicates, or contain errors.
The need for data cleaning will arise from problems in the way that data is entered and stored.
Data cleaning is the process of preventing and correcting these errors. Common tasks include
record matching, identifying inaccuracy of data, and overall quality of existing data,
reduplications, and column segmentation. Such data problems can also be identified through a
variety of analytical techniques. For example, with financial information, the totals for particular
variables may be compared against separately published numbers believed to be reliable.
Unusual amounts above or below pre-determined thresholds may also be reviewed. There are
several types of data cleaning that depend on the type of data such as phone numbers, email
addresses, employers etc. Quantitative data methods for outlier detection can be used to get rid of
likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of
mistyped words, but it is harder to tell if the words themselves are correct.

2.5 Exploratory Data Analysis

Once the data is cleaned, it can be analyzed. Analysts may apply a variety of techniques referred
to as exploratory data analysis to begin understanding the messages contained in the data. The
process of exploration may result in additional data cleaning or additional requests for data, so
these activities may be iterative in nature. Descriptive statistics such as the average or median
may be generated to help understand the data. Data visualization may also be used to examine
the data in graphical format, to obtain additional insight regarding the messages within the data.

2.6 Modeling and Algorithms

Mathematical formulas or models called algorithms may be applied to the data to identify
relationships among the variables, such as correlation or causation. In general terms, models may
be developed to evaluate a particular variable in the data based on other variable(s) in the data,
with some residual error depending on model accuracy (i.e., Data = Model + Error).

Inferential statistics includes techniques to measure relationships between particular variables.


For example, regression analysis may be used to model whether a change in advertising
(independent variable X) explains the variation in sales (dependent variable Y). In mathematical
terms, Y (sales) is a function of X (advertising). It may be described as Y = aX + b + error,
where the model is designed such that a and b minimize the error when the model predicts Y for
a given range of values of X. Analysts may attempt to build models that are descriptive of the
data to simplify analysis and communicate results.

2.7 Data Product


A data product is a computer application that takes data inputs and generates outputs, feeding
them back into the environment. It may be based on a model or algorithm. An example is an
application that analyzes data about customer purchasing history and recommends other
purchases the customer might enjoy.

2.8 Communication

Once the data is analyzed, it may be reported in many formats to the users of the analysis to
support their requirements. The users may have feedback, which results in additional analysis.
As such, much of the analytical cycle is iterative.

When determining how to communicate the results, the analyst may consider data visualization
techniques to help clearly and efficiently communicate the message to the audience. Data
visualization uses information displays such as tables and charts to help communicate key
messages contained in the data. Tables are helpful to a user who might lookup specific numbers,
while charts (e.g., bar charts or line charts) may help explain the quantitative messages contained
in the data.

Fig. 2: Data visualization to understand the results of a data analysis.


Author Stephen Few described eight types of quantitative messages that users may attempt to
understand or communicate from a set of data and the associated graphs used to help
communicate the message. Customers specifying requirements and analysts performing the data
analysis may consider these messages during the course of the process.

1. Time-series: A single variable is captured over a period of time, such as the


unemployment rate over a 10-year period. A line chart may be used to demonstrate the
trend.

Fig. 3: A time series illustrated with a line chart demonstrating trends in U.S. federal spending and revenue over
time

2. Ranking: Categorical subdivisions are ranked in ascending or descending order, such as a


ranking of sales performance (the measure) by sales persons (the category, with each
sales person a categorical subdivision) during a single period. A bar chart may be used to
show the comparison across the sales persons.
3. Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a
percentage out of 100%). A pie chart or bar chart can show the comparison of ratios, such
as the market share represented by competitors in a market.
4. Deviation: Categorical subdivisions are compared against a reference, such as a
comparison of actual vs. budget expenses for several departments of a business for a
given time period. A bar chart can show comparison of the actual versus the reference
amount.
5. Frequency distribution: Shows the number of observations of a particular variable for
given interval, such as the number of years in which the stock market return is between
intervals such as 0–10%, 11–20%, etc. A histogram, a type of bar chart, may be used for
this analysis.
6. Correlation: Comparison between observations represented by two variables (X,Y) to
determine if they tend to move in the same or opposite directions. For example, plotting
unemployment (X) and inflation (Y) for a sample of months. A scatter plot is typically
used for this message.
7. Nominal comparison: Comparing categorical subdivisions in no particular order, such as
the sales volume by product code. A bar chart may be used for this comparison.
8. Geographic or geospatial: Comparison of a variable across a map or layout, such as the
unemployment rate by state or the number of persons on the various floors of a building.
A cartogram is a typical graphic used.

3.0 BARRIERS TO EFFECTIVE ANALYSIS

Barriers to effective analysis may exist among the analysts performing the data analysis or
among the audience. Distinguishing fact from opinion, cognitive biases, and innumeracy are all
challenges to sound data analysis.

3.1 Confusing Fact and Opinion

Effective analysis requires obtaining relevant facts to answer questions, support a conclusion or
formal opinion, or test hypotheses. Facts by definition are irrefutable, meaning that any person
involved in the analysis should be able to agree upon them. For example, in August 2010, the
Congressional Budget Office (CBO) estimated that extending the Bush tax cuts of 2001 and
2003 for the 2011–2020 time periods would add approximately $3.3 trillion to the national debt.
Everyone should be able to agree that indeed this is what CBO reported; they can all examine the
report. This makes it a fact. Whether persons agree or disagree with the CBO is their own
opinion.

As another example, the auditor of a public company must arrive at a formal opinion on whether
financial statements of publicly traded corporations are "fairly stated, in all material respects."
This requires extensive analysis of factual data and evidence to support their opinion. When
making the leap from facts to opinions, there is always the possibility that the opinion is
erroneous.
3.2 Cognitive Biases

There are a variety of cognitive biases that can adversely affect analysis. For example,
confirmation bias is the tendency to search for or interpret information in a way that confirms
one's preconceptions. In addition, individuals may discredit information that does not support
their views.

Analysts may be trained specifically to be aware of these biases and how to overcome them. In
his book Psychology of Intelligence Analysis, retired CIA analyst Richards Heuer wrote that
analysts should clearly delineate their assumptions and chains of inference and specify the
degree and source of the uncertainty involved in the conclusions. He emphasized procedures to
help surface and debate alternative points of view.

3.3 Innumeracy

Effective analysts are generally adept with a variety of numerical techniques. However,
audiences may not have such literacy with numbers or numeracy; they are said to be innumerate.
Persons communicating the data may also be attempting to mislead or misinform, deliberately
using bad numerical techniques.

For example, whether a number is rising or falling may not be the key factor. More important
may be the number relative to another number, such as the size of government revenue or
spending relative to the size of the economy (GDP) or the amount of cost relative to revenue in
corporate financial statements. This numerical technique is referred to as normalization or
common-sizing. There are many such techniques employed by analysts, whether adjusting for
inflation (i.e., comparing real vs. nominal data) or considering population increases,
demographics, etc. Analysts apply a variety of techniques to address the various quantitative
messages described in the section above.

Analysts may also analyze data under different assumptions or scenarios. For example, when
analysts perform financial statement analysis, they will often recast the financial statements
under different assumptions to help arrive at an estimate of future cash flow, which they then
discount to present value based on some interest rate, to determine the valuation of the company
or its stock. Similarly, the CBO analyzes the effects of various policy options on the
government's revenue, outlays and deficits, creating alternative future scenarios for key
measures.

3.4 Training of Staff Conducting Analyses

A major challenge to data integrity could occur with the unmonitored supervision of inductive
techniques. Content analysis requires raters to assign topics to text material (comments). The
threat to integrity may arise when raters have received inconsistent training, or may have
received previous training experience(s). Previous experience may affect how raters perceive the
material or even perceive the nature of the analyses to be conducted. Thus one rater could assign
topics or codes to material that is significantly different from another rater. Strategies to address
this would include clearly stating a list of analyses procedures in the protocol manual, consistent
training, and routine monitoring of raters.

3.5 Data Recording Method

Analyses could also be influenced by the method in which data was recorded. For example,
research events could be documented by:

a. recording audio and/or video and transcribing later,

b. either a researcher or self-administered survey,

c. either closed ended survey or open ended survey,

d. preparing ethnographic field notes from a participant/observer and

e. requesting that participants themselves take notes, compile and submit them to
researchers.

While each methodology employed has rationale and advantages, issues of objectivity and
subjectivity may be raised when data is analyzed.

3.6 Manner of Presenting Data


At times investigators may enhance the impression of a significant finding by determining how
to present derived data (as opposed to data in its raw form), which portion of the data is shown,
why, how and to whom (Shamoo, Resnik, 2003). Nowak (1994) notes that even experts do not
agree in distinguishing between analyzing and massaging data. Shamoo (1989) recommends that
investigators maintain a sufficient and accurate paper trail of how data was manipulated for
future review.

3.7 Environmental/Contextual Issues

The integrity of data analysis can be compromised by the environment or context in which data
was collected i.e., face-to face interviews vs. focused group. The interaction occurring within a
dyadic relationship (interviewer-interviewee) differs from the group dynamic occurring within a
focus group because of the number of participants, and how they react to each other’s responses.
Since the data collection process could be influenced by the environment/context, researchers
should take this into account when conducting data analysis.

3.8 Following Acceptable Norms for Disciplines

Every field of study has developed its accepted practices for data analysis. Resnik (2000) states
that it is prudent for investigators to follow these accepted norms. Resnik further states that the
norms are ‘…based on two factors:

1) the nature of the variables used (i.e., quantitative, comparative, or qualitative),

2) assumptions about the population from which the data are drawn (i.e., random
distribution, independence, sample size, etc.). If one uses unconventional norms, it is
crucial to clearly state this is being done, and to show how this new and possibly
unaccepted method of analysis is being used, as well as how it differs from other more
traditional methods.

For example, Schroder, Carey, and Vanable (2003) juxtapose their identification of new and
powerful data analytic solutions developed to count data in the area of HIV contraction risk with
a discussion of the limitations of commonly applied methods.
If one uses unconventional norms, it is crucial to clearly state this is being done, and to show
how this new and possibly unaccepted method of analysis is being used, as well as how it differs
from other more traditional methods. For example, Schroder, Carey, and Vanable (2003)
juxtapose their identification of new and powerful data analytic solutions developed to count
data in the area of HIV contraction risk with a discussion of the limitations of commonly applied
methods.

3.9 Drawing Unbiased Inference

The chief aim of analysis is to distinguish between an event occurring as either reflecting a true
effect versus a false one. Any bias occurring in the collection of the data, or selection of method
of analysis, will increase the likelihood of drawing a biased inference. Bias can occur when
recruitment of study participants falls below minimum number required to demonstrate statistical
power or failure to maintain a sufficient follow-up period needed to demonstrate an effect
(Altman, 2001).

3.10 Reliability and Validity

Researchers performing analysis on either quantitative or qualitative analyses should be aware of


challenges to reliability and validity. For example, in the area of content analysis, Gottschalk
(1995) identifies three factors that can affect the reliability of analyzed data:

a. stability , or the tendency for coders to consistently re-code the same data in the same
way over a period of time
b. reproducibility , or the tendency for a group of coders to classify categories membership
in the same way
c. accuracy , or the extent to which the classification of a text corresponds to a standard or
norm statistically

The potential for compromising data integrity arises when researchers cannot consistently
demonstrate stability, reproducibility, or accuracy of data analysis
According Gottschalk, (1995), the validity of a content analysis study refers to the
correspondence of the categories (the classification that raters’ assigned to text content) to the
conclusions, and the generalizability of results to a theory (did the categories support the study’s
conclusion, and was the finding adequately robust to support or be applied to a selected
theoretical rationale?).

REFERENCES

Beynon-Davies (2002). Information Systems: An introduction to informatics in organisations.


Basingstoke, UK: Palgrave Macmillan. ISBN 0-333-96390-3.
Beynon-Davies (2009). Business information systems. Basingstoke, UK: Palgrave. ISBN 978-0-
230-20368-6.
Gottschalk, L. A. (1995). Content analysis of verbal behavior: New findings and clinical
applications. Hillside, NJ: Lawrence Erlbaum Associates, Inc
Jeans, M. E. (1992). Clinical significance of research: A growing concern. Canadian Journal of
Nursing Research, 24, 1-4.
Lefort, S. (1993). The statistical versus clinical significance debate. Image, 25, 57-62.
Kendall, P. C., & Grove, W. (1988). Normative comparisons in therapy outcome. Behavioral
Assessment, 10, 147-158.

Mesly, Olivier (2015). Creating Models in Psychological Research. États-Unis : Springer


Psychology  : 126 pages. ISBN 978-3-319-15752-8

Nowak, R. (1994). Problems in clinical trials go far beyond misconduct. Science. 264(5165):
1538-41.
O'Neil, Cathy and, Schutt, Rachel (2013). Doing Data Science. O'Reilly. ISBN 978-1-449-
35865-5.

Resnik, D. (2000). Statistics, ethics, and research: an agenda for educations and reform.
Accountability in Research. 8: 163-88
Schroder, K.E., Carey, M.P., Venable, P.A. (2003). Methodological challenges in research on
sexual risk behavior: I. Item content, scaling, and data analytic options. Ann Behav Med,
26(2): 76-103.
Shamoo, A.E., Resnik, B.R. (2003). Responsible Conduct of Research. Oxford University Press.
Shamoo, A.E. (1989). Principles of Research Data Audit. Gordon and Breach, New York.
Shepard, R.J. (2002). Ethics in exercise science research. Sports Med, 32 (3): 169-183.
Silverman, S., Manson, M. (2003). Research on Teaching in Physical Education Doctoral
Dissertations: a Detailed Investigation of Focus, Method, and Analysis. Journal of
Teaching in Physical Education, 22(3): 280-297.
Smeeton, N., Goda, D. (2003). Conducting and presenting social work research: some basic
statistical considerations. Br J Soc Work, 33: 567-573.
Thompson, B., Noferi, G. 2002. Statistical, practical, clinical: How many types of significance
should be considered in counseling research? Journal of Counseling & Development,
80(4):64-71.

You might also like