You are on page 1of 3

Data Analytics Workflow (CC) Walter Johnston

Data Analytics Workflow


(Source: Walter Johnston, 5/16/2019)

Diagram Explanation

Each circle or oval figure in the diagram represents a resource, activity or deliverable in the process
under description. Figures with only outgoing arrows are resources while figures with only incoming
solid arrows are deliverables. Each solid arrow shows information flow, starting at the various
categories of data sources and leading to deliverables for each step in a path. Dotted arrows indicate
feedback paths available for process improvement.

Definitions and Comments

Public Data

Public Data is openly available data obtainable through download or web-scrape and is considered
both free and un-restricted. This data may require registration and/or a login but has no restrictions on
transfer, sharing, or redistribution. Retention of meta data such as date obtained, URL, timestamps or
other identifiers is recommended as the data moves through to Local Analysis Data stage. Activities
including (but not limited to) data auditing, cleaning, and re-formatting should be performed in the later
Local Analysis Data stage. No assumptions as to temporal or spatial coverage, completeness or
Data Analytics Workflow (CC) Walter Johnston

accuracy of Public Data is made at this process stage. Public Data often, but not always, originates at
a public or non-governmental organization (NGO) source.

Proprietary Data

Proprietary Data may originate as Public Data. It may also originate at a non-public source. The
primary motivations for use of Proprietary Data are: lack of alternate sourcing, a large user community
which facilitates acceptance of analytical results, and “curation” which for current purposes implies a
degree of data auditing, cleaning, and (possibly) re-formatting. Any advantages derived from
“curation” must be balanced against in-house costs (including but not limited to both equipment and
personnel). Some defining characteristics are cost(s) and restrictions on transfer, sharing, or
redistribution.

Internal Data

Internal Data already exists for other use(s) inside the current work environment and which is already
kept current and maintained at a level sufficient for the currently intended analysis.

Local Analysis Data

Local Analysis Data is the “local” storage of all Internal, Proprietary or Public data to be used in the
current analysis. This repository must be controlled by the organization performing the analysis to
maintain data integrity, consistancy, and availability. Before data is made available through this
repository it must be audited for layout and content, cleaned for accuracy and consistancy, and
formatted/transformed for easy use in the analysis. |Local| storage mechanisms can vary but must
include safe and reliable persistance.

Question(s)

Question(s) are the motivations for data analysis. Initial questions from a customer (internal or
external) will often be vage and possibly reduce to a variation of “how does X affect Y” or “can we
predict X”. There is generally a backlog of unasked questions which may be exposed as customer
questions are answered. It is also often the case that answers to an initial question will either change
the importance of existing questions or suggest new questions.

Analytics

Analytics, as used in this document, is the use of data to answer questions. Our course is specifically
focused on understanding the available tools, their use, and the interpretation of the results which they
present. The fundamental purpose of Analytics is insight.

Ad Hoc Reports

Ad Hoc Reports are the result of questions asked without the expectation of repeated answers. These
reports can be pilot studies investigating the feasibility, cost, and value of answering a specific question
and which may then be used as the basis for Production. Code developed for these reports should be
retained and managed as questions have a tendancy to be repeated, with or without variations.
Data Analytics Workflow (CC) Walter Johnston

Production

Production is the repeated use of Analytics results to assess new/changed data. This may be done on a
scheduled or demand basis.

Production Report(s)/Output(s)

Production Report(s)/Output(s) are the persistent results from Production. The content of these
report(s)/output(s) can change as the underlying data inputs change. Output(s) are considered to be the
retained, persistent result(s) of any data transformation while Report(s) are the subset intended for use
directly by people (printout, listings, plots, etc).

Assessment

Assessment is the on-going comparison of predictions generated through Production to actual


outcomes. These comparisons need to be assessed for compliance with expected error rates and
magnitudes to provide feedback to the Analytics activity on the need for the update of Production.

You might also like