You are on page 1of 139

1

TM351
Data management and analysis
3

Caveat

• These slides DO NOT replace the course learning


materials
• Exams WILL BE derived from the full set of the course
learning materials
4

SESSION 1
Module guide, software guide, part 1 and part 2
7

MODULE GUIDE
11

1.3 Module structure


The major areas are:
• the data analysis pipeline (Parts 2–5), which looks at specific
issues around data acquisition, preparation, analysis and
presentation
• Relational database management systems (Parts 8–12)
• Document and non-relational systems and distributed
Storage and processing (Parts 13–18)
• Data warehousing and data mining (Parts 19–22)
• Linked data and the ‘semantic web’ (Parts 24–26).
12

1.3 Module structure


The other areas are:
• An introduction to data management and analysis (part 1)
• Data protection and data privacy (part 6)
• When spreadsheets fail – how scale changes things (part 7)
• The secure management of data (Part 23).
17

Learning Python
Online tutorial
https://docs.python.org/3/tutorial/index.html
Textbook
• http://interactivepython.org/runestone/static/pythonds/Intro
duction/ReviewofBasicPython.html

• Start by completing the Bootcamp notebooks (part 1


notebooks 1.1-1.5)
18

Learning materials
• Downloadable from the Central LMS (course contents).

• Course software (downloadable from the internet)


19

Course software
• Will not use the virtual machine as indicated in the original
OU software guide
• For data management we will use PostgreSQL and
MongoDB
• For Data analysis we will use Anaconda as the main
analysis environment and we will use Python as the main
programming language plus a variety of analysis libraries
23

Assessment
MTA: 30%
TMA: 20%
Final: 50%
24

SOFTWARE GUIDE
26

Changes from OU guide


• Different guide
• No virtualization
• Different software
28

General guidelines
• Read the software guide now
• Install anaconda now
• Go through the bootcamp notebooks 1.1-1.5 now
29

PART 1
Introducing data management and analysis
33

Data and data sets (Rob Kitchin 2014)


Form, structure, Source, Producer & Type
Characterisation Characteristic values
Form Quantitative: numerical data
Qualitative: non-numeric data, such as texts, audio, imagery,
video
Structure Structured: well-organised, with a well-defined data model
Semi-structured: more irregular structure, often hierarchical to
an arbitrary depth
Unstructured: no identifiable data model or structure
Source Captured: data captured and recorded for a particular purpose
Exhaust: data produced as a by-product of some other activity,
such as the list of items in your shopping basket, collected in the
process of calculating how much you owe the supermarket for
them
Original or derived: derived data is the result of
processing original data, for example, the total cost (derived) of
items in your shopping basket (original)
Transient: transient data is data that is generated but is of little
value, so not collected (for example cursor positions on websites)
34

Data and data sets (Rob Kitchin 2014)


Characterisation Characteristic values
Producer Primary: generated by the producer for their own use
Secondary: data provided by a producer to another user for (re)use over
and above the primary use
Tertiary: derived data published for use by third parties, e.g. statistical
tables and reports
Type Indexical: data that includes unique identifiers (e.g. a UK National
Insurance number), allowing data items to be linked across distinct data
collections
Attribute: properties or attributes of a data item; multiple attributes of
the same item (e.g. a customer’s name, age and postcode) may be unique
to that item
Metadata: data about data – see Section 2.4 below
35

2.4 Metadata
• data about the dataset itself.
Three kinds:
• descriptive: supporting identification and discovery: for
example, the name of a dataset, or a description of its
contents
• structural: relating to the structure of the dataset: for
example, the column headings in a tabular dataset
• administrative: recording the means by which the
dataset came into being and how it may be, or may have
been, used.
• Can also be subject to data analysis
36

3.1 Stakeholders
• A dataset or database may have a very broad range of
stakeholders
• Different stakeholders will have widely different concerns.
• For example, if data about an individual is being analysed,
then:
• that individual is a stakeholder
• So is the Information Commissioner (Data Protection Act),
protecting the legal interests of all data subjects.
39

3.2 Scale
The three (or six) Vs
• Volume: our traditional measure of data size – how much
there is of it.
• Variety: in many different, sometimes incompatible forms
and representations.
• Velocity: how fast new data is generated and has to be
processed.
• Three more Vs are now becoming current:
• Veracity: the quality of the data; how ‘clean’ it is.
• Validity: to what extent the facts the data incorporates are
correct and consistent for their context (1)
• Volatility: how quickly data changes, or becomes invalid.
42

Big data
• At the greatest extent of the scales of the three (or six) Vs
lies ‘big data’.
• big data aims to gather, analyse, link, and compare large
datasets to identify patterns
• ‘N equals All’ approach: computational power has
become so great that the whole population can now be
analysed, without the need for prior hypotheses.
• You will learn techniques to analyse data in a ‘hypothesis-
free’ way in later sections.
• There are cases in which this strategy fails spectacularly
e.g. predicting flu outbreaks
43

4 Data handling
• Comprises two distinct sets of activities, roles and
responsibilities: data management and data analysis.
• In reality, though, these two are highly interdependent.
• Two ways to characterise data handling:
• as a cycle or life cycle (which mainly emphasises data
management)
• or as a pipeline (which combines data management with the way
the data is used).
44

4.1 The data cycle

Use well managed data


for other purposes
45

4.2 The data pipeline

Figure 1.4 A data analysis pipeline

• In practice, activities that appear neatly separated actually


are combined or revisited.
• For example, after initial preparation and some early
analysis there may be a need to identify and acquire more
data, which will itself require preparation and analysis.
46

Variations

Figure 1.5 A variation on the data analysis pipeline

• From the ‘School of Data’ civil society project which


provides skills development to NGOs involved in
investigative data projects.
48
49

Further elaborations
• The pipeline representation could be redrawn to highlight
specific stakeholder requirements
• it is possible to show different stakeholder requirements
throughout the data flow.
• If a specific viewpoint is required it can be extracted from
the full representation.
• Such deeply decomposed descriptions could be
annotated to show how and where different tools or
technologies are applied, or how aspects of data
management might change at different points in the
pipeline. This is useful for large corporations.
52

Issues in data management


key issues:
• legal
• control
• curation
• flexibility
• currency and maintenance.
62

Some common large-scale data


management architectures
• OLTP (online transaction processing). Guarantees a high
degree of consistency and correctness in the data through a
series of transactions, fault tolerance and high availability to
large numbers of users in real time. Ex: order entry and
processing and financial transactions.
• OLAP (online analytical processing) Extract and view settled
(not frequently updated) data from selected points of view: for
example, seeing data on sales aggregated by sales region,
monthly sales, product ranges, etc. Data is carefully cleaned
and aggregated before being stored.
• Data stream processing (DSP) systems support processing
of data arriving in continuous streams. Processing is triggered
by certain data elements arriving on the stream, or user
intervention, so new results can be generated as long as new
data arrives to be processed.
• more detail at OLTP and OLAP issues in Part 19.
64

Analysis for a purpose


• Simple questions: ‘who?’, ‘what?’, ‘which?’, ‘where?’,
‘how many?’. With increasing information and context we
can work up to the storytelling questions of ‘why?’, ‘how?’
and ‘what if?’.
• Testing a hypothesis: Starts with a hypothesis and a
dataset. Chosen data analysis tasks are applied and their
results are validated, interpreted and applied to the
hypothesis.
• Finding a story: Start with no particular hypothesis or
questions. Analysts are ‘finding the story’ the data has to
tell. The story will develop from exploring the data until the
story emerges.
67

Issues in data analysis


• trust
• data quality and fitness for purpose
• bias
• completeness and correctness
• reproducibility and provenance.
68

Trust
• This is the issue that overarches all other data analysis
activities. Analysing data is pointless unless stakeholders can
trust the results. Trust can relate to several aspects of data
analysis:
• Trust in the data itself: in its origins, documentation, security,
curation and in the quality of its maintenance.
• Trust in the processing applied to the data: have appropriate,
proven methods been applied? How well has the code base
been verified and validated? How repeatable and robust are
the results of the processing? How transparently have the
results been reported?
• Trust in the data managers and analysts themselves: their
competence, their understanding of procedures and processes,
and of concepts of fitness for purpose, data quality, appropriate
interpretation of results and requirements.
• The remaining sections briefly cover issues that relate to trust.
69

Data quality and fitness for purpose


High-level attributes of data:
• Accuracy
• Validity
• Reliability
• Timeliness
• Relevance
• Consistency
• Completeness
• Provenance

• However, we can only really talk sensibly about these


attributes if we measure and review them in some way.
70

Bias
• Human bias
• Bias in data capture
• Bias in data cleaning
• Bias in data handling
80

Reproducibility and provenance


• Using Jupyter Notebooks can result in documented,
reproducible data processes.
• However, strict scientific method also expects fuller
documentation of the context for that processing, allowing
others to follow your reasoning as well as your code.
• So, the Notebooks can be used to capture the wider meta
descriptions and reasoning as well as simply producing
well-documented code and processes.
• Note, too, that scientific protocol demands that analysts
are also required to track and record the transformations
they apply to datasets, so that others can decide about
fitness for purpose and place trust in the data they use.
81

Reproducibility and provenance


• They might need to produce details of the software
packages, versions, hardware and operating systems
used, in case, at some future date, errors are revealed in
them.
• This applies to datasets sourced from elsewhere, as well
as to datasets made available to others. Therefore,
capturing the provenance of a dataset is of interest to
data researchers.
• In the art and antiquities world the provenance of an item
is a record of the ownership, ownership transfers,
management and restoration applied to the item. An
article with a valid and full provenance has a detailed
history to support claims of authenticity.
82

Figure 1.9 Provenance tracking in a big


data workflow
• Figure 1.9 illustrates a ‘big data’ workflow from a data-centric
scientific research process. It shows how the provenance tracking
is expected to cover the whole data management pipeline. Many
scientific data analysis workflows have very similar structures.
• At step 1, data is collected, often by a third party.
• At step 2, data is transferred to the researcher as raw data. Such
datasets often arrive encrypted and must be decrypted beforehand.
This step is sometimes performed with non-standard tools on a
different machine. In many domains where the privacy of subjects is
of very high importance, researchers must follow specific protocols.
• At step 3, data is then loaded onto a file system for analysis.
• At step 4, the analysis takes place. Besides the input data, the
researcher can supply parameters to the analytic software, which
tunes the analysis. The result is transformed data. This analysis
may be performed repeatedly, while the researcher fine-tunes the
parameters. The software itself could be changed either by the
researcher or a third-party vendor.
• At this step, the data is extracted out of a distributed cluster to be
visualised and described to produce a publication.
• At the final step, the derived data and/or publication is placed into
publicly accessible storage where it could serve as raw data for
new research.
83

4.6 Data engineering


• Data engineering has been defined as:

‘the multi-disciplinary practice of engineering computing


systems, computer software, or extracting information
partly through the analysis of data’ (Buntine, 1997).
84

4.6 Data Engineering


The tasks of Data engineers include:
• collecting data over space and time
• cleaning it of errors
• anonymising it
• filtering it
• representing it so that it can be exported from one system
and imported into others
• sorting and storing it across distributed systems
• shaping it into forms that allow it to be analysed
• visualising it.
• Must respect legal and ethical concerns.
85

4.7 Ethics and professionalism in data


management and analysis
• Ethics
• Has been debated by philosophers for millennia
• Ethical behaviour can be summarised as ‘doing the right thing’:
acting in ways that individuals and society would judge to be
consistent with good values, considering the impact of
behaviour on individuals, communities, the environment, etc.
• There are so many vague or debatable terms here that it is not
surprising that there has been so much argument and
uncertainty.

• Professionalism
• Professionalism recognises that those in a professional
position are often asked to make decisions, to plan activity or
engage in debates which can impact on individuals,
communities, and society as a whole, and as such they need to
behave in an ethical way.
90

5 Summary
• If you have followed this introduction you will:
• be able to articulate the module aims
• have been introduced to a simplified model of the data analysis
pipeline, and how that relates to the data management and
analysis tasks in more complex models
• have been introduced to some of the different stakeholder
roles, for whom various issues and concerns are of greater or
lesser importance
• have gained an understanding of some of the key issues in the
management and analysis of data.
• Practically, you should have:
• Installed Anaconda and ran it
• Worked through the bootcamp Notebooks
• optionally, created your own IPython Notebook scribble pad.
91

EXERCISES
For Part 1
92

Exercise 1.1 Exploratory


• 10 minutes
• Think of one or two datasets that you are familiar with. Try
to characterise their different components using Kitchin’s
model. Does the model seem complete to you? Are there
any additional characteristics you think should be in
there?
93

Exercise 1.2 Exploratory


10 minutes
Briefly consider the following two scenarios – in both cases consider who you think the
stakeholder groups might be and consider what expectations they may have about the data use
and management. Consider what would be the impact on each of these groups of not taking
their requirements and expectations into account.
Record this in table form:

Stakeholder Expectation Impact if not met

Scenario 1
I curate the family digital photo collection, held on a networked home PC using commercial
photo album management software. Photographs are uploaded and catalogued irregularly
(usually when memory cards are full, or after major ‘events’) on various devices and cloud
storage spaces. Backups are made to a password-protected portable hard drive.
Scenario 2
The secretary of a small ice hockey club keeps membership records, contact details, relevant
medical records, etc. in a spreadsheet. Game statistics, fixtures, results, player performance
records, etc. are held in a database, and extracts from these are published on the club website.
All communication with club members is filed in various software and email packages. Financial
records of members’ payments to the club, outgoings, etc. are held in an encrypted
spreadsheet. All club and player insurance documents and official documentation from the ice
hockey governing body are scanned and held in an encrypted folder.
94

Exercise 1.3 Self-assessment


• 5 minutes
• Can you think of types of problems that analysts might explore
where incomplete or inaccurate data from operational systems
may be acceptable?
Discussion
• In business contexts, there should be policies and processes
that cover routine analyses, and these will require that the
necessary data meets the business requirements for quality.
However, in more ad hoc analyses, perhaps using data
collected for different purposes, or exhaust data from other
processes, it will be necessary to document the reasoning
behind a decision to use it, and ensure that due attention has
been paid to legal and ethical use.
95

Exercise 1.4 Self-assessment


• 5 minutes
• Why might the following sampling processes lead to a bias?
• Students who successfully complete a module are asked to give their views on its
workload and assessment.
• A study of passenger train use and the impact of train ticket prices on passenger
numbers uses data captured on board London-bound commuter trains.
• Discussion
• It is fairly safe to assume that any student who has successfully completed a
module has had a relatively positive experience of workload and assessment, and
their views are likely to reflect this. A group of students who failed, or failed to
complete a module might harbour more negative views, but the population
described excludes these students. The sample bias is in the population selection
and its relationship to the impact of the module aspects being explored.
• Again the population selection is skewed: commuter train passengers are often
forced to travel by train irrespective of cost. This has the effect of reducing the
impact of pricing on passenger numbers. Data from other transport modes might
mitigate this bias by showing fluctuations against other forms of transport; there
may also be ways of recognising non-work-related train passengers within the
sample set: for example, including a survey question on reasons for travel. Note
that the impact of ticket prices on passenger numbers is quite different from the
passenger’s feelings about ticket prices.
96

Exercise 1.5 Self-assessment


• 5 minutes
• How might you use metadata to help you keep track of the origins of a particular
dataset in a big research data pipeline such as the one described in Figure 1.9?
• Discussion
• Metadata is descriptive data – data about the data or the processes applied to the
data.
• Individual data and result sets could be tagged with standard descriptors stating,
for example, who applied the processing and when, the source data files, the
versions of software used to process or generate the data or results, any standard
processing, preprocessing or validation checks. These could be included in files
giving a provenance history for each dataset.
• Alternatively, if the above pipeline represented a standard processing sequence
used repeatedly, then the metadata could be held at the pipeline level, with a
document or file management system relating specific files to the processing
applied to them, and the process history. A specific process could be introduced
into the pipeline for setting up and verifying a repository of metadata tagging, in
order to check-out and check-in datasets and results.
97

ACTIVITIES
For part 1
98

Activity 1.1 Practical (to be completed


during this week)
• 90 minutes
• Read through the Software Guide and follow the steps for
installing the Anaconda software (30 minutes).
• Ensure that the software has installed correctly, and you
can access the module Notebooks (10 minutes).
• Work through the IPython ‘bootcamp’ Notebooks (i.e.
Notebooks 01.1 through 01.5), to familiarise yourself with
basic Python coding (40 minutes).
• Optionally – create your own IPython Notebook for use as
a working file and scribble pad (10 minutes).
99

Activity 1.2 Social


• Ongoing throughout the module – as noted in the Module
Guide you might want to find 20–30 minutes each week
for engaging with the module forums.
• Definitions
• If you want to challenge or improve the definitions given
here, or propose alternatives, (cartoon-based definitions
are also encouraged!) please do so in the module forum.
Similarly, if you discover new definitions (formal or
informal) of terms we use in the module you should
document your findings on the forum – crowdsourcing a
working definition is often more valuable than simply
being provided with a formal definition.
100
101

Activity 1.3 Practical (checkpoint)


• By the time you have reached the end of this section, you
should ensure that you have at least installed the
[Anaconda and PostgreSQL] virtual machine on your
computer, and checked the software has installed
properly (see the Software Guide).
• Once you have installed and checked the required
software virtual machine, you should then be working
through the bootcamp Notebooks (if you need a refresher
on Python), and the advice on creating and using
Notebooks as scribble pads.
102

Activity 1.4 Exploratory


• 10 minutes
• Find the poem, The Blind Men and the Elephant by John
Godfrey Saxe on the web and read it.
103

Activity 1.5 Media


• 10 minutes
• Watch and listen to the following media clips, which
capture how ‘big data’ captured the popular imagination in
recent years.
• Bang Goes the Theory trailer, April 2014
• Data, Data Everywhere, BBC Radio 4, December 2013
• Bang Goes the Theory (full episode), April 2014 (optional)
104

Activity 1.6 Media


• 2 minutes
• The mythical status of Big Data is picked up on in:
• Data, Data Everywhere, BBC Radio 4, December 2013
105

Activity 1.7 Exploratory


• 15 minutes
• Read the article by Tim Harford (2014), ‘Big data: are we
making a big mistake?’, which briefly describes the tale of
the Google flu predictor and offers a general critique of
populist ideas around big data.
• What does Harford identify as the ‘four exciting claims’
made for big data. What problems does he identify for
each of them?
106

Activity 1.8 Exploratory


• 15 minutes
• Find another example describing a data life cycle or data
pipeline and compare it to the views described above.
• Does it introduce any new steps?
• What role, if any, do feedback or repeated loops play in the
process?
• If the model particularly challenges the models above, or
makes more sense to you, post a link to it in the module forum,
saying why you think it is a good example, or a good contrast
to the model shown above.
• Hint: If you’re struggling to find an alternative then the
previously referenced Kensington and Chelsea Transparency
Policy (2011) document has a workflow diagram, on page 14,
for publishing data on their Transparency Portal.
107

Activity 1.9 Practical (checkpoint)


• By the time you have reached the end of section 4.3, you
should ensure that you have at least installed and
checked the TM351 software, and worked through the
bootcamp Notebooks.
108

Activity 1.10 Exploratory


• 15 minutes
• From the Information Commissioner’s website
(https://ico.org.uk/) locate the key principles of the Data
Protection Act. Look for the definition of ‘personal’ data.
Do you process personal data?
Discussion
• Principles can be found at: Data protection principles
• Personal data definition taken from the same website:
Key definitions of the Data Protection Act
• The Data Protection Act is considered further in Part 6.
109

Activity 1.11 Self-assessment


• 5 minutes
• Think about any disastrous data losses you have
experienced, or know about. Would an enforced backup
policy have helped mitigate the problems that arose as a
result of the loss? If you have an example to share you
can post your experience on the forum (you should avoid
the use of real names of companies and individuals to
spare their blushes). It’s also valuable if you can comment
on other students’ examples, especially around how to
avoid similar losses in the future. These shared examples
help you build up a case log of shared experiences.
110

Activity 1.12 Exploratory


• 10 minutes
• As an example of a very thorough data description, skim
through the Royal Mail Programmer’s Guide to using the
Postcode Address File (PAF) intended to make PAF usable by
programmers and others outside the Post Office.
• In particular, look at the definition of a postcode on page 17
and note how, on page 21, the PAF’s relationship to other files
is recorded.
• Discussion
• The PAF is intended to support programmers. Programmers
need detailed descriptions, so it’s not surprising that this is a
very detailed guide. However, even this single document
describes postcodes at different levels of detail, depending on
context. Choosing the right level of detail for the audience can
be challenging.
111

Activity 1.13 Exploratory


• 5 minutes
• Have a brief look at the range of cautionary examples of
spurious correlation at tylervigen.com.
• Discussion
• These should serve as stark reminders that if the analysis
says one thing, and common sense says another, then go
back and check your processing before preparing to tell
the world.
112

Activity 1.14 Practical (checkpoint)


• At this point in the module you should be confident that
you have installed the virtual machine and software
correctly, you have worked through the bootcamp
Notebooks (if necessary) and are familiar with the
features of the Notebooks as a note-taking and scribble
pad space.
• The module will make use of the virtual machine and
installed software extensively over the coming weeks – so
it is important that you begin Part 2 with a working
environment installed.
113

PART 2
Acquiring and representing data
116

Workload
• This week you will be working through the module content, reading and
using standards documents, and doing practical work importing data into
IPython and OpenRefine.
• You will be spending about half your study time this week on practical
activities and exercises. Most of the practical activities occur towards the
end of the reading material, so ensure you work through the early
material quickly to leave sufficient time for the practical work.
• During this part of the module you will work through five Notebooks,
looking at Python’s pandas and developing skills in reading, writing and
manipulating content in different file formats.
• Activity 2.2 uses 02.2.0 Data file formats – file encodings (10 minutes).
• Activity 2.7 uses 02.1 Pandas DataFrames (60 minutes).
• Activity 2.10 uses 02.2.1 Data file formats – CSV (20 minutes).
• Activity 2.11 uses 02.2.2 Data file formats – JSON (20 minutes).
• Activity 2.14 (optional) uses 02.2.3 Data file formats – other (20 minutes).
• In addition there are two screencasts in Activity 2.12 (30 minutes), which
give a short introduction to the OpenRefine tool.
117

Acquiring and representing data


• Parts 2-5 cover the 4 stages of the pipeline.

Figure 2.1 A data analysis pipeline


• Two simple means of representing complex, structured data:
• the table
• the document

• numerous problems that can arise when trying to represent


data that exists ‘out here’, ‘in the world’, ‘in there’, on a
computer. This may require difficult decisions.
119

2.1 Data ‘out here’ and ‘in there’

Figure 2.2 Data ‘out here’ and data ‘in there’


• Note that the data, as represented, will generally have to
make its way out into the world again, either to be passed
to another system (probably altered in some way), or
visualised, or in some other way reported on.
120

2.2 In the beginning was the bit: data


and data types
• Ex: Java data type declaration:
float mySalary = 99000.00;

• Typing is useful in a programming language for:


1. Allocate an appropriate amount of memory for the values,
2. Constrain legal operations that may be applied to a variable

• For example, the statements


float mySalary = 99000.00;
mySalary = mySalary * “Thomas”;

• should result in a compilation error.


121

2.2 In the beginning was the bit: data


and data types
• Operators may be overloaded.
• try out the following in your Jupyter scribble pad:
a=1+2
b = 1.1 + 2
c = 2 + 1.1
d = '1' + '2'
print(a, b, c, d)
• you will get
(3, 3.1, 3.1, '12')

• The result of the more expressive data type will be returned

• e = 1 + '2' results in an error that declares unsupported


operand type(s) for +: 'int' and 'str'.
122

2.2 In the beginning was the bit: data


and data types
• Every typed programming language supplies a different set of
atomic primitive types.
• Java offers:
• bool (1 bit)
• byte (1-byte signed)
• char (2-byte unsigned)
• short (2-byte signed)
• int (4-byte signed)
• double (8-byte floating point).
• Python, offers a richer set of primitive types including complex
numbers and various collection types.
• There are no inherently atomic types (except the bit).
• The atomic types offered by any programming language are all
conventions, built on top of the bit and the byte.
123

2.3 Character encodings and their


semantics
• To represent a book on the computer, simply represent each
character in it as one or more bytes. Problems:
• no obvious rules on how to create the stream of bytes.
• once in the digital realm the semantics of the original words in the
book have been lost
• Character encodings define the mappings between the digital
representation of a character and the character itself.
• ASCII is a 7-bit scheme that can encode up to 127 different values
• Unicode Originally16-bit encoding, can now accommodate up to
1,114,112 distinct symbols from many languages, each requiring
21 bits. Each character is represented by an unique integer.
• UTF-8 UTF-8 can encode all 1,112,064 valid code
points in Unicode using 1-4 8-bit bytes. The name is derived
from Unicode Transformation Format – 8-bit.
• backward compatible with ASCII. Code points with lower numerical
values, which occur more frequently, are encoded using fewer bytes.
• Is overtaking the web.
124

Unicode
125

Common encoding for data in different


formats
• It is essential to be able to identify what character
encoding has been used to represent the data elements
so that they can be appropriately decoded. This becomes
especially problematic when data is being drawn in from
several streams using different encodings, as illustrated in
Figure 2.3.

Figure 2.3 Producing a common encoding for data in


different formats
128

The Likert scale


• This widely used scale in social science research can be
used to map qualitative responses onto a quantitative
scale, typically an ordinal scale where only the rank order
is important.
• For example, a five-point scale for assessing the extent to
which survey respondents agree or disagree with a
particular statement would use the values: Strongly
disagree, Disagree, Neither agree nor disagree, Agree,
Strongly agree.
129

Numbers for measurements – Stevens’


NOIR
• In a 1946 paper in Science, psychologist Stanley Smith
(Stevens, 1946) Stevens identified four classes (NOIR):

• Nominal
• Ordinal
• Interval
• Ratio
130

Nominal
• numbers are used only as labels, in which case words or
letters would serve equally as well
• Items have no inherent order, so the only legitimate
operations on these are a test for equality, or counting
the number of instances of the members of each category,
or finding the statistical mode (that is, most common
element).
• Example, gender (1=M and 2=F)
131

Ordinal
• numbers have a rank ordering, so in addition to an
equality test, it is legitimate to find the median (middle-
ranked value) as well as the mode, and order items into
percentiles.
• Example: Likert scale

Excellent Very good Good Acceptable Poor


1 2 3 4 5

Example 2: Letter grades

A B C+ C D F
4 3 2.5 2 1.5 0
132

Interval
• numbers on an interval scale can be ranked, and we
know how far apart things are, such as on a temperature
scale, but without a specific origin being stated.
• Legitimate operations thus include finding the mean
(average) value, standard deviation and correlations.
• Example 1: temperature in Celsius or Fahrenheit scale
• Example 2: Date when measured from a particular epoch
• Example 3: Interest rates
• Example 4: Locations in Cartesian coordinates
• Example 5: Direction in degrees measured from true or
magnetic north
133

Ratio
• Numbers are on an ordinal scale, but with a meaningful,
known, fixed origin, such as in the Kelvin temperature
scale – origin absolute zero.
• Example 1: Temperature in Kelvin Scale
• Example 2: Mass of an object
• Example 3: Height of an object
• Example 4: Force on an object
• Example 5: The magnetic or the electrical field value
136

2.5 Complex data


• Previous discussion failed to consider the 2nd and 3rd
problems of data representation:
• maintaining relationships between attributes and
• preserving the semantics of the things ‘out here’ once inside the
digital realm.
137

2.5 Complex data - Examples


• Example 1 : digital book: given a suitable, stated character
encoding, the semantics of the book’s words and characters
are probably preserved, but most books possess internal
structure, such as chapters, sections and paragraphs as well.
• Example 2: an experiment, simply finding a correct data
encoding is not enough – measurements and their timestamps
have to be tied together, and the semantic problem remains
unsolved.
• Example 3: the car dealership – presents the same difficulties.
A naive approach to representing such categorical data might
be to represent each category as a number: However:
• encoding the cars as integers would allow us to do meaningless
statistical operations on them,
• any trace of the semantics of each car type is lost.
• We must be able to represent the way in which the attributes of an
object are tied to that object.
138

2.5 Complex data


• So, there are two general problems here:
• to find a way to tie data values – measurements, timestamps,
objects, attributes – together in appropriate structures
• to make their semantics identifiable, e.g. to show which values are
timestamps and which are cars, and that these values actually
represent timestamps and cars.
139

Complex data: Representing dates


• in Python we have the built-in date type, and given a particular
date specified as a date type, we can obtain the ISO-formatted
version of the date, or write out the date in a specific format:

from datetime import date


d = date(2013, 1, 15)
d.isoformat()
'2013-01-15'
• d.strftime('%d %B, %Y (%a)')
'15 January, 2013 (Tue)'

• allowing us to express the date in different ways, or different


facts about it, all from the same underlying object.
140

Grouping related data


• Numerous facilities to group related values together are
also usually found in most programming languages, in the
form of complex data types such as
• Lists
• Dictionaries
• sets.

As regards problem (2), rather less dependably, the


semantics of values, or collections of values, can be
conveyed through appropriate variable and collection
names.
142

3 Representing structured data: tables


• the table is a very common schema for representing structured data
• According to the W3C draft Model for Tabular Data and Metadata on
the Web (W3C, 2015),
• Tabular data is data that is structured into rows,
• each of which contains information about some thing.
• Each row contains the same number of cells (although some of these
cells may be empty), which provide values of properties of the thing
described by the row.
• In tabular data, cells within the same column provide values for the
same property of the thing described by the particular row.
• This is what differentiates tabular data from other line-oriented
formats.
• According to the W3C model, then, a table must contain at least one
column and at least one row.
• Spreadsheets use worksheets of two-dimensional, cell-based tabular
displays.
146

3 Representing structured data: tables

Table 2.5 A table with a metadata title


• the standard provides for the specification of the particular data
type that each cell is expected to hold.
• So, in our example in Table 2.5, we might specify that each cell
in the ‘Make’ and ‘Model’ columns must contain a string, and
that the ‘Mileage’ column must contain an integer.
• we might wish to construct tables to represent more complex
forms of structuring. In the table below, a hierarchical structure
is represented.
147

3 Representing structured data: tables

Table 2.6 Metadata used to group columns forming a


hierarchy

• In some applications, such as spreadsheets, it is possible


to ‘merge’ these cells to span several columns.
148

3.1 Logical presentations and physical


representations
• From the point of view of data managers and other users
of a tool (DBMS or spreadsheet software) , the data is
presented, in the form of tables.
• However, this says nothing about how the data is actually
represented on the physical machine.
• Data that appears as a set of tables may well be stored in
the form of indexed files, or as arrays, or whatever.
149

3.1 Logical presentations and physical


representations

Figure 2.4 The logical and physical representation of real-


world objects
151

3.2 Representing tables in web pages


• Tabular data can be represented in two forms within a web
page, both of which allow the browser and its plug-ins to
handle the interaction between the logical and physical aspects
of the data. The two forms are:
• As an (X)HTML <table> element (W3C, 2004). Some
manipulation of the table data, such as searching and sorting,
is made possible through numerous plug-ins, such as
DataTables.net, a plug-in for the JQuery library.
• As a JavaScript data object for manipulation. JavaScript
toolkits from Yahoo (2015) and Google (2015) now include data
table objects that support the manipulation and display of table
data in web pages.
• These toolkits are generally tightly integrated within a wider set
of tools. For example, in the case of Google’s DataTable and
DataView, data stored in JavaScript objects can be directly
pulled in to Google chart components.
152

4 Representing structured data:


documents
• Data scientists like to refer to a document as meaning
any file or representation that embodies a particular data
record.
• Books are usually divided into chapters, or sections; they
may contain illustrations, footnotes, endnotes, tables of
contents, indexes and special headings. They may
employ a single typeface, or a variety of them. We can
look on all of these first of all as structural data, none of
which could be captured in a simple sequence of Unicode
characters.
• So, how can the structure of our data be captured?
• The most widespread way of capturing document
structure is markup.
153

4.1 Basic markup


• best– the Hypertext Markup Language (HTML).
• Figure 2.5 shows a fragment of a (terrific) book, represented in HTML 4.0 markup.

• The browser mediates bet. the logical and physical representation


• you can switch between them using ‘View Source’
• note how different structural items are represented within the document (in red)
• a title
• Headings
• Tables
• You can specify how the text within these tags is presented by using a separate style sheet
(you can see the reference to this at line 8 in the example above).
• The tags in some sense capture the semantics which can also be highlighted. For example,
• Act I of the play is specified as being a heading at level 3 (h3)
• the style sheet will specify the font, emboldening, indentation, etc. that applies to all headings
at this level.
• The tool that mediates between the user and document (presumably a browser) will then
apply this styling as required.
• In some cases, however, the styling is expressed directly: <i> </i> (poor practice).
154

4.1 Basic markup


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0
Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>Julius Caesar: Entire Play </title>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
<LINK rel="stylesheet" type="text/css" media="screen"
href="/shake.css">
</HEAD>
<body bgcolor="#ffffff" text="#000000">
155

Example HTML 4.0


<table width="100%" bgcolor="#CCF6F6">
<tr><td class="play" align="center">The Life and Death of
Julius Caesar</td></tr>
<tr><td class="nav" align="center">
<a href="/Shakespeare">Shakespeare homepage</A>
| <A href="/julius_caeser/">Julius Caesar</A>
| Entire play
</td></tr>
</table>
156

Example HTML 4.0


<H3> ACT I </h3>
<h3> SCENE I. Rome. A street. </h3>
<blockquote>
<i>Enter FLAVIUS, MARULLUS, and certai Commoners</i>
</blockquote>
157

Example HTML 4.0


<A NAME=speech1><b>FLAVIUS</b></a>
<blockquote>
<A NAME=1.1.1>Hence! home, you idle creatures get you
home:</A><br>
<A NAME=1.1.2>Is this a holiday? what! know you not,</A.<br>
<A NAME=1.1.3.Being mechanical, you ought not walk</A.<br>
<A NAME=1.1.4>Upon a labouring day without the sign</A><br>
<A NAME=1.1.5>Of your profession? Speak, what trade art
thou?</A><br>
</blockquote>

Figure 2.5 Text showing HTML tagging of the opening lines


of the play Julius Caesar
158

4.2 When basic markup is not


sufficient
• Ex1:newspaper reports consist of an identifiable headline,
a byline that identifies the reporter, and an initial ‘lede’
paragraph, HTML is fine.
• Ex2: TM351module materials contain a much richer
combination of specialised elements (headings,
examples, etc.)
• To preserve semantics of each element we need a tool
like XML
159

4.3 Extended markup


• XML allows user-defined tags
• XML can represent arbitrary data structures
• XML documents are both human and machine readable.
160

Example XML (cars)


<?xml version="1.0"?>
<!DOCTYPE CAR SYSTEM "cars.dtd"> <!-- XML schema defines allowable tags and tag
structures – allows checking the XML file against the schema -->
<?xml-stylesheet type="text/css" href="xmlcarstyle.css"?>
<CARS-IN-STOCK> <!-- User-defined tag -->
<TITLE>Cars currently in stock</TITLE>
<CAR> <!-- User-defined tag -->
<MAKE>Skoda</MAKE> <!-- User-defined tag -->
<MODEL>Fabia</MODEL> <!-- User-defined tag -->
<REGISTRATION> OK15NNB </ REGISTRATION > <!-- User-defined tag -->
<PRICE> 275,000</PRICE> <!-- User-defined tag -->
</CAR>
<CAR>
<MAKE>Ferrari</MAKE>
<REGISTRATION> LR56WRU </ REGISTRATION >
<PRICE> 9,871</PRICE>
<MODEL>Enzo</MODEL>
</CAR>
</CARS-IN-STOCK>
161

Namespaces
• XML from different schemas can be combined in the same
document
• namespaces allow for tags to be qualified and thus given
the meaning associated with the right schema. Ex:

<root xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos"> 
<foaf:Person>
<foaf:name>Tim Berners-Lee</foaf:name>
</foaf:Person>
• more about XML and namespaces in Section 5.6.
162

Summary
• data managers will be concerned to:
• find suitable forms of representation for basic data
elements such as numbers, names, dates, etc.
• represent the relationships between these data elements
• preserve the semantics of the ‘things out here’ they
represent.
• managers will be required to share data with others.
• they must also consider how their captured data may be
packaged so that it can be transported reliably between
different applications and stakeholders.
163

5 Transporting data
• Q: what technologies can package captured data for
transfer, while preserving the carefully chosen
representation choices and semantic information?
• XML is popular as a message passing format in the delivery of web
services.
Also,
• CSV – comma-separated values file (sometimes referred to as a
comma-separated variable file).
• JSON – JavaScript Object Notation.
165

Basic CSV

Table 2.7 Data from a car dealership


• Rows are separated by line breaks.
• Fields are separated by commas.
• Each line contains same number of fields in the same
order
• the first row in CSV:
• 10/3/15, OK15NNB, Mr. Graham Kennedy, 15, Acacia
Avenue, Leicester, UK, "LE21 4NB", £274,350
But there are immediate problems with this:
166

Problems

• 1st line Represented in CSV as:


10/3/15, OK15NNB, Mr. Graham Kennedy, 15, Acacia Avenue,
Leicester, UK, "LE21 4NB", £274,350
Prob.1 lost semantic information: allow an optional header row,
like this:
Date, Reg, Customer name, Customer address, Postcode,
Paid
10/3/15, OK15NNB, Mr. Graham Kennedy, 15, Acacia Avenue,
Leicester, UK, "LE21 4NB", £274,350
Prob.2 Commas in text: use the double, like this:
10/3/15, OK15NNB, "Mr. Graham Kennedy, 15, Acacia
Avenue, Leicester, UK", "LE21 4NB", "£274,350"
167

Problems

• 10/3/15, OK15NNB, Mr. Graham Kennedy, 15, Acacia Avenue,


Leicester, UK, "LE21 4NB", £274,350
• postcode already contains double: use more double quotes, like this:
10/3/15, OK15NNB, "Mr. Graham Kennedy, 15, Acacia Avenue,
Leicester, UK", """LE21 4NB""", "£274,350"
• Reading values that are spread over several lines as a single column:
There must be an agreed line delimiter.
• Paid is represented as a string, but it should be an integer : no
facilities with CSV to check or enforce typing.
• Also note that all the elements of a column are expected to be of the
same type, rather than a heterogeneous list of items of differing data
types.
168

CSV dialects
• CSV dialects differences:
• different line terminators
• the final line of the file may or may not employ the line
terminator
• leading and trailing white space may or may not be
ignored.
• different separator characters for header rows.
• different escaping characters.
• some dialects will always quote every field.
169

CSV dialects
• Alternative dialects of CSV can be specified using the CSV
Dialect Description Format (CSVDDF) (Data Protocols, 2015a).
For example:
{
"csvddfVersion": 1.0,
"dialect": {
"delimiter": ",",
"doubleQuote": false,
"lineTerminator": "\r\n",
"quoteChar": "\"",
"skipInitialSpace": false
}
}
• In a brief review of the territory, as well as a community call to
arms, the Open Data Institute’s Jeni Tennison called for 2014 to
be ‘The Year of CSV’ (Tennison, 2014).
172

JSON - example
• valid JSON from RFC 7159.
• embodies an attribute/value representation style.
• Note that the object is contained within braces i.e. ‘{’ ‘}’:
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": 100
},
"Animated" : false,
"IDs": [116, 943, 234, 38793]
}
}
• Figure 2.6 JSON representation of an image object
173

XML equivalent
<image>
<width>800</width> <height>600</height>
<title>View from 15th Floor</title>
<thumbnail>
<url>http://www.example.com/image/481989943</url>
<height>125</height>
<width>100</width>
</thumbnail>
<animated>false</animated>
<IDs>
<id>116</id>
<id>943</id>
<id>234</id>
<id>38793</id>
</IDs>
</image>
Figure 2.7 XML representation of an image object, corresponding
to Figure 2.6
174

5.3 Introducing OpenRefine


• a tool for previewing and cleaning datasets.
• Can import and preview:
• CSV files (and other CSV-like files with arbitrary delimiters)
• Excel spreadsheet files
• XML files
• JSON data.
• Can load data from a local file or from one or more web
addresses/URLs, and cut and paste text directly.
175

5.4 JSON and CSV together


• most significant problem with CSV: semantics of column
definitions are not declared anywhere.
• There is no requirement that column headers are
specified within a CSV file
• No clear and conventionally encoded way of describing
either their data type or any other metadata.
176

Figure 2.8 An example JSON table


schema for car data
• The JSON Table Schema (Data Protocols, 2015b)
provides a candidate solution to the problem of
associating column type metadata with the columns in a
CSV file. Going back to our car dealership example, a
table schema might be something like:
{
"name": "CARS-IN-STOCK",
"title": "Vehicles currently for sale",
# fields is an ordered list of descriptors
# one for each column in the table
"fields": [
177

Figure 2.8 An example JSON table


schema for car data
# a field-descriptor
{
"name": "MAKE",
"title": "The manufacturer of a car currently held in stock",
"type": "String”
"description": "A name from the approved list of makers"
...
},
178

Figure 2.8 An example JSON table


schema for car data
# a field-descriptor
{
"name": "MODEL",
"title": "The model of a car currently held in stock",
"type": "String”
"description": "A model from the this maker’s list"
...
},
... more field descriptors
]
}
179

CSV and JSON Together


• Note how the columns in the table are held together in a list
structure, delineated by square brackets.
• The Tabular Data Package (Data Protocols, 2014), a
candidate lightweight standard, extends this idea by packaging
one or more CSV files with a metadata file (datapackage.json)
that describes their contents. Key columns can also be defined
on which table indices can be built, or that can be used to link
the contents of different tables together.
• The whole data package – which would be represented on a
computer as a folder or directory containing one or more CSV
files and the JSON metadata file – could be transported as a
single compressed file. The specification does not currently
specify a compression type, or file suffix, for identifying such
data packages.
180

5.5 Checking validity


• You can check whether a text file is a valid JSON file by
using the JSONLint service (Dary, n.d.).
• Similarly CSV files and CSV files qualified with a schema
description based on the JSON Table Schema can also
be checked for validity using the CSV Lint service
provided by the Open Data Institute (Open Data Institute,
n.d.).
• Note: Even if a text file has the right format and passes
the relevant lint checker, the software you later use to
read the file may not handle the file correctly – always
check the file with the software you intend to use.
181

5.6 Other data representation and


packaging conventions
• other representation and packaging standards exist:
• those specific to particular applications (such as statistics or
mathematical computing applications).
• Excel spreadsheets (XLS & XLSX)
• although they suffer many of the same problems as CSV in their
informal structuring (for example, tabular data and informal comments in
the same sheets; multiple sheets in the same spreadsheet).
• DBMSs can export data in ‘standardised’ ways, such as the SQL
statements required to recreate the database.
• However, these interchange formats may be brittle when data is
transported between different DBMS, or even between different versions
of the same DBMS.
183

6 Summary
• In this part, you have learned about:
• some of the different ways in which data elements can be represented, and the
perils inherent in doing so
• two formats for representing complex data in such a way as to preserve its
structure and semantics
• two formats for packaging data so that it can be shared in a convenient way.
• Practically you will have worked with:
• the Python pandas library to represent structured data, using Series and
DataFrame objects
• the Python libraries to import and export CSV and JSON datasets
• OpenRefine as a data acquisition and parsing tool.
• You have also started to get your hands dirty with some real datasets, using
OpenRefine and pandas to open, and explore the properties of, a variety of data-
containing documents.
• In the next part of the module you will have an opportunity to work with a wider
range of real world datasets. Acquiring and representing data is one thing, but in
many cases it may not be clean and well behaved. Before it can be properly
manipulated it may well have to be given a good cleaning first.
184

ACTIVITIES
For part 2
185

Activity 2.1 Self-assessment


• 10 minutes
• Along with the simple ASCII coding scheme, the UTF-8 coding scheme
(Unicode Transformation Format, 8-bit encoded) is another widely used
character encoding.
• According to the Python Unicode HOWTO what rule, or rules, allow string
elements to be mapped from their Unicode values to equivalent UTF-8
values?
• Discussion
• UTF-8 uses the following rules:
• If the code point is less than 128, it’s represented by the corresponding
byte value.
• If the code point is 128 or greater, it’s turned into a sequence of two,
three, or four bytes, where each byte of the sequence is between 128
and 255.
• (Note the code point is the integer value of the character’s representation
in Unicode (here expressed in decimal (base 10)).
• Here are three examples – one with code point less than 128, and two
with code points above 128.
186

Activity 2.1 Self-assessment


Three examples:
187

Activity 2.2 Notebook


• 10 minutes
• Work through Notebook 02.2.0 Data file formats - file
encodings, which relates to this topic.
188

Activity 2.3 Exploratory (optional)


• 20 minutes
• Read through Nicholas R. Chrisman’s critique of Stevens’
NOIR (Chrisman, 1995) Beyond Stevens: A revised approach
to measurement for geographic information, up to, but not
including, the section ‘A larger framework for measurement’
(p. 277).
• While reading, consider the following questions:
• What distinguished the representational view of measurement
from the extensive view?
• What did the change in perspective mean for the development
of social sciences and how do you think this might extend to
data collection and analysis in general?
• Chrisman claims that ‘Stevens’ four “scales” are usually
presented as a complete set, but they are far from exhaustive’
(p. 273). What additional scales does he identify?
189

Activity 2.4 Social


• 10 minutes and ongoing
• Read over the following article from BBC News, which
describes several examples of what happens when the
wrong form of representation is used:
• Great miscalculations: The French railway error and 10
others
• If you know of any other examples, or find others while
studying this module, post them to the forum.
190

Activity 2.5 Exploratory


• 10 minutes
• Read quickly through Section 4 of the W3C model (W3C,
2015), and then consider the following questions:
• What are the essential properties of a table?
• What are the essential properties of a table cell?
• Hint: When reading Section 4, the essential properties are
tagged ‘must’.
• Discussion
• A table must have one or more columns, the order of which is
significant, and a list of rows, the order of which is significant. A
table cell within a particular table is associated with a particular
row and a particular column. Its contents are described in
terms of a literal string value and a semantic value.
191

Activity 2.6 Self-assessment


• 5 minutes
• Is this form of table annotation (used in Table 2.6)
allowable within the W3C standard?
• Discussion
• At the time of writing, this form of annotation doesn’t seem
to be available within the W3C model.
192

Activity 2.7 Notebook


• 60 minutes
• The pandas package contains many facilities for handling
tabular data in the form of DataFrames. Work through
Notebook 02.1 Pandas DataFrames to see how to
construct and perform some basic manipulation over
DataFrames – these will be used extensively throughout
the rest of the module.
193

Activity 2.8 Practical


• 20 minutes
• At the time of writing, the default reference document against which to define a CSV
generator or parser is information memo RFC 4180 – Common Format and MIME Type for
Comma-Separated Values (CSV) Files.
• Note: RFCs – or Request for Comments – are a collection of documents which describe
various actual and suggested practices relevant to the internet. Most RFCs deal with
technical arrangements and conventions, often called protocols, capturing agreements on
the format of the data and related issues (Korpela, 2004).
• By referring to the RFC, how would you change the following line to encode it as a single
field (one string value, complete with commas and quotes)?
• "Meadowlands", 23, Madeup Roundabout, Applechester, PX12 4AR, UK
• Answer
• Paragraphs 5 and 6 of Section 2 of the RFC say that if you want to treat the comma and
quotes as part of the string you need to surround the field in double quotes. The double
quotes around Meadowlands then need to be doubled up.
• Following these requirements we get:
• """Meadowlands"", 23, Madeup Roundabout, Applechester, PX12 4AR, UK"
• Discussion
• The CSV format, while common, is not always interpreted correctly in all software that reads
CSV files – it is, therefore, advisable to check the software you use against the CSV files you
intend to make use of, to ensure that the right fields are formed when the CSV file is
ingested.
194

Activity 2.9 Exploratory


• 20 minutes
• Read Jeni Tennison’s blog post now – The Year of CSV – bearing in
mind the following questions:
• Why is data published in the CSV format of interest at all?
• Why is CSV insufficient as a format for publishing data on the web?
Discussion
• Tennison’s post describes how tabular data is in widespread use in
many applications on the web, and is the dominant way of
representing open public data. Tabular data tends to be published in a
variety of forms, including Excel spreadsheet documents and as CSV,
but not in any standard or coherent way. While Excel allows for the
inclusion of descriptive, contextual information within a spreadsheet
itself, this information is often provided in an ad hoc and inconsistent
manner. On the other hand, the simpler CSV document is not as
expressive and has no ‘clean’ way of incorporating this contextual
information.
195

Activity 2.10 Notebook


• 20 minutes
• You can now work through Notebook 02.2.1 Data file
formats - CSV.
196

Activity 2.11 Notebook


• 20 minutes
• You can now work through Notebook 02.2.2 Data file
formats - JSON.
• Specialised representations based on JSON may also be
available; for example, the JSON-stat statistical table
schema (Badosa, 2013) as used by the UK’s Office of
National Statistics (ONS) in their API (ONS, n.d.).
• Since its introduction in 2006, JSON has come to replace
XML in many web applications. There has been
considerable debate about which is the better format,
which you might care to follow, if you have time
(Crockford, 2006, Marinescu and Tilkov, 2006)).
197

Activity 2.12 Practical


• 30 minutes
• Here is a quick preview of how we can use OpenRefine to load in a
dataset.
• Watch video: Loading a dataset into OpenRefine
• Instructions on how to run OpenRefine appear in the Software Guide.
• Now open the following files in a text editor and observe their formatting.
Then see if you can preview them in a tabular format using OpenRefine.
• https://www.gov.uk/government/uploads/system/uploads/attachment_dat
a/file/85792/Spend-Transactions-with-descriptions-CO-04-April-
2010.csv_.csv
• http://www.timeshighereducation.co.uk/Journals/THE/THE/24_January_2
013/attachments/UCAS%20acceptance%20figures.xlsx
• http://www.bbc.co.uk/programmes/p002w6r2/episodes/player.json
• http://www.bbc.co.uk/programmes/p002w6r2/episodes/player.xml
• We can also export data from OpenRefine, as the following clip shows.
• Watch video: Exporting data from OpenRefine
198

Activity 2.13 Practical (optional)


• 30 minutes
• Using a data portal, web search or otherwise, find a well-
formatted CSV file that includes at least one column of each of
the following types – a string, an integer, a float or double, a
date-time – or alternatively create a CSV file for yourself.
• Hint: web search engines often support filters than can be used
to limit the results returned. For example, the search limit
‘filetype:<suffix>’ will, when added to a search query, only
return files with the stated suffix. For example, filetype:pdf will
only return search results that point to PDF files.
• Now create a JSON Table Schema format for it. Upload your
CSV file to CSVLint, along with its JSON Table Schema and
check that it validates. Now introduce several errors into your
CSV file and test it again. Does the CSVLint tool detect the
errors? Does it help you address the errors? How, if at all,
could the CSVLint tool be extended to provide any more
support?
199

Activity 2.13 Practical (optional)


• Many text editors support code highlighting to highlight
JSON syntax. Online highlighters are also available
(Kornilov, n.d.). One of the best ways of viewing JSON
data feeds in a styled way within your browser is to use a
browser extension, such as JSONView.
200

Activity 2.14 Notebook


• 20 minutes
• You can see how (if) Python libraries can handle some
other common data formats in the Notebook 02.2.3 Data
file formats - other.
• Remember too that there are probably several other
libraries in development at the time the module was
written, that may be useful in particular projects – so if we
don’t mention a particular format it may still be useful to
search the Python development spaces.
201

EXERCISES
For part 2
202

Exercise 2.1 Self-assessment


• 5 minutes
• To what extent, if any, do you think a consideration of
character encodings is required when acquiring,
representing and saving a dataset?
• Discussion
• If a data file is opened with the wrong encoding, errors are
likely to be thrown when a character cannot be decoded
with that encoding. In representing character data, it is
important to choose an encoding that can encode all the
characters that are likely to appear in it. If a dataset is
saved using an inappropriate character encoding,
information may be lost as byte values will map to the
wrong characters.
203

Exercise 2.2 Self-assessment


• 5 minutes
• Which of the NOIR measurement scales do you think each of the following correspond to and why?
• A series of timestamps.
• A set of prices of currently available mobile phones.
• A set of unique account numbers held by a bank.
• A set of numerical responses to questions in an online survey, in which participants are asked to state how
strongly they agree with certain propositions.
• A set of numerical responses to questions in an online survey, in which participants are asked to state
which kind of product packaging they prefer.
• A set of measurements of temperature in degrees Celsius.

• Answer
• Interval. Timestamps can be ordered (ranked) and each gap (i.e. seconds, minutes, etc.) is uniform; but
there is no notational ‘first’ second (we could march time back forever a second at a time).
• Ratio. Prices have a fixed origin (zero), and otherwise are ordinal.
• Nominal. A bank account number is not really a number (it’s a numeric label). It’s not meaningful to
manipulate bank account numbers as numbers: you could use the underlying numeric value to perform
arithmetic, but that’s not using them as account numbers.
• Ordinal. Responses have an order (strength of agreement), but the actual numeric value is largely
irrelevant.
• Nominal. The numeric value has no ordering inherent in the choice of the numbers: the numbers are index
labels into the list of packaging types.
• Interval. The Celsius scale has no origin point (unlike Kelvin’s absolute zero), but otherwise has order and
fixed separation between consecutive values.
204

Exercise 2.3 Self-assessment


• 5 minutes
• Write down four or five examples of things that you think
might be considered a document.
• Discussion
• You may have started with some of the things that are
often referred to as formal ‘documents’, such as birth
certificates, passports and driving licences. But you may
also have thought of documents created by productivity
tools such as Microsoft Word, or Excel, each with a
different file type (e.g. .docx, .xlsx). And if things with
different file types can be different sorts of document, then
so can HTML pages (.html), PNG image files, MP3 audio
files, or MPEG video files.
205

Exercise 2.4 Self-assessment


• 5 minutes
• To what extent do you think HTML could capture the structure
of the data in each of our three simple examples, from the start
of Section 2? How about the semantics?
• Discussion
• This turns out to be quite a complex issue, hinging on what we
take the word ‘semantic’ to encompass. HTML certainly seems
capable of capturing a lot of the structural elements we might
expect in a book, for instance – paragraphs, titles, and so on.
The numerical data in our second example could probably be
expressed in some form of table, as could the information
about cars in our third example. However, while tags are useful
for capturing this kind of structural information, it is not clear
just how much actual semantic information can be captured by
the range of tags supplied by HTML.

You might also like