Professional Documents
Culture Documents
TM351
Data management and analysis
3
Caveat
SESSION 1
Module guide, software guide, part 1 and part 2
7
MODULE GUIDE
11
Learning Python
Online tutorial
https://docs.python.org/3/tutorial/index.html
Textbook
• http://interactivepython.org/runestone/static/pythonds/Intro
duction/ReviewofBasicPython.html
Learning materials
• Downloadable from the Central LMS (course contents).
Course software
• Will not use the virtual machine as indicated in the original
OU software guide
• For data management we will use PostgreSQL and
MongoDB
• For Data analysis we will use Anaconda as the main
analysis environment and we will use Python as the main
programming language plus a variety of analysis libraries
23
Assessment
MTA: 30%
TMA: 20%
Final: 50%
24
SOFTWARE GUIDE
26
General guidelines
• Read the software guide now
• Install anaconda now
• Go through the bootcamp notebooks 1.1-1.5 now
29
PART 1
Introducing data management and analysis
33
2.4 Metadata
• data about the dataset itself.
Three kinds:
• descriptive: supporting identification and discovery: for
example, the name of a dataset, or a description of its
contents
• structural: relating to the structure of the dataset: for
example, the column headings in a tabular dataset
• administrative: recording the means by which the
dataset came into being and how it may be, or may have
been, used.
• Can also be subject to data analysis
36
3.1 Stakeholders
• A dataset or database may have a very broad range of
stakeholders
• Different stakeholders will have widely different concerns.
• For example, if data about an individual is being analysed,
then:
• that individual is a stakeholder
• So is the Information Commissioner (Data Protection Act),
protecting the legal interests of all data subjects.
39
3.2 Scale
The three (or six) Vs
• Volume: our traditional measure of data size – how much
there is of it.
• Variety: in many different, sometimes incompatible forms
and representations.
• Velocity: how fast new data is generated and has to be
processed.
• Three more Vs are now becoming current:
• Veracity: the quality of the data; how ‘clean’ it is.
• Validity: to what extent the facts the data incorporates are
correct and consistent for their context (1)
• Volatility: how quickly data changes, or becomes invalid.
42
Big data
• At the greatest extent of the scales of the three (or six) Vs
lies ‘big data’.
• big data aims to gather, analyse, link, and compare large
datasets to identify patterns
• ‘N equals All’ approach: computational power has
become so great that the whole population can now be
analysed, without the need for prior hypotheses.
• You will learn techniques to analyse data in a ‘hypothesis-
free’ way in later sections.
• There are cases in which this strategy fails spectacularly
e.g. predicting flu outbreaks
43
4 Data handling
• Comprises two distinct sets of activities, roles and
responsibilities: data management and data analysis.
• In reality, though, these two are highly interdependent.
• Two ways to characterise data handling:
• as a cycle or life cycle (which mainly emphasises data
management)
• or as a pipeline (which combines data management with the way
the data is used).
44
Variations
Further elaborations
• The pipeline representation could be redrawn to highlight
specific stakeholder requirements
• it is possible to show different stakeholder requirements
throughout the data flow.
• If a specific viewpoint is required it can be extracted from
the full representation.
• Such deeply decomposed descriptions could be
annotated to show how and where different tools or
technologies are applied, or how aspects of data
management might change at different points in the
pipeline. This is useful for large corporations.
52
Trust
• This is the issue that overarches all other data analysis
activities. Analysing data is pointless unless stakeholders can
trust the results. Trust can relate to several aspects of data
analysis:
• Trust in the data itself: in its origins, documentation, security,
curation and in the quality of its maintenance.
• Trust in the processing applied to the data: have appropriate,
proven methods been applied? How well has the code base
been verified and validated? How repeatable and robust are
the results of the processing? How transparently have the
results been reported?
• Trust in the data managers and analysts themselves: their
competence, their understanding of procedures and processes,
and of concepts of fitness for purpose, data quality, appropriate
interpretation of results and requirements.
• The remaining sections briefly cover issues that relate to trust.
69
Bias
• Human bias
• Bias in data capture
• Bias in data cleaning
• Bias in data handling
80
• Professionalism
• Professionalism recognises that those in a professional
position are often asked to make decisions, to plan activity or
engage in debates which can impact on individuals,
communities, and society as a whole, and as such they need to
behave in an ethical way.
90
5 Summary
• If you have followed this introduction you will:
• be able to articulate the module aims
• have been introduced to a simplified model of the data analysis
pipeline, and how that relates to the data management and
analysis tasks in more complex models
• have been introduced to some of the different stakeholder
roles, for whom various issues and concerns are of greater or
lesser importance
• have gained an understanding of some of the key issues in the
management and analysis of data.
• Practically, you should have:
• Installed Anaconda and ran it
• Worked through the bootcamp Notebooks
• optionally, created your own IPython Notebook scribble pad.
91
EXERCISES
For Part 1
92
Scenario 1
I curate the family digital photo collection, held on a networked home PC using commercial
photo album management software. Photographs are uploaded and catalogued irregularly
(usually when memory cards are full, or after major ‘events’) on various devices and cloud
storage spaces. Backups are made to a password-protected portable hard drive.
Scenario 2
The secretary of a small ice hockey club keeps membership records, contact details, relevant
medical records, etc. in a spreadsheet. Game statistics, fixtures, results, player performance
records, etc. are held in a database, and extracts from these are published on the club website.
All communication with club members is filed in various software and email packages. Financial
records of members’ payments to the club, outgoings, etc. are held in an encrypted
spreadsheet. All club and player insurance documents and official documentation from the ice
hockey governing body are scanned and held in an encrypted folder.
94
ACTIVITIES
For part 1
98
PART 2
Acquiring and representing data
116
Workload
• This week you will be working through the module content, reading and
using standards documents, and doing practical work importing data into
IPython and OpenRefine.
• You will be spending about half your study time this week on practical
activities and exercises. Most of the practical activities occur towards the
end of the reading material, so ensure you work through the early
material quickly to leave sufficient time for the practical work.
• During this part of the module you will work through five Notebooks,
looking at Python’s pandas and developing skills in reading, writing and
manipulating content in different file formats.
• Activity 2.2 uses 02.2.0 Data file formats – file encodings (10 minutes).
• Activity 2.7 uses 02.1 Pandas DataFrames (60 minutes).
• Activity 2.10 uses 02.2.1 Data file formats – CSV (20 minutes).
• Activity 2.11 uses 02.2.2 Data file formats – JSON (20 minutes).
• Activity 2.14 (optional) uses 02.2.3 Data file formats – other (20 minutes).
• In addition there are two screencasts in Activity 2.12 (30 minutes), which
give a short introduction to the OpenRefine tool.
117
Unicode
125
• Nominal
• Ordinal
• Interval
• Ratio
130
Nominal
• numbers are used only as labels, in which case words or
letters would serve equally as well
• Items have no inherent order, so the only legitimate
operations on these are a test for equality, or counting
the number of instances of the members of each category,
or finding the statistical mode (that is, most common
element).
• Example, gender (1=M and 2=F)
131
Ordinal
• numbers have a rank ordering, so in addition to an
equality test, it is legitimate to find the median (middle-
ranked value) as well as the mode, and order items into
percentiles.
• Example: Likert scale
A B C+ C D F
4 3 2.5 2 1.5 0
132
Interval
• numbers on an interval scale can be ranked, and we
know how far apart things are, such as on a temperature
scale, but without a specific origin being stated.
• Legitimate operations thus include finding the mean
(average) value, standard deviation and correlations.
• Example 1: temperature in Celsius or Fahrenheit scale
• Example 2: Date when measured from a particular epoch
• Example 3: Interest rates
• Example 4: Locations in Cartesian coordinates
• Example 5: Direction in degrees measured from true or
magnetic north
133
Ratio
• Numbers are on an ordinal scale, but with a meaningful,
known, fixed origin, such as in the Kelvin temperature
scale – origin absolute zero.
• Example 1: Temperature in Kelvin Scale
• Example 2: Mass of an object
• Example 3: Height of an object
• Example 4: Force on an object
• Example 5: The magnetic or the electrical field value
136
Namespaces
• XML from different schemas can be combined in the same
document
• namespaces allow for tags to be qualified and thus given
the meaning associated with the right schema. Ex:
<root xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos">
<foaf:Person>
<foaf:name>Tim Berners-Lee</foaf:name>
</foaf:Person>
• more about XML and namespaces in Section 5.6.
162
Summary
• data managers will be concerned to:
• find suitable forms of representation for basic data
elements such as numbers, names, dates, etc.
• represent the relationships between these data elements
• preserve the semantics of the ‘things out here’ they
represent.
• managers will be required to share data with others.
• they must also consider how their captured data may be
packaged so that it can be transported reliably between
different applications and stakeholders.
163
5 Transporting data
• Q: what technologies can package captured data for
transfer, while preserving the carefully chosen
representation choices and semantic information?
• XML is popular as a message passing format in the delivery of web
services.
Also,
• CSV – comma-separated values file (sometimes referred to as a
comma-separated variable file).
• JSON – JavaScript Object Notation.
165
Basic CSV
Problems
Problems
CSV dialects
• CSV dialects differences:
• different line terminators
• the final line of the file may or may not employ the line
terminator
• leading and trailing white space may or may not be
ignored.
• different separator characters for header rows.
• different escaping characters.
• some dialects will always quote every field.
169
CSV dialects
• Alternative dialects of CSV can be specified using the CSV
Dialect Description Format (CSVDDF) (Data Protocols, 2015a).
For example:
{
"csvddfVersion": 1.0,
"dialect": {
"delimiter": ",",
"doubleQuote": false,
"lineTerminator": "\r\n",
"quoteChar": "\"",
"skipInitialSpace": false
}
}
• In a brief review of the territory, as well as a community call to
arms, the Open Data Institute’s Jeni Tennison called for 2014 to
be ‘The Year of CSV’ (Tennison, 2014).
172
JSON - example
• valid JSON from RFC 7159.
• embodies an attribute/value representation style.
• Note that the object is contained within braces i.e. ‘{’ ‘}’:
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": 100
},
"Animated" : false,
"IDs": [116, 943, 234, 38793]
}
}
• Figure 2.6 JSON representation of an image object
173
XML equivalent
<image>
<width>800</width> <height>600</height>
<title>View from 15th Floor</title>
<thumbnail>
<url>http://www.example.com/image/481989943</url>
<height>125</height>
<width>100</width>
</thumbnail>
<animated>false</animated>
<IDs>
<id>116</id>
<id>943</id>
<id>234</id>
<id>38793</id>
</IDs>
</image>
Figure 2.7 XML representation of an image object, corresponding
to Figure 2.6
174
6 Summary
• In this part, you have learned about:
• some of the different ways in which data elements can be represented, and the
perils inherent in doing so
• two formats for representing complex data in such a way as to preserve its
structure and semantics
• two formats for packaging data so that it can be shared in a convenient way.
• Practically you will have worked with:
• the Python pandas library to represent structured data, using Series and
DataFrame objects
• the Python libraries to import and export CSV and JSON datasets
• OpenRefine as a data acquisition and parsing tool.
• You have also started to get your hands dirty with some real datasets, using
OpenRefine and pandas to open, and explore the properties of, a variety of data-
containing documents.
• In the next part of the module you will have an opportunity to work with a wider
range of real world datasets. Acquiring and representing data is one thing, but in
many cases it may not be clean and well behaved. Before it can be properly
manipulated it may well have to be given a good cleaning first.
184
ACTIVITIES
For part 2
185
EXERCISES
For part 2
202
• Answer
• Interval. Timestamps can be ordered (ranked) and each gap (i.e. seconds, minutes, etc.) is uniform; but
there is no notational ‘first’ second (we could march time back forever a second at a time).
• Ratio. Prices have a fixed origin (zero), and otherwise are ordinal.
• Nominal. A bank account number is not really a number (it’s a numeric label). It’s not meaningful to
manipulate bank account numbers as numbers: you could use the underlying numeric value to perform
arithmetic, but that’s not using them as account numbers.
• Ordinal. Responses have an order (strength of agreement), but the actual numeric value is largely
irrelevant.
• Nominal. The numeric value has no ordering inherent in the choice of the numbers: the numbers are index
labels into the list of packaging types.
• Interval. The Celsius scale has no origin point (unlike Kelvin’s absolute zero), but otherwise has order and
fixed separation between consecutive values.
204