You are on page 1of 33

CIS 601: Fundamental of Data

Science
Master of Science in Data Science
Dr. Mohammed Mustafa
Faculty of Computers and Information Technology
University of Tabuk
based on slides from Dr. Steven Skiena & others
and including notes slides from: Dr. Osama Alia

Lecture 3: Assembling Data Sets


Data Munging
Good data scientists spend most of their time
cleaning and formatting data.
The rest spend most of their time complaining
there is no data available.
Data munging or data wrangling is the art of
acquiring data and preparing it for analysis.
Data wrangling
Data wrangling, sometimes referred to as data
munging, is the process of transforming and
mapping data from one "raw" data form into
another format with the intent of making it more
appropriate and valuable for a variety of
downstream purposes such as analytics.
(Wikipedia)
Wrangling Examples
►want the core news text, title, date, etc. off the following
page Apple’s iPhone loses top spot to Android in Australia
►want the text plus details from the PDF file
“Data Wrangling: The Challenging Journey from the Wild to
the Lake
►want all article titles from the PUBMED results xml
►want to digitize the text off a scanned letter
►want to extract all the sentences referring to Hillary Clinton
in a news article
Wrangling Examples, cont.
►your company has customer records in 4 different
databases in different formats; you want a single
standardised set of customer names and addresses
►convert addresses in your customer database into
geographic latitude and longitude
►convert free text dates to standard format, e.g. “next
Tuesday”, “2nd January 15”, “January 3 next year”, “3rd
Friday in the month”, “03/31/15”, “31/03/15”
►recognise what values in your data are “unknown” or
“illegal”
Languages for Data Science
● Python: contains libraries and features (e.g
regular expressions) for easier munging.
● R: programming language of statisticians.
● Matlab: fast and efficient matrix operations.
● Java/C: language for Big Data systems.
● Mathematica/Wolfram Alpha: symbolic
math.
● Excel: bread and butter tool for
Common Software
• Access: SQL, Hadoop, MS SQL Server, PIG, Spark
• Wrangling: common scripting languages (Python, Perl)
• Visualisation: Tableau, Matlab, Javascript+D3.js
• Statistical analysis: Weka, SAS, R
• Multi-purpose: Python, R, SAS, KNIME, RapidMiner
• Cloud-based: Azure ML (Microsoft), AWS ML (Amazon)
Notebook Environments
Mixing code, data,
computational results,
and text are essential
for projects to be:
● reproducible
● tweakable
● documented.
Data Pipelines
Notebooks make it easier to
maintain data pipelines, the
sequence of processing steps
from start to finish.
Expect to have to redo your
analysis from scratch, so build
your code to make it
possible.
Standard Data Formats
Historically, computer scientists would rather
share a toothbrush than a data format.
But accepted standards are now available:
● CSV files: for tables like spreadsheets
● XML: for structured but non-tabular data.
● JSON: Javascript Object Notation for
APIs.
● SQL databases: for multiple related
Structured Data Formats
• CSV, comma separated values
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa

• hierarchies, e.g., Newick tree format

(A,B,(C,D)E)F;

• networks, e.g., GraphViz (DOT) hierarchical


digraph graphname {
a -> b -> c;
b -> d;
}
PageRank: The web as a behavioral dataset
Graph Data
Lots of interesting data
has a graph structure:
• Social networks
• Communication networks
• Computer Networks
• Road networks
• Citations
• Collaborations/Relationships
• …

Some of these graphs can get


quite large (e.g., Facebook*
user graph) 13
{
"firstName": "John",
"lastName": "Smith",

JSON "isAlive": true,


"age": 25,
"address": {
"streetAddress": "21 2nd
Street", "city": "New York",
• Similar to XML but simpler "state": "NY",
"postalCode": "10021-
3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-
1234"
},
{
"type": "office",
"number": "646 555-
4567"
},
{
"type": "mobile",
"number": "123 456-
7890"
}
],
Kind of Data
• STRUCTURED DATA • UNSTRUCTURED DATA
– lists – text
– n × m arrays – images
– video
– hierarchies
– sound
(e.g., organization chart)
– networks
(e.g., travel routes, hypertext = • Often can be made structured
links) by, e.g., parsing language,
• Generic data-interchange segmenting images, etc.
formats:
XML, JSON
Parsing
• Given a known grammar, unstructured text data can be parsed
to discover its inherent structure

• “It ain’t over till the fat lady sings”

((it, (ain't, over)), (till, ((the, (fat, lady)), sings)))

• Similarly, images can be segmented into parts


I M A G E S E G M E N TAT I O N : E X A M P L E
1
I M A G E S E G M E N TAT I O N : E X A M P L E
2
Big Data
• A crucial part of the rise of Data Science is the steep increase in the amount
and availability of data.

• Big Data refers not only to the quantity but also to the quality of the data:
– VOLUME: lots of it
– VELOCITY: fast (streaming)
– VARIETY: all kinds, not nice and “clean”, (challenge is standardization)
– VERACITY: can it be trusted?

• One way to differentiate between a large amount of data vs. big data is
that big data often refers to data that wasn’t collected for the same
purpose for which it is being used.
Where Does Data Come From?
The critical issue in any modelling project is
finding the right data set.
This will certainly be the case for your projects!
Large data sets often come with valuable
metadata: e.g. book titles, image captions,
Wikipedia edit history...
Repurposing metadata requires imagination.
Data can be got from many sources.

icons from by Openclipart.org, public domain


Sources of Data
● Proprietary data sources
● Government data sets
● Academic data sets
● Web search
● Sensor data https://toolbox.google.com/datasetsearch

● Crowdsourcing
● Sweat equity
Proprietary Data Sources
Facebook, Google, Amazon, Blue Cross, etc.
have exciting user/transaction/log data sets.
Most organizations have/should have internal
data sets of interest to their business.
Getting outside access is usually impossible.
Companies sometimes release rate-limited
APIs, including Twitter and Google.
Government Data Sources
● City, State, and Federal governments are
increasingly committed to open data.
● Data.gov has over 100,000 open data
sets!
● The Freedom of Information Act (FOI)
enables you to ask if something is not open.
● Preserving privacy is often the big issue in
whether a data set can be released.
Academic Data Sets
● Making data available is now a requirement
for publication in many fields.
● Expect to be able to find economic, medical,
demographic, and meteorological data if
you look hard enough.
● Track down from relevant papers, and ask.
● Google topic and “Open Science” or “data”
Web Search/Scraping
Webpages often contain valuable text and numerical
data, which we would like to get our hands on.
There are two distinct steps to make this happen,
spidering and scraping:
• Spidering is the process of downloading the right set
of pages for analysis.
• Scraping is the fine art of stripping this content from
each page to prepare it for computational analysis.
Web Search/Scraping
Scraping is the fine art of stripping text/data
from a webpage.
Libraries exist in Python to help parse/scrape
the web (e.g. BeautifulSoup), but first search:
● Are APIs available from the source?
● Did someone previously write a scraper?
Terms of service limit what you can legally do.
Available Data Sources
● Bulk Downloads: e.g. Wikipedia, IMDB,
Million Song Database.
● API access: e.g. New York Times, Twitter,
Facebook, Google.
Be aware of limits and terms of use.
Sensor Data Logging
The “Internet of Things” can do amazing things:
● Image/video data can do many things: e.g.
measuring the weather using Flicker images.
● Measure earthquakes using accelerometers
in cell phones.
● Identify traffic flows through GPS on taxis.
Build logging systems: storage is cheap!
Sweat Equity
But sometimes you must work for your data
instead of stealing it.
Much historical data still exists only on paper or
PDF, requiring manual entry/curation.
At one record per minute, you can enter 1,000
records in only two work days.
Often projects require sweat equity.
Crowdsourcing
Many amazing open data resources have been
built up by teams of contributors:
● Wikipedia/Freebase
● IMDB
Crowdsourcing platforms like Amazon Turk
enable you to pay for armies of people to help
you gather data, like human annotation.
Project Data Sources
● Miss Universe?
● Movie gross?
● Baby weight?
● Art auction price?
● Snow on Christmas?
● Super Bowl / College Champion?
● Ghoul Pool?
● Future Gold / Oil Price?
Cleaning Data: Garbage In, Garbage Out
Many issues arise in ensuring the sensible
analysis of data from the field, including:
● Distinguishing errors from artifacts.
● Data compatibility / unification.
● Imputation of missing values.
● Estimating unobserved (zero) counts.
● Outlier detection.

You might also like