Professional Documents
Culture Documents
Data Management in Big Data Era
Data Management in Big Data Era
CILO Outline
Hierarchy of data
- Bit: 0 and 1
- Byte: 8 bits 1 byte: ASCII
- Character:1, 2 or 4 bytes 2 bytes: GB23012 or Big5
- Field: string of characters 4 bytes: Unicode
- Record: a group of related fields
- File: collection of records
- Database: collection of files with minimal redundancy
Fields
Records
A Chinese character may need two bytes using Big5 (Traditional Chinese) or GB2312
(Simplified Chinese)
If you want to show both Traditional Chinese and Simplified Chinese together, you need to
use Unicode (Big5, GB2312 and Unicode are three different standards)
Unicode (4 bytes) is an encoding method that stores characters of all languages
in the world.
- Unicode (UTF-8) will replace Big5 and GB in all Chinese Web sites
- UTF-8 (UCS Transform Format) is to use less than 4 bytes if leading bytes are 0’s
Relational Databases
- Relation or table (file)
- Tuple or row (record)
- Attribute or column
(field)
- Relationships
Big Data Page 5
SQL
Layman term:
Data warehouse is a large store of data accumulated from a wide range
of sources within a company and used to guide management decisions.
Still make no sense?
Data Mining:
finding interesting trends and patterns in large database (or
data warehouse) to guide decisions about future activities.
That is, find out RULES in the data.
Data mining is related to:
- Exploratory data analysis in statistics.
- Knowledge discovery and machine learning in artificial intelligence.
A not very good example:
- After seeing 100 to 500 birds, you come to conclusion that “all birds can fly”
- Find the rule with data observation
- There is always exception
- Is penguin a bird?
Find out items that are purchased together by the SAME customer
{shirt}=>{tie}
LHS=>RHS
Velocity Cannot be transferred from one site to
Too slow another.
Variety Cannot be handled by a single program or a
Too complex set of programming packages (data are
unstructured).
Big Data is not new, it has been around for years
It was too expensive
Need super-computers
Not every organization can afford it
Even some organizations can afford it, it is not profitable
Only Government agents and very large firms can afford it
CIA and FBI to hunt down someone
Predict tomorrow’s weather
Imagine a program that can predict tomorrow’s weather
that need 25 hours to run
Predict when is the end of the world
The count-down clock
Now it is affordable and profitable
Use clusters of PC’s (thousands of them) to emulate a super computer
Entrance fee of Big Data is reduced from Billion$ to Million$
SSD storage
100 times faster than old magnetic disk drives
Reduction in Memory cost
In-core Database possible
Distributed Cluster Computing
Do not need ONE very very very fast machine
Have lot of PC’s clustered together
Cloud Computing
Do not need to put all the data in a single place
Pervasive Computing
Mobile devices
Internet enabled household products
microwave that can download recipes
CCTV’s that are online
AI in data analytics
Image recognition, Voice recognition, Face recognition, OCR
Pattern discovery
Data mining (finding rules in data)
Big Data Page 18
BIG data Applications
Google Now (an apps on iPhone and Android)
use natural language user interface
answer questions
make recommendations
perform actions by delegating requests to some web sites
Siri on Apple
Some may consider Siri is only a joke
Find new customers
Send him/her Social Media messages when near by
Generate personalized promoting messages
Hunt down someone
Police or government agents.
Cross platform marketing
Receive email from a furniture store when you just purchased an apartment
Health Care
Track down the source of a Virus (Bio-virus not computer virus)
Diagnose very rare health care cases
Database
Data Warehouse
Data Mining
Big Data
Page 5
- https://en.wikipedia.org/wiki/Relational_model
Page 16
- 3V: https://apandre.wordpress.com/2013/11/19/datawatch/
Big Data