You are on page 1of 22

Data management in big data era

CILO Outline

Course Intended Learning Outcome (CILO)


 Describe key concepts in different areas (Big Data) of information technology (IT)
and explain their implications to our daily life.
Outline
 Database
 Data Warehouse
 Data Mining
 Big Data

Big Data Page 2


How to store data?

 Hierarchy of data
- Bit: 0 and 1
- Byte: 8 bits 1 byte: ASCII
- Character:1, 2 or 4 bytes 2 bytes: GB23012 or Big5
- Field: string of characters 4 bytes: Unicode
- Record: a group of related fields
- File: collection of records
- Database: collection of files with minimal redundancy

Fields

Records

Big Data Page 3


How many bytes in a character?

 A character can be 1 byte, 2 bytes or 4 bytes


- English character can be represented using 1 byte known as the ASCII code
ASCII = American Standard Code for Information Interchange
Letter ASCII Code Bit Notation
I 73 0100 1001
J 74 0100 1010
M 77 0100 1101

 A Chinese character may need two bytes using Big5 (Traditional Chinese) or GB2312
(Simplified Chinese)
 If you want to show both Traditional Chinese and Simplified Chinese together, you need to
use Unicode (Big5, GB2312 and Unicode are three different standards)
 Unicode (4 bytes) is an encoding method that stores characters of all languages
in the world.
- Unicode (UTF-8) will replace Big5 and GB in all Chinese Web sites
- UTF-8 (UCS Transform Format) is to use less than 4 bytes if leading bytes are 0’s

Big Data Page 4


Database Management System (DBMS)

 Software package designed for controlling access to the database and


managing data resources efficiently
 Available for various sizes and types of computers
- e.g. MS-ACCESS, MS-SQL server, Oracle
 Common DBMSs are using the Relational Model
- with table, column and tuple
 Data are accessed using SQL
(Structured Query Language)

 Relational Databases
- Relation or table (file)
- Tuple or row (record)
- Attribute or column
(field)
- Relationships
Big Data Page 5
SQL

 Structured Query Language (SQL)


- A standard computer language designed to manipulate Relational DBMS
- All Relational DBMS (MS-ACCESS, MS-SQL server, Oracle) may use SQL
to access the DBMS
 SQL may SELECT, INSERT, UPDATE, DELETE rows(records) in the
tables using proper syntax Find all SELECT Body_style
 Table can be created Body_style after 1994 FROM Corvettes
using CREATE TABLE command WHERE Year > 1994 ;

INSERT INTO Corvettes(Vette_id,


Input a record into the Corvettes table Body_style, Miles, Year, State)
VALUES (37, 'convertible', 25.5, 1986, 17) ;

CREATE TABLE States ( Change a value in the


State_id INTEGER Corvettes table UPDATE Corvettes
PRIMARY KEY NOT NULL, SET Year = 1996
State CHAR(20)) WHERE Vette_id = 17 ;
Make a new Table States with 2 fields using State_id as key

Big Data Page 6


Data warehouse

 The term data warehouse was coined by William H. Inmon, who is


known as the Father of Data Warehousing:
A data warehouse as being a subject-oriented, integrated, time-variant
and nonvolatile collection of data that supports management's decision-
making process.
 Does not make any sense, does it?

 Layman term:
 Data warehouse is a large store of data accumulated from a wide range
of sources within a company and used to guide management decisions.
 Still make no sense?

 Or the simplest form:


 Data warehouse is a Database of Databases
- It is NOT yet the BIG data (will tell you what is BIG data later)

Big Data Page 7


Data Mining

 Data Mining:
finding interesting trends and patterns in large database (or
data warehouse) to guide decisions about future activities.
 That is, find out RULES in the data.
 Data mining is related to:
- Exploratory data analysis in statistics.
- Knowledge discovery and machine learning in artificial intelligence.
 A not very good example:
- After seeing 100 to 500 birds, you come to conclusion that “all birds can fly”
- Find the rule with data observation
- There is always exception
- Is penguin a bird?

Big Data Page 8


Data Mining in business

 A typical example – Market Basket Analysis.


 A market basket is a collection of items purchased by a
customer in a single customer transaction.
- A customer transaction consists of a single visit to a store, a
single order through a mail-order catalog, or an order at a virtual
store on the web.
 Market basket analysis aims to:
- identify items that are purchased together.
- use this information to improve the layout of goods in a store or
the layout of catalog pages.

Big Data Page 9


Purchase Relation for Market Basket Analysis

Find out items that are purchased together by the SAME customer

transid custid date item qty

111 201 5/1/16 shirt 2


111 201 5/1/16 tie 1
111 201 5/1/16 shoes 1 All tuples(records) in a
111 201 5/1/16 socks 6 group have the same
112 105 6/3/16 shirt 1 transid, and together
they describe a
112 105 6/3/16 tie 1
customer transaction,
112 105 6/3/16 shoes 1
which involves
113 106 5/10/16 shirt 1
purchases of one or
113 106 5/10/16 shoes 1 more items.
114 201 6/1/16 shirt 2
114 201 6/1/16 tie 2
114 201 6/1/16 socks 4

Big Data Page 10


Association Rules

 Given a collection of customer purchases (or


“transactions”) of the form:
transid custid date item qty

we want to identify rules of the form:

{shirt}=>{tie}

This rule should be read as follows:


“If a shirt is purchased in a transaction, it is likely that tie
will also be purchased in that transaction.”

Big Data Page 11


Association Rules

 It is a statement that describes the transactions in the


database
- Extrapolation to future transaction should be done with caution.
 More generally, an association rule has the form:

where both LHS and RHS are sets of items.

LHS=>RHS

Interpretation of such a rule is that if every item in LHS is purchased in a


transaction, then it is likely that the items in RHS are purchased as well.

Big Data Page 12


Association Rules

Simple statistics, or is it?

 There are two important measures for an association rule:


- Support – The support for a rule LHS => RHS is the support for the set
of items LHS  RHS.
- Confidence – Consider transactions containing all the items in LHS. The
confidence for a rule LHS => RHS is the percentage of such transactions
that also contain all items in RHS.
- More precisely, let sup(LHS) be the percentage of transactions that
contain LHS and let sup(LHS  RHS) be the percentage of
transactions that contain both LHS and RHS. Then the confidence of
the rule LHS => RHS is:
sup  LHS  RHS 
sup  LHS 
- The confidence of a rule is an indication of the strength of the rule.

Big Data Page 13


The Use of Association Rules for Prediction

Try to investigate how the supermarket arrange items of goods on the


selves.
Almost ALL supermarkets arrange them similarly.
Due to Market Basket Analysis in data mining
 Association rules are widely used for prediction.
- Association rules describe existing data accurately but
can be misleading when used naively for prediction.
- For example, consider the rule {shirt}  {tie}, the
confidence associated with this rule is the conditional
probability of a tie purchase given a shirt purchase
over the given database.
For the 4 transactions on page 10: Confidence of {shirt}=>{tie}
sup(shirt  tie) = 75% is 75% (75% divided by one)
sup(shirt) = 100% (one)

Big Data Page 14


The 3 V’s
Big Data

Volume Cannot be stored in a single machine and

Too BIG sometimes cannot be stored in a single site.


Velocity Cannot be transferred from one site to

Too slow another.


Variety Cannot be handled by a single program or a

Too complex set of programming packages (data are
unstructured).

to be handled by the traditional data processing methods.

Big Data Page 15


The 3 V’s
Data are not only Text Data is kept on coming
sound, video, office file, in.
maps, finger print, X-ray image, Cannot wait for it to be
weather data, DNA string, etc. stored on hard disk.

Zetta: 21 0’s Tera: 12 0’s


1,000,000,000,000,000,000,000 1,000,000,000,000

Big Data Page 16


BIG data past and future


Big Data is not new, it has been around for years

It was too expensive

Need super-computers

Not every organization can afford it

Even some organizations can afford it, it is not profitable

Only Government agents and very large firms can afford it

CIA and FBI to hunt down someone

Predict tomorrow’s weather

Imagine a program that can predict tomorrow’s weather
that need 25 hours to run 

Predict when is the end of the world

The count-down clock


Now it is affordable and profitable

Use clusters of PC’s (thousands of them) to emulate a super computer

Entrance fee of Big Data is reduced from Billion$ to Million$

Big Data Page 17


Big Data Technologies


SSD storage

100 times faster than old magnetic disk drives

Reduction in Memory cost

In-core Database possible

Distributed Cluster Computing

Do not need ONE very very very fast machine

Have lot of PC’s clustered together

Cloud Computing

Do not need to put all the data in a single place

Pervasive Computing

Mobile devices

Internet enabled household products

microwave that can download recipes

CCTV’s that are online

AI in data analytics

Image recognition, Voice recognition, Face recognition, OCR

Pattern discovery

Data mining (finding rules in data)
Big Data Page 18
BIG data Applications


Google Now (an apps on iPhone and Android)

use natural language user interface

answer questions

make recommendations

perform actions by delegating requests to some web sites

Siri on Apple

Some may consider Siri is only a joke 

Find new customers

Send him/her Social Media messages when near by

Generate personalized promoting messages

Hunt down someone

Police or government agents.

Cross platform marketing

Receive email from a furniture store when you just purchased an apartment

Health Care

Track down the source of a Virus (Bio-virus not computer virus)

Diagnose very rare health care cases

Big Data Page 19


The 4th, 5th and 6th V in Big Data

 In addition to Volume, Variety, Velocity


 Veracity
- Quality of captured data must be maintained
- Corrupted data -> Corrupted decision
- The good-old saying “Garbage in Garbage out”
 Variability
- Which version of data is the latest version?
- How to make sure you are using the latest data value?
- Store multiple versions of the same data (BIG DATA -> BIGGER DATA)
 Value
- Any use?
- The ultimate goal of BIG DATA
- BIG data must be useful – as it requires HUGE among of resource (billions $)

Big Data Page 20


What you have learnt?

 Database
 Data Warehouse
 Data Mining
 Big Data

Big Data Page 21


Credits

 Page 5
- https://en.wikipedia.org/wiki/Relational_model
 Page 16
- 3V: https://apandre.wordpress.com/2013/11/19/datawatch/

Big Data

You might also like