You are on page 1of 28

Principles of Data Management

and Mining

CS 504 Spring 2020


Lecture 1

Introduction
People
• Instructor: Dr. Ping Deng
– E-mail: pideng@gmu.edu
– Office: ENGR 4608
– Office hours: W 2-4 PM
• TA: Archange Destine
– E-mail: adestine@masonlive.gmu.edu
– Office hours: TR 3-4PM @ ENGR 4456
Textbooks
• NoSQL Distilled: A Brief Guide to the Emerging World of
Polyglot Persistence by Sadalage and Fowler
• Data Science for Business: What you need to know about data
mining and data-analytic thinking by Provost and Fawcett

Recommended Books
• Introducing Data Science: Big data, machine learning, and
more using Python tools by Cielen, Meysman, and Ali
• Python for Data Analysis by Wes McKinney
• Data Mining Concepts and Techniques by Han and Kamber
• Fundamentals of Database Systems by Elmasri and Navathe

All except for the last book are available on Safari:


https://login.mutex.gmu.edu/login?url=http://proquest.safariboo
ksonline.com/ )
Grading
– 10% In Class Quizzes
– 30% Homework
– 30% Midterm Exam
– 30% Final Exam
• Homework will be submitted on BLACKBOARD unless
otherwise noted
• The lowest quiz score for the semester will be dropped
• You have a budget of 3 late days which you can use as you
wish for homework
• Grades will be changed only when a grading error has been
made. All grade change requests are due within a week of
the grade becoming available on Blackboard
• There will be no make-up of exams or quizzes unless
previously arranged with the instructor
Policies
• Cheating/Plagiarism will NOT be tolerated. Please be familiar
with the Honor Code policies. Note that blackboard validates
your assignments against a database of all prior submitted
work. You may be asked to explain your answers on
homework, quizzes, and exams. Failure to explain your own
work will raise serious doubt about you following these
policies. All cases will be forwarded to the Honor Committee
with the recommendation for an F in the course
• Electronics
Topics
• ER & EER model
• Relational data model and ER & EER to
relational mapping
• SQL
• NoSQL
• Classification
• Clustering
Why are we here?
• This class is NOT about:
– Instantly becoming a data scientist
– Knowing all there is to know about database
– Knowing all there is to know about data mining
So what is it about?
• The idea is to give students an understanding
of what data (processing, saving, using) is
about
• A ‘tour’ over the science and technology of
– Relational data model
– NoSQL
– Mining principles
• An understanding of the modeling process
Software
• RDBMS (Oracle/MySQL)
• NoSQL (MongoDB)
• Data mining (Weka)
• Python (Anaconda)
What Exactly Does “Big Data” Mean?

s o f re cords
co l l e ction
e Bs
Massiv ink 10 s of P To some, Big Data means using a
– th
NoSQL system or Parallel relational
DBMS like

Typically ho
used on lar
Facebook h ge clusters
as 2700 no of low-cost
storage!! des in their processors
cluster with
Truly stunn 60PB of
ing.
How much data?
• Google processes 20 PB a day (2008)
• Facebook has 2.5 PB of user data + 15 TB/day (2009)
• eBay has 6.5 PB of user data + 50 TB/day (2009)
• 2.5 quintillion bytes generated each day (IBM estimate)

640K ought to be
enough for anybody.
Some Big Data Stats

35ZB = enough data


to fill a stack of DVDs
Amount
reaching of Stored
halfway If you
Data By Sector
to Mars like analogies…
(in Petabytes, 2009)

966
1000
848
900
Mars
800 715
700 619
Petabytes

600
500 434
364
400
300
Earth 269
227
200
100
0
g
g

n
re
t

ail
ns
en

in
in

io

tio
ca

t
tio
ur

Re
nk
m

at

ta
th
ica

uc
rn

1 zettabyte?
t

Ba

or
ac

al
e

Ed

sp
un
uf

He
ov
an

an
m
G

= 1 million petabytes
m
M

Tr
Co

Sources:
= 1 billion terabytes
"Big Data: The Next Frontier for Innovation, Competition and Productivity." = 1 trillion gigabytes
US Bureau of Labor Statistics | McKinsley Global Institute Analysis
Why the Sudden Explosion
of Interest?
• An increased number and variety of data sources that generate
large quantities of data
– Sensors (e.g. location, acoustical, …)
– Web 2.0 (e.g. twitter, wikis, … )
– Web clicks
• Realization that data was too valuable to delete
0 1
0 0 1 0 1 0 1 0
0 1 0 0 0 1 1 1 0
0 1 1 1 1 0 0 1 1
1 0 0 0 0 0 1 1 0
0 0 1 0 1 1 1 1
1 1 1 1

• Dramatic decline in the cost of hardware, especially storage


– If storage was still $100/GB there would be no big data revolution
underway
More data or better model?
• In 2001, Banko and Brill published a paper:
– More data leads to better accuracy (are you
surprised?)
– With increasing amount of data, the accuracy of
different algorithms converged!
– Conclusion (at the time): machine learning
methods don’t matter, data does (do you believe
this?)
Wild Predictions: A Myth?
In 2008, Chris Anderson published an article
entitled “The End of Theory: The Data Deluge
Makes the Scientific Method Obsolete,” in Wired
Magazine
Main premise: if you let the data ‘talk’ you don’t
need a scientist (or an expert for that matter)
Why we still need people and theory
• Reason 1: The Power of Chance
– When you have lots of data patterns will appear…
with trillions of points and using thousands of metrics
correlations will appear by pure chance (just as some
people win the lottery every day against impossible
odds)
• Example: just 1,000 time series will give you half a
million correlations. Most of them will be useless!

– Lots of those patterns will have no predictive power


whatsoever

– You need somebody with knowledge to help in


separating the noise from the signal
Why we still need people and theory
• Reason 1 (concrete example)
– A once famous indicator of economic
performance was the winner of the Super Bowl
• From 1967 to 1997, the stock market gained an average
of 14% for the rest of the year when a team from the
NFC won the game, and fell 10% when a team from the
AFC won instead
• That pattern (besides having no foundation) has since
been broken
– You can easily find that pattern when blindly
looking for correlations in Big Data
Why we still need people and theory
• Reason 2: Data is Noisy!
– Statistical inferences are much stronger when you
back them up with theory or at least think about
their causes
– It’s easy to confuse noise with signal
– Examples:
• Economics: At least one economic forecasting firm
predicted a recession in 2011 by using lot of
specialized indexes (related to overfitting – later)
• Weather predictions: those based only on statistics are
not accurate enough
Why we still need people and theory
• Reason 3: if we want to produce good models,
we have to understand the theory
– In machine learning, a basic principle tells you that
models should not exhibit too much bias or too
much variance
Why we still need people and theory
• Selecting the right model requires knowledge
of the prediction field
• Example of high bias predictions (underfitting)
– Usually political predictions are biased
• Example of high variance predictions
(overfitting)
– Earthquake prediction
Reason 4: Explain the Data
• There are at least two types of models:
– Predictive: I want to predict the future
– Explanatory: I want to explain a process
• Most predictive models only care about
reducing errors
• As such they might not be good in explaining
the data
Fundamental Concept 1
• Extracting useful knowledge from data to
solve business problems can be treated
systematically by following a process with
reasonably well-defined stages
– The Cross Industry Standard Process for Data
Mining
The CRISP data mining process
Fundamental Concept 2
• From a large mass of data, information
technology can be used to find informative
descriptive attributes of entities of interest
Fundamental Concept 3
• If you look too hard at a set of data, you will
find something – but it might not generalize
beyond the data you are looking at
Fundamental Concept 4
• Formulating data mining solutions and
evaluating the results involves thinking
carefully about the context in which they will
be used

You might also like