You are on page 1of 22

Musings on Data Science and

Students Experiencing Data Analytics


New England SENCER Center for Innovation

Prof. Randy Paffenroth


Data Science Program
Department of Mathematical Sciences
Worcester Polytechnic Institute

rcpaffenroth@wpi.edu

2014
My Research

"Internet Connectivity Access layer" by User:Ludovic.ferre -


Internet_Connectivity_Overview2_Access.svg. Licensed under Creative
Commons Attribution-Share Alike 3.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Internet_Connectivity_Access_layer
.svg#mediaviewer/File:Internet_Connectivity_Access_layer.svg
This is a panel, so I want to be
provocative!
Provocative
Adjective
1. tending or serving to provoke; inciting,
stimulating, irritating, or vexing.
So, I will be a little sad if I don’t end up irritating
anyone 
The first war: Terminology
• Analyzing data has a long history!
• There have been many terms that have been
used to describe such endeavors:
• Statistics
• Artificial Intelligence
• Machine learning
• Data analytics
• Since I happen to work in a “Data Science”
program perhaps I may be allowed the
indulgence of using that terminology…
Whatever we call it, what makes
things different now?
Experiments, observations, and numerical simulations in many
areas of science and business are currently generating terabytes
of data, and in some cases are on the verge of generating
petabytes and beyond. Analyses of the information contained in
these data sets have already led to major breakthroughs in fields
ranging from genomics to astronomy and high-energy physics and
to the development of new information-based industries.
- Frontiers in Massive Data Analysis, National Research Council of the National Academies

Given a large mass of data, we can by judicious selection


construct perfectly plausible unassailable theories—all of
which, some of which, or none of which may be right.
- Paul Arnold Srere
The ability to take data—to be able to understand it, to process it, to
extract value from it, to visualize it, to communicate it—that’s going to
be a hugely important skill in the next decades, not only at the
professional level but even at the educational level for elementary
school kids, for high school kids, for college kids. Because now we
really do have essentially free and ubiquitous data. So the
complimentary scarce factor is the ability to understand that data and
extract value from it.
- Hal Varian, Google's Chief Economist, http://www.mckinsey.com/insights/innovation/hal_varian_on_how_the_web_challenges_managers

My personal goal: Getting students to be able to


think critically about data.
What is Big Data?
 The are many examples of "data", but what makes some of
it “big”? The classic definition revolves around the three
Vs.
 Volume, velocity, and variety.
 Volume: There is a just a lot of it being generated all
the time. Things get interesting and “big”, when you
can’t fit it all on one computer anymore. Why? There
are many ideas here such as MapReduce, Hadoop, etc.
that all revolve around being able to process data that
goes from Terabytes, to Petabytes, to Exabytes.
http://pl.wikipedia.org
 Velocity: Data is being generated very quickly. Can /wiki/Green_Giant#m
ediaviewer/Plik:Jolly_
you even store it all? If not, then what do you get rid of green_giant.jpg
and what do you keep?
 Variety: The data types you mention all take different
shapes. What does it mean to store them so that you
can play with or compare them?
Is Big Data the same as Data
Science?
 Are Big Data and Data Science the same thing?
 I wouldn't say so...
 Data Science can be done on small data sets.
 And not everything done using Big Data would
necessarily be called Data Science.

Data
Big Data
Science
Is Big Data the same as Data
Science?
 Are Big Data and Data Science the same thing?
 I wouldn't say so...
 Data Science can be done on small data sets.
 And not everything done using Big Data would
necessarily be called Data Science.
 But there certainly is a substantial overlap!

Data
Big Data
Science
Can you even be certain?
 For real world problems, I
claim that you will never be
certain of any inferences from
data.
 I mean, what happens to your
carefully thought out marketing
plan for some rocking slacks
when the Martians land.
 What is unacceptable is when
the data you actually have
does not support the Public domain image

conclusion you report.


It can be easy to fool yourself!
Human beings are really Perhaps a bit too good!
good at pattern
detection...

http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)
It can be easy to fool yourself!

http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)
Skills for Data Science

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Which is most important?

http://en.wikipedia.org/wiki/View_of_the_World_from_9th_Avenue

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
WPI Data Science Program:
A Collaboration

Computer
Science
Mathematical Department
Sciences
Department

Business School
M.S. in Data Science Program
GRADUATE QUALIFYING PROJECT OR MS THESIS
(3 TO 9 CREDITS)

CONCENTRATION AND ELECTIVES


(9 TO 15 CREDITS)

DATA BUSINESS
MATHEMATICAL DATA ACCESS &
ANALYTICS & INTELLIGENCE &
ANALYTICS MANAGEMENT
MINING CASE STUDIES
(3 CREDITS) (3 CREDITS)
(3 CREDITS) (3 CREDITS)

INTEGRATIVE DATA SCIENCE (3 CREDITS)


Data Science Core
I N T E G R AT I V E D ATA S C I E N C E :
D S 5 0 1 I N T R O D U C T I O N T O D ATA S C I E N C E ( N E W C O U R S E )

M A T H E M A T I C A L A N A LY T I C S ( S E L E C T O N E ) :
M A 5 4 3 / D S 5 0 2 S TAT I S T I C A L M E T H O D S F O R D ATA S C I E N C E ( N E W
COURSE)
M A 5 4 2 R E G R E S S I O N A N A LY S I S Data Science Certificate
M A 5 5 4 A P P L I E D M U LT I VA R I AT E A N A LY S I S Program (18 credits);
• 15 CREDIT DATA SCIENCE CORE
D ATA A C C E S S A N D M A N A G E M E N T ( S E L E C T O N E ) : plus
C S 5 4 2 D ATA B A S E M A N A G E M E N T S Y S T E M S
M I S 5 7 1 D ATA B A S E A P P L I C AT I O N S D E V E L O P M E N T • 3 CREDIT ELECTIVE
C S 5 6 1 A D VA N C E D T O P I C S I N D ATA B A S E S Y S T E M S
C S 5 8 5 / D S 5 0 3 B I G D ATA M A N A G E M E N T ( N E W C O U R S E )

D A T A A N A LY T I C S A N D M I N I N G ( S E L E C T O N E ) :
C S 5 4 8 K N O W L E D G E D I S C O V E R Y A N D D ATA M I N I N G
CS 539 MACHINE LEARNING
C S 5 8 6 / D S 5 0 4 B I G D ATA A N A LY T I C S ( N E W C O U R S E )

BUSINESS INTELLIGENCE AND CASE STUDIES (SELECT ONE):


MIS 584 BUSINESS INTELLIGENCE
M K T 5 6 8 D ATA M I N I N G B U S I N E S S A P P L I C AT I O N S
2014 Data Science Cohort

EDUCATIONAL FOUNDATION
QUANTITATIVE/ COMPUTATIONAL
BACKGROUNDS
NATIONALITY
PROGRAMMING WITH DATA STRUCTURES
AND ALGORITHMS FOR COMPUTATIONAL CAMBODIA
SKILLS 10% INDIA
QUANTITATIVE SKILLS FULBRIGHT
CALCULUS, LINEAR ALGEBRA AND CHINA
SCHOLARS
STATISTICS
PAKISTAN
EMPLOYMENT HISTORIES TAIWAN
SENIOR RESEARCH ANALYST
GENDER IRAN
SENIOR BUSINESS ANALYST
PATIENT FINANCIAL SERVICES 66.70% Male U.S.A.
DATA BASE ANALYST-ARCHITECT
DECISION SCIENTIST 33.3% Female BRAZIL
MINISTRY OF FINANCE NEPAL
LAHEY HEALTH
AFGHANISTAN
TECHNICAL PROGRAM MANAGEMENT
U.S. DEPARTMENT OF STATE INDONESIA
2014 Data Science Cohort

FALL 2014
Total Applicants 126
Total acceptances 33 Many hold more than one earned Bachelor’s Degree
Fulbright Scholars 3 US Universities include Columbia, UNH and WPI
Brazil Science Mobility Student 1 Dean Oates gave two Awards of $5K to outstanding
Countries Represented 9 students.
Domestic Students 5 These awards help attract top students.
International Students 28
Skills Acquired by Our Students
Fundamental/Technical : Tools :
SQL/ Data Modeling / Cleaning Oracle /MySQL/DB2/SQLServer
Data Integration / Warehousing R / SAS / SciKit
Statistical Learning / Machine Learning Weka /RapidMiner /MatLab
Distributed Computing IBM Cognos / SPSS Modeler

Big Data Management Hadoop / Mahout / Cassandra


Python / Java / Cloud Computing
Classif./Regression/DecisionTrees
Storm / Sparc / InfoSphere Streams
Business Intelligence
Spotfire / Tableaux
Distributed Mining Algorithms

Professional Skills:
Professional Skills:
Story Telling / Visualization
Business Use Cases / Entrepreneurship
Presentations / Reports
Interdisciplinary Teams / Leadership
Data Science Tools for Students:
Free!
Software: Data:
•Python •UCI Machine learning
repository
•http://www.python.org/
• http://archive.ics.uci.edu/ml/
• iPython: http://ipython.org/
• Numpy: http://www.numpy.org/ •Kaggle
• Pandas: http://pandas.pydata.org/ • https://www.kaggle.com/
• Matplotlib: http://matplotlib.org/
•U.S. Government
• Mayavi: http://mayavi.sourceforge.net/
• Scikit-learn: http://scikit- • https://www.data.gov/
learn.org/stable/

You might also like