You are on page 1of 73

Data Science

Programming
Welcome to IST 380 !

Data Science
Programming

an advocate of
concrete computing –
and HMC's mascot
About myself

Who Zach Dodds

Where Harvey Mudd College

What Research includes robotics and computer vision

When Mondays 7-10pm here in ACB 119

dodds@cs.hmc.edu
Contact
Information 909-607-0867
Friday mornings, 9-11 am
Office Hours:
or set up a time...
HMC Beckman B111
TMI?

fan of low-tech games


fan of low-level AI
IST 380 ~ the big picture

What is it? Why me?


IST 380 ~ the big picture

What is it?
Data Science
Venn Diagram

Hmmm… where am I
on this diagram?
• Neighbor's name
Data?!
• A place they consider home

• Are they working at a company now? Where?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science" background?


(statistics, machine learning, CS)
state reminders…
• Neighbor's name Zachary Dodds
Data!
• A place they consider home Pittsburgh, PA

• Are they working at a company now? Where? Harvey Mudd

• How many U.S. states have they visited? 44

• Their favorite unhealthy food… ? M&Ms

• Do they have any "Data Science" background?


(statistics, machine learning, CS)
mostly CS for me…
• Neighbor's name Zachary Dodds
Data!
• A place they consider home Pittsburgh, PA
u ly s em i n a r-
c la s s i s tr
histhey working at a company
• TAre e , a s y o u a r e , now? Harvey Mudd
e r
Where?

sty le : I ' m h
a i n in s ig h ts
• How rderU.S.
in omany to gstates have they visited? 44
n e w f ie ld … .
int o t h i s v e r y
• Their favorite unhealthy food… ? M&Ms

• Do they have any "Data Science" background?


(statistics, machine learning, CS)
mostly CS for me…

be sure to set up your login + profile for the submission site…


Data Science concerns

Is "Data Science"
important or just trendy?
Data Science concerns

Hmmm…
the companies are expanding as fast as the data!
There's certainly a lot of it!
Data, data everywhere…

1 Zettabyte 1.8 ZB 8.0 ZB

logarithmic scale
800 EB

Data produced each year


161 EB

5 EB
1 Exabyte

120 PB

100-years of HD video + audio


60 PB
Human brain's capacity
1 Petabyte 14 PB

1 Petabyte == 1000 TB 2002 2006 2009 2011 2015


1 TB = 1000 GB
References

(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf (2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm


(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video
(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!

(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store


I'd call it data,
not information

wisdom

knowledge

information

data
Big Data?

I agree with this…


Make data easier to use ~ by using it!

It may be true that


Data Science isn't a
science – but that
doesn't mean it's
not useful!
IST 380 ~ the big picture

What? Why?
Data Science
Programming Data Rules

All of our insights – large and small, permanent and


ephemeral, natural and artificial – come about
through the integration of lots of data.

Data Science simply recognizes that the rules and


skills behind those insights are widely applicable…
A few examples…

Make3d

Andrew Ng ~
Computers and
Thought award,
2009

How is this being done?


and how do we succeed?

… Data Science is at the heart of computer science


A few examples…

Learning to
Powerslide

Stanford's
Autonomous
Vehicles project
(Thrun et al.)

… Data Science is at the heart of computer science


A few examples…

Learning ground
from obstacles

"my summer was


finding that red line"

… Data Science is at the heart of computer science


A few examples…

classification segmentation

Learning ground from obstacles


Insights beyond science
Marketing
Visualization

Motivation
Recommender Systems

predicting
movie ratings
Netflix Prize

(I don't know this guy) Bob Bell, winner of the "Netflix prize"

Napoleon Dynamite = 1.22 Finding Nemo = ??


Batman Begins = .75 Lord of the Rings = ??
Some films are difficult to predict…
Netflix Prize

(I don't know this guy) Bob Bell, winner of the "Netflix prize"

Napoleon Dynamite = 1.22 Finding Nemo = .67


Batman Begins = .75 Lord of the Rings = .42
Some films are difficult to predict… and others are easier!
Why IST 380 ?
Specific skills:
R statistical environment (and the S programming language)

Experience with several statistical analyses (descriptive statistics)

Experience with predictive statistics (modeling) and


machine learning algorithms
Why IST 380 ?
Specific skills:
R statistical environment (and the S programming language)

Experience with several statistical analyses (descriptive statistics)

Experience with predictive statistics (modeling) and


machine learning algorithms

Broad background:
Final project ~ open-ended with datasets of your choice

You'll be confident and capable with whatever datasets you


encounter in the future – on your own or as part of a team.
About IST 380 …
Details
Web Page:
http://www.cs.hmc.edu/~dodds/IST380

Assignments, online text, necessary files, lecture slides are linked


First week's assignment: Getting started with R

Textbook An introduction to Data Science


freely available online jsresearch.net/groups/teachdatascience/

and many online resources…


Grab both of
these now…

Programming: R
www.r-project.org/
Homepage
Go to the course page http://www.cs.hmc.edu/~dodds/IST380/

Grab R and the text from


these two links…
Homework
Assignments
~ 2-5 problems/week ~ 100 points extra credit, often
Due Tuesday of the following week by 11:59 pm.
Assignment 1 due Tuesday, February 5.

1 week + 1 day…
Homework
Assignments
~ 2-5 problems/week ~ 100 points extra credit, often
Due Tuesday of the following week by 11:59 pm.
Assignment 1 due Tuesday, February 5.

On your own or in groups of 2.


Working on programs: Divide the work at the keyboard evenly!

Submitting programs: at the submission website

install software ensure accounts are working


Today's Lab: try out R - the first HW is officially due on 2/5
Outline
approximate! using R
descriptive statistics
Weeks 1-5 predictive statistics
"Data Science" probability distributions

statistical modeling
support vector machines (SVMs)
Weeks 6-10
nearest neighbors (NN)
"Machine Learning" random forests No breaks?!
k-means algorithm

Weeks 11-15 Final Project


Grading
Grades if score >= 0.95: grade = "A"
if score >= 0.90: grade = "A-"
Based on points percentage if score >= 0.86: grade = "B+"
~ 800 points for assignments
~ 400 points for the final project
see the course syllabus for the full list...

Final project
• the last ~4 weeks will work towards a larger, final project
• there will be a short design phase and a short final presentation
• choose your own problem to study (I'll have some suggestions, too.)
• I'd encourage you to connect R and our Data Science techniques
to other datasets or projects that you use/need/like, etc.
Academic Honesty
This course operates under CGU's (and all of Claremont Schools')
Academic Honesty policies…

•Your work must be your own. This must be true for the whole
team, if you're working in a pair.

•Consulting with others (except team members or myself) is


encouraged, but has to be limited to discussion and debugging
of problems. Sharing of written, electronic, or verbal
solutions/files/code is a violation of CGU’s academic honesty
policy.

•A reasonable guideline: Work is your own if you could delete


all of it and recreate it yourself.
Thoughts?
Getting to know… R
Getting to know… R

R is the programmer's toolkit for statistics; SAS, Stata,


http://lang-index.sourceforge.net/#categ SPSS are preferred by those in business intelligence
Getting to know… R

Free… and very well supported online…


Getting to know… R

R is responsive, up-to-date, and flexible: Data Science vs. Statistics


Try
Getting to know… R it !

1) Find the IST 380 course webpage


www.cs.hmc.edu/~dodds/IST380/

2) Download and install R

3) Run R and try some basic commands at the prompt:

6 * 7

rnorm(10)

x <- 380
Getting started!

1) Open Matloff's Why R? notes

2) Skip ahead to page 7, the "5 minute example session"

3) Try out the commands in section 2.2 to get started…

4) When you finish, save your session and submit it!

This is problem 1 this week


Saving your session

1) Create a folder named hw1, perhaps on your desktop

2) Use the Save to file… (Windows) or Save as…


(Mac) in order to save your current console session into
hw1
3) Name that file pr1.txt

4) From your operating system, open up that file in


order to confirm it contains your whole session!

This is problem 1 this week


Submitting your work
1) Zip up hw1 into hw1.zip

2) From the course webpage, click on the submission


site link.

3) Choose a submission site login name & let me know!

4) Once your account is made, login, change your password


to something you know, and submit hw1.zip

5) You can submit again – all copies are saved… troubles? email me!
This webserver can be
spacey -- I should know!

You've completed Problem 1!


Reflection
Assignment?

Creating a vector?

Printing?

Average and standard deviation?

Comments?

Comments?
R types

You can use mode() to view the type of a variable.


Where's the big data?
c ~ concatenate

Vectors are R lists of a single type of element


Where's the big data?
c ~ concatenate

the colon : also


creates vectors

Vectors are R lists of a single type of element


Analyzing vectors – try these…

Square brackets [] can "subset" (or "slice") vectors


Analyzing vectors

you can use a


boolean vector
to subset
another vector

Square brackets [] can "subset" (or "slice") vectors


NA
R uses NA to represent data that is "not available"
The function is.na( ) tests for NA

What is going on here?


NA
R uses NA to represent data that is "not available"
The function is.na( ) tests for NA

What is going on here?

This uses subsetting to remove NA values!


Data frames

R's fundamental data structures are data frames


The next tutorial will introduce them…
Irises…

virginica setosa

data() yields many built-in data files. This is iris


Subsetting iris data
df[rows,cols]

As with vectors, you can "subset" data frames.


Lab…
The 2nd part of each class meeting dedicated to lab work.

I welcome you to stay for the lab, but it is not required.

Today's lab:

Work through Santorico and Shin's Tutorial for the R


Statistical Package and submit the console sessions as
pr2_1.txt, pr2_1.txt, pr2_1.txt, pr2_1.txt, and pr2_1.txt.

This is a nice reinforcement of vectors, introduction to


data frames, and a look at the graphics that R supports.
Homework
Problem 3: Challenge exercises in R

These will reinforce the "subsetting" and data-


analysis introduction from pr2's tutorial.

Problem 4: Introduction to Data Science, early chapters

This is a fuller background on R and the field


of data science

(submit your console session for both of these…)


Lab !
CS vs. IS and IT ?
greater integration
system-wide issues

smaller details
machine specifics

www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf
CS vs. IS and IT ?

Where will IS go?


CS vs. IS and IT ?
IT ?

Where will IT go?


IT ?
The bigger picture

Weeks 10-12 Weeks 13-15


Objects Final Projects

Week 10 Week 13
classes vs. objects final projects

Week 11 Week 14
methods and data final projects

Week 12 Week 15
inheritance final exam
• Neighbor's name
Data?!
• A place they consider home

• Are they working at a company now? Where?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"


(statistics, machine learning, CS)
background?
state reminders…
• Neighbor's name Zachary Dodds
Data!
• A place they consider home Pittsburgh, PA

• Are they working at a company now? Where? Harvey Mudd

• How many U.S. states have they visited? 44

• Their favorite unhealthy food… ? M&Ms

• Do they have any "Data Science"


(statistics, machine learning, CS)
background? mostly CS for me…
• Neighbor's name Zachary Dodds
Data!
• A place they consider home Pittsburgh, PA

• Are they working at a company now? Where? Harvey Mudd

• How many U.S. states have they visited? 44

• Their favorite unhealthy food… ? M&Ms

• Do they have any "Data Science" This class is truly


seminar-style:
(statistics, machine learning, CS) we're devloping
expertise in this
background? mostly CS for me… field together.

be sure to set up your login + profile for the submission site…

You might also like