You are on page 1of 39

Feedback from last class, emails, piazza

postings, etc.
• Updated due dates for Project 1: No late submission allowed. With every phase
submission, submit the updated previous phase. TAs please note you don’t regrade any
previous phases, only the current phase. Once the grade is assigned for a phase it is final.
• Phase 1 : (DS tasks 1, 2,3 and 4): 2/27
You have to
• Phase 2: (DS task 5): 3/7 3/14
know problem solving
• Phase 3: (DS task 6, 7): 3/20 and programming
to do well in this
• Phase 4: (DS task 9,10): 3/27
class.
• Phase 5: (DS talk 8): 4/3

NOTE: No other dates change ..


Problem statement (in your own words)
• A representative but concise title
• What is your project about? Topics/subject area, core issue addressed
• Why is it important? Why should you care?
• Who will it help? Who are the beneficiaries? And How? What is your hypothesis?

• What are the data sources? Size, characteristics


• What are the questions you are trying to get an answer?
• What are you going to do with the data?
• How are you going to analyze the data? Reasoning for the methods, if applicable.
• References and attributions.
DS book
• Look at the announcements on ublearns.
• Look at web site for course description:
www.cse.buffalo.edu/~bina/cse487/spring2021
• J. VanderPlas, Python Data Science Handbook. O'reilly. ISBN:
9781491912058, November 2016. (Online version available.)
• Before the semester started I sent email to all registered to read and execute the
python code to learn about DS and prepare for the class. DO it now, if you
didn’t. All the answers for project 1 are in there.
Big data – Data-intensive computing
book
1. Big data analytics: Data-Intensive Text Processing with MapReduce. J. Lin and C.
Dyer, Morgan & Claypool Publishers (October 10, 2010), ASIN : B0094J3FXM.
2. Hadoop and Spark infrastructure: Hadoop 2 Quick-Start Guide: Learn the
Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem. D. Eadline,
Addison Wesley Professional, ASIN : B017A8UACW, 2015.
• We’ll begin with Lin and Dyer’s book (1).
• Numerous resources are available online too!
• I will start the lesson today on this topic. Project 2 is based on this.
Sign up with the TAs.
• If you don’t sign and get approval for your topic you’ll not be graded!
• Attend TAs’ office hours.
• Work around religious days. I have given generous number of days for
each phase.

Chen Yuan chyuan@buffalo.edu


Ping Yu pingyu@buffalo.edu Jiayi Xian: jxian@buffalo.edu
Quiz
• Next class we have a quiz. 100 points/week
• Have your device ready. Only one attempt is allowed.
• I don’t want any excuses such as laptop fell down and broke, I am at a
hotspot, wireless freaked out, power went out exactly at the time of the
quiz, many more excuses that your creative minds come up with.
Attendance and Reading
• Attendance is important.
• This is a synchronous, real-time class.

• Last note: READ, READ, READ


• There is nothing more important than reading. There is no short cut that
exists to learning. Learning starts with reading and writing, arithmetic and
coding.
Project 1 Help and Platform --Feb 16
• TAs and I had very productive meeting
• TAs have prepared a very nice notebook guiding you through your project 1 work.
• It is about “Reproducible Research” using Jupyter notebook
• You’ll have to install Jupyter and prepare a notebook and keep updating the
notebook for each phase.
• Thus you’ll have just one notebook for all your work including documentation, code
and visualization. Please see project details on ublearns.
• Also consult the TAs by email or by piazza.
Question on Piazza about dataset
• Somebody wrote if “sensitive” and different dataset will be used for future project.
• Here is my answer: I think this person meant to say “sensible” meaning that data sets and
research organizations that do socially relevant projects and data collect are “insensitive” or
“non-sense”
• Understand most raw data is messy, noisy, dirty; be happy somebody is doing the work of
collecting these sets for you.
• 80% of data science work is data cleaning and pre-processing
• Please look at earlier lecture on what I discuss about prj1, 2 etc.
• This prj1 is “structure data” and prj2 will be focused on ”unstructured/structured big data”
February 18
Feedback from project response
• I have only 14 responses out of 141 people: Here is the google link;
https://forms.gle/2cizmZsTun6AMkNo8
• Discuss the data sets and the idea with TAs or myself.
• Do you know my office hours? MW: 8.30-10.00am: zoom link is on
ublearns.
• TAs will post office hours on ublearns too with the link.
Have potential
Data Generation Processes concerns and biases

•There are a variety of processes for capturing events as data, each of which has its own limitations and assumptions. The primary modes of data
collection fall into the following categories:
•Sensors: The volume of data being collected by sensors has increased dramatically in the last decade. Sensors that automatically detect and record
information, such as pollution sensors that measure air quality, are now entering the personal data management sphere (think of FitBits or other step
counters). Assuming these devices have been properly calibrated, they offer a reliable and consistent mechanism for data collection.
•Surveys: Data that is less externally measurable, such as people’s opinions or personal histories, can be gathered from  surveys. Because surveys are
dependent on individuals’ self-reporting of their behavior, the quality of data may vary (across surveys, or across individuals). Depending on the
domain, people may have poor recall (i.e., people don’t remember what they ate last week) or have incentives to respond in a particular way (i.e.,
people may over-report healthy behaviors). The biases inherent in survey responses should be recognized and, when possible, adjusted for in your
analysis.
•Record keeping: In many domains, organizations use both automatic and manual processes to keep track of their activities. For example, a hospital
may track the length and result of every surgery it performs (and a governing body may require that hospital to report those  results). The reliability of
such data will depend on the quality of the systems used to produce it. Scientific experiments also depend on diligent record keeping of results.
•Secondary data analysis: Data can be compiled from existing knowledge artifacts or measurements, such as counting word occurrences in a
historical text (computers can help with this!).
Finding data

• Government publications: 
• Social networks and media organizations:
• U.S. government’s open data: https://www.data.gov
• Twitter developer platform: https://developer.twitter.com/en/docs
• Government of Canada open data:  • Google APIs Explorer: 
https://open.canada.ca/en/open-data https://developers.google.com/apis-explorer/
• Open Government Data Platform India: https://data.gov.in • Online communities:
• City of Seattle open data portal: https://data.seattle.gov • Kaggle: “the home of data science and machine learning”: 
https://www.kaggle.com
• City of Buffalo open data: https://data.buffalony.gov/
• Socrata: data as a service platform: https://opendata.socrata.com
• News and journalism: • UCI Machine Learning Repository: 
• New York Times Developer Network:  https://archive.ics.uci.edu/ml/index.php
https://developer.nytimes.com • /r/DataSets: https://www.reddit.com/r/datasets/
• FiveThirtyEight: Our Data: https://data.fivethirtyeight.com • From original surveys by NPOs:
• Scientific Research: • Pew research: https://www.pewresearch.org/

• Nature: Recommended Data Repositories: 


https://www.nature.com/sdata/policies/repositories
Interpreting data
• Acquiring domain knowledge
• You do not need to necessarily be an expert in the problem domain (though it wouldn’t hurt); you just
need to acquire sufficient domain knowledge to work within that problem domain!
• Understanding data schemas : meta data : about data:
For forming questions
• “What meta-data is available for the data set?”
and hypothesis.
• “Who created the data set? Where does it come from?”
• “What features does the data set have?”
• What “real-world” aspect does each column attempt to capture?
• For continuous data: what units are the values in?
• For categorical data: what different categories are represented, and what do those mean?
• What is the possible range of values?
• “What terms do you not know or understand?”
• As you read through a data set—or anything really—you should write down the terms and phrases you are
not familiar with to look up later. This will discourage you from (inaccurately) guessing a term’s meaning,
and will help delineate between terms you have and have not yet clarified.
Using data to answer questions
• What are the questions you want answered? What is your hypothesis?
• Consider the data set Global Burden of Disease study performed by the Institute for Health
Metrics and Evaluation, which details the burden of disease in the United States and around the
world.
• “What is the worst disease in the United States?”
• Which disease causes the largest number of deaths in the United States?
• Which disease causes the most premature deaths in the United States?
• Which disease causes the most disability in the United States?
• What patterns and predictions can we make?
February 23
Quiz
• One shot, no second time
• Close all your browsers before you start the quiz
• No chat on zoom or on mobile messaging media
• Video on when you are taking the quiz.
• Quiz: ONLY every other week.
• No quiz this week, but we’ll have one next week. This allows me to discuss the
concepts without a break with the option of the quiz either on Tuesday or Thursday
depending on lecture coverage.
• Also this allows you to focus on your projects without a break to study for the quiz.
Project 1 – Phase 1
• Choose data that you know something about.
• It is alright to change data sets/idea now that later.
• Form relevant questions, hypothesis. But you can redefine and add questions. (That does NOT mean
you’ll regraded for Phase 1. Once a phase is graded you can always improve it but no regrade for that
phase. But redoing a phase may help score better in the next phases.)
• Look at “data cleaning” tasks: numpy, pandas, data frame, etc. Data processing is also considered in
this phase.
• Then comes EDA, analysis, visualization etc. Make sure your current data is amenable to analysis by
application of ML and statistical models.
• “More observations (rows) is significant than hundreds of features (cols).”
Preferred way to communicate with TAs
• These are questions related to project: Use Piazza
• If it is concerning your data: use private communication on Piazza
• If it is something of personal nature communicate with me or the TA by
email.
Plan for project 2
• Graph data?
February 25
Not much!
• I saw some fantastic explorations from some students!
• Can you locate the submission link on ublearns? Do it now.. I don’t want panic messages on
the due date.
• There are so many cleaning and pre-processing that you can do! People are not “reading”
the book or online tutorials.
• Having numeric data helps in downstream analytics and modeling!
• Look at the down stream processes/task in data set selection and in formulation of your
questions and hypothesis.
• Finally, if you’ve been lazy, start you project 1 NOW.
How many operations?
• Min, max? What operations?
• Min is given, but the idea is to clean and pre-process the data.
• Not turn in a messy data because you are done with 10 operations!
March 2
Quiz
• No tricks please.
• Do the right thing.
• Come prepared.
• Learn.
• I have a dashboard of what’s going on during the lecture.
• Only ONE attempt.
Project
• You have every right to kick out a non-contributing team member. Be
brave.
• I will ask TAs to give a 0 to the non-working, dead-beat team member.
Please let us know. How?
• TAs Note: you can go back and give the non-contributing member 0. And
remove him/her out of the team into a singleton.
• Here is a method: CRC Card: Classes, Responsibilities, Collaborations
Project name
Team: Member 1’s name, Member 2’s name
• Task 1 • Who did it? Responsibility of member 1
• Who did it? Responsibility of member 2
• Task 2
• Who did it? Responsibility of member 1
• Task 3 • Who did it? Responsibility of member 2
• Task 4 • Who did it? Responsibility of member 2
• Task 5 • Who did it? Responsibility of member 1

• Task 6
EDA: Exploratory data analysis
• Plots, charts
• Summary
• Other functions to evaluate the quality of the data
What will I learn in this course? From Lecture Feb2!
• You’ll learn about basic data analytics process from defining a problem to cleaning
and processing data for downstream analytics (and application of algorithms),
knowledge extraction. (Figure adapted form Doing Data science book)

“structured data”. “tables.csv” “relational database”

• You’ll learn about big data infrastructures and algorithms. Managing large scale
data requires special structures and algorithms. (Figure from Lin and Dyer’s book.)

“structured and unstructured data” “key-value store”

• You’ll learn newer data challenges and methods to address these. For example
streaming data, how to process and analyze these in a timely manner. (Examples for
data stream are social media data and multi-modal enterprise data)

“structured, unstructured and streaming data” Data stream


“messages”
Project 2
• 50 points
• 15 points for installation Hadoop environment
• 15 points basic MR (will let you know the job)
• 20 points for another MR
• Choose a data set to be authorship of publications (IEEE, ASEE, History,
Neurology) in a given domain. (Ahmad and Nitya, please note, you cannot make
up your own project and data.)
• Big data > 1 GB
March 4
Sign up with the TAs.
• If you don’t sign and get approval for your topic you’ll not be graded!
• Attend TAs’ office hours.
• Work around religious days. I have given generous number of days for
each phase.

Chen Yuan chyuan@buffalo.edu


Ping Yu pingyu@buffalo.edu Jiayi Xian: jxian@buffalo.edu
Sign up with the TAs.
• If you don’t sign and get approval for your topic you’ll not be graded!
• Attend TAs’ office hours.
• Work around religious days. I have given generous number of days for
each phase.

Chen Yuan chyuan@buffalo.edu


Ping Yu pingyu@buffalo.edu Chen Xu: chenxu@buffalo.edu
What to do with project sign up?
• First 25 signup with Ping stay with Ping.
• First 25 signed up with Jiayi move to ChenXU.
• Rest of the people who signed up Ping and Jiayi move to ChenYu’s signup
sheet.
• Please do it now.
Feedback from last class, emails, piazza
postings, etc.
• Updated due dates for Project 1: No late submission allowed. With every phase
submission, submit the updated previous phase. TAs please note you don’t regrade any
previous phases, only the current phase. Once the grade is assigned for a phase it is final.
• Phase 1 : (DS tasks 1, 2,3 and 4): 2/27
You have to
• Phase 2: (DS task 5): 3/7 3/14
know problem solving
• Phase 3: (DS task 6, 7): 3/20 and programming
to do well in this
• Phase 4: (DS task 9,10): 3/27
class.
• Phase 5: (DS talk 8): 4/3

NOTE: No other dates change ..


About the quiz?
• Please READ ahead; prepare; keep up with lectures.
• Many learned that quizzes cannot be answered by Google “Search”
• This is one of the many quizzes
• Overall grading is relative to the performance of the class; so the score 60
may be good.
• No quiz next week, but we have one the week after next.
Reading
• Thanks to Josh for asking this question.
• See the slide shown next from Feb 23!
• I started off Lin and Dyer by stating what I am going to cover from that
book.
Lin and Dyer’s text (From Feb 23 lecture
slides)
• Chapter 1: Please read: sets the context for MR
• Chapter 2: MR Basics: analysis of a sample problem analysis/walkthrough
• Chapter 3: MR Algorithm Design (up to p.60) (up to section 4.4 inclusive)

• Chapter 4: Inverted index for text retrieval


• Chapter 5: Graph algorithms (pagerank and other classical algorithms)
About the project EDA: Phase 2
• https://www.itl.nist.gov/div898/handbook/eda/eda.htm
• https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm
• Let’s go through this: once again we would like a elegant, cohesive presentation of
EDA and not a hodge-podge of items!
• Exploratory data analysis is about understanding the data.
• “Useful” Plots, charts: don’t fill your explorations with plots and charts, only useful
ones
• Statistics: search statistics and find out what is meant by “statistics”

You might also like