Professional Documents
Culture Documents
postings, etc.
• Updated due dates for Project 1: No late submission allowed. With every phase
submission, submit the updated previous phase. TAs please note you don’t regrade any
previous phases, only the current phase. Once the grade is assigned for a phase it is final.
• Phase 1 : (DS tasks 1, 2,3 and 4): 2/27
You have to
• Phase 2: (DS task 5): 3/7 3/14
know problem solving
• Phase 3: (DS task 6, 7): 3/20 and programming
to do well in this
• Phase 4: (DS task 9,10): 3/27
class.
• Phase 5: (DS talk 8): 4/3
•There are a variety of processes for capturing events as data, each of which has its own limitations and assumptions. The primary modes of data
collection fall into the following categories:
•Sensors: The volume of data being collected by sensors has increased dramatically in the last decade. Sensors that automatically detect and record
information, such as pollution sensors that measure air quality, are now entering the personal data management sphere (think of FitBits or other step
counters). Assuming these devices have been properly calibrated, they offer a reliable and consistent mechanism for data collection.
•Surveys: Data that is less externally measurable, such as people’s opinions or personal histories, can be gathered from surveys. Because surveys are
dependent on individuals’ self-reporting of their behavior, the quality of data may vary (across surveys, or across individuals). Depending on the
domain, people may have poor recall (i.e., people don’t remember what they ate last week) or have incentives to respond in a particular way (i.e.,
people may over-report healthy behaviors). The biases inherent in survey responses should be recognized and, when possible, adjusted for in your
analysis.
•Record keeping: In many domains, organizations use both automatic and manual processes to keep track of their activities. For example, a hospital
may track the length and result of every surgery it performs (and a governing body may require that hospital to report those results). The reliability of
such data will depend on the quality of the systems used to produce it. Scientific experiments also depend on diligent record keeping of results.
•Secondary data analysis: Data can be compiled from existing knowledge artifacts or measurements, such as counting word occurrences in a
historical text (computers can help with this!).
Finding data
• Government publications:
• Social networks and media organizations:
• U.S. government’s open data: https://www.data.gov
• Twitter developer platform: https://developer.twitter.com/en/docs
• Government of Canada open data: • Google APIs Explorer:
https://open.canada.ca/en/open-data https://developers.google.com/apis-explorer/
• Open Government Data Platform India: https://data.gov.in • Online communities:
• City of Seattle open data portal: https://data.seattle.gov • Kaggle: “the home of data science and machine learning”:
https://www.kaggle.com
• City of Buffalo open data: https://data.buffalony.gov/
• Socrata: data as a service platform: https://opendata.socrata.com
• News and journalism: • UCI Machine Learning Repository:
• New York Times Developer Network: https://archive.ics.uci.edu/ml/index.php
https://developer.nytimes.com • /r/DataSets: https://www.reddit.com/r/datasets/
• FiveThirtyEight: Our Data: https://data.fivethirtyeight.com • From original surveys by NPOs:
• Scientific Research: • Pew research: https://www.pewresearch.org/
• Task 6
EDA: Exploratory data analysis
• Plots, charts
• Summary
• Other functions to evaluate the quality of the data
What will I learn in this course? From Lecture Feb2!
• You’ll learn about basic data analytics process from defining a problem to cleaning
and processing data for downstream analytics (and application of algorithms),
knowledge extraction. (Figure adapted form Doing Data science book)
• You’ll learn about big data infrastructures and algorithms. Managing large scale
data requires special structures and algorithms. (Figure from Lin and Dyer’s book.)
• You’ll learn newer data challenges and methods to address these. For example
streaming data, how to process and analyze these in a timely manner. (Examples for
data stream are social media data and multi-modal enterprise data)