Professional Documents
Culture Documents
Feedback From Last Class, Emails, Piazza Postings, Etc
Feedback From Last Class, Emails, Piazza Postings, Etc
postings, etc.
• Updated due dates for Project 1: No late submission allowed. With every phase
submission, submit the updated previous phase. TAs please note you don’t regrade any
previous phases, only the current phase. Once the grade is assigned for a phase it is final.
• Phase 1 : (DS tasks 1, 2,3 and 4): 2/27
You have to
• Phase 2: (DS task 5): 3/7
know problem solving
• Phase 3: (DS task 6, 7): 3/20 and programming
to do well in this
• Phase 4: (DS task 9,10): 3/27
class.
• Phase 5: (DS talk 8): 4/3
Problem statement (in your own words)
• A representative but concise title
• What is your project about? Topics/subject area, core issue addressed
• Why is it important? Why should you care?
• Who will it help? Who are the beneficiaries? And How? What is your hypothesis?
•There are a variety of processes for capturing events as data, each of which has its own limitations and assumptions. The primary modes of data
collection fall into the following categories:
•Sensors: The volume of data being collected by sensors has increased dramatically in the last decade. Sensors that automatically detect and record
information, such as pollution sensors that measure air quality, are now entering the personal data management sphere (think of FitBits or other step
counters). Assuming these devices have been properly calibrated, they offer a reliable and consistent mechanism for data collection.
•Surveys: Data that is less externally measurable, such as people’s opinions or personal histories, can be gathered from surveys. Because surveys are
dependent on individuals’ self-reporting of their behavior, the quality of data may vary (across surveys, or across individuals). Depending on the
domain, people may have poor recall (i.e., people don’t remember what they ate last week) or have incentives to respond in a particular way (i.e.,
people may over-report healthy behaviors). The biases inherent in survey responses should be recognized and, when possible, adjusted for in your
analysis.
•Record keeping: In many domains, organizations use both automatic and manual processes to keep track of their activities. For example, a hospital
may track the length and result of every surgery it performs (and a governing body may require that hospital to report those results). The reliability of
such data will depend on the quality of the systems used to produce it. Scientific experiments also depend on diligent record keeping of results.
•Secondary data analysis: Data can be compiled from existing knowledge artifacts or measurements, such as counting word occurrences in a
historical text (computers can help with this!).
Finding data
• Government publications:
• Social networks and media organizations:
• U.S. government’s open data: https://www.data.gov
• Twitter developer platform: https://developer.twitter.com/en/docs
• Government of Canada open data: • Google APIs Explorer:
https://open.canada.ca/en/open-data https://developers.google.com/apis-explorer/
• Open Government Data Platform India: https://data.gov.in • Online communities:
• City of Seattle open data portal: https://data.seattle.gov • Kaggle: “the home of data science and machine learning”:
https://www.kaggle.com
• City of Buffalo open data: https://data.buffalony.gov/
• Socrata: data as a service platform: https://opendata.socrata.com
• News and journalism: • UCI Machine Learning Repository:
• New York Times Developer Network: https://archive.ics.uci.edu/ml/index.php
https://developer.nytimes.com • /r/DataSets: https://www.reddit.com/r/datasets/
• FiveThirtyEight: Our Data: https://data.fivethirtyeight.com • From original surveys by NPOs:
• Scientific Research: • Pew research: https://www.pewresearch.org/