You are on page 1of 3

Assignment M1 [Group Work and Group Submission]:

Exploring Open Datasets | Effort 60-90 min


Purpose
The Purpose of this assignment is to:

 Get a better understanding of the types of open datasets available for a thorough investigation.
 Investigate ways that a specific dataset can be used for the purposes of exploratory analysis and
or/predictive analysis.
 Investigate whether a specific dataset exhibits any potential biases that can impact the results of
exploratory or predictive analysis.
 Describe ways why machine learning may or may not be the suitable approach to tackle questions
around a specific dataset.
 Describe how a specific chosen dataset and exploratory and predictive questions around it can be of
impact.

Instructions
After going through all the required resources posted on Moodle under “Module 1”, write a report that addresses the
following eight questions. Refer to the analytic rubric provided at the end of the instructions. Each question is worth
8 points. The overall grade is the sum of grades taken over all questions.

1. Explore further websites for identifying open datasets (the recommended resources might give you
further clues if you need them):
a. Provide a summary of one promising website you encountered. Avoid websites were data is too clean and
almost ready for a machine learning setup. Examples of such websites to avoid is Kaggle. (8 pts)
b. Explain the significance of the datasets hosted there. (8 pts)
2. Following your explorations:
a. Identify one tabular dataset that you think would make a good candidate for machine learning and provide a
link to this dataset. Explain why your choice is a good candidate to demonstrate the data science pipeline. Restrict
your choice to a dataset with no more than ten or so features in total, and with at least two to three numerical
features and two to three categorical features. This is to keep further analyses on this dataset manageable. Note:
even if you find a dataset with more than 10 columns, you can bring it down to 10 by removing columns you think
are not important. (8 pts)
b. Describe what each row and each column in this dataset designate, with clear reference to the independent
variables and dependent variable(s). (8 pts)
c. Describe the ownership of the data, how it was generated, and whether further data scraping can help augment
the number of samples/records in this dataset. (8 pts)
d. Explain whether some of the data revolves around human subjects, in which case, explain how issues related to
privacy and confidentiality have (or have not) been addressed. (8 pts)
e. What are ancillary (supplementary) data sources can you rely on, to enrich the feature space of this chosen
dataset? Remark on the feasibility of enriching the feature space or otherwise, the difficulty of doing so. (8 pts)
3. Describe ways that this dataset can be explored for interesting patterns or causal inferences. (8 pts)

4. Describe ways that this dataset can be explored for interesting predictive questions. (8 pts)
5. Describe why the overarching problem this dataset can help tackle is of any impact. For example,
describe what would change, say, five years down the road, if any of the questions above are
addressed. (8 pts)

6. Why would machine learning possibly be a good fit for your chosen dataset/problem, and in what
ways can it outperform previous classical methods? Elaborate by referring to a select number of
references (say, at least two and at most five select references) approaching this from a classical,
non-ML viewpoint. (8 pts)

7. Machine learning is often used with data mined from administrative, social or clinical data sources
not collected as part of standard research procedures. As such, datasets might exhibit biases
associated with data generation or sampling processes.
a. What data bias issues might be associated with your chosen dataset? (8 pts)
b. What are ways that such biases can be mitigated? (8 pts)
Category 1 pts (unsatisfactory) 2 pts (marginal) 3 pts (good) 4 pts (exemplary)

Information clearly
relates to the main
topic and adds new
concepts and
Information Information clearly information. It
Information has little
or nothing to do with clearly relates to relates to the main includes several
the main topic or the main topic. topic. It provides at supporting details
No details least one supporting and/or examples,
simply restates the
main concept. It does and/or examples detail or example where applicable. It
not advance are given despite where applicable. It consistently
Quality of applicable. occasionally establishes source
discussion. Does not
information Responds to provides documentation for
provide
and critical critical thinking documentation ideas, where
documentation for
thinking questions but where applicable. applicable. It
sources where
applicable. Does not does not engage Some critical enhances the critical
in premise thinking and thinking process
respond to critical
thinking questions reflection, where reflection are consistently through
applicable. demonstrated in reflection, where
posed where
discussion where applicable. Is a
applicable.
applicable. quality response that
advances thoughts
forward and promotes
an interesting
debate.

Professional Professional Professional Professional Professional


language vocabulary and vocabulary and vocabulary and vocabulary and
writing style are not writing style are writing style are writing style are used
used. used used frequently consistently
occasionally throughout the throughout the
throughout the
discussion. discussion. discussion.

You might also like