You are on page 1of 3

বঙ্গবন্ধু শেখ মুজিবুর রহমান জিজিটাল ইউজনভাজসিজট, বাাংলাদেে

কাজলয়াককর, গাজীপুর

Week 02 Tutorial: Data Science Process

1 Suppose your task as a data scientist at a university is to design a data mining system to
examine their university course database, which contains the following information: the
name, address, and status (e.g., undergraduate or graduate) of each student, the courses
taken, and their cumulative grade point average (GPA).

i. Describe the architecture you would choose. What is the purpose of each
component of this architecture?
ii. Describe the process of data science you will follow.

2 Suppose we are building several machine learning models to analyze the performance of a
Formula One (F1) driver. Consider the following cases:

i) Model_1 consists of only two features say the circuit name and the country name.
ii) Model_2 consists of 4 features say weather and max speed of the car including the above
two.
iii) Model_3 consists of 8 features say driver’s experience, number of wins, car condition,
and driver’s physical fitness including all the above features.
iv) Model_4 consists of 16 features say driver’s age, latitude, longitude, driver’s height, hair
color, car color, the car company, and driver’s marital status including all the above
features.
v) Model_5 consists of 32 features.
vi) Model_6 consists of 64 features.
vii) Model_7 consists of 128 features.
viii) Model_8 consists of 256 features.
ix) Model_9 consists of 512 features.
x) Model_10 consists of 1024 features.

Assuming the training data remains constant, it is observed that on increasing the number
of features the accuracy tends to increase until a certain threshold value and after that, it
starts to decrease. From the above example the accuracy of Model_1 < accuracy of Model_2
< accuracy of Model_3 but if we try to extrapolate this trend it doesn’t hold true for all the
models having more than 8 features. Now you might wonder if we are providing some
extra information for the model to learn why is it so that the performance starts to degrade.

i. Identify the types of problem.


ii. How can you solve the problem?

3 According to the World Health Organization (WHO) stroke is the 2nd leading cause of
death globally, responsible for approximately 11% of total deaths.

Prepared by: Nurjahan Nipa, Lecturer, Department of Internet of Things & Robotics Engineering (IRE), BDU Page 1|3
বঙ্গবন্ধু শেখ মুজিবুর রহমান জিজিটাল ইউজনভাজসিজট, বাাংলাদেে
কাজলয়াককর, গাজীপুর

This dataset has 5110 records and is used to predict whether a patient is likely to get stroke
based on the input parameters like gender, age, various diseases, and smoking status. Each
row in the data provides relavant information about the patient.

1) id:
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has
hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart
disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this
patient

Now, identify the type of each attributes.

4 Conside the figure below.

Here, ETL is the process of data extraction from multiple sites, transforming the data to
meet your requirement and then loading it in a target storage system.

Prepared by: Nurjahan Nipa, Lecturer, Department of Internet of Things & Robotics Engineering (IRE), BDU Page 2|3
বঙ্গবন্ধু শেখ মুজিবুর রহমান জিজিটাল ইউজনভাজসিজট, বাাংলাদেে
কাজলয়াককর, গাজীপুর

Briefly describe the characteristics of data warehouse, data mart, data lake. Identify each
types from the aforementioned figure.

5 Consider a use case where a machine learning model has to analyze photos and identify the
ones that contain dogs in them. If the machine learning model was trained on a data set that
contained majority photos showing dogs outside in parks , it may may learn to use grass as
a feature for classification, and may not recognize a dog inside a room.

i. What kinds of problem it is?


ii. How can we solve these kinds of problem.

6 Outline the major research challenges of data mining in one specific application domain,
such as stream/sensor data analysis, spatiotemporal data analysis, or bioinformatics.

7 What are the major challenges of mining a huge amount of data (such as billions of tuples)
in comparison with mining a small amount of data (such as a few hundred tuple data set)?

8 Briefly describe the following advanced database systems and applications: object-
relational databases, spatial databases, text databases, multimedia databases, the World
Wide Web.

Prepared by: Nurjahan Nipa, Lecturer, Department of Internet of Things & Robotics Engineering (IRE), BDU Page 3|3

You might also like