Homework 02 ISE-291-T222

ISE-291: Homework 02
ISE-291: Introduction to Data Science
Term 222
Homework 02 2023
Covers: Topics 4-5 Material Deadline: 10 March 2023
11:59 PM
Homework Presentation & Submission:

❖ You must attempt only the first three problems (A, B, and C), while
the others (D & E) are for practice purposes.
❖ Provide the output solution for each sub-problem (part). Otherwise,
50% marks will be deducted.
❖ Every sub-problem (part) should be answered on a DIFFERENT
CELL as given in the template.
❖ EVERY CELL should have a problem and part number clearly
written in the first line.
❖ All cells of your homework should be in CHRONOLOGICAL
order. One cell per sub-problem.
❖ Any text should be written as a comment in the code cell. Do NOT
modify the code cell into the markdown cell.
❖ Submit the entire homework as ONE single .ipynb document.
❖ Do NOT add/delete any cell in the given template.
Page 1 of 7
Problem A [40 Marks]: Consider data given in “HW2_DataA” Microsoft Excel Comma Separated
Values (.CSV) file.
Using the given data, do the following tasks:
A-1. [3 marks]: Read the data (Assume the 1st row in HW2_DataA contains the column headings).
Then display:
(i) 10 randomly selected rows
(ii) 10 last rows for all columns.
A-2. [5 marks]: Create a new dataframe (let’s say: ndf) by selecting the first 20 rows from column-2
(Type_of_Payment) to column-6 (Late_delivery) and then sort the second column of the new
dataframe in descending order and the third column of the new dataframe in ascending order.
A-3. [2 marks]: Considering the new dataframe from part A-2, display the separate statistical
summary for numerical and non-numerical columns.
A-4. [5 marks]: From the original dataset “HW2_DataA”, How many customers below the age of 40
ordered a smart watch using the first class shipping mode?
A-5. [5 marks]: Add a new column (Order_Item_Discount_New) by converting
“Order_Item_Discount” column values from float to nearest integer using the apply command
and custom function. [Note: using lambda function is not allowed]
A-6. [2 marks]: Show the histogram of Shipping_Mode differentiated by Late_delivery column.
A-7. [3 marks]: Select the rows having Order_Item_Discount greater than 30% and plot their
relationship to Sales_per_customer. What can you conclude from the plot?
A-8. [10 marks]: Draw a plot showing the payment method preference (Type_of_Payment) based on
the Customer_Segment (your graph must show the count of the different types of
Customer_Segment differentiated by Type_of_Payment) and answer the following questions:
(i) Which customer segment has the highest number of DEBIT type?
(ii) Which customer segment prefers PAYMENT over the other types?
Page 2 of 7
(iii) Roughly speaking, what is the most common payment method?
A-9. [5 marks]: Make boxplots for the Sales_per_customer of the COMPLETE and PENDING orders
of Smart watch. Compare the two boxplots and explain how the median Sales per customer is
related to the Order_Status.
Problem B [40 Marks]: Consider the data given in “HW2_DataB” Microsoft Excel (.csv) file and
described in Table 1. Note: Solve all the following questions using Python. Use the Pandas & Sklearn library for
all the following analyses.
Table 1. Data Description

Field Description
ID The ID of the patient is automatically assigned
Gender The Gender of the patient
Age The recorded Age of the patient
hypertension The patient hypertension diagnosis
heart_disease The patient heart diagnosis
ever_married The patient marital status
work_type The type of work of the patient
Residence_type The home location of the patient
avg_glucose_level The recorded sugar lever of the patient
bmi The recorded Body Mass Index of the patient
smoking_status The recorded smoking habits of the patient
stroke If the patient had a heart stroke
has_kids The recorded number of children of the patient
Using the given data do the following:
B-1. [3 marks]: Read and display the data. Identify the number of rows and columns. Does any
column have missing data? If yes, provide their name.
Page 3 of 7
B-2. [2 marks]: Type Consistency: For each column, identify each field type and verify that each
column in Python is identified correctly. If there is any discrepancy, then indicate it.
B-3. [5 marks]: Filter noise: Looking at the data, some values in the numeric columns (“age”) were
entered in a less than 1 (by mistake). Fix the inconsistencies. Furthermore, find unique categorical
values and remove unknowns (if any).
B-4. [7 marks]: Handling NaN values: Drop all columns containing 30% or more missing values.
Then impute the columns having missing values.
B-5. [5 marks]: Normalization/Transformation: Normalize all numeric columns to a mean of zero and
standard deviation of one and print only normalized columns.
B-6. [5 marks]: Encoding: Convert “work_type” using label encoder.
B-7. [5 marks]: Encoding: For the “ever_married,” convert it using binary values (0 and 1). Do not
drop any new column(s).
B-8. [8 marks]: General questions (write your answers in a jupyter notebook):
(i) When is best to use a label encoder rather than one hot encoding?
(ii) What are data cube aggregation and discretization?
(iii) Give a real-world example of direct and indirect data acquisition approaches.
(iv) Give a real-world example of structured data and unstructured data.
(v) Why is there a need to convert numerical data to Min-Max scaler?
Problem C [20 Marks]: Consider the UCI Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets.php
Access the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php, and then do the
following:
Page 4 of 7
C-1. [5 marks]: Find the “Abalone” Data Set and download the data file. The data file is in the “Data
Folder” with “.data” extension. To customize and limit your search, you may use the filters on
the left side of the page with the following criteria:
Attribute Type : mixed; Data Type : multivariate; # Attributes : : less than 10.
C-2. [5 marks]: Identify the columns’ headers of the data. (Tip: The columns’ headers are under the
“Attribute Information” section in the “Abalone Data Set” page.)
C-3. [5 marks]: Read the data into Pandas dataframe. Include the list of columns’ headers to your
“abalone.data” file. (Tip: Create a list of columns’ headers and then add it as data headers to
your data file)
C-4. [5 marks]: Display a statistical summary for each of the numerical and non-numerical data.
Problem D: (Practice only. No submission required.)
Consider the following python methods, available in naive Python, or pandas/seaborn libraries:
D-1. pandas.DataFrame()
D-2. pandas.read csv()
D-3. pandas.DataFrame.head()
D-4. pandas.DataFrame.index
D-5. pandas.DataFrame.columns
D-6. pandas.DataFrame.describe()
D-7. pandas.DataFrame.info()
D-8. pandas.DataFrame.loc()
D-9. pandas.DataFrame.iloc()
D-10. pandas.DataFrame.sort values()
D-11. pandas.DataFrame.isin()
Page 5 of 7
D-12. pandas.DataFrame.value counts()
D-13. pandas.DataFrame.apply()
D-14. pandas.DataFrame.applymap()
D-15. seaborn.relplot()
D-16. seaborn.pairplot()
D-17. seaborn.catplot()
Answer the following questions for each of the above methods:
(i) State the purpose/usage of the method/attribute.

(ii) List all the argument of the method.
(iii) Classify the arguments as positional or keyword arguments.
(iv) Write the default values for each of the keyword arguments.
Problem D: (Practice only. No submission required.)
Consider the following python methods, available in naive Python, or pandas/sklearn libraries:
E-1. pandas.DataFrame.index
E-2. pandas.DataFrame.columns
E-3. pandas.DataFrame.dtypes
E-4. pandas.DataFrame.select dtypes()
E-5. pandas.DataFrame().apply()
E-6. pandas.DataFrame.map()
E-7. pandas.DataFrame.get dummies()
Answer the following questions for each of the above methods:
(i) State the purpose and usage of the above methods.

(ii) List all the argument of the method.
(iii) Classify the arguments as positional or keyword arguments.
(iv) Identify the data types for each of the arguments.
Page 6 of 7
(v) Write the default values for each of the arguments.
Consider the following python classes, available in sklearn library:
E-8. sklearn.preprocessing.LabelEncoder.fit()
E-9. sklearn.preprocessing.LabelEncoder.transform()
E-10. sklearn.preprocessing.StandardScaler.transform()
Answer the following questions for each of the above classes:
(i) State the purpose and usage of the above methods.

(ii) List all the methods and properties.
(iii) Discuss the .fit() method.
(iv) Discuss the .transform() method.
☞ Note: You can use the following online references to answer the above questions:
♣ https://docs.python.org/3.8/library/functions.html#help
♦ https://docs.python.org/3/library/index.html
♥ https://pandas.pydata.org/pandas-docs/stable/index.html
♠ https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
------------------------------------------- THE END -------------------------------------------
Page 7 of 7

Homework 02 ISE-291-T222

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Homework 02 ISE-291-T222

Uploaded by

Copyright:

Available Formats

ISE-291: Homework 02

ISE-291: Introduction to Data Science

Homework Presentation & Submission:

Using the given data, do the following tasks:

(i) 10 randomly selected rows

(ii) 10 last rows for all columns.

summary for numerical and non-numerical columns.

ordered a smart watch using the first class shipping mode?

A-5. [5 marks]: Add a new column (Order_Item_Discount_New) by converting

and custom function. [Note: using lambda function is not allowed]

A-6. [2 marks]: Show the histogram of Shipping_Mode differentiated by Late_delivery column.

relationship to Sales_per_customer. What can you conclude from the plot?

Customer_Segment differentiated by Type_of_Payment) and answer the following questions:

(iii) Roughly speaking, what is the most common payment method?

related to the Order_Status.

Table 1. Data Description

ID The ID of the patient is automatically assigned

Gender The Gender of the patient

Age The recorded Age of the patient

hypertension The patient hypertension diagnosis

heart_disease The patient heart diagnosis

ever_married The patient marital status

work_type The type of work of the patient

Residence_type The home location of the patient

avg_glucose_level The recorded sugar lever of the patient

bmi The recorded Body Mass Index of the patient

smoking_status The recorded smoking habits of the patient

stroke If the patient had a heart stroke

has_kids The recorded number of children of the patient

Using the given data do the following:

column have missing data? If yes, provide their name.

values and remove unknowns (if any).

Then impute the columns having missing values.

standard deviation of one and print only normalized columns.

B-6. [5 marks]: Encoding: Convert “work_type” using label encoder.

drop any new column(s).

B-8. [8 marks]: General questions (write your answers in a jupyter notebook):

(ii) What are data cube aggregation and discretization?

(iv) Give a real-world example of structured data and unstructured data.

(v) Why is there a need to convert numerical data to Min-Max scaler?

Problem C [20 Marks]: Consider the UCI Machine Learning Repository:

the left side of the page with the following criteria:

“Attribute Information” section in the “Abalone Data Set” page.)

your data file)

Problem D: (Practice only. No submission required.)

D-2. pandas.read csv()

D-10. pandas.DataFrame.sort values()

D-12. pandas.DataFrame.value counts()

Answer the following questions for each of the above methods:

(i) State the purpose/usage of the method/attribute.

Problem D: (Practice only. No submission required.)

E-4. pandas.DataFrame.select dtypes()

E-7. pandas.DataFrame.get dummies()

Answer the following questions for each of the above methods:

(i) State the purpose and usage of the above methods.

(v) Write the default values for each of the arguments.

Consider the following python classes, available in sklearn library:

Answer the following questions for each of the above classes:

(i) State the purpose and usage of the above methods.

------------------------------------------- THE END -------------------------------------------

You might also like