You are on page 1of 7

ISE-291: Homework 02

ISE-291: Introduction to Data Science

Term 222

Homework 02 2023
Covers: Topics 4-5 Material Deadline: 10 March 2023
11:59 PM

Homework Presentation & Submission:


❖ You must attempt only the first three problems (A, B, and C), while
the others (D & E) are for practice purposes.
❖ Provide the output solution for each sub-problem (part). Otherwise,
50% marks will be deducted.
❖ Every sub-problem (part) should be answered on a DIFFERENT
CELL as given in the template.
❖ EVERY CELL should have a problem and part number clearly
written in the first line.
❖ All cells of your homework should be in CHRONOLOGICAL
order. One cell per sub-problem.
❖ Any text should be written as a comment in the code cell. Do NOT
modify the code cell into the markdown cell.
❖ Submit the entire homework as ONE single .ipynb document.
❖ Do NOT add/delete any cell in the given template.

Page 1 of 7
ISE-291: Homework 02

Problem A [40 Marks]: Consider data given in “HW2_DataA” Microsoft Excel Comma Separated
Values (.CSV) file.

Using the given data, do the following tasks:

A-1. [3 marks]: Read the data (Assume the 1st row in HW2_DataA contains the column headings).

Then display:

(i) 10 randomly selected rows

(ii) 10 last rows for all columns.

A-2. [5 marks]: Create a new dataframe (let’s say: ndf) by selecting the first 20 rows from column-2

(Type_of_Payment) to column-6 (Late_delivery) and then sort the second column of the new

dataframe in descending order and the third column of the new dataframe in ascending order.

A-3. [2 marks]: Considering the new dataframe from part A-2, display the separate statistical

summary for numerical and non-numerical columns.

A-4. [5 marks]: From the original dataset “HW2_DataA”, How many customers below the age of 40

ordered a smart watch using the first class shipping mode?

A-5. [5 marks]: Add a new column (Order_Item_Discount_New) by converting

“Order_Item_Discount” column values from float to nearest integer using the apply command

and custom function. [Note: using lambda function is not allowed]

A-6. [2 marks]: Show the histogram of Shipping_Mode differentiated by Late_delivery column.

A-7. [3 marks]: Select the rows having Order_Item_Discount greater than 30% and plot their

relationship to Sales_per_customer. What can you conclude from the plot?

A-8. [10 marks]: Draw a plot showing the payment method preference (Type_of_Payment) based on

the Customer_Segment (your graph must show the count of the different types of

Customer_Segment differentiated by Type_of_Payment) and answer the following questions:

(i) Which customer segment has the highest number of DEBIT type?

(ii) Which customer segment prefers PAYMENT over the other types?

Page 2 of 7
ISE-291: Homework 02

(iii) Roughly speaking, what is the most common payment method?

A-9. [5 marks]: Make boxplots for the Sales_per_customer of the COMPLETE and PENDING orders

of Smart watch. Compare the two boxplots and explain how the median Sales per customer is

related to the Order_Status.

Problem B [40 Marks]: Consider the data given in “HW2_DataB” Microsoft Excel (.csv) file and
described in Table 1. Note: Solve all the following questions using Python. Use the Pandas & Sklearn library for
all the following analyses.

Table 1. Data Description


Field Description

ID The ID of the patient is automatically assigned

Gender The Gender of the patient

Age The recorded Age of the patient

hypertension The patient hypertension diagnosis

heart_disease The patient heart diagnosis

ever_married The patient marital status

work_type The type of work of the patient

Residence_type The home location of the patient

avg_glucose_level The recorded sugar lever of the patient

bmi The recorded Body Mass Index of the patient

smoking_status The recorded smoking habits of the patient

stroke If the patient had a heart stroke

has_kids The recorded number of children of the patient

Using the given data do the following:

B-1. [3 marks]: Read and display the data. Identify the number of rows and columns. Does any

column have missing data? If yes, provide their name.

Page 3 of 7
ISE-291: Homework 02

B-2. [2 marks]: Type Consistency: For each column, identify each field type and verify that each

column in Python is identified correctly. If there is any discrepancy, then indicate it.

B-3. [5 marks]: Filter noise: Looking at the data, some values in the numeric columns (“age”) were

entered in a less than 1 (by mistake). Fix the inconsistencies. Furthermore, find unique categorical

values and remove unknowns (if any).

B-4. [7 marks]: Handling NaN values: Drop all columns containing 30% or more missing values.

Then impute the columns having missing values.

B-5. [5 marks]: Normalization/Transformation: Normalize all numeric columns to a mean of zero and

standard deviation of one and print only normalized columns.

B-6. [5 marks]: Encoding: Convert “work_type” using label encoder.

B-7. [5 marks]: Encoding: For the “ever_married,” convert it using binary values (0 and 1). Do not

drop any new column(s).

B-8. [8 marks]: General questions (write your answers in a jupyter notebook):

(i) When is best to use a label encoder rather than one hot encoding?

(ii) What are data cube aggregation and discretization?

(iii) Give a real-world example of direct and indirect data acquisition approaches.

(iv) Give a real-world example of structured data and unstructured data.

(v) Why is there a need to convert numerical data to Min-Max scaler?

Problem C [20 Marks]: Consider the UCI Machine Learning Repository:


https://archive.ics.uci.edu/ml/datasets.php

Access the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php, and then do the

following:

Page 4 of 7
ISE-291: Homework 02

C-1. [5 marks]: Find the “Abalone” Data Set and download the data file. The data file is in the “Data

Folder” with “.data” extension. To customize and limit your search, you may use the filters on

the left side of the page with the following criteria:

Attribute Type : mixed; Data Type : multivariate; # Attributes : : less than 10.
C-2. [5 marks]: Identify the columns’ headers of the data. (Tip: The columns’ headers are under the

“Attribute Information” section in the “Abalone Data Set” page.)

C-3. [5 marks]: Read the data into Pandas dataframe. Include the list of columns’ headers to your

“abalone.data” file. (Tip: Create a list of columns’ headers and then add it as data headers to

your data file)

C-4. [5 marks]: Display a statistical summary for each of the numerical and non-numerical data.

Problem D: (Practice only. No submission required.)

Consider the following python methods, available in naive Python, or pandas/seaborn libraries:

D-1. pandas.DataFrame()

D-2. pandas.read csv()

D-3. pandas.DataFrame.head()

D-4. pandas.DataFrame.index

D-5. pandas.DataFrame.columns

D-6. pandas.DataFrame.describe()

D-7. pandas.DataFrame.info()

D-8. pandas.DataFrame.loc()

D-9. pandas.DataFrame.iloc()

D-10. pandas.DataFrame.sort values()

D-11. pandas.DataFrame.isin()

Page 5 of 7
ISE-291: Homework 02

D-12. pandas.DataFrame.value counts()

D-13. pandas.DataFrame.apply()

D-14. pandas.DataFrame.applymap()

D-15. seaborn.relplot()

D-16. seaborn.pairplot()

D-17. seaborn.catplot()

Answer the following questions for each of the above methods:

(i) State the purpose/usage of the method/attribute.


(ii) List all the argument of the method.
(iii) Classify the arguments as positional or keyword arguments.
(iv) Write the default values for each of the keyword arguments.

Problem D: (Practice only. No submission required.)

Consider the following python methods, available in naive Python, or pandas/sklearn libraries:

E-1. pandas.DataFrame.index

E-2. pandas.DataFrame.columns

E-3. pandas.DataFrame.dtypes

E-4. pandas.DataFrame.select dtypes()

E-5. pandas.DataFrame().apply()

E-6. pandas.DataFrame.map()

E-7. pandas.DataFrame.get dummies()

Answer the following questions for each of the above methods:

(i) State the purpose and usage of the above methods.


(ii) List all the argument of the method.
(iii) Classify the arguments as positional or keyword arguments.
(iv) Identify the data types for each of the arguments.

Page 6 of 7
ISE-291: Homework 02

(v) Write the default values for each of the arguments.

Consider the following python classes, available in sklearn library:

E-8. sklearn.preprocessing.LabelEncoder.fit()

E-9. sklearn.preprocessing.LabelEncoder.transform()

E-10. sklearn.preprocessing.StandardScaler.transform()

Answer the following questions for each of the above classes:

(i) State the purpose and usage of the above methods.


(ii) List all the methods and properties.
(iii) Discuss the .fit() method.
(iv) Discuss the .transform() method.

☞ Note: You can use the following online references to answer the above questions:

♣ https://docs.python.org/3.8/library/functions.html#help

♦ https://docs.python.org/3/library/index.html

♥ https://pandas.pydata.org/pandas-docs/stable/index.html

♠ https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

------------------------------------------- THE END -------------------------------------------

Page 7 of 7

You might also like