You are on page 1of 3

Data Strategy

Assignment for Seminar Paper, Part 1: Classification

Task 1: Titanic - Machine Learning from Disaster

Objective: Use machine learning models to predict the probability of survival of Titanic
passengers based on their attributes such as gender, age, etc.

Dataset: https://www.kaggle.com/c/titanic/overview

 Obtain the dataset from Kaggle. If you don’t have an account, create one on Kaggle.
 Download the training set. Later, you can split it into training and test sets using
Orange's Test&Score module.
 Upload the file to Orange using the “CSV File Imsport” component. Since Orange is
unfamiliar with this dataset, ensure you select the correct target variable in the “Select
columns” module. All subsequent connections should be made after the select column
widget, or other components will not recognize your selection.
 Remove the “fare” variable using the select columns module and move it to the
ignored section.

Resources:

 Introduction video to the Titanic dataset: https://www.youtube.com/watch?


v=8yZMXCaFshs
 For inspiration on how to approach the problem, check out the Code and Discussion
tabs on Kaggle. Some contributors also have explanatory videos in Python, from
which you can learn about the methods to employ: https://www.youtube.com/watch?
v=I3FBJdiExcg
 In the provided files folder, refer to these handouts: confusionMatrix,
TreeRandomForest. For interpreting the dispersion value in feature statistics, use the
VariationCoefficient handout.

Instructions:

1. Conduct a descriptive analysis using the feature statistics.


2. Visualize the data:
o Show the probability of survival based on gender and age.
o Display the distribution of survival based on passenger class using a
histogram.
3. Train machine learning models. Use all the classification models you are familiar
with.
4. Score the models and comment on various metrics, primarily accuracy. Also, consider
other metrics from the Confusion Matrix handout like Recall, Precision, etc.
5. Specifically, display the decision tree and interpret it. In the decision tree component,
set the maximum number of levels to about 4-5 for better visualization. When
interpreting, be cautious and consider the relative size of each group in the dataset.
6. Showcase predictions for specific individuals and provide an interpretation.

Note: Treat this assignment seriously and explore beyond the scope of our block teaching.
Experiment with different charts, correlation tables, and heatmaps. The more methods and
models you employ, the better. If you are unsure about something, don't hesitate to search for
it online.

Task 2: Airline Passenger Survey

Objective: Similar to the Titanic task, this assignment focuses on predicting passenger
satisfaction with airline services using machine learning models.

Dataset: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction

 Retrieve the dataset from Kaggle.


 Follow similar steps as mentioned in the Titanic task for importing and processing the
dataset in Orange.

Resources:

 Kaggle can be a valuable source of inspiration. Visit the Code and Discussion tabs to
get insights from other data science enthusiasts. Some might have created
visualizations or models that can provide you with a different perspective on the data.

Instructions:

1. Start by conducting a descriptive analysis of the dataset.


2. Visualize key aspects of the data to understand patterns and trends:
o Explore variables that might influence passenger satisfaction.
o Use different visualization techniques like bar charts, pie charts, and scatter
plots to represent your data effectively.
3. Train various machine learning models.
4. Evaluate the performance of your models using appropriate metrics. Accuracy is
essential, but also consider other metrics that might give you a holistic view of the
model's performance.
5. Interpret the results of your models. Which features significantly influence passenger
satisfaction? Are there any surprising insights?

Note: Emphasize data storytelling. Just like the Titanic task, the way you present your
findings is crucial. Dive deep into the data, and don't hesitate to explore beyond the provided
instructions. The goal is to provide meaningful insights that can be valuable for improving
airline services.
Resourses
https://antonellocalamea.medium.com/orange-meets-titanic-f483a5c317c7

https://www.kaggle.com/code/startupsci/titanic-data-science-solutions/notebook

You might also like