Professional Documents
Culture Documents
Francisco GARCIA
Samer LAHOUD
Ingrid VALVERDE
Lucas VELOSO
1 Introduction
2 Data Analysis
Data Collection And Features
Data Split
Exploratory Analysis
Categorical and Numerical Variables
3 Data Preprocessing
Categorical Variables
Numerical Variables
Feature Scaling
4 Predicting Delay Times of TGVs
5 Predicting the Causes of the Delays
Introduction
Introduction
The primary goal: predict TGV delay times from January to June
2023 and their main causes.
Discuss preprocessing and analysis of the SNCF dataset,
prediction process, and results obtained.
Data Analysis Data Collection And Features
Dataset Split
Exploratory Analysis
Numerical Values
Comprehensive analysis of
Categorical Values numerical feature
National services: 85%. distributions.
International services: 15%. Identified outliers and skewed
Major stations: Paris Lyon distributions.
and Paris Montparnasse Outliers: extreme delay
(33%). cases. Skewed distributions:
Other stations: Paris Est, underlying patterns or factors.
Lyon Part Dieu, and Marseille Goal: preprocess data
St. Charles (12%). effectively and choose
appropriate modeling
techniques.
Data Analysis Categorical and Numerical Variables
Uniform-like distribution in
Distribution of retard moyen arrivee and
Distribution of retard moyen tous trains depart.
Departure delay-related variables show exponential pattern.
Example: Distribution of nb train depart retard (See
Figure).
Data Analysis Categorical and Numerical Variables
Data Preprocessing
Feature excitation
spectrum regarding the mean delay of all trains upon arrival.
Data Preprocessing Categorical Variables
Key Observations
Numerical Variables
Missing Values
The only missing values were in the training set due to lag features.
They were handled by:
Dropping records using dropna
No need for imputation
Data Preprocessing Numerical Variables
Numerical Variables
Transformation of Variables
Feature Scaling
Pre-feature Selection
Constant and Quasi-Constants Features
Pre-feature Selection
Duplicated Features
Conclusion
Aim: Predict delays and identify root causes from Jan to June
2023.
Best model for delay times: Extreme Gradient Boosting (xgboost).
Ensemble model R2 score: 0.8414 on test set.
Feed-Forward neural network for predicting causes showed room
for improvement.
Incorporating more granular data could enhance model
performance.