You are on page 1of 25

Prediction of TGV Delays

Francisco GARCIA
Samer LAHOUD
Ingrid VALVERDE
Lucas VELOSO

October 29, 2023


Table of Contents

1 Introduction
2 Data Analysis
Data Collection And Features
Data Split
Exploratory Analysis
Categorical and Numerical Variables
3 Data Preprocessing
Categorical Variables
Numerical Variables
Feature Scaling
4 Predicting Delay Times of TGVs
5 Predicting the Causes of the Delays
Introduction

Introduction

The primary goal: predict TGV delay times from January to June
2023 and their main causes.
Discuss preprocessing and analysis of the SNCF dataset,
prediction process, and results obtained.
Data Analysis Data Collection And Features

Data Collection And Features

Data obtained from SCNF dataset (.csv) of monthly TGV


regularity by route.
Available on the french government dataset website.
The dataset has both categorical and numerical features.
Lagged features created based on departure and arrival stations.
Valuable in predicting train arrival delays and their causes.
Data Analysis Data Split

Dataset Split

Data from 2018 to 2022: training and validation sets.


Data from January to June 2023: testing set.

Set Size of X Size of y


Xtrain 5707 5707
Xvalid 1721 1721
Xtest 726 726
Table: Sizes of Training, Validation, and Testing Sets
Data Analysis Exploratory Analysis

Exploratory Analysis

19.2% of variables are categorical.


Major missing data: commentaire retards depart,
commentaire annulation, and
commentaires retard arrivee.
commentaire annulation and
commentaire retards depart data are entirely absent.
Data Analysis Categorical and Numerical Variables

Categorical and Numerical Variables

Numerical Values
Comprehensive analysis of
Categorical Values numerical feature
National services: 85%. distributions.
International services: 15%. Identified outliers and skewed
Major stations: Paris Lyon distributions.
and Paris Montparnasse Outliers: extreme delay
(33%). cases. Skewed distributions:
Other stations: Paris Est, underlying patterns or factors.
Lyon Part Dieu, and Marseille Goal: preprocess data
St. Charles (12%). effectively and choose
appropriate modeling
techniques.
Data Analysis Categorical and Numerical Variables

Numerical Variables Insights

Uniform-like distribution in
Distribution of retard moyen arrivee and
Distribution of retard moyen tous trains depart.
Departure delay-related variables show exponential pattern.
Example: Distribution of nb train depart retard (See
Figure).
Data Analysis Categorical and Numerical Variables

Feature Distribution Example

Example of analysis of the distribution of a feature in all instances.


Data Preprocessing

Data Preprocessing

Separated categorical and numerical variables.


Data imputation on these variables.
âMissingâ category added for missing instances.
Transformed all categorical variables into numerical ones using
Decision Tree Encoder (DTE).
Checked correlation using heat map (See Figure).
Data Preprocessing Categorical Variables

Heat Map of Correlation

Heat map of the


Data Preprocessing Categorical Variables

Feature Excitation Spectrum

Feature excitation
spectrum regarding the mean delay of all trains upon arrival.
Data Preprocessing Categorical Variables

Key Observations

Many variables are uncorrelated: reduces redundant information.


Highest correlation variables with target:
retard moyens tous trains arrivee lag1 (corr=0.39)
duree moyenne lag1 (corr=0.26)
nbr train retard sup 15 lag1 (corr=0.24)
Data Preprocessing Numerical Variables

Numerical Variables
Missing Values

The only missing values were in the training set due to lag features.
They were handled by:
Dropping records using dropna
No need for imputation
Data Preprocessing Numerical Variables

Numerical Variables
Transformation of Variables

Objective: Symmetrical distribution of variables.


Used Yeo-Johnson transformation.
Importance: Models often assume symmetric distribution.
Data Preprocessing Feature Scaling

Feature Scaling

Decision: Keep the outliers.


Used RobustScaler (robust to outliers).
Boxplots verified most samples between -1 and 1.
Data Preprocessing Feature Scaling

Pre-feature Selection
Constant and Quasi-Constants Features

Removed features with same values in ¿99% of instances.


Benefits: Reduce noise, complexity and improve model’s pattern
identification.
Data Preprocessing Feature Scaling

Pre-feature Selection
Duplicated Features

Eliminated duplicate features to:


Prevent overfitting
Reduce computational inefficiency
Predicting Delay Times of TGVs

Predicting Delay Times of TGVs

Compared 20 models (See Table ??).


Best performer: Extreme Gradient Boosting (xgboost).
MAE: 1.2583, R2 : 0.5597.
Predicting Delay Times of TGVs

Model Selection and Performance

Ensemble Model: xgboost, Huber Regressor, Gradient Boosting


Regressor.
Performance Metrics:
R2 : 0.8414
MAE: 0.9016
RMSE: 1.4454
RMSLE: 0.1851
MAPE: 0.2184
Predicting the Causes of the Delays

Neural Network Overview

Aim: Predict the reasons for train delays (6 categories).


Chose Neural Network for its universal function approximator
ability.
Can automatically learn relevant features.
Implemented a Feed-Forward neural network using PyTorch.
Predicting the Causes of the Delays

Feed Forward Model Specifications

Model Feed Forward


Activation Function GeLU
Number of Hidden Neurons 512
Number of Hidden Layers 3
Dropout 0.1
Predicting the Causes of the Delays

Data Preparation and Training

Data converted into manageable batches (size of 512).


Training regimen adjusted based on performance metrics.
Computation on V100 - GPU.
Used Mean Squared Error loss and Adam optimizer.
Trained over 15 epochs.
Predicting the Causes of the Delays

Neural Network Evaluation

Evaluated model using Loss and R2 score.


Low R2 scores suggest need for enhanced data for better
predictions.
Predicting the Causes of the Delays

Conclusion

Aim: Predict delays and identify root causes from Jan to June
2023.
Best model for delay times: Extreme Gradient Boosting (xgboost).
Ensemble model R2 score: 0.8414 on test set.
Feed-Forward neural network for predicting causes showed room
for improvement.
Incorporating more granular data could enhance model
performance.

You might also like