Presentation GPT 4

Prediction of TGV Delays
Francisco GARCIA
Samer LAHOUD
Ingrid VALVERDE
Lucas VELOSO
October 29, 2023

Table of Contents
1 Introduction
2 Data Analysis
Data Collection And Features
Data Split
Exploratory Analysis
Categorical and Numerical Variables
3 Data Preprocessing
Categorical Variables
Numerical Variables
Feature Scaling
4 Predicting Delay Times of TGVs
5 Predicting the Causes of the Delays
Introduction
Introduction
The primary goal: predict TGV delay times from January to June
2023 and their main causes.
Discuss preprocessing and analysis of the SNCF dataset,
prediction process, and results obtained.
Data Analysis Data Collection And Features
Data Collection And Features
Data obtained from SCNF dataset (.csv) of monthly TGV

regularity by route.
Available on the french government dataset website.
The dataset has both categorical and numerical features.
Lagged features created based on departure and arrival stations.
Valuable in predicting train arrival delays and their causes.
Data Analysis Data Split
Dataset Split
Data from 2018 to 2022: training and validation sets.

Data from January to June 2023: testing set.
Set Size of X Size of y

Xtrain 5707 5707
Xvalid 1721 1721
Xtest 726 726
Table: Sizes of Training, Validation, and Testing Sets
Data Analysis Exploratory Analysis
Exploratory Analysis
19.2% of variables are categorical.

Major missing data: commentaire retards depart,
commentaire annulation, and
commentaires retard arrivee.
commentaire annulation and
commentaire retards depart data are entirely absent.
Data Analysis Categorical and Numerical Variables
Categorical and Numerical Variables
Numerical Values
Comprehensive analysis of
Categorical Values numerical feature
National services: 85%. distributions.
International services: 15%. Identified outliers and skewed
Major stations: Paris Lyon distributions.
and Paris Montparnasse Outliers: extreme delay
(33%). cases. Skewed distributions:
Other stations: Paris Est, underlying patterns or factors.
Lyon Part Dieu, and Marseille Goal: preprocess data
St. Charles (12%). effectively and choose
appropriate modeling
techniques.
Numerical Variables Insights
Uniform-like distribution in
Distribution of retard moyen arrivee and
Distribution of retard moyen tous trains depart.
Departure delay-related variables show exponential pattern.
Example: Distribution of nb train depart retard (See
Figure).
Feature Distribution Example
Example of analysis of the distribution of a feature in all instances.

Data Preprocessing
Data Preprocessing
Separated categorical and numerical variables.

Data imputation on these variables.
âMissingâ category added for missing instances.
Transformed all categorical variables into numerical ones using
Decision Tree Encoder (DTE).
Checked correlation using heat map (See Figure).
Data Preprocessing Categorical Variables
Heat Map of Correlation
Heat map of the

Feature Excitation Spectrum
Feature excitation
spectrum regarding the mean delay of all trains upon arrival.
Key Observations
Many variables are uncorrelated: reduces redundant information.

Highest correlation variables with target:
retard moyens tous trains arrivee lag1 (corr=0.39)
duree moyenne lag1 (corr=0.26)
nbr train retard sup 15 lag1 (corr=0.24)
Data Preprocessing Numerical Variables
Numerical Variables
Missing Values
The only missing values were in the training set due to lag features.
They were handled by:
Dropping records using dropna
No need for imputation
Data Preprocessing Numerical Variables
Numerical Variables
Transformation of Variables
Objective: Symmetrical distribution of variables.

Used Yeo-Johnson transformation.
Importance: Models often assume symmetric distribution.
Data Preprocessing Feature Scaling
Feature Scaling
Decision: Keep the outliers.

Used RobustScaler (robust to outliers).
Boxplots verified most samples between -1 and 1.
Pre-feature Selection
Constant and Quasi-Constants Features
Removed features with same values in ¿99% of instances.

Benefits: Reduce noise, complexity and improve model’s pattern
identification.
Pre-feature Selection
Duplicated Features
Eliminated duplicate features to:

Prevent overfitting
Reduce computational inefficiency
Predicting Delay Times of TGVs
Compared 20 models (See Table ??).

Best performer: Extreme Gradient Boosting (xgboost).
MAE: 1.2583, R2 : 0.5597.
Model Selection and Performance
Ensemble Model: xgboost, Huber Regressor, Gradient Boosting

Regressor.
Performance Metrics:
R2 : 0.8414
MAE: 0.9016
RMSE: 1.4454
RMSLE: 0.1851
MAPE: 0.2184
Predicting the Causes of the Delays
Neural Network Overview
Aim: Predict the reasons for train delays (6 categories).

Chose Neural Network for its universal function approximator
ability.
Can automatically learn relevant features.
Implemented a Feed-Forward neural network using PyTorch.
Feed Forward Model Specifications
Model Feed Forward

Activation Function GeLU
Number of Hidden Neurons 512
Number of Hidden Layers 3
Dropout 0.1
Data Preparation and Training
Data converted into manageable batches (size of 512).

Training regimen adjusted based on performance metrics.
Computation on V100 - GPU.
Used Mean Squared Error loss and Adam optimizer.
Trained over 15 epochs.
Neural Network Evaluation
Evaluated model using Loss and R2 score.

Low R2 scores suggest need for enhanced data for better
predictions.
Conclusion
Aim: Predict delays and identify root causes from Jan to June
2023.
Best model for delay times: Extreme Gradient Boosting (xgboost).
Ensemble model R2 score: 0.8414 on test set.
Feed-Forward neural network for predicting causes showed room
for improvement.
Incorporating more granular data could enhance model
performance.

Presentation GPT 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Presentation GPT 4

Uploaded by

Copyright:

Available Formats

Prediction of TGV Delays

October 29, 2023

Data Collection And Features

Data obtained from SCNF dataset (.csv) of monthly TGV

Data from 2018 to 2022: training and validation sets.

Set Size of X Size of y

19.2% of variables are categorical.

Categorical and Numerical Variables

Numerical Variables Insights

Feature Distribution Example

Example of analysis of the distribution of a feature in all instances.

Separated categorical and numerical variables.

Heat Map of Correlation

Heat map of the

Feature Excitation Spectrum

Many variables are uncorrelated: reduces redundant information.

Objective: Symmetrical distribution of variables.

Decision: Keep the outliers.

Removed features with same values in ¿99% of instances.

Eliminated duplicate features to:

Predicting Delay Times of TGVs

Compared 20 models (See Table ??).

Model Selection and Performance

Ensemble Model: xgboost, Huber Regressor, Gradient Boosting

Neural Network Overview

Aim: Predict the reasons for train delays (6 categories).

Feed Forward Model Specifications

Model Feed Forward

Data Preparation and Training

Data converted into manageable batches (size of 512).

Neural Network Evaluation

Evaluated model using Loss and R2 score.

You might also like