Modeling Traffic Delay Severity Caused by Accidents: A Machine Learning Approach
As part of my MSc in Data Science at Copenhagen Business School, I recently completed the final project for my Machine Learning course together with Eduard Aguado Marin, Luca Giovanni Gudi, and David Yoshio Uraji.
Our work, “Modeling Traffic Delay Severity Caused by Accidents: A Machine Learning Approach”, explored how supervised classification techniques can be applied to predict the severity of traffic delays using only information available at the time of an accident.
The research focused on the question: Can we accurately identify high-severity traffic events in real time to support more effective traffic management and routing decisions?
To address this, we followed a complete machine learning workflow, covering data preprocessing, feature engineering, class balancing, model selection, and validation, ultimately aiming to create a model that could prioritize the timely detection of high-impact incidents.
Related Project
Modeling Traffic Delay Severity Caused by Accidents: A Machine Learning Approach
This project investigates the use of machine learning to classify the severity of traffic delays caused by roadway accidents based on features available at the time of the incident. The problem addressed is the need for timely identification of high-impact events to support traffic management and routing decisions. The research question concerns how accident-related traffic delay severity can be predicted based on real-time features, with a focus on minimizing false negatives for high-severity cases. Concepts applied include supervised classification, class balancing, feature engineering, and model validation. The analysis is based on the US Accidents dataset containing over 7.7 million records, which was cleaned, binarized, balanced, and used to train four models. Histogram-Based Gradient Boosting achieved the highest recall at 0.79, outperforming Random Forest, Logistic Regression, and Multilayer Perceptron, which showed higher accuracy but lower sensitivity to severe cases. These results suggest that HGBoost is best suited for applications where the accurate identification of high-severity delays is prioritized. It is recommended as the preferred model when recall is the primary objective and training efficiency is also relevant.