Pyramid Scene Parsing Network for Driver Distraction Classification Pyramid Scene Parsing Network for Driver Distraction Classification

In recent years, there has been a persistent increase in the number of road accidents worldwide. The US National Highway Traffic Safety Administration reports that distracted driving is responsible for approximately 45 percent of road accidents. In this study, we tackle the challenge of automating the detection and classification of driver distraction, along with the monitoring of risky driving behavior. Our proposed solution is based on the Pyramid Scene Parsing Network (PSPNet), which is a semantic segmentation model equipped with a pyramid parsing module. This module leverages global context information through context aggregation from different regions. We introduce a lightweight model for driver distraction classification, where the final predictions benefit from the combination of both local and global cues. For model training, we utilized the publicly available StateFarm Distracted Driver Detection Dataset. Additionally, we propose optimization techniques for classification to enhance the model’s performance.


INTRODUCTION
According to recent research conducted by the Moroccan National Agency for Road Safety, distracted driving was a contributing factor in 3,005 road fatalities and more than 84 585 injuries in Morocco in the year 2020.Regrettably, this issue seems to be worsening year after year. (1,2,3)Distracted driving, as defined by Strayer et al. (4) encompasses any activity that diverts a driver's attention away from the road, such as texting, eating, conversing with passengers, or adjusting the stereo.In light of this, the objective of our research is to develop and implement a model for the detection and classification of distracted driving in smart cars, leveraging semantic segmentation techniques (5) and convolutional neural networks (CNNs). (4)To identify and categorize driver distraction from visual cues, we explored various models, including convolutional neural networks (CNNs). (4)Building upon the state-of-the-art findings in this field, we devised a simplified model based on PSPNet, which yielded promising results. (6,7)

Related Work
Distracted driving can generally be classified into four distinct forms, as outlined by Strayer et al. (8) cognitive, visual, manual, and auditory distractions.When a driver becomes distracted, they divert their focus and actions away from driving-related tasks, engaging in non-driving activities.Some activities inherently involve multiple forms of distraction.For instance, using a cell phone for calls or texting can encompass all four forms of distractions.
Manual Distraction: This occurs when a driver's hands are taken off the steering wheel, impacting their ability to control the vehicle (as depicted in figure 1a).Common instances include eating, drinking, smoking, or retrieving items from a purse or wallet.
Visual Distraction: In this scenario, the driver's attention shifts to looking at a device instead of the road, which is one of the most prevalent distractions (as depicted in figure 1b).Examples encompass glancing at a GPS device, focusing on the entertainment center, observing a passenger.
Cognitive Distraction: Cognitive distractions emerge when the driver's focus is drawn away from driving by interpreting information from a device (as illustrated in Figure 1c).Common instances include listening to a podcast, engaging in conversations through hands-free devices, conversing with other passengers.Auditory Distraction because noise distracts the driver.The table 1 shows some examples of distraction actions and their mapping to distraction types.

Pyramid scene parsing network (PSPNet)
PSPNet, as introduced in Zhao et al. (9) 's work, utilizes a pretrained CNN (6) and employs the dilated network technique to extract feature maps from input images.The final feature map size is reduced to 1/8 of the input image dimensions.We leverage the pyramid pooling module to aggregate contextual information on top of this feature map.The pooling kernels cover various portions of the image, including the entire, half, and smaller Data and Metadata.2023; 2:154 2 areas, thanks to our four-level pyramid structure.These aggregated features are then combined to form a holistic representation.In the final step, we fuse this combined data with the original feature map and apply a convolution layer to produce the ultimate prediction map.
The task of understanding visual scenes necessitates semantic image segmentation, as highlighted in Chen et al. (1) 's work.Semantic segmentation aims to classify each pixel in the input image, effectively performing pixel-level object segmentation. (2)This technique finds applications in diverse fields, such as autonomous driving, robotics, medical image analysis, video surveillance, and more.Consequently, it becomes imperative to enhance the accuracy and precision of semantic image segmentation both in theoretical research and practical implementation.This study primarily introduces the PSPNet, a scene analysis model based on pyramid synthesis, (9) along with a parameter optimization approach tailored to the PSPNet model, leveraging GPU distributed computing for improved efficiency.In figure 2, we offer an overview of our proposed PSPNet.Our approach commences with an input image (a) and initially employs a Convolutional Neural Network (CNN) to extract the feature map from the final convolutional layer (b).Following this, we apply a pyramid parsing module to collect a wide array of subregion representations.Subsequently, we utilize upsampling and concatenation layers to construct the final feature representation (c), which encapsulates both local and global context information.To conclude, this representation is input into a convolutional layer to generate the ultimate per-pixel prediction (d).
We term this module as the 'pyramid pooling module,' which is meticulously crafted to establish a global scene context based on the final-layer feature map of the deep neural network, as vividly depicted in part (c) of figure 2.
The pyramid pooling module adeptly amalgamates features across four distinct pyramid scales.The coarsest level, highlighted in red, engages in global pooling to yield a single-bin output.The subsequent pyramid level further subdivides the feature map into discrete sub-regions, creating pooled representations for various spatial locations.

Proposed Method
The method we propose is composed of the following phases: Semantic Segmentation Utilizing the Pyramid Scene Parsing Network (PSPNet) model, we delve into a critical domain of computer vision known as semantic segmentation.This aspect is fundamental for addressing various scene interpretation challenges.Semantic segmentation involves the prediction of the category associated with each pixel or, more precisely, determining the category to which a given pixel belongs.By precisely delineating the object region to which a pixel pertains, our goal is to enhance the accuracy of pixel categorization.
Method: We employed three evaluation metrics-precision, recall, and F1 score-to assess and analyze the effectiveness of our image segmentation proposal based on PSPNet, drawing from our experimental results.In our implementation, we utilized the multi-scale parallel convolutional neural network model with PSPNet.To tackle the challenges posed by the complex and dynamic nature of the images, diverse driver positions, and the risk of overfitting or disrupting the parameter structure, we employed small-sample transfer learning as a means to constrain the parameter learning process.Classification: For the classification of the driver's pose (and thus her/his distraction), we used The convolutional neural network (CNN).Figure 3 The diagram below describes the pipeline of the proposed method.After segmentation of the database images with PSPNET Model, the proposed convolutional neural network (CNN) is used as the classification model.

Dataset
The dataset we used to train and test the models is the StateFarm's distraction detection dataset. (7)

Implementation Details
Python is used to implement the suggested preprocessing and classification pipeline.A pre-trained PSPNET Model was employed for the pose estimation step.The Keras library on the Tensorflow backend was used to implement the CNN baseline model.80 percent of the dataset was used to train all classifiers (including CNN), and the remaining 20 percent was used to test them.We compared the performance of the proposed method detection and classification for driver distraction using semantic segmentation.To evaluate the performance of different classifiers, we used four performance

Figure 2 .
Figure 2. Diagram of semantic segmentation of images used in PSPNet Model

Figure 3 .
Figure 3.The pipeline of our proposed method for driver distraction classification.Pyramid Scene Parsing Network for Driver Distraction Classification

Table 1 .
Assignment to common distraction actions and distraction types

Table 2
present the 10 classes of the dataset and the number of images for each class.

Table 2 .
StateFarm distraction-detection dataset and total images in the class

Table 3 .
Classification model on the test set, in termes of accuracy, macro average precision, recall and F1-score.10 classes schema