Deep Learning for Pothole Detection on Indonesian Roadways

Accidents are common on Indonesian roadways. Accidents are caused by vehicles, motorcycles, and public transportation. Road fatalities are caused by speeding, alcohol, distraction, fatigue, and poor road conditions. There are numerous car accidents on Indonesian roadways. 30% of Indonesian traffic incidents are explained by road infrastructure and environmental conditions, 61% by driver skill and personality, and 9% by vehicle variables such as vehicle standardization. Cars are damaged, immobilized, and crashed as a result of road conditions. Every hour, three people pass away in traffic in Indonesia, according to authorities. According to the BPS's 2021 Land Transportation Statistics report, 31.91 percent of Indonesia's roads were damaged, totaling 174,298 kilometers. Accidents among Indonesian motorists are becoming more common as roads deteriorate. Using a single camera, a deep learning algorithm can recognize and detect road degradation such as potholes and road cracks. Train and process the model using transfer learning and fine-tuning on the Nano YOLOv5 model architecture. After being validated in three major scenarios, the model performs well with the appropriate confidence level. The precision metric for the model is 0.8, while recall and mAP:0.5 are both 0.5.


INTRODUCTION
Each year, a significant number of road accidents occur in Indonesia [1]. These collisions encompass automobiles, motorcycles [2], trucks, and public transportation vehicles, among others [3], [4]. Road accidents are caused by reckless driving, excessive speeding, driving under the influence of alcohol or narcotics, driving while distracted, fatigue, and poor road conditions [5]. In Indonesia, poor road conditions contribute substantially to traffic accidents [6]. According to the director general of land transportation, Pudji Hartanto [7], 30% of traffic accidents in Indonesia are influenced by infrastructure and environmental conditions of vehicle paths, 61% by human factors, such as the ability to drive and the character of the driver while driving, and the remaining 9% by vehicle factors, such as the standardization of vehicles that are used properly or not on the streets. This poor road condition can have calamitous effects, such as increasing the likelihood of vehicle damage, decreasing the driver's ability to control the vehicle, preventing the vehicle from moving, and increasing the likelihood of collisions between vehicles [8]. According to Indonesian police statistics, three people pass away every hour in road accidents [9]. According to the 2021 Land Transportation Statistics report by the Central Statistics Agency (BPS) [10], the cumulative length of all damaged roads in Indonesia in 2021 reached 174,291 kilometers, or approximately 31.91%. The extensive road damage in Indonesia has made drivers susceptible to collisions. Experts recommend studies that can help drivers become more aware of damaged roads, such as potholes and road fractures, which can lead to traffic accidents.

Dataset
The dataset utilized in this study is derived from video logs that are publicly available on the YouTube site. The video-log (Vlog) format is employed, including footage of motorbike riders conducting MotoVlogs [11]- [14]. At least four videos are used as reference datasets in this research's training procedure, and three more videos are not processed in Roboflow with the intention of serving as test data. The movies classified as training data are then retrieved and processed using the Roboflow platform [15].  The retrieved pictures are labeled in Roboflow using the boundingbox technique, as seen in figure 1 above. This seeks to make pothole identification easier by removing a broad backdrop, allowing the inference process to focus on only the damaged road parts. The quantity of data utilized after tagging was 952 images, which were then pre-processed and augmented by adding noise, exposure, and blur, as seen in Figure 2 above. Furthermore, picture resolution equalization is performed by converting all photos to a resolution of 480x480 pixels. The major goal of this preprocessing and augmentation is to generate a more diversified dataset so that new datasets may be obtained without re-labeling. The total number of images acquired is 2316 images [16], which are suitable for training and testing.

2 Model Architecture
Objects in pictures and videos are detected and located by computer vision. In real-world applications, deep learning models can recognize things reliably and effectively [17]- [19]. To recognize and locate objects in pictures and videos [20], [21], the object identification model employs a resilient backbone network and a detection head [22]. Data annotation [23], [24], optimization [25], and evaluation [26], [27] can train the model to recognize objects accurately and consistently. Architecture, loss function, and assessment criteria are determined by research aims and restrictions [28], [29].
There are two-stage and one-stage object detection models. Faster [30] and Mask R-CNN [31] are object detection models with two stages. Using an RPN, these models first suggest regions. Image object regions and bounding box coordinates are suggested by RPN. The second step categorizes and changes optimal zones for item identification. During region proposal, two-stage models focus on likely item locations and update predictions to increase detection accuracy. YOLO [32], [33] and SSD [34], [35] are object recognition models that estimate item bounding boxes and class probabilities in a single pass over an input image. These single-step models are fast and appropriate for real-time applications. They find items across the image by using anchor boxes of varied sizes and aspect ratios. One-stage models are faster than two-stage models, although they are less exact [36]- [38]. The subject of this research is a one-stage model based on the YOLOv5 architecture [39]. Ultralytics provides a variety of YOLOv5 network version [40]. The network complexity of the YOLOv5 model is used to classify these models. The researchers chose the Nano version of the YOLOv5 architecture (YOLOv5n) for the study [41]. The network complexity of the YOLOv5 nano is exceedingly simple, as the term "nano" suggests. This is done so that the YOLO model can be run on low-resource platforms like smartphones [42], Raspberry Pi [43], and microcontrollers [44]. The architectural series used in this research model is depicted in the figure 3 above. As illustrated, the model has 281 layers and a total of three million parameters.

Figure 4 Research methodology flowchart
The flowchart in figure 4 above is an example of one type of research methodology offered by researchers. The research begins with the accumulation of data pertinent to the problems being encountered, particularly image data of potholes. After being collected, data on defects is utilized for testing and training purposes. In this investigation, the training procedure is divided into two stages: fine tuning and training. The purpose of both of these processes is to generate findings from trained models that meet the needs of the research. After the trained model has been produced, its performance is evaluated by applying it to video data not used during training.

RESULTS AND DISCUSSION
A Ryzen 9 CPU with 16 GB of RAM and a 4 GB Nvidia GeForce RTX 3050 graphics processor unit was utilized to conduct this investigation. In the widest sense, this research divided into three major categories: "fine tuning," "training," and "testing."

1 Fine Tuning
A transfer learning method is used in the fine tuning process. The pretrained model given by Ultralytics is used as the foundation model for training in this transfer learning approach. This pretrained model, however, continues to employ generic hyperparameters based on the supplied hyperparameters. As a result, researchers do fine tuning in order to adapt the hyperparameters to match the image dataset that being used. This fine tuning is accomplished by initiating the evolve process throughout the training phase, which updates the hyperparameters utilized at each training step. This evolve method seeks to identify the best hyperparameter configuration for the dataset in order for the model to better explore the dataset. The fine tuning method took 800 iterations with a batch size of 32, an image size of 480x480 pixels, and the SGD (Stochastic Gradiend Descent) optimizer. The data utilized in the fine tuning procedure is the same as the data used in subsequent training, namely the labeling and processing dataset. The figure 5 above are acquired after running the operation for several hours. Three loss functions are shown: box loss, objectness loss, and categorization loss. The term "box loss" refers to the model's bounding box loss. This loss function seeks to determine how accurate the bounding box is based on the model's prediction results for the item. Objectness loss is also used as a measure of how successfully the model recognizes an item in a image. Finally, classification loss is used to assess how accurately the model classifies an item depending on the class of objects presented. As can be seen in the figure 5 above, the YOLOv5n model is capable of minimizing its loss. Because the researcher only employed one class or one item detection in this investigation, the stable classification loss is 0. The goal of this single object detection is to lessen the strain on the model in inferring objects so that the resources required, in this case the laptop, are not overburdened. When examined through the loss graph in the figure 5 above, the outcomes of this fine tuning method are quite good, thus the researchers chose to apply the generated hyperparameters in this fine tuning process.

2 Model Training
The training procedure comes next. The YOLOv5n6 model is trained using the same setup as during the fine tuning phase, as well as hyperparameters derived during the fine tuning process. This training procedure was repeated 1000 times with a patience configuration of 500 times. Following the completion of the training procedure, it was revealed that the training procedure had been terminated at the 753rd epoch. The training process is terminated since the model does not make substantial progress while training is conducted for the last 500 epochs, also known as the early stopping technique. This early stopping strategy is particularly effective in reducing excessive training time consumption, allowing the best model to be created with the least amount of time spent. The figure 6 below depicts the outcomes of this program.  figure 6 above. Despite the fact that its development on objectness loss appears to be going in the other way, the model is able to retain the consistency of the resultant loss. The extremely basic model complexity causes this circumstance, as the network processing model is not as detailed as the model with a larger complexity. However, as seen in the graph above, the model is still capable of maintaining loss at the lowest conceivable amount.

Figure 7 Training metrics history
In addition to loss graphs, researchers look at changes in accuracy, recall, and mean average precision (mAP), as seen in the figure 7 above. Since the beginning of training, the model has been able to reach the maximum degree of accuracy while still maintaining recall and mAP. The graph above shows that the model can detect pothole items with high precision and can detect things under specific situations. The ultimate outcome of this training method is a value of AuC (Area under Curves) of 0.56 from accuracy and recall.

Model Testing
At this stage, the model is evaluated with video data that was not included in the training dataset. When driving a motorized vehicle, such as a motorcycle, the video data is in the form of a video log (Vlog). In general, the model can identify and detect potholes on paved roads. Figure 8 and 9 show that the model can detect minor and big potholes with motor speeds in the video log ranging from roughly 20-30 KM/hour. Furthermore, the researcher performed many situations that were used while testing conclusions. In this investigation, three primary scenarios were used, with confidence threshold values of 0.5, 0.6, and 0.8. The key difference between these three cases is the model's sensitivity to identifying things. The model detects all items that are deemed potholes in the confidence threshold scenario of 0.5. However, at this level of sensitivity, the model recognizes items other than potholes, as shown in the figure 10 above. Figure 10 (a) shows how visible objects detect road joints and fractures. Furthermore, as illustrated in figure 10 (b), (c), and (d), the model recognizes road borders and grass as well as asphalt areas on the road as potholes. Over-detection generates a pretty substantial bias in this image, prompting researchers to raise the confidence threshold to 0.6. Figure 11 Inference result scenario 2 The model eliminates items better than a confident threshold of 0.5 in the scenario of a confident threshold of 0.6. However, as illustrated in figure 11 (a), the model still has difficulties eliminating things that appear to be damaged roadways but are not. Furthermore, as illustrated in figure 11 (b) and (d), the model detects asphalt patches that are relatively large in size as well as smooth holes. In this and earlier instances, the model still has difficulties removing things that are not road components, such as the tree trunks in figure 11 (c). Too much sensitivity leads to a significant bias during inference. Following multiple studies, the researchers opted to utilize a confident threshold value of 0.8 to avoid over-detection of things other than potholes.
(a) (b) (c) (d) Figure 12 Inference result scenario 3 The model recognizes potholes with rather severe circumstances with this high enough confidence threshold, as seen in figure 12 (a) above. Furthermore, as demonstrated in figure 12 (b), the model is capable of identifying fine holes. However, as seen in figure 12 (c) and (d), the model still classifies asphalt patches as potholes as a result of this inference. The findings of the asphalt patch detection still cause a rather substantial bias in the inference results. This asphalt patch detection error can be caused by a number of factors, including a lack of accuracy in the dataset used during fine tuning and training, or the model still requires additional training to distinguish potholes from non-pothole objects. It is important to conduct additional study on this asphalt patch scenario in order for the model's inferences to be more precise and correct. In addition to detecting errors on asphalt, the model can detect potholes very well, and researchers believe the model is qualified to be implemented on edge computing devices such as Raspberry Pi, Arduino, or smartphones, which can later help motorists so that drivers are aware of potholes on the road, even though the model's implementation must be in several conditions, such as the confident threshold level, which must be adjusted to the speed of driving.

CONCLUSION
This study focuses on the use of deep learning to the detection of potholes on roadways using only a single camera. The dataset used in this study is motoVlog video data collected from the YouTube site and processed with the Roboflow platform. The Nano version of YOLOv5 was employed in this work as the deep learning model architecture. This nano version of YOLOv5 is designed to run on devices with limited resources, such as smartphones, microcontrollers, and microprocessors. After 800 epochs of fine tuning and 753 epochs of training, the model with the highest efficiency was developed by combining transfer learning and fine tuning methodologies. The model can run well during testing, however the confidence threshold configuration needs to be modified in certain scenarios to reduce bias during inference.

RECOMMENDATION
Researchers will do additional research by deploying models on edge computing devices. Furthermore, future research can construct model inferences on various vehicles (such as cars and motorcycles) with varying vehicle speed conditions. Also, the dataset utilized in the training process should be double-checked and focused on potholes that could cause traffic accidents. Data gathering, on the other hand, can be modified at various speeds, allowing for more precise and effective application on edge computing devices. With this advancement, it is believed that traffic accidents in Indonesia will be minimized and motorists will be more conscious of the road conditions they are driving on.