DOI QR코드

DOI QR Code

Superpixel-based Vehicle Detection using Plane Normal Vector in Dispar ity Space

  • Seo, Jeonghyun (Dept. of Electrical and Electronic Engineering, Yonsei University) ;
  • Sohn, Kwanghoon (Dept. of Electrical and Electronic Engineering, Yonsei University)
  • Received : 2015.12.25
  • Accepted : 2016.05.18
  • Published : 2016.06.30

Abstract

This paper proposes a framework of superpixel-based vehicle detection method using plane normal vector in disparity space. We utilize two common factors for detecting vehicles: Hypothesis Generation (HG) and Hypothesis Verification (HV). At the stage of HG, we set the regions of interest (ROI) by estimating the lane, and track them to reduce computational cost of the overall processes. The image is then divided into compact superpixels, each of which is viewed as a plane composed of the normal vector in disparity space. After that, the representative normal vector is computed at a superpixel-level, which alleviates the well-known problems of conventional color-based and depth-based approaches. Based on the assumption that the central-bottom of the input image is always on the navigable region, the road and obstacle candidates are simultaneously extracted by the plane normal vectors obtained from K-means algorithm. At the stage of HV, the separated obstacle candidates are verified by employing HOG and SVM as for a feature and classifying function, respectively. To achieve this, we trained SVM classifier by HOG features of KITTI training dataset. The experimental results demonstrate that the proposed vehicle detection system outperforms the conventional HOG-based methods qualitatively and quantitatively.

Keywords

1. INTRODUCTION

Over the past decade, the incidence of urban traffic accidents still has increased, while the incidence of an overall traffic accident has slowly decreased. According to “European accident research and safety report 2013” [1], more than 90 percent of those driving accidents originated from the mistakes of drivers. In order to decrease the accidents occurred by the mistakes, there is a tendency to obligatory installation of intelligent vehicle system using various sensors such as Lidar, Laser, Radar, and camera.

Those sensor-based ADASs have achieved great developments, which provide better understanding of the environment in order to improve traffic safety and efficiency. ADAS mainly assists human drivers by alerting them in the situation such as lane and road departure, obstacle collision, pedestrian and vehicle detection, and parking assistance. Among the various sensing modalities, imaging technology has extremely been developed in recent years. Vision-based sensors are cheaper, smaller, and of higher quality than ever before. Simultaneously, computing power has dramatically increased. Vision-based sensors especially give lots of information such as colors, shapes, depth and have advantages that driver can effectively understand surrounding environment by seeing image.

With advances in camera sensing and computational technologies, advances in vehicle detection using monocular / stereo vision, and sensor fusion has been an extremely active research area in the intelligent vehicles community. Basically, researchers use the monocular vision-based system for vehicle detection. Most monocular vision-based systems utilize the low-level information such as color, texture, edges, and motion information obtained from difference images between temporal neighborhoods. The low-level information is powerful to understand environment around vehicles and to detect obstacles. However, there are some limitations which cause the loss of the depth and structure information in monocular vision- based ones. To overcome these limitations, the stereo vision-based system is introduced and researched in recent years. Not just supplement the depth and structure information, stereo vision-based systems can also provide the real distance to obstacles and inform the accurate location of the obstacles to drivers. However, the quality of these information is extremely dependent on depth information. The depth is measured by calculating the difference between the position of a feature in a reference image and the one of the target image. In this paper, vehicle detection method is proposed to alleviate the problems of conventional color-based and depth-based approaches. It is based on a superpixel-based plane normal vector in disparity space, where Histogram of Gradient (HOG) and Support Vector Machine (SVM) are used as a feature and classifier, respectively.

The rest of this paper is organized as follows. Chapter 2 provides a brief review of vision-based vehicle detection. We then describe the proposed vehicle detection methods in Chapter 3. Chapter 4 shows the experimental results and their evaluations on the KITTI dataset. Finally, we conclude the paper with discussion for further works in Chapter 5.

 

2. RELATED WORKS

Vehicle detection methods can be categorized into two groups: 1) correlation based approaches using template matching and 2) learning based approaches using an object classifier.

1) Template Matching: The correlation based method uses a predefined template and determines a similarity measure by calculating correlation between the ROI and the template. Since there are various models of vehicles with different appearances, a general template with common features of a vehicle is used. These features include the rear window and number plate, a rectangular box with specific aspect ratio and a “U” shaped pattern with one horizontal and two vertical edges [2], [3]. A vehicle could appear in different sizes depending on its distance from the camera; therefore, the correlation test is usually performed at several scales of ROI. Intensity of the image is also normalized before the correlation test to get a consistent result.

2) Classifier-Based Vehicle Verification: This approach uses two-class image classifiers to differentiate vehicle from non-vehicle candidates. A classifier learns the characteristics of a vehicle’s appearance from training images. Training is based on a supervised learning approach where a large set of labeled positive (vehicle) and negative (non-vehicle) images are used. The most common classification schemes for vehicle verification include SVM [4,20], AdaBoost [5]. To facilitate the classification, training images are first preprocessed to extract descriptive features. Selection of features is important for achieving good classification results. A fine set of features should capture most of the variability in the appearance of a vehicle. Numerous features for vehicle classification have been proposed such as HOG [6], Gabor [7], Principal Component Analysis (PCA) [8].

HOG feature captures local histogram of an image’s gradient orientations in a dense grid and it was first proposed by Dalal [6] for human classification. System developed using a linear SVM classifier trained on HOG features was able to spot vehicles in different traffic scenarios [9].

Gabor features had been known for texture analysis of images [10] as they capture local lines and edges information at different orientations and scales. SVM classifier trained on Gabor features were tested for vehicle detection and achieved 94.5% detection rate at 1.5% false detection [7]. It was shown that the classifier outperformed a PCA feature based ANN classifier. A systematic and general evolutionary Gabor filter optimization (EGFO) approach to select optimal Gabor parameters (orientation and scale) was investigated to improve the performance of vehicle detection [11]. SVM classifier trained on Boosted Gabor features with parameters (orientation and scale) selected from learning had reported 96% detection rate [12].

 

3. PROPOSED METHOD

3.1 Algorithm overview

The proposed vehicle detection method aims at effective collision avoidance by identifying not only the vehicle but also the road. The overview of the proposed framework is illustrated in Fig. 1. We utilize the two factors common to vision-based vehicle detection methods: Hypothesis Generation (HG) and Hypothesis Verification (HV).

Fig. 1.Overview of the proposed framework.

At the stage of HG, the lane estimation is performed for an adaptive ROI selection. We employ Canny edge detection and Hough transform, where the lane is tracked every frame. Simultaneously, the disparity estimation is performed using a Semi Global Matching (SGM) [14], and Simple Linear Iterative Clustering (SLIC) superpixels [13] are then extracted based on the color information, which divides the estimated depth map to compute the plane normal vectors in disparity space. A clustering is employed to the normal vector of each plane at a superpixel level. With the assumption that the central-bottom of the input image is always on the road region, the K-means clustering algorithm is utilized to decide whether each plane belongs to the road or obstacle. In this process, we extract the obstacle candidates in the selected ROI. Next, at the stage of HV, the obstacle candidate regions are verified by employing HOG and SVM as a feature and classifying function, respectively. To achieve this, we trained SVM classifier with HOG features of KITTI training dataset. We then extract HOG features of the obtained obstacle candidates and classify the vehicles by the trained SVM classifier. Each step of the proposed framework is described in the following subsections.

3.2 Selection of a region of interest (ROI) using lane detection and tracking

The definition of vanishing point is the image point at which the projection of parallel line intersect. It is invariant feature of the image and has been used for both qualitative and quantitative image analysis. The methods using local oriented textures are able to detect more reliable vanishing points for unstructured road, but edge-based vanishing point detection method is suitable for real-time implementation. In this paper, since we mainly deal with vehicle detection for structured environments such as highway and urban scene in KITTI dataset, we employ an edge-based vanishing point detection. First, the edges are detected by Canny edge detection method. Lines are estimated by Hough transform. The detection of the straight line is a necessary prerequisite process for all the methods of vanishing point detection. Lines are comprised of the basis of grouping connected regions of pixels which have similar gradient orientations. In many lines, we extract two dominant lines by angle of lines. We assume that the two dominant lines is lane, and designate under-region as a region of interest (ROI). These ROIs are essential to reduce the computational cost of the following process. Vehicle detection in ROI has advantages that are robust to damaged roads, various traffic environment. Thus, the ROI using vanishing point detection is able to provide a strong constraint in challenging traffic environment.

3.3 Road and obstacle region extraction using plane normal vector in disparity space

Disparity is the measure of difference between the position of a feature in a reference image and the target image. Since it provides depth information robust to illumination variation, we estimate disparity to distinguish the road from the obstacles. In this paper, dense disparity maps are computed using a semi-global matching (SGM) [14]. Although the SGM is one of the most reliable stereo matching algorithms, the estimated disparity map is still sparse in challenging traffic scene including low-textured road surfaces or shady regions. As shown in Fig. 2(a), it can be seen that the disparity values in low-textured region are presented sparse due to the failure of estimation. In order to handle this problem, the image is segmented by superpixels and the representative disparity value is allocated to each superpixel. Noticing that the disparity estimation technique relies on the smoothness assumption, i.e., nearby pixels with similar appearance have similar disparities. The low-texture and shadow problem thus can be handled effectively by the manner of the over-segmentation since superpixels are likely to be uniform in color and texture. We perform the over-segmentation of the input image into superpixels using SLIC [13]. As shown in Fig. 2(b), superpixels produce spatially appealing segments of road and tend to preserve boundaries effectively.

Fig. 2.Scene including low textured road surface. (a) Original image, (b) its corresponding disparity map.

In order to estimate geometric information of the input stereo pair, we find the best-fit plane for each superpixel. The best-fit plane is obtained from a set of points which consists of image coordinates and disparity. Let us denote given stereo pairs IL : I→R3 and IR : I→R3, and their corresponding disparity map D : I→L that assigns each pixel index p to a disparity dp∈L, where I ⊂ N2 is a dense discrete image domain and L is a discrete candidate. Our goal is to estimate plane normal vectors NL : I→R3 and NR : I→R3. Without loss of generality, we concentrate on deriving NL in the following.

Let Si denote the i-th index set of pixels in IL. The plane is defined as follows;

where the Ai, Bi, and Ci represent the constants of the plane equation in the i-th superpixel, and u, v, and d are the variables of X, Y, and Z coordinates, respectively. We then yield the error function Ei to fit the best-fit plane in Si as follows;

where denotes the i-th index set of non-error disparity pixels in Si. Since equation (2) shows the quadratic form, the optimal values for (Ai,Bi,Ci) can be computed by its derivative as follows;

Then, the representative normal vector of the plane in Si is estimated as follows;

As a result, the representative normal vector in equation (4) is assigned to each superpixel. Noting that the normal vector is estimated by non-error disparity values in each superpixel, the proposed framework shows robustness to the error of stereo matching as shown in Fig. 3.

Fig. 3.Effectiveness of the superpixel-based computation. (a) pixel-wise estimation, (b) superpixel-based estimation.

Here we are targeting to determine the region of road for each superpixel by considering the similarity of . Based on the assumption that the central-bottom of input image is included in the road region, we apply the K-means clustering algorithm for to decide whether each superpixel belongs to the road or not [15]. Let us denote is clustered into a set of K clusters, where cl is an index set of the l-th cluster. K-means algorithm finds a partition which minimizes the squared error between the mean of a cluster and the normal vectors in the cluster. Let µl be the mean of normal vectors in the l-th cluster. The sum of squared error between µl and the normal vectors in the l-th cluster is defined as

The goal of K-means clustering is to minimize the sum of squared error over all the K clusters as follows:

Then, the K sets of normal vectors is extracted. Let the cluster of normal vectors on the central-bottom of given scene be cf. Finally, the road region Si be determined as follows;

As illustrated in Fig. 4. the proposed scheme effectively classifies the road since the clustering is performed in 3D space. In the obtained ROI as shown in Fig. 5(a), the road and obstacle are represented as in Fig. 5(b).

Fig. 4.Colorization of plane normal vector.

Fig. 5.Extraction of obstacle and road region in ROI. (a) ROI, (b) Result.

3.4 Vehicle detection using active learning

HOG descriptors presented in [6] provides excellent performance relative to other existing feature sets including wavelets. The basic hypothesis is that local object appearance and shape can often be characterized rather well by the distribution of local intensity gradient or edge directions, even without precise knowledge of the corresponding gradient or edge positions.

In [6], each detection window is divided into cells of size 8 × 8 pixels and each group of 2 × 2 cells is integrated into a block in a sliding fashion, so blocks overlap with each other. For each pixel I(x,y), the gradient magnitude m(x,y) (equation (10)) and orientation θ(x,y) (equation (11)) is computed in these cells. Then, a local one-dimensional orientation histogram of gradients is formed from the gradient orientations of sample points within a cell. Each histogram divides the gradient angle range into a predefined number of bins (e.g. 9 bins). The gradient magnitudes vote into the orientation histogram.

Each block contains a concatenated vector of all its cells. In other words, each block is represented by a 36-D feature vector that is normalized to an L2 unit length (equation (12)). Each 64 × 128 detection window is represented by 7 × 15 blocks, producing a total of 3780 features per detection window. Apparently this feature extraction is a dense representation that map local image regions to high-dimension feature spaces. These features are then used to train a linear SVM classifier.

We use linear SVM other than Radial Basis Function (RBF) kernel SVM as our binary classifier because the number of HOG features is large, one may not require map data into a higher dimensional space and linear SVM is computationally faster. It trains with easy case, in which the training patterns are linearly separable. It is a linear function of the form as

such that for each training example mi, the function yields f(mi) ≥ 0 for yi =+1, f(mi) < 0 for yi =-1, and f(m) = wTm + b =0 is the hyperplane.

The SVM [16] is a supervised learning model with related learning algorithms that examine data and recognize patterns based on the structure risk minimization principle. In the simple binary classification case, the objective of the SVM is to find a best separating hyperplane with a maximum margin. The simple definition of a SVM classifier form as

where m is feature vector of an observation data, y= {+1,-1} is a class label, mj is feature vector of the j-th training sample, N is a total number of training samples, and K(m,mj) is a kernel function. In this SVM learning process, α = {α1,α2,⋯,αN} is computed.

The training data include positive and negative examples which are fixed resolution image windows. Each positive window usually contains only one centered instance of the vehicle, and negative windows are usually randomly sub-sampled and cropped from set of images not containing any instances of the vehicle. We extract HOG features from all the training data, and train the linear SVM classifier using these high-dimension feature vectors as shown in Fig. 1.

Actually in the set of images not containing any instances of the object used in the training, running the preliminarily trained classifier would generate many false positives. To reduce false positives and make full use of the training images, we use the preliminary detector exhaustively scan the negative training images for hard examples (false positives), and then the classifier is re-trained using this augmented training set (original positives and negatives, and hard examples) to produce the final detector.

During the detection phase, the binary window classifier is scanned across the regions of obstacle candidates obtained in HG part. This typically produces multiple overlapping detections for each object instance. Thus, we separate from each obstacle candidates by using segmented disparity map. The representative disparity value of the superpixel is estimated by dominant value as shown in Fig. 6(a), and the regions of each obstacle candidates are extracted by similarity of disparity. They are verified by our classifier, and we detect vehicles as shown in Fig. 6(b).

Fig. 6.Result of vehicle detection. (a) Representative disparity value of the superpixel, (b) Result of vehicle detection.

 

4. EXPERIMENTAL RESULTS

Our local processing platform is a standard PC with a CPU Intel I7 of 3.4GHz clock and 8GB of RAM, and the computation environment is MATLAB R2014a. The KITTI data set [17] is selected as the test set for performance evaluation, because it provides scenes which are captured by driving around the city, in rural areas and on highways. In addition to providing all the data in raw format, they give benchmarks for each task. The KITTI data set contains 7,500 images of street scenes which are divided into 150 sequences with varying duration. The stereo cameras are mounted approximately level with the ground plane. The camera images are cropped to a size of 1382 × 512 pixels which are presented a calibrated, synchronized and rectified autonomous driving dataset.

The KITTI dataset consists of 7,500 images of various traffic scenes, which is comprising a total of 40,000 labeled objects with various aspects, poses, and illumination conditions. Among them, we use the vehicle-labeling image patch as the positives and randomly cropped image patch as the negatives. The positive dataset includes 1,701 image patches containing front, rear, and side view vehicles as shown in Fig. 7. Since the resolution of each image patch is various, we normalized the size of each image to 128 × 128 pixels.

Fig. 7.Training examples Posotives(left) and negtives (right).

One very important issue in the classifier training for one object class is how to select effective negative training samples. As negative training samples include all kinds of images, a prohibitively large set is needed in order to be representative, which would also require infeasible amount of computation in training. To alleviate this problem, a bootstrapping method proposed by Sung and Poggio [18] is used to incrementally training the classifier as illustrated in Fig. 8.

Fig. 8.The bootstrap training diagram.

Fig. 9 shows some of the results of detection examples using the proposed method. Fig. 9(a) shows example images which extract obstacle candidates, and Fig. 9(b) shows the corresponding detection results. Fig. 10 shows comparison results for the proposed method with simple HOG-based vehicle detection methods. Vehicle detection performance is evaluated by using the measure discussed in [17]. As summarized in Fig. 11, the average false positive rate of the proposed method decrease by 6.15% compared with the HOG-based method [6].

Fig. 9.Examples of the proposed vehicle detection: (a) Region of vehicle candidates, (b) Detection result.

Fig. 10.Comparison with simple HOG-based vehicle detection methods.

Fig. 11.False positive rate of vehicle detection methods. (a) HOG-based vehicle detection [6], Result of the proposed vehicle detection.

 

5. CONCLUSION

This paper proposed the framework of superpixel- based vehicle detection using normal vector in disparity space. We utilize the two factors common to detect vehicles: HG and HV. Conventional stereo-based HG methods provide unsatisfactory results for stereo pairs under an uncontrolled environment such as illumination distortions. A majority of efforts to address this problem have devoted to develop a robust post process. The main advantage of our HG method is that the disparity- based normal vector yields road geometry insensitive to different illumination, and the superpixel- based processing enables a robust disparity estimation on low textured road regions and reflective vehicle surfaces. The experiments performed on the various road scenes have shown that the proposed framework robustly extract the road and obstacle candidates under various real traffic scenes, and outperforms current state of the art methods both qualitatively and quantitatively. Subsequently, it makes the proposed vehicle detection system possible to assist efficiently detection on the vehicle, since regions to be verified are dramatically reduced. This framework is also applicable to other feature-based object detection problems. The performance of the proposed vehicle detection system is evaluated on the KITTI dataset, and it shows the reduced rate of false detection by over 6.15% compared to conventional HOGbased method.

References

  1. Volvo Trucks, European Accident Research and Safety Report, 2013.
  2. P. Parodi and G. Piccioli, “A Feature-based Recognition Scheme for Traffic Scenes,” Proceeding of IEEE Intelligent Vehicle Symposium, pp. 229-234, 1995.
  3. A. Bensrhair, M. Bertozzi, and A. Broggi, “A Cooperative Approach to Vision-based Vehicle Detection,” Proceeding of IEEE Intelligent Transportation Systems, pp. 207-212, 2001.
  4. V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, NewYork, 1995.
  5. Y. Freund and R.E. Schapire, “A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting,” Proceeding of Conference Computational Learning Theory, pp. 23-37, 1995.
  6. N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” Proceeding of IEEE Computer Vision and Pattern Recognition, pp. 886-893, 2005.
  7. S. Zehang, G. Bebis, and R. Miller, “On-road Vehicle Detection Using Gabor Filters and Support Vector Machines,” Proceeding of International Conference Digital Signal Processing, Vol. 2, pp. 1019-1022, 2002.
  8. Q. Truong and B. Lee, “Vehicle Detection Algorithm Using Hypothesis Generation and Verification,” Emerging Intelligent Computing Technology and Applications, pp. 534-543, 2009.
  9. L. Mao, M. Xie, Y. Huang, and Y. Zhang, “Preceding Vehicle Detection Using Histograms of Oriented Gradients,” Proceeding of IEEE International Conference Communications, Circuits and Systems, pp. 354-358, 2010.
  10. S.E. Grigorescu, N. Petkov, and P. Kruizinga, “Comparison of Texture Features Based on Gabor Filters,” IEEE Transactions on Image Processing, Vol. 11, No. 10, pp. 1160-1167, 2002. https://doi.org/10.1109/TIP.2002.804262
  11. S. Zehang, G. Bebis, and R. Miller, “On-road Vehicle Detection Using Evolutionary Gabor Filter Optimization,” IEEE Transactions on Intelligent Transportation Systems, Vol. 6, No. 2, pp. 125-137, 2005. https://doi.org/10.1109/TITS.2005.848363
  12. H. Cheng, N. Zheng, and C. Sun, “Boosted Gabor Features Applied to Vehicle Detection,” IEEE Proceeding of International Conference Pattern Recognition, pp. 662-666, 2006.
  13. R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk, “SLIC Superpixels Compared to State-of-the-art Superpixel Methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, No. 11, pp. 2274-2282, 2012. https://doi.org/10.1109/TPAMI.2012.120
  14. H. Hirschmuller, “Accurate and Efficient Stereo Processing by Semiglobal Matching and Mutual Information,” Proceeding of IEEE Conference Computer Vision and Pattern Recognition, Vol. 2, pp. 807-814, 2005.
  15. T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, and A.Y. Wu, “An Efficient k-Means Clustering Algorithm: Analysis and Implementation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 7, pp. 881-892, 2002. https://doi.org/10.1109/TPAMI.2002.1017616
  16. T. Malisiewicz, A. Gupta, and A.A. Efros, "Ensemble of Exemplar-svms for Object Detection and Beyond," Proceeding of IEEE International Conference Computer Vision, pp. 89-96, 2011.
  17. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. “Vision Meets Robotics: The KITTI Dataset,” The International J. of Robotics Research, 2013.
  18. K.K. Sung and T. Poggio, “Example-based Learning for View-based Human Face Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 1, pp. 39-51, 1998. https://doi.org/10.1109/34.655648
  19. N. Dalal, B. Triggs, and C. Schmid, “Human Detection Using Oriented Histograms of Flow and Appearance,” Proceeding of European Conference on Computer Vision, pp. 428-441, 2006.
  20. M.S. Choi, J.H. Lee, J.H. Suk, T.M. Roh, and J.C. Shim, “Vehicle Detection based on the Haar-like feature and Image Segmentation, “ Journal of Korea Multimedia Society, 13(9), 1314-1321, 2010.

Cited by

  1. Multi-spectral Vehicle Detection based on Convolutional Neural Network vol.19, pp.12, 2016, https://doi.org/10.9717/kmms.2016.19.12.1909
  2. 시간에 따라 변화하는 빗줄기 장면을 이용한 딥러닝 기반 비지도 학습 빗줄기 제거 기법 vol.22, pp.1, 2019, https://doi.org/10.9717/kmms.2019.22.1.001