# 6 DoF Vehicle Pose Estimation for Autonomous Racing

6DoF vehicle detection and pose estimation. [Left] Input image displaying perspective of a trailing vehicle, with predicted 3D bounding box (green) and ground truth annotation (red dots) generated from an automatic labeling tool. [Right] Map showing track boundaries, along with the trailing vehicle’s pose (yellow), detected vehicle’s pose estimate (green), and ground truth pose of detected vehicle (red).

## Abstract

Over the past few years, deep learning has increasingly been used to estimate position and orientation of objects in monocular camera images. In this project, a pipeline for training a convolutional neural network (CNN) for detecting vehicles and estimating 6DoF vehicle poses in real-time is described. This pipeline involves an automatic labeling tool capable of rapidly labeling real world data for training, a single-shot CNN for 3D bounding box estimation, a PnP algorithm for estimating a 6DoF pose, and a ROS node for running this detection onboard a physical vehicle. Vehicle detection and pose estimation is the first stage for enabling racing between AutoRally platforms (autorally.github.io) that will use vision for tracking opponents and making control decisions.

## Background

The AutoRally platform [1] is a 1/5th scale autonomous vehicle capable of performing high-speed, aggressive maneuvers on off-road terrain. This platform has two forward-facing monocular cameras, an IMU, GPS, and hall-effect wheel speed sensors. The objective of this project is to detect AutoRally vehicles given a single monocular RGB camera image, and estimate the 6 degree-of-freedom (DoF) pose of that vehicle. 6DoF pose estimation involves estimating the position (x,y,z) and orientation (roll, pitch, yaw) of an object of interest. The motivation for this capability is to eventually have mutiple AutoRally vehicles autonomously racing against each other using vision, which will require vehicles to have accurate pose estimation of other vehicles on the track.

2D bounding box estimation for objects in images has achieved remarkable results using deep learning. SSD [2], and YOLO v2 [3] are examples of extremely fast single-shot object detection algorithms. Only more recently has the challenge of estimating 6DoF poses with deep learning been explored, with much improvement to be made. Many variations of 6DoF pose estimation algorithms using deep learning exist, although they often have different requirements in order to estimate pose. For example, BB8 [4] creates a 2D segmentation mask of an object and subsequntly applies a convolutional neural network (CNN) to predict the 3D bounding box coordinates from a cropped image, and solves for the pose using a PnP algorithm. [5] regresses 3D bounding box coordinates from a 2D bounding box prediction, and applies additional constraints since the network is specifically predicting poses of vehicles in urban environments. Kehl extends SSD to SSD-6D [6], which predicts 2D bounding boxes that have corresponding likely 6DoF pose hypotheses, and is trained entirely on synthetic data. Deep-6dpose is an end-to-end architecture for detecting and segmenting objects, while also predicting 6DoF poses with a CNN and region proposal network (RPN) [7]. Xiang also proposes an end-to-end network that separates semantic labeling, 3D translation estimation, and 3D rotation estimation to produce 6DoF poses, and also applies a pose refinement stage to improve prediction accuracy [8].

One issue is that the majority of these approaches are trained and evaluated on the LINEMOD [9] dataset, which consists of household objects that are very different from the object of interest – the AutoRally platform. Additionally, these approaches may require physical models, or densely annotated labels for semantic segmentation, which are often not available for many perceptions tasks. For this project, the work of Tekin [10] is extended to predict the 6DoF pose of AutoRally vehicles. Tekin extends YOLO v2 to predict 3D bounding boxes, and uses a PnP algorithm [11] for estimating the 6DoF pose of objects. This work is used for this project because the network runs at nearly 50 Hz on a Titan X GPU, which is much faster than many other state-of-the-art solutions to 6DoF pose estimation. Inference time is critical for the AutoRally platform, since high-speed racing requires sufficiently high detection rates of other vehicles on the track.

• An automatic labeling tool that labels several thousand images per hour without assistance from a human.
• The single-shot architecture proposed in [10], along with the training procedure and PnP algorithm for 6 DoF pose estimation.
• A ROS node for estimating vehicle positions within a map of the racetrack.

In this page, two vehicles are used. The trailing vehicle will be referred to as “Cat”, and the leading vehicle will be referred to as “Mouse”. The pose estimation pipeline is run onboard Cat, who is recording camera images. Mouse is the vehicle being detected. Additionally, it is assumed that the roll and pitch detected vehicles is 0, since the track where data is collected is approximately flat.

## Approach

### Automatic Labeling

The network architecture used for this project takes an image with a 3D bounding box label as an input. CNNs for object recognition tasks often require thousands of training images in order to generalize well. However, the process of labeling can be time-consuming, and labeling 3D bounding boxes can be difficult even for a human. Therefore, training images are collected using an automatic labeling tool with no human supervision. This tool requires one AutoRally platform to be driving in front (Mouse) of another, with the trailing vehicle (Cat) recording images. When Mouse is in the field of view of Cat’s camera, the automatic labeling tool will use the known pose estimates of both vehicles to map the position of Mouse onto the image recorded by Cat’s camera (Fig. 1). The following equation was used to identify the pixel representing the center of Mouse in an image

$r$$KG_{Cat}^{Cam}G_{world}^{Cat}q_{world}$        (1)

where $r=(x,y)$ is the pixel value when scaled, $K$ is the camera matrix containing the intrinsic parameters of Cat’s camera, $G_{Cat}^{Cam}$ is the transformation matrix that takes a point in Cat’s pose reference frame to Cat’s camera reference frame, $G_{world}^{Cat}$ is the transformation matrix that takes a point in the world reference frame and projects it to Cat’s pose reference frame. $q_{world}$ is a homogeneous point that represents the position of Mouse. Since the dimensions of the vehicle are known, Eq. (1) can be used to map the 3D world coordinates of Mouse’s eight corners onto an image, which establishes a 3D bounding box.

The Kalibr [12] calibration tool is used to obtain the matrices $K$ and $G_{Cat}^{Cam}$. The AutoRally platform uses the factor graph localization toolbox GTSAM [13] to estimate its pose and velocity based on GPS, IMU, and wheel-speed signals. This pose estimate is used to obtain the transformation from the world reference frame to Cat’s pose reference frame, $G_{world}^{Cat}$. GTSAM is also used to estimate the position of Mouse in the world frame, $q_{world}$.

Figure 1. 3D bounding box labels (green) generated from Automatic labeling tool.

Data was collected by driving Mouse in front of Cat on an off-road test track, with one of Cat’s cameras recording images. The track used for collecting data is an oval dirt track, with a width of approximately 3 meters and maximum outer diameter of 30 meters. Recorded images where Mouse is visible are labeled as class ‘0’ (car), and images where Mouse is not visible are labeled as class ‘1’ (no car). Because poor GPS signal can result in ground truth pose estimates with higher errors, the data is cleaned by removing segments where GPS signal is poor. The automatically labeled dataset consists of 21234 training images (17026 class ‘0’, 4208 class ‘1’) and 728 testing images (636 class ‘0’, 92 class ‘1’). The training data is collected from approximately 35 minutes of driving data, and the testing data is collected from approximately 70 seconds of driving data. The reason why there is a large difference between the size of the training and testing datasets is that the model is currently being trained on nearly all of the automatically labeled examples so that experiments can be run for vision-based vehicle racing, which is important for future research on the Autorally platform.

### Network Architecture and Training

The architecture used in this project is the single-shot deep CNN architecture from [10] (Fig. 2a), which is used to detect the AutoRally platform and predict the 3D bounding box coordinates. An advantage of using this approach is that the labels only need 2D pixel coordinates of 3D bounding box locations, as opposed to other methods which require dense computer models of the object or semantic segmentation labels. This architecture is the same as the architecture in YOLO v2 with 23 convolutional layers and 5 max-pooling layers, although the output dimensions are different since 3D bounding boxes are represented by 9 control points (2D coordinates of eight tightly-fitted corners around an object, plus centroid) as opposed to 2D bounding boxes which are represented by 5 control points (four corners and centroid). This architecture also predicts a confidence value associated with each bounding box prediction which is computed based on the distance between the predicted bounding box pixel coordinates and the ground truth coordinates. The model takes a single RGB image as input, and divides the image into SxS cells (Fig. 2c). Each cell predicts the image coordinates of 3D bounding boxes in the image. For this project, 13 is used as the value for S. All training software uses the PyTorch framework.

Figure 2. Overview: (a) The single-shot CNN architecture. (b) An example input image with four objects. (c) The S × S grid showing cells
responsible for detecting the four objects. (d) Each cell predicts 2D locations of the corners of the projected 3D bounding boxes in the
image. (e) The 3D output tensor from our network, which represents for each cell a vector consisting of the 2D corner locations, the class
probabilities and a confidence value associated with the prediction (Figure from [10]).

The loss function used for this architecture consists of three components, shown in Eq. (2). $L_{pt}$ is the coordinate loss, which is computed using mean-squared error. $L_{conf}$ is the confidence loss, which is also computed using mean-squared error. $L_{id}$ is the classification loss, which is computed using cross entropy. Like YOLO v2, cells with objects have $\lambda_{conf}$ set to 5, and cells without objects have $\lambda_{conf}$ set to 0.1. $\lambda_{pt}$ and $\lambda_{id}$ are set to 1.

$L = \lambda_{pt}L_{pt}+\lambda_{conf}L_{conf}+\lambda_{id}L_{id}$        (2)

The network is pre-trained on ImageNet classification task. Like [10], data augmentation is applied to the training images by randomly changing hue, saturation, and exposure values by up to a factor of 1.5, and images are randomly scaled and translated by up to a factor of 20% of the input image size, with random flips. The input image size is 416×416. Stochastic gradient descent is used for optimization. The network is trained using a batch size of 4, with an initial learning rate of 0.001 that is divided by 10 after 4 epochs and again after 6 epochs. Additionally, a momentum value of 0.9, and regularization value of 0.0005 is used.

### Pose Estimation

A ROS node is created to take camera images as input, run the single-shot network, and estimate the pose of detected vehicles in the world. The network output consists of proposed 3D bounding boxes with a corresponding class (car vs. no car) and a confidence value. To compute the pose of Mouse relative to Cat’s camera’s pose, a PnP algorithm is used. This algorithm takes the 9 bounding box points as input, and returns 3D translation and rotation vectors. However, this pose output is relative to Cat’s camera, and we wish to know Mouse’s pose in the global map frame. Therefore, a series of transformations must be applied (Eq. 3). $G_{Mouse}^{cam}$ is the single-shot network output, which is the pose of Mouse in Cat’s camera frame. $G_{cam}^{Cat}$ is the transformation between Cat’s camera and Cat’s body frame, which is computed using Kalibr [12] to produce a camera-imu calibration. $G_{Cat}^{world}$ is the transformation matrix that converts a point in Cat’s reference frame to the world reference frame.

$G_{Mouse}^{world}=G_{Cat}^{world}G_{cam}^{Cat}G_{Mouse}^{cam}$        (3)

Using this transformation, the pose of the detecting vehicle and detected vehicle can be visualized on the map (Fig. 3).

Figure 3. Cat’s image shown in bottom left. The predicted bounding box (green) and ground truth (red dots) are shown. Global map is shown on the right, with the outer boundaries of the test track (white). Cat’s pose (yellow), predicted pose of Mouse (green), and ground truth pose of Mouse (red) are shown.

## Experiments and Results

The network runs at approximately 15 Hz on an Nvidia GTX 1050 Ti GPU. Training loss for the network, along with pixel error on the testing data, is shown in Fig. 4. For a given test example, pixel error is the norm between the 2D coordinates of the ground truth 3D bounding box projected into the image and the 2D coordinates of the projected 3D bounding box that surround the predicted vehicle pose from the PnP algorithm. This pixel error provides some intuition on how the network performs as part of the entire pipeline of predicting pose.

Figure 4. Training loss (left) and pixel error of the test set (right).

While the loss converges after approximately 30000 training steps, it is not clear how well the entire pose estimation pipeline using the single-shot network performs for the task of estimating the pose of an AutoRally platform in an image. Therefore, an assessment library was written to compute metrics and display visualizations of the pose estimation pipeline’s performance. The results are shown in Fig. 5-7.

Figure 5. [Top Left] Ground truth position of Mouse (green) and single-shot prediction position (dots). Lighter shades of dots indicate higher confidence. [Top Right] 2D position error (yellow) between the ground truth and predicted position of Mouse, and corresponding confidence of the prediction (blue). [Bottom Left] Ground truth heading of Mouse (green) and single-shot prediction heading (red dots). [Bottom Right] Heading error (yellow) between the ground truth and predicted heading of Mouse, and corresponding confidence of the prediction (blue).

The mean 2D position error on the test data is 3.77 meters, with a standard deviation of 9.51 meters. This is quite high, likely because there are outliers with errors over 100 meters which can occur when Mouse is very far from Cat, and minor changes in the bounding box prediction can cause significant changes in the PnP algorithm result. The mean absolute heading error is 0.135 radians, with a standard deviation of 0.9 radians. This error is also quite high. Additionally, it seems like the predicted heading is overestimating the ground truth heading. This could be an artifact of poor camera calibration, or improper optimization of the PnP algorithm.

Figure 6. [Top] Cumulative distribution function of 2D position error between ground truth and predicted position of Mouse. [Bottom] Cumulative distribution function of absolute heading error between ground truth and predicted heading of Mouse.

CDFs (Fig. 6) are shown to help better understand the distribution of errors for the single-shot pose predictor. Half of the predictions have position errors less than 1.74 meters, and absolute heading errors less than 30 degrees. While heading errors are quite high for the entire distribution, position errors are reasonable up until about 1-sigma. Notice that the maximum position error is 170 meters, which is impossible given the dimensions of the test track.

Figure 7. [Top] Error between the ground truth and predicted position of Mouse, plotted against the distance between Cat and Mouse. [Bottom] Error between the ground truth and predicted heading of Mouse, plotted against the distance between Cat and Mouse.

Fig. 7 shows that prediction errors increase as the distance between Cat and Mouse increase. This is reasonable because as Mouse becomes smaller in Cat’s image, it becomes increasingly difficult to estimate a tight-fitting bounding box around the vehicle, and small errors can result in significant pose errors.

Detection of multiple vehicles. The top 2 highest confidence 3D bounding boxes are shown. A green bounding box indicates that the bounding box has the highest confidence of all predictions, while a red bounding box indicates that the bounding box does not have the highest confidence. Both predicted bounding boxes are red if the network is most confident that the image does not have a vehicle. Interestingly, The network is able to detect an AutoRally platform with a clear body cover, even though all training data contained solid, black body covers. This is likely because the vehicle has the same overall structure, which the network has learned.

## Conclusion

In this project, a pipeline for estimating the 6DoF pose of an AutoRally vehicle is described. This pipeline involves automatic labeling to generate training data, a single-shot network architecture for 3D bounding box predictions, a PnP algorithm for producing a 6DoF pose, and a ROS node for estimating the global position of detected vehicles in real-time. This pipeline achieves position errors of less than 1.5 meters and heading errors of less than 30 degrees at close ranges of less than 5 meters. However, it is clear that this pipeline suffers at farther detection ranges. For future work, this pipeline could be improved with the following:

• Higher quality labels. Currently, labels can only be as accurate as the automatic labeling tool, which suffers when GPS quality is poor. This means that an improved labeling tool or human annotations may be required.
• More diverse data. The training data for this project was collected on just two separate days, with relatively uniform driving patterns. Additional data collection with different weather and track conditions, with more varying vehicle poses, may improve pose estimation performance.
• Improved camera-IMU calibration. The calibration between Cat’s camera and Cat’s body frame is important for transforming the position of Mouse into the world frame. Errors in this calibration can cause large pose estimate errors at farther ranges.
• Improved training regime. For this project, the majority of development time was spent on collecting data and writing the infrastructure for predicting pose of detected vehicles in real-time. Performance can be improved with better hyperparameter tuning.

As this pose detection pipeline will lead to vehicle tracking, algorithms like a particle filter can be applied to the vehicle detections in order to smooth the estimated vehicle poses using a motion model.

## References

[1] B. Goldfain, P. Drews, C. You, M. Barulic, O. Velev, P. Tsiotras, and J. M. Rehg, “AutoRally An open platform for aggressive autonomous driving”, arXiv:1806.00678 [cs], Jun. 2018.

[2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single Shot MultiBox Detector. In ECCV, 2016.

[3] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, July 21–26 2017, pp. 6517–6525.

[4] M. Rad and V. Lepetit. BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth. In ICCV, 2017.

[5] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka. 3D Bounding Box Estimation Using Deep Learning and Geometry. arXiv:1612.00496, 2016.

[6] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab. SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again. In ICCV, 2017.

[7] T. T. Do, M. Cai, T. Pham, and I. D. Reid. Deep-6dpose: Recovering 6D object pose from a single RGB image. arXiv:1802.10367, 2018.

[8] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. arXiv preprint arXiv:1711.00199, 2017.

[9] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model Based Training, Detection and Pose Estimation of Texture-less 3D Objects in Heavily Cluttered Scenes. In ACCV, 2012.

[10] Bugra Tekin, Sudipta N Sinha, and Pascal Fua. Real-time seamless single shot 6D object pose prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[11] V. Lepetit, F. Moreno-Noguer, and P. Fua. EPnP: An Accurate O(n) Solution to the PnP problem. IJCV, 2009.

[12] J. Maye, P. Furgale, R. Siegwart (2013). Self-supervised Calibration for Robotic Systems, In Proc. of the IEEE Intelligent Vehicles Symposium (IVS).

[13] F. Dellaert, “Factor graphs and GTSAM: A hands-on introduction,” Georgia Institute of Technology, Tech. Rep., 2012.