Open access peer-reviewed chapter

Practical Insights on Automotive SLAM in Urban Environments

Written By

Piotr Skrzypczyński

Reviewed: 23 September 2022 Published: 02 November 2022

DOI: 10.5772/intechopen.108262

From the Edited Volume

Autonomous Mobile Mapping Robots

Edited by Janusz Bȩdkowski

Chapter metrics overview

102 Chapter Downloads

View Full Metrics

Abstract

This chapter tackles the issues of simultaneous localization and mapping (SLAM) using laser scanners or vision as a viable alternative to the accurate modes of satellite-based localization, which are popular and easy to implement with modern technology but might fail in many urban scenarios. This chapter considers two state-of-the-art localization algorithms, LOAM and ORB-SLAM3 that use the optimization-based formulation of SLAM and utilize laser and vision sensing, respectively. The focus is on the practical aspects of localization and the accuracy of the obtained trajectories. It contributes to a series of experiments conducted using an electric car equipped with a carefully calibrated multisensory setup with a 3D laser scanner, camera, and a smartphone for collecting the exteroceptive measurements. Results of applying the two different SLAM algorithms to the data sequences collected with the vehicle-based multisensory setup clearly demonstrate that not only the expensive laser sensors but also monocular vision, including the commodity smartphone camera, can be used to obtain off-line reasonably accurate vehicle trajectories in an urban environment.

Keywords

  • SLAM
  • LiDAR
  • vision
  • GNSS
  • factor graph
  • evaluation

1. Introduction

In the last decade, we have witnessed a rapid development of methods, algorithms, and technologies that make vehicles more autonomous. This trend focuses on self-driving cars but includes also public transportation vehicles and advanced driver assistance systems. In all cases, the knowledge of the vehicle’s pose and the availability of an internal world model are enabling conditions for efficient task and motion planning, reasoning about the environment’s semantics, and man–machine interaction.

Although the most used localization solution in automotive applications is the global navigation satellite system (GNSS), the availability of GNSS signal is limited in many practical scenarios, such as driving through a tunnel, maneuvering in an underground parking lot, or navigating among tall buildings in the downtown. Therefore, simultaneous localization and mapping [1] and visual odometry (VO) [2] are considered a complementary means of yielding vehicle pose estimates that enable safe navigation of autonomous or semi-autonomous vehicles. While there is a large body of research on SLAM and VO, there are still many open issues when it comes to more complicated yet more practical cases, including 3D perception and lifelong map learning [3].

The two sensing modalities that dominate automotive SLAM are passive cameras and 3D laser scanners, known as LiDARs (light detection and ranging). Both classes of sensors have important advantages, but also some limitations. Active 3D laser sensors are well-suited to work under any lighting conditions, while passive cameras struggle with poorly illuminated scenes and sudden lighting changes. LiDAR sensors provide reasonably dense depth images within the range of tens, or even hundreds of meters. These data make it possible to build dense maps of the environment, while passive vision SLAM systems usually rely on sparse maps of point features. However, passive cameras are certainly the most affordable and easy-to-use sensors that can be deployed at minimal cost, while visual SLAM systems can achieve satisfying accuracy of trajectory estimation [4].

This chapter presents an experimental study on trajectory estimation accuracy by two popular open-source SLAM systems in the context of navigation for autonomous vehicles in urban environment. It contributes a unique set of experiments conducted using an autonomous electric vehicle equipped with a rich set of exteroceptive sensors. An important part of these experiments is the ground truth trajectories collected using an accurate RTK-GPS (real-time kinematics global positioning system). These data allow us to test two state-of-the-art SLAM solutions: LiDAR odometry and mapping (LOAM) [5] and ORB-SLAM3 [4] using two sensing modalities: LiDAR and passive vision, respectively.

Additionally, while performing these experiments we collected also images from a camera of a commodity smartphone attached to the windshield of the vehicle. This low-cost camera serves as a test bed for minimal-cost localization in urban environments, as it can be attached to any vehicle.

The aim of the presented research is to answer the question of how accurate are the trajectories obtained using different SLAM solutions and from different sensing modalities in scenarios imitating urban driving but under full control of an accurate GNSS solution providing precise ground truth trajectories.

The collected data are processed off-line, as often SLAM is used to obtain reference positions for data collected by a vehicle, which is the case of the CityBrands project, the presented research is part of, where trajectories are collected to identify locations of billboards and other advertisement installations. One of the purposes of this project is to verify the thesis that it is possible to obtain vehicle pose estimates of a local translational error not exceeding 1 m using only a smartphone’s camera as a sensor.

Advertisement

2. Concept and implementation of experiments

The experiments were carried out using a set of integrated sensors attached rigidly to a common frame and mounted on a vehicle. A Melex electric vehicle (work car) was used during the experiments. The rigid frame with the sensors was mounted on its roof, and data was collected by a computer mounted in the luggage compartment, except for data from a smartphone, stored directly in its memory. The basic component is an aluminum frame with a GNSS receiver and an Ouster OS1-128 Gen2 3D laser scanner with a range of 150 m and a 360° horizontal field of view. Other components of the system: a Basler acA1440-220um USB3 camera with a Basler C125-0418-5M F1.8 f4mm lens and an AHRS (attitude and heading reference system) Xsens MTi-30-2A8G4 sensor (Figure 1A).

Figure 1.

Melex electric vehicle with a sensory system configured for the experiments: 1 – Camera, 2 – LiDAR, 3 – IMU, 4 – GNSS, 5 – Smartphone (A), a smartphone visible behind the vehicle’s windshield (B), and a close-up of the main sensors on the roof (C).

The whole sensor set was calibrated both in terms of internal camera parameters (intrinsic) and in terms of external parameters (extrinsic) defining transformations between sensor coordinate systems. All these onboard exteroceptive sensors were integrated under Linux using robot operating system (ROS) [6] with ROS nodes deployed for each of the sensors, writing to defined ROS topics, and finally, saving the data using the flexible “rosbag” format. The ROS system provided also time synchronization to all these nodes by proper time stamping of the messages written to ROS topics and then to rosbag files.

Including in the multisensory system, a Samsung Galaxy A20 smartphone (Figure 1B) used as a second camera requires mutual calibration of the external parameters (translation and rotation) of the smartphone’s camera relative to the Basler camera (Figure 1C). Since the smartphone runs Android, it is not connected to the system as a ROS node and is not subject to system time synchronization. The images from the smartphone’s camera were collected using a dedicated Android application, as the default Samsung app that supports the camera on the A20 phone does not allow setting a specific focus value (locking autofocus). With automatic settings of the camera’s focus and white balance, it was not possible to calibrate it accurately.

2.1 Calibration

Calibration is essential in a multi-sensor system that is dedicated to provide data for comparison of different methods and sensing modalities. In our case, calibration was performed offline, on data sequences that were collected before starting the tests, applying a number of different tools.

The main software package used to calibrate the sensors was Kalibr [7], a system for calibration of multi-camera and camera-IMU (inertial measurements unit) systems. Kalibr is integrated with ROS, which made many operations on data stored in rosbag files much easier. The entire calibration process was accomplished in several steps:

  1. Calibration of the Basler camera parameters using Kalibr.

  2. Calibration of the Basler camera with the Xsens AHRS (which is a type of IMU), implemented with Kalibr and the frame with sensors removed from the vehicle, as the procedure required to move the camera/IMU system in order to produce appropriate data.

  3. Calibration of the Basler camera with the Ouster OS1 LiDAR, which was conducted using the novel spatiotemporal calibration method [8] that makes it possible to obtain accurate translation and rotation between the coordinate systems of both sensors, but provides also an estimate of the time difference between the image frames and laser scans. Note that in our setup there is no hardware trigger for the sensors, and the time is only synchronized by ROS time stamps, which requires estimating the actual difference imposed by hardware.

  4. Calibration of the parameters of the Samsung Galaxy A20 camera.

  5. Calibration of the external parameters between the Basler camera and the smartphone’s camera, considered as a stereo pair.

Surprisingly, the last two tasks turned out to be most difficult. An attempt to use Kalibr for these calibrations failed, probably because the smartphone’s camera has a rolling-shutter frame, and the Kalibr procedure was unable to find some of the corners of the checkerboard pattern used for calibration (Figure 2A). The resulting camera parameters, although looking correct, were inappropriate for the ORB-SLAM3 system, causing problems with the initialization of the system (Figure 2B and C). Finally, the Galaxy A20 camera was calibrated using the classic approach [9] and software that allowed us to manually point the calibration corners on the checkerboard. These parameters allow ORB-SLAM3 to initialize correctly (Figure 2D). The resulting calibration parameters of the relative translations and rotations of the sensors were used to build a TF tree in the ROS environment (Figure 2E).

Figure 2.

Calibration of the multisensory system on the vehicle: Calibration checkerboard shown to the galaxy A20 smartphone’s camera (A), initialization of the ORB-SLAM3 system on an image from galaxy camera (B), ORB-SLAM3 incorrectly initialized because of improper calibration of the camera (C), point features tracked correctly by ORB-SLAM3 when the camera is correctly initialized (D), at the ROS RViz view of the transformations tree in the fully calibrated multisensory system (E).

2.2 Ground truth acquisition

The solution for obtaining the reference (ground truth) trajectory is a satellite navigation system. In order to achieve the centimeter-level accuracy required in the application under consideration, it is necessary to transmit to the RTK-GPS system corrections coming from a ground station with a known geographic position [10]. The setup used in the experiments applied an Ublox C099-F9P sensor with a ZED-F9P module for RTK-GPS, which can provide a localization accuracy of 2 cm with a position frame rate of 10 Hz when operating in L1/L2 modes. With access to commercial RTK corrections transmitted via the Internet and LTE modem, it was not necessary to establish a separate reference station at a known position.

Advertisement

3. SLAM problem

Over the past few years, traditional SLAM algorithms based on filtering (extended Kalman filter, particle filter) [11] have been replaced by methods based on optimizing graph representations of the SLAM problem [12], most often in the form of a sensor poses graph [13] or factor graph [14] binding multiple variables subject to optimization, describing both the state of the agent (vehicle/sensor) and a map of the environment in the form of features.

As shown in [12], the SLAM problem can be viewed in terms of factor graph optimization. The factor graph formulation can be applied to many forms of SLAM and localization, regardless of the used sensing modality (e.g., to WiFi fingerprints [15]), and regardless of the used features/landmarks, for example, visual point features [4, 16] or LiDAR-based segment features [17]. The most general form, a method known in the computer vision field as structure from motion (SfM), assumes that all historical sensor positions are related to observed features through measurements. These relationships can be thought of as a random Markov field with variables x describing the sensor poses and variables f describing the features (Figure 3A). However, if SLAM is used not for a limited scene, but a large area, the factor graph grows when new positions are added when the vehicle moves, while new features appear when the vehicle/sensor explores new parts of the environment.

Figure 3.

Markov random field for the SfM problem (A). Factor graph for position-only SLAM (B) and factor graph for bundle adjustment SLAM (C). Larger blue circles are vehicle/sensor positions, green circles are features considered in optimization, empty (white) circles are those features and poses that have been marginalized, graph arcs are measurements, and rectangles denote factors from pose-to-pose constraints (red) or pose-to-feature constraints (black).

Processing a very large factor graph should be avoided due to computational performance limitations. For this reason, pose-based graph SLAM systems marginalize all historical features (and thus do not have an explicit map) and keep only the pose graph with constraints between them (Figure 3B). These constraints are derived from the estimated movement of the vehicle and are formed locally, within a certain window along the trajectory, solving a problem analogous to visual odometry. Constraints (factor graph arcs) can also represent loop closures, that is, relationships between locations that are distant in the position chain but lie close enough spatially to be observed simultaneously. Constraints on loop closures are essential to the successful implementation of the SLAM task, as they allow the system to reduce trajectory drift compared to a comparable visual odometry system. The constraints are represented on the graph as factors, shown in Figure 3B by small rectangles. In the pose graph formulation of SLAM [13] direct observations of the features are not used, as these measurements are utilized only to compute the constraints between agent poses in the graph. In pose graph optimization new pose values are estimated that minimize the following error:

argminT1,,TnihTiTi+1Ωi,i+1h(TiTi+1)+jh(Tlj1Tlj2)Ωljh(Tlj1Tlj2)),E1

where T1,,Tn are the optimized sensor poses, hTiTi+1 is the pose error between the i-th and i+1-th poses from the measurement of the visual odometry process, and hTlj1Tlj2 is the pose error of the j-th loop closure measurement defined as the error between lj1-th and lj2-th poses related by this loop closing event.

In contrast, a full SLAM system uses a nonlinear estimation technique [18] to optimize the constraint graph, which, however, has a much larger number of factors directly linking feature positions to sensor poses (Figure 3C). Also not all poses are used directly—only those considered as keyframes are stored in the map. The choice of keyframes strongly depends on the used sensing modality and the processing of features but is crucial to both the efficiency and robustness of SLAM [19]. An edge represents the resulting constraint from a sensor measurement between the i-th position and j-th feature. The uncertainty of each constraint is represented by its information matrix Ω, which can be determined by inverting the covariance matrix of a particular measurement [19]. This approach is similar to the bundle adjustment (BA) method used to efficiently solve the SfM problem [20] but can be applied in real time and used for large optimization problems (thus large maps) due to its computational efficiency [21]. The optimization step can be described mathematically as:

argminT,Fi=1njFiffiTjΩi,jffiTj+khTlk1Tlk2Ωlkh(Tlk1Tlk2),E2

where T is a set of all optimized sensor poses, F is the set of all optimized features, Fi defines the sets of indices of all features visible from i-th sensor pose, and ffiTj is the error function of the measurement between the i-th feature and j-th pose. Note that the part of the optimization problem related to loop closing is described the same way as in (1), with hTlk1Tlk2 being the pose error of the k-th loop closure measurement. The SLAM formulations (1) and (2) take the notion of topological loop closing, as defined in [19], which requires detecting the event of revisiting an already mapped area and constructing an additional pose-to-pose constraint by matching the reobserved features to the known part of the map. However, [19] points out that there is also geometric loop closing possible, where the vehicle/sensor does not need to detect the revisit event but matches the observed features to the known map using local criteria. These local criteria differ depending on the type of features. In LiDAR-based SLAM with planar or linear features, geometric distances are applied to discriminate incorrect matches [22], while in visual SLAM with point features [16], robust matching of descriptors can be employed [23]. Anyway, this geometric loop closing is only local, because if a very large loop has to be closed the accumulated pose drift may prevent our agent from executing the correct matching of the local features to the map. This is not the case for the loop closing formulation shown in (1) and (2), because in topological loop closing the respective locations (poses) are matched using global, appearance-based methods [24]. Note that the conceptual diagram in Figure 3C shows both ways of loop closing on a single graph.

3.1 LOAM system

The open-source LiDAR odometry and mapping is a state-of-the-art solution that ranked first in car localization based on the KITTI benchmark [25]. The processing of laser scans in LOAM is divided into several stages for feature extraction, odometry, mapping, and integration. The system operates on detected planar and edge features, which make it possible to reduce the extensive 3D laser scanner data stream. These features are then matched between the current and previous scans. In this step, the LiDAR sensor pose (and indirectly the vehicle pose) is estimated using constraints set by the associated features. The LOAM system assumes that scanning takes place continuously while the scanner is in motion. Therefore, after the odometry stage, the measured points are projected onto the coordinate system associated with the starting point of the scanning process, and a geometrically corrected point cloud is obtained. This corrected cloud is the input to the mapping procedure, which optimizes the match between the new cloud from the sensor and the global map (Figure 4A), giving a more accurate pose estimate than that obtained in the odometry step. Map optimization is performed less frequently (1 Hz) than odometry (10 Hz) because this step is more computationally intensive. The next localization step combines the pose from the map optimization with the latest available pose from laser odometry to provide the best available sensor pose estimate at the time. Note that LOAM does not use explicit (i.e., topological) loop closing. Although it is possible to implement this type of loop closing in a LiDAR-based SLAM with planar and line features [17], LOAM uses an approach that can be rather seen as instantaneous, geometric loop closing by matching the local features to the already known part of the map.

Figure 4.

Visualization of the LOAM global point cloud map during one of the experiments (A), the point features extracted by ORB-SLAM3 in its internal map representation (B), and detected directly on an image (C).

The LOAM system is set up to work with the Velodyne VLP-16 scanner; however, the scans from the Ouster OS1-128 sensors were used in the described study, modifying the program code accordingly. The modifications made included parameters related to different horizontal and vertical observation angles, as well as necessary changes in assumptions about how points are acquired in each scan to correctly represent measurements at the beginning of the scan based on the current motion estimate.

3.2 ORB-SLAM3 system

The visual SLAM system ORB-SLAM3 is a variant of the ORB-SLAM algorithm presented in Ref. [16]. It was developed as a monocular SLAM using the concept of point feature map and bundle adjustment optimization. The third version, available as open source software [4], can use data from a single camera (monocular SLAM), stereo camera, or an RGB-D camera providing depth measurements. Constraint graph optimization is implemented in ORB-SLAM3 using the g2o library [18]. The system architecture is divided into a front-end and a back-end storing the map (Figure 4B). ORB-SLAM3 retains all the features of monocular SLAM, also taking into account features devoid of depth information and optimizing the map only on the basis of feature reprojection errors in the images. In addition, ORB-SLAM3 fully implements loop closure based on scene view recognition, using DBoW2 bag-of-visual-words algorithm [24]. This makes it possible to efficiently close loops of any size, not just local loops. Note that although the detection of loop closures in ORB-SLAM3 is appearance-based, the optimization in factor graph after closing a loop is accomplished on the level of features rather than pose-to-pose constraints. In ORB-SLAM3, the fully multi-scale ORB features [26] are used throughout the system: for tracking sensor motion (Figure 4C), matching the current perception to the map (creating constraints), and closing loops. ORB-SLAM3 uses optimization with the factor graph concept at several stages of processing, not only for global optimization, as shown in (2), but frequently performs local bundle adjustment over several keyframes and their associated features to obtain local map reconstruction. A robust Huber kernel [27] can be used for optimization with reprojection error to reduce the influence of spurious associations.

During tracking, a simple model of camera motion is used, assuming motion at a constant speed. This model can be considered sufficient in light of the scenarios of the experiments conducted, in which the Melex vehicle moved at speeds from 10 to 30 km/h. Although ORB-SLAM3 can work also as visual-inertial SLAM, which makes it much more robust, for the sake of comparison between results from the two cameras used in the presented setup the application of ORB-SLAM3 was restricted to monocular SLAM only.

A more challenging aspect of applying ORB-SLAM3 in the conducted experiments is the use of a smartphone as a camera. The developers of ORB-SLAM3 do not provide a separate implementation for mobile devices and do not support Android. However, as offline processing of the collected data is assumed, only the quality of the images and the availability of correct calibration data are important, because the images from Samsung A20 are further processed on an x86 computer under Linux.

Advertisement

4. Methodology for assessing the accuracy of SLAM

With regard to the VO and SLAM algorithms, it is possible to talk about the accuracy of estimation of two components of the created model: the sensor trajectory and the map itself. Both of these components are of practical importance in the tasks of navigation and recognition of the environment, and they are interrelated since the accuracy of the trajectory estimation determines the accuracy of the registration of the position of the features that form a map of the environment.

A standard sensor for providing precise reference trajectories in outdoor SLAM is GNSS, which was used in the form of integrating the RTK-GPS in the sensor test bed on the Melex vehicle. In order to assess the accuracy of a trajectory estimated to be either visual or LiDAR SLAM the available ground truth trajectory needs to be compared to the estimated one using a proper method. Such methods have been proposed in Ref. [28] and now are widely used in robotics localization research. These are methods for evaluating absolute trajectory error (ATE) and relative pose error (RPE) of sensor positions. The ATE value is the Euclidean distance between the corresponding points of the estimated trajectory and the ground truth trajectory. Thus, the ATE metric allows us to determine how far the estimated position is from the reference position at a certain moment in time on the estimated trajectory. The ATE metric is calculated for the entire trajectory as root mean squared (RMS) error, which allows convenient comparison of results for different SLAM algorithms for the same test data set. Assuming that two trajectories are given: a ground truth and an estimated one with the same number of positions n, and the positions of the sensor are given as homogeneous matrices, the ATE metric for the i-th frame on the trajectory can be determined as:

EiATE=Tigt1Ti,E3

and then the ATE value for the entire trajectory is obtained by calculating the RMS error value for all frames. Note that in order to obtain a correct ATE, the trajectories must be converted to a common global coordinate system before the calculation is performed. The need for geometric matching occurs in the case of a trajectory estimated by visual odometry or a monocular visual SLAM algorithm because for algorithms of this kind the trajectory scale s is an unobservable variable [19]. For this reason, the estimated trajectory may exactly reproduce the shape of the reference trajectory, but have a different scale. It is possible to address the scale problem in monocular systems if data from additional sensors are available, in particular inertial measurements from an IMU or AHRS [29]. However, this possibility was not exploited in this project because the stream of images from the smartphone could not be synchronized accurately in time with the AHRS measurements, while the aim was also to compare the trajectories obtained using visual SLAM from the two different types of cameras mounted on the Melex. On the other hand, the correct scale can be easily retrieved by processing the trajectory offline and using the Umeyama algorithm [30] with only several GNSS global pose measurements as reference points.

In the case of LiDAR SLAM, the scale drift problem does not occur. Trajectory matching is performed offline by finding the transformation that minimizes the distance between two groups of rigidly connected points representing these trajectories (vehicle poses), the most commonly used method to accomplish this task is the Umeyama algorithm [30].

On the other hand, the RPE metric determines the difference in transformation (separately the rotational and translational parts) that would exist after following the estimated trajectory and the reference trajectory independently for a given number of frames/images and then calculating the roto-translation between the estimated trajectory and its ground truth trajectory counterpart. The RPE metric for the i-th image frame is given by the formula:

EiRPE=Tigt1Ti+1gt1Ti1Ti+1.E4

Determining the translational or rotational part, one gets the local translational or rotational error measure. The values of the RPE metrics for the entire trajectory are calculated from the corresponding part (translational or rotational) as the RMS error value for all nodes.

The main measure of sensor position estimation error used in the process of evaluating the accuracy of the tested algorithms will be ATE for the i-th frame. The aggregated measure of ATE (RMS) is of lesser importance in this context, since it is not assumed that the resulting maps will be globally accurate, but that the user can get reasonably small errors, for example, translational error smaller than 1 meter, in the coordinate system associated with the local map. This ensures correct recognition of the road structure by the vehicle, which in turn makes it possible to keep on the proper lane and to localize some objects in the vicinity of the road.

In this context, the translational part of the RPE metric (RPEt) is also useful, as, from the point of view of evaluating the results of the experiments performed, the main advantage of the RPE metric is that it can be directly applied to the evaluation of the trajectory estimated from a monocular camera, in the presented case both the Basler camera and the smartphone’s camera. When using the ATE metric, such a possibility does not exist, because, for monocular SLAM, the estimated trajectory and the reference trajectory differ in scale.

Roll and pitch angles of the sensor are of lesser importance, as it is assumed that the vehicle moves on generally flat ground in the urban environment. However, the roll and pitch angles for the reference trajectory can be read from the AHRS relative to the gravity vector. This is used to ensure that the assumption of running on a reasonably flat surface is met for the collected trajectories.

Advertisement

5. Experimental results

Three fully documented experiments were conducted using a pre-calibrated set of sensors on an electric car. The vehicle traveled at speeds ranging from 10 to 30 km/h along various routes in the area of the Warta campus of Poznan University of Technology and the streets adjacent to this area. The routes were documented by plotting ground truth trajectories (obtained from RTK-GPS) on maps from the OpenStreetMaps service (Figure 5).

Figure 5.

Ground truth trajectories of three experiments obtained from RTK-GPS and plotted on OpenStreetMaps: Exp. 1 – A single loop inside the campus (A), exp. 2 – Multiple loops inside the campus (B), and exp. 3 – A large loop partially outside the campus (C).

In each run, an estimated vehicle trajectory was obtained from the OS1-128 laser scanner data, the relative position error of which did not exceed 1 m with respect to the RTK-GPS data (Table 1). At the same time, all the estimated trajectories are topologically consistent and not distorted (Figure 6), also characterized by small absolute ATE errors (the red error lines are faint).

Exp. No.123
RPEt [m]maxRMSmaxRMSmaxRMS
LOAM Ouster1.830.531.930.451.140.51
ORB-SLAM3 Basler1.710.380.920.270.400.18
ORB-SLAM3 Galaxy0.490.230.430.290.420.18

Table 1.

Relative translational errors of the trajectory estimated by LOAM (LiDAR SLAM) and ORB-SLAM3 (visual SLAM) for all experiments.

Figure 6.

Plots of the ground truth and estimated trajectories for LOAM/ouster with visualization of the ATE values (red lines) for exp. 1 (A), exp. 2 (B), and exp. 3 (C).

At the same time, the presented experiments confirmed the veracity of the main these—it is possible to obtain the localization of a vehicle equipped with a single passive camera with a relative accuracy of no worse than 1 m (Table 1) under conditions similar to the real operating conditions of urban navigation. The difference between the conditions of the experiment and the real conditions came down to the use of a vehicle moving slower than a typical car and the absence of heavy traffic of other vehicles (with the presence of single vehicles and pedestrians).

There are clearly significant differences in the accuracy of the trajectory estimated by ORB-SLAM3 depending on the scenario of a given experiment Figure 7. The decisive influence on the quality of localization is the number of point features observed and tracked over a sufficient number of frames. The absence of such features, for example, because of driving at a great distance from objects with distinct surface texture and/or the presence of unstable (weakly repeatable) features generated on trees, causes a sharp increase in local (relative) errors and, consequently, distortion of the estimated trajectory. In the case of the second experiment, in which the vehicle made several passes over the same terrain (loop), it can also be seen that the data from a single camera (monocular visual SLAM) does not allow in some cases to accurately close the loop and completely reduce the drift of the trajectory. In the consequence, the ATE values are much bigger for the vision-based ORB-SLAM3 than for the LiDAR-based LOAM, as shown in Table 2.

Figure 7.

Plots of the ground truth and estimated trajectories for ORB-SLAM3/Basler with visualization of the ATE values (red lines) for exp. 1 (A), exp. 2 (B), and exp. 3 (C).

Exp. no.123
ATE [m]maxRMSmaxRMSmaxRMS
LOAM Ouster2.491.263.112.086.332.55
ORB-SLAM3 Basler2.591.1512.905.7212.596.53
ORB-SLAM3 Galaxy7.503.048.294.6170.8330.07

Table 2.

Absolute errors of the trajectory estimated by LOAM (LiDAR SLAM) and ORB-SLAM3 (visual SLAM) for all experiments.

The differences in the accuracy of the trajectory estimated by ORB-SLAM3 from the low-cost camera of the smartphone measured by the relative measure (Table 1) are small, and the obtained values of relative translation errors (both RMS and maximum values) do not differ significantly from analogous values obtained when estimating the trajectory based on images from a professional camera (Figure 8).

Figure 8.

Plots of the ground truth and estimated trajectories for ORB-SLAM3/galaxy with visualization of the ATE values (red lines) for exp. 1 (A), exp. 2 (B), and exp. 3 (C).

In the case of experiments 2 and 3 for the Basler camera, the maximum value of the error estimated from the smartphone images is even significantly smaller (0.49 m vs. 1.71 m for the professional camera). This allows us to conclude that the smartphone camera under proper calibration is a sufficient sensor for feature-based monocular SLAM, despite having a rolling shutter and simplified lens. It should be noted that an advantage of the Galaxy A20 smartphone camera (and many others) over the Basler camera with an f = 4 mm lens is a wider angular field of view in the horizontal plane. This can increase the number of matching point features during more rapid changes in vehicle orientation. At the same time, however, the depth of field of the smartphone camera is noticeably smaller than for the classic camera configuration compared here, which in turn leads to the detection of fewer point features overall, especially at greater distances from the vehicle. The number of point features observed and tracked by a sufficient number of frames has a decisive impact on the quality of pose tracking (local localization). Lack of a sufficient number of features results in a sharp increase in local (relative) errors and, consequently, distortion of the estimated trajectory.

In the case of experiments 2 and 3, during which the vehicle made several loops returning to the same areas, it can also be seen that the data from the Galaxy A20 smartphone camera does not allow, in some cases, for accurate loop closure and trajectory drift reduction. This is particularly evident in Figure 8C—ineffective loop closing, probably due to too few features, led to a very large ATE error for the long trajectory (15,288 images were processed), despite the small relative errors.

Advertisement

6. Conclusions

The presented results allow us to conclude that the use of laser scanning to obtain reference trajectories regardless of the availability of the GNSS signal (which often disappears in urban environments) is effective, and it can be applied as part of the methodology for testing vision-based localization systems, as the accuracy obtained is sufficient.

This research has shown that the trajectories estimated by a state-of-the-art visual SLAM system are of acceptable accuracy in the context of local navigation in the urban environment. The obtained accuracy of the vehicle pose estimates is also sufficient for localization of selected objects on the roadsides—such as billboards considered in the CityBrands project.

The multi-loop paths of the presented experiments, which are not present in the publicly available datasets for SLAM, for example, KITTI [25] made it possible to demonstrate the crucial role of topological loop closing in obtaining globally consistent trajectories. A clear decrease of this global consistency was observed for the trajectories obtained from the images acquired using smartphone with its rolling-shutter camera. This can be attributed to the smaller number of point features detected in the lower-quality images that in some cases were insufficient to detect a loop using the bag-of-visual-words approach. This was observed for trajectories that otherwise had local errors not greater than their counterparts estimated from images collected by a professional global-shutter camera. However, supplementing the SLAM algorithm with only locally available (at selected points) GNSS data (which here served only as ground truth) should allow to removal of the observed shortcomings. In general, this research demonstrated that even low-cost commodity hardware can be used to obtain useful trajectories of a vehicle in urban environment if recent algorithms are applied for data processing, and the entire processing pipeline is implemented carefully, focusing on proper calibration of the sensors.

Advertisement

Acknowledgments

This work was funded by POIR.01.01.01-00-0494/20 (CityBrands) grant from the National Centre for Research and Development. The author would like to thank K. Ćwian, A. Banaszczyk, and K. Żywanowski for their help in conducting the presented experiments.

References

  1. 1. Cadena C, Carlone L, Carrillo H, Latif Y, Scaramuzza D, Neira J, et al. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on Robotics. 2016;32(6):1309-1332
  2. 2. Scaramuzza D, Fraundorfer F. Visual odometry: Part I the first 30 years and fundamentals. IEEE Robotics & Automation Magazine. 2011;18(4):80-92
  3. 3. Skrzypczyński P. LiDAR localization and mapping for autonomous vehicles: Recent solutions and trends. In: Automation 2021: Recent Achievements in Automation, Robotics and Measurement Techniques, AISC 1390. Cham: Springer; 2021. pp. 251-261
  4. 4. Campos C, Elvira R, Gómez Rodríguez JJ, Montiel JMM, Tardós JD. ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM. IEEE Transactions on Robotics. 2021;37(6):1874-1890
  5. 5. Zhang J, Singh S. Low-drift and real-time LiDAR odometry and mapping. Autonomous Robots. 2017;41(2):401-416
  6. 6. Quigley M, Conley K, Gerkey B, Faust J, Foote T, Leibs J et al. ROS: An Open-Source Robot Operating System, Workshops at the IEEE International Conference on Robotics and Automation. Kobe: IEEE; 2009
  7. 7. Rehder J, Siegwart R, Furgale P. A general approach to spatiotemporal calibration in multisensor systems. IEEE Transactions on Robotics. 2016;32(2):383-398
  8. 8. Nowicki MR. Spatiotemporal calibration of camera and 3D laser scanner. IEEE Robotics and Automation Letters. 2020;5(4):6451-6458
  9. 9. Zhang Q, Pless R. Extrinsic calibration of a camera and laser range finder (improves camera calibration). In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sendai. IEEE; 2004. pp. 2301-2306
  10. 10. Takasu T. RTKLIB: Open Source Program Package for RTK-GPS, FOSS4G 2009. Tokyo: Free and Open Source Software for Geospatial (FOSS4G), Open Source Geospatial Foundation; 2009
  11. 11. Durrant-Whyte H, Bailey T. Simultaneous localization and mapping: Part I. IEEE Robotics and Automation Magazine. 2006;13(2):99-110
  12. 12. Strasdat H, Montiel JMM, Davison AJ. Visual SLAM: Why filter? Image and Vision Computing. 2012;30(2):65-77
  13. 13. Grisetti G, Kümmerle R, Stachniss C, Burgard W. A tutorial on graph-based SLAM. IEEE Intelligent Transportation Systems Magazine. 2010;2(4):31-43
  14. 14. Kschischang F, Frey B, Loeliger H-A. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory. 2001;47(2):498-519
  15. 15. Nowicki M, Skrzypczyński P. Indoor navigation with a smartphone fusing inertial and WiFi data via factor graph optimization. In: Mobile Computing, Applications, and Services. MobiCASE 2015, LNCS 162. Cham: Springer; 2015. pp. 280-298
  16. 16. Mur-Artal R, Montiel JMM, Tardós JD. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics. 2015;31(5):1147-1163
  17. 17. Ćwian K, Nowicki M, Wietrzykowski J, Skrzypczyński P. Large-scale LiDAR SLAM with factor graph optimization on high-level geometric features. Sensors. 2021;21(10):3445
  18. 18. Kümmerle R, Grisetti G, Strasdat H, Konolige K, Burgard W. G2o: A General Framework for Graph Optimization. Proc. IEEE Int. Conf. on Robotics & Automation: Shanghai; 2011. pp. 3607-3613
  19. 19. Strasdat H. Local Accuracy and Global Consistency for Efficient Visual SLAM [PhD Dissertation]. London: Imperial College; 2012
  20. 20. Triggs B, McLauchlan PF, Hartley RI, Fitzgibbon AW. Bundle Adjustment – A Modern Synthesis, Vision Algorithms: Theory and Practice, LNCS 1883. Berlin, Heidelberg: Springer; 2000. pp. 298-372
  21. 21. Konolige K. Sparse sparse bundle adjustment. In: Proc. British Machine Vision Conference. Aberystwyth: British Machine Vision Association; 2010. pp. 102.1-102.11
  22. 22. Segal A, Haehnel D, Thrun S. Generalized-ICP. In: Proc. Robotics: Science and Systems. Seattle: RSS Foundation; 2009
  23. 23. Schmidt A, Kraft M, Fularz M, Domagala Z. Comparative assessment of point feature detectors and descriptors in the context of robot navigation. Journal of Automation, Mobile Robotics & Intelligent Systems. 2013;7(1):11-20
  24. 24. Galvez-López D, Tardos JD. Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics. 2012;28(5):1188-1197
  25. 25. Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition. Providence: IEEE; 2012. pp. 3354-3361
  26. 26. Rublee E, Rabaud V, Konolige K, Bradski G. ORB: An Efficient Alternative to SIFT or SURF. Barcelona: IEEE Int. Conf. on Computer Vision; 2011. pp. 2564-2571
  27. 27. Będkowski J. Large-Scale Simultaneous Localization and Mapping. Springer, Singapore: Cognitive Intelligence and Robotics; 2022
  28. 28. Sturm J, Engelhard N, Endres F, Burgard W, Cremers D. A benchmark for the evaluation of RGBD SLAM systems. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura: IEEE; 2012. pp. 573-580
  29. 29. Nützi G, Weiss S, Scaramuzza D, Siegwart R. Fusion of IMU and vision for absolute scale estimation in monocular SLAM. Journal of Intelligent and Robotic Systems. 2011;61:287-299
  30. 30. Umeyama S. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1991;13(4):376-380

Written By

Piotr Skrzypczyński

Reviewed: 23 September 2022 Published: 02 November 2022