Open access peer-reviewed chapter

Mobile Industrial Robotic Vehicles: Navigation with Visual SLAM Methodologies

Written By

Xenofon Karamanos, Giorgos Karamitsos, Dimitrios Bechtsis and Dimitrios Vlachos

Submitted: 23 February 2023 Reviewed: 26 February 2023 Published: 27 March 2023

DOI: 10.5772/intechopen.1001346

From the Edited Volume

Autonomous Vehicles - Applications and Perspectives

Denis Kotarski and Petar Piljek

Chapter metrics overview

128 Chapter Downloads

View Full Metrics

Abstract

Mobile industrial robotic vehicles are using cutting edge technologies and have been widely accepted as a means of sustainability in the last decade. Recent navigation approaches are commonly divided into two categories (i) Laser-Based and (ii) Visual-Based. Many researchers proposed navigation systems for laser-based SLAM but their efforts both in the two-dimensional (2D) and the three-dimensional (3D) environments are still lacking critical information, such as color and texture, from the facility layout in contrast with visual-based methods. Moreover, visual-based methods use more affordable sensor devices, indicatively monocular, stereo and RGB-D cameras, that provide highly detailed information from the operation’s environment. The reconstruction of the 3D digital twin environment is more accurate and detailed, enabling the mobile industrial robotic vehicle to navigate in the facility layout and accomplish a much greater variety of tasks. The proposed research discusses recent developments in Visual-Based methods and analyses various well-known proposed systems. Performance assessment is also reviewed using the Robot Operating System (ROS) to compare the discussed methods and discuss their suitability for various facility layouts.

Keywords

  • SLAM
  • ROS
  • visual SLAM
  • RGB-D
  • MIR

1. Introduction

Mobile Industrial Robotic (MIR) systems are emerging in the Industry 4.0 context in order to increase economic, environmental, and social sustainability indicators. Even during the pandemic, while warehouses and manufacturing sites face significant challenges, the technology is still prominent with a high Return of Investment (ROI). Moreover, robotic systems’ installations are expected to grow even more with the recovery from the pandemic, according to the International Federation of Robotics [1].

Warehouses and manufacturing sites worldwide continuously strive to improve their automated processes and the quality control of their products. Numerous activities are repetitive and performing them manually is unproductive, non-creative and a lot of times creates safety issues. To overcome this burden, workers can be assisted by autonomous mobile robots performing heavy and tedious tasks, delivery of goods and quality checks. Autonomous agents can accurately and timely move products and deliver raw materials for the manufacturing processes, allowing the human capital to perform specialized and more complex tasks [2]. Autonomous agents could also perform quality and quantity control activities at all the parts and the sub-parts of an order. This improves the quality of services and reduces costs. Autonomous agents should have the capability to sense the environment, as well as perform actions such as localization, path finding, obstacle identification and avoidance, product and semi-product identification and picking. This will enable them to navigate the facility safely and efficiently and cooperate with human capital and the machinery on the production line.

Towards accomplishing all those critical tasks, MIRs need to localize themselves in the facility and identify the relevant position of all the static (machinery, working cells, charging positions) and dynamic (humans, products and semi-products, other vehicles) entities in the industrial facility. Simultaneous Localization and Mapping (SLAM) [3, 4] is considered an essential component of robot navigation. SLAM is also used in a variety of use-cases, such as AR/VR Gaming [5], self-driving cars [6] and healthcare [7].

SLAM is a complex problem and is often described as the chicken and the egg problem. An agent must localize itself in the environment, but that requires some form of map correspondence. To create that map, the agent must accurately identify its position to map the environment accurately. SLAM literature is still evolving, in terms of accuracy, computational costs, and reliability. Many research approaches have been examined to overcome this problem using multiple sensors such as ultrasonic and sonar sensors, cameras (monocular, stereo, depth) and LiDARs. With recent hardware camera developments on affordable Time of Flight (ToF) and RGB-D cameras, the SLAM research moves towards a full Visual SLAM (vSLAM) approach. The main reason for this is the cameras’ rich information, including color, depth and after processing semantic knowledge.

Advertisement

2. Materials and methods

SLAM algorithms are necessary for the safe and autonomous navigation of MIR. They are used to map the agent’s environment and at the same time localize the agent. SLAM algorithms have many types depending on the use case. They can be categorized depending on their Degrees of Freedom (DoF) into two main categories, the 2D (3 DoF) SLAM and the 3D (6 DoF) SLAM algorithms, with each category having its subcategories. Most of the time 2D SLAM is used due to lower computational costs and excellent results. The most common sensor used in 2D environments is the LiDAR, but other ones are also used such as ultrasonic and sonar range finders as well. 3D SLAM is accomplished using various 3D sensors such as stereo cameras, RGB-D cameras, and 3D laser scanners. Our focus is on the 3D SLAM algorithms, specifically visual 3D SLAM using either stereo cameras or RGB-D cameras. We chose to focus on cameras, due to their lower entrance price points in respect to 3D LiDAR sensors. Moreover, cameras can provide information that otherwise would not be available, such as color and texture and visual SLAM is still ongoing research to lower the computational costs and increase robustness.

Visual SLAM uses information acquired from image sensors, such as monocular, Stereo or RGB-D cameras. Examples of such cameras are any single grayscale or color camera with no depth information, Stereolabs® ZED™1 and Intel® RealSense™ Camera D4352 respectively.

Most of the SLAM algorithms, are implemented as standalone applications. Open-source robotic research libraries and frameworks, such as Robot Operating System (ROS), help on providing them through ROS-wrappers. It is also possible to provide the full implementation on top of ROS.

ROS provides a variety of software modules for enabling robot communications and procedures like localization, mapping, path planning and scheduling methods and paves the way for the rapid reconfiguration of mobile robots. ROS-wrappers provide the ability to researchers, to easily deploy the software to various robotic platforms and a wide configuration of sensors, through the publisher-subscriber and topics mechanism.

2.1 SLAM map structures

2D SLAM usually uses the occupancy grid map data structure. The map is split into cells that are either occupied or free or unknown status. The format efficiently provides the data for navigation tasks, such as planning algorithms and obstacle avoidance. Being 2D-structure, if 3D information is present, a single cell may be occupied in the z1 plane but free on the z2 plane. This information is lost when building 2D and the SLAM algorithm must decide what the cell should be, either by heuristics or user defined rules.

3D SLAM on the other hand, keeps that information. It utilizes either 3D meshes or Octree data formats. Octree is a structure similar to the occupancy grid but in 3D. Instead of a cell though, it uses a voxel, a cube. The environment is split into 8 voxels, which are divided into another 8 voxels until the resolution is the one desired. Even though it is computationally costly, it offers significantly more information. Utilizing the 3D environment, we can gather semantic information that otherwise would be lost and use it for both localization and object manipulation. Most algorithms that provide volumetric information also use graph-based optimization techniques to deliver more reliable and compact results.

2.2 Sensors description

Monocular sensors refer to a single camera module producing a single image, whether color or black-white image. Monocular cameras are usually cheap, lightweight and low power consumption devices. They are mostly preferred when in need of low computational costs and lightweight devices such as drones.

Stereo sensors are a pair of, usually monocular, cameras with a known physical relationship. This means information regarding the distance between their focal points and their common field of view is known (see Figure 1). After calibration is performed, one can calculate a disparity image on a pair of images captured simultaneously by stereo matching algorithms. Depth and 3D points can also be obtained. If one of the pair cameras is an RGB sensor, colored 3D points can be produced. The cost is higher compared to monocular devices, but recent developments allow more affordable hardware. They are highly usable in outdoor environments where RGB-D sensors are lacking due to their structured light patterns.

Figure 1.

Stereo camera model [8].

RGB-D sensor is a module that has an RGB sensor camera along with some type of depth-sensing mechanisms, such as infrared light patterns. It can produce both a color image, per-pixel depth maps, and 3D point cloud. However, they have limited range and their accuracy is degraded over a certain distance threshold. Due to the nature of their depth-sensing mechanism, RGB-D is most suitable for Indoor Applications. Furthermore, their initial cost is higher than monocular sensors but due to recent developments in the hardware field, the cost has stabilized to a reasonable price range (Figure 2).

Figure 2.

RGB-D sensor structure (Kinect) [9].

Inertial Measurement Unit (IMU) sensor combines an accelerometer and gyroscope and provides information regarding changes in robot movement like acceleration, position, and orientation at extremely high rates. Its small size also makes it possible to be incorporated into camera sensors, as an extra source of data. Unfortunately, due to its accumulation of errors over time, it is not considered reliable without additional information and corrective measures (Figure 3).

Figure 3.

Inertial measurement unit (IMU) components (Model IMU, GPS, and INS/GPS - MATLAB & Simulink (mathworks.com)).

Advertisement

3. Literature review

The identification and classification of SLAM algorithms MIR require an analysis of the literature in the field. Our taxonomy focuses mostly on: (i) the sensors used in the methodology such as Monocular, RGB-D or Stereo cameras; (ii) the map representation used i.e., Landmark, Graph or Volumetric-based; and (iii) availability as on open-source software and ROS compatibility. Following that, the selected and reviewed articles were critically taxonomized. The procedure can be seen in Figure 4.

Figure 4.

Workflow to find literature.

The search has been conducted in the Scopus of Elsevier database. Terms such ‘3D SLAM’, ‘RGB SLAM’, ‘stereo SLAM’ were introduced and but not limited to, and the ‘Article title, Abstract, Keywords’ field were used. The reviewed studies selected to be included in our taxonomy were written in the English language. A total of 18 studies, were identified to be reviewed.

This section summarizes various SLAM methods (Table 1) categorized upon their sensor input. This study focuses on Monocular, Stereo, RGB-D sensor, with IMU (Visual-Inertial SLAM) or without IMU and LiDAR (Light Detection And Ranging) or a combination of those.

ReferencesROS-wrapperOpen-sourceInput sensorSLAM TypeMap representationEnvironment (preferred)
DirectFeatureLandmark basedGraphVolumetricIndoorOutdoor
MonoSLam [10]M
T-SLAM [11]M, L
DTAM [12]M
KinectFusion [13]RGB-D
SLAM++ [14]RGB-D✓ (object level)
Kimera [15]M, S, I✓(Mesh)
OpenVSLAM [16]M, S, R
PL-SLAM (2019) [17]S
Visual-Based Semantic [18]M
Vitamin-E [19]M
RTAB-MAP [20]M, S, R, L, I✓(Octree)
SegMap [21]S, R
DS-SLAM [22]M, S, R✓(Octree)
Sparse3D [23]RGB-DM/D
SceneSLAM [24]M, S, R, LM/DM/DM/DM/DM/DM/D
ProSLAM [25]S
DPI SLAM [26]RGB-D✓ (3D point cloud)
S-PTAM [27]Stereo
ORB-SLAM3 [28]M, S, R, I

Table 1.

List of SLAM algorithms and their properties M, Mono; S, Stereo; R or RGB-D, Pair of RGB image and depth image; L, LiDAR; I, IMU; M/D, Module Dependent.

3.1 Monocular-based methods

Research [10] introduces MonoSLAM, a real-time algorithm for simultaneous localization and mapping with a single freely moving camera. In this work, localization is the main output of interest. A map is undoubtedly built, but it is a sparse map of landmarks optimized for enabling localization. They use a probabilistic feature-based map, representing a snapshot of the current estimate of the camera and the uncertainty of those estimates. Updates are pushed dynamically by using Extended Kalman Filter and a large patch size of 11x11 pixels is used to identify long-term landmarks. They also initialize the system with the help of known targets placed in front of the camera. A motion model and prediction are developed to re-move the uncertainty of the control inputs. To reduce the time needed to find features in the image, they predict the image position of each feature before deciding which to measure. Efficiency is provided by active feature search, ensuring that no image processing effort is wasted. Feature initialization is done after some frames have passed that include a particular feature in order to have a depth estimation. They also have a map-management algorithm due to the low capacity of max features that they can handle.

In [12], authors created a system for real-time camera tracking and reconstruction with the use of an RGB only camera and direct manipulation of image. DTAM (Dense Tracking and Mapping) consists of three stages. Map initialization by stereo measurements. Camera pose is estimated by matching synthetic views to live feed. Depth information is estimated for every pixel by using multi-baseline stereo. Tracking is done by comparing the input image with synthetic view images generated from the reconstructed map.

Authors in [18] were inspired by the human visual navigation system. They created a SLAM method that incorporates semantic information of the environment. In order to do that, they use the excellent ORB-SLAM2 [29] method for 3D point cloud generation and combine the semantic information provided by Convolution Neural Network (CNN) model PSPNet-101 [30]. The main idea is the extraction of semantic information from the camera. Then the pixel-level semantic result and the current frame are sent to the SLAM system for re-construction. The pixel-level information is associated with the map point using a Bayesian update rule, which updates the probability distribution of each map point. Moreover, the landmarks are projected in the SLAM map and are also associated with the nearest keyframes. The map that is created can be used for re-localization.

Vitamin-E [19] is an indirect monocular SLAM algorithm. The system has high accuracy and robustness because of tracking extremely dense feature points. Feature tracking is accomplished by monitoring the local extrema of curvature on image intensities and builds upon over multiple images. They also predict the position of features in the next frame by using a Geman-McClure kernel function. Then, the reconstructed map is adjusted by bundle adjustment to minimize reprojection errors. Moreover, a new novel optimization technique is introduced called ‘subspace Gauss-Newton’ and it partially updates variables rather than all of them at once. This results in accurate 3D points allowing point cloud generation and further dense geometry reconstruction such as meshing and noise removal.

3.2 Stereo based methods

PL-SLAM [17] is a method expected to run in a low-textured environment. It uses feature points and line segments found in the scene. It also achieves real-time performance. The system is split into three threads, visual odometry, local mapping and loop closure. Local mapping and loop closure thread is invoked only when a new keyframe is inserted. The map is a set of keyframes and 3D features such as critical points and line segments. The map structure is a graph, and bundle adjustment can be performed. Loop closure is done in parallel to local mapping by extracting a descriptor for each image (keyframe) and then compared to the current one. The best match will be considered a loop closure only if the surrounding keyframes are also similar.

ProSLAM [25] is a graph-based lightweight SLAM system that maps through the pair image stream. The authors implement the system as highly modular and although they claim to be parallelized, they implement the system in a single thread. The four main modules are frame point generation, position tracking, map management, and re-localization. They also provide the data structure’s internals and how they interact with all the remaining classes. A FAST detector is used for keypoint detection and BRIEF descriptor is used. An epipolar feature matching is then used to find the pairs of features in the stereo pair images. Subsequently, the position tracking module estimates the motion between the two frames. Map management is responsible for map generation and keeping track of landmarks and local maps from the input of the two modules before it. Finally, the re-localization module performs graph optimization to compensate for the drift that has been produced. Graph optimization uses the following procedure. At first, a similarity search is performed to identify past observations likely to be taken at the current location using Hamming Binary Search Tree. Then, a geometric validation is performed to find a consistent transform that maximizes the overlap between the two locations that were found by similarity search. Finally, a refinement is done for all the local maps to incorporate the newly added constraints of loop closure.

3.3 RGB-D methods

KinectFusion [13] provides real-time mapping of an indoor scene that provides a single global implicit surface model of the scene in real time. Current sensor pose is obtained by tracking the current depth frame relative to the global surface model using the ICP [31] algorithm. KinectFusion can track 6-DOF camera pose using all live data available. It can integrate depth measurements directly into the global dense volumetric model and can change view dynamically. GPU is used for high-speed tracking and reconstruction. The methods used by the authors include surface measurement, surface reconstruction update using TSDF representation, surface prediction that close the loop between mapping and localization, by tracking depth frame against the globally fused model. Finally, sensor pose estimation is performed using ICP between the predicted surface and current sensor measurement.

The authors in [14] present a 3D SLAM paradigm with object-level semantic label SLAM++. Many objects repeat in many areas; hence they are being scanned beforehand and are used to identify during mapping. The method is as follows. First, a database with common objects is created using KinectFusion [13]. A mesh is extracted from the Truncated Signed Distance Volume (TSDV) using marching cubes. The map representation preferred in the article is a graph, where each node stores either the estimated pose of an object or the historical pose of the camera at timestep. Also, each object node is annotated with a type from the object database. Each pose node contains the measurement of the pose of an object from the camera pose as factor. Realtime object recognition with 6 DOF pose of 3D objects is implemented in GPU. An object is detected and localized via the accumulation of votes in a parameter space, the basis of vote being the correspondence between Point-Pair features. Matching similar scene features can be performed in parallel via a vectorized binary search. Camera tracking and accurate object pose estimation are accomplished using ICP [31]. Graph optimization is used to further refine the pose. The system can also include prior information like a common ground plane. It also offers a re-localization mode when the tracking is lost.

SegMap [21] is 3D SLAM method that uses a segment-based approach for map representation in localization and mapping. The system decomposes the robot’s surroundings into a set of segments, each segment represented by a distinctive, low dimensional learning-based descriptor. The descriptor is computed through a CNN model composed of three convolutional and two fully connected layers. To train the network, they used the KITTI dataset, and classification and scaled reconstruction losses are used. Those descriptors are then used for data association by segment descriptor retrieval matching. Those same descriptors also allow the reconstruction of the environment’s map and extract semantic information. This is the main feature of SegMap, using incredibly low dimensionality features, hench high compression rate can reconstruct the environment.

Sparse3D [23] is an off-line reconstruction system. Given a dataset of ordered (video stream) or unordered set of images, with low intra-frame overlap, it will produce a 3D reconstructed map. The system is composed of three main steps, feature extraction and correspondences, pairwise frame matching and multi-graph global optimization for alignment pruning and refinement. The system uses three distinct kinds of keypoints to build the features. The Shi-Tomasi and SIFT keypoints are used for 2D keypoints and the SIFT descriptor is used. NARF keypoints are used for 3D (depth images) and FPFH descriptors. An affinity matrix is constructed to measure both similarity and spatial coherency between feature points and feature pairs, respectively. The best correspondence creates a transformation between the frames. Keep in mind, that different alignments can be found due to the three different keypoints used. Then a global optimization on multi-graph is performed. Multi-graph means each node may be connected to another node with more than 1 edge.

3.4 Some combination of sensors (mono/stereo/RGB-D and IMU/LiDAR)

In [11], T-SLAM is a map building method that integrates topological and geometric maps created independently using multiple sensors. They use a global topological map and a set of two or more local geometric maps. Each node of the topological map is registered with every local geometric map. In order to build the topological map, the robot samples the environment according to a sampling plan. During this first trip around the environment, called the Environment Familiarization phase, it collects features by using its sensors into the Reference Sequence. A repetition of the motion performed during the place recognition should propel the robot along the path described by the Reference Sequence. Any maneuver other than the ones taken during the Environment Familiarization phase will take the robot to a place that was not sampled in the Environment Familiarization phase, called Lost_Place. The original views can be modeled as left to right graph and are augmented by the insertion of Lost_Places. When the robots need to localize themselves, the current view is compared to the previously collected views and an inference is made of the current position of the robot using the Hidde Markov Model. Local Geometric Maps are created using various SLAM methods from sensor data like DP-SLAM [32] and FastSLAM [33] that produce grid-based metric maps using particle filters. In their implementation two grayscale cameras and a lidar are used.

Kimera [15] is an open-source library for Real-Time Metric-Semantic SLAM. It uses state-of-art efforts in various research areas, combining them and get a SLAM system that is composed of four key modules. Namely, Kimera-VIO, the Visual Inertial Odometry module, for fast and accurate estimation. Kimera-RPGO, Robust Pose Graph Optimization, Kimera-Mesher, a fast per-frame and multi-frame 3D meshes computation and Kimera-Semantics, that builds more accurate a global 3D mesh using volumetric approach and semantically annotates the mesh using 2D pixel-wise semantic segmentation. Kimera system also supports ROS ecosystem for easy online SLAM or offline datasets. It has real-time capabilities for obstacle avoidance, 3D meshing and it is modular allowing replace each module or executing in isolation. Kimera-RPGO is responsible for Loop Closure detection and computing globally consistent keyframe poses using robust Pose Graph Optimization (PGO). Loop closure is accomplished with the help of DBoW2 library and PGO is implemented in GTSAM [34] with a modern outlier rejection method called Incremental Consistent Measurement Set Maximization.

OpenVSLAM [16] is an open-source software designed to be a modular SLAM framework with high usability and extensibility. Various 3D sensors can be used such as perspective cameras, fisheye, equirectangular and one can implement support for new ones. Another crucial point is the ability to save the created maps and re-use them for localization. Implementation is divided into three modules. The tracking module, which estimates the current camera pose. Moreover, it classifies whether a frame must be regarded as a new Keyframe. If so, it is sent to the mapping and global optimization modules. The mapping module is responsible for creating the map, triangulating new 3D points using the inserted keyframes. Also, bundle adjustment is performed for optimization. The last module, global optimization performs loop detection and pose-graph optimization.

RTAB-Map (Real-Time Appearance-Based Mapping) [20] is an open-source library released in 2013 but evolved since then to full SLAM module. Initially, the library implemented a loop closure detection with a memory management approach. The approach was to limit the size of map so that the loop closure was always processed in real time and for long-term and large-scale environments. The full SLAM approach is graph-based, cross-platform and integrated with ROS ecosystem. RTAB-Map is split into various modules, some of them are Short-Term Memory (STM), working Memory (WM) and Long-Term Memory (LTM), Loop Closure and Proximity Detection and Global Map Assembling. STM is responsible for creating the nodes containing the odometry pose, sensors data and any other information useful for the other modules. The link between the nodes is a rigid transformation, either by STM or by Loop Closure and Proximity Detection module. The links are used as constraints for graph optimization. RTAB-Map memory management is run on top of graph management modules. It limits the size of the graph so that long term online SLAM can be achieved in large environments. When an update fails the real time constrain, some nodes are transferred in LTM from WM. A node that is transferred to LTM is not used in modules in WM. Choosing the nodes depends on a weighing mechanism. When a loop closure is detected within WM, neighbors of this location can be brought back from LTM. RTAB-Map can use external odometry inputs (visual or lidar based). It is a full ROS compatible node, outputting online 3D point cloud or 2D occupancy grids, that can be used by external libraries for navigation.

DS-SLAM [22] is an extended version of ORB-SLAM2. It consists of five threads and utilizes segmentation semantics along with moving consistency checks. The five threads are tracking, semantic segmentation thread, loop closure, local mapping, and dense oct-tree map creation thread. Semantics segmentation is real time using SegNet neural networks classifying objects found in the scene. It also checks for ORB features points and if any of them are found in moving objects, found by moving consistency workflows and semantics, they are dropped to reduce computational time. Finally, they use an octree-based structure in mapping thread, to add semantic segmented information without any moving objects in the map.

SceneSLAM [24] is an extensible modular SLAM framework with scene detection algorithms. The system is comprised of three main components, SLAM module, Scene Detection module, and a Software Bus module. SLAM module and Scene Detection module are interfaces, meaning various algorithms can be implemented and switched on/off on runtime based on some conditions. Software Bus is the module responsible for tracking these changes. The system is based on a finite state. After the initialization phase, a scene detection algorithm is performed and based on the result, switch state is enabled. The switch state switches the SLAM algorithm used based on which scene was found. One SLAM algorithm is used per Scene Detection value. Experiments were performed with Dark/Bright Scene Detection algorithms and Laser/RGB-D SLAM modules, respectively. Any Scene Detection/SLAM method can be used if it implements the respectively interface and can be switched on/off by the module loader of Software Bus module.

DPI-SLAM [26] is dense-planar inertial SLAM that reconstructs 3D dense models of large indoor environments. Its input is RGB-D images and Inertial Measurement Unit (IMU) data. The architecture of the system consists of three main parts. Odometry estimation and frame labeling, local depth fusion, and finally, global planar inertial mapping with structural constrains and loop closing. Starting with an RGB-D frame, the pose is predicted using preintegrated IMU measurement and refined with RGB-D odometry. Depth fusion module fuses the depth of frames to the last frame into a local depth map. In the third part, plane extraction is performed, and the point cluster is produced. Planes are then associated with existing landmarks using a projective method. They are also added into a global factor graph and optimized with existing Visual Odometry (VO) and inertial factors. Loop closure constraints are also added and then optimized again. Loop closures are found with a bag-of-words approach and algorithm based on iSAM2 [35].

ORB-SLAM3 [28] is an extension of ORB-SLAM and ORB-SLAM VI (Visual-Inertial) systems. This combination facilitates short-term, mid-term and long-term data associations with all previously gathered information not only from some seconds before. ORB-SLAM3 also introduces multi-map data association, providing infrastructure for building maps by combing multiple mappings sessions, that can be used later for improved localization. It uses three different threads for each subsystem of tracking, local mapping and loop and map merging. ORB-SLAM3 also introduces camera model agnostic module, meaning it can work not only with a pin-hole model but also any other kind like fisheye model provided someone implemented it. It also offers robustness when an agent gets lost with a two staged method of short-term loss and long-term loss steps. Loop closure is also provided with usage of DBoW2 bag-of-words place recognition system [36]. Moreover, temporal and geometric consistency checks are performed for improved results. Finally, if loop closure is from different map session or non-active map, a map merge is performed.

Advertisement

4. Performance analysis

Robot Operating System (ROS) [37] has an out of the box 2D SLAM but the search for 3D alternative/replacement is still evolving. To identify a suitable solution that they are going to support long term, benchmarking is the way to go.

Review papers exist that experimentally compare a combination of these algorithms. The SLAM algorithms are put to the test through available open datasets deployed by other researchers. Datasets provide a lot of different environments, indoor and outdoor and different camera movement styles such as on-board robot, handheld, rapid changes and pure rotational. This makes it possible to stress test them on edge cases and find the most stable, robust, and best performing algorithm. Notable datasets that many algorithms are tested on are EuRoC MAV, TUM RGB-D and KITTI. Each one contains raw data from different sensors, environments, and camera movement. EuRoC data is stereo with IMU about indoor environment captured by drone, TUM has RGB-D indoor data captured either onboard robot or from handheld device and KITTI has stereo without IMU information in outdoor data captured from camera on top of a car.

In a recent comparison, see [38], authors try to find the best 3D vSLAM to incorporate in ROS 2 ecosystem. They put to the test three of the latest state-of-the-art, modern and feature rich SLAM systems, ORB-SLAM3, OpenVSLAM and RTAB-Map. After comparison OpenVSLAM performed robustly on all three datasets, EuRoC, TUM and Kitti. When there is IMU information, they propose to incorporate it, due to improving the overall accuracy and robustness. While there is still a lot to improve vSLAM such as consideration of lighting conditions, they noted that vSLAM is an essential tool to get away from expensive LiDAR sensors.

Comparisons made in [39] tries to shed light in this matter by benchmarking a separate set of systems and six different datasets. Datasets cover a thorough list of probable situations, such as stereo, rgb-d, imu sensors, fast motion, dynamic objects, illumination changes and sensor degradation captured on ground robots, drones, handheld devices and even on synthesized data. The algorithms that were tested include ORBSLAM2 [29], ORBSLAM3 [28], OpenVINS [40], FullFusion [41], ReFusion [42] and ElasticFusion [43]. Their conclusion is that ORBSLAM3 provides the best balance between the various conditions of illumination, rapid changes, and dynamic objects. This comes at the cost of below 15 FPS. They also confirm that added inertial information in ORBSLAM3 improved overall robustness compared to ORBSLAM2.

VSLAM methodologies performance varies depending on the environment and robot motion. Poor illumination and changes, texture of the environment and fast motion are some contributing factors. Overall, the OpenVSLAM performed the most robustly in all different scenarios with high reliability. ORB-SLAM3, even though hard crashes could occur (segmentation faults), manages to perform robustly and various conditions with good accuracy. Comparisons from [38, 39] can be seen in Figures 5 and 6.

Figure 5.

Comparison of [38] between ORB-SLAM3, OpenVSLAM, RTAB-map on KITTI dataset.

Figure 6.

Comparison of [39] between SLAM across various settings.

Advertisement

5. Implementation: Q-CONPASS project

Q-CONPASS is a system currently under development and focuses on assisting the manufacturing process and quality control. A robotic agent will assist the workers and management team in numerous ways. First, the robot will move across workspaces and a dedicated space where small and expendable materials are found. It will pick up the necessary components needed for the current manufactured unit and transport them back to the workspace. This way, unnecessary moves by workers will be removed and the worker will be utilized in a more efficient and productive manner. Moreover, it was found that a lot of shipment orders had missing crucial components, that cost the company either reimbursement or sending individually the missing components. Hence, the robotic agent will also have specific stationery spots in some of the workspaces. One spot will be for quality control. It will monitor the various components needed to complete an order and check if the worker placed that component for boxing. Another spot will be used for monitoring the workers and their ergonomic posture [44]. A report of indicators describing a worker’s right and wrong posture will be handed to the management team in order to improve the worker’s safety procedures and avoid further health problems.

The most widely used algorithm on ROS, RTAB-Map, was deployed using ROS in the Q-CONPASS project. A TurtleBot 2 was used as the MIR. Equipped with LiDAR and an RGB-D camera, a rosbag file was recorded at site during the initial mapping.

Afterwards, RTAB-Map was run offline to produce a 3D Map of the environment. Using the default parameters and enforcing 3 DoF for localization as it is a ground robot, the RTAB-Map was able to produce the 3D Mapping. Moreover, RTAB-Map enables users to export the point clouds produced, as well as the robot pose and an OctoMap version of the map. Figure 7 shows the map produced inside RTAB-Map Viewer with point cloud and OctoMap version in ROS RViz.

Figure 7.

Mapping using RTAB-map GUI (above) and RViz for visualization (below).

In Figure 8 we can see the same scene inside Rtab Viewer and RViz.

Figure 8.

Same scene visualized from RTAB-map viewer (above) vs. RViz (OctoMap octree) (below).

Advertisement

6. Discussion and conclusions

SLAM is an essential part of a wide range of applications, including autonomous vehicles, Virtual & Augmented Reality (AR/VR) and robotics. Being able to operate in an environment safely and robustly is a must. To accomplish that, various sensors can be used but one of special interest is the camera sensor, either mono, stereo, or RGB-D. Camera sensors are cheap and provide lots of information in comparison to the high-cost 3D Lidars. Lots of algorithms and methodologies have been proposed to bring out the best of cameras despite being computational costly, and the barriers to do safely and robust localization and mapping have dropped significantly. Out of the many methods that exist, some of them are quite popular and are becoming the baseline to compare other implementations against. ORB-SLAM2, RTAB-Map and OpenVSLAM have been so far the most popular and ORB-SLAM3 is gaining traction. Visual SLAM is increasingly used, and a lot of research is still ongoing on how to minimize the cost, but the improvement so far is tremendous, allowing it to run even on embedding devices but at lower rates. Moreover, combined with more sensors such as an IMU, Visual-Inertial SLAM can provide further improvements and robustness.

Advertisement

Acknowledgments

This work has been supported by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation (project code: T6YBΠ-00238).

References

  1. 1. World Robotics Summary 2020. Available from: https://ifr.org/img/worldrobotics/Executive_Summary_WR_Industrial_Robots_2021.pdf
  2. 2. Azadeh K, Roy D, de Koster MBMR. Dynamic human-robot collaborative picking strategies. SSRN Electronic Journal. 2020. Available from: https://www.semanticscholar.org/paper/Dynamic-Human-Robot-Collaborative-Picking-Azadeh-Roy/87b513dfa95b2308550da324135aa9e3d829afba
  3. 3. Durrant-Whyte H, Bailey T. Simultaneous localization and mapping: Part I. IEEE Robotics & Automation Magazine. 2006;13(2):99-110. Available from: http://ieeexplore.ieee.org/document/1638022/
  4. 4. Bailey T, Durrant-Whyte H. Simultaneous localization and mapping (SLAM): Part II. IEEE Robotics and Automation Magazine. 2006;13(3):108-117. Available from: http://ieeexplore.ieee.org/document/1678144/
  5. 5. Jinyu L, Bangbang Y, Danpeng C, Nan W, Guofeng Z, Hujun B. Survey and evaluation of monocular visual-inertial SLAM algorithms for augmented reality. Virtual Reality & Intelligent Hardware. 2019;1(4):386-410
  6. 6. Singandhupe A, La H. A review of SLAM techniques and security in autonomous driving. In: Proceedings—3rd IEEE International Conference on Robotic Computing, IRC 2019. 2019 Mar 26. pp. 602-607
  7. 7. Juneja A, Bhandari L, Mohammadbagherpoor H, Singh A, Grant E. A comparative study of slam algorithms for indoor navigation of autonomous wheelchairs. In: 2019 IEEE International Conference on Cyborg and Bionic Systems, CBS 2019. 2019 Sep 1. pp. 261-266
  8. 8. Asada M, Tanaka T, Hosoda K. Visual tracking of unknown moving object by adaptive binocular visual servoing. In: IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems. 1999. pp. 249-254. Available from: https://ieeexplore.ieee.org/document/815998
  9. 9. Martin Martin R, Lorbach M, Brock O. Deterioration of depth measurements due to interference of multiple RGB-D sensors. In: IEEE International Conference on Intelligent Robots and Systems. 2014. pp. 4205-4212
  10. 10. Davison AJ, Reid ID, Molton ND, Stasse O. MonoSLAM: Real-time single camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2007;29(6):1052-1067
  11. 11. Ferreira F, Amorim I, Rocha R, Dias J. T-SLAM: Registering topological and geometric maps for robot localization in large environments. In: 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems. Seoul, Korea (South): IEEE; 2008. pp. 392-398. Available from: https://ieeexplore.ieee.org/document/4648097
  12. 12. Newcombe RA, Lovegrove SJ, Davison AJ. DTAM: Dense tracking and mapping in real-time. In: 2011 International Conference on Computer Vision. Barcelona, Spain: IEEE; 2011. pp. 2320-2327. Available from: https://www.robots.ox.ac.uk/~vgg/rg/papers/newcombe_davison__2011__dtam.pdf
  13. 13. Newcombe RA, Fitzgibbon A, Izadi S, Hilliges O, Molyneaux D, Kim D, et al. KinectFusion: Real-time dense surface mapping and tracking. In: 2011 10th IEEE International Symposium on Mixed and Augmented Reality. Vol. 2011. Basel, Switzerland: IEEE; 2011. pp. 127-136. Available from: https://ieeexplore.ieee.org/document/6162880/
  14. 14. Salas-Moreno RF, Newcombe RA, Strasdat H, Kelly PHJ, Davison AJ. SLAM++: Simultaneous localisation and mapping at the level of objects. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE; 2013. pp. 1352-1359. Available from: http://ieeexplore.ieee.org/document/6619022/
  15. 15. Rosinol A, Abate M, Chang Y, Carlone L. Kimera: An open-source library for real-time metric-semantic localization and mapping. Journal of Visual Languages and Computing. 2020;11(3):1689-1696. Available from: https://www.m-culture.go.th/mculture_th/download/king9/Glossary_about_HM_King_Bhumibol_Adulyadej’s_Funeral.pdf
  16. 16. Sumikura S, Shibuya M, Sakurada K. OpenVSLAM: A versatile visual SLAM framework. In: Proceedings of the 27th ACM International Conference on Multimedia (MM ‘19). New York, NY, USA: Association for Computing Machinery; 2019. pp. 2292-2295. DOI: 10.1145/3343031.3350539
  17. 17. Gomez-Ojeda R, Moreno FA, Zuñiga-Noël D, Scaramuzza D, Gonzalez-Jimenez J. PL-SLAM: A stereo SLAM system through the combination of points and line segments. IEEE Transactions on Robotics. 2019;35(3):734-746
  18. 18. Zhao Z, Mao Y, Ding Y, Ren P, Zheng N. Visual-based semantic SLAM with landmarks for large-scale outdoor environment. In: 2019 2nd China Symposium on Cognitive Computing and Hybrid Intelligence (CCHI), Xi’an, China. 2019. pp. 149-154. DOI: 10.1109/CCHI.2019.8901910
  19. 19. Yokozuka M, Oishi S, Thompson S, Banno A. Vitamin-E: Visual tracking and mapping with extremely dense feature points. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2019;2019:9633-9642
  20. 20. Labbé M, Michaud F. RTAB-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. Journal of Field Robotics. 2019;36(2):416-446
  21. 21. Dubé R, Cramariuc A, Dugas D, Nieto J, Siegwart R, Cadena C. SegMap: 3D segment mapping using data-driven descriptors. Robotics: Science and Systems. 2018;39(2-3):339-355. Available from: https://journals.sagepub.com/doi/full/10.1177/0278364919863090
  22. 22. Yu C, et al. DS-SLAM: A semantic visual SLAM towards dynamic environments. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain. 2018. pp. 1168-1174. DOI: 10.1109/IROS.2018.8593691. Available from: https://ieeexplore.ieee.org/document/8593691
  23. 23. Le C, Li X. Sparse3D: A new global model for matching sparse RGB-D dataset with small inter-frame overlap. CAD Computer Aided Design. 2018;102:33-43. DOI: 10.1016/j.cad.2018.04.018. Available from: https://digitalcommons.lsu.edu/eecs_pubs/722/
  24. 24. Tong Z, Shi D, Yang S. SceneSLAM: A SLAM framework combined with scene detection. In: 2017 IEEE International Conference on Robotics and Biomimetics, (ROBIO), Macau, Macao. 2017. pp. 487-494. DOI: 10.1109/ROBIO.2017.8324464. Available from: https://ieeexplore.ieee.org/document/8324464
  25. 25. Schlegel D, Colosi M, Grisetti G. ProSLAM: Graph SLAM from a programmer’s perspective. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia. 2018. pp. 3833-3840. DOI: 10.1109/ICRA.2018.8461180. Available from: https://ieeexplore.ieee.org/document/8461180
  26. 26. Hsiao M, Westman E, Kaess M. Dense planar-inertial SLAM with structural constraints. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia. 2018. pp. 6521-6528. DOI: 10.1109/ICRA.2018.8461094. Available from: https://ieeexplore.ieee.org/document/8461094
  27. 27. Pire T, Fischer T, Castro G, de Cristóforis P, Civera J, Jacobo BJ. S-PTAM: Stereo parallel tracking and mapping. Robotics and Autonomous Systems. 2017;93(3):27-42. Available from: https://www.m-culture.go.th/mculture_th/download/king9/Glossary_about_HM_King_Bhumibol_Adulyadej’s_Funeral.pdf
  28. 28. Campos C, Elvira R, Rodriguez JJG, Jose JM, Tardos JD. ORB-SLAM3: An accurate open-source library for visual, visual inertial, and multimap SLAM. IEEE Transactions on Robotics. 2021;37(6):1874-1890. Available from: https://ieeexplore.ieee.org/document/9440682
  29. 29. Mur-Artal R, Tardos JD. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics. 2017;33(5):1255-1262. Available from: http://ieeexplore.ieee.org/document/7946260/
  30. 30. Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid Scene Parsing Network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA. 2016. pp. 6230-6239. DOI: 10.1109/CVPR.2017.660. Available from: https://ieeexplore.ieee.org/document/8100143
  31. 31. Zhang Z. Iterative point matching for registration of free-form curves and surfaces. International Journal of Computer Vision. 1994;13(2):119-152
  32. 32. Eliazar AI, Parr R. DP-SLAM 2.0. In: IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA ‘04. 2004, New Orleans, LA, USA; 2004;2:1314-1320. DOI: 10.1109/ROBOT.2004.1308006. Available from: https://ieeexplore.ieee.org/document/1308006
  33. 33. Thrun S, Montemerlo M, Koller D, Wegbreit B, Nieto J, Nebot E. Fastslam: An efficient solution to the simultaneous localization and mapping problem with unknown data association. Journal of Machine Learning Research. 2004;4(3):380-407
  34. 34. Dellaert F. Factor Graphs and GTSAM: A Hands-on Introduction. 2012. Available from: https://smartech.gatech.edu/handle/1853/45226
  35. 35. Kaess M, Johannsson H, Roberts R, Ila V, Leonard JJ, Dellaert F. iSAM2: Incremental smoothing and mapping using the Bayes tree. The International Journal of Robotics Research. Feb 2012;31(2):216-235
  36. 36. Gálvez-López D, Tardós JD. Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics. 18 May 2012;28:1188-1197
  37. 37. Quigley M, Conley K, Gerkey BP, Faust J, Foote T, Leibs J, et al. ROS: an open-source Robot Operating System. In: ICRA Workshop on Open Source Software. 2009. Available from: https://www.bibsonomy.org/bibtex/2281f400bf541a0022e41ace75d9156ea/markusjordan88
  38. 38. Merzlyakov A, Macenski S. Comparison of Modern General-Purpose Visual SLAM Approaches. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic. 2021. pp. 9190-9197. DOI: 10.1109/IROS51168.2021.9636615. Available from: https://ieeexplore.ieee.org/document/9636615
  39. 39. Bujanca M, Shi X, Spear M, Zhao P, Lennox B, Lujan M. Robust SLAM Systems: Are We There Yet?. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic. 2021. pp. 5320-5327. DOI: 10.1109/IROS51168.2021.9636814. Available from: https://ieeexplore.ieee.org/document/9636814
  40. 40. Geneva P, Eckenhoff K, Lee W, Yang Y, Huang G. OpenVINS: A Research Platform for Visual-Inertial Estimation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France. 2020. pp. 4666-4672. DOI: 10.1109/ICRA40945.2020.9196524. Available from: https://ieeexplore.ieee.org/document/9196524
  41. 41. Bujanca M, Luján M, Lennox B. FullFusion: A Framework for Semantic Reconstruction of Dynamic Scenes. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea (South). 2019. pp. 2168-2177. DOI: 10.1109/ICCVW.2019.00272. Available from: https://ieeexplore.ieee.org/document/9022128
  42. 42. Palazzolo E, Behley J, Lottes P, Giguère P, Stachniss C. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China. 2019. pp. 7855-7862. DOI: 10.1109/IROS40897.2019.8967590. Available from: https://ieeexplore.ieee.org/document/8967590
  43. 43. Whelan T, Leutenegger S, Salas Moreno R, Glocker B, Davison A. ElasticFusion: Dense SLAM without a pose graph. In: Robotics: Science and Systems. Robotics: Science and Systems Foundation; 2015. Available from: http://www.roboticsproceedings.org/rss11/p01.pdf
  44. 44. Chatzis T, Konstantinidis D, Dimitropoulos K. Automatic Ergonomic Risk Assessment Using a Variational Deep Network Architecture. Sensors. 2022;22:6051. Available from: https://www.mdpi.com/1424-8220/22/16/6051/htm

Notes

  • https://www.stereolabs.com/zed/specs/
  • https://www.intelrealsense.com/depth-camera-d435/

Written By

Xenofon Karamanos, Giorgos Karamitsos, Dimitrios Bechtsis and Dimitrios Vlachos

Submitted: 23 February 2023 Reviewed: 26 February 2023 Published: 27 March 2023