Visual-Tactile Fusion for Robotic Stable Grasping

The stable grasp is the basis of robotic manipulation. It requires balance of the contact forces and the operated object. The status of the grasp determined by vision is direct according to the object ’ s shape or texture, but quite challenging. The tactile sensor can provide the effective way. In this work, we propose the visual-tactile fusion framework for predicting the grasp. Meanwhile, the object intrinsic property is also used. More than 2550 grasping trials using a novel robot hand with multiple tactile sensors are collected. And visual-tactile intrinsic deep neural network (DNN) is evaluated to prove the performance. The experimental results show the superiority of the proposed method.


Introduction
In recent years, dexterous robotic manipulation increasingly attracts worldwide attention, because it plays an important role in robotic service. Furthermore, the stable grasp is the basis of manipulation. However, stable grasp is still challenging, since it depends on various factors, such as the actuator, sensor, movement, object, environment, etc. With the development of the neural network, the data-driven methods [1] become popular. For example, Levine et al. used 14 robots to randomly grasp over 800,000 times for collecting the data and training the convolutional neural network (CNN) [2]. Guo et al. trained the deep neural network (DNN) with 12 K-labeled images to learn the end-to-end grasping polices [3]. Mahler et al. built the dataset that included millions of point cloud data for training Grasp Quality Convolutional Neural Network (GQ-CNN) with an analytic metric. Then GQ-CNN developed the optimal grasp strategy that achieves 93% success rate for eight kinds of objects [4][5][6]. Zhang et al. trained robots to manipulate objects by videos that were made by virtual reality (VR). For pick-place tasks, the success rate was increased when the number of samples increased [7]. Therefore, sufficient highquality data is important for robotic grasping.
Nowadays, a few datasets of robot grasping have been developed. Playpen dataset obtains 60-hour grasping data of robot PR2 with RGBD cameras [8]. Columbia dataset collects about 22,000 grasping samples via the GraspIt! simulator [9]. Besides experiments with robots and numerical simulations, human manipulation videos are also useful. Self-supervised learning algorithms are developed from demonstration of videos [10]. While the above datasets focus on the whole grasping process, there are other datasets that concentrate on specific tasks, like grasp planning and slip detection. Pinto et al. instructed robots to automatically generate labeled images for grasp planning with 50,000 times by self-supervised learning algorithms [11]. MIT built the grasp dataset by vision-based tactile sensor and external vision [12]. While some experiments produced slip with extra force or fix objects [13,14], researchers recorded the actual random grasping process with 46% failure results in 1000 times grasp [15,16]. The real data can contribute to the precision grasping [17]. In daily life overabundance of the object's types leads to the difficulty of building datasets. Some researchers select the common objects and build 3D object set models such as KIT objects [18], YCB object set [19], etc. They are more convenient for research. However, there are few datasets that include the visual and tactile data. Sufficient visual, tactile, and position data can clearly describe the grasping process and improve the robot's ability of grasping.
According to the previous work, it is necessary to build a complete dataset for the robotic manipulation. In this chapter, a new grasp dataset based on the threefinger robot hand is built. In the following section, the structure of the multimodal dataset is introduced in detail. Moreover, the CNN and long short-term memory networks (LSTMs) are designed to complete grasp stability prediction.

Grasp stability prediction
In this section, the multimodal fusion framework of grasp stability prediction is proposed.

Visual representation learning
Under the visual image set, we can only observe 2700*2 = 5400 sets of image data in total, which is in use. It is difficult to extract visual features with convolutional neural networks (ResNet-18 network structure is used in our experiment). Training convergence is less on a small dataset, so time comparison network is used [10], capture video information from the capture process, anchor, positive, negative data. Then we define the triplet loss function [20] and use the characteristics of the continuous change of motion in the video to learn the operation process. The visual characteristics are also used as a pre-training process for the subsequent stable retrieval of the convolutional network part of the prediction network. Such as shown in Figure 1, we cleverly use a multi-angle camera to record the video image of the same capture process; at the same time, different image in the perspective should represent the same robot state, that is, its embedded layer embedding vector. A certain distance from the feature representation is relatively small, and the image at the same perspective at different times represents the robot. At different grasping states, a certain distance of the embedded layer Embedding vector is relatively large, formally: where f x a i À Á , f x p i À Á , and f x n i À Á represent the anchor, positive, and negative image features extracted by CNN. So, we can define the loss function [21] as

Predicting grasp stability
In order to describe the properties of the objects like shape or size, the images are captured before grasping from two cameras, represented by Ib (Figure 2). Id is the position of the robot concerning the object grasped. Hence the vision feature fv can be calculated as where R represents the pre-trained neural network. The images are passed through the standard convolutional network that uses the ResNet-18 architecture. Different from the previous work [22], the tactile sensors are used to obtain the force applied by the robot during the manipulation. As tactile sequences, the LSTMs are applied as the feature extractor: where ft is the last time step of the LSTMs' output and T0,T1, … ,TT is the input of the LSTMs at each step. Besides, the mass and mass distribution of the object also affect the stability of grasping. In order to simplify the problem, the weight of the object is known and the mass distribution is assumed uniform. Then the intrinsic object property is described as where fi represents the intrinsic object feature and w is the object weight. Then the multilayer perceptron (MLP) is used to extract the intrinsic feature. The sensory modalities provide the complementary information about the prospects for a successful grasp. For example, the camera's images show that the gripper is near the center of the object, and the tactile shows that the force is enough to keep stable for grasping. In order to study the method of multimodal fusion for predicting grasp outcomes, a neural network is trained to predict whether robot's grasp would be successful integrated by visual, tactile, and object's characters. The network computes y = {(X)}, where y is the probability of a successful grasp and X = [ fv, ft, fi] contains a set of images from multiple modalities: visual, tactile, and object intrinsic properties.
Train the network: initializing the weights of visual by CNN with a model pretrained in Section III-A. The visual representation network is trained 200 epochs using the Adam optimizer [23], starting with a learning rate of 105 which is decreased by an order of magnitude halfway through the training process.
During the training, the RGB images are cropped with containing the table that holds the objects. Then, following the standard practice in object recognition, the images are resized to be 256 Â 256 and randomly sampled at 224 Â 224. Meanwhile the images are randomly flipped in the horizontal direction. However, the same data is still applied for augmentation to prevent overfitting.

Experiment and data collection
The experiment platform consists of the Eagle Shoal robot hand, two RealSense SR300 cameras, and the UR5 robot arm. As shown in Figure 3, they are arranged around the table of length 600 mm and width 600 mm. There is a layer of sponge on the surface of the table for protection. A soft flannel sheet covers the table to The general grasp dataset is built with various variables including shape, size, weight, grasp style, etc. The objects in the dataset contain different sizes of cuboid, cylinder, and special shapes, and their weights change by adding granules or water. Different grasping methods are tested by grasping from three directions including back, right, and top. The dataset with unstable grasping data is generated by slipping with added weight, changed grasp force, and adjusted grasp position ( Table 1). The detailed processes are as follows: 1. The object is put on the center of the table; the front camera is used to get the point cloud data and computer the target's position.
2. Choose the object's half height position as the grasped point, control the robot to approach the object, and then add the random error of AE5 mm.
3. Based on the object's size, controlling the robot hand to grasp with a position loop mode, and then after 1 second, the robot arm lifts up with a speed of 20 mm/s.

4.
After the robot arm moves to a certain position, the robotic finger's position is changed. If the hand is bending too much, this grasp is labeled as failure, and open the hand directly then prepare the next grasp.
5. The grasp is labeled as success, for a light object, if the robot puts down the object and, for some heavy object, the robot opens the hand and drops the object directly.  6. Putting the object on the center of the table, the robot arm returns to the initial place and waits for the next loop.
The proposed method is contrasted with traditional classifiers including k-nearest neighbor (KNN) [24], support-vector machine (SVM) [25], and naive Bayes (NB) [26]. A total of 2550 sets have been divided into 80% for training and 20% for testing. The KNN classifier is set with k = 3, and the SVM kernel is the radial basis function (RBF). The success rate with a criterion, the number of detection n and the number of label data m, is calculated by n/m. The contrast result in Table 2 shows the performance of LSTM and SVM is both well with a success rate. However, the SVM's labels are on the falling edge, which means the SVM model gets a good classification result by learning the falling edge features. The falling edge means the object is dropped already and cannot help to realize a stable grasp. SVM proves unsuitable for this test.
Besides the success rate, another criterion is necessary to evaluate the slip detection. If the time of predict result turns from 1 to 0 ahead of the time in label data, set it as ahead sample and counted number nahead, calculate the ahead rate by nahead/m, and set it as the criterion. The results are shown in Table 2. With these two criteria, LSTM shows the superior performance that attains the higher success rate and higher ahead rate (Figures 4 and 5 Table 2.
Classification results of different classifiers.

Conclusions
In this chapter, the end-to-end approach for predicting stable grasp is proposed. Raw visual, tactile, and object intrinsic information are used, and the tactile sensor provides detailed information about contacts, forces, and compliance. More than 2500 grasp data are autonomously collected, and the multiple deep neural network model is proposed for predicting grasp stability with different modalities. The results show that visual-tactile fusion method improves the ability to predict grasp outcomes. In order to further validate the method, the real-world evaluations of the different models in the active grasp are implemented. Our experimental results demonstrate the superiority of the proposed method.