Open access peer-reviewed chapter

Path Planning in Rough Terrain Using Neural Network Memory

By Nancy Arana-Daniel, Roberto Valencia-Murillo, Alma Y. Alanís, Carlos Villaseñor and Carlos López-Franco

Submitted: May 23rd 2017Reviewed: October 5th 2017Published: December 20th 2017

DOI: 10.5772/intechopen.71486

Downloaded: 721


Learning navigation policies in an unstructured terrain is a complex task. The Learning to Search (LEARCH) algorithm constructs cost functions that map environmental features to a certain cost for traversing a patch of terrain. These features are abstractions of the environment, in which trees, vegetation, slopes, water and rocks can be found, and the traversal costs are scalar values that represent the difficulty for a robot to cross given the patches of terrain. However, LEARCH tends to forget knowledge after new policies are learned. The study demonstrates that reinforcement learning and long-short-term memory (LSTM) neural networks can be used to provide a memory for LEARCH. Further, they allow the navigation agent to recognize hidden states of the state space it navigates. This new approach allows the knowledge learned in the previous training to be used to navigate new environments and, also, for retraining. Herein, navigation episodes are designed to confirm the memory, learning policy and hidden-state recognition capabilities, acquired by the navigation agent through the use of LSTM.


  • robot navigation
  • learning to search
  • reinforcement learning
  • LSTM
  • unstructured terrain
  • rough terrain
  • cost function

1. Introduction

Autonomous robot navigation in unstructured terrain allows a robot to move through an environment for which the selection of traversable terrain is not a deterministic decision [1]. A mobile robot must make decisions of when to traverse patches of terrain that could be dangerous or that might consume too many resources.

Several approaches have been developed in order to solve this problem; some of them are focused on classifying traversable terrain [2, 3], and others in coupling the perceptual and planning systems of the robot using functions that map features of the environment to scalar values that represent the traversability of the terrain [1, 4, 5]. These works try to resolve the task of autonomous robot navigation in unstructured terrain; however, these approaches do not address the issue as an integrated system.

In most cases, a human expert provides information for cost map constructions heuristically. In other cases, this information is then used to construct a cost function that automatically maps features to costs; however, note that this cost function is established heuristically. The problem with this methodology is that the traversability of a given feature for a robot is difficult to quantify. In contrast, humans can determine traversal trajectories relatively easily.

An alternative to establishing cost functions is to use an algorithm that automatically constructs and tunes a cost function. Learning to Search (LEARCH) [1] is an algorithm that uses learning from demonstration in order to construct a cost function. In this approach, a human expert exhibits a desirable behavior (a sample path) over certain terrain; then, LEARCH adjusts a cost function in order to match the behavior exhibited by the expert.

LEARCH has the advantage that the cost function to be constructed can be chosen from among linear functions, parametric functions, neural networks and decision trees, to name a few [1]. In particular, neural networks and other learning machines such as support vector machines (SVM) have exhibited considerable generalization capability in different scenarios [6].

However, as will be demonstrated in this paper, the LEARCH generalization capability decreases as the number of sample paths increases; this is because the error decays over time during training.

In this study, this problem with the LEARCH algorithm is addressed using a long-short-term memory (LSTM) neural network and reinforcement learning (RL). Recurrent neural networks are capable of finding hidden states, as shown in [7]. Therefore, we propose a complex learning system that allows a navigation agent to learn navigation policies and determine complex traversability cost functions. Furthermore, this system can retain the knowledge learned in past navigation episodes in memory and generalize this knowledge for use in new episodes.


2. Learning to Search

This section presents an overview of the LEARCH algorithm, along with the results of generalization tests conducted using LEARCH. The objective is to explain the need for memory for this algorithm in order to improve its performance.

The LEARCH algorithm is based on the concept of inverse optimal control [8], which addresses the problem of finding a cost map such that a known trajectory through an environment is optimally navigated using this map. In addition, non-linear maximum margin planning [1] with the support vector regression machine [9] is used to learn behavior from an expert, that is, human expert.

Let Sbe a state space operated by a path planner. Fis a feature space defined over S. Then, for every x ∈ Sa corresponding feature vector Fx ∈ Sexists. The Fxvectors are inputs for the cost function C, which maps Fto scalar values. Cis defined as the weighted sums of functions Ri ∈  R, where Ris a space of limited complexity that maps from the feature space to a scalar [1].

We define a path Pas a sequence of states x ∈ Sthat lead from the start spoint to the goal g. The cost of each state is C(Fx); thus, the cost of the entire path is defined as


Consider a path provided by an expert, i.e. sample path Pe, which runs from a start state seto a goal state ge. In order to learn from the expert demonstration, a cost function such that Peis the optimal path from seto geis required. This task can be expressed as the following optimization problem [1]:


where λis a term that scales the regularization term REG(C). P̂is a path computed by a planner over the cost space, and Leis a loss function that encodes the similarity between paths. The latter is defined as


The sub-gradient is used to minimize O[C]. In the cost function space, the sub-gradient is


where δis the Dirac delta and P is the optimal path for the actual cost map.

In order to avoid overfitting, a cost function space is considered. Cis now defined as the space of weighted sums of functions Ri ∈  R, where Ris a space of functions of limited complexity, which maps from the feature space to a scalar. The possible choices for Rinclude linear functions, parametric functions, neural networks and decision trees. Thus,


The functional gradient is projected onto the direction set by finding the element Ri ∈  Rthat maximizes the inner product 〈−∇OF[C], R〉. This maximization can be regarded as a learning problem. Here,




As in [1] the projection of the functional gradient can be regarded as a weighted classification problem. It can be seen that the regression targets yxare positive in regions of the feature space for which the planned path visits more than the sample path and negative in the opposite case. Here, this approach is viewed as minimizing the error induced by visiting states that are not in the sample path. Then, the visitation count Uis the cumulative count of the number of states x ∈ Psuch that Fx = F. The visitation counts can be split into positive and negative components, depending on whether they correspond to the current planned path or the sample path:


Ignoring the regularization term of Eq. (4), the regression targets and weights can be computed as functions of the visitation counts. Then, the regressor targets can be obtained using these visitation counts. Further, with this regressor, the cost function can be expressed as


where j = {1, 2, 3, …, n}, with nbeing the number of iterations; Ris the regressor; and ηis the learning rate.

2.1. Learning to Search experiments

This section describes the experiment conducted to test the LEARCH generalization capabilities. Satellite-like images were selected for feature extraction. Further, patches of terrain were divided into grid cells, with a vector being created for each cell. These vectors represented a value for each of the following features of the environment in each dimension: the vegetation density, slope, the presence of gravel or rocks and the presence of water; each scalar value represents an abstraction of the feature; for example, the scalar value for vegetation represents its density, a patch of terrain with grass would be represented with a low value, and a patch of terrain with a tree would be represented with a high value of vegetation. In these experiments vectors of dimension 4 were used, that is, for the patch of terrain with grass, the vector [0, 2, 0, 0] would be its representation. A human expert-traced sample paths over the terrain, as shown in Figure 1.

Figure 1.

Left: Satellite like image. Right: Grid cells and lines with different colours representing sample paths.

With this information a support vector regressor (SVR) was used to learn the cost function Riof Eq. (5). Note that, after LEARCH is executed, the trained SVR can map the features of the terrain directly into traversal costs. Figure 2 shows a diagram of the algorithm used for training.

Figure 2.

LEARCH diagram. F is the feature map,Mis the cost map,seandgerepresent the start and the goal points, respectively,P* is the optimal path,Peis the sample path,Tis the vector of regressor targets values for training a regressorR,C(F is the cost function andj= {1,2,3,?,n}, wherenis the number of iterations needed to train the cost function.

The procedure is explained as follows:

  • A cost map Mis constructed using a feature map and a cost function (C(F)).

  • The path planner D computes an optimal path P from start point seto the goal point geover M.

  • Using the sample path from the expert path Peand P, the vector Uindicating the visitation counts is constructed as shown in Eq. (8).

  • Using U, the regressor targets are computed, and this regressor is trained.

  • The cost function is updated using Eq. (9).

  • This process is repeated until Peand P are equal.

The traversal costs can be color coded for demonstration purposes, as shown in Figure 3, where values near 20 represent the patches with the highest crossing difficulty and those near zero represent terrain that is easy to traverse.

Figure 3.

Example of a cost map which belongs to the real map shown inFigure 1.

In this experiment, in order to prove the generalization capabilities of LEARCH (i.e. its ability to use policies learned in past navigation episodes during new episodes), we employed the following methodology. First, we trained the LEARCH system described in Figure 1 using an initial map and one sample path. Then, we incrementally added to this learned knowledge (using the same initial map) by incorporating more paths to be learned by LEARCH (one by one). The results of these experiments are shown in Figure 4, where the image (a) of the figure shows the cost map obtained with a cost function trained using one sample path and the image (b) shows the results obtained by adding a path to the training, and so on until five sample paths are used. The image (f) of Figure 4 shows the cost map obtained with a cost function trained using eight sample paths.

Figure 4.

Costs maps obtained using different numbers of paths of terrain.

From the cost map (e) of Figure 4, in comparison with the cost map (f), it is apparent that information from the environment is missing after the cost function is trained with more sample paths. That is, some states are no longer recognized as states with a high traversal cost. It is important to note that the costs that are most affected are those furthest from the sample paths, in comparison with the costs of the corresponding states on the original path; therefore, the generalization capability of the LEARCH system is very poor. This problem renders the task of finding the optimal path difficult. In addition, the path planner could compute a path that traverses dangerous terrain. Further, note that, the use of only a few sample paths is not a solution to the problem of obtaining a system with knowledge of a greater number of area costs than those attached to the sample paths. This is because such sample paths cannot contain all the information necessary for a good and complete representation of the environment.

In this study, other experiments to prove the limitations of LEARCH were performed, in which we trained the system using nonrepresentative environment paths. That is, the paths taught by the expert traversed many cells of the environment that did not contain sufficient representative features of the environment or cells that did not have significant differences in cost. Figure 5 shows examples of these paths, which allowed the LEARCH system to acquire nonrepresentative knowledge that was then generalized over the cost map. The cost map at the left of Figure 5 is less generalized compared with the more descriptive costs shown on the map at the right of Figure 5.

Figure 5.

Left: cost map computed with five representative paths. Right: cost map computed with five non representative paths.

Therefore, in order to address the problems with the LEARCH system, we propose the use of an LSTM as part of the system. Inclusion of an LSTM allows the navigation agent to learn navigation policies and complex traversability cost functions and, furthermore, to retain memory of the knowledge learned in the past navigation episodes for reuse during new episodes. The latter capability allows expensive retraining to be avoided when the navigation environment is similar to those already explored by the agent and allows hidden states of the extremely large state space represented by a nonstructured or rough terrain to be recognized. We present the LSTM in the next section.


3. Long-short-term memory neural network

LSTM is a recurrent neural network architecture originally designed for supervised time-series learning. It addresses the problem that errors propagated back in time tend to vanish in multilayer neural networks (MLPs). Enforcing a constant error flow in constant error carousels (CEC) is a solution for vanishing errors [7].

These CECs are processing units having linear activation functions that do not decay over time. CECs can become filled with useless information if access to them is not regulated; therefore, specialized multiplicative units called input gates regulate access to the CECs. Further, their access to activation of other network units is regulated by multiplicative units called output gates. In addition, forget gates are added to CECs in order to reset information that is no longer useful. A combination of a CEC and its input, output and forget gates is called a memory cell (Figure 6).

Figure 6.

Graphic representation of a memory cell.

The activation updates at each time step tin this type of neural network are computed as follows. For the hidden unit activation yh, the output unit activation yk, the input gate activation yin, the output gate activation yout and the forget gate activation yφ, we have


where wimis the weight of the connection from unit mto unit i. For the activation function fi, the standard logistic sigmoid function for all units is chosen, except for output units, for which it is the identity function [7]. The CEC activation, also known as memory cell state, is calculated using


where gis a logistic sigmoid function scaled to the [−2, 2] range and scjv0. Finally, the activation update for the memory cell output is calculated from


The learning process implemented for LSTM in this paper is a variation of real-time recurrent learning (RTRL), as described in Ref. [7] which is a variation of [9]. In this variant, when the error arrives at a cell, it stops propagation further back in time. However, the error is used to update incoming weights when it leaves the memory cell through the input gate.


4. Reinforcement learning y long-short-term memory neural network

In order to teach the LSTM to navigate an unstructured terrain, RL was implemented as described in Ref. [7]. In this approach, an LSTM approximates the value function Vof the RL algorithm, which teaches a robotic agent how to navigate a T-shaped maze environment.

This problem is a partially observable Markov decision process, in which the agent is unaware of the full state of the environment and must infer this information using current observations. In this study, these observations are the same feature vectors of the environment that were used for previous LEARCH experiments, and these vectors are the input for the LSTM.

The LSTM outputs represent the advantage values A(s, a) of each action, where ais the action taken in state s. They are used to compute the value of the state Vs=maxaAsa, which represents the action with the higher advantage value.

To perform weight updates, truncated backpropagation through time was implemented with RL. A function approximator’s prediction error at time step t, ETD(t), is computed using Eq. (13) and is propagated one step back in time through all the units of the network, except for the CECs, for which the error is backpropagated for an indefinite amount of time [7]. Thus,


where ris the immediate reward, γis a discount factor in the [0, 1] range and kscales the difference between the values of the optimal and suboptimal actions. It is worth mentioning that only the output associated with the executed action receives the error signal.

During the learning process, the agent can explore the environment using the state values; however, directed exploration (i.e. exploration for which a predictor is used to direct the exploration stage, so as to avoid clueless exploration of the entire state space) is important in order to learn complex terrain navigation. When an undirected exploration is conducted, RL tries every action in the same way over all states; however, in unstructured terrain, some states provide ambiguous information about the environment rendering it difficult for the agent to determine the state of the environment. Other states provide clear information; therefore, the agent must direct its exploration to discover the ambiguous states. In order to explore the environment, an MLP was implemented for directed exploration. This MLP input was the same as the LSTM, and the MLP objective was to predict the absolute value of the current temporal difference error, i.e. ETD(t). This aided prediction of which observations were associated with a larger error. The desired MLP output was obtained using Eq. (14), and backpropagation was employed to train the MLP:


The MLP output yv(t) is used as the temperature of the Boltzmann action selection rule, which has the form


where nis the number of actions available to the agent.

The complete learning process and the manner in which the LEARCH and RL-LSTM systems are connected is shown in Figure 7. The entire process occurs offline. First, the LEARCH algorithm iterates until the required cost map Mis obtained. Then, the RL-LSTM algorithm begins the process of training the LSTM using the costs converted into rewards r. The feature map Fis obtained from the robotic agent and used by both systems.

Figure 7.

LEARCH-RL-LSTM system showing the manner in which the two systems are connected to train the LSTM. The entire process occurs offline. First, the LEARCH algorithm iterates until the required cost mapMis obtained. Then, the RL-LSTM algorithm begins the process of training the LSTM using the costs converted into rewardsr. The feature mapFis obtained from the robotic agent and used by both systems.

In order to prove the generalization and long-term memory capabilities of LSTM, training was performed using patches of terrain containing representative features of rough terrain. That is, an entire map is not used to train the LSTM (Figure 8). In this way, an efficient training phase is achieved by taking advantage of the above-mentioned capabilities. In the next section, we show the results of the experiments conducted to confirm these capabilities. In addition, we prove the efficacy of the LSTM for mapping tasks that require inference of hidden states, i.e. smoothing or noise recognition.

Figure 8.

(a) Example of a real environment modelled as a grid map. (b) Patches of terrain used for training are marked with an orange box.

The LEARCH algorithm builds a cost function; however, as noted in Section 2, the cost function capability for generalization is limited and decays as the number of training paths grows. As the motivation for employing a cost function is to obtain the cost of traversing a patch of terrain so that the path planning system can compute the optimal path with the minimal traversal cost, we propose the extraction of terrain patches having descriptive characteristics of rough terrain for navigation. Hence, the traversal costs for these environment features can be determined using LEARCH, and the costs can be transformed to rewards for a RL algorithm [10].


5. Results

In this section, the results of the experimental tests are presented. Five environments were designed for navigation policy learning using the LEARCH-RL-LSTM system shown in Figure 7. Here, each environment was modeled as a grid, and each model was referred to as a map. Each map was a grid having dimensions of 20 × 20 cells. Further, each cell represented a patch of terrain, and this patch was represented by a vector of dimension 4, where each dimension was a scalar value representing the vegetation density, terrain slope, rock size or the presence of water. Figure 9 shows in the lower right corner the color code used to illustrate the manner in which this environment was designed. Each environment differed by 5% from the previous one, i.e. 20% of the states in map 5 differed from those of map 1. These maps are shown in Figure 9.

Figure 9.

Maps used in experiments. Lower right corner of the second row: colour code used to represent environment features.

Table 1 lists the results of experiments conducted using the LEARCH system alone to learn the navigation policies and cost functions of the five maps. In order to test the capability of LEARCH to reuse knowledge learned in previous navigation episodes, the following process was employed. Once LEARCH learned the navigation policies and cost function of map 1, this knowledge was used as initial knowledge to start navigation episodes involving the remaining maps. As is apparent from the first row of Table 2, it was not necessary to retrain the LEARCH row shows, and it was not necessary to retrain the LEARCH system to learn the demonstrated behavior and cost function of map 2. In other words, the LEARCH system could apply the knowledge learned from map 1 to map 2. However, this behavior did not occur for the other maps. For maps 3, 4 and 5, and when attempting to reuse the knowledge learned from map 1, it was necessary to retrain the LEARCH system. In these learning episodes, an increased number of iterations were necessary in order to acquire the new knowledge (as is apparent when Table 1 is compared with row one on Table 2, it can be concluded that the previous knowledge learned using map 1 is even detrimental to the system performance when new maps are processed, even if the new maps are very similar to map 1. Therefore, the LEARCH system was shown to have a very poor generalization capability.

EnvironmentMap 1Map 2Map 3Map 4Map 5

Table 1.

Iterations needed to learn demonstrated behavior using LEARCH system.

EnvironmentMap 2Map 3Map 4Map 5
Iterations LEARCH0547
Iterations LEARCH-RL-LSTM0000

Table 2.

Iterations needed to learn demonstrated behavior using knowledge of map 1, for LEARCH and LEARCH-RL-LSTM system.

When RL-LSTM was integrated with LEARCH to improve the capability for reusing knowledge learned from previous navigation episodes, there was no need to retrain the system. This is apparent from the second row of Table 2, where all the demonstrated behavior for maps 2 to 5 could be learned using the knowledge learned from map 1 only. It is important to note that, although the results were obtained from relatively small maps, it was necessary to retrain the cost function using LEARCH in each of these cases. Further, when LEARCH-RL-LSTM was employed, retraining was unnecessary when the patches of terrain were similar, because this system can generalize knowledge from previous navigation episodes. Note that, when the agent navigates in real time, even small retraining episodes are computationally expensive. Further, the agent is required to stop navigating until the retraining episode ends. However, for LEARCH-RL-LSTM, retraining is unnecessary when the environment is similar to those already known from previous navigation episodes.

Another set of environment maps was also used to test both algorithms. For these new environments, features that were not observed in previous scenarios were included. In this experiment, map 6 was the base of knowledge, and two new features were included in maps 7, 8 and 9. The states differed in the same way as in the previous experiment, with 5% of the states in each map being different from those of the previous maps. However, these differences included new features in order to simulate a dynamic environment, i.e. we simulate that the terrain of the map 6 suddenly changed when the agent navigates again on this map introducing new features on some cells of the grid of map 6. The maps used for this experiment are shown in Figure 10.

Figure 10.

Maps used in second set of experiments to simulate dynamic environments. Lower right corner of the second row: sample of a map with the patch of terrain used for retraining marked by an orange box.

The LSTM used in these experiments was trained offline. During agent navigation, an efficient training episode was only executed if necessary, i.e. only if the action that LSTM learned to take is dangerous for the agent. These training episodes were efficient, because only a fraction of the environment was used (such as the patch of terrain shown in the lower right corner of Figure 10 each time the robot encountered a new state or required navigation assistance.

5.1. Noise tests

In the previous experiments, we assumed that the agent could infer the current state of the environment model based on the features observed by the agent. However, in a real scenario, the agent must infer the actual state via a perceptual system based on data obtained through noisy sensors such as cameras, a Global Positioning System (GPS) or LiDAR. In outdoor environments, two states (patches of terrain) can be very similar; however, the same action in these similar states could lead to different resultant actions. In case of noisy signals, one state could be interpreted as another similar state or, alternatively, as a new state that is not explicitly represented in the environment model, i.e. a hidden state.

To test these two systems in more realistic environment, a noise signal was induced to the inputs of both systems. A real uniform distribution bounded to a maximum of [−1, 1] (20% of noise) was used. Then, several runs of each system were conducted with an initial limit of [−0.1, 0.1] (2% of noise) and increments of [−0.1, 0.1] in the noise signal, until the maximum limits where both systems failed to infer the real state for the agent were determined. Tables 3 and 4 show the test results for both systems with noise; the noise range values are the maximum limits of the noise supported by the system using that map. Note that the results of the maps 6–9 yielded by the LEARCH system are omitted, because this system could not reproduce the desired behavior on these maps under the supplied noise levels.

EnvironmentMap 1Map 2Map 3Map 4Map 5
Noise-supported LEARCH2%2%2%2%6%
Noise-supported LEARCH-RL-LSTM10%8%10%8%12%

Table 3.

Maximum noise supported by both systems in tests where the desired behavior could be reproduced with maps 1–5.

EnvironmentMap 6Map 7Map 8Map 9
Noise-supported LEARCH0%0%0%0%
Noise-supported LEARCH-RL-LSTM10%8%8%8%

Table 4.

Maximum noise supported by both systems in tests where the desired behavior could be reproduced with maps 6–9.


6. Conclusion

LEARCH is an efficient method for learning a cost function that maps environment features to traversal costs and can then be used to navigate an unstructured terrain. However, as demonstrated by the experiments conducted in this work, this algorithm is incapable of reusing knowledge in an efficient manner. Indeed, zero knowledge is sometimes preferable to reusing previously learned knowledge.

We concluded that LEARCH cannot reuse knowledge because of a lack of memory; because of this lack of memory, the cost function cannot correlate knowledge learned in earlier training episodes with the new information provided by new environments; therefore, an LSTM was proposed. The LSTM can relate knowledge using memory cells, and this knowledge can be used to manage dynamic environments. This performance was demonstrated in experiment, where a dynamic environment was simulated through addition of new features that were not included in previous training episodes.

In addition, we implemented these two approaches to manage real scenarios in which noisy signals were present. The experiments showed that LEARCH-RL-LSTM can reproduce the desired behavior and navigate through the environment.

© 2017 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Nancy Arana-Daniel, Roberto Valencia-Murillo, Alma Y. Alanís, Carlos Villaseñor and Carlos López-Franco (December 20th 2017). Path Planning in Rough Terrain Using Neural Network Memory, Advanced Path Planning for Mobile Entities, Rastislav Róka, IntechOpen, DOI: 10.5772/intechopen.71486. Available from:

chapter statistics

721total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Path Planning Based on Parametric Curves

By Lucía Hilario Pérez, Marta Covadonga Mora Aguilar, Nicolás Montés Sánchez and Antonio Falcó Montesinos

Related Book

First chapter

Software‐Defined Optical Networking (SDON): Principles and Applications

By Yongli Zhao, Yuqiao Wang, Wei Wang and Xiaosong Yu

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us