Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO for One-Dimensional Linear Motion with Bounded Acceleration

Hsuan-Cheng Liao; Han-Jung Chou; Jing-Sin Liu

doi:10.5772/intechopen.103690

Abstract

The time-optimal control problem (TOCP) has faced new practical challenges, such as those from the deployment of agile autonomous vehicles in diverse uncertain operating conditions without accurate system calibration. In this study to meet a need to generate feasible speed profiles in the face of uncertainty, we exploit and implement probabilistic inference for learning control (PILCO), an existing sample-efficient model-based reinforcement learning (MBRL) framework for policy search, to a case study of TOCP for a vehicle that was modeled as a constant input-constrained double integrator with uncertain inertia subject to uncertain viscous friction. Our approach integrates learning, planning, and control to construct a generalizable approach that requires minimal assumptions (especially regarding external disturbances and the parametric dynamics model of the system) for solving TOCP approximately as the perturbed solutions close to time-optimality. Within PILCO, a Gaussian Radial basis functions is implemented to generate control-constrained rest-to-rest near time-optimal vehicle motion on a linear track from scratch with data-efficiency in a direct way. We briefly introduce the importance of the applications of PILCO and discuss the learning results that PILCO would actually converge to the analytical solution in this TOCP. Furthermore, we execute a simulation and a sim2real experiment to validate the suitability of PILCO for TOCP by comparing with the analytical solution.

Keywords

model-based reinforcement learning (MBRL)
applied reinforcement learning
time-optimal control problem (TOCP)
velocity learning
vehicle control

Author Information

Show +

Hsuan-Cheng Liao
- Institute of Information Science, Taipei, Taiwan, ROC
Han-Jung Chou
- Institute of Information Science, Taipei, Taiwan, ROC
Jing-Sin Liu*
- Institute of Information Science, Taipei, Taiwan, ROC

*Address all correspondence to: liu@iis.sinica.edu.tw

1. Introduction

Optimal control–based approaches have played key roles in trajectory planning or replanning and in optimization of control inputs with numerous applications, such as robotics and autonomous driving. These approaches have recently been used in autonomous systems. Optimal control formulations typically require accurate knowledge of the dynamics and can account for a more general set of constraints and objectives (performance measures) [1, 2] relative to other approaches. Many cost functions, such as those for time and energy, are used to define the desired behavior of the controlled system. As part of an effort to enhance working efficiency and productivity by having tasks be completed as fast as possible, especially for tasks involving repetitive state-to-state transfer in trajectory execution, Time-Optimal Control Problem (TOCP), for which the objective function is the terminal time, has been extensively studied. For TOCP, control bounds, different boundary conditions or paths with different curvature profiles and lengths, and various choices of the physical parameters such as mass, friction, can produce different velocity solutions. As a result, different maximum velocities and different travel times are produced. It was studied first on articulated robot manipulators for industrial and aerospace applications [3, 4, 5, 6, 7].

In recent years, the deployment of agile autonomous systems such as autonomous driving vehicles [8, 9], mobile robots [1, 10, 11] and new robot platforms such as humanoid robots [12] and unmanned aerial vehicles (uavs) have posed new increasingly important challenges to TOCP. Furthermore, a crucial aspect of traditionally solving TOCP is that analytical or learned time-optimal velocity solution are both platform and path (task) dependent, therefore, model-based and goal-directed. These challenges lie in several aspects including nonlinear, nonconvex, multi-dimensional state and control spaces, as well as various platform dependent constraints, and most importantly uncertainties of the environment in real world problems and make the model only able to approximate the reality. Therefore, computational solutions to TOCP based on an not so accurate model are not reliable and not practical for online applications. By contrast, learning-based control algorithms for dynamical systems learn to generate the desired system behavior without any complicated system formalism or predefined controller parameters a priori and thereby achieve more generalization and platform independency. One promising approach in the context of intelligent planning and control involves the use of reinforcement learning (RL) [13, 14] for learned behaviors, which can be viewed as a class of optimal control resolutions. The effectiveness and performance of RL is of task instance and platform specific, i.e. as a function of the transition and reward functions induced by the evaluated policy and the system dynamics. Therefore, to address the practical challenges in facilitating RL algorithms for a wide range of real-world decision-making problems such as the autonomous vehicles in the context of diverse driving scenarios, it is generally believed that only by proposing specific applications of RL on concrete cases can better demonstrate related issues and in which algorithm works well for a specific task instance [15].

With this aim for study, in this paper, the learning goal is to recover a near time-optimal rest-to-rest one-dimensional linear motion on a double integrator with embedded uncertainties (frictions of motion) and constant control constraint. We assume no prior knowledge of any parametric system dynamics model for deriving the optimality conditions. In addition, the characteristics of the learning task is that the vehicle mass is uncertain and the environment characteristics such as friction is unknown, and both parameters affect the maximum speed at each position along the path. It is worth mentioning that any single-input controllable second-order system is feedback equivalent to a double integrator. Thus, we use a simplified but still general, precise enough vehicle model, damped double integrator, as the foundation and demonstration for TOCP for more complicated, high-dimensional nonlinear vehicle model. The analytical solution of TOCP to double integrator subject to constant acceleration bound is known of a second-order ODE solution with an assumption by bang-bang control.

In the scope of this paper, our MBRL-based approach to recover the time-optimal motion of double integrator features a two-stage process for integrating learning, planning and control. In the first stage, we employ a Model-Based Reinforcement Learning (MBRL) framework, Probabilistic Inference for Learning Control (PILCO) [16], to generate a control-constrained rest-to-rest near time-optimal motion from scratch. The analytical solution is used as a baseline for effectively assessing the learning results. Because Monte-Carlo updates of the parametrized policy renders it difficult to incorporate velocity limits within an instance of locomotion, hence, as a second stage, we apply rescaling to give a speed profile respecting the additional velocity limit.

The outcome is then a time-optimal velocity profile under both velocity and control constraints, with no prior required knowledge and replanning. Both simulation and sim-to-real experiments are conducted and confirm our approach applicable.

Our main contributions include:

A novel application of a model-based reinforcement learning algorithm, PILCO, serving as adaptive optimal control, learns from scratch the time-optimal control of double-integrator vehicle model in the presence of uncertainties. The near time-optimality and data-efficiency is observed in light of the simulation and sim2real experimental validation we present.
A distinct feature of this work compared to the majority of the related literature [17, 18, 19] is that the learning results are evaluated and verified effectively by the analytical time-optimal motion, instead of based on a lot of test scenarios. The results, which are valid for line following of a single input controllable second-order uncertain system, allow interpretations in terms of integrated learning, planning, and control.

The remainder of the article is structured as follows. In Section 2, we summarize the approaches to the TOCP, specifically conventional and RL approaches. In Section 3, we outline the key elements of the PILCO algorithm, which are dynamics modeling, trajectory prediction, policy evaluation, and policy improvement. In Section 4, we present our simulation results on time-optimal velocity learning for an autonomous vehicle with double integrator dynamics whose analytical solution to TOCP is derived in Appendix A as the verification baseline. A sim2real experimental validation on a low-cost car along with discussions in Section 5 is provided. Finally, Section 6 provides conclusions.

2. Related work

The aim of time-optimal vehicle control is to control a vehicle such that it reaches a target state as quickly as possible (e.g., in racing or emergencies). Minimal-time velocity profile along a prespecified curve, as a subclass of TOCP subject to hard control constraints resulting from input saturation, state constraints and external disturbance, is nowadays applied to a variety of modern autonomous systems such as autonomous driving, uav and robotics. The existence of time-optimal trajectories is guaranteed by Pontryagin Maximum Principle (PMP) [20] for a vehicle with explicit dynamics model in state-space form [21]. The derivations of the optimal control and state trajectories are generally computationally expensive; computationally cheaper yet accurate methods are required, especially considering the need for rapid (even real-time) computation in most real-world industrial or engineering systems in response to changes in operating conditions and the environment. In this section, we briefly present the most common numerical approaches for systems with known dynamics model developed in robotics and autonomous vehicles; we then discuss RL-based approaches that handle uncertainty.

2.1 Approaches to the TOCP with known dynamics

Solutions to the TOCP can be categorized as complete or decoupled approaches. In the complete approach, where the aim is to solve challenging problems with general vehicle dynamics and constraints in their entirety, the optimal state and input trajectories are simultaneously determined; some direct or indirect transcription methods have been developed as part of this approach for trajectory optimization [2, 22, 23] and played an important role in their numerical performance of trajectories. By contrast, in the decoupled approach (e.g., the path velocity decomposition approach and path-constrained trajectory generation approach), the trajectory generation is decomposed into two subproblems for the path geometry to be decoupled from the velocity along the geometric path. In the decoupled approach, the first step is planning a geometric path for connecting two states (configurations or poses) in adherence to geometric constraints, such as obstacle avoidance or smoothness requirements, and the second step is designing a time-scaling function (that represents either information on timing or the velocity and acceleration) along the planned state-to-state-transfer geometric path. This approach results in a one-parameter family of velocity profiles, i.e. the parametrization of the vehicle-position-dependent velocity along the path as a function of the single path parameter of the arc length. The velocity and acceleration of the vehicle on each position of the path can be altered by the design of the time-scaling function respecting smoothness requirement (such as small jerk) and fixed boundary conditions (such as the initial and target positions and velocities are precisely specified) and the kinodynamic constraints constraints. A fair amount of literature on maximizing the speeds along the path with the acceleration, torque, jerk (or torque/acceleration derivative) constraints [3, 8, 10, 11]. In general, a model predictive control (MPC) framework can be used in the decoupled approach to generate the safe velocity profile and the input commands for following a given planned geometric path in terms of known system dynamics [24]. The following three methods have been commonly used for TOCP.

2.1.1 Hamilton-Jacobi-bellman (HJB) equation

A popular approach to obtain time-optimal motion for a system with known dynamic model and fixed boundary conditions under the safety and kinodynamic constraints of a vehicle is via optimal control or model predictive control formulations. This approach requires the derivation of optimality conditions for the state trajectories and control policies based on PMP or Dynamic Programming Principle (DPP) [20]. This yields the Two-point Boundary Value Problem of HJB partial differential equations with initial condition on the state and final condition on the costate for time-optimization of trajectories. The advantage of generality is that more general state and input constraints and objective functions can be taken into account at the cost of heavy numerical burden. For example, time-optimality can be traded off against energy to yield less aggressive control to steer the vehicles slower but smoother. Additionally, HJB equation approach is a practically useful approach in that many numerical solvers of HJB equations are available.

2.1.2 Convex optimization (CO)

The Hamiltonian of TOCP for robotic manipulator is shown to be convex with respect to the control input. TOCP is transformed into a convex optimization problem with a single state through a nonlinear change of variables [4], where the acceleration and velocity at discretized locations on the path are the optimization variables. Then, followed by [5] the work is further extended to meet speed dependent requirements. Such approach is simple and robust thanks to the existing convex optimization libraries, yet only convex objective functions can be concerned. However, the convex optimization program contains a large number of variables and inequality constraints, making it slow and less suitable for real-time applications.

2.1.3 Numerical integration (NI)

Since the vehicle velocity highly depends on the path to be followed, the universally applicable decoupled approach splits the motion planning problems to two sub-problems of finding a geometric path and planning the velocity at each position of the vehicle along the path to manage the computational complexity to generate a suboptimal motion trajectory. This result in a one-parameter family of velocity profiles, or the velocity (bound) along the path depending on the vehicle position on the path is parametrized as a function of the single path parameter (or a scalar curvilinear abscissa coordinate) s, usually the arc length. By description of the dynamics and constraints along the path to be followed on the sṡ phase plane, then this method generates the velocity limit curve on the phase plane from the velocity and acceleration bounds. The travel time is determined by the path velocity ṡ along the path or the time scaling function st that is solved by optimization tools to meet the imposed constraints. The generation of minimum-time velocity profile along the given path is greatly simplified to the determination of switching structure in the phase plane [12]. Essentially, NI searches for switching points on the phase plane and establishes the velocity profile by integrating forward with the acceleration limits and backward with the deceleration limits pivoting from those points.

2.2 Reinforcement learning (RL)

RL refers to the learning of a policy, defined as a mapping from state space to action space, by means of maximizing a reward the agent receives from the environments it operates and interacts. When the system dynamics is unknown, with the dynamical system modeled as a reward-maximizing RL agent and the desired behavior expressed as a reward function, the system can be trained to automatically execute an optimal sequence of actions (trajectories) under the present environmental conditions to complete a given mission. The RL framework reformulates the problem as a Markov Decision Process for the autonomous agent, which maximizes the long-term rewards and does not necessarily need the transition dynamics beforehand. RL offers a diverse set of model-based and model-free algorithms to improve the performance of RL agent based on the reward it received. The mission may involve integrated planning and control with time and computational resource budget in real-time applications, a system model therefore is a good informative basis for predicting the behaviors and enhancing performance, instead of tuning the behaviors frequently and manually in practical situations. However, it is often infeasible to derive a precise, analytical model that is provably correct for the actual system within regions and time horizon that the model is valid, since the existence of parametric uncertainties, unmodelled dynamics, external disturbance in perception and agent-environment interaction is inevitable. To ameliorate the problems from these interwoven factors that affect planning and control performance, researchers in the field of modeling and control have been increasingly interested in model-building or model learning in a data-driven setting based on nonparametric and probabilistic models [25, 26] as a means to enhance the performance of the underlying controlled system. Model-based RL (MBRL), complementary to the control approaches such as robust and adaptive control, model predictive control and fuzzy control, is an attractive intelligent model-based control approach that integrates learning, planning and control: it learns a dynamics model and then the derived characteristics of a learned model is exploited for generating trajectories and learning the policy. While the model-free RL attracts the most scientific interest, in MBRL algorithms that employ derived or learned system dynamic models, during policy evaluation, the state evolution calculated by a predictive model under a given policy can be used to estimate the impact of the policy on the reward. Therefore MBRL is more data-efficient in the context of diverse goal-directed planning tasks since fewer interactions between the agent and the environment are required to learn a good policy faster, in contrast to some model-free approaches. A surging number of researches demonstrated that learning, planning and control of autonomous systems such as robotics and self-driving vehicles can be cast as MBRL tasks, due to the use of an accurate, reliable learned model of the agent-environment interactions as an internal simulation in task execution and the basis for any optimization and real time control for achieving highly effective control performance [13, 14, 27]. Despite its faster convergence over the model-free frameworks, MBRL suffers from model bias and accumulated model prediction errors that greatly affect the control policy learning and rewards by leveraging the model-generated trajectories characteristics. To improve model learning performance and thus policy learning performance (the controller parameters are learned in light of currently learned model), system transition modeling or model fitting techniques ranging from deterministic methods such as physics-based (first principles based) formulation, to stochastic methods are developed [15]. Among which, nonparametric regression models such as Gaussian Process (GP) that extracts the information from the sampled data with the high data efficiency to make accurate predictions in PILCO [16]. In contrast to other probabilistic models that maintain a distribution over random variables, GP builds one over the underlying functions that generate the data. Therefore, it has no prior assumption on the function mapping current states and actions to future states. The flexibility and expressiveness GP to refine the uncertainty estimate offers makes it an effective approximator for modeling the unknown system dynamics (transition function from input data to output observation or measurement) that continuously evolves over time (i.e. trajectories), and is employed in this study. Some popular GP implementations are tabulated in [28] for practitioners. The most recent open source MBRL-Lib is released to reduce the coding burden [29].

3. MBRL for time-optimal vehicle motion

The imperfect modeling of system dynamics and perception of environment with significant noise makes machine learning a viable approach to the practical, near minimum-time velocity planning for autonomous systems that do not rely on heavy dynamic model-based computations using identification techniques. In order to find the time-optimal control policy for a vehicle dynamics model with uncertainties along a predefined path, in RL setting, it is learned from limited trial driving experiences the action (sequence) to be applied at each possible state confined on the path for the uncertain system (i.e. system dynamics along the path) with the received rewards as feedback, aiming to allocate higher trajectory reward, to determine the next observed state. There are a number of choices of RL algorithms. An earlier work [17] used Q-learning for car-like vehicle motion planning. Another study [18] analyzed transfer learning in obstacle avoidance behaviors in similar environments with similar obstacle patterns, where the state of the environment is represented by the obstacle pattern. Recently, a model-free actor-critic RL algorithm was applied to time-optimal velocity planning along an arbitrary path in [19]. That study demonstrated that the incorporation of velocity computation through the exploitation of a vehicle dynamics model is practically feasible for improving learning outcomes. These studies exemplify the practical utility of RL in improving a model’s ability to control vehicle driving (encoded as a set of trajectories). These work, among others, shows that vehicle driving skill (encoded as a set of trajectories) learning via RL is promising for real autonomous driving. Along this line of work, in the present study, the learning task for vehicle control demonstrated is a one-dimensional vehicle maneuvering task. A data-driven state feedback control approach was designed through the learning of dynamics model; in such a control scheme, the optimal time-scaling function (i.e., one that makes the vehicle reach the target as quickly as possible) is recovered or approximated through a set of sampled trajectories of unknown vehicle model under the physical constraints imposed by the vehicle. The simulated vehicle model is represented by a damped double integrator whose solution to the TOCP is known if no uncertainties exist (see Appendix A) and the numerical solution for double integrator with given boundary states and bounded acceleration is solved in e.g. [23] as the trajectory optimization via barrier function, if the system is completely known. We exploited and implemented PILCO [16], a data-efficient MBRL, in a simulation and then in a real-world experiment involving a toy car. PILCO has had high performance in benchmark tasks involving low-dimensional state spaces, such as in the control of an inverted pendulum and in cart-pole swing-up; specifically, it has demonstrated unprecedented performance in modeling uncertain system dynamics and optimizing a control policy accordingly. The paper [30] contains an easy introduction of PILCO, and some of its extensions and modifications. As summarized in Algorithm 1, PILCO employs the nonparametric GP for the learning of the unknown dynamics and corresponding uncertainty estimates in a probabilistic dynamics model. PILCO finds the optimal policy parameters which minimizes the expected episodic trajectory cost based on learned probabilistic model. The core elements of the PILCO framework, including dynamics modeling, trajectory prediction, policy evaluation and policy optimization, are briefly described in this section.

Algorithm 1: PILCO [16]

1: Define parametrized policy: π:xt×θ→ut

2: Initialize policy parameters θ∈N0I randomly

3: Execute actual system and record initial data

4: repeat

5: Learn system dynamics model with GP

6: repeat

7: Predict system trajectories

8: Evaluate policy: Jπθ=∑t=0TγtExcostxtθ

9: Policy improvement: Update parameter θ by analytic gradient dJπθdθ

until convergence θ∗ to obtain π∗=πθ∗

10: Executeπ∗ on actual system and record data

11: until task completed

3.1 Dynamics modeling

In the real world, model uncertainties and model errors are inevitable in the process of modeling a dynamic system. Various methods have been formulated for modeling and learning unknown system dynamics [13, 25, 26]. In a data-driven setting, system dynamics (instead of representations by differential or difference equations) or vector fields that contain uncertainties, nonlinearities, and disturbances are represented in the form of a set of trajectories obtained from the iterated performace of a mission. Therefore, the model learning algorithm must be able to cope with the uncertainty and noise in the collected trajectory data. PILCO adopts GP probabilistic modeling and inference to learn the transition dynamics of a real-world agent), as represented by a prediction model of the true system dynamics by a probability distribution on a space of transition functions for planning (computing the desired state and control trajectories) and control learning. Therefore, PILCO effectively handles uncertainties and reduces the effect of model errors or simplification on the represented system dynamics, as derived through nontrivial mathematical and physical equations. In this respect, PILCO has eliminated the common drawback of model-based frameworks to some extent. Consider an unknown system described by

xt+1=fxtutwithxt∈RD,ut∈RFE1

In PILCO, a GP can be used to model the unknown transition function (1). The training inputs are data in the form of state–action xu pairs generated by the unknown transition function

x˜t=xtut∈RD+FE2

where ut=πxtθ with θ as policy parameters depends on a policy π:RD→RF mapping the perceived state to an action. The training target for model learning is chosen as the delta state (the difference between consecutive states) for predicting the difference between current state and next state given action:

Δt=xt+1−xt∈RDE3

In this paper, the mean and variance of the prior multivariate Gaussian distribution for f modeled as a GP are chosen to be a zero mean function and squared exponential covariance kernel function, respectively,

kx˜ix˜j=σf2exp−12x˜ix˜jTΛ−1(x˜ix˜j)E4

where the variance of the function σf2 and Λ≔diagl12l22…lD+F2 depending on the length scales are the hyperparameters. With n training samples X˜≔x˜1…x˜n and y≔Δ1…Δn, the posterior GP hyperparameters are learned through evidence maximization and describes a one-step prediction model of xt+1 for state trajectory generation from Δt and xt as follows.

posterior state distribution

pxt+1xtut=Nxt+1μt+1Σt+1E5

mean

μt+1=xt+EfΔtE6

variance

Σt+1=VarfΔtE7

where EfΔt and VarfΔt can be calculated from prior distribution. In practice, computationally tractable means and variances of GP are used as the most likely estimate for the training data and the confidence in the prediction, respectively, for further decision-making.

3.2 Deterministic trajectory prediction

For the subsequent step of policy evaluation, PILCO first predicts long term system trajectories with the learned transition dynamics, given a policy. The distribution of state xt at time t is assumed to be Gaussian with mean μt and covariance Σt where pxt∼NμtΣt. In order to predict next state xt+1, the distribution px˜t and put are needed. This is done by assuming put is Gaussian and by approximating state-control distribution px˜t by a Gaussian with correct mean and variance. The mean and covariance of the predictive control distribution put is obtained by integrating out the state from ut=πxtθ.

put=∫px˜tdxtE8

The distribution of the change in state Δt

pΔt=∬pfx˜tx˜tpx˜tdfdx˜tE9

is subsequently approximated by a Gaussian distribution with mean μΣ and variance ΣΔ through calculating of the posterior mean and covariance of pΔt by moment matching or linearization [16]. The posterior state distribution in (5–7) can then be approximated by pxt+1∼Nμt+1Σt+1 with

μt+1=μt+μΔE10

Σt+1=Σt+ΣΔ+CovxtΔt+1+CovΔt+1xtE11

CovxtΔt+1=CovxtutΣu−1CovutΔt+1E12

3.3 Policy evaluation (reward function)

The policy is evaluated with the expected return. To this end, the predictive trajectories pxtt=01…N are retrieved, given a policy, to compute the expected cumulative reward (13, 14) for policy evaluation.

Jθ=∑t=0TγtExcostxtθE13

Excostxtθ=∫costxtNxtμtΣtdxtE14

where γ is the future discount factor which determines the importance of future costs on the reward and quantifies the time after which the costs have less influence on the rewards. Note that a large γ will cause the accumulated cost (13) calculated at the end of episode to reduce more slowly in late time as time index t→N. A small γ means current reward is more important than the future rewards, thus would be good for uniform convergence. Using the currently learned model for policy evaluation is the key to data-efficiency of PILCO.

*Remark. Since the learning reward (13, 14) does not include the control regularization term (such as using L1, L2, or mixed L1-L2 norm regularization as additional sparsity-inducing cost for (13, 14) [9]), it allows gradient computation for model-based optimization described in the following subsection.

3.4 Cost function design

In general, there are quite a number of cost functions possible for the reward of RL acting as the control for uncertain system, yet the effectiveness of a specific RL algorithm depends on the applications indeed. The objective function can be multimodal to allow different skills to be learned. For example, a parametric cost function [31] can be used to switch the cost from one that follows a quadratic cost function to a time-optimal cost during learning to generate different feedback control schemes in response to different events. Kabzan et al. [32] used a progress-maximizing cost function, defined as one that ensured that the learning agent drove as far as possible within every time step. For TOCP we consider, an appropriate per-step cost at xt=xt at sampling time t is represented in (15)

costxt=1−exp−12σc2∥xt−xtarget∥22∈01E15

where xtarget is the target state and the cost width can be tuned with σc. It measures how fast the vehicle progresses on the track to reach the target (subject to tolerance, i.e. the neighborhood of the target state) in terms of exponential function of pairwise Euclidean distance within the episode horizon. The data-driven control (18, 19) for (15) is to maximize the accumulated trajectory cost (13, 14). The expected cost depending on the state, thus on the control parameter θ through the mean and covariance of ut=πxtθ, allows for analytic integration [16].

3.5 Policy optimization

With the model uncertainty handled by the GP, PILCO employs model-based policy search for planning and uses analytic gradient of cost function for optimization of policy parameter. The policy is improved episodically through the gradient information of (13): the policy update step is in the gradient directions toward high reward region of action space to search for optimal policy (best action sequence) directly. Since the cumulative reward function and transition function are differentiable with respect to the policy parameter θ, analytic gradient Jπθ/dθ with respect to the policy parameter, which depends on the policy parametrization, is available for many interesting control parametrizations, which involves several applications of chain rule [16]. Finally, an advantage of PILCO is that through the analytic expression of the cost function with respect to policy parameter, any standard gradient-based optimization method can be implemented to search directly in the policy space for the optimal policy parameter θ of high (say, thousands) dimension, which minimizes the total cost Jπθ so as to obtain desired state trajectory with higher reward.

4. Simulation results and discussions

4.1 Simulation scenario and settings

For the simulation purpose, we consider a simple aggressive driving task whose action space is one-dimensional acceleration in a given interval, and the state space is a two-dimensional space of position and velocity with the position confined on a linear track to constrain the exploration for sample efficiency. The driving scenario is visualized in Figure 1, in which the autonomous car is represented by a black box. We used a double integrator with an unknown but constant point mass to simulate the behavior of a real vehicle traveling along a straight line on flat ground with unknown but constant viscous friction. Let xt denote the distance traveled by a point mass m on a frictional ground with viscous friction coefficient c controlled by an applied control subject to symmetric constraint u∈Uad=−umaxumax, where the action space Uad is an admissible control set (a convex polytope) that overcomes static friction and avoid slipping. Defining x=xẋT as the state vector of simulated vehicle, the state equations for the vehicle dynamics can be written as

Figure 1.
Setup of one-dimensional state-to-state transfer task. The black box depicts the car, which is modeled using a point-mass double integrator. The car begins moving from the origin (green star) at rest along a straight line to reach the target (red star) along a rough plane. The task involved the execution of different acceleration control policies on the double integrator with embedded uncertainties from the same rest state to reach a target state. The resulting state–input pair and cost at each sampling time point were recorded.

ẋ=Ax+bu,x0=0E16

where

A=010−cm,b=01mE17

Since the vehicle (15) is controllable, the existence of a time-optimal control input with at most one switching to steer the vehicle from a rest state x0=0 at time zero to a neighborhood of another state xt in a given time t is ensured by controllability is ensured by PMP. In Appendix A, we provide the analytical solution of state-to-state transfer TOCP to (15) with explicit parameter dependence. The insights offered by the analytical solution for learning and its performance is that given the symmetric control bound, the control trajectory and switch time are determined by cm and c (the natural frequency and damping ratio), while the state (position and velocity) trajectory depends on cm, m explicitly and c implicitly via the control.

Settings. For concreteness, the learning scenario is illustrated in Figure 1, in which the vehicle is represented by a black box. The vehicle had a horizontal length of 30cm and a mass of 0.5kg. In addition, to mimic a real vehicle on the road, we set a symmetric bound of ±4m/s2 on the acceleration control and a friction coefficient of 0.1. The motion began from rest at the origin and proceeded along a linear track for a fixed duration Tterminal per an acceleration policy. It is desired that the target state xT=xtarget=50T for a prescribed distance L=5 with a T≤Tterminal value that is as small as possible (i.e., the policy determines the maximum ẋ for each point x on the pre-specified path and a travel time T for task completion), where Tterminal is the duration of learning. Crucially, the agent itself has no prior knowledge on the mass or friction coefficient. The simulation is conducted on an Intel Core i7-8700k and 16 GB RAM using MATLAB.

To facilitate the episodic policy search in PILCO, we set each episode to be Tterminal=4s, and each episode was further discretized into 40 time steps (i.e., Δt=0.1s for each time step). We ran 16 episodes for the agent to learn the task, and the first task was randomly initialized. The code is available at https://github.com/brianhcliao/PILCO_AlphaBot. Our choice of learning time Tterminal=4 (for one episode) in the simulations was a trade-off between computational cost (which increases if more data are collected) and performance, where overly small and large scales of Tterminal result in aggressive and relaxed learning, respectively. When each episode was run, the agent considered an episode, in which the vehicle was returned to the same initial state (rest state at t=0) after a policy in the parametric form of (18, 19) to the simulated damped double integrator model (15). In this task, both the state and input could be observed for the data to be collected. In one control cycle and for each maneuver policy, we collected a batch of trajectory data generated by (15) (the trajectory did not violate the acceleration limits) at the sampling times points of t=0.1,0.2,…,3.9,4.0 (in seconds) over the time horizon 04. Therefore, a total of 16 episodes (including an initial random-policy round) were executed, each of which had a total of N=40 samples of state–control pairs. Subsequently, a constant-size data set Di, (i=1,2,…,16) was constructed, and each of which is composed of a sequence of precisely 40 state–action pairs, where the corresponding state was visited by the vehicle and corresponding control input

Di=x1iu1i…(xNtargetiuNtargeti)…(x40ju40j)E18

u1i=πx1i−1θi−1,…,u40i=πx40i−1θi−1E19

where Ntarget≤N=40 denote the first sampling time at which the car passes through the target region. These 40 state-control pairs collected at different sampling times are correlated via state transition which maps current state and acceleration to the next state. The time-discrete system as the state transition function is obtained by using Euler method to (15) and gets the controllability matrix

xn+1i=xni+ΔtAxni+Δtbuni=I+ΔtAxni+ΔtbutiE20

C=I+ΔtA=1Δt01−Δtcm=controllabilitymatrixE21

where n=0,1,2,…,39, x0i=x0u0i given, u0i is an initial input for arbitrary initial exploration. Since C is nonsingular for any m and c, the system considered is ensemble controllable to guarantee the existence of an appropriate input for steering task, even the system parameters are unknown. The data set Di in each trial recorded how the vehicle modified its input at each sampled point on the path in accordance with the stage cost at each sampling time point, which was the only type of feedback information that the vehicle received during the vehicle–environment interaction during learning. The i-th episode learning data were to be used for a demonstration in which the reward in the next (i + 1)th learning cycle was used to revise the policy for an optimization over action sequences in an effort to minimize the distance between the predicted future state and the target by large progress; this revision was conducted through an online estimation of the vehicle model based on the batch of collected data (for the whole completed episode) during the learning process.

4.2 Data-driven approximate time-optimal control design

We posit the basis functions ϕii=12…nb where nb is the total number of basis functions that can be used to represent the policy as a linear combination with a set of parameters over the basis functions [16]; this linear combination is represented as follows:

uxθ=∑t=0nbwiϕixE22

We choose ϕix in the form of Gaussian Radial Basis Function (RBF)

ϕix=exp−12x−μiTΔ−1x−μiE23

where μii=12…nb are support points, and Δ is covariance matrix. Then θi=wiμiΔ represents the policy parameter vector of the weight, mean, and covariance of each Gaussian RBF. The choice of parametric controller (18, 19) in the form of linear combination of Gaussian RBFs lies in function approximation property and good generalization property, where the controller parameters and the number of basis functions can be optimized using various methods. This implies robustness to vehicle parametric uncertainties and disturbances in learning. Therefore, the class of learning-based parametrized control policy for effectively reject external disturbance and compensate parametric uncertainties we consider is defined with a mapping that maps weights wi and basis functions ϕi to a full (predicted) state (position and velocity) feedback control. Since the optimal control is generally unknown, nb is chosen to be sufficiently large (100 in simulation) to allow an accurate approximation in model prediction and input-constrained control for the specific state-to-state transfer task.

4.3 Model learning performance and efficiency

PILCO uses GP that relates the policy learning performance to the double integrator with embedded uncertainties [specifically mass m and friction coefficient c in (15)] through its interactions with the environment for model learning. GP maps the policy parameter to the reward, which correlates the policy learning and one-step predictive model trajectory learning. How accurate the learned model trajectories can thus be measured by the reduction in travel time or increase in rewards over episodes, since an accurate dynamic model is required to derive the optimality conditions. In our case, the maneuver time minimization is not directly involved in the transformed cost (13, 14, 19). In fact, via maximizing the expected sum of discounted rewards of progress (or elapsed distance measured along the path) per step by applying admissible acceleration at each time step over the entire episode horizon 0Tterminal of learning, the maneuver or control policy tends to achieve a maximum progress per sampling time. Since the decrease of distance to target is directly related to reduction of travel time, it is worth noting that the model learning performance is gradually improved over episode as validated in the cost v.s. the number of episodes (time complexity) plot of Figure 2 with reduced cost and consistently with the reduced motion duration. This is because decaying factor to the power of t, γt in the time-accumulated trajectory cost sum (13) favors the immediate distance cost (19) by maximizing the distance traveled in the first few samples using the discounted factor in the cost function. This promotes the goal-reaching to be achieved with large progression per step at the beginning of the trajectory subject to symmetric acceleration bound of the vehicle, thus faster trajectory so as to fit trajectories with higher rewards.

Figure 2.
Graph of total cost against episode. Cost was reduced until (near) convergence was achieved at the second episode and was stable after convergence.

We can see two highly different trajectory behaviors in Figure 3a, while Figure 3b shows the learning curves in the position-velocity phase plane. The initial motion direction is either aligned with the desired direction toward the target or counterdirectional with it. The first trajectory generated in the first episode was counterintuitive in that the vehicle drove away from the destination. This unusual behavior was associated with a high trajectory cost due to divergence from the target, causing the rest of the trajectories in the second to last episodes to have their directions of motion switched when the vehicle was initially traveling toward the target. We observe that in the second episode, a nearly correct state response is generated to fit the highest reward. The learning behavior from second episode until the final one is relatively identical with small steady-state oscillation due to no terminal cost to control the terminal state at the end of horizon. The task-specific cost (19) is approaching zero as the final state is only around the target (i.e. xT→xtarget), considering the inaccuracy in reaching the target and residual vibration was present after the vehicle reached the neighborhood of the target. The required fast motion induces overshooting the target and when overshoot occurs, turn-back to be closer to the target is observed during learning. The policy is updated and exploration is terminated to maintain the reward at its highest throughout the remaining learning episodes.

Figure 3.
Converged learning outcome of state (position, velocity) and input (acceleration) trajectories of the vehicle. The velocity profile was triangular and the acceleration input exhibited bang–bang control characteristics. After reaching the neighborhood of the target, the vehicle attempted to stop and remain at the target. A little steady-state oscillation around the target was present because the vehicle could not decelerate fast enough to come to rest at the target. That is, the required fast motion induces overshooting the target if it does not brake in time and then reversed as hard as possible to be closer to the target.

4.4 Policy learning performance-verifying time-optimality

Theoretically, in ideal situations of no external disturbances and no model errors, by taking into account the input constraints, the analytical solution can be obtained (see Appendix A) for the damped double integrator when the time-optimal acceleration input is a bang-bang control [20] (i.e., a piecewise constant ±umax at all time points that is on the boundary of the admissible control set Uad) with one sign change (see also the following interpretation in Section 5.2). Accordingly, the effectiveness of the learned policy with respect to the time-optimal motion task is measured by analytical solution. As shown in Figure 4b and Table 1, the learned velocity trajectory is visually converges to a profile which is very similar the analytical solution. We see that the characteristics of exact solution are learned: the learned velocity profile almost coincides with the time-optimal zigzag profile with exactly one switching point whose height and corresponding time are, respectively, the maximum allowed velocity and tsw, and optimal acceleration is applied for each possible state on the linear track under symmetric acceleration constraint. For different c/m and c parameters, the fastest goal-reaching behavior from the same initial rest state results in different zigzag velocity profiles with different travel times for given L and fixed umax. Table 1 shows the effect of parameter variations due to approximate dynamics model on achievable minimum time. Note that T∗≠2tsw when c≠0. It is because the vehicle should not only have the highest acceleration to reach a highest speed at tsw and decelerate sufficiently fast to meet the null boundary conditions, but also, affected by the viscous frictional force which resists the vehicle’s movement speed to increase. The acceleration policy that produces this unique zigzag position-velocity profile corresponding to the bang-bang control with appropriate switching time to meet the boundary conditions is the goal of the policy learning based on learned model in an attempt to match.

Figure 4.
Learned response in time and phase plane. Episode 1 features the car reversing. In episode 2, a correct, nearly time-optimal motion toward the target was produced. (a) Vehicle position at each episode over 40 sampling time steps spanning a time horizon of 4 s. (b) Predicted position–velocity trajectories in the phase plane through application of data-driven control on learned model for the goal-reaching task along a linear track. The state trajectories became more accurate (time-optimal) with respect to the predicted velocity trajectory as the model was updated during learning. The time-optimal velocity trajectory upon convergence with one instance of switching is shown.

ma	cb	tswc(s)	T∗d (s)
0.5	0.0	0.79057	1.58114
0.5	0.1	0.854717	1.58443
0.5	0.096	0.852088	1.58418
0.5	0.104	0.857352	1.5847
0.48	0.1	0.836147	1.55229
0.52	0.1	0.872974	1.61595

Table 1.

This shows tsw and T∗ of the triangle profile in the presence of ∓4% deviation of both mass and friction coefficient simultaneously with accurate c=0.1, m=0.5 for given L=5 and umax=4. For comparison, tsw=0.79057 and T∗=2tsw=1.58114 for the case of c=0. It indicates the friction clearly slows down the fastest motion.

^a

m is the mass, a given parameter.

^b

c is the coefficient of friction, also a given parameter.

^c

t_sw is the switching time under bang-bang control.

^d

T^∗ is the minimum time spent in the whole motion.

In conclusion, the success of the simulation of goal-reaching trajectories along the track (task relevant states in reachable set) rendered the time-optimal trajectories on a straight road feasible. As the simulation shows, the objective function decreases the approximation errors introduced by the currently learned GP model and reduces consistently the travel time very effectively for optimal policy search to fit trajectories with higher rewards, and the resulting policy encodes the desirable spatial and temporal correlations approximately. However, the GP learning results from the collected data set are only valid for the given path and task, but the laws governing physical dynamics apply across paths and tasks. Furthermore, there is discrepancy between the desired time-optimality and the objective of cumulative maximum progress per step we implemented, we observed that the algorithm, despite the natural exploitation–exploration characteristics of the saturating cost function, was sometimes not guaranteed to be globally optimal; it could get stuck in a local optimum because the optimization problem was not convex [16, 18], but still yields good results owing to learning convergence of the generalizable controller as approximate solution to TOCP, whose sensitivity with respect to parameter variations of m and c is small.

5. Sim2Real policy transfer experiment

In real-world experiments, a relatively general uncertain nonlinear vehicle model is either not readily available or overly complicated for the design of state-to-state steering maneuvers. Therefore, to solve this problem, one approach to sim2real control is the transfer of skills learned from a simple simulated model to a similar real system executing a similar task. To test the learned policy in a real-world environment and as detailed in this section, we conducted a sim2real validation experiment with a low-cost Raspberry Pi–controlled small car, called AlphaBot, as shown in Figure 5. The simple car was equipped with photo-interrupters and ultrasonic sensors for state measurements. It could be controlled to either accelerate or brake. Because the robot unit was inexpensive, the low pulse-width-modulation duty cycles that were translated from the control signals might be incapable of driving the car, which also makes the task more complex. Before we conducted the experiment, we verified that the robot had sufficient power to travel a distance of L along the linear track to reach the target. In contrast to the simulation scenario, the experimental setup had an additional velocity limit for the car, which functioned as a state constraint. The velocity limit stemmed from the voltage constraints of the board and was therefore not directly handled by the control signal.

Figure 5.
Low-cost model car AlphaBot for the experiment (left photo) and the experimental setup involved in driving AlphaBot along a linear track.

5.1 Setup

A small car was designated to travel from a point 180cm from the wall to a point 50cm from the wall. Its heading was fixed at a forward-facing angle of 0∘ for straight line motion. The task characteristics (system dynamics along a linear track on a frictional plane) and the environment for learning in the simulation and experiment were very similar. We assumed that the model of the car was similar to that in the experiment; it was a second-order dynamics model with slightly changed parameters. This model is universal because it is based on fundamental physical principles. Thus, we could reuse the same RBF controller (18, 19) that was used in the simulated double integrator (15) as a time-optimal feedback control of the actual vehicle. When applied to a real vehicle, the learned policy from simulation runs thus can be viewed as an optimal demonstration by the simulated system (15). This policy provides a good understanding of the time-optimal motion of the vehicle motion with only an input constraint under no disturbance and no state constraint. In the experiment, the control signal generated from the RL algorithm ranged from −2 to 2 and was translated into the rate of change of the duty cycles of the motor PWM signal on board.

5.2 Results

As indicated in Figure 6, the sim2real transfer experiment, which applied the same data-driven control learned from simulation, worked well on the low-cost car. The vehicle drove at each point on the path at its highest allowable velocity; this observation was similar to that in the simulation. This validates the slight generalization capability (robustness and stability) of the MBRL with Gaussian RBF kernel under similar dynamics and task constraints. The learning curve is illustrated in Figure 6a. The total cost was 40 at the initial trial of the task for the saturating cost function. The vehicle attained a lowest cost of approximately 16 and a travel time of approximately 2.2 s at the seventh episode with a total experience time of 28s. This approach was deemed to be very efficient because learning was successful after only a small number of trials. In Figure 6b, immediate cost at every time step (per-step cost) at various episodes is plotted, showing the learning process and indicating that the arrival time decreased over episodes. The decrease was, however, not monotonic; thus, the intermediate trajectories prior to the completion of learning may not be acceptable until a nearly time-optimal control input is finally obtained at convergence, which was the desired control goal. As indicated in Figure 6c, by following a nearly time-optimal velocity along the track, the final position of the car, despite not overshooting the target or fluctuating, was slightly off the target by 50cm partly because of modeling uncertainties and localization errors from lateral tracking error and sensor inaccuracies. A velocity limit was set because of the electronic voltage constraints on the robot and was therefore not directly handled by the robot control signal.

Figure 6.
Experimental results. (a) Graph of total cost against episodes. Trajectory cost decreased after some transients until convergence at the seventh episode where a travel time of approximately 2 s was achieved. (b) Intermediate costs in selected episodes. (c) Outcome with respect to learning state and control trajectories.

5.3 Discrepancies between simulation and experiment

Prior knowledge about a detailed description of the vehicle dynamics that contain uncertainties is not available or is not required in both simulation and sim2real experiment. Instead, the model is learnt through the Gaussian Processes to reduce the accumulated cost, thus contributing to decrease the maneuver time. However, the convergence in the simulation tends to require less learning episodes to converge, which possibly stems from the following major differences. The first important difference is that the learning experiment is on the basis of more complicated vehicle dynamics. In fact, the inherent factors confronted during real world experiment that cannot be neglected or accounted for in a simple physical model (18). These can affect the control performance include uncertainties caused by lateral drift and wheel sideslip of the vehicle, physical properties such as inertia and complicated, hard to model friction characteristics (such as Coulomb friction proportional to signẋ and aerodynamic (quadratic) drag force proportional to ẋ2, in addition to viscous friction proportional to ẋ), or unknown external forces inherent in the vehicle-terrain interaction as a result of the inaccuracy of sensor measurements, motor characteristics and torque disturbances and variation of environment interactions such as uneven ground. These factors make the prediction of the sampled states on the linear track under the input ambiguous and affect the stability of system to reach the target. Considerations of vehicle hardware also entail an additional velocity limit (due to a limit to how much voltage the hardware can take) that ensures that rapid motion causes no harm or instability on the vehicle at each trial. These factors result in differing state trajectories in the experiment and simulation under the same control. Due to low sensitivity of TOCP with respect to system parameters, however, the convergence of learning to the approximate solution is ensured under these factors. Therefore, a new approximate optimal solution, if the system deviated from the initially demonstrated trajectory, is recovered by learning controller.

5.4 Interpretations of learning results

In a scenario of short-distance driving on a linear path, we see that data-driven control (18, 19) realizing the nearly time-optimal state-to-state steer control in a simulated model is demonstrated as a good approximated time-optimal control of actual system with similar dynamics and task characteristics. The error which is in sim2real experiment between the learned simulated model from the dataset and learned near time-optimal control performance can be split into two aspects. The first one is the error between the actual state xat and the theoretical time-optimal state xoptt. The second one is the error between the theoretical time-optimal state xoptt and the learned simulated state xt. Let xopttuoptt=uxopttθoptt be the time-optimal state-input pair of simulation model (15). For a successful sim2real experiment, we assume that the simulation model (15) can be extended to the actual system (24) in real experiment

ẋa=Axa+buopt+δAxaE24

xa0=0,u∈Uad=−umaxumax

where xat is the actual state, and

δA=000δcmE25

is a bounded additive disturbance due to the error δc of actual frictions and simulated friction settings with other a priori unknown factors that linearly related to xa, umax is the maximum control limitation (or state constraint) confined by hardware setting. Note (24) applies the same time-optimal control (18) learned from the simulated system. We assume the error δc is bounded by kc>0. Since xopt satisfies (15), we then have

ẋa−ẋopt=Axa−xopt+δAxaE26

Integrating (26) and moving associated terms, we have

xa−xopt=eAt∫0te−AsδAxadsE27

for 0<t<Tterminal. Since Tterminal is finite, xa is bounded in 0Tterminal (note, (24) can be solved analytically), the difference ∣xa−xopt∣ is proportional to kc with a constant C, independent of t. Therefore, the error between the actual state xat and the theoretical time-optimal state xoptt is arbitrarily small if kc tends to 0.

On the other hand, the dynamics system with transition function is discretized by Euler method as

xt+1=xt+ΔtAxt+ΔtbutE28

Since this is the dynamic system learned by PILCO, thus the difference between xt and xopt is proportional to Δt2 with constant C, independent of t. Therefore, as kc,Δt→0, we have xa→xopt.

6. Conclusions

A case study of TOCP, whose solution typically requires a known model with given boundary conditions for applying PMP, is framed as a MBRL task for rest-to-rest steering along a linear path by a vehicle that was modeled as an uncertain double integrator subject to constant acceleration bound. An important aspect of this paper is a novel application of PILCO, an existing data-efficient MBRL, to approximately solve TOCP for uncertain damped double integrator without accurate parameters. Feasibility of learning convergence is empirically verified by parametric sensitivity of exact time-optimal solution. The consistency of learned velocity profile closer to that obtained by analytical time-optimal control is shown by simulation first and then implemented on a sim2real experiment. The learned velocity can be further revised to account for a velocity limit through scaling without the need of replanning or relearning for performing less aggressive maneuvers. Our case study expands the scope of problems that can be successfully solved by MBRL (specifically, PILCO), serving as a robust adaptive optimal control, without prior parametric model representation, and it demonstrates the capability in compensating for uncertainties and external disturbances, which can cause the state trajectories to deviate from the optimal simulated state trajectory. For the challenging problem of learning a safe velocity for various road topologies and traffic flows, MBRL suffers from the accumulated compounding error over long horizon. And the comparison with other learning approaches to solution of optimal control problems is our future work.

Acknowledgments

This work was supported by an internal funding from Institute of Information Science, Academia Sinica, Taipei, Taiwan.

d2xdt2=−cmdxdt+1mutE29

where

x0=0,xT=Lẋ0=0,ẋT=0∣ut∣≤umaxE30

Integrating the Eq. (29), we then get

ẋt−ẋ0=−cmxt−x0+1mUt−U0E31

where

Ut=U0+∫0tusdsE32

Reducing the Eq. (31) by invoking the boundary conditions, we get

ẋt=−cmxt+1mUt−U0E33

Thus,

xt=x0+e−cmt∫0tecmsmUt−U0dsE34

The state trajectory with fixed initial state can be expressed by the control ut, thereby removing the system dynamics constraint.

From the terminal condition at T, we get

L=e−cmT∫0TecmsmUt−U0dsE35

Therefore,

LecmT=∫0Tecmsm∫0sus˜ds˜ds=∫0T∫0secmsmus˜ds˜ds=∫0T∫sTecmsmus˜dsds˜=∫0Tus˜∫sTecmsmdsds˜=1c∫0Tus˜ecmT−ecmsds˜=1cecmT∫0Tusds−1c∫0Tusecmsds=1cecmTUT−U0−1c∫0TusecmsdsE36

And substituting into (33) with boundary condition of ẋT=0, we get

UT=Lc+U0E37

Therefore,

∫0Tusecmsds=0E38

Thus, the conditions (32), (37) and (38) can be equivalently converted into

∫0Tusds=Lc∫0Tusecmsds=0E39

(39) together with another condition ∣ut∣≤umax on 0T, these three are complete constraints on ut of the problem. Now, according to bang-bang control, the solution ut should be ±umax for T to achieve the minimum. Since (29) is a linear second-order system, the bang-bang control has exactly one switching time. Let tsw denote the switching time and T∗ the minimum time possible. Based on this pattern for ut,

ut=umaxift∈0tsw−umaxift∈tswT∗E40

we can define

ut=c1I0tsw+c2ItswT∗E41

where

Iabx=1ifx∈ab0otherwiseE42

Substituting (41) into (39), it becomes

c1tsw+c2T−tsw=Lcc1mcecmtsw+c2mcecmT−ecmtsw=0E43

Therefore, the c1 and c2 are

c1=LcecmT−ecmtswtswecmT−1−Tecmtsw−1c2=−Lcecmtsw−1tswecmT−1−Tecmtsw−1E44

According to ∣ut∣≤umax and bang-bang principle, we have

c1=LcecmT−ecmtswtswecmT−1−Tecmtsw−1=umaxc2=−Lcecmtsw−1tswecmT−1−Tecmtsw−1=−umaxE45

Thus, ecmT−ecmtsw=ecmtsw−1 and we get tsw=mclnecmT+12. Now, substituting it into

c1=LcecmT−ecmtswtswecmT−1−Tecmtsw−1=umaxE46

we get the minimum case of T, i.e. T=T∗. (This can be computed by numerical methods. See results in Table 1).

References

1. Ostafew CJ, Schoellig AP, Barfoot TD, Collier J. Speed daemon: Experience-based mobile robot speed scheduling. In: Canadian Conference on Computer and Robot Vision. USA: IEEE; 2014. pp. 56-62
2. Rao AV. Trajectory optimization: A survey. In: Optimization and Optimal Control in Automotive Systems. Cham: Springer; 2014. pp. 3-21
3. Bobrow J, Dubowsky S, Gibson J. Time-optimal control of robotic manipulators along specified paths. International Journal of Robotics Research. 1985;4(3):3-17
4. Verscheure D, Demeulenaere B, Swevers J, DeSchutter J, Diehl M. Time-optimal path tracking for robots: A convex optimization approach. IEEE Transcation on Automatic Control. 2009;54(10):2318-2327
5. Tohid A, Norrlöf M, Löfberg J, Hansson A. Convex optimization approach for time-optimal path tracking of robots with speed dependent constraints. IFAC Proceedings Volumes. 2011, 2011;44(1):14648-14653
6. Shin K, McKay N. Selection of near-minimum time geometric paths for robotic manipulators. IEEE Transactions on Automatic Control. 1986;31(6):501-511
7. Wigstrom O, Lennartson B, Vergnano A, Breitholtz C. High-level scheduling of energy optimal trajectories. IEEE Transactions on Automation Science and Engineering. 2013;10(1):57-64
8. Bianco CGL, Romano M. Optimal velocity planning for autonomous vehicles considering curvature constraints. In: IEEE International Conference on Robotics and Automation. USA: IEEE; 2007. pp. 2706-2711
9. Dinev T, Merkt W, Ivan V, Havoutis I, Vijayakumar S. Sparsity-inducing Optimal Control Via Differential Dynamic Programming. USA: IEEE; 2020. arXiv preprint arXiv:2011.07325
10. Kunz T, Stilman M. Time-optimal trajectory generation for path following with bounded acceleration and velocity. In: Proceedings of Robotics Science and Systems VIII. Cambridge, Massachusetts, United States: MIT Press; 2012. pp. 1-8
11. Jond HB, Nabiyev VV, Akbarimajd A. Planning of mobile robots under limited velocity and acceleration. In: 22nd Signal Processing and Communications Applications Conference. USA: IEEE; 2014. pp. 1579-1582
12. Pham Q. A general, fast, and robust implementation of the time-optimal path parameterization algorithm. IEEE Transactions on Robotics. 2014;30(6):1533-1540
13. Polydoros AS, Nalpantidis L. Survey of model-based reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems. 2017;86(2):153-173
14. Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research. 2013;32(11):1238-1274
15. Dulac-Arnold G, Levine N, Mankowitz DJ, Li J, Paduraru C, Gowal S, et al. Challenges of real-world reinforcement learning: Definitions, benchmarks and analysis. Machine Learning. 2021;110(9):1-50
16. Deisenroth M, Rasmussen CE. PILCO: A model-based and data-efficient approach to policy search. In: 28th International Conference on Machine Learning (ICML-11). Bellevue, WA, USA: ICML; 2011. pp. 465-472
17. Martinez-Marin, T. (2005). Learning optimal motion planning for car-like vehicles. IEEE International Conference on Computational Intelligence for Modelling, Control and Automation IEEE USA pp.601-612
18. Saha O, Dasgupta P, Woosley B. Real-time robot path planning from simple to complex obstacle patterns via transfer learning of options. Autonomous Robots. 2019:1-23
19. Hartman G, Shiller Z, Azaria A. Deep reinforcement learning for time optimal velocity control using prior knowledge. In: IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI). USA: IEEE; 2018. arXiv preprint arXiv:1811.11615
20. Liberzon D. Calculus of Variations and Optimal Control Theory: A Concise Introduction. Princeton University Press; 2011
21. Ozatay E, Ozguner U, Filev D. Velocity profile optimization of on road vehicles: Pontryagin’s maximum principle based approach. Control Engineering Practice. 2017;61:244-254
22. Stryk O, Bulirsch R. Direct and indirect methods for trajectory optimization. Annals of Operation Research. 1992;37(1):357-373
23. Hauser J, Saccon A. A barrier function method for the optimization of trajectory functionals with constraints. In: Proceedings of the 45th IEEE Conference on Decision and Control. USA: IEEE; 2006. pp. 864-869
24. Qian X, Navarro I, de La Fortelle A, Moutarde F. Motion planning for urban autonomous driving using Bézier curves and MPC. In: IEEE 19th International Conference on Intelligent Transportation Systems (ITSC). USA: IEEE; 2016. pp. 826-833
25. Song C, Boularias A. Identifying Mechanical Models Through Differentiable Simulations. Ithaca, New York: Cornell University; 2020. arXiv preprint arXiv:2005.05410
26. Geist AR, Trimpe S. Structured learning of rigid-body dynamics: A survey and unified view. GAMM‐Mitteilungen. 2020;44(2):e202100009. arXiv preprint arXiv:2012.06250
27. Moerland TM, Broekens J, Jonker CM. Model-based Reinforcement Learning: A Survey. Ithaca, New York: Cornell University; 2020. arXiv preprint arXiv:2006.16712
28. Liu M, Chowdhary G, Da Silva BC, Liu SY, How JP. Gaussian processes for learning and control: A tutorial with examples. IEEE Control Systems Magazine. 2018;38(5):53-86
29. Pineda L, Amos B, Zhang A, Lambert NO, Calandra R. MBRL-LIB: A Modular Library for Model-based Reinforcement Learning. Ithaca, New York: Cornell University; 2021. arXiv preprint arXiv:2104.10159. Available from: https://github.com/facebookresearch/mbrl-lib
30. Brunzema P. Review on Data-Efficient Learning for Physical Systems using Gaussian Processes. Berlin, Germany: ResearchGate; 2021. Available from: researchgate.net
31. Sprague CI, Izzo D, Ögren P. Learning a Family of Optimal State Feedback Controllers. Ithaca, New York: Cornell University; 2019. arXiv preprint arXiv:1902.10139
32. Kabzan J, Hewing L, Liniger A, Zeilinger MN. Learning-based model predictive control for autonomous racing. IEEE Robotics and Automation Letters. 2019;4(4):3363-3370

[1] 1. Ostafew CJ, Schoellig AP, Barfoot TD, Collier J. Speed daemon: Experience-based mobile robot speed scheduling. In: Canadian Conference on Computer and Robot Vision. USA: IEEE; 2014. pp. 56-62

[2] 2. Rao AV. Trajectory optimization: A survey. In: Optimization and Optimal Control in Automotive Systems. Cham: Springer; 2014. pp. 3-21

[3] 3. Bobrow J, Dubowsky S, Gibson J. Time-optimal control of robotic manipulators along specified paths. International Journal of Robotics Research. 1985;4(3):3-17

[4] 4. Verscheure D, Demeulenaere B, Swevers J, DeSchutter J, Diehl M. Time-optimal path tracking for robots: A convex optimization approach. IEEE Transcation on Automatic Control. 2009;54(10):2318-2327

[5] 5. Tohid A, Norrlöf M, Löfberg J, Hansson A. Convex optimization approach for time-optimal path tracking of robots with speed dependent constraints. IFAC Proceedings Volumes. 2011, 2011;44(1):14648-14653

[6] 6. Shin K, McKay N. Selection of near-minimum time geometric paths for robotic manipulators. IEEE Transactions on Automatic Control. 1986;31(6):501-511

[7] 7. Wigstrom O, Lennartson B, Vergnano A, Breitholtz C. High-level scheduling of energy optimal trajectories. IEEE Transactions on Automation Science and Engineering. 2013;10(1):57-64

[8] 8. Bianco CGL, Romano M. Optimal velocity planning for autonomous vehicles considering curvature constraints. In: IEEE International Conference on Robotics and Automation. USA: IEEE; 2007. pp. 2706-2711

[9] 9. Dinev T, Merkt W, Ivan V, Havoutis I, Vijayakumar S. Sparsity-inducing Optimal Control Via Differential Dynamic Programming. USA: IEEE; 2020. arXiv preprint arXiv:2011.07325

[10] 10. Kunz T, Stilman M. Time-optimal trajectory generation for path following with bounded acceleration and velocity. In: Proceedings of Robotics Science and Systems VIII. Cambridge, Massachusetts, United States: MIT Press; 2012. pp. 1-8

[11] 11. Jond HB, Nabiyev VV, Akbarimajd A. Planning of mobile robots under limited velocity and acceleration. In: 22nd Signal Processing and Communications Applications Conference. USA: IEEE; 2014. pp. 1579-1582

[12] 12. Pham Q. A general, fast, and robust implementation of the time-optimal path parameterization algorithm. IEEE Transactions on Robotics. 2014;30(6):1533-1540

[13] 13. Polydoros AS, Nalpantidis L. Survey of model-based reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems. 2017;86(2):153-173

[14] 14. Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research. 2013;32(11):1238-1274

[15] 15. Dulac-Arnold G, Levine N, Mankowitz DJ, Li J, Paduraru C, Gowal S, et al. Challenges of real-world reinforcement learning: Definitions, benchmarks and analysis. Machine Learning. 2021;110(9):1-50

[16] 16. Deisenroth M, Rasmussen CE. PILCO: A model-based and data-efficient approach to policy search. In: 28th International Conference on Machine Learning (ICML-11). Bellevue, WA, USA: ICML; 2011. pp. 465-472

[17] 17. Martinez-Marin, T. (2005). Learning optimal motion planning for car-like vehicles. IEEE International Conference on Computational Intelligence for Modelling, Control and Automation IEEE USA pp.601-612

[18] 18. Saha O, Dasgupta P, Woosley B. Real-time robot path planning from simple to complex obstacle patterns via transfer learning of options. Autonomous Robots. 2019:1-23

[19] 19. Hartman G, Shiller Z, Azaria A. Deep reinforcement learning for time optimal velocity control using prior knowledge. In: IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI). USA: IEEE; 2018. arXiv preprint arXiv:1811.11615

[20] 20. Liberzon D. Calculus of Variations and Optimal Control Theory: A Concise Introduction. Princeton University Press; 2011

[21] 21. Ozatay E, Ozguner U, Filev D. Velocity profile optimization of on road vehicles: Pontryagin’s maximum principle based approach. Control Engineering Practice. 2017;61:244-254

[22] 22. Stryk O, Bulirsch R. Direct and indirect methods for trajectory optimization. Annals of Operation Research. 1992;37(1):357-373

[23] 23. Hauser J, Saccon A. A barrier function method for the optimization of trajectory functionals with constraints. In: Proceedings of the 45th IEEE Conference on Decision and Control. USA: IEEE; 2006. pp. 864-869

[24] 24. Qian X, Navarro I, de La Fortelle A, Moutarde F. Motion planning for urban autonomous driving using Bézier curves and MPC. In: IEEE 19th International Conference on Intelligent Transportation Systems (ITSC). USA: IEEE; 2016. pp. 826-833

[25] 25. Song C, Boularias A. Identifying Mechanical Models Through Differentiable Simulations. Ithaca, New York: Cornell University; 2020. arXiv preprint arXiv:2005.05410

[26] 26. Geist AR, Trimpe S. Structured learning of rigid-body dynamics: A survey and unified view. GAMM‐Mitteilungen. 2020;44(2):e202100009. arXiv preprint arXiv:2012.06250

[27] 27. Moerland TM, Broekens J, Jonker CM. Model-based Reinforcement Learning: A Survey. Ithaca, New York: Cornell University; 2020. arXiv preprint arXiv:2006.16712

[28] 28. Liu M, Chowdhary G, Da Silva BC, Liu SY, How JP. Gaussian processes for learning and control: A tutorial with examples. IEEE Control Systems Magazine. 2018;38(5):53-86

[29] 29. Pineda L, Amos B, Zhang A, Lambert NO, Calandra R. MBRL-LIB: A Modular Library for Model-based Reinforcement Learning. Ithaca, New York: Cornell University; 2021. arXiv preprint arXiv:2104.10159. Available from: https://github.com/facebookresearch/mbrl-lib

[30] 30. Brunzema P. Review on Data-Efficient Learning for Physical Systems using Gaussian Processes. Berlin, Germany: ResearchGate; 2021. Available from: researchgate.net

[31] 31. Sprague CI, Izzo D, Ögren P. Learning a Family of Optimal State Feedback Controllers. Ithaca, New York: Cornell University; 2019. arXiv preprint arXiv:1902.10139

[32] 32. Kabzan J, Hewing L, Liniger A, Zeilinger MN. Learning-based model predictive control for autonomous racing. IEEE Robotics and Automation Letters. 2019;4(4):3363-3370

Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO for One-Dimensional Linear Motion with Bounded Acceleration

Artificial Intelligence Annual Volume 2022

Abstract

Keywords

Author Information

Hsuan-Cheng Liao

Han-Jung Chou

Jing-Sin Liu*

1. Introduction

2. Related work

2.1 Approaches to the TOCP with known dynamics

2.1.1 Hamilton-Jacobi-bellman (HJB) equation

2.1.2 Convex optimization (CO)

2.1.3 Numerical integration (NI)

2.2 Reinforcement learning (RL)

3. MBRL for time-optimal vehicle motion

3.1 Dynamics modeling

3.2 Deterministic trajectory prediction

3.3 Policy evaluation (reward function)

3.4 Cost function design

3.5 Policy optimization

4. Simulation results and discussions

4.1 Simulation scenario and settings

Figure 1.

4.2 Data-driven approximate time-optimal control design

4.3 Model learning performance and efficiency

Figure 2.

Figure 3.

4.4 Policy learning performance-verifying time-optimality

Figure 4.

Table 1.

5. Sim2Real policy transfer experiment

Figure 5.

5.1 Setup

5.2 Results

Figure 6.

5.3 Discrepancies between simulation and experiment

5.4 Interpretations of learning results

6. Conclusions

Acknowledgments

References

Continue reading from the same book

Artificial Intelligence Annual Volume 2022