Open access peer-reviewed chapter - ONLINE FIRST

Traffic State Prediction and Traffic Control Strategy for Intelligent Transportation Systems

Written By

Shangbo Wang

Submitted: October 4th, 2021 Reviewed: November 17th, 2021 Published: March 9th, 2022

DOI: 10.5772/intechopen.101675

Intelligent Electronics and Circuits - Terahertz, IRS, and Beyond Edited by Mingbo Niu

From the Edited Volume

Intelligent Electronics and Circuits - Terahertz, IRS, and Beyond [Working Title]

Dr. Mingbo Niu

Chapter metrics overview

39 Chapter Downloads

View Full Metrics


The recent development of V2V (Vehicle-to-Vehicle), V2I (Vehicle-to-Infrastructure), V2X (Vehicle-to-Everything) and vehicle automation technologies have enabled the concept of Connected and Automated Vehicles (CAVs) to be tested and explored in practice. Traffic state prediction and control are two key modules for CAV systems. Traffic state prediction is important for CAVs because adaptive decisions, control strategies such as adjustment of traffic signals, turning left or right, stopping or accelerating and decision-making of vehicle motion rely on the completeness and accuracy of traffic data. For a given traffic state and input action, the future traffic states can be predicted via data-driven approaches such as deep learning models. RL (Reinforcement Learning) - based approaches gain the most popularity in developing optimum control and decision-making strategies because they can maximize the long-term award in a complex system via interaction with the environment. However, RL technique still has some drawbacks such as a slow convergence rate for high-dimensional states, etc., which need to be overcome in future research. This chapter aims to provide a comprehensive survey of the state-of-the-art solutions for traffic state prediction and traffic control strategies.


  • traffic state prediction
  • traffic control
  • deep learning
  • V2V
  • V2I
  • V2X
  • CAV
  • RL

1. Introduction

Connected and Automated Vehicles (CAVs) are nowadays the area of extensive research and there are premises to suspect that the introduction CAVs may revolutionize the whole transportation area [1]. There is no lack of predictions stating that CAVs will solve many of the current problems experienced on roads today, such as congestion, traffic accidents and lost time [2].

Traffic state prediction and traffic control are two key modules in transportation systems with CAVs [3]. Traffic states such as flow, speed, congestion, etc., plays vital roles in traffic management, public service and traffic control [4]. By predicting the evolution of traffic state timely and accurately, decision-maker and traffic controller can make effective policy and control input to avoid traffic congestion ahead of time and thus ITS (Intelligent Transportation Systems), advanced traffic management systems and traveler information systems rely on real-time traffic state prediction. Traffic control can be divided into a decision-making module and a vehicle control module. The former is used to optimize the mobility, safety and energy consumption by using the vehicle trajectory prediction results to calculate vehicle platoon sizes, speed, flow, density, traffic merging, diverging flow and traffic signals, while the latter is used for vehicle path control, vehicle fleet control and steering wheel, throttle, brake, and other actuator control by using onboard units based on the control commands [3]. How to timely and accurately predict the future traffic state and deliver an effective traffic control strategy are fundamental issues in ITS.

Traffic state prediction approaches can be broadly divided into two parts: parametric and non-parametric approaches [5]. Parametric approaches utilize parametric models that capture all the information about its predictions within a finite set of parameters. The popular techniques in parametric approaches include ARIMA (Autoregressive Integrated Moving Average) [6, 7, 8, 9], linear regression [10] and Kalman Filter (KF) based method [11], which are linear models and able to have high accuracy with linear characteristics of traffic data. ARIMA model is based on the assumption that the future data will resemble the past and widely used in time series analysis, which can be made to be stationary by differencing. It can be specified three values that represent the order of autoregressive (p), the degree of differencing (d) and the order of moving average (q). The model order can be selected by Akaike’s Information Criterion (AIC) combined with the likelihood of the historical data while the model parameters can be estimated by maximizing the log likelihood function. The extension of the ARIMA time series model into the spatial domain results in the STARIMA (Space–Time Autoregressive Integrated Moving Average) model, which can deliver more accurate prediction results in traffic prediction because of spatio-temporal correlation of traffic data. To capture the spatial correlation, the STARIMA model adds the spatial matrix comprising spatial adjacency and weight structure, and the number of spatial lags for STAR and STMA models. The drawback of the ARIMA model for traffic prediction is the strong assumption that traffic data can become stationary by differencing, which is difficult to be fulfilled because of the non-stationary characteristics of traffic data. By assuming the linear relationship between the input variables and the single output variable, the linear regression model aims to estimate the regression coefficients by using the historical traffic data. KF based methods allow a unified approach for the prediction of all processes that can be given a state space representation. Although, EKF (Extended Kalman Filtering) can be used to deal with the non-linearity of traffic data, it is difficult to have an accurate approximation of most non-linear functions and thus it can lead to relatively large error.

Non-parametric models such as DL (Deep Learning) outperform parametric models because of stochastic, indeterministic, non-linear and multidimensional characteristics of traffic data [5]. DL is a subset of machine learning (ML) which is based on the concept of deep neural network (DNN) and it has been widely used for data classification, natural language processing (NLP) and object recognition [5]. The most popular DL models used for traffic state prediction includes Convolution Neural Network (CNN) [12, 13, 14], Deep Belief Network (DBN) [15, 16], Recurrent Neural Network (RNN) [17, 18, 19] and Autoencoder (AE) [20] etc. CNN is useful for traffic prediction because of the two-dimensional characteristics of traffic data and its ability to extract the spatial feature. CNN is only connected to a smaller subset of input and thus decreases the computational complexity of the training process. DBN is a stacking of multiple RBMs (Restricted Boltzmann Machines), which can be used to estimate the probability distribution of the input traffic data. LSTM is the special type of RNN, which can capture the temporal feature of traffic data, and LSTM can overcome the gradient vanishing problem caused by the standard RNN.

Traffic control strategies can be generally divided into classical methods and learning-based methods. Classical methods develop traffic controller based on control theory or optimization-based techniques, which include dynamic traffic assignment based nonlinear controller [21], standard proportional-integral (PI) controller [22, 23], robust PI controller [24], model-based predictive control (MPC) [25, 26], linear quadratic controller [27], mixed-integer non-linear programming (MINLP) [28], multi-objective optimization based decision-making model [29]. Learning-based methods refer to the utilization of artificial intelligence technologies to achieve decision-making and control for CAVs, which can be further divided into three categories: statistic learning-based method, deep learning-based (DL) method and reinforcement learning-based (RL) method. The RL-based method is currently one of the most commonly used learning-based techniques for traffic control and decision-making because RL can solve complex control problems by using the Markov decision process (MDP) to describe the interaction states of agent and environment [4]. The most popular RL-based methods include Q-learning for adaptive traffic signal control [30, 31], multi-agent RL approaches [32, 33, 34, 35], Nash Q-learning strategy [36]. Many other RL-based approaches are also available in the literature. Q-learning based traffic signal control aims to minimize the average accumulated travel time by greedily selecting action at each iteration. Multi-agent RL approaches are more popularly used in network signal optimization and can be generally divided into centralized RL and decentralized RL, while the former considers the whole system as a single agent and the latter distributes the global control to each agent. Nash Q-learning strategy is a decentralized multi-agent RL strategy, which performs iterated updates based on assuming Nash equilibrium behavior over the current Q-values. It can be shown that traffic signal control using the Nash Q-learning strategy can converge to at least one Nash equilibrium for stationary control policies. However, Nash Q-learning is unable to achieve the Pareto Optimality without consideration of cooperation among different agents.

This chapter provides a comprehensive survey about state-of-the-art traffic state prediction and traffic control techniques. It is organized as follows: In Section 2, we firstly introduce the fundamental structure and main characteristics of two important DL models: CNN and LSTM (Long Short-Term Memory), as well as their advantages in traffic state prediction, then we introduce how to realize hybrid traffic state prediction by combining two models to achieve better accuracy. In Section 3, we detail RL fundamentals and introduce how it can be applied in traffic control and decision-making. We focus on multi-agent RL approaches. Pros and cons are discussed. Section 4 gives the summary of this chapter.


2. DL-based traffic state prediction approaches

In this section, we first briefly overview the machine learning and deep learning concept. Then, we focus on introducing the architectures of two DL models: CNN and LSTM, which show good performance in processing high-dimensional and temporal correlated data. Finally, a hybrid model of CNN and LSTM is described and the research potential is about how to improve the prediction accuracy by incorporating spatio-temporal correlation.

2.1 Overview of deep learning

ML approaches are broadly classified into two categories, i.e., Supervised Learning and Unsupervised Learning [5]. Supervised Learning requires input data to be clearly labeled. It involves a function y=fxthat maps input xto output y[5]. It aims to perform two tasks: regression and classification. In contrast to supervised learning, unsupervised learning aims to perform data clustering by extracting the data pattern. Some popularly used supervised learning approaches mainly include Random Forest (RF), Support Vector Machine (SVM), Bayesian methods, Artificial Neural Network (ANN), etc., whereas unsupervised learning approaches mainly include Autoencoder, Principal Component Analysis (PCA), Deep Belief Network (DBN), etc. To perform prediction tasks by supervised learning approaches, a model needs to be trained firstly by a training dataset with a certain amount of samples. Then, new predictions can be obtained by inputting the feature vector of new samples to the trained model. Cross-validation is usually used to validate the prediction performance by performing the following procedures for each k-th fold: (i) the whole dataset is divided into Kfolds; (ii) a model is trained by using K1folds as training data; (iii) the resulting model is validated by the k-th fold.

DL is a branch of ML which aims to construct a computational model with multiple processing layers to support high-level data abstraction. It can automatically extract the feature from data, without any human interference to explore hidden data relationships among different attributes of the dataset [37]. Concepts of DL are inspired by the thinking process of the human brain. Hence, the majority of DL architectures are using the framework of Artificial Neural Network (ANN), which consists of input, hidden and output layers with nonlinear computational elements (neurons and processing units). The network depth (the number of layers) can be adjusted according to the feature dimensions and complexity of the data. The number of neurons at the input layer is equal to the number of independent variables, while the number of neurons at the output layer is equal to the number of dependent variables, which can be single or multiple. Neurons of two successive layers are connected by weights which are updated while training the model. The neurons at each layer receive the output from the previous layer, which is generated by a weighted summation over inputs and then passed to an activation function (Figure 1).

Figure 1.

(left) ANN with one input layer, two hidden layers and one output layer,W,V,θ: Weighting matrices between IL and HL1, HL1 and HL2, HL2 and OL; (right) an illustration of an output generation.

Let us take the four-layer ANN in Figure 2 for example. During the training process, the value of k-th neuron at the output layer can be calculated by

Figure 2.

CNN structure.


where fis the activation function, xnis the independent variable at the n-th neurons of the input layer, bk, bjand bmare bias values, N, Mand Jare the numbers of neurons at the input layer, hidden layer 1 and hidden layer 2, respectively. Then, the unknown parameters W, V, θare adjusted to minimize loss function such as MSE (Mean Squared Error) by back propagation algorithms [38, 39].

2.2 Fundamental structure of CNN and LSTM

In this section, we examine two popular DL architectures: CNN and LSTM, which are used popularly for multidimensional and time sequential dataset. CNNs have been extensively applied in various fields, including traffic flow prediction [14, 40, 41], computer vision [42], Face Recognition [43], etc., while LSTMs are special kinds of RNNs, which are mainly applied in the area of temporal data processing, such as traffic state prediction [34, 44], speech processing [45] and NLP (Natural Language Processing) contexts [46], etc.

The significant difference between fully connected ANN and CNN is that CNN neurons are only connected to a smaller subset of input which decreases the total parameters in the network [47]. CNNs have the ability to extract important and distinctive features from multidimensional by making use of filtering operations. A commonly used type of CNNs, which is similar to multi-layer perception (MLP), consists of numerous convolution layers preceding pooling layers and fully connected layers. CNN structure is illustrated in Figure 2, where it consists of the input layer, convolution layer, pooling layer and fully connected layer. Convolution layer outputs higher abstraction of the feature. Each convolution layer uses several filters, which are designed to have a distinct set of weights. Filters used by the convolution layer have the smaller dimensions compared to the data size. In the training phase, filter weights are automatically determined according to an assigned task. The filters of each convolution layer are applied through the input layer by computing the sum of the product of input and filter, leading to a feature map of each filter. Each feature map detects a distinct high-level feature which is then processed by a pooling layer and a fully connected layer. ReLU activation function is applied to remove all negative values in the feature map.

The benefits of CNNs over other statistical learning methods and DL methods are listed followings [48]:

  1. CNNs have the weight sharing feature, which reduces the number of trainable network parameters and in turn helps the network speed up the training process and avoid overfitting.

  2. Concurrently learning the feature extraction layers and the classification layer causes the model output to be both highly organized and highly reliant on the extracted features.

  3. Large-scale network implementation is much easier with CNN than with other neural networks.

CNN and other kinds of ANNs such has MLP are not designed for sequences and time series data because they do not have memory element. In such cases, RNN can deliver more accurate results. RNNs are widely used in traffic state prediction because traffic data has spatiotemporal characteristics, which cannot be captured by CNN or other kinds of ANNs. RNN structure is illustrated in Figure 3, where RNNs involve an internal memory element that memorizes the previous output. The current output htis not only based upon present input xt, but also on previous output ht1, which can be expressed by

Figure 3.

RNN structure.


where Wi, Wrand Woare respectively the weighting matrices for the current input vector xt, previous output vector ht1and current output vector ht, bhand byare the bias vectors, fhand ftare the activation functions.

LSTM is firstly proposed in [49] to overcome the gradient vanishing problems generated by other RNNs. A typical LSTM network consists of an input layer, a recursive hidden layer and an output layer. In the recursive hidden layer, each neuron is made up of four structures: a forget gate, an input gate, an output gate and a memory block. The state of the memory cell reflects the features of the input, while the three gates can read, update and delete features stored in the cell. The LSTM structure is illustrated in Figure 4.

Figure 4.

LSTM structure.

The past information carried by the cell state ct1can be regulated by the current input state xt, the previous output ht1and the gate σ, which is usually composed of a sigmoid neural network layer and a pointwise multiplication operation. From Figure 4, the forget gate ft, input gate it, cell state ct, output gate ot, and hidden state htat t-th time instant can be expressed by


where W.and b.are respectively weighting matrix and bias vector and .denotes the subscripts including f, i, cand o, is point-wise multiplication. The forget gate decides which information needs attention and which can be ignored. The information from the current state xtand hidden state ht1are passed through the sigmoid function. Sigmoid generated values between 0 and 1. It concludes whether the part of the old output is necessary. The input gate updates the cell state by the following operations: first, values between 0 and 1 are generated by passing the current state xtand previous hidden state ht1into the second sigmoid function. Then, the same information of the hidden state and current state will be passed through the tanh function to generate values regulated by the first operation. Finally, the current cell state ctis updated by weighted summation of the generated values and past cell state ct1. The output gate determines the value of next hidden state. First, the values of the current state xtand previous hidden state ht1are passed into the third sigmoid function. Then, the new cell state generated from the cell state is passed through the tanh function. Based on the final value, the network decides which information the hidden state should carry.

Generally, LSTM can address the vanishing gradient problem that makes network training difficult for a long-sequence temporal data. The long-term dependencies in the data can be learned to improve the prediction accuracy.

2.3 Traffic state prediction with hybrid DL models

Although, CNN and LSTM have advantages in dealing with traffic data with spatiotemporal dependencies, due to the complex and non-linear models of traffic data, it is hard to predict accurate results by using a single model [5]. Some literature proposed that prediction accuracy can be improved by hybrid modeling such as combining CNN and LSTM [50, 51, 52, 53].

The spatial and temporal features can be fully extracted by hybrid models, where CNN in this model is used to capture spatial features of traffic data whereas LSTM is used to extract temporal features. Suppose that we have traffic state data of Klocations si=1Kin tNt1are used as inputs to predict the traffic states at tt+1t+h. The real-time measured data can be arranged into a matrix:


Note that sk,tin the matrix can also be a vector which may include flow, speed, position, etc. Traffic state such as traffic flow depicts spatio-temporal characteristics, that is, the traffic state on each location at a certain time instant depends on that of neighboring locations at current or different time instant.

There are mainly two hybridization manners: the first one is to extract spatio-temporal features by concatenating CNN and LSTM, that is, each column of Sis firstly input into a CNN model to obtain the high-level spatial feature map, which is then input into a LSTM.

model to capture the temporal features; the second one is to parallelize CNN and LSTM modeling process by considering the extracted spatial and temporal features are of the same importance, that is, the same traffic state data is input into two models, the final prediction is obtained by passing the output of two models through a FC (Fully Connected) layer. The structure of the two hybridizations is illustrated as follows (Figure 5).

Figure 5.

(left) concatenated hybrid model; (right) parallelized hybrid model.

For concatenated hybrid models, the real-time measured data matrix Sis firstly parallelized in the time domain; then, a one-dimensional CNN is used to capture high-level spatial features for each channel; finally, the high-level spatial feature map is input into LSTM models to generate the final prediction.

The high-level spatial feature map output by the one-dimensional CNN can be expressed by


where xl,tis the l-th high-level feature at the t-th time instant and can be obtained as follows


where wtis the one-dimensional filter with KL+1coefficients, is the convolution operation, Stis the t-th column of S, btis the bias, fldenotes the l-th activation function. For simplicity, only one convolution layer in Eq. (9) is displayed. In practice, multiple convolutions and pooling layers can be used to satisfy the demand according to the data size.

To extract the temporal features, the high-level spatial feature vector for single or multiple time instants will be selected for the input of each LSTM, which is denoted as


where Fnis the high-level spatial feature map for the n-th LSTM network denoted as


where Mis the adjustable input window size. The spatio-temporal features output by LSTM are denoted as H=H0H1HN1, Hnis a K×Tmatrix, where Tis the adjustable output window size with M+TN. Combining with Figure 4 and Eq. (6), the generated spatio-temporal features are iteratively determined by


where W,Fand W,Hare respectively the weighting matrices for the current input high-level spatial feature matrix and previous spatio-temporal feature matrix, vecis used for vectorization due to different size of Fnand Hn1, σand tanhare respectively the sigmoid function and hyperbolic function with vector input.

Concatenated hybrid models utilize a one-dimensional CNN to obtain a smaller range of spatial features, in addition, they do not contain a fully connected layer at the output of LSTM models, and thus concatenated hybrid models are with low learning complexity. However, the temporal features delivered by LSTM have a strong correlation with the spatial features output by CNN, which needs some special assumptions about the raw data.

For parallelized hybrid models, the historical data matrix Sneed not be parallelized in the time domain, rather it is input into a CNN and LSTM simultaneously to extract the high-level spatial and temporal feature map independently. Then, the final prediction is generated by merging the high-level spatial and temporal features via a fully-connected layer. The high-level spatial feature map can be obtained by filtering Svia two-dimensional CNNs. Suppose that we utilize a CNN with Lconvolution layers, the spatial filter for the l-th layer is denoted as wi,jl,i=0,1,,I1;j=0,1,,J1, where Iand Jare the size of the spatial filter. Given the historical data matrix Sin (7), the output of the l-th layer of the n-th CNN is obtained by


where oi,j,nl,crepresents the ij-th output of the l-th layer of the n-th CNN, σis the activation function and bi,jlis the ij-th bias of the l-th layer of CNN.

A LSTM is utilized to obtain the high-level temporal feature map. The output of the n-th LSTM can be obtained by Eq. (12) with the input of Sn, which can be denoted as


By posing a fully connected layer to the output of the L-th CNN layer oL,cand the output of the n-th LSTM onL, the n-th final prediction can be obtained by


where WFand bFare the weighting matrix and bias vector for the fully connected layer, respectively.

In parallelized hybrid models, the spatial and temporal feature maps are considered to be of the same importance, and thus are extracted independently. The fully connected layer merges the output of CNN and LSTM without any special assumptions about the high-level spatial and temporal features.

Traffic state has strong periodic features because people get used to repeating some similar or same behaviors on the same time period of different days or the same day of different weeks, e.g., most people routinely go to work in the morning and go home in the evening during the peak hour [53]; most people routinely go for shopping on weekends rather than weekdays, etc. The periodic features can be used as supplementary information to predict the future traffic state. For the short-term traffic state prediction, the real-time data only contains the data before the prediction time instant, but the historical data on previous days or weeks contain the full data of that period, that means, traffic state information after the inspected time instant on previous days or weeks can be utilized to get the prediction about that on the inspected time instant. Suppose we use parallelized hybrid models, the complete prediction structure should contain CNN and LSTM for the real-time data, CNN and bidirectional LSTM for the historical data, which are connected by using a fully connected layer.

The bidirectional LSTM is composed of two independent forward and backward LSTMs, whose inputs are the time series before and after the inspected time instant. The final prediction of bidirectional LSTM is obtained by concatenating the forward and backward LSTMs. The structure of bidirectional LSTM is depicted in Figure 6.

Figure 6.

Bidirectional LSTM structure.

Suppose that additionally, we have historical traffic state data Sdon the d-th day, which is denoted as


where Dis the number of previous days. We assume each previous day has data at 2N+1time instants available for prediction. The input and output of the forward LSTM are given by Eqs. (14) and (15), and the input of the backward LSTM is given by


Using Eq. (12), the output of the n-th backward LSTM can be obtained by


Then, the n-th temporal feature can be obtained by concatenating onLand onBL.


3. Reinforcement learning based traffic signal control

An accurate and efficient traffic state prediction can provide continuous and precise traffic status and vehicle states based on past information. How to utilize the current and predicted traffic states to make a real-time optimum decision is the main task of the traffic signal control module in ITS. The objectives of traffic signal control include minimizing the average waiting time at multiple intersections, reducing traffic congestion and maximizing network capacity. There exist real-time linear feedback control approaches and MPC (Model-based Predictive Control) that are specifically designed for traffic signal control systems to achieve the targets. The drawback of linear feedback control techniques that have been tried is that the system should always remain in the linear region at all times for the controller. Although, MPC has some advantages such as imposing constraints, the main shortcoming is it needs an accurate dynamic model, which is difficult to be obtained for traffic control systems. Data-driven approaches such as DRL (Deep Reinforcement Learning) based traffic control techniques are widely presented for ITS in recent years because RL can solve complex control problems and deep learning helps to approximate highly nonlinear functions from the complex datasets. In this section, we firstly briefly review the fundamental principles of RL. Then, we focus on multi-agent DRL based traffic signal control techniques such as decentralized multi-agent advantage actor-critic, which can converge to the local optimum and overcome the scalability issue by considering the non-stationarity of MDP transition caused by policy update of the neighborhood; and Nash Q-learning strategy, which can converge to Nash equilibrium by only considering the competition among agents.

3.1 Overview of reinforcement learning

Reinforcement Learning (RL) is a promising data-driven approach for decision-making and control in complex dynamic systems. RL methodology formally comes from a Markov Decision Process (MDP), which is a general mathematical framework sequential decision-making algorithms, and consists of five elements [54]:

  • A set of states S, which contains all possible states the system can be in;

  • A set of actions Α, which contains all possible actions the system can respond;

  • Transition probability Tst+1stat, which maps a state-action pair for each time tto a distribution of next state st+1;

  • Reward function rstatst+1which gives the instantaneous reward for taking action atfrom the state stwhen transitioning to the next state st+1;

  • The discount factor γbetween 0 and 1 for future rewards.

RL aims to maximize a numerically defined reward by interacting with the environment to learn how to behave in an environment without any prior knowledge by learning. In traffic signal control systems, RL is used to find the best control policy πthat maximizes the expected cumulative reward ERtsπfor each state sand cumulative discounted reward


where γis the discount factor that reflects the importance of future rewards, rtis the t-th instantaneous reward. Choosing a largerγmeans that the agent’s actions have a higher dependency on future reward, whereas a smaller γresults in actions that mostly care about rt.

RL generally can be classified into model-based RL which knows or learns the transition model from state stto st+1, and model-free RL which explores the environment without knowing or learning a transition model. Model-based RL emphasizes planning, that is, agents can keep track of all the routes in future time instants by predicting the next state and reward. Model-free RL can estimate the optimal policy without using or estimating the dynamics of the environments. In practice, model-free RL either estimates a value function or the policy by interacting with the environment and observing the responses. Model-free RL can be classified into value-based RL and policy-based RL. Value-based RL is mostly used for the cases that control problems have discrete state-action space. Q-learning and SARSA are two main value-based RL techniques, where the values of state-action pairs (Q-value) are stored in a Q-table, and are learned via the recursive nature of Bellman equations utilizing the Markov property [54]:


where πasis the probability of the action ais selected by given s, Qis the expected accumulative reward given by an action aand a state s


The stochasticity in Eq. (21) comes from the control policy πand the transition probability from stto st+1. Value-based RL updates the Qπwith a learning rate 0<α<1by


where ytis the Temporal Difference (TD) target for Qπstatand can be determined by


The learning rate αcontrols the speed at which Qπstatupdates, a lager αallows a fast update but may oscillate over training epochs while a smaller αtends to reserve the old Qπstatand thus may take longer time to train.

Value-based RL does not work well for continuous control problems with infinite-dimensional action space or high-dimensional problems because it is difficult to explore all the states in a large and continuous space and store them in a table. In such a case, policy-based RL can provide better solutions than value-based RL. By treating the policy πθas a probability distribution over state-action pairs parameterized by θ, policy parameters θare updated to maximize the objective function Jθ, which can be


The optimum policy parameters θare selected to maximize Jθ, with


Policy-based RL tries to select the optimum actions by using the gradient of the objective function with respect to θ, which can be written as


where Qπθsacannot be determined directly and thus Monto-Carlo method is used to sample Qπθsafrom Mtrajectories and take the empirical average


where Rtmis given by Eq. (19). No analytical solution can be provided for Eq. (27), thus, the optimum solution of Eq. (25) can be obtained by stochastic gradient descent algorithms with


where Rtis an estimator of the mean of the reward function. Eq. (28) shows that policy gradient can learn a stochastic policy by the update of the parameters θat each iteration. Thus, policy-based RL does not need to implement an exploration and exploitation trade off. A stochastic policy allows the agent to explore the state space without always taking the same action. However, policy-based RL typically converges to a local optimum rather than a global optimum.

Actor-critic RL combines the characteristics of policy-based methods and value-based methods, in which an actor is used to control the agent’s behaviors based on policy, critic evaluates the taken action based on value function. From Eq. (27), the objective function can be rewritten as


The loss function for policy and value updating can be respectively defined as


where Vwstmis the approximation function for Qπθstmatmand wis its parameter. Actor-critic RL aims to iteratively find the optimum θand wto minimize the objective functions in (30).

Recall that Rtmis obtained by taking Tsamples from the stored minibatch with Rtm=i=0T1γirtm+i, thus it may have a relatively large bias and variance. Advantage Actor-Critic (A2C) aims to solve the problem by learning the Advantage values Atminstead of Rtm, which is defined by


where wis the parameter of the approximation function for the last iteration. Then, the loss function for policy updating can be rewritten by


3.2 Multi-agent deep reinforcement learning based traffic signal control

A real traffic network consists of multiple signalized intersections, each of which can be considered as an agent. The states for the i–th agent (intersection) such as the total number of approaching vehicles, position and speed of each vehicle, the vehicle flow and density of the links, etc. are not only determined by the i-th action (adjustment of green-time proportion) but also influenced by other agents’ actions. Hence, traffic signal control can be modeled as a cooperative and competitive game, where the learning process requires considering other agents’ actions to reach a globally optimum solution. When multiple agents are presented, the standard MDP is no longer suitable for describing the environment because actions from other agents can influence the state dynamics. In such a case, a Markov game can be defined by the tuple NSiiNAiiNTπθiiNriiNγ, where

  • Nis the number of agents with N>1;

  • A set of states space of the i-th agent Si, which contains all possible states the i-th agent can be in and S=S1×S2××SNis called the joint state space;

  • A set of action space of the i-th agent Aiand A=A1×A2××ANis called the joint action space;

  • Transition probability Tst+1,ist,iat,1at,2at,Nof the i-th agent: the probability transiting from st,ito st+1,iby a joint action a=at,1at,2at,NA;

  • Control policy of the i-th agent πθi: the probability selecting an action by a given state and πθ=πθ1πθ2πθNis global control policy;

  • The instantaneous reward function of the i-th agent ri;

  • γ01is the discount factor.

The centralized multi-agent RL considers the multi-agent systems as a single-agent system with joint state space Sand action space Aand thus it can achieve the global optimum. However, it is infeasible for large-scale traffic control systems because of the extremely high dimension of joint state-action space [32]. Decentralized Multi-Agent deep RL (MARL) can overcome the scalability issue by distributing the global control to each local agent, which is controlled based on the local observed states and communicated states and actions from other agents.

Suppose we have a multi-intersection traffic network, which can be modeled as GVE. The size of the graph is the number of intersections denoted by N, and eij=1if vertices viand vjare neighbors. The neighborhood of vican be manually selected within a certain coverage limit, denoted by Ωiand thus the local region is defined by Li=Ωivi. Thanks to the advancement in communication technologies for ITS, it is possible to share the instantaneous rewards riiN, states st,iiN, actions at,iiNand policies πθiiNamong agents. The objective of MARL is to maximize the total reward function Qπθi=1NQiπθ, where Qiπθis the local reward of the i-th agent by given global control policy and can be expressed by


where the global states and policies can be communicated from all other agents in the system as well as the neighborhood Li. In multi-agent traffic signal control problems, the summation of each local reward can only be used to approximate Qπθbecause of competitiveness among different intersections, i.e., a decrease of the average vehicle waiting time at the i-th intersection may cause waiting time increase at the neighboring intersections. Hence, the objective of decentralized MARL can be approximated to be the maximization of each local reward function with consideration of states and policies from the neighborhood, which can be expressed by


We assume Eq. (33) has continuous state-action space and thus multi-agent A2C can be applied to search the optimum policy parameter. From Eq. (31), the Advantage value for the i-th intersection can be expressed by


where Rtm,iis the weighted sum of Tsamples from the minibatch, Vwistmπθiπθj,j=1,,N,jiis the approximation of the value function for the input sample stmwith the function parameter of last time iteration, T+1samples are obtained from the same stationary policy πθiπθj,j=1,,N,jiwhich respectively represents the control policies for the i-th intersection and neighboring intersections at last time iteration, stm,Listm,jjLirepresents the states for all neighbors of the i-th intersection at tm. We assume that agents are synchronized, that is, the policy and value updating for all agents happen simultaneously at end of each episode, and the delay for information exchange among agents is ignored. Therefore, within each episode, the dynamic system can be considered to be stationary although the trajectory for each agent is influenced by multiple policies πθiπθj,j=1,,N,ji. Then, the loss function for policy and value updating can be obtained as


If each agent follows Eqs. (35) and (36) in a decentralized manner, a local optimum policy πθican be achieved if other agents can achieve the optimum policy πθj,j=1,,N,jiwithin the same episode. However, if θj,j=1,,N,jicannot be achieved or θj,j=1,,N,jiare updated within the same episode, the policy gradient may be inconsistent across minibatch and thus the convergence to a local optimum cannot be guaranteed, since Atm,iis conditioned on the changing πθiπθj,j=1,,N,ji.

In practice, the information exchange among multiple intersections may not be synchronized and communication delay should be considered, which causes policy changing within the same episode and thus leads to non-stationarity. There is some research that try to stabilize convergence and relieve non-stationarity. Tesauro proposes a “Hyper-Q” learning, in which values of mixed strategies rather than base actions are learned and other agents’ strategies are estimated from observed actions via Bayesian inference [55]. Foerster et al. include low-dimensional fingerprints, such as εof ε-greedy exploration and the number of updates.

To relieve non-stationarity, the key is to keep policies from neighboring agents fixed within one episode. We can apply a DNN network to approximate the local policy πθi,tmstm,Liwhen size of Liis relatively large. If we consider the sampled latest policies from neighbors to be additional input of the DNN network, besides the current state from Li, the local policy can be rewritten by


Then, the loss function for policy updating can be rewritten by


Even if the policies from the neighbors are fixed and are considered to be additional input, it is still difficult to approximate Atm,igiven by Eq. (35) and thus convergence to a local optimum cannot be guaranteed. Recall that the total reward function can be approximated as the summation of local reward functions. Thus, a decomposable global reward with a spatial discount factor can be proposed to solve the problem.


where αis the spatial discount factor with 0α1, dis the distance between the i-th agent and j-th agent. The spatial discount factor scales down the reward in spatial order to emphasize the role played by the policy of the local agent. Compared to sharing the same weights across agents, the spatial discounted factor is more flexible for the trade-off between greedy control (α=0) and cooperative control (α=1), and is more relevant for estimating the advantage of local policy. By applying the spatial discount factor to neighboring states, we have


Then, the cumulative discounted reward can be obtained by


and the local return and Advantage value Atm,ifor the i-th agent can be expressed by


and Eq. (38) can be rewritten as


The loss function for value updating can be expressed as


The decreolized MA2C can overcome the scalability issue and achieve local optimum (Pareto Optimality). How to achieve the global optimum using a decentralized approach when the global reward function is non-convex in the future research direction.

Compared to decentralized MA2C, Nash Q-learning does not consider cooperation among agents and thus it has lower computational complexity but can only achieve the Nash equilibrium. Nash Q-learning aims to find the optimal global control policy πθby iteratively updating agents’ actions to maximize their Q functions based on assuming Nash equilibrium behavior, that is:


where Qt,ist,iat,1at,Nis the Nash Q function at the t-th iteration for the i-th agent, πθ,tis the joint Nash equilibrium strategy at the t-th iteration. Under the joint Nash equilibrium strategy, the following relation should be fulfilled:

Qt,ist+1,iat+1,1at+1,Nπθ,t+1Qt,ist+1,iat+1,1at+1,Nπθ1,t+1πθi1,t+1πθi,t+1πθi+1,t+1πθN,t+1for i=1,2,,NE46

Eqs. (45) and (46) show that at each iteration t, agent iobserves its current state st,iand takes action to maximize its Q function based on st,iand other agents’ actions. The update of i-th action will cause the update of actions of agents i, which represents all agents excluding the agent i. The t-th joint Nash equilibrium strategy will not be obtained until the convergence of the joint Nash equilibrium action is reached.

In traffic signal control application, the state space Scan be the number of vehicles, positions, speeds and lane order of vehicles, phase duration, etc., for the specific intersections, the action space Acan be phase switch, phase duration or phase percentage, etc., the reward rcan be queue length, average waiting time, cumulative time delay, network capacity etc. The objective of the multi-agent RL strategy is to minimize the accumulated average waiting for time or queue length etc.

By conducting a simulation on SUMO for a two-intersection case, we can observe in Figure 7 that the centralized DQN outperform the centralized Q-learning in terms of reward value (Average Waiting Time/s) and convergence rate (the Number of Iterations). When the number of agents is small (two, in this case), by using the centralized methods, the average waiting time can converge to the local optimum, which is more optimal than the Nash equilibrium delivered by Nash Q learning. However, the convergence rate of Nash Q learning is higher than that of centralized methods.

Figure 7.

Comparison of different multi-agent RL methods for traffic signal control.


4. Summary

In this chapter, we introduced deep learning-based traffic state prediction technique, which can provide accurate future information for traffic control and decision making. The traffic state data depicts a strong correlation in the spatial and temporal domain, which can be utilized by applying CNN and LSTM techniques to improve the prediction accuracy. CNN technique is used to capture high-level spatial features while LSTM can provide excellent performance in dealing with time-sequential data by extracting high-level temporal features. We firstly reviewed the fundamentals of deep learning and presented the architecture of CNN and LSTM. Then, we introduced how to combine these two models to form concatenated hybrid models and parallelized hybrid models. Finally, we proposed bidirectional LSTM models to enhance prediction performance by learning additional high-level temporal features from the historical data in previous days.

Furthermore, we introduced the decentralized multi-agent advantage Actor-Critic technique and Nash Q learning for traffic signal control applications. We firstly briefly review the fundamental principles of RL. Then, we focus on multi-agent DRL-based traffic signal control techniques such as decentralized multi-agent advantage actor-critic, which can converge to the local optimum and overcome the scalability issue by considering the non-stationarity of MDP transition caused by policy update of the neighborhood.

The main contribution of this chapter can be summarized as followings:

  • We reviewed the state-of-the-art technique in traffic state prediction and traffic control strategies, and provide readers with a clear framework for understanding how to apply deep learning models to traffic state prediction and how to deal with multi-agent traffic control by using RL strategies.

  • We proposed the hybrid prediction models, which can utilize CNN and LSTM to capture the spatio-temporal feature of traffic data.

  • We proposed a multi-agent deep RL (MARL) strategy, which conducts in a decentralized manner and considers the cooperation among agents and thus can overcome the scalability issue and achieve local optimum.

  • We compared the centralized RL Q-learning, DQN to the Nash Q-learning strategy in terms of the reward value and convergence rate.


  1. 1. Gora P, Rüb I. Traffic models for self-driving connected cars. Transportation Research Procedia. 2016;14:2207-2216
  2. 2. Calvert S, Schakel WJ, Lint JWC. Will automated vehicles negatively impact traffic flow? Journal of Advanced Transportation. 2017;2017:8
  3. 3. Yang Cheng YZ. Connected Automated Vehicle Highway (CAVH): A Vision and Development Report for Large Scale Automated Driving System (ADS) Deployment
  4. 4. Liu Q, Li X, Yuan S and Li Z, Decision-Making Technology for Autonomous Vehicles Learning-Based Methods, Applications and Future Outlook, 2021
  5. 5. Miglani A, Kumar N. Deep learning models for traffic flow prediction in autonomous vehicles: A review, solutions, and challenges. Vehicular Communications. 2019;20:100184
  6. 6. Duan P, Mao G, Yue W, Wang S. A unified STARIMA based model for short-term traffic flow prediction. 21st International Conference on Intelligent Transportation Systems (ITSC). Maui, HI. 2018. pp. 1652-1657
  7. 7. Napiah M, Kamaruddin I. ARIMA models for bus travel time prediction. Journal of the Institution of Engineers Malaysia. 2010;71
  8. 8. Kumar SV, Vanajakshi L. Short-term traffic flow prediction using seasonal ARIMA model with limited input data. European Transport Research Review. 2015;7:21
  9. 9. Williams BM, Durvasula PK, Brown DE. Urban freeway traffic flow prediction: Application of seasonal autoregressive integrated moving average and exponential smoothing models. Transportation Research Record. 1998;1644:132-141
  10. 10. Ahn J, Ko E, Kim EY. Highway traffic flow prediction using support vector regression and Bayesian classifier. 2016 International Conference on Big Data and Smart Computing (BigComp). 2016. pp. 239-244. DOI: 10.1109/BIGCOMP.2016.7425919
  11. 11. Kumar SV. Traffic flow prediction using Kalman filtering technique. Procedia Engineering. 2017;187:582-587
  12. 12. Ranjan N, Bhandari S, Zhao HP, Kim H, Khan P. City-wide traffic congestion prediction based on CNN, LSTM and transpose CNN. IEEE Access. 2020;8:81606-81620
  13. 13. Ma X, Dai Z, He Z, Ma J, Wang Y, Wang Y. Learning traffic as images: A Deep convolutional neural network for large-scale transportation network speed prediction. Sensors. 2017;17:818
  14. 14. Di YANG, Songjiang LI, Zhou PENG, Peng WANG, Junhui WANG, Huamin YANG. MF-CNN: Traffic flow prediction using convolutional neural network and multi-features fusion. IEICE Transactions on Information and Systems. 2019;E102.D:1526-1536
  15. 15. Bao X, Jiang D, Yang X, Wang H. An improved deep belief network for traffic prediction considering weather factors. Alexandria Engineering Journal. 2021;60:413-420
  16. 16. Huang W, Song G, Hong H, Xie K. Deep architecture for traffic flow prediction: Deep belief networks with multitask learning. IEEE Transactions on Intelligent Transportation Systems. 2014;15:2191-2201
  17. 17. Lu S, Zhang Q, Chen G, Seng D. A combined method for short-term traffic flow prediction based on recurrent neural network. Alexandria Engineering Journal. 2021;60:87-94
  18. 18. Sadeghi-Niaraki A, Mirshafiei P, Shakeri M, Choi S-M. Short-term traffic flow prediction using the modified elman recurrent neural network optimized through a genetic algorithm. IEEE Access. 2020;8:217526-217540
  19. 19. Tian Y, Pan L. Predicting short-term traffic flow by long short-term memory recurrent neural network. 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity). 2015. pp. 153-158. DOI: 10.1109/SmartCity.2015.63
  20. 20. Wang W, Bai Y, Yu C, Gu Y, Feng P, Wang X, et al. A network traffic flow prediction with deep learning approach for large-scale metropolitan area network. 2018 IEEE/IFIP Network Operations and Management Symposium. 2018. pp. 1-9
  21. 21. Wang S, Li C, Yue W, Mao G. Network capacity maximization using route choice and signal control with multiple OD Pairs. In: IEEE Transactions on Intelligent Transportation Systems. Vol. 21. 2020. pp. 1595-1611
  22. 22. Keyvan-Ekbatani M, Kouvelas A, Papamichail I, Papageorgiou M. Exploiting the fundamental diagram of urban networks for feedback-based gating. Transportation Research Part B: Methodological. 2012;46:1393-1403
  23. 23. Elouni M, Rakha HA. Weather-tuned network perimeter control - A network fundamental diagram feedback controller approach. 2018 International Conference on Vehicle Technology and Intelligent Transport Systems. 2018. pp. 82-90
  24. 24. Haddad J, Shraiber A. Robust perimeter control design for an urban region. Transportation Research Part B: Methodological. 2014;68:315-332
  25. 25. Sirmatel II, Geroliminis N. Economic model predictive control of large-scale urban road networks via perimeter control and regional route guidance. IEEE Transactions on Intelligent Transportation Systems. 2018;19:1112-1121
  26. 26. Kouvelas A, Saeedmanesh M, Geroliminis N. A linear formulation for model predictive perimeter traffic control in cities**This research has been supported by the ERC (European Research Council) Starting Grant “METAFERW: Modelling and controlling traffic congestion and propagation in large-scale urban multi-modal networks” (Grant #338205). IFAC-PapersOnLine. 2017;50:8543-8548
  27. 27. Aboudolas K, Geroliminis N. Perimeter and boundary flow control in multi-reservoir heterogeneous networks. Transportation Research Part B: Methodological. 2013;55:265-281
  28. 28. Mohebifard R, Hajbabaie A. Optimal network-level traffic signal control: A benders decomposition-based solution algorithm. Transportation Research Part B: Methodological. 2019;121:252-274
  29. 29. Wu W, Wang Z-J, Chen X-M, Wang P, Li M-X, Ou Y-J-X, et al. A decision-making model for autonomous vehicles at urban intersections based on conflict resolution. Journal of Advanced Transportation. 2021;2021:8894563
  30. 30. Lu S, Liu X, Dai S. Incremental multistep Q-learning for adaptive traffic signal control based on delay minimization strategy. 2008 7th World Congress on Intelligent Control and Automation. 2008. pp. 2854-2858. DOI: 10.1109/WCICA.2008.4593378
  31. 31. Shoufeng L, Ximin L, Shiqiang D. Q-learning for adaptive traffic signal control based on delay minimization strategy. 2008 IEEE International Conference on Networking, Sensing and Control. 2008. pp. 687-691
  32. 32. T. Chu, J. Wang, L. Codecà and Z. Li, “Multi-agent deep reinforcement learning for large-scale traffic signal control,” IEEE Transactions on Intelligent Transportation Systems, PP. 2019
  33. 33. Wang T, Cao J, Hussain A. Adaptive traffic signal control for large-scale scenario with cooperative group-based multi-agent reinforcement learning. Transportation Research Part C: Emerging Technologies. 2021;125:103046
  34. 34. Wang X, Ke L, Qiao Z, Chai X. Large-scale traffic signal control using a novel multi-agent reinforcement learning. IEEE Transactions on Cybernetics. 2021;51(1):174-187
  35. 35. Li Z, Xu C, Zhang G, A Deep Reinforcement Learning Approach for Traffic Signal Control Optimization, 2021
  36. 36. Guo J, Harmati I. Evaluating semi-cooperative Nash/Stackelberg Q-learning for traffic routes plan in a single intersection. Control Engineering Practice. 2020;102:104525
  37. 37. Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: A survey of recent advances on deep learning techniques for electronic health record (EHR) analysis. IEEE Journal of Biomedical and Health Informatics. 2018;22(5):1589-1604. DOI: 10.1109/JBHI.2017.2767063
  38. 38. Lin C-T, Lee CSG. Neural Fuzzy Systems: A Neuro-Fuzzy Synergism to Intelligent Systems. USA: Prentice-Hall, Inc.; 1996
  39. 39. Chon T-S, Park Y-S, Kim J-M, Lee B-Y, Chung Y-J, Kim Y. Use of an artificial neural network to predict population dynamics of the forest–pest pine needle gall midge (Diptera: Cecidomyiida). Environmental Entomology. 2000;29:1208-1215
  40. 40. Fouladgar M, Parchami M, Elmasri R, Ghaderi A. Scalable deep traffic flow neural networks for urban traffic congestion prediction. 2017 International Joint Conference on Neural Networks (IJCNN). 2017. pp. 2251-2258
  41. 41. Yu H, Wu Z, Wang S, Wang Y, Ma X. Spatiotemporal recurrent convolutional networks for traffic prediction in transportation networks. 2018 Proceedings of the 2nd International Conference on Computer and Data Analysis (ICCDA). 2018. pp. 28-35
  42. 42. Fang W, Love PED, Luo H, Ding L. Computer vision for behaviour-based safety in construction: A review and future directions. Advanced Engineering Informatics. 2020;43:100980
  43. 43. Li H-C, Deng Z-Y, Chiang H-H. Lightweight and resource-constrained learning network for face recognition with performance optimization. Sensors. 2020;20
  44. 44. Hassannayebi E, Ren C, Chai C, Yin C, Ji H, Cheng X, et al. Short-term traffic flow prediction: A method of combined deep learnings. Journal of Advanced Transportation. 2021;2021:9928073
  45. 45. Dinler ÖB, Aydin N. An optimal feature parameter set based on gated recurrent unit recurrent neural networks for speech segment detection. Applied Sciences. 2020;10
  46. 46. Jagannatha A, Yu H. Structured prediction models for RNN based sequence labeling in clinical text. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2016;856-865
  47. 47. Mohammadi M, Al-Fuqaha A, Sorour S, Guizani M. Deep learning for IoT big data and streaming analytics: A survey. IEEE Communication Surveys and Tutorials. 2018;20:2923-2960
  48. 48. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, et al. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. Journal of Big Data. 2021;8:53
  49. 49. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997;9:1735-1780
  50. 50. Duan Z, Yang Y, Zhang K, Ni Y, Bajgain S. Improved deep hybrid networks for urban traffic flow prediction using trajectory data. IEEE Access. 2018;6:31820-31827
  51. 51. Liu Y, Zheng H, Feng X and Chen Z. Short-Term Traffic Flow Prediction with Conv-LSTM. 2017
  52. 52. Sainath TN, Vinyals O, Senior A, Sak H. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2015. pp. 4580-4584
  53. 53. Wu Y, Tan H. Short-term traffic flow forecasting with spatial-temporal correlation in a hybrid deep learning framework. ArXiv abs/1612.01022, 2016
  54. 54. Haydari A, Yilmaz Y. Deep reinforcement learning for intelligent transportation systems: A survey. CoRR. 2020;abs/2005.00935
  55. 55. Tesauro G. Extending Q-learning to general adaptive multi-agent systems. 2003 Proceedings of the 16th International Conference on Neural Information Processing Systems (NIPS). 2003. pp. 871-878

Written By

Shangbo Wang

Submitted: October 4th, 2021 Reviewed: November 17th, 2021 Published: March 9th, 2022