CUDA Accelerated 2-OPT Local Search for the Traveling Salesman Problem

This research involves the development of a compute unified device architecture (CUDA) accelerated 2-opt local search algorithm for the traveling salesman problem (TSP). As one of the fundamental mathematical approaches to solving the TSP problem, the time complexity has generally reduced its efficiency, especially for large problem instances. Graphic processing unit (GPU) programming, especially CUDA has become more mainstream in high-performance computing (HPC) approaches and has made many intractable problems at least reasonably solvable in acceptable time. This chapter describes two CUDA accelerated 2-opt algorithms developed to solve the asymmetric TSP problem. Three separate hardware configurations were used to test the developed algorithms, and the results validate that the execution time decreased significantly, especially for the large problem instances when deployed on the GPU.


Introduction
This research addresses two very important aspects of computational intelligence, algorithm design, and high-performance computing. One of the fundamental problems in this field is the TSP, which has been used as a poster child for the notorious P ¼ NPassertion in theoretical computer science.
TSP in nominal form is considered NP-Complete, when attempted using exact deterministic heuristics. The time complexity when solving it using the Held-Karp algorithm is On 2 2 n ðÞ and the space complexity is On 2 n ðÞ .Whensolvingtheproblem using optimization algorithms and approximation, then problem tends to be NP-Hard.
2-opt is considered the simplest local search for the TSP problem. Theoretical knowledge about this heuristic is still very limited [1]; however, simple euclidean distance variants have been shown to have complexity of On 3 ðÞ [2]. Generally, the computed solution has been shown to be within a few percentage points of the global optimal [3].
One of the empirical approaches of improving the execution of the algorithm is applying high performance computing (HPC) paradigm to the problem. This is generally possible if the problem is deducible to a parallel form.

2-OPT algorithm
The 2-opt algorithm is one of the most famous heuristics developed originally for solving the TSP problem. It was first proposed by Croes [20]. Along with 3-opt, generalized as k-opt [21], these heuristics are based on exchange of up to k edges in a TSP tour (more information on application of k-opt local search techniques to TSP problems can be obtained from [22]). Together they are called exchange or local improvement heuristics. The exchange is considered to be a single move, from this point of view, such heuristics search the neighborhood of the current solution, that is, perform a local search and provide a locally optimal solution (k-optimal) to the problem [23].
The 2-opt procedure requires a starting feasible solution. It then proceeds by replacing the two non-adjacent edges, v i , v iþ ðÞ and v j , v jþ ÀÁ by v i , v j ÀÁ and v iþ , v jþ ÀÁ , and reversing one of the subpaths produced by dropping of edges, in order to maintain the consistent orientation of the tour. For example, the subpath v i , v iþ , … , v j , v jþ ÀÁ is replaced by v i , v j , … , v iþ , v jþ ÀÁ . The solution cost change produced in this way can be expressed as .IfΔ ij < 0, the solution produced by the move improves upon its predecessor. The procedure iterates until no move where Δ ij < 0 (no improving move) can be found [24].
The 2-opt local search was described by Kim et al. [25] as follows: Step 1: Let S be the initial solution, fS ðÞits objective function value. Set S * ¼ Step 2: Consider exchange result S 0 such that fS 0 Step 3: if S 6 ¼ S * set S ¼ S * , i ¼ 1, j ¼ i þ 1 and go to step 2. Otherwise output best solution S and terminate the process.

CUDA
General purpose GPU computing (GPGPU) programming was introduced by Apple Cooperation, which created the Kronos Group [26] to further develop and promote this new approach to accelerate scientific computing paradigms.
GPU's offer significantly faster acceleration due to their uniquer hardware architecture. GPGPU's started to increase in application from 2006. At this point NVIDIA decided to create its propriety unique architecture called Compute Unified Device Architecture (CUDA), specific for their Tesla generation GPU cards. In order to support this architect, specific API primitive extensions of C, C++ and Fortran extensions has been developed [27,28].
The specific C/C++ language extension for the C language is called the CUDA-C. This contains a number of accelerated libraries, extensions, and APIs. These are scalable and freely available without professional license. The main computational bottleneck is the splitting of the task between GPU and CPU tasks, where CPU handles better memory management and memory checking and GPU handles the data acceleration using parallization. It is considered heterogenous programming, where compute intensive data parallel tasks are offloaded on to the GPU.
CUDA contains three specific paradigms, thread hierarchy, memory hierarchy and synchronization. These can be further divided into coarse-grained parallelism on the blocks in grid parallization and fine-grain parallization in the threads in block, which requires low-level synchronization.

Thread hierarchy
CUDA kernels are special function calls, which is used for data parallization. Each kernel launches threads which are grouped into blocks which are then grouped into grids. Communication is done synchronously by threads in a block, whereas blocks are independent. Certain programming techniques needs to be undertaken to ensure data synchronization and validity between blocks. Threads in different blocks are not able to communicate with each other.
Threads are distinguished by their unique threadId in their respective blockId, which allows operating on specific data in the global and shared memory.

Memory hierarchy
There are different memory types in the GPU, which CUDA can utilize. Some memory structures are based on cache, some are read-only, etc. The first higher level memory structure is called the global memory, which can be accessed by all memory blocks. Due to its size and access level, it is the slowest memory on the GPU. The second memory level is the shared memory, which is shared by blocks, which threads within blocks can access. The third memory is the register memory, which are only accessible by threads, and can be used to local variables. This is the smallest and fastest memory in the GPU. If there are larger memory structures, and when registers are not sufficient, local memory can be then utilized. Another memory is constant memory which cannot be changed by the kernel code. The final memory is the texture memory, which is a read-only cache that provides a speed-up for locality in data access by threads [29].

Synchronization
Blocks in grids are used in coarse-grained parallelism and threads in a specific block are used in fine-grained parallelism. Data sharing in the scope of a kernel is done by threads in the block. The number of threads are limited by the device architecture design (max. 1024) and also by thread memory resource consumption. There is a level of scalability as the blocks are scheduled independently. Each block is assigned to a streaming multiprocessor (MS) in the GPU [29, 30].

CUDA-based 2-opt algorithm
This section presents the parallel CUDA-based version of 2-opt algorithm. This is a modification of the local search for permutative flowshop with makespan criterion problem [31] and its NEH variant [32]. Before coming to the parallel implementation description, however, the more detailed pseudocode of sequential version is provided in Algorithm 5, in order to enable better understanding of the CUDA algorithm design.
Algorithm 1: 2-opt sequential version. The Swap(T,j,i) procedure swaps jÀth and iÀth cities of tour T As can be seen already from the analysis of description of 2-opt, the task that can be done in parallel is the exploration of neighborhood of the current solution. This is divided between individual CUDA blocks. Possible neighbors of the current solution are split evenly between the launched blocks, which then explores these neighbors evenly including the fitness evaluations. If a new better solution is found, it is then stored into the global memory allocation of that block. Thereafter, if at least one of the launched blocks finds an improving solution during the iteration, the best cost solution amongst all blocks is obtained and stored into memory as the current solution for the next iteration. Otherwise, the current solution is returned as the best. It should be noted that the fitness function is not parallelized, as only a single thread in each block is tasked with this task.
Each block explores approximately the same amount of possible neighbors to the current solution (in the worst case, when no improving solution is found), including the cost evaluation. However, if it finds an improving solution, that solution is stored into the global memory allocated for each block, and the block terminates. If at least one of the blocks found an improving solution, the minimal cost solution amongst all blocks is found and stored into memory as the current solution for the next iteration. Otherwise, the current solution is returned. The cost function evaluation itself was not parallelized, as in each block only a single thread performs this task.
The outline of the parallel algorithm can be given as follows: Step 1: Set current solution S = Initial solution.
Step 2: Explore the neighborhood of S by G blocks in parallel. In each block b: Step 1.1: Determine initial index i for b.
Step 1.2: Explore all neighbors of S created by swapping of i and j, j ∈ 1, … , N fg . If improving neighbor T found, go to step 1.4.
Step 1.4: Store T and its objective function value f T into global memory and terminate.
Step 3: If no improving solution found, exit procedure and return S as the best solution found. Otherwise determine the best solution amongst those found by blocks in parallel.
Step 4: Store best solution as S. Gotostep 2. Where N is the number of cities in the tour and i is the outer loop index (see Algorithm 1 for sequential version of 2-opt).

Exploration and evaluation of neighboring solutions
The neighbors of solution are generated and evaluated in this kernel. From the sequential version pseudocode (Algorithm 1), it is obvious that the function of generating individual neighbors by swapping every possible pair of jobs pair-wise i, j ðÞ for i ¼ 1, … , N and j ¼ i þ 1, … , N can be considered independent and therefore executed in parallel. These solutions can be stored in the shared memory after generation. After evaluation, if the new solution has better fitness value compared to the current one, it is stored into the global memory allocated for each block,to avoid data races between blocks (this is illustrated in Figure 1 depicting memory layout for six cities and four blocks). The improvements counter in the global memory is incremented using an atomic operation. This counter is compared against zero after the kernel termination, to determine if the stopping criterion of the algorithm was met. The fitness function itself is evaluated by only a single thread; the other threads in a block process the elements of the solution when transferring data between shared and global memory locations. It is logically impractical to allocate the full number of N À 1 ðÞ 2 =2 blocks on the GPU in most case scenarios. This number can be very large, whereas the number of SMs and the number of resident blocks on SM is limited by various factors, such as the number of threads in a block and a registers/shared memory usage.
The optimal number of threads in a block maximizing the number of resident blocks, as well as GPU occupancy, can be easily determined based on the calculations performed in the CUDA occupancy calculator tool [33], as a function of the number of cities in a tour (which determines the size of shared memory used). This can maximizes the utilization of the GPU, while reducing the total global memory size required by the grid, as well as the workload done by the search for minimal cost solution in the next kernel. The mapping of the blocks to the tasks however can becomes more complicated to implement in code.
Using the assumption that the number of blocks will be nearly always smaller than the aforementioned function of the number of actual cities for the problem instances of interest (problems with cities larger than 30), only the outer loop of the sequential 2-opt algorithm was parallelized. The inner loop is performed by each block sequentially. This reduces the data transfers between global and shared memory, and does not eliminate the advantage of the low complexity of the swap operation at the same time. If the solution created by swapping jobs i and j is worse than the current one, it is easy to reverse this change by swapping again j and i, with equal complexity. Therefore, maximally N À 1 blocks are needed for this function. The mapping of blocks to tasks is illustrated in Figure 2.

Parallel reduction to obtain minimal cost
The parallel reduction procedure is used to find the index of the solution with the minimal fitness value. This employs shared memory to store the data being used, whereas the data is initially copied from the global to shared memory. In this step, each active thread compares two costs, and stores the smaller of the two costs on the place of the first cost, along with its original index (cost is represented as a structure containing two elements: cost value, and cost index). Using this reduction, the first element of the costs array contains the minimal cost found, along with its respective solution index. This pair is then written into global memory.

Device synchronization and subsequence update
In the final process, a new kernel copies the best indexed solution into the current solution buffer, and the next step of the main loop can be performed. A global CUDA device synchronization is required for relatively large data (for a tour size/number of threads in a block of size more than approximately 100, as was empirically confirmed) before the start of the synchronization. As each of the  kernels consumes some of the GPU resources, it is necessary to wait, until the pending kernels completely finish the execution, and release their resources, otherwise the GPU freezes and unsuccessful kernel launches start to appear. This is done by calling cudaDeviceSynchronize() function from the host code, after the Update kernel is launched. Figure 3 outlines the memory layout of the previously described code (without TSP input data, for the current subsequence size 2, city tour 4. The data fields not used in the current step are grayed out). The candidate solutions are stored in one global memory 1D array, which conceptually represents 2D array, wherein each row contains one candidate tour. The respective costs are stored in a separate array. The TSP problem input data (distance between cities) are stored in the similar fashion in global memory (because of its large size).
This implementation is expected to provide in each step the speedup proportional to the number of solutions generated.

2-OPT variants
Two versions of the 2-opt local search was implemented in this work. The first is the LS2OPT variant, which uses the search with the first ascend strategy. In this strategy, the next tour is the first improving solution found. This can be given in Algorithm 2.
The second variant is the MLS2OPT version, which is the best ascend strategy.In this strategy, the next tour is the best improving solution found in the 2-swap neighborhood as given in Algorithm 3.

Experimentation design
The experimentation design is as the following. Three different CPU'sandthree different GPU's are used to run the two different 2-opt variants on a selected number of asymmetric TSP instances (ATSP). The only measure is the time complexity.
The problem instances of the ATSP was obtained from the TSP library [34]. The following problems were selected due to differing city sizes as given in Table 1.
The machine specifications is given in Table 2. Three separate machines were used with differing CPUs and GPUs. Two machines were on a Windows 10 operating system and the other is a Central Washington University Supercomputer cluster running Ubuntu [35]. Machine 2 and 3 utilized headless GPU's.

Results and analysis
The results are grouped by the machine architectures, as there is a dependency between the CPU and GPU. Thirty experimentations was done of each problem instance on each machine for each algorithm and the average time is given in the tables (* in msec). The percentage relative difference (PRD) is calculated between the CPU and GPU times as given in Eq. (3). Negatives values (given as bolded text in the tables) indicate that the GPU execution is faster.
The first part of the first machine experiment results of the LS2OPT and its CUDA variant is given in Table 3. The first column is the problem instances and the second and third column is the CPU and GPU average results of the LS2OPT in milliseconds. The final column is the PRD results. From all the results, apart from the ftv64 instance, the GPU produced faster results. The average time was 22480.28 ms for the CPU and 2168.57 ms for the GPU. The average PRD was À47.29% for all experiments. A deeper analysis shows that for the larger instances, the PRD was over 80%.
The plot of the execution time is given in Figure 4 where the execution speedup is clearly identifiable for the larger instances.
The second part of the first machine experimentation is the MLS2OPT and its CUDA variant and the result are given in Table 4. For all the problem instances, the execution time for the GPU was significantly better. The average time was 14183.85 ms for the CPU and 1854.28 ms for the GPU. The average PRD was À52.55% for all experiments. Apart from two instances, all the other were above 85% PRD.
The plot of the execution time is given in Figure 5, where the execution speedup is linearly identifiable for the larger instances.
The first part of the second machine experiment results of the LS2OPT and its CUDA variant is given in Table 5. As the NVidia Titan Xp is a dedicated headless TESLA category GPU, the computational times are better than the CPU for all the results. The average time was 12157.14 ms for the CPU and 857 ms for the GPU. The average PRD was À64.92% for all experiments. A deeper analysis shows that for the larger instances, the PRD was over 90%. As the transfer overhead for the PCIe bus is  compensated by more extensive experimentation, larger instances performed faster on the GPU. The plot of the execution time is given in Figure 6, where the execution speedup is clearly identifiable for the larger instances.
The second part of the second machine experimentation is the MLS2OPT and its CUDA variant and the result are given in Table 6. For all the problem instances, the execution time for the GPU was significantly better. The average time was 7955.28 ms for the CPU and 616.28 ms for the GPU. The average PRD was À63.39% for all experiments. The three larger instances were above 90% PRD.
The plot of the execution time is given in Figure 7, where the execution speedup is linearly identifiable for the larger instances.
The first part of the third machine experiment results of the LS2OPT and its CUDA variant is given in Table 7. Generally, the NVidia P100 is regarded as an industry leading GPU solution for scientific computing. This is coupled with the IBM Power 8 CPU Architecture. For all the problem instances the result was   significantly better. The average time was 28592.43 ms for the CPU and 1536.42 ms for the GPU. The average PRD was À87.83% for all experiments. The plot of the execution time is given in Figure 8, where the execution speedup is clearly identifiable for the larger instances.
The second part of the third machine experimentation is the MLS2OPT and its CUDA variant and the result are given in Table 8. For all the problem instances, the execution time for the GPU was again significantly better. The average time was 23429.14 ms for the CPU and 751 ms for the GPU. The average PRD was À92.78% for all experiments. The PRD is the highest of all experiments.
The plot of the execution time is given in Figure 9, where the execution speedup is linearly identifiable for the larger instances.
The final comparison is of the three GPU's on the two separate algorithms. Figure 10 shows the values of the three GPU's on the problem instances for the LS2OPTCUDA algorithm. For the small sized problem, the timing is not significantly distinct. The distinction only becomes variable when the instance sizes increase. Overall, the NVidia Titan Xp is the best performing GPU for this algorithm.         Figure 11 shows the results of the MLS2OPTCUDA algorithm on the problem. As with the previous case, the distinction only becomes obvious for large sized problem instances. Again the NVidia Titan Xp is the best performing GPU for this algorithm.

Algorithm comparison
This section discusses the tour cost obtained by the two different 2-OPT approaches developed here compared with published research. The first comparison is done with the best known solution in literature, which can be obtained from the TSPLib [36]. Table 9 gives the comparison results between the optimal and the results obtained from the LS2OPTCUDA and MLS2OPTCUDA algorithms on the P100   GPU. The results are compared using the PRD Eq. (3). The GPU is replaced with the optimal value and the CPU is replaced by the obtained result.
The PRD values comparison shows that the LS2OPT is at most 40% away from the optimal value for ftv170 instance and À 9% for the rbg403 instance. For the MLS2OPT comparison, the PRD is À39% from the optimal value for ftv170 instance and À 5% for the rbg403 instance. On average, the MLS2OPT is a better performing algorithm with an average of 15853.86 against 16356.57 for the LS2OPT algorithm. A plot of the comparison values is given in Figure 12.
The second comparison is now done with four different evolutionary algorithms as given in Table 10. Theses are the Discrete Particle Swarm Optimization (DPSO) algorithm [37], Discrete Self-Organizing Algorithm (DSOMA) [38], Enhanced Differential Evolution (EDE) algorithm and the Chaos driven Enhanced Differential Evolution (EDE C ) algorithm [17]. The DPSO and DSOMA algorithms were revised for the TSP problem and the 2-OPT local search was removed from the algorithms to compare the results without any local search implemented. EDE and EDE C are published algorithms however only three instances were published. Both these algorithms had the 2-OPT local search embedded in them.   From the results, it was obvious that evolutionary algorithms without local search heuristics are not as effective as the 2-opt local search heuristic or algorithms with both directed and local search combined. Therefore, it is important to combine these two algorithms as in [39]. As reported in [39] that the execution time of local search can be around 95-99% of the total run time of the algorithm, it is viable to accelerate the local search heuristics.

Conclusions
This chapter introduces a CUDA accelerated 2-opt algorithm for the TSP problem. As one of the most common and widely used approaches to solve the problem, the 2-opt approach can be considered as canonical in the field.
GPU programming, especially CUDA has gained significant traction for high performance computing. Readily available hardware has made programming a much easier and available task.
Two variants of the 2-opt algorithm have been coded in CUDA to show the acceleration of computational time. This has been tested against a sample of test    instances from literature. From the results obtained, it is clear that even for a relatively cheap GPU such as the GTX 1050 the performance improvement is significant, especially for larger sized problem instances. These were compared against industry leading CPU's such as Intel i7-X series and IBM Power 8. One of the interesting aspects was that the Titan Xp performed better than the P100 for these instances. It is difficult to identify the reasons, as the same code was deployed on all machines, however the IBM and Intel architecture differences and different C/C++ compiler usage may have affected the performance. The physical configuration of the GPU's inside the hardware and its connection to the motherboard and memory bandwidth issues could also add to the time overhead. However, when analyzing the cost-performance of the GPU's then the $1500 Titan Xp is a better GPU than the $15,000 P100 in this case.
However, the clear distinction is that there is a significant improvement to be had when applying the CUDA version of the 2-opt algorithm. The next direction of this research is to combine it with powerful swarm meta-heuristics with a layered approach, and try and solve very large TSP instances. © 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.