Open access peer-reviewed chapter

# CUDA Accelerated 2-OPT Local Search for the Traveling Salesman Problem

Written By

Donald Davendra, Magdalena Metlicka and Magdalena Bialic-Davendra

Submitted: October 7th, 2019 Reviewed: June 5th, 2020 Published: September 3rd, 2020

DOI: 10.5772/intechopen.93125

From the Edited Volume

## Novel Trends in the Traveling Salesman Problem

Edited by Donald Davendra and Magdalena Bialic-Davendra

Chapter metrics overview

View Full Metrics

## Abstract

This research involves the development of a compute unified device architecture (CUDA) accelerated 2-opt local search algorithm for the traveling salesman problem (TSP). As one of the fundamental mathematical approaches to solving the TSP problem, the time complexity has generally reduced its efficiency, especially for large problem instances. Graphic processing unit (GPU) programming, especially CUDA has become more mainstream in high-performance computing (HPC) approaches and has made many intractable problems at least reasonably solvable in acceptable time. This chapter describes two CUDA accelerated 2-opt algorithms developed to solve the asymmetric TSP problem. Three separate hardware configurations were used to test the developed algorithms, and the results validate that the execution time decreased significantly, especially for the large problem instances when deployed on the GPU.

### Keywords

• traveling salesman problem
• CUDA
• 2-opt
• local search
• GPU programming

## 1. Introduction

This research addresses two very important aspects of computational intelligence, algorithm design, and high-performance computing. One of the fundamental problems in this field is the TSP, which has been used as a poster child for the notorious P=NPassertion in theoretical computer science.

TSP in nominal form is considered NP-Complete, when attempted using exact deterministic heuristics. The time complexity when solving it using the Held-Karp algorithmis On22nand the space complexity is On2n. When solving the problem using optimization algorithms and approximation, then problem tends to be NP-Hard.

2-opt is considered the simplest local search for the TSP problem. Theoretical knowledge about this heuristic is still very limited [1]; however, simple euclidean distance variants have been shown to have complexity of On3[2]. Generally, the computed solution has been shown to be within a few percentage points of the global optimal [3].

One of the empirical approaches of improving the execution of the algorithm is applying high performance computing (HPC) paradigm to the problem. This is generally possible if the problem is deducible to a parallel form. A number of different HPC approaches exist, namely, threads, OpenMP, MPIand CUDA. CUDA is by far the most complex and accelerated approach, as it requires programming on the GPU instead of the central processing unit (CPU).

Since its inception, CUDA has been widely used to solve a large number of computational problems [4]. This research looks to harness this approach to implement the 2-opt approach to the TSP problem.

The outline of the chapter follows with the introduction of the mathematical background of the TSP problem followed by the 2-opt algorithm. CUDA is subsequently discussed and the two CUDA developed 2-opt algorithm variants are described. The experimentation design discusses the hardware specifications of the three different architectures and then the obtained results are discussed and analyzed in respect to the execution time.

## 2. Traveling salesman problem

The TSP is a well-studied problem in literature [5, 6], which in essence tries to find the shortest path that visits a set of customers and returns to the first. A number of studies have been done using both approximation-based approaches [7] and metaheuristics. Metaheuritics are generally based on evolutionary approaches. A brief outline of different approaches can be obtained from:

1. Tabu Search: [8]

2. Simulated Annealing: [9]

3. Genetic Algorithm: [10, 11]

4. Ant Colony Optimization: [12]

5. Particle Swarm Optimization: [13]

6. Cuckoo Search: [14]

7. Firefly Algorithm: [15]

8. Water Cycle Algorithm: [16]

9. Differential Evolution Algorithm: [17]

10. Artificial Bee Colony: [18]

11. Self Organizing Migrating Algorithm: [19]

The TSP function can be expressed as shown in Eq. (1).

xij=1thepathgoesfromcityitocityj0otherwiseE1

where xij=1if city iis connected with city j, and xij=0otherwise. For i=0,,n, let uibe an artificial variable and finally take cijto be the distance from city ito city j. The objective function can be then formulated as Eq. (2):

mini=0nji,j=0ncijxij0xij1i,j=0,,nuiZi=0,,ni=0,ijnxij=1j=0,,nj=0,jinxij=1i=0,,nuiuj+nxijn11ijnE2

## 3. 2-OPT algorithm

The 2-opt algorithm is one of the most famous heuristics developed originally for solving the TSP problem. It was first proposed by Croes [20]. Along with 3-opt, generalized as k-opt [21], these heuristics are based on exchange of up to kedges in a TSP tour (more information on application of k-opt local search techniques to TSP problems can be obtained from [22]). Together they are called exchange or local improvement heuristics. The exchange is considered to be a single move, from this point of view, such heuristics search the neighborhood of the current solution, that is, perform a local search and provide a locally optimal solution (k-optimal) to the problem [23].

The 2-opt procedure requires a starting feasible solution. It then proceeds by replacing the two non-adjacent edges, vivi+and vjvj+by vivjand vi+vj+, and reversing one of the subpaths produced by dropping of edges, in order to maintain the consistent orientation of the tour. For example, the subpath vivi+vjvj+is replaced by vivjvi+vj+. The solution cost change produced in this way can be expressed as Δij=cvivj+cvi+vj+cvivi+cvjvj+. If Δij<0, the solution produced by the move improves upon its predecessor. The procedure iterates until no move where Δij<0(no improving move) can be found [24].

The 2-opt local search was described by Kim et al. [25] as follows:

Step 1:Let Sbe the initial solution, fSits objective function value. Set S=S,i=1,j=i+1=2.

Step 2:Consider exchange result S'such that fS'<fS. Set S=S'. if j<nrepeat step 2. Otherwise set i=i+1and j=i+1. if i<nrepeat step 2, otherwise go to step 3.

Step 3:if SSset S=S,i=1,j=i+1and go to step 2. Otherwise output best solution Sand terminate the process.

## 4. CUDA

General purpose GPU computing (GPGPU) programming was introduced by Apple Cooperation, which created the Kronos Group [26] to further develop and promote this new approach to accelerate scientific computing paradigms. GPU’s offer significantly faster acceleration due to their uniquer hardware architecture. GPGPU’s started to increase in application from 2006. At this point NVIDIA decided to create its propriety unique architecture called Compute Unified Device Architecture (CUDA), specific for their Teslageneration GPU cards. In order to support this architect, specific API primitive extensions of C, C++ and Fortran extensions has been developed [27, 28].

The specific C/C++ language extension for the C language is called the CUDA-C. This contains a number of accelerated libraries, extensions, and APIs. These are scalable and freely available without professional license. The main computational bottleneck is the splitting of the task between GPU and CPU tasks, where CPU handles better memory management and memory checking and GPU handles the data acceleration using parallization. It is considered heterogenousprogramming, where compute intensive data parallel tasks are offloaded on to the GPU.

CUDA contains three specific paradigms, thread hierarchy, memory hierarchyand synchronization. These can be further divided into coarse-grainedparallelism on the blocksin gridparallization and fine-grainparallization in the threadsin block, which requires low-level synchronization.

CUDA kernels are special function calls, which is used for data parallization. Each kernel launches threadswhich are grouped into blockswhich are then grouped into grids. Communication is done synchronously by threadsin a block, whereas blocksare independent. Certain programming techniques needs to be undertaken to ensure data synchronization and validity between blocks. Threadsin different blocksare not able to communicate with each other.

Threads are distinguished by their unique threadIdin their respective blockId, which allows operating on specific data in the globaland sharedmemory.

### 4.2 Memory hierarchy

There are different memory types in the GPU, which CUDA can utilize. Some memory structures are based on cache, some are read-only, etc. The first higher level memory structure is called the global memory, which can be accessed by all memory blocks. Due to its size and access level, it is the slowest memory on the GPU. The second memory level is the shared memory, which is shared by blocks, which threads within blocks can access. The third memory is the register memory, which are only accessible by threads, and can be used to local variables. This is the smallest and fastest memory in the GPU. If there are larger memory structures, and when registers are not sufficient, localmemory can be then utilized. Another memory is constantmemory which cannot be changed by the kernel code. The final memory is the texturememory, which is a read-only cache that provides a speed-up for locality in data access by threads [29].

### 4.3 Synchronization

Blocksin gridsare used in coarse-grained parallelism and threadsin a specific blockare used in fine-grainedparallelism. Data sharing in the scope of a kernel is done by threadsin the block. The number of threadsare limited by the device architecture design (max. 1024) and also by threadmemory resource consumption. There is a level of scalability as the blocksare scheduled independently. Each blockis assigned to a streaming multiprocessor (MS) in the GPU [29, 30].

## 5. CUDA-based 2-opt algorithm

This section presents the parallel CUDA-based version of 2-opt algorithm. This is a modification of the local search for permutative flowshop with makespan criterion problem [31] and its NEH variant [32]. Before coming to the parallel implementation description, however, the more detailed pseudocode of sequential version is provided in Algorithm 5, in order to enable better understanding of the CUDA algorithm design.

Algorithm 1:2-opt sequential version. The Swap(T,j,i)procedure swaps jth and ith cities of tour T

As can be seen already from the analysis of description of 2-opt, the task that can be done in parallel is the exploration of neighborhood of the current solution. This is divided between individual CUDA blocks. Possible neighbors of the current solution are split evenly between the launched blocks, which then explores these neighbors evenly including the fitness evaluations. If a new better solution is found, it is then stored into the global memoryallocation of that block. Thereafter, if at least one of the launched blocksfinds an improving solution during the iteration, the best cost solution amongst all blocksis obtained and stored into memory as the current solution for the next iteration. Otherwise, the current solution is returned as the best. It should be noted that the fitness function is not parallelized, as only a single threadin each blockis tasked with this task.

Each blockexplores approximately the same amount of possible neighbors to the current solution (in the worst case, when no improving solution is found), including the cost evaluation. However, if it finds an improving solution, that solution is stored into the global memoryallocated for each block, and the blockterminates. If at least one of the blocksfound an improving solution, the minimal cost solution amongst all blocksis found and stored into memory as the current solution for the next iteration. Otherwise, the current solution is returned. The cost function evaluation itself was not parallelized, as in each blockonly a single threadperforms this task.

The outline of the parallel algorithm can be given as follows:

Step 1:Set current solution S = Initial solution.

Step 2:Explore the neighborhood of Sby Gblocks in parallel. In each block b:

Step 1.1:Determine initial index ifor b.

Step 1.2:Explore all neighbors of Screated by swapping of iand j,j1N. If improving neighbor Tfound, go to step 1.4.

Step 1.3:Determine next index ifor b. If iN, terminate. Otherwise go to step 1.2.

Step 1.4:Store Tand its objective function value fTinto global memory and terminate.

Step 3:If no improving solution found, exit procedure and return Sas the best solution found. Otherwise determine the best solution amongst those found by blocks in parallel.

Step 4:Store best solution as S. Go to step 2.

Where Nis the number of cities in the tour and iis the outer loop index (see Algorithm 1 for sequential version of 2-opt).

### 5.1 Exploration and evaluation of neighboring solutions

The neighbors of solution are generated and evaluated in this kernel. From the sequential version pseudocode (Algorithm 1), it is obvious that the function of generating individual neighbors by swapping every possible pair of jobs pair-wise ijfor i=1,,Nand j=i+1,,Ncan be considered independent and therefore executed in parallel. These solutions can be stored in the shared memoryafter generation. After evaluation, if the new solution has better fitness value compared to the current one, it is stored into the global memoryallocated for each block, to avoid data races between blocks(this is illustrated in Figure 1 depicting memory layout for six cities and four blocks). The improvements counter in the global memoryis incremented using an atomic operation. This counter is compared against zero after the kernel termination, to determine if the stopping criterion of the algorithm was met. The fitness function itself is evaluated by only a single thread; the other threadsin a blockprocess the elements of the solution when transferring data between sharedand global memorylocations.

It is logically impractical to allocate the full number of N12/2blockson the GPU in most case scenarios. This number can be very large, whereas the number of SMs and the number of resident blockson SM is limited by various factors, such as the number of threadsin a blockand a registers/shared memoryusage. The optimal number of threadsin a blockmaximizing the number of resident blocks, as well as GPU occupancy, can be easily determined based on the calculations performed in the CUDA occupancy calculator tool [33], as a function of the number of cities in a tour (which determines the size of shared memoryused). This can maximizes the utilization of the GPU, while reducing the total global memorysize required by the grid, as well as the workload done by the search for minimal cost solution in the next kernel. The mapping of the blocksto the tasks however can becomes more complicated to implement in code.

Using the assumption that the number of blockswill be nearly always smaller than the aforementioned function of the number of actual cities for the problem instances of interest (problems with cities larger than 30), only the outer loop of the sequential 2-opt algorithm was parallelized. The inner loop is performed by each blocksequentially. This reduces the data transfers between globaland shared memory, and does not eliminate the advantage of the low complexity of the swap operation at the same time. If the solution created by swapping jobs iand jis worse than the current one, it is easy to reverse this change by swapping again jand i, with equal complexity. Therefore, maximally N1blocksare needed for this function. The mapping of blocksto tasks is illustrated in Figure 2.

### 5.2 Parallel reduction to obtain minimal cost

The parallel reductionprocedure is used to find the index of the solution with the minimal fitness value. This employs shared memoryto store the data being used, whereas the data is initially copied from the globalto shared memory. In this step, each active thread compares two costs, and stores the smaller of the two costs on the place of the first cost, along with its original index (cost is represented as a structure containing two elements: cost value, and cost index). Using this reduction, the first element of the costs array contains the minimal cost found, along with its respective solution index. This pair is then written into global memory.

### 5.3 Device synchronization and subsequence update

In the final process, a new kernel copies the best indexed solution into the current solution buffer, and the next step of the main loop can be performed. A global CUDA device synchronization is required for relatively large data (for a tour size/number of threadsin a blockof size more than approximately 100, as was empirically confirmed) before the start of the synchronization. As each of the kernels consumes some of the GPU resources, it is necessary to wait, until the pending kernels completely finish the execution, and release their resources, otherwise the GPU freezes and unsuccessful kernel launches start to appear. This is done by callingcudaDeviceSynchronize()function from the host code, after the Update kernel is launched.

Figure 3 outlines the memory layout of the previously described code (without TSP input data, for the current subsequence size 2, city tour 4. The data fields not used in the current step are grayed out). The candidate solutions are stored in one global memory1D array, which conceptually represents 2D array, wherein each row contains one candidate tour. The respective costs are stored in a separate array. The TSP problem input data (distance between cities) are stored in the similar fashion in global memory(because of its large size).

This implementation is expected to provide in each step the speedup proportional to the number of solutions generated.

## 6. 2-OPT variants

Two versions of the 2-opt local search was implemented in this work. The first is the LS2OPTvariant, which uses the search with the first ascend strategy. In this strategy, the next tour is the firstimproving solution found. This can be given in Algorithm 2.

The second variant is the MLS2OPTversion, which is the best ascend strategy. In this strategy, the next tour is the bestimproving solution found in the 2-swap neighborhood as given in Algorithm 3.

## 7. Experimentation design

The experimentation design is as the following. Three different CPU’s and three different GPU’s are used to run the two different 2-opt variants on a selected number of asymmetric TSP instances (ATSP). The only measure is the time complexity.

The problem instances of the ATSP was obtained from the TSP library [34]. The following problems were selected due to differing city sizes as given in Table 1.

DataCities
ft7070
ftv6465
ftv170171
kro124p100
rbg323323
rbg358358
rbg403403

### Table 1.

TSP instances and number of cities.

The machine specifications is given in Table 2. Three separate machines were used with differing CPUs and GPUs. Two machines were on a Windows 10 operating system and the other is a Central Washington University Supercomputer cluster running Ubuntu [35]. Machine 2 and 3 utilized headless GPU’s.

Algorithm 2:LS2OPT sequential version. The Swap(T,j,i)procedure swaps jth and ith cities of tour T

Algorithm 3:MLS2OPT sequential version. The Swap(T,idx,i)procedure swaps idxth and ith cities of tour T, where idxth is the best 2-swap schedule jth index found after iteration

SpecificationsMachine 1Machine 2Machine 3
ProcessorIntel i7-9750HGTX 1050Intel i7-7800XTitan XpPower 8P100
Memory16 GB2 GB32 GB12 GB32 GB16 GB
Cores46406384063584
OSWin10Win10Ubuntu
LanguageC++CUDA-CC++CUDA-CC++CUDA-C
IDEVisual Studio 17Visual Studio 17Makefile
Cost (USD)$200$1500$15,000 ### Table 2. Machines specifications. Advertisement ## 8. Results and analysis The results are grouped by the machine architectures, as there is a dependency between the CPU and GPU. Thirty experimentations was done of each problem instance on each machine for each algorithm and the average time is given in the tables (* in msec). The percentage relative difference(PRD) is calculated between the CPUand GPUtimes as given in Eq. (3). Negatives values (given as bolded text in the tables) indicate that the GPU execution is faster. PRD=GPUCPU/CPU100E3 The first part of the first machine experiment results of the LS2OPT and its CUDA variant is given in Table 3. The first column is the problem instances and the second and third column is the CPU and GPU average results of the LS2OPT in milliseconds. The final column is the PRD results. From all the results, apart from the ftv64instance, the GPU produced faster results. The averagetime was 22480.28 ms for the CPU and 2168.57 ms for the GPU. The average PRD was −47.29% for all experiments. A deeper analysis shows that for the larger instances, the PRD was over 80%. DataIntel i7-9750H LS2OPTNvidia GTX 1050 LS2OPTCUDAPRD (%) ft704234−19.047 ftv641430114.29 ftv17032287−72.98 kro124p580111−80.86 rbg32343,8542963−93.24 rbg35851,0694096−91.98 rbg40361,4817859−87.22 Average22480.282168.57−47.29 ### Table 3. Results of the experiments of Intel i7-9750Hand NVidia GTX 1050on the LS2OPT and LS2OPTCUDA algorithms. All results are in milliseconds (ms). The plot of the execution time is given in Figure 4 where the execution speedup is clearly identifiable for the larger instances. The second part of the first machine experimentation is the MLS2OPT and its CUDA variant and the result are given in Table 4. For all the problem instances, the execution time for the GPU was significantly better. The averagetime was 14183.85 ms for the CPU and 1854.28 ms for the GPU. The average PRD was −52.55% for all experiments. Apart from two instances, all the other were above 85% PRD. DataIntel i7-9750H MLS2OPTNvidia GTX 1050 MLS2OPTCUDAPRD (%) ft703721−43.24 ftv642652100.00 ftv17061978−87.40 kro124p30375−75.25 rbg32321,2052525−88.09 rbg35831,3303775−87.95 rbg40345,7676454−85.90 Average14183.851854.28−52.55 ### Table 4. Results of the experiments of Intel i7-9750Hand NVidia GTX 1050on the MLS2OPT and MLS2OPTCUDA algorithms. All results are in milliseconds (ms). The plot of the execution time is given in Figure 5, where the execution speedup is linearly identifiable for the larger instances. The first part of the second machine experiment results of the LS2OPT and its CUDA variant is given in Table 5. As the NVidia Titan Xp is a dedicated headless TESLA category GPU, the computational times are better than the CPU for all the results. The averagetime was 12157.14 ms for the CPU and 857 ms for the GPU. The average PRD was −64.92% for all experiments. A deeper analysis shows that for the larger instances, the PRD was over 90%. As the transfer overhead for the PCIebus is compensated by more extensive experimentation, larger instances performed faster on the GPU. DataIntel i7-7800X LS2OPTNVidia Titan Xp LS2OPTCUDAPRD (%) ft702118−14.29 ftv64128−33.33 ftv17018377−57.92 kro124p30694−69.28 rbg32323,6191467−93.79 rbg35827,6141848−93.31 rbg40333,3452487−92.54 Average12157.14857−64.92 ### Table 5. Results of the experiments of Intel i7-7800Xand NVidia titan Xpon the LS2OPT and LS2OPTCUDA algorithms. All results are in milliseconds (ms). The plot of the execution time is given in Figure 6, where the execution speedup is clearly identifiable for the larger instances. The second part of the second machine experimentation is the MLS2OPT and its CUDA variant and the result are given in Table 6. For all the problem instances, the execution time for the GPU was significantly better. The averagetime was 7955.28 ms for the CPU and 616.28 ms for the GPU. The average PRD was −63.39% for all experiments. The three larger instances were above 90% PRD. DataIntel i7-7800X MLS2OPTNVidia Titan Xp MLS2OPTCUDAPRD (%) ft702016−20.00 ftv64118−27.27 ftv17032158−81.93 kro124p16456−65.85 rbg32311,517941−91.83 rbg35817,1091059−93.81 rbg40324,4452176−91.10 Average7955.28616.28−63.39 ### Table 6. Results of the experiments of Intel i7-7800Xand NVidia titan Xpon the MLS2OPT and MLS2OPTCUDA algorithms. All results are in milliseconds (ms). The plot of the execution time is given in Figure 7, where the execution speedup is linearly identifiable for the larger instances. The first part of the third machine experiment results of the LS2OPT and its CUDA variant is given in Table 7. Generally, the NVidia P100 is regarded as an industry leading GPU solution for scientific computing. This is coupled with the IBM Power 8 CPU Architecture. For all the problem instances the result was significantly better. The averagetime was 28592.43 ms for the CPU and 1536.42 ms for the GPU. The average PRD was −87.83% for all experiments. DataPower 8 LS2OPTNVidia P100 LS2OPTCUDAPRD (%) ft705710−82.46 ftv64236−73.91 ftv17043075−82.56 kro124p75461−91.91 rbg32361,7753245−94.75 rbg35864,4193587−94.43 rbg40372,6893771−94.81 Average28592.431536.42−87.83 ### Table 7. Results of the experiments of power 8and NVidia P100on the LS2OPT and LS2OPTCUDA algorithms. All results are in milliseconds (ms). The plot of the execution time is given in Figure 8, where the execution speedup is clearly identifiable for the larger instances. The second part of the third machine experimentation is the MLS2OPT and its CUDA variant and the result are given in Table 8. For all the problem instances, the execution time for the GPU was again significantly better. The averagetime was 23429.14 ms for the CPU and 751 ms for the GPU. The average PRD was −92.78% for all experiments. The PRD is the highest of all experiments. DataPower 8 MLS2OPTNVidia P100 MLS2OPTCUDAPRD (%) ft70537−86.79 ftv64334−87.88 ftv17081152−93.59 kro124p38535−90.91 rbg32330,1201124−96.27 rbg35844,7091215−97.28 rbg40387,8932820−96.79 Average23429.14751−92.78 ### Table 8. Results of the experiments of power 8and NVidia P100on the MLS2OPT and MLS2OPTCUDA algorithms. All results are in milliseconds (ms). The plot of the execution time is given in Figure 9, where the execution speedup is linearly identifiable for the larger instances. The final comparison is of the three GPU’s on the two separate algorithms. Figure 10 shows the values of the three GPU’s on the problem instances for the LS2OPTCUDAalgorithm. For the small sized problem, the timing is not significantly distinct. The distinction only becomes variable when the instance sizes increase. Overall, the NVidia Titan Xp is the best performing GPU for this algorithm. Figure 11 shows the results of the MLS2OPTCUDAalgorithm on the problem. As with the previous case, the distinction only becomes obvious for large sized problem instances. Again the NVidia Titan Xp is the best performing GPU for this algorithm. Advertisement ## 9. Algorithm comparison This section discusses the tour cost obtained by the two different 2-OPT approaches developed here compared with published research. The first comparison is done with the best known solution in literature, which can be obtained from the TSPLib [36]. Table 9 gives the comparison results between the optimal and the results obtained from the LS2OPTCUDA and MLS2OPTCUDA algorithms on the P100 GPU. The results are compared using the PRD Eq. (3). The GPU is replaced with the optimal value and the CPU is replaced by the obtained result. DataOptimalLS2OPTPRD (%)MLS2OPTPRD (%) ft7038,67343,163−10.4043,310−10.71 ftv6418392744−32.982554−28.00 ftv17027554559−39.574510−38.91 kro124p36,23058,014−37.5555,011−34.14 rbg32313261681−21.121535−13.62 rbg35811631625−28.431459−20.29 rbg40324652710−9.042598−5.12 Average12064.4316356.57−25.5815853.86−21.54 ### Table 9. Comparison of 2OPT vs. optimal values. The PRD values comparison shows that the LS2OPT is at most 40% away from the optimal value for ftv170instance and − 9% for the rbg403instance. For the MLS2OPT comparison, the PRD is −39% from the optimal value for ftv170instance and − 5% for the rbg403instance. On average, the MLS2OPT is a better performing algorithm with an average of 15853.86 against 16356.57 for the LS2OPT algorithm. A plot of the comparison values is given in Figure 12. The second comparison is now done with four different evolutionary algorithms as given in Table 10. Theses are the Discrete Particle Swarm Optimization (DPSO) algorithm [37], Discrete Self-Organizing Algorithm (DSOMA) [38], Enhanced Differential Evolution (EDE) algorithm and the Chaos driven Enhanced Differential Evolution (EDEC) algorithm [17]. The DPSO and DSOMA algorithms were revised for the TSP problem and the 2-OPT local search was removed from the algorithms to compare the results without any local search implemented. EDE and EDECare published algorithms however only three instances were published. Both these algorithms had the 2-OPT local search embedded in them. DataMLS2OPTDPSODSOMAEDEEDEC ft7043,31054,44451,32540,28539,841 ftv64255447114423 ftv170451019,102952269024578 kro124p55,011113,15375,37341,18039,574 rbg323153548524523 rbg358145956924874 rbg403259863734427 Average15853.8629761.0022066.71 ### Table 10. MLS2OPT vs. evolutionary algorithms. From the results, it was obvious that evolutionary algorithms without local search heuristics are not as effective as the 2-opt local searchheuristic or algorithms with both directedand localsearch combined. Therefore, it is important to combine these two algorithms as in [39]. As reported in [39] that the execution time of local search can be around 95–99% of the total run time of the algorithm, it is viable to accelerate the local search heuristics. Advertisement ## 10. Conclusions This chapter introduces a CUDA accelerated 2-opt algorithm for the TSP problem. As one of the most common and widely used approaches to solve the problem, the 2-opt approach can be considered as canonical in the field. GPU programming, especially CUDA has gained significant traction for high performance computing. Readily available hardware has made programming a much easier and available task. Two variants of the 2-opt algorithm have been coded in CUDA to show the acceleration of computational time. This has been tested against a sample of test instances from literature. From the results obtained, it is clear that even for a relatively cheap GPU such as the GTX 1050 the performance improvement is significant, especially for larger sized problem instances. These were compared against industry leading CPU’s such as Intel i7-X series and IBM Power 8. One of the interesting aspects was that the Titan Xp performed better than the P100 for these instances. It is difficult to identify the reasons, as the same code was deployed on all machines, however the IBM and Intel architecture differences and different C/C++ compiler usage may have affected the performance. The physical configuration of the GPU’s inside the hardware and its connection to the motherboard and memory bandwidth issues could also add to the time overhead. However, when analyzing the cost-performance of the GPU’s then the$1500 Titan Xp is a better GPU than the \$15,000 P100 in this case.

However, the clear distinction is that there is a significant improvement to be had when applying the CUDA version of the 2-opt algorithm. The next direction of this research is to combine it with powerful swarm meta-heuristics with a layered approach, and try and solve very large TSP instances.

## References

1. 1. Englert M, Roglin H, Vocking B. Worst case and probabilistic analysis of the 2-opt algorithm for the TSP. Algorithmica. 2014;68:190-264
2. 2. Van Leeuwen J, Schoon A. Untangling a traveling salesman tour in the plane. In: Proceedings of the 7th International Workshop on Graph-Thoeratical Concepts in Computer Science. The Netherlands: Rijksuniversiteit. Vakgroep Informatica; 1981. pp. 87-98
3. 3. Johnson D, McGeoch L. The traveling salesman problem: A case study in local optimization. In: Aarts E, Lenstra J editors, Local Search in Combinatorial Optimization. Hoboken, NJ, USA: John Wiley and Sons; 1997
4. 4. Farber R. CUDA Application Design and Development. Burlington, MA, USA: Morgan Kaufmann; 2012
5. 5. Lawler EL, Lenstra JK, Kan AR, Shmoys DB. The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization. Vol. 3. New York: Wiley; 1985
6. 6. Davendra D, editor. Traveling Salesman Problem, Theory and Applications. Rijeka: IntechOpen; 2010
7. 7. Laporte G. The traveling salesman problem: An overview of exact and approximate algorithms. European Journal of Operations Research. 1992;59(2):231-247
8. 8. Li H, Alidaee B. Tabu search for solving the black-and-white travelling salesman problem. Journal of the Operational Research Society. 2016;67(8):1061-1079
9. 9. Kirkpatrick S, Gellat C, Vecchi M. Optimization by simulated annealing. Science. 1983;220(4598):671-680
10. 10. Grefenstette J, Gopal R, Rosmaita B, Van Gucht D. Genetic algorithms for the traveling salesman problem. In: Proceedings of the First International Conference on Genetic Algorithms and their Applications. New Jersey: Lawrence Erlbaum; 1985. pp. 160-168
11. 11. Oliver I, Smith D, Holland JR. Study of permutation crossover operators on the traveling salesman problem. In: Genetic Algorithms and their Applications: Proceedings of the Second International Conference on Genetic Algorithms; July 28–31, 1987 at the Massachusetts Institute of Technology, Cambridge, MA, Hillsdale, NJ, L. Erlhaum Associates; 1987
12. 12. Yu B, Yang Z-Z, Yao B. An improved ant colony optimization for vehicle routing problem. European Journal of Operations Research. 2009;196(1):171-176
13. 13. Tang K, Li Z, Luo L, Liu B. Multi-strategy adaptive particle swarm optimization for numerical optimization. Engineering Applications of Artificial Intelligence. 2015;37:9-19
14. 14. Yang X-S, Deb S. Cuckoo search via levy flights. In: World Congress on Nature & Biologically Inspired Computing, 2009. NaBIC 2009. NY, USA: IEEE Publications; 2009. pp. 210-214
15. 15. Yang X-S. Firefly algorithms for multimodal optimization. In: Stochastic Algorithms: Foundations and Applications. Berlin, Heidelberg, Germany: Springer; 2009. pp. 169-178
16. 16. Osabaa E, Sera D, Sadollahd A, Miren Nekane Bilbaob J, Camachoe D. A discrete water cycle algorithm for solving the symmetric and asymmetric traveling salesman problem. Applied Soft Computing. 2018;71:277-290
17. 17. Davendra D, Zelinka I, Senkerik R, Bialic-Davendra M. Chaos driven evolutionary algorithm for the traveling salesman problem. In: Davendra D, editor. Traveling salesman problem. Rijeka: IntechOpen. DOI: 10.5772/13107
18. 18. Li L, Cheng Y, Tan Y, Niu B. A discrete artificial bee colony algorithm for TSP problem. In: Proceedings of the 7th International Conference on Intelligent Computing: Bio-Inspired Computing and Applications (ICIC’11). Berlin, Heidelberg: Springer-Verlag; 2011. pp. 566-573. DOI: 10.1007/978-3-642-24553-4_75
19. 19. Davendra D, Zelinka I, Pluhacek M, Senkerik R. DSOMA—Discrete self-organising migrating algorithm. In: Self-Organizing Migrating Algorithm: Methodology and Implementation. Berlin, Heidelberg, Germany: Springer; 2016. pp. 51-63
20. 20. Croes G. A method for solving traveling-salesman problems. Operations Research. 1958;6(6):791-812
21. 21. Shen L. Computer solutions of the traveling salesman problem. Bell System Technical Journal. 1965;44(10):2245-2269
22. 22. Savelsbergh M. An efficient implementation of local search algorithms for constrained routing problems. European Journal of Operational Research. 1990;47(1):75-85
23. 23. Johnson D, McGeoch L. The traveling salesman problem: A case study in local optimization. Local search in combinatorial optimization. 1997;1:215-310
24. 24. Gutin G, Punnen A. The Traveling Salesman Problem and Its Variations. Vol. 12. Berlin, Heidelberg, Germany: Springer; 2002
25. 25. Kim B, Shim J, Zhang M. Comparison of tsp algorithms. In: Project for Facilities Planning and Materials Handling. 1998
26. 26. Kronos Group. Available from:https://www.khronos.org/[Accessed: 24 May 2020]
27. 27. Sanders J, Kandrot E. CUDA by Example. 1st Print Edition. Addison-Wesley; 2010
28. 28. Kirk D, Wen-mei W. Programming Massively Parallel Processors: A Hands-on Approach. Newnes; 2012
29. 29. NVIDIA: Cuda C Programming Guide. Santa Clara, CA, USA: NVIDIA Corporation; 2020
30. 30. NVIDIA: Kepler gk110. Santa Clara, CA, USA: NVIDIA Corporation; 2012
31. 31. Metlicka M. Framework for scheduling problems [master thesis]. Czech Republic: Technical University of Ostrava; 2015
32. 32. Metlicka M, Davendra D, Hermann F, Meier M, Amann M. GPU accelerated NEH algorithm. In: 2014 IEEE Symposium on Computational Intelligence in Production and Logistics Systems (CIPLS), Orlando, FL; 2014. pp. 114-119. DOI: 10.1109/CIPLS.2014.7007169
33. 33. NVIDIA. Cuda C Best Practices Guide [Online]. 2020
34. 34. TSP Library ATSP Dataset. Available from:http://elib.zib.de/pub/mp-testdata/tsp/tsplib/atsp/index.html[Accessed: 02 January 2020]
35. 35. CWU Turing Supercomputer. Available from:http://www.cwu.edu/faculty/turing-cwu-supercomputer[Accessed: 20 February 2020]
36. 36. TSP Library ATSP Best Known Solutions. Available from:http://elib.zib.de/pub/mp-testdata/tsp/tsplib/atsp-sol.html[Accessed: 17 May 2020]
37. 37. Wang X, Tang L. A discrete particle swarm optimization algorithm with self-adaptive diversity control for the permutation flowshop problem with blocking. Applied Soft Computing. 2012;12:652-662
38. 38. Davendra D, Bialic-Davendra M. Discrete self organizing algorithm for pollution vehicle routing problem. In: Proceedings of the genetic and evolutionary computation conference 2020 (GECCO 20 companion). New York, NY, USA: ACM; 2020. p. 8. DOI: 10.1145/3377929.3398076
39. 39. Merz P, Freisleben B. Genetic local search for the TSP: New results. In: Proceedings of 1997 IEEE International Conference on Evolutionary Computation (ICEC ’97), Indianapolis, IN, USA; 1997. pp. 159-164. DOI: 10.1109/ICEC.1997.592288

Written By

Donald Davendra, Magdalena Metlicka and Magdalena Bialic-Davendra

Submitted: October 7th, 2019 Reviewed: June 5th, 2020 Published: September 3rd, 2020