Open access peer-reviewed chapter

# CUDA Accelerated 2-OPT Local Search for the Traveling Salesman Problem

By Donald Davendra, Magdalena Metlicka and Magdalena Bialic-Davendra

Submitted: October 7th 2019Reviewed: June 5th 2020Published: September 3rd 2020

DOI: 10.5772/intechopen.93125

## Abstract

This research involves the development of a compute unified device architecture (CUDA) accelerated 2-opt local search algorithm for the traveling salesman problem (TSP). As one of the fundamental mathematical approaches to solving the TSP problem, the time complexity has generally reduced its efficiency, especially for large problem instances. Graphic processing unit (GPU) programming, especially CUDA has become more mainstream in high-performance computing (HPC) approaches and has made many intractable problems at least reasonably solvable in acceptable time. This chapter describes two CUDA accelerated 2-opt algorithms developed to solve the asymmetric TSP problem. Three separate hardware configurations were used to test the developed algorithms, and the results validate that the execution time decreased significantly, especially for the large problem instances when deployed on the GPU.

### Keywords

• traveling salesman problem
• CUDA
• 2-opt
• local search
• GPU programming

## 1. Introduction

This research addresses two very important aspects of computational intelligence, algorithm design, and high-performance computing. One of the fundamental problems in this field is the TSP, which has been used as a poster child for the notorious P=NPassertion in theoretical computer science.

TSP in nominal form is considered NP-Complete, when attempted using exact deterministic heuristics. The time complexity when solving it using the Held-Karp algorithm is On22nand the space complexity is On2n. When solving the problem using optimization algorithms and approximation, then problem tends to be NP-Hard.

2-opt is considered the simplest local search for the TSP problem. Theoretical knowledge about this heuristic is still very limited [1]; however, simple euclidean distance variants have been shown to have complexity of On3[2]. Generally, the computed solution has been shown to be within a few percentage points of the global optimal [3].

One of the empirical approaches of improving the execution of the algorithm is applying high performance computing (HPC) paradigm to the problem. This is generally possible if the problem is deducible to a parallel form. A number of different HPC approaches exist, namely, threads, OpenMP, MPI and CUDA. CUDA is by far the most complex and accelerated approach, as it requires programming on the GPU instead of the central processing unit (CPU).

Since its inception, CUDA has been widely used to solve a large number of computational problems [4]. This research looks to harness this approach to implement the 2-opt approach to the TSP problem.

The outline of the chapter follows with the introduction of the mathematical background of the TSP problem followed by the 2-opt algorithm. CUDA is subsequently discussed and the two CUDA developed 2-opt algorithm variants are described. The experimentation design discusses the hardware specifications of the three different architectures and then the obtained results are discussed and analyzed in respect to the execution time.

## 2. Traveling salesman problem

The TSP is a well-studied problem in literature [5, 6], which in essence tries to find the shortest path that visits a set of customers and returns to the first. A number of studies have been done using both approximation-based approaches [7] and metaheuristics. Metaheuritics are generally based on evolutionary approaches. A brief outline of different approaches can be obtained from:

1. Tabu Search: [8]

2. Simulated Annealing: [9]

3. Genetic Algorithm: [10, 11]

4. Ant Colony Optimization: [12]

5. Particle Swarm Optimization: [13]

6. Cuckoo Search: [14]

7. Firefly Algorithm: [15]

8. Water Cycle Algorithm: [16]

9. Differential Evolution Algorithm: [17]

10. Artificial Bee Colony: [18]

11. Self Organizing Migrating Algorithm: [19]

The TSP function can be expressed as shown in Eq. (1).

xij=1thepathgoesfromcityitocityj0otherwiseE1

where xij=1if city i is connected with city j, and xij=0otherwise. For i=0,,n, let uibe an artificial variable and finally take cijto be the distance from city i to city j. The objective function can be then formulated as Eq. (2):

mini=0nji,j=0ncijxij0xij1i,j=0,,nuiZi=0,,ni=0,ijnxij=1j=0,,nj=0,jinxij=1i=0,,nuiuj+nxijn11ijnE2

## 3. 2-OPT algorithm

The 2-opt algorithm is one of the most famous heuristics developed originally for solving the TSP problem. It was first proposed by Croes [20]. Along with 3-opt, generalized as k-opt [21], these heuristics are based on exchange of up to k edges in a TSP tour (more information on application of k-opt local search techniques to TSP problems can be obtained from [22]). Together they are called exchange or local improvement heuristics. The exchange is considered to be a single move, from this point of view, such heuristics search the neighborhood of the current solution, that is, perform a local search and provide a locally optimal solution (k-optimal) to the problem [23].

The 2-opt procedure requires a starting feasible solution. It then proceeds by replacing the two non-adjacent edges, vivi+and vjvj+by vivjand vi+vj+, and reversing one of the subpaths produced by dropping of edges, in order to maintain the consistent orientation of the tour. For example, the subpath vivi+vjvj+is replaced by vivjvi+vj+. The solution cost change produced in this way can be expressed as Δij=cvivj+cvi+vj+cvivi+cvjvj+. If Δij<0, the solution produced by the move improves upon its predecessor. The procedure iterates until no move where Δij<0(no improving move) can be found [24].

The 2-opt local search was described by Kim et al. [25] as follows:

Step 1: Let Sbe the initial solution, fSits objective function value. Set S=S,i=1,j=i+1=2.

Step 2: Consider exchange result S'such that fS'<fS. Set S=S'. if j<nrepeat step 2. Otherwise set i=i+1and j=i+1. if i<nrepeat step 2, otherwise go to step 3.

Step 3: if SSset S=S,i=1,j=i+1and go to step 2. Otherwise output best solution Sand terminate the process.

## 4. CUDA

General purpose GPU computing (GPGPU) programming was introduced by Apple Cooperation, which created the Kronos Group [26] to further develop and promote this new approach to accelerate scientific computing paradigms. GPU’s offer significantly faster acceleration due to their uniquer hardware architecture. GPGPU’s started to increase in application from 2006. At this point NVIDIA decided to create its propriety unique architecture called Compute Unified Device Architecture (CUDA), specific for their Tesla generation GPU cards. In order to support this architect, specific API primitive extensions of C, C++ and Fortran extensions has been developed [27, 28].

The specific C/C++ language extension for the C language is called the CUDA-C. This contains a number of accelerated libraries, extensions, and APIs. These are scalable and freely available without professional license. The main computational bottleneck is the splitting of the task between GPU and CPU tasks, where CPU handles better memory management and memory checking and GPU handles the data acceleration using parallization. It is considered heterogenous programming, where compute intensive data parallel tasks are offloaded on to the GPU.

CUDA contains three specific paradigms, thread hierarchy, memory hierarchy and synchronization. These can be further divided into coarse-grained parallelism on the blocks in grid parallization and fine-grain parallization in the threads in block, which requires low-level synchronization.

CUDA kernels are special function calls, which is used for data parallization. Each kernel launches threads which are grouped into blocks which are then grouped into grids. Communication is done synchronously by threads in a block, whereas blocks are independent. Certain programming techniques needs to be undertaken to ensure data synchronization and validity between blocks. Threads in different blocks are not able to communicate with each other.

Threads are distinguished by their unique threadId in their respective blockId, which allows operating on specific data in the global and shared memory.

### 4.2 Memory hierarchy

There are different memory types in the GPU, which CUDA can utilize. Some memory structures are based on cache, some are read-only, etc. The first higher level memory structure is called the global memory, which can be accessed by all memory blocks. Due to its size and access level, it is the slowest memory on the GPU. The second memory level is the shared memory, which is shared by blocks, which threads within blocks can access. The third memory is the register memory, which are only accessible by threads, and can be used to local variables. This is the smallest and fastest memory in the GPU. If there are larger memory structures, and when registers are not sufficient, local memory can be then utilized. Another memory is constant memory which cannot be changed by the kernel code. The final memory is the texture memory, which is a read-only cache that provides a speed-up for locality in data access by threads [29].

### 4.3 Synchronization

Blocks in grids are used in coarse-grained parallelism and threads in a specific block are used in fine-grained parallelism. Data sharing in the scope of a kernel is done by threads in the block. The number of threads are limited by the device architecture design (max. 1024) and also by thread memory resource consumption. There is a level of scalability as the blocks are scheduled independently. Each block is assigned to a streaming multiprocessor (MS) in the GPU [29, 30].

## 5. CUDA-based 2-opt algorithm

This section presents the parallel CUDA-based version of 2-opt algorithm. This is a modification of the local search for permutative flowshop with makespan criterion problem [31] and its NEH variant [32]. Before coming to the parallel implementation description, however, the more detailed pseudocode of sequential version is provided in Algorithm 5, in order to enable better understanding of the CUDA algorithm design.

Algorithm 1: 2-opt sequential version. The Swap(T,j,i) procedure swaps jth and ith cities of tour T

As can be seen already from the analysis of description of 2-opt, the task that can be done in parallel is the exploration of neighborhood of the current solution. This is divided between individual CUDA blocks. Possible neighbors of the current solution are split evenly between the launched blocks, which then explores these neighbors evenly including the fitness evaluations. If a new better solution is found, it is then stored into the global memory allocation of that block. Thereafter, if at least one of the launched blocks finds an improving solution during the iteration, the best cost solution amongst all blocks is obtained and stored into memory as the current solution for the next iteration. Otherwise, the current solution is returned as the best. It should be noted that the fitness function is not parallelized, as only a single thread in each block is tasked with this task.

Each block explores approximately the same amount of possible neighbors to the current solution (in the worst case, when no improving solution is found), including the cost evaluation. However, if it finds an improving solution, that solution is stored into the global memory allocated for each block, and the block terminates. If at least one of the blocks found an improving solution, the minimal cost solution amongst all blocks is found and stored into memory as the current solution for the next iteration. Otherwise, the current solution is returned. The cost function evaluation itself was not parallelized, as in each block only a single thread performs this task.

The outline of the parallel algorithm can be given as follows:

Step 1: Set current solution S = Initial solution.

Step 2: Explore the neighborhood of S by G blocks in parallel. In each block b:

Step 1.1: Determine initial index ifor b.

Step 1.2: Explore all neighbors of S created by swapping of iand j,j1N. If improving neighbor T found, go to step 1.4.

Step 1.3: Determine next index ifor b. If iN, terminate. Otherwise go to step 1.2.

Step 1.4: Store T and its objective function value fTinto global memory and terminate.

Step 3: If no improving solution found, exit procedure and return S as the best solution found. Otherwise determine the best solution amongst those found by blocks in parallel.

Step 4: Store best solution as S. Go to step 2.

Where Nis the number of cities in the tour and iis the outer loop index (see Algorithm 1 for sequential version of 2-opt).

### 5.1 Exploration and evaluation of neighboring solutions

The neighbors of solution are generated and evaluated in this kernel. From the sequential version pseudocode (Algorithm 1), it is obvious that the function of generating individual neighbors by swapping every possible pair of jobs pair-wise ijfor i=1,,Nand j=i+1,,Ncan be considered independent and therefore executed in parallel. These solutions can be stored in the shared memory after generation. After evaluation, if the new solution has better fitness value compared to the current one, it is stored into the global memory allocated for each block, to avoid data races between blocks (this is illustrated in Figure 1 depicting memory layout for six cities and four blocks). The improvements counter in the global memory is incremented using an atomic operation. This counter is compared against zero after the kernel termination, to determine if the stopping criterion of the algorithm was met. The fitness function itself is evaluated by only a single thread; the other threads in a block process the elements of the solution when transferring data between shared and global memory locations.

It is logically impractical to allocate the full number of N12/2blocks on the GPU in most case scenarios. This number can be very large, whereas the number of SMs and the number of resident blocks on SM is limited by various factors, such as the number of threads in a block and a registers/shared memory usage. The optimal number of threads in a block maximizing the number of resident blocks, as well as GPU occupancy, can be easily determined based on the calculations performed in the CUDA occupancy calculator tool [33], as a function of the number of cities in a tour (which determines the size of shared memory used). This can maximizes the utilization of the GPU, while reducing the total global memory size required by the grid, as well as the workload done by the search for minimal cost solution in the next kernel. The mapping of the blocks to the tasks however can becomes more complicated to implement in code.

Using the assumption that the number of blocks will be nearly always smaller than the aforementioned function of the number of actual cities for the problem instances of interest (problems with cities larger than 30), only the outer loop of the sequential 2-opt algorithm was parallelized. The inner loop is performed by each block sequentially. This reduces the data transfers between global and shared memory, and does not eliminate the advantage of the low complexity of the swap operation at the same time. If the solution created by swapping jobs iand jis worse than the current one, it is easy to reverse this change by swapping again jand i, with equal complexity. Therefore, maximally N1blocks are needed for this function. The mapping of blocks to tasks is illustrated in Figure 2.

### 5.2 Parallel reduction to obtain minimal cost

The parallel reduction procedure is used to find the index of the solution with the minimal fitness value. This employs shared memory to store the data being used, whereas the data is initially copied from the global to shared memory. In this step, each active thread compares two costs, and stores the smaller of the two costs on the place of the first cost, along with its original index (cost is represented as a structure containing two elements: cost value, and cost index). Using this reduction, the first element of the costs array contains the minimal cost found, along with its respective solution index. This pair is then written into global memory.

### 5.3 Device synchronization and subsequence update

In the final process, a new kernel copies the best indexed solution into the current solution buffer, and the next step of the main loop can be performed. A global CUDA device synchronization is required for relatively large data (for a tour size/number of threads in a block of size more than approximately 100, as was empirically confirmed) before the start of the synchronization. As each of the kernels consumes some of the GPU resources, it is necessary to wait, until the pending kernels completely finish the execution, and release their resources, otherwise the GPU freezes and unsuccessful kernel launches start to appear. This is done by callingcudaDeviceSynchronize()function from the host code, after the Update kernel is launched.

Figure 3 outlines the memory layout of the previously described code (without TSP input data, for the current subsequence size 2, city tour 4. The data fields not used in the current step are grayed out). The candidate solutions are stored in one global memory 1D array, which conceptually represents 2D array, wherein each row contains one candidate tour. The respective costs are stored in a separate array. The TSP problem input data (distance between cities) are stored in the similar fashion in global memory (because of its large size).

This implementation is expected to provide in each step the speedup proportional to the number of solutions generated.

## 6. 2-OPT variants

Two versions of the 2-opt local search was implemented in this work. The first is the LS2OPT variant, which uses the search with the first ascend strategy. In this strategy, the next tour is the first improving solution found. This can be given in Algorithm 2.

The second variant is the MLS2OPT version, which is the best ascend strategy. In this strategy, the next tour is the best improving solution found in the 2-swap neighborhood as given in Algorithm 3.

## 7. Experimentation design

The experimentation design is as the following. Three different CPU’s and three different GPU’s are used to run the two different 2-opt variants on a selected number of asymmetric TSP instances (ATSP). The only measure is the time complexity.

The problem instances of the ATSP was obtained from the TSP library [34]. The following problems were selected due to differing city sizes as given in Table 1.

DataCities
ft7070
ftv6465
ftv170171
kro124p100
rbg323323
rbg358358
rbg403403

### Table 1.

TSP instances and number of cities.

The machine specifications is given in Table 2. Three separate machines were used with differing CPUs and GPUs. Two machines were on a Windows 10 operating system and the other is a Central Washington University Supercomputer cluster running Ubuntu [35]. Machine 2 and 3 utilized headless GPU’s.

Algorithm 2: LS2OPT sequential version. The Swap(T,j,i) procedure swaps jth and ith cities of tour T

Algorithm 3: MLS2OPT sequential version. The Swap(T,idx,i) procedure swaps idxth and ith cities of tour T, where idxth is the best 2-swap schedule jth index found after iteration

SpecificationsMachine 1Machine 2Machine 3
ProcessorIntel i7-9750HGTX 1050Intel i7-7800XTitan XpPower 8P100
Memory16 GB2 GB32 GB12 GB32 GB16 GB
Cores46406384063584
OSWin10Win10Ubuntu
LanguageC++CUDA-CC++CUDA-CC++CUDA-C
IDEVisual Studio 17Visual Studio 17Makefile
Cost (USD)$200$1500$15,000 ### Table 2. Machines specifications. ## 8. Results and analysis The results are grouped by the machine architectures, as there is a dependency between the CPU and GPU. Thirty experimentations was done of each problem instance on each machine for each algorithm and the average time is given in the tables (* in msec). The percentage relative difference (PRD) is calculated between the CPU and GPU times as given in Eq. (3). Negatives values (given as bolded text in the tables) indicate that the GPU execution is faster. PRD=GPUCPU/CPU100E3 The first part of the first machine experiment results of the LS2OPT and its CUDA variant is given in Table 3. The first column is the problem instances and the second and third column is the CPU and GPU average results of the LS2OPT in milliseconds. The final column is the PRD results. From all the results, apart from the ftv64 instance, the GPU produced faster results. The average time was 22480.28 ms for the CPU and 2168.57 ms for the GPU. The average PRD was −47.29% for all experiments. A deeper analysis shows that for the larger instances, the PRD was over 80%. DataIntel i7-9750H LS2OPTNvidia GTX 1050 LS2OPTCUDAPRD (%) ft704234−19.047 ftv641430114.29 ftv17032287−72.98 kro124p580111−80.86 rbg32343,8542963−93.24 rbg35851,0694096−91.98 rbg40361,4817859−87.22 Average22480.282168.57−47.29 ### Table 3. Results of the experiments of Intel i7-9750H and NVidia GTX 1050 on the LS2OPT and LS2OPTCUDA algorithms. All results are in milliseconds (ms). The plot of the execution time is given in Figure 4 where the execution speedup is clearly identifiable for the larger instances. The second part of the first machine experimentation is the MLS2OPT and its CUDA variant and the result are given in Table 4. For all the problem instances, the execution time for the GPU was significantly better. The average time was 14183.85 ms for the CPU and 1854.28 ms for the GPU. The average PRD was −52.55% for all experiments. Apart from two instances, all the other were above 85% PRD. DataIntel i7-9750H MLS2OPTNvidia GTX 1050 MLS2OPTCUDAPRD (%) ft703721−43.24 ftv642652100.00 ftv17061978−87.40 kro124p30375−75.25 rbg32321,2052525−88.09 rbg35831,3303775−87.95 rbg40345,7676454−85.90 Average14183.851854.28−52.55 ### Table 4. Results of the experiments of Intel i7-9750H and NVidia GTX 1050 on the MLS2OPT and MLS2OPTCUDA algorithms. All results are in milliseconds (ms). The plot of the execution time is given in Figure 5, where the execution speedup is linearly identifiable for the larger instances. The first part of the second machine experiment results of the LS2OPT and its CUDA variant is given in Table 5. As the NVidia Titan Xp is a dedicated headless TESLA category GPU, the computational times are better than the CPU for all the results. The average time was 12157.14 ms for the CPU and 857 ms for the GPU. The average PRD was −64.92% for all experiments. A deeper analysis shows that for the larger instances, the PRD was over 90%. As the transfer overhead for the PCIe bus is compensated by more extensive experimentation, larger instances performed faster on the GPU. DataIntel i7-7800X LS2OPTNVidia Titan Xp LS2OPTCUDAPRD (%) ft702118−14.29 ftv64128−33.33 ftv17018377−57.92 kro124p30694−69.28 rbg32323,6191467−93.79 rbg35827,6141848−93.31 rbg40333,3452487−92.54 Average12157.14857−64.92 ### Table 5. Results of the experiments of Intel i7-7800X and NVidia titan Xp on the LS2OPT and LS2OPTCUDA algorithms. All results are in milliseconds (ms). The plot of the execution time is given in Figure 6, where the execution speedup is clearly identifiable for the larger instances. The second part of the second machine experimentation is the MLS2OPT and its CUDA variant and the result are given in Table 6. For all the problem instances, the execution time for the GPU was significantly better. The average time was 7955.28 ms for the CPU and 616.28 ms for the GPU. The average PRD was −63.39% for all experiments. The three larger instances were above 90% PRD. DataIntel i7-7800X MLS2OPTNVidia Titan Xp MLS2OPTCUDAPRD (%) ft702016−20.00 ftv64118−27.27 ftv17032158−81.93 kro124p16456−65.85 rbg32311,517941−91.83 rbg35817,1091059−93.81 rbg40324,4452176−91.10 Average7955.28616.28−63.39 ### Table 6. Results of the experiments of Intel i7-7800X and NVidia titan Xp on the MLS2OPT and MLS2OPTCUDA algorithms. All results are in milliseconds (ms). The plot of the execution time is given in Figure 7, where the execution speedup is linearly identifiable for the larger instances. The first part of the third machine experiment results of the LS2OPT and its CUDA variant is given in Table 7. Generally, the NVidia P100 is regarded as an industry leading GPU solution for scientific computing. This is coupled with the IBM Power 8 CPU Architecture. For all the problem instances the result was significantly better. The average time was 28592.43 ms for the CPU and 1536.42 ms for the GPU. The average PRD was −87.83% for all experiments. DataPower 8 LS2OPTNVidia P100 LS2OPTCUDAPRD (%) ft705710−82.46 ftv64236−73.91 ftv17043075−82.56 kro124p75461−91.91 rbg32361,7753245−94.75 rbg35864,4193587−94.43 rbg40372,6893771−94.81 Average28592.431536.42−87.83 ### Table 7. Results of the experiments of power 8 and NVidia P100 on the LS2OPT and LS2OPTCUDA algorithms. All results are in milliseconds (ms). The plot of the execution time is given in Figure 8, where the execution speedup is clearly identifiable for the larger instances. The second part of the third machine experimentation is the MLS2OPT and its CUDA variant and the result are given in Table 8. For all the problem instances, the execution time for the GPU was again significantly better. The average time was 23429.14 ms for the CPU and 751 ms for the GPU. The average PRD was −92.78% for all experiments. The PRD is the highest of all experiments. DataPower 8 MLS2OPTNVidia P100 MLS2OPTCUDAPRD (%) ft70537−86.79 ftv64334−87.88 ftv17081152−93.59 kro124p38535−90.91 rbg32330,1201124−96.27 rbg35844,7091215−97.28 rbg40387,8932820−96.79 Average23429.14751−92.78 ### Table 8. Results of the experiments of power 8 and NVidia P100 on the MLS2OPT and MLS2OPTCUDA algorithms. All results are in milliseconds (ms). The plot of the execution time is given in Figure 9, where the execution speedup is linearly identifiable for the larger instances. The final comparison is of the three GPU’s on the two separate algorithms. Figure 10 shows the values of the three GPU’s on the problem instances for the LS2OPTCUDA algorithm. For the small sized problem, the timing is not significantly distinct. The distinction only becomes variable when the instance sizes increase. Overall, the NVidia Titan Xp is the best performing GPU for this algorithm. Figure 11 shows the results of the MLS2OPTCUDA algorithm on the problem. As with the previous case, the distinction only becomes obvious for large sized problem instances. Again the NVidia Titan Xp is the best performing GPU for this algorithm. ## 9. Algorithm comparison This section discusses the tour cost obtained by the two different 2-OPT approaches developed here compared with published research. The first comparison is done with the best known solution in literature, which can be obtained from the TSPLib [36]. Table 9 gives the comparison results between the optimal and the results obtained from the LS2OPTCUDA and MLS2OPTCUDA algorithms on the P100 GPU. The results are compared using the PRD Eq. (3). The GPU is replaced with the optimal value and the CPU is replaced by the obtained result. DataOptimalLS2OPTPRD (%)MLS2OPTPRD (%) ft7038,67343,163−10.4043,310−10.71 ftv6418392744−32.982554−28.00 ftv17027554559−39.574510−38.91 kro124p36,23058,014−37.5555,011−34.14 rbg32313261681−21.121535−13.62 rbg35811631625−28.431459−20.29 rbg40324652710−9.042598−5.12 Average12064.4316356.57−25.5815853.86−21.54 ### Table 9. Comparison of 2OPT vs. optimal values. The PRD values comparison shows that the LS2OPT is at most 40% away from the optimal value for ftv170 instance and − 9% for the rbg403 instance. For the MLS2OPT comparison, the PRD is −39% from the optimal value for ftv170 instance and − 5% for the rbg403 instance. On average, the MLS2OPT is a better performing algorithm with an average of 15853.86 against 16356.57 for the LS2OPT algorithm. A plot of the comparison values is given in Figure 12. The second comparison is now done with four different evolutionary algorithms as given in Table 10. Theses are the Discrete Particle Swarm Optimization (DPSO) algorithm [37], Discrete Self-Organizing Algorithm (DSOMA) [38], Enhanced Differential Evolution (EDE) algorithm and the Chaos driven Enhanced Differential Evolution (EDEC) algorithm [17]. The DPSO and DSOMA algorithms were revised for the TSP problem and the 2-OPT local search was removed from the algorithms to compare the results without any local search implemented. EDE and EDECare published algorithms however only three instances were published. Both these algorithms had the 2-OPT local search embedded in them. DataMLS2OPTDPSODSOMAEDEEDEC ft7043,31054,44451,32540,28539,841 ftv64255447114423 ftv170451019,102952269024578 kro124p55,011113,15375,37341,18039,574 rbg323153548524523 rbg358145956924874 rbg403259863734427 Average15853.8629761.0022066.71 ### Table 10. MLS2OPT vs. evolutionary algorithms. From the results, it was obvious that evolutionary algorithms without local search heuristics are not as effective as the 2-opt local search heuristic or algorithms with both directed and local search combined. Therefore, it is important to combine these two algorithms as in [39]. As reported in [39] that the execution time of local search can be around 95–99% of the total run time of the algorithm, it is viable to accelerate the local search heuristics. ## 10. Conclusions This chapter introduces a CUDA accelerated 2-opt algorithm for the TSP problem. As one of the most common and widely used approaches to solve the problem, the 2-opt approach can be considered as canonical in the field. GPU programming, especially CUDA has gained significant traction for high performance computing. Readily available hardware has made programming a much easier and available task. Two variants of the 2-opt algorithm have been coded in CUDA to show the acceleration of computational time. This has been tested against a sample of test instances from literature. From the results obtained, it is clear that even for a relatively cheap GPU such as the GTX 1050 the performance improvement is significant, especially for larger sized problem instances. These were compared against industry leading CPU’s such as Intel i7-X series and IBM Power 8. One of the interesting aspects was that the Titan Xp performed better than the P100 for these instances. It is difficult to identify the reasons, as the same code was deployed on all machines, however the IBM and Intel architecture differences and different C/C++ compiler usage may have affected the performance. The physical configuration of the GPU’s inside the hardware and its connection to the motherboard and memory bandwidth issues could also add to the time overhead. However, when analyzing the cost-performance of the GPU’s then the$1500 Titan Xp is a better GPU than the \$15,000 P100 in this case.

However, the clear distinction is that there is a significant improvement to be had when applying the CUDA version of the 2-opt algorithm. The next direction of this research is to combine it with powerful swarm meta-heuristics with a layered approach, and try and solve very large TSP instances.

chapter PDF
Citations in RIS format
Citations in bibtex format

## More

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## How to cite and reference

### Cite this chapter Copy to clipboard

Donald Davendra, Magdalena Metlicka and Magdalena Bialic-Davendra (September 3rd 2020). CUDA Accelerated 2-OPT Local Search for the Traveling Salesman Problem, Novel Trends in the Traveling Salesman Problem, Donald Davendra and Magdalena Bialic-Davendra, IntechOpen, DOI: 10.5772/intechopen.93125. Available from:

### Related Content

#### Novel Trends in the Traveling Salesman Problem

Edited by Donald Davendra

Next chapter

By Weiqi Li

First chapter

#### Traveling Salesman Problem: an Overview of Applications, Formulations, and Solution Approaches

By Rajesh Matai, Surya Singh and Murari Lal Mittal

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

View all Books