Analysis of the Performance of the Fish School Search Algorithm Running in Graphic Processing Units

Fish School Search (FSS) is a computational intelligence technique invented by Bastos-Filho and Lima-Neto in 2007 and first presented in Bastos-Filho et al. (2008). FSS was conceived to solve search problems and it is based on the social behavior of schools of fish. In the FSS algorithm, the search space is bounded and each possible position in the search space represents a possible solution for the problem. During the algorithm execution, each fish has its positions and weights adjusted according to four FSS operators, namely, feeding, individual movement, collective-instinctive movement and collective-volitive movement. FSS is inherently parallel since the fitness can be evaluated for each fish individually. Hence, it is quite suitable for parallel implementations.


Introduction
Fish School Search (FSS) is a computational intelligence technique invented by Bastos-Filho and Lima-Neto in 2007 and first presented in Bastos-Filho et al. (2008).FSS was conceived to solve search problems and it is based on the social behavior of schools of fish.In the FSS algorithm, the search space is bounded and each possible position in the search space represents a possible solution for the problem.During the algorithm execution, each fish has its positions and weights adjusted according to four FSS operators, namely, feeding, individual movement, collective-instinctive movement and collective-volitive movement.FSS is inherently parallel since the fitness can be evaluated for each fish individually.Hence, it is quite suitable for parallel implementations.
In the recent years, the use of Graphic Processing Units (GPUs) have been proposed for various general purpose computing applications.Thus, GPU-based platforms afford great advantages on applications requiring intensive parallel computing.The GPU parallel floating point processing capacity allows one to obtain high speedups.These advantages together with FSS architecture suggest that GPU based FSS may produce marked reduction in execution time, which is very likely because the fitness evaluation and the update processes of the fish can be parallelized in different threads.Nevertheless, there are some aspects that should be considered to adapt an application to be executed in these platforms, such as memory allocation and communication between blocks.Some computational intelligence algorithms already have been adapted to be executed in GPU-based platforms.Some variations of the Particle Swarm Optimization (PSO) algorithm suitable for GPU were proposed by Zhou & Tan (2009).In that article the authors compared the performance of such implementations to a PSO running in a CPU.Some tests regarding the scalability of the algorithms as a function of the number of dimensions were also presented.Bastos-Filho et al. (2010) presented an analysis of the performance of PSO algorithms when the random number are generated in the GPU and in the CPU.They showed that the XORshift Random Number Generator for GPUs, described by Marsaglia (2003), presents enough quality to be used in the PSO algorithm.They also compared different GPU-based versions of the PSO (synchronous and asynchronous) to the CPU-based algorithm.

2
Zhu & Curry (2009) adapted an Ant Colony Optimization algorithm to optimize benchmark functions in GPUs.A variation for local search, called SIMT-ACO-PS (Single Instruction Multiple Threads -ACO -Pattern Search), was also parallelized.They presented some interesting analysis on the parallelization process regarding the generation of ants in order to minimize the communication overhead between CPU-GPU.The proposals achieved remarkable speedups.
To the best of our knowledge, there is no FSS implementations for GPUs.So, in this paper we present the first parallel approach for the FSS algorithm suitable for GPUs.We discuss some important issues regarding the implementation in order to improve the time performance.We also consider some other relevant aspects, such as when and where it is necessary to set synchronization barriers.The analysis of these aspects is crucial to provide high performance FSS approaches for GPUs.In order to demonstrate this, we carried out simulations using a parallel processing platform developed by NVIDIA, called CUDA.This paper is organized as follows: in the next Section we present an overview of the FSS algorithm.In Section 3, we introduce some basic aspects of the NVIDIA CUDA Architecture and GPU Computing.Our contribution and the results are presented in Sections 4 and 5, respectively.In the last Section, we present our conclusions, where we also suggest future works.

Fish School Search
Fish School Search (FSS) is a stochastic, bio-inspired, population-based global optimization technique.As mentioned by Bastos-Filho et al. (2008), FSS was inspired in the gregarious behavior presented by some fish species, specifically to generate mutual protection and synergy to perform collective tasks, both to improve the survivability of the entire group.
The search process in FSS is carried out by a population of limited-memory individuals -the fish.Each fish in the school represents a point in the fitness function domain, like the particles in the Particle Swarm Optimization (PSO) Kennedy & Eberhart (1995) or the individuals in the Genetic Algorithms (GA) Holland (1992).The search guidance in FSS is driven by the success of the members of the population.
The main feature of the FSS is that all fish contain an innate memory of their success -their weights.The original version of the FSS algorithm has four operators, which can be grouped in two classes: feeding and swimming.The Feeding operator is related to the quality of a solution and the three swimming operators drive the fish movements.

Individual movement operator
The individual movement operator is applied to each fish in the school in the beginning of each iteration.Each fish chooses a new position in its neighbourhood and then, this new position is evaluated using the fitness function.The candidate position n i of fish i is determined by the Equation (1) proposed by Bastos-Filho et al. (2009).
where x i is the current position of the fish in dimension i, rand[-1,1] is a random number generated by an uniform distribution in the interval [-1,1].The step ind is a percentage of the search space amplitude and is bounded by two parameters (step ind_min and step ind_max ).
The step ind decreases linearly during the iterations in order to increase the exploitation ability along the iterations.After the calculation of the candidate position, the movement only occurs if the new position presents a better fitness than the previous one.

Feeding operator
Each fish can grow or diminish in weight, depending on its success or failure in the search for food.Fish weight is updated once in every FSS cycle by the feeding operator, according to equation (2).
where W i (t) is the weight of the fish i, f [ x i (t)] is the value for the fitness function (i.e. the amount of food) in x i (t), ∆ f i is the difference between the fitness value of the new position f [ x i (t + 1)] and the fitness value of the current position for each fish f [ x i (t)], and the max(∆ f ) is the maximum value of these differences in the iteration.A weight scale (W scale ) is defined in order to limit the weight of fish and it will be assigned the value for half the total number of iterations in the simulations.The initial weight for each fish is equal to W scale 2 .

Collective-instinctive movement operator
After all fish have moved individually, their positions are updated according to the influence of the fish that had successful individual movements.This movement is based on the fitness evaluation of the fish that achieved better results, as shown in equation ( 3).
where ∆ x ind i is the displacement of the fish i due to the individual movement in the FSS cycle.
One must observe that ∆ x ind i = 0 for fish that did not execute the individual movement.

Collective-volitive movement operator
The collective-volitive movement occurs after the other two movements.If the fish school search has been successful, the radius of the school should contract; if not, it should dilate.Thus, this operator increases the capacity to auto-regulate the exploration-exploitation granularity.The fish school dilation or contraction is applied to every fish position with regards to the fish school barycenter, which can be evaluated by using the equation ( 4): We use equation ( 5) to perform the fish school expansion (in this case we use sign +)o r contraction (in this case we use sign −). 19 Analysis of the Performance of the Fish School Search Algorithm Running in Graphic Processing Units www.intechopen.com where r 1 is a number randomly generated in the interval [0, 1] by an uniform probability density function.d( x i (t), B(t)) evaluates the euclidean distance between the particle i and the barycenter.step vol is called volitive step and controls the step size of the fish.step vol is defined as a percentage of the search space range and is bounded by two parameters (step vol_min and step vol_max ).step vol decreases linearly from step vol_max to step vol_min along the iterations of the algorithm.It helps the algorithm to initialize with an exploration behavior and change dynamically to an exploitation behavior.

GPU computing and CUDA architecture
In recent years, Graphic Processing Units (GPU) have appeared as a possibility to drastically speed up general-purpose computing applications.Because of its parallel computing mechanism and fast float-point operation, GPUs were applied successfully in many applications.Some examples of GPU applications are physics simulations, financial engineering, and video and audio processing.Despite all successful applications, some algorithms can not be effectively implemented for GPU platforms.In general, numerical problems that present parallel behavior can obtain profits from this technology as can be seen in NVIDIA (2010a).
Even after some efforts to develop Applications Programming Interface (API) in order to facilitate the developer activities, GPU programming is still a hard task.To overcome this, NVIDIA introduced a general purpose parallel computing platform, named Computer Unified Device Architecture (CUDA).CUDA presents a new parallel programming model to automatically distribute and manage the threads in the GPUs.
CUDA allows a direct communication of programs, written in C programming language, with the GPU instructions by using minimal extensions.It has three main abstractions: a hierarchy of groups of threads, shared memories and barriers for synchronization NVIDIA (2010b).These abstractions allow one to divide the problem into coarse sub-problems, which can be solved independently in parallel.Each sub-problem can be further divided in minimal procedures that can be solved cooperatively in parallel by all threads within a block.Thus, each block of threads can be scheduled on any of the available processing cores, regardless of the execution order.Some issues must be considered when modeling the Fish School Search algorithm for the CUDA platform.In general, the algorithm correctness must be guaranteed, once race conditions on a parallel implementation may imply in outdated results.Furthermore, since we want to execute the algorithm as fast as possible, it is worth to discuss where it is necessary to set synchronization barriers and in which memory we shall store the algorithm information.
The main bottleneck in the CUDA architecture lies in the data transferring between the host (CPU) and the device (GPU).Any transfer of this type may reduce the time execution performance.Thus, this operation should be avoided whenever possible.One alternative is to move some operations from the host to the device.Even when it seems to be unnecessary (not so parallel), the generation of data in the GPU is faster than the time needed to transfer huge volumes of data.
CUDA platforms present a well defined memory hierarchy, which includes distinct types of memory in the GPU platform.Furthermore, the time to access these distinct types of memory vary.Each thread has a private local memory and each block of threads has a shared memory 20 Theory and New Applications of Swarm Intelligence www.intechopen.comaccessible by all threads inside the block.Moreover, all threads can access the same global memory.All these memory spaces follow a memory hierarchy: the fastest one is the local memory and the slowest is the global memory; accordingly the smallest one is the local memory and the largest is the global memory.Then, if there is data that must be accessed by all threads, the shared memory might be the best choice.However, the shared memory can only be accessed by the threads inside its block and its size is not very large.On the FSS versions, most of the variables are global when used on kernel functions.Shared memory was also used to perform the barycenter calculations.Local memory were used to assign the thread, block and grid dimension indexes on the device and also to compute the specific benchmark function.
Another important aspect is the necessity to set synchronization barriers.A barrier forces a thread to wait until all other threads of the same block reach the barrier.It helps to guarantee the correctness of the algorithm running on the GPU, but it can reduce the time performance.Furthermore, threads within a block can cooperate among themselves by sharing data through some shared memory and must synchronize their execution to coordinate the memory accesses (see Fig. 1).Although the GPUs are famous because of their Fig. 1.Illustration of a Grid of Thread Blocks parallel high precision operations, there are GPUs with only single precision capacity.Since many computational problems need double precision computation, this limitation may lead to bad results.Therefore, it turns out that these GPUs are inappropriate to solve some types of problems.
The CUDA capacity to execute a high number of threads in parallel is due to the hierarchical organization of these threads as a grid of blocks.A thread block is set of processes which cooperate in order to share data efficiently using a fast shared memory.Besides, a thread block must synchronize themselves to coordinate the accesses to the memory.
The maximum number of threads running in parallel in a block is defined by its number of processing units and its architecture.Therefore, each GPU has its own limitation.As a consequence, an application that needs to overpass this limitation have to be executed sequentially with more blocks, otherwise it might obtain wrong or, at least, outdated results.
The  In order to process the algorithm in parallel, one must inform the CUDA platform the number of parallel copies of the Kernel functions to be performed.These copies are also known as parallel blocks and are divided into a number of execution threads.
The structures defined by grids can be split into blocks in two dimensions.The blocks are divided in threads that can be structured from 1 to 3 dimensions.As a consequence, the kernel functions can be easily instantiated (see Fig. 2).In case of a kernel function be invoked

22
Theory and New Applications of Swarm Intelligence www.intechopen.comby the CPU, it will run in separated threads within the corresponding block.For each thread that executes a kernel function there is a thread identifier that allows one to access the threads within the kernel through two built-in variables threadIdx and blockIdx.The size of data to be processed or the number processors available in the system are used to define the number of thread blocks in a grid.The GPU architecture and its number of processors will define the maximum number of threads in a block.On the current GPUs, a thread block may contain up to 1024 threads.For this chapter, the simulations were made with GPUs that supports up to 512 threads.Table 1 presents the used configuration for grids, blocks and thread for each kernel function.Another important concept in CUDA architecture is related to Warp, which refers to 32 threads grouped to get executed in lockstep, i.e. each thread in a warp executes the same instruction on different data Sanders & Kandrot (2010).In this chapter, as already mentioned, the data processing is performed directly in the memories.For these functions, the qualifier __global__ must be used to allow one to access the functions outside the device.The qualifier __device__ declares which kernel function can be executed in the device and which ones can only be invoked from the device NVIDIA (2010b).
The FSS pseudocode shown in algorithm 1 depicts which functions can be parallelized in GPUs.

The synchronous FSS
The synchronous FSS must be implemented carefully with barriers to prevent any race condition that could generate wrong results.These barriers, indicated by __syncthreads() function in CUDA, guarantee the correctness but it comes with a caveat.Since the fish need to wait for all others, all these barriers harm the performance.
In the Synchronous version the synchronization barriers were inserted after the following functions (see algorithm 1): fitness evaluations, update new position, calculate fish weights, calculate barycenter and update steps values.

The asynchronous FSS
In general, an iteration of the asynchronous approach is faster than the synchronous one due to the absence of some synchronization barriers.However, the results will be probably worse, since the information acquired is not necessarily the current best.
Here, we propose two different approaches for Asynchronous FSS.The first one, called Asynchronous -Version A, presents some points in the code with synchronization barriers.In this case, were have maintained the synchronization barriers only in the functions used to update the positions and evaluate the barycenter.The pseudocode of the Asynchronous FSS -Version A is shown in algorithm 2. In the second approach, called Asynchronous -Version B, all the synchronization barriers were removed from the code in order to have a full asynchronous version.The pseudocode of the Asynchronous FSS -Version B is shown in algorithm 3.
In order to calculate the execution time for each simulation we have used the CUDA event API, which handles the time of creation and destruction events and also records the time of the events with the timestamp format NVIDIA (2010b).
We used a 1296 MHz GeForce GTX 280 with 240 Processing Cores to run the GPU-based FSS algorithms.All simulations were performed using 30 fish and we run 50 trial to evaluate the average fitness.All schools were randomly initialized in an area far from the optimal solution in every dimension.This allows a fair convergence analysis between the algorithms.All the random numbers needed by the FSS algorithm running on GPU were generated by a normal distribution using the proposal depicted in Bastos-Filho et al. (2010).
In all these experiments we have used a combination of individual and volitive steps at both initial and final limits with a percentage of the function search space Bastos-Filho et al. (2009).benchmark functions were used to employ the simulations and are described in equations ( 6) to (8).All the functions are used for minimization problems.The Rosenbrock function is a simple uni-modal problems.The Rastrigin and the Griewank functions are highly complex multimodal functions that contains many local optima.
The first one is Rosenbrock function.It has a global minimum located in a banana-shaped valley.The region where the minimum point is located is very easy to reach, but the convergence to the global minimum is hard to achieve.The function is defined as follows: The second function is the generalized Rastrigin, a multi-modal function that induces the search to a deep local minima arranged as sinusoidal bumps: Equation ( 8) shows the Griewank function, which is a multimodal function: All simulations were carried out in 30 dimensions.Rastrigin −5.12 ≤ x i ≤ 5.12 2.56 ≤ x i ≤ 5.12 0.0 D Griewank −600 ≤ x i ≤ 600 300 ≤ x i ≤ 600 0.0 D Table 3. Function used: search space, initialization range and optima.
Griewank, respectively.Tables 4, 5 and 6 present the average value of the fitness and standard deviation at the 10,000 iteration for the Rosenbrock, Rastrigin and Griewank, respectively.
Analyzing the convergence of the fitness values, the results for the parallel FSS versions on the GPU demonstrate that there are no reduction on the quality performance over the original version running on the CPU.Furthermore, there is a slight improvement in the quality of the values found for the Rastrigin function (see Fig. 4), specially for the asynchronous FSS version B. It might occurs because the outdated data generated by the race condition can avoid premature convergence to local minima in multimodal problems.According to these results, all FSS implementations based on the GPU achieved a time performance around 6 times better than the CPU version.

Conclusion
In this chapter, we presented a parallelized version of the Fish School Search (FSS) algorithm for graphics hardware acceleration platforms.We observed a significant reduction of the computing execution time when compared to the original FSS version running on CPU.This swarm intelligence technique proved to be very well adapted to solving some optimization problems in a parallel manner.The computation time was significantly reduced and better optimization results were obtained more quickly with GPU parallel computing.Since FSS can be easily parallelized, we demonstrated that by implementing FSS in GPU one can benefit from the distributed float point processing capacity.We obtained a speedup around 6 for a cheap GPU-card.We expect to have a higher performance in more sophisticated GPU-based architectures.Since the Asynchronous version achieved the same fitness performance with a lower processing time, we recommend this option.As future work, one can investigate the performance in more complex problems and assess the scalability in more advanced platforms.
NVIDIA CUDA platform classify the NVIDIA GPUs using what they call Compute Capability as depicted in NVIDIA (2010b).The cards with double-precision floating-point numbers have Compute Capability 1.3 or 2.x.The cards with 2.x Capability can run up to 1,024 threads in a block and has 48 KB of shared memory space.The other ones only can execute 512 threads and have 16 KB of shared memory space.

Fig
Fig. 2. CUDA C program structure

Fig. 3 .
Fig. 3. Rosenbrock's fitness convergence as a function of the number of iterations.

Fig. 5 .
Fig. 5. Griewank's fitness convergence as a function of the number of iterations.
Table 2 presents the used parameters for the steps (individual and volitive).Three

Table 2 .
Initial and Final values for Individual and Volitive steps.

Table 4 .
The Average Value and Standard Deviation of the Fitness value at the 10,000 iteration for Rosenbrock function.Tables 7, 8 and 9 present the average value and the standard deviation of the execution time and the speedup for the Rosenbrock, Rastrigin and Griewank functions, respectively.

Table 6 .
The Average Value and Standard Deviation of the Fitness value at the 10,000 iteration for Griewank function.

Table 7 .
The Average Value and Standard Deviation of the Execution Time and Speedup Analysis for Rosenbrock function.

Table 8 .
The Average Value and Standard Deviation of the Execution Time and Speedup Analysis for Rastrigin function.

Table 9 .
The Average Value and Standard Deviation of the Execution Time and Speedup Analysis for Griewank function.