The Speedup of Discrete Event Simulations by Utilizing CPU Caching

Discrete event simulation (DES) has been widely used for both theoretical and application researches in various fields. The low-cost numerical experiments by DES can be carried out repeatedly to investigate the collective behavior or dynamics of a complex system which usually consists of a huge number of elements or is too expensive to be established realistically [3]. As for engineering applications, new algorithms, measures, or even standards, often require a great deal of DES experiments before actual deploying, especially in the field of communication networks. However, the feasibility of DES method relies not only on the correctness of computational model of the target system but also the spend in the time of computation. An experiment that costs over days of computer computation will limit the application area of DES considerably.


Introduction
Discrete event simulation (DES) has been widely used for both theoretical and application researches in various fields.The low-cost numerical experiments by DES can be carried out repeatedly to investigate the collective behavior or dynamics of a complex system which usually consists of a huge number of elements or is too expensive to be established realistically [3].As for engineering applications, new algorithms, measures, or even standards, often require a great deal of DES experiments before actual deploying, especially in the field of communication networks.However, the feasibility of DES method relies not only on the correctness of computational model of the target system but also the spend in the time of computation.An experiment that costs over days of computer computation will limit the application area of DES considerably.For a DES system designed for communication network simulation, which is implemented by the event-driven method, computation tasks waiting for processing are usually represented by their corresponding events.These events need to be maintained in-sequence by a systematic scheduler.The causality that a scheduler must preserve is usually guaranteed by sorting and dispatching according to the time of event.An event is to be suspended unless its time value is smaller than or equal to the present time.When a great number of tasks or events are in pending, the cost of event scheduling becomes the top contributor to the computational time.For a large scale communication network simulation, for example, scheduling can consume over 40% of the total computational time [12].To accelerate scheduling, several practical algorithms have been proposed and adopted for some widely used simulators such as NS2 [9] and OMNet++ [14].These algorithms can be categorized, according to the data structure used, into linear list, partitioning list, heap and tree.Among them, the partitioning list has attracted much attention.
The partitioning list algorithm proposed by R. Brown is called Calendar Queue (CQ) [4], where a data structure, named bucket, is used to partition events into a sequence of sub-lists.
CQ can reduce the time for event searching and decrease the complexity down to O (1).Since then, some improved algorithms, such as Dynamic CQ (DCQ) [1], SNOOPy CQ [13], and sluggish CQ [15], have been proposed in order to improve the adaptability of CQ to the event distribution in the time domain.However, the way of event enqueue and dequeue remains unchanged.For these algorithms, the complexity of O(1) is only valid for bucket searching and it doesn't take into account event locating within a bucket.In addition, the overhead for bucket managing inevitably results in some negative effects on the performance of computation.K. Chung, J. Sang and V. Rego compared some earlier-day scheduling algorithms and found that, for a token-ring network simulation, both CQ and DCQ are no better than those based on heap or tree structure [6].
In the past decades, researchers have put their interesting into parallel and distributed approaches [7] to speedup simulation by a cluster of computers or processors.Parallel and distributed event scheduling need to cope with challenges such as time synchronization and task balance and remain a gap to extensive adoption.An alternative way to speedup DES simulation, we think, is to exploit the potentials offered by the modern processors.As an instance of attempts, recently H. Park and P.A. Fishwick proposed a graphics processing units (GPU) based algorithm which shows 10-fold speedup [11].However, until now the caching mechanism of processor has not been considered in existing event scheduling algorithms.Herein we provide a cache aware algorithm and verify its improvement on the performance by extending the conventional NS2.
The motivation to utilize caching of processor or CPU is straightforward.
Both cache based method and cache aware method have been applied successfully in some heavy data-retrieving algorithms, including IP routing table lookup [5] and Radix tree implementation [2], where a huge amount of data are frequently accessed.This situation happens for a large scale DES as well, and CPU caching is benefit to speed up event scheduling.
The chapter is organized as follows.Section 2 gives a analysis of event driven mechanism of the NS2 simulator, an estimation on the number of event of the typical application, and a description of the conventional CQ algorithm.A cache aware algorithm is presented in Section 3, followed by a complexity analysis on enqueue and dequeue operations.In section 4, experiments to verify CPU cache awareness are provided, aiming at the evaluation of performance and its relationship with the size of event queue.Section 5 summarizes the chapter.

NS2 and Calendar Queue algorithm
For the typical event-driven DES system, simulation events waiting for processing are buffered in a queue and sorted according to their simulation time or timestamp.A simple and natural choice is to introduce a linear list to manage the queue of event.The number and distribution of events in the time domain hence dominate the performance of queue management.Such performance depends on not only simulation complexity, but also the implementation of a simulator.In order to make the issue tractable, the following analyses are developed on the widely used network simulator NS2 [9].The event generation and consumption can be illustrated by tracing a packet transportation along a simulation link.As it can be imaged, the end of packet transferring at a output port of node will trigger a next serving event which will be sent back to the output queue, i.e.Queue, in the future time of simulation.While time delay due to propagation through the LinkDelay will create a packet arriving event which will be forwarded the input interface, i.e.TTLChecker, of the next node later.These two kinds of events lead to invocations of the class Queue and TTLChecker are buffered into the event queue of Scheduler,a ss h o w e d in Fig. 2.
Packet generating is controlled by the object Application, which has been developed and divided into two types which represent the stochastic generator and application simulator, respectively.The former creates an event that will trigger packet generating of Application recursively.The latter generates packet according to the mechanism of the simulated application.The following discussion will focus on the former for simplicity.

Population of standing events
As mentioned above, there are 3 types of events in NS2, including packet generating, timer and "at-event".Packet is used to model packet's transmission, timer to build protocol state-machine and packet's generator, and "at-event"to control computation.Application and its derivations are designed to generate a flow of packets, where one or more timer(s) is/are used to trigger the generation in a recursive manner.Queue and LinkDelay are designed for packet forwarding and transmitting through a link connecting a pair of neighbor nodes, respectively.Two timers are generated when a LinkDelay handles a received packet, one for triggering packet receiving at the next node later on, the other for calling Queue back when the LinkDelay is available for transmitting next packet buffered in the Queue.Hopping of a packet along a link, hence, induce two consequential events.
Let h k denotes the number of hops or links over an end-to-end path indexed by k.T h e number, represented by g k , of packets are generated by the corresponding Application during One Way Delay (OWD) of the path.The total number of Application's instances, n, is assumed as the same as the number of paths.Then, in the context of NS2, one packet-typed event induces 2h k timer-typed events if the packet does not drop during forwarding and transmitting.Therefore, the number N of pending events is as follows, where, k ∈ [1..n],i sa l s ot h ei n d e xo fApplicationso fn concurrent traffic flows.For a network with 100 nodes and a full mesh typed traffic pattern, n equals 9900.
For a typical traffic with bandwidth demand 1 Mbps and packet size 1000 bytes, the packet generation rate r is 125 packet/s.Assuming the propagation delay along each link is fixed as 2 ms and each path consists of 6 hops in average, we have h k = 6 and OWD is greater than 12 ms.Then, It is easy to figure out that the number of events is approximately, Here, for simplicity, we neglect the effects of packet transmitting and queuing at an output port.As one knows, transmitting delay is inverse proportion to the bandwidth of output link, and it contributes an increment to OWD.Queuing can bring about packet drops and introduce ad e c r e m e n tt og k during network congestion.However, as it is the result of overloading, network congestion implies much more pending events.Therefore, the above estimated N can be seen taken roughly as the low bound for event scheduling.
If N events are buffered in single one linear list, the average times of memory accessing for sorted insertion is about N/2 (10 5 ).Such heavy volume of accessing can lead to much too high time overhead.For networks with burst pattern traffics and higher traffic demands, the speed of event scheduling becomes a critical factor for fast simulation.

The Calendar Queue scheduling
The default event scheduler of NS2 is based on Calendar Queue (CQ) algorithm, which works similar to a desktop calendar [4].In CQ, a bucket is used to stand for the day of year and store a sub-list of events.Events happening on the same day, despite the deference of year, are stored in the same bucket as shown in Fig. 3.
For simplicity, 5 events, E 0 to E 4 , are illustrated in Fig. 3, whose timestamps are defined as 0, 1, 1, 3 and 16 in second.Given that the depth or size of bucket, T B , is 2 seconds in time and one year consists of 3 buckets, the length of year, T y , is 6 seconds.Therefore, events E 0 to E 2 are located in bucket B(0), E 3 in B( 1),andE 4 in B(2) of the third year.
The bucket location or index is determined by the following arithmetic computation, where, t e is the timestamp of an event.E 0 , E 1 and E 2 are inserted in the bucket B(0) since computations according to Eq.( 4) result in the same value 0. Again, the computation of the time, 16, of E 4 gives a value 2, indexing the bucket B(2).
The computation of Eq.( 4) is independent on the number of events in the queue, the complexity is therefore of O(1) [4].However, event enqueue in a single bucket works as the 51 The Speedup of Discrete Event Simulations by Utilizing CPU Caching Figure 3.The structural relationship between event and bucket in CQ same way as a linear list.In the case as showed in Fig. 3, the insertion of E 2 will require two extra steps of comparing.The final complexity is dominated by the sorted insertion in the bucket unless the number of bucket is greater than the total number of events and the distribution of event in time is uniform.
Actually, bucket plays as a container and its size determines events partitioning.The fixed structure of bucket becomes an obstacle for performance improvement.It has been shown that a skewed nonuniform distribution of events in the time domain can degrade the performance of CQ [12].Since then, several improved scheduling algorithms [1,13,15] have been proposed to solve the problem of adaptability to the distribution.
In addition to the structure of bucket, there are two issues that CQ and the related improved algorithms can not cope with readily.One is concurrent events insertion and the other is the earliest event fetching.The former can be seen from the insertion of E 2 after E 1 as shown in Fig. 3.Although they have equal timestamp, logically E 2 occurs later than E 1 and one more comparison is needed to insert E 2 .If the number of concurrent events is huge, say 9900 as cited in Eq.( 3), a large number of comparisons are needed.The latter can be illustrated via E 4 's fetching.After E 3 departing, a CQ scheduler will carry out 6 times of bucket access across two years, then reach E 4 .This is more complex than a linear list, in which the earliest event is always located at the head.
As is seen, CQ and the related improved algorithms do not take the modern structure of processor into considerations, especially high speed CPU caches.It has been demonstrated that a cache aware algorithm can speedup greatly an application that involves heavy data-retrieving [2,5].In the following section, we provide a fast cache aware scheduling algorithm to accelerate the simulation of large scale communication networks.

Data structure for event partitioning
Similar to CQ, the algorithm with cache awareness belongs to the category of partitioning list.Two lists work in a correlated manner, one for DES events and named event queue (Qe), and the other for indexing the sub-list of DES events and named digest queue (Qd).Qe is organized the same as a linear list, while Qd is implemented in an array structure acts as a ring-typed buffer.The element of Qd, called digest, is used to index the tail of a sub-list of Qe, as depicted in Fig. 4. Ad i ge s tD j contains a time delimiter tm and an event pointer ptr to the corresponding DES event of a sub-list.In Fig. 2, D 0 is the digest of the sub-list (E 0 , E 1 ), and D j is the digest of the sub-list (E i , ..., E k−1 ).The time delimiter of digest is defined as follows, where the coefficient g is used to control the size of sub-list, and ts is the timestamp of the corresponding event.If g = 1, in unit of timestamp, a sub-list consists of only those events with the same timestamp.For g > 1, events happen within g interval are indexed by a digest pointing to the latest of them.In effect, the configurable coefficient g is analog to bucket size of CQ.The difference is that two successive digests need not have their tm contiguous.In other words, D j+1 → tm -D j → tm is allowed to be greater than 1×g.This makes our sub-list more adaptive to event distribution in the time domain.
The value of ptr of a digest is the address of corresponding DES event.The function of this address is two-fold, one for heading of the next sub-list, the other for tailing of the current sub-list.The reason why the ptr points to the tail of a sublist is to avoid repeating comparisons for concurrent events with the same time-stamp.This situation happens more frequently in case simulation parameters, ex.bandwidth or link delay, are configured with the same value.

Enqueue and dequeue operations
As described in Algorithm 1, inserting a new DES event, or ENQUEUE, is a little more complex operation than a linear list.There are 6 significant steps in ENQUEUE operation, including, • to calculate the time maker, x, of the new DES event, according to Eq.( 4); • to find the target sub-list by comparing x with each digest of Qe; • to sort and insert the new event into the sub-list by InsertList(); • to replace the digest if the new event is inserted at the tail, otherwise; • to update the digest if digest queue is full, otherwise; • to create a digest and insert it into digest ring buffer by InsertDigQueue().InsertDigQueue (new, x, j + 1); end Since an array structure is used for the digest queue, the conventional ring-buffer access method can be adopted to implement InsertDigQueue().The function InsertList() is, however, the same as that for a linear list.
The event departure or DEQUEUE operation is simply as a linear list, as shown in Algorithm 2. The function FetchList() is to fetch the head of DES event list, and the following block of codes to update the variable h is to remove the first of digest queue.The extra processing on the head is used to keep the correctness of digesting.

Cache awareness and computational complexity
The difference between a cache aware algorithm and CQ is the way to partition events into sub-lists.For the cache aware algorithm, an array of digest is designed to index sub-lists and works as a ring buffer.In the ENQUEUE operation, both sub-list searching and digest inserting (InsertDigQueue()) execute at first over the array.
Generally speaking, a fixed array of data is with great possibility allocated by the memory management of an OS to be continuously distributed in the memory space.This locality is feasible for fast accessing by a processor with caches.Caches are fast memories in which a block of data around the request are loaded from the main memory.The caching operation is benefit to the future requests if they locate in the loaded block.The ratio, β, of access latency of main memory to cache, is usually greater than 10.Therefore, an optimum data structure and its accessing can be designed to utilize the benefit.Two types of approach have been used in designing, one is to access cache directly by instructions based on the hardware structure, the other is to design a deliberate data structure and access them as locally as possible.The former promises better performance but less applicable for different hardware structures, while the latter is cache awareness and has a better compatibility for general purposes.
Buckets of CQ/DCQ embedded in NS2 are allocated by an array and can assure locality.But this locality can not make use of the benefit of caching, because the bucket determination for event searching is computed directly and has no relationship with buckets' allocation.
For the cache aware algorithm, the target sub-list is searched via the digest queue Qd.I fw h ole of Qd can be loaded into caches, the searching can speed up β times.Therefore, the complexity of sub-list searching is equivalent to O( m 2β ).For an average distribution of event in time, the size of each sub-list bing n m ,t h ec o m p l e x i t yo fInsertList() is then O( n 2m ).This leads to the complexity of O( m 2β )+O ( n 2m ) for an ENQUEUE operation.The optimum condition is as follows, m = βn. ( 6 )  Consequently, the lowest complexity of ENQUEUE is O( n β ), twice as that of CQ configured with m = βn buckets.However, for the case of a sub-list consists of concurrent events with the same timestamp, the cache aware algorithm has an advantage over CQ since that the tail of sub-list is indexed.Sublist searching of the algorithm is of O(1) while that of CQ goes up of O( n β ).
As for DEQUEUE operation, the cache aware algorithm is equivalent to remove the head of a list.The complexity is O(1) and it is independent on the distribution of event in time.As a conclusion, the algorithm is better than CQ.

Description of implementation
The cache aware algorithm has been implemented within an extended class CacheScheduler which is derived from the existing ListScheduler.T w o f u n c t i o n s , insert() and deque(), are overridden, and some helper functions appended.Modifications are based on the NS with version 2.33 and should be compatible with 55 The Speedup of Discrete Event Simulations by Utilizing CPU Caching the most of other versions because no change is required except a config modification for Simulator is required to select CacheScheduler rather than CalendarScheduler in default.
T w ofi x e da r r a ya r ed e fi n e di nt h ec l a s sCacheScheduler, one named key_ with type of unsigned int for digest's time, the other named event_ with type of Event * for digest's pointer.Two variable members, head_ and tail_, are defined to index the ring-typed buffer.The coefficient g in Eq.( 5) is represented by the third variable member pricision_,andthe size of digest by the last size_.computations, the simulated network is examined and replaced by a regenerated topology until it is fully connected.In order to make the experiment more generalized, the simulated network topology is built randomly.The number of network nodes and the pattern of traffics are adjustable.   1 and Fig. 5 show the computational time spending for simulations T versus the number of traffic flows, for CQ (Calendar), the cache aware (Cache) and the linear list (List) schedulers.

The overridden function
Experiments are carried out on a personal computer with a Intel(R) Pentium(R) 4 typed CPU with 2.93 GHz in frequency and 1024 kB in cache size.The computational time is evaluated by invoking TCL predefined command clock at 0.0 and 2.0 seconds in the simulation time.
It can be seen from Fig. 5 that the cache aware algorithm is always faster than CQ and the maximum speedup factor reaches 37.
59 The Speedup of Discrete Event Simulations by Utilizing CPU Caching Fig. 6 shows the computational time varying with the size of digest queue, for the case configured with a cache aware scheduler and 30 sources, i.e. 99 × 30 = 2970 flows.Since a ring-buffer management is used in the cache aware algorithm, the actual capacity of the queue is one smaller than its literal size.The condition with the queue sized 1 means to disable the digest queue, and the cache aware algorithm is degraded and equivalent to the linear list.The condition with the queue sized 2 allows only one digest which always points to the tail of DES event list.Experiment results on the size of digest queue are also listed in Tab.From Fig. 6 and Tab.2, it can concluded that a cache aware algorithm makes use of the benefit of CPU caches when the size of digest queue is greater than 16.It is can be seen also that the speedup efficiency relies on the size of cache-line rather than the size and topology of the simulated network.The detailed analysis is given in the next section.

Effectiveness of CPU caching
The result of the experiment indicates that the cache aware algorithm can speedup DES event scheduling for large scale networks.Continuously allocated digest queue in memory is benefit to fully use the feature of caching.Meanwhile, the digest queue acts as a classifier to partition DES event list into sub-lists with a shorter length, and hence reduce the searching time of ENQUEUE operations.
The digest in the cache aware algorithm is managed by two variables, key_ and Event_, each 4 bytes in length.Hence, a digest queue with size 16 occupies 128 bytes in the main memory.This is exactly equivalent to 128-byte sized L2 cache line of the processor [8] used in experiments.The dropping approaching the size 16 shown in Fig. 6 coincides with the CPU L2 caching capability.Therefore, the algorithm is cache-line awareness for the CPU chip employed in our experiments.This means that computation time reduction remains nearly constant when the size of digest queue increases over that of cache line, i.e., 8 elements or 64 bytes, roughly 10 as showed in experiments.In this condition, missing in single a cache line causes a regular main memory access so that the contribution of CPU caching is saturated.
The algorithm provided in this chapter is not the best that can fully utilize the capability of CPU resources since that the design of digest queue and operations have no knowledge of CPU's microstructure.However, a further improvement that depends on the specific CPU products will bring a serious issue in compatibility.The trade-off between performance and compatibility needs extensive investigations and relates to the application field of discrete event simulations.

Conclusion
Discrete event simulation method has been employed in various research fields for exploring a complex or large system numerically.The applicability of this kind of simulation method relies in large on the computation speed.In this chapter, fast event scheduling approaches for simulations of large scale networks is discussed.The typical simulation procedures are described according to NS2 platform, followed by a brief analysis of the conventional Calendar Queue algorithm.Based on the list partitioning method, a digest queue on the fixed array structure is introduced to partition and index the sub-list of DES events.The double-list structure can utilize the benefit of caching mechanism of modern CPU chips.Details of enqueue and dequeue algorithm are given and the complexity analysis is presented.Also, developments based on the open-source NS2 software are showed for the algorithm implementation.In order to verify the benefit of cache awareness to speedup simulation, we report computational experiments over an on-shelf personal computer.It is shown that the performance of a cache aware algorithm is considerably better than the conventional Calendar Queue.Experiment results show that the cache aware algorithm makes simulation faster by up to a factor of 37. The improvement in computation efficiency is applicable for any ordinary network topology and traffic pattern since the randomized topologies and patterns are employed in our experiments.
For simplicity, the algorithm's design and implementation are focused on the network simulation with all traffics assumed over UDP/IP protocol stacks.As for the simulation involves in TCP connection, events used for protocol machine controlling, such as those for time-out processing, bring numbers of canceling operations into event scheduling.The event cancelation in the queue based on the ring-typed buffer should result in more computational complex than that of convectional algorithms.A better solution, we think, is to put these "at-events" being significant for the local network node into a separate object from the global event scheduler.This separation is not only benefit in speedup to the scheduler but also to parallel enabler.
On the other hand, the advancement of multi-core architecture of CPU deserves to be verified whether the proposed algorithm can utilize caches of multi-core chip as well.Moreover, it's valuable to study multi-core technology to achieve parallel speedup in the meantime.

Figure 2 .
Figure 2. The configuration node and link of NS2 and typical events written in italic the Scheduler include ListScheduler, which implements the linear list algorithm, and CalendarScheduler, which implements CQ and DCQ.

Figure 4 .
Figure 4. Event digests and their relationship with DES events

Figure 5 .
Figure 5. Computational time and speedup factor over CQ VS traffics (in unit of 99 flows)

Figure 6 .
Figure 6.Computation time VS digest queue size

Scheduler 1 1 ListScheduler NsObject Simulator Application Queue TTLChecker Agent Process LinkDelay CalendarScheduler TclObject Handler Classifier Connector Figure 1. The hierarchical structure of main classes of NS2 in UML schema 2.1. Event handling mechanism
ENQUEUE operation over a non-empty digest queue input : A new event new, a DES event list headed by H, and a digest ring headed by h and tailed by t. output: Updated event list and digest ring.
if D[j] → tm == x then /*To check the time maker of the digest*/ if list → ts <= new → ts then /*To replace the end of sub-list*/ (D[j] → ptr) ←− new; return; end else if t + 1 ==h then /*To update the digest*/ (D[j] → ptr) ←− new; (D[j] → tm) ←− x; else /*To create a digest and insert it at (j+1)-th of the digest ring */ deque() is defined in c++ as following, The function findDigest() results the pointer p of sublist of DES event after which the new should be inserted, and the index idx of the sublist in the digest queue.The c++ codes of findDigest() is defined as, Codes in the above from Line 05 to 06 are to find out the sublist that the event timestamped with key should be contained.Codes from Line 08 to 10 are to handle the special case of the first sublist, and codes from Line 13 to 17 are introduced to compensate tail pointing mechanism used in sublist digesting, considering that the head of a sublist is equivalent to the tail of its previous one.
where, the variable member queue_ points to the head of DES event list.The condition (event_[head_] == e) is for checking whether it should remove the head of digest queue or not.The c++ codes of the function insert() is showed as following, where, the functions findDigest() and insertDigest() operate over the array key_ and event_.

Table 1 .
Computational times of three scheduling algorithms and speedups of the proposed (Cache) to CQ

Table 2 .
Computational time (T) varies with the size of digest queue