Modeling parameters and their typical values.
In shared-memory bus-based multiprocessors, the number of processors is often limited by the (shared) bus; when the utilization of the bus approaches 100%, processors spend an increasing amount of time waiting to get access to the bus (and shared memory) and this degrades their performance. The limitations imposed by the bus depend upon many parameters, and different parameters affect the performance in different ways. This chapter uses timed Petri nets to model shared-memory bus-based multiprocessors at the instruction execution level and shows how the performance of processors and the system are affected by different modeling parameters. Discrete-event simulation of the developed net models is used to get performance results.
- shared-memory multiprocessors
- bus-based multiprocessors
- timed Petri nets
- performance analysis
- discrete-event simulation
Due to continuous progress in manufacturing technologies, the performance of microprocessors has been steadily improving over several decades, doubling every 18 months (the so-called Moore’s law ). The capacity of memory chips has also been doubling every 18 months, but the performance has been improving less than 10% per year . The performance gap  between the processor and its memory have been doubling approximately every 6 years, and an increasing part of the processor’s time is being spent on waiting for the completion of memory operations. Although multilevel cache memories are used to reduce the average latencies of memory accesses, matching the performances of the processor and the memory is an increasingly difficult task. In effect, it is often the case that more than 50% of processor cycles are spent waiting for the completion of memory accesses . A model of a pipelined processor at the instruction execution level is used in this chapter to study the mismatch of processor and memory performances.
This model of a processor is then used for performance analysis of shared-memory bus-based multiprocessors. The main objective of this analysis is to study the degradation of the processor’s performance when the utilization of the (shared) bus approaches 100%. This performance degradation limits the number of processors in bus-based systems.
Modeling and analysis of shared-memory bus-based systems requires a flexible formalism that can easily handle concurrent activities as well as synchronization of different events and processes that occur in such systems. Petri nets [5, 6] are such formal models.
As formal models, Petri nets are bipartite directed graphs in which the two types of vertices represent, in a very general sense, conditions and events. An event can occur only when all conditions associated with it (represented by arcs directed to the event) are satisfied. An occurrence of an event usually satisfies some other conditions, indicated by arcs directed from the event. Hence, an occurrence of one event causes some other event (or events) to occur, and so on.
In order to study the performance aspects of systems modeled by Petri nets, the durations of modeled activities must also be taken into account. This can be done in different ways, resulting in different types of temporal nets . In timed Petri nets , occurrence times are associated with events, and the events occur in real time (as opposed to instantaneous occurrences in other models). For timed nets with constant or exponentially distributed occurrence times, the state graph of a net is a Markov chain (or an embedded Markov chain). If the state space of a timed net is finite and reasonably small, the stationary probabilities of states can be determined by standard methods . Then these stationary probabilities are used for the derivation of many performance characteristics of the model . In other cases, discrete event simulation  is used to find the performance characteristics of a timed net.
In this chapter, timed Petri nets are used to model shared-memory bus-based multiprocessor systems. Section 2 recalls basic concepts of Petri nets and timed Petri nets. Section 3 discusses a model of a pipelined processor and its performance as a function of modeling parameters. Shared-memory bus-based systems are described and analyzed in Section 4. Section 5 concludes the chapter.
2. Timed Petri nets
In Petri nets, concurrent activities are represented by tokens that can move within a (static) graph-like structure of the net. More formally, a marked place/transition Petri net is defined as a pair , where the structure is a bipartite directed graph, , with two types of vertices, a set of places and a set of transitions , and a set of directed arcs connecting places with transitions and transitions with places, . The initial marking function assigns nonnegative numbers of tokens to places of the net, . Marked nets can be equivalently defined as .
A place is shared if it is connected to more than one transition. A shared place is free-choice if the sets of places connected by directed arcs to all transitions sharing are identical. A shared place is (dynamically) conflict-free if for each marking reachable from the initial marking, at most one transition sharing is enabled. If a shared place is not free-choice and not conflict-free, the transitions sharing are conflicting.
In timed nets , occurrence times are associated with transitions, and transition occurrences are real-time events, i.e., tokens are removed from input places at the beginning of the occurrence period and are deposited to the output places at the end of this period. All occurrences of enabled transitions are initiated in the same instants of time in which the transitions become enabled (although some enabled transitions may not initiate their occurrences). If, during the occurrence period of a transition, the transition becomes enabled again, a new, independent occurrence can be initiated, which will overlap with the other occurrence(s). There is no limit on the number of simultaneous occurrences of the same transition (sometimes this is called infinite occurrence semantics). Similarly, if a transition is enabled “several times” (i.e., it remains enabled after initiating an occurrence), it may start several independent occurrences in the same time instant.
More formally, a timed Petri net is a triple, , where is a marked net, is a choice function which assigns probabilities to transitions in free-choice classes, or relative frequencies of occurrences to conflicting transitions, , and is a timing function, which assigns an (average) occurrence time to each transition of the net, , where is the set of nonnegative real numbers.
The occurrence times of transitions can be either deterministic or stochastic (i.e., described by some probability distribution function). In the first case, the corresponding timed nets are referred to as D-timed nets ; in the second, for the (negative) exponential distribution of occurrence times, the nets are called M-timed nets (Markovian nets) . In both cases, the concepts of state and state transitions have been formally defined and used in the derivation of different performance characteristics of the model. In simulation applications, other distributions can also be used, for example, the uniform distribution (U-timed nets) is sometimes a convenient option. In timed Petri nets, different distributions can be associated with different transitions in the same model providing flexibility that is used in simulation examples that follow.
In timed nets, the occurrence times of some transitions may be equal to zero, which means that the occurrences are instantaneous; all such transitions are called immediate (while the others are called timed). Since immediate transitions have no tangible effects on the (timed) behavior of the model, it is convenient to “split” the set of transitions into two parts, the set of immediate and the set of timed transitions, and to first perform all occurrences of the (enabled) immediate transitions, and then (still in the same time instant), when no more immediate transitions are enabled, to start the occurrences of (enabled) timed transitions. It should be noted that such a convention effectively introduces the priority of immediate transitions over the timed ones, so the conflicts of immediate and timed transitions are not allowed in timed nets. Detailed characterization of the behavior of timed nets with immediate and timed transitions is given in .
3. Pipelined processors
A timed Petri net model of a pipelined processor  at the level of instruction execution is shown in Figure 1 (as usual, timed transitions are represented by solid bars, and immediate transitions by thin bars). It is assumed that the first level cache does not delay the processor, while cache misses (at the first level cache) introduce a delay of processor cycles. For simplicity, only two levels of cache memory are represented in the model; it appears that such a simplification does not affect the results in a significant way .
Place Pnxt is marked when the processor is ready to execute the next instruction. Pnxt is a free-choice place with three possible outcomes that model issuing an instruction without any further delay (Ts0 with the choice probability ), a single-cycle pipeline stall (modeled by Td1 with the choice probability associated with Ts1), and a two-cycle pipeline stall (modeled by Td2 and then Td1 with the choice probability assigned to Ts2). Other pipeline stalls could be represented in a similar way, if needed.
Marked place Cont indicates that an instruction is ready to be issued to the execution pipeline. It is assumed that once the instruction enters the pipeline, it will progress through all the stages and, eventually, leave the pipeline. Since the details of pipeline implementation are not important for performance analysis of the processor, they are not represented here. Only the first stage of the execution pipeline is shown as timed transition Trun.
Done is another free-choice place which determines if the executing instruction results in a cache miss or not. Transition Tnxt occurs (with the corresponding probability) if cache miss does not occur and the processor can continue fetching and issuing instructions. Cache miss is represented by . The choice probability associated with Tsel determines the instruction runlength, , i.e., the average number of instructions between two consecutive cache misses; if this choice probability is equal to 0.1, the runlength is equal to 10; if it is equal to 0.2, the runlength is 5; and so on.
is another free-choice place; it models the hits and misses of the second-level cache. The probability associated with transition represents the hit ratio of the second-level cache (the occurrence time of is the average access time to the second-level cache, ) while the miss ratio is associated with transition which represents accesses to the main memory (with the occurrence time ).
Typical values of modeling parameters used in this chapter are shown in Table 1 .
|First-level cache hit rate||0.9|
|Second-level cache hit rate||0.8|
|First-level cache access time||1|
|Second-level cache access time||5|
|Main memory access time||25|
|Prob. of one-cycle pipeline stall||0.1|
|Prob. of two-cycle pipeline stall||0.05|
All temporal data in Table 1 (i.e., cache and memory access times) are in processor cycles.
Processor utilization as a function of , the hit rate of the first-level cache, is shown in Figure 2 for two values of the second-level cache access time, , and . It should not be surprising that processor utilization is quite sensitive to the values of , but is much less sensitive to the values of .
Processor utilization as a function of , the hit rate of the second-level cache, is shown in Figure 3 for two values of the main memory access time, and . Processor utilization is rather insensitive to values of , and does not change much with .
Processor utilization as a function of the probability of pipeline stalls is shown in Figure 4 for three combinations of values of and .
Again, processor utilization is rather insensitive to the probability of pipeline stalls as well as the values of and .
For pipelined processors shown in Figure 1 , processor utilization can be estimated using the following formula:
For the values of modeling parameters shown in Table 1 , processor utilization is:
4. Shared-memory bus–based systems
An outline of a shared-memory bus-based multiprocessor is shown in Figure 5 . The system is composed of identical processors, which access the shared memory using a system bus. To reduce the average access time to the shared memory, the processors use (multilevel) cache memories. It is assumed that memory consistency is provided by a cache coherence mechanism , which usually increases the miss ratio of accessing caches (and is otherwise not represented in the model).
A timed Petri net model of a shared-memory bus-based multiprocessor is shown in Figure 6 . It contains models of processors (only two are shown in Figure 6 ), which are copies of the model shown in Figure 1 except for the main memory (transition ) which becomes shared memory in Figure 6 . The remaining part of Figure 6 is modeling the bus that coordinates accesses of processors to the shared memory.
When a processor , , requests an access to shared memory, place becomes marked. If the bus is available (i.e., if place is marked), the occurrence of transition indicates that processor begins its access to shared memory. Transitions , …, constitute a conflict class with a fair resolution of conflicts (i.e., all conflicting processors have the same probability of being selected for accessing memory). In real systems, accessing the shared bus is often based on priorities assigned to processors; such priorities could easily be represented using inhibitor arcs in Petri nets.
Place collects memory access requests from all processors (occurrences of transitions ). The end of memory access (i.e., the end of the occurrence of ) is indicated by an occurrence of transition of the processor which initiated memory access. The occurrence of also returns a token to , allowing another access to shared memory to be executed.
In Figure 7 , the bus utilization approaches 100% for about five processors. Moreover, the degradation of processors’ performance due to increasing waiting times for accessing the bus (and shared memory) is well illustrated in Figure 7 .
The average waiting time (in processor cycles) of accessing shared memory (i.e., the times from requesting memory access in place to granting this access by an occurrence of ) is shown in Figure 8 as a function of the number of processors in the system.
Figure 8 shows that the waiting times increase almost linearly with the number of processors when this number is greater than 5, i.e., when the bus (and shared memory) is utilized in almost 100%.
If the value of the second-level cache hit rate, , increases (and the other parameters do not change), the number of accesses to main memory is reduced, so that the performances of processors and the whole system improve. Figure 9 shows the utilization of processors and the bus as functions of the number of processors in the system for . It also shows that the reduced (in comparison with Figure 7 ) utilization of the bus allows the increase of the number of processors without significant degradation of their performance.
By a similar argument, reduced hit rate at the first-level cache, , increases the number of accesses to the second-level cache as well as to the main memory, and this results in reduced performance of the system. Figure 10 shows the utilization of processors and the bus as functions of the number of processors in the system for . It provides a good illustration of the degradation of performance when compared with Figure 7 .
The number of processors for which the bus is used to almost 100% can be estimated by the following formula:
For the case shown in Figure 7 , this number is:
For the case shown in Figure 9 , this value is 8.2.
There are several ways in which the number of processors can be increased in bus-based systems without sacrificing the processors’ performance. The simplest approach is to introduce the second bus which allows two concurrent accesses to shared memory, provided the memory is dual port (it allows two concurrent accesses). Figure 11 outlines a dual bus shared-memory system.
Petri net model of a dual bus system is the same as in Figure 6 ; the only difference is the initial marking of place , which now requires two tokens to represent two concurrent accesses to shared memory. It should be observed observed that, for a small number of processors, the utilization of each bus in Figure 12 is one half of that in Figure 7 , and also the number of processors that can be used in such a dual bus system without degradation of their performance is twice as large as in a single bus system ( Figure 7 ).
The results shown in Figure 12 are very similar to those shown in Figure 9 . The second bus in a shared-memory system allows to perform two concurrent accesses to the shared memory. From a single processor’s performance point of view, this effect is similar to reducing two times the number of accesses to the shared memory, and this is the effect of reducing two times the miss rate for the second-level cache (which is shown in Figure 9 ).
If dual port memory cannot be used, the shared memory can be split into several independent modules which can be accessed concurrently by the processors provided that the bus is also split into sections associated with each module, with processors accessing all such sections, as shown in Figure 13 for four independent memory modules. The main difference between a multibus system ( Figure 11 ) and a system with split bus is in accessing the shared memory; in a multiple bus system, the whole shared memory is accessed by each bus, while in a split bus system ( Figure 13 ), each section of the bus accesses only one memory module. In the system shown in Figure 13 , up to four (the number of memory modules) memory accesses can be performed concurrently, but if two (or more) processors request access to the same memory module, the requests are served one after another.
In Figure 14 , there is a free-choice place for each processor , . This free-choice place selects the requested memory module by transitions , , and forwards the memory access request to the selected memory module (place ). If the selected module is available, i.e., if place is marked, the access to shared memory is initiated by the occurrence of . When this memory access is completed, the occurrence of releases the memory modules (by returning a token to ) and resumes instruction execution in the processor that requested the memory access.
If memory module is not available when it is requested, the memory access is delayed (in ) until the requested module becomes available.
It is possible that more than one processor becomes waiting for the same memory module. The selection of the processor which will get access first is random with the same probability assigned to all waiting processors. In real systems, there is usually some priority scheme that determines the order in which the waiting processors access the bus. Such a priority scheme could easily be modeled if it is needed (for example, for studying the starvation effect which can be created when the system is overloaded).
In Figure 14 , the selection of memory modules is random, with the same probabilities for all modules. If this policy is not realistic, a different memory accessing policy can be implemented, for example, the probabilities of accessing consecutive memory modules by each processor could be used to model sequential processing of large arrays, and so on.
In Figure 15 , even for 20 processors, the average utilization of the bus is close to 80%, so the system can accommodate more processors.
5. Concluding remarks
The chapter uses timed Petri nets to model shared-memory bus-based architectures at the level of instruction execution to study the effects of modeling parameters on the performance of the system. The models are rather simple with straightforward representation of modeling parameters.
Performance results presented in this chapter have been obtained by the simulation of developed Petri net models. However, the model shown in Figure 7 has only 10 states, so its analytical solution (for different values of modeling parameters) can be easily obtained and compared with simulation results to verify their accuracy. Table 2 shows such a comparison of processor utilization for several combinations of parameters and . In all cases, the simulation-based results are very close to the analytical ones.
|Simulated results||Analytical results|
The models of multiprocessor systems are usually composed of many copies of the same submodel of a processor and possibly other elements of the system. Colored Petri nets  can significantly simplify such models by eliminating copies of similar subsystems. Analysis of colored Petri nets is, however, much more complex than that of ordinary Petri nets.
Finally, it should be noted that the performance of real-life multiprocessor systems very rarely can be described by a set of parameters that remain stable for any significant period of time. The basic parameters like the hit rates depend upon the executed programs as well as their data, and can change very quickly in a significant way. Consequently, the characteristics presented in this chapter can only be used as some insight into the complex behavior of multiprocessor systems.