Modeling parameters and their typical values.
Abstract
In sharedmemory busbased multiprocessors, the number of processors is often limited by the (shared) bus; when the utilization of the bus approaches 100%, processors spend an increasing amount of time waiting to get access to the bus (and shared memory) and this degrades their performance. The limitations imposed by the bus depend upon many parameters, and different parameters affect the performance in different ways. This chapter uses timed Petri nets to model sharedmemory busbased multiprocessors at the instruction execution level and shows how the performance of processors and the system are affected by different modeling parameters. Discreteevent simulation of the developed net models is used to get performance results.
Keywords
 sharedmemory multiprocessors
 busbased multiprocessors
 timed Petri nets
 performance analysis
 discreteevent simulation
1. Introduction
Due to continuous progress in manufacturing technologies, the performance of microprocessors has been steadily improving over several decades, doubling every 18 months (the socalled Moore’s law [1]). The capacity of memory chips has also been doubling every 18 months, but the performance has been improving less than 10% per year [2]. The performance gap [3] between the processor and its memory have been doubling approximately every 6 years, and an increasing part of the processor’s time is being spent on waiting for the completion of memory operations. Although multilevel cache memories are used to reduce the average latencies of memory accesses, matching the performances of the processor and the memory is an increasingly difficult task. In effect, it is often the case that more than 50% of processor cycles are spent waiting for the completion of memory accesses [4]. A model of a pipelined processor at the instruction execution level is used in this chapter to study the mismatch of processor and memory performances.
This model of a processor is then used for performance analysis of sharedmemory busbased multiprocessors. The main objective of this analysis is to study the degradation of the processor’s performance when the utilization of the (shared) bus approaches 100%. This performance degradation limits the number of processors in busbased systems.
Modeling and analysis of sharedmemory busbased systems requires a flexible formalism that can easily handle concurrent activities as well as synchronization of different events and processes that occur in such systems. Petri nets [5, 6] are such formal models.
As formal models, Petri nets are bipartite directed graphs in which the two types of vertices represent, in a very general sense, conditions and events. An event can occur only when all conditions associated with it (represented by arcs directed to the event) are satisfied. An occurrence of an event usually satisfies some other conditions, indicated by arcs directed from the event. Hence, an occurrence of one event causes some other event (or events) to occur, and so on.
In order to study the performance aspects of systems modeled by Petri nets, the durations of modeled activities must also be taken into account. This can be done in different ways, resulting in different types of temporal nets [7]. In timed Petri nets [8], occurrence times are associated with events, and the events occur in real time (as opposed to instantaneous occurrences in other models). For timed nets with constant or exponentially distributed occurrence times, the state graph of a net is a Markov chain (or an embedded Markov chain). If the state space of a timed net is finite and reasonably small, the stationary probabilities of states can be determined by standard methods [9]. Then these stationary probabilities are used for the derivation of many performance characteristics of the model [10]. In other cases, discrete event simulation [11] is used to find the performance characteristics of a timed net.
In this chapter, timed Petri nets are used to model sharedmemory busbased multiprocessor systems. Section 2 recalls basic concepts of Petri nets and timed Petri nets. Section 3 discusses a model of a pipelined processor and its performance as a function of modeling parameters. Sharedmemory busbased systems are described and analyzed in Section 4. Section 5 concludes the chapter.
2. Timed Petri nets
In Petri nets, concurrent activities are represented by tokens that can move within a (static) graphlike structure of the net. More formally, a marked place/transition Petri net
A place is shared if it is connected to more than one transition. A shared place
In timed nets [8], occurrence times are associated with transitions, and transition occurrences are realtime events, i.e., tokens are removed from input places at the beginning of the occurrence period and are deposited to the output places at the end of this period. All occurrences of enabled transitions are initiated in the same instants of time in which the transitions become enabled (although some enabled transitions may not initiate their occurrences). If, during the occurrence period of a transition, the transition becomes enabled again, a new, independent occurrence can be initiated, which will overlap with the other occurrence(s). There is no limit on the number of simultaneous occurrences of the same transition (sometimes this is called infinite occurrence semantics). Similarly, if a transition is enabled “several times” (i.e., it remains enabled after initiating an occurrence), it may start several independent occurrences in the same time instant.
More formally, a timed Petri net is a triple,
The occurrence times of transitions can be either deterministic or stochastic (i.e., described by some probability distribution function). In the first case, the corresponding timed nets are referred to as Dtimed nets [12]; in the second, for the (negative) exponential distribution of occurrence times, the nets are called Mtimed nets (Markovian nets) [13]. In both cases, the concepts of state and state transitions have been formally defined and used in the derivation of different performance characteristics of the model. In simulation applications, other distributions can also be used, for example, the uniform distribution (Utimed nets) is sometimes a convenient option. In timed Petri nets, different distributions can be associated with different transitions in the same model providing flexibility that is used in simulation examples that follow.
In timed nets, the occurrence times of some transitions may be equal to zero, which means that the occurrences are instantaneous; all such transitions are called immediate (while the others are called timed). Since immediate transitions have no tangible effects on the (timed) behavior of the model, it is convenient to “split” the set of transitions into two parts, the set of immediate and the set of timed transitions, and to first perform all occurrences of the (enabled) immediate transitions, and then (still in the same time instant), when no more immediate transitions are enabled, to start the occurrences of (enabled) timed transitions. It should be noted that such a convention effectively introduces the priority of immediate transitions over the timed ones, so the conflicts of immediate and timed transitions are not allowed in timed nets. Detailed characterization of the behavior of timed nets with immediate and timed transitions is given in [8].
3. Pipelined processors
A timed Petri net model of a pipelined processor [14] at the level of instruction execution is shown in
Figure 1
(as usual, timed transitions are represented by solid bars, and immediate transitions by thin bars). It is assumed that the first level cache does not delay the processor, while cache misses (at the first level cache) introduce a delay of
Place
Marked place
Typical values of modeling parameters used in this chapter are shown in Table 1 .
Symbol  Parameter  Value 


Firstlevel cache hit rate  0.9 

Secondlevel cache hit rate  0.8 

Firstlevel cache access time  1 

Secondlevel cache access time  5 

Main memory access time  25 

Prob. of onecycle pipeline stall  0.1 

Prob. of twocycle pipeline stall  0.05 
All temporal data in Table 1 (i.e., cache and memory access times) are in processor cycles.
Processor utilization as a function of
Processor utilization as a function of
Processor utilization as a function of the probability of pipeline stalls
Again, processor utilization is rather insensitive to the probability of pipeline stalls as well as the values of
For pipelined processors shown in Figure 1 , processor utilization can be estimated using the following formula:
For the values of modeling parameters shown in Table 1 , processor utilization is:
The estimated values agree quite well with the values shown in Figures 2 – 4 .
4. Sharedmemory bus–based systems
An outline of a sharedmemory busbased multiprocessor is shown in
Figure 5
. The system is composed of
A timed Petri net model of a sharedmemory busbased multiprocessor is shown in
Figure 6
. It contains models of
When a processor
Place
Figure 7 shows the utilization of processors and the bus as functions of the number of processors in a sharedmemory system for the values of modeling parameters shown in Table 1 .
In Figure 7 , the bus utilization approaches 100% for about five processors. Moreover, the degradation of processors’ performance due to increasing waiting times for accessing the bus (and shared memory) is well illustrated in Figure 7 .
The average waiting time (in processor cycles) of accessing shared memory (i.e., the times from requesting memory access in place
Figure 8 shows that the waiting times increase almost linearly with the number of processors when this number is greater than 5, i.e., when the bus (and shared memory) is utilized in almost 100%.
If the value of the secondlevel cache hit rate,
By a similar argument, reduced hit rate at the firstlevel cache,
The number of processors for which the bus is used to almost 100% can be estimated by the following formula:
For the case shown in Figure 7 , this number is:
For the case shown in Figure 9 , this value is 8.2.
There are several ways in which the number of processors can be increased in busbased systems without sacrificing the processors’ performance. The simplest approach is to introduce the second bus which allows two concurrent accesses to shared memory, provided the memory is dual port (it allows two concurrent accesses). Figure 11 outlines a dual bus sharedmemory system.
Petri net model of a dual bus system is the same as in
Figure 6
; the only difference is the initial marking of place
The results shown in Figure 12 are very similar to those shown in Figure 9 . The second bus in a sharedmemory system allows to perform two concurrent accesses to the shared memory. From a single processor’s performance point of view, this effect is similar to reducing two times the number of accesses to the shared memory, and this is the effect of reducing two times the miss rate for the secondlevel cache (which is shown in Figure 9 ).
If dual port memory cannot be used, the shared memory can be split into several independent modules which can be accessed concurrently by the processors provided that the bus is also split into sections associated with each module, with processors accessing all such sections, as shown in Figure 13 for four independent memory modules. The main difference between a multibus system ( Figure 11 ) and a system with split bus is in accessing the shared memory; in a multiple bus system, the whole shared memory is accessed by each bus, while in a split bus system ( Figure 13 ), each section of the bus accesses only one memory module. In the system shown in Figure 13 , up to four (the number of memory modules) memory accesses can be performed concurrently, but if two (or more) processors request access to the same memory module, the requests are served one after another.
Petri net model of a system outlined in Figure 13 is shown in Figure 14 where only two processors and two memory modules are detailed.
In
Figure 14
, there is a freechoice place
If memory module is not available when it is requested, the memory access is delayed (in
It is possible that more than one processor becomes waiting for the same memory module. The selection of the processor which will get access first is random with the same probability assigned to all waiting processors. In real systems, there is usually some priority scheme that determines the order in which the waiting processors access the bus. Such a priority scheme could easily be modeled if it is needed (for example, for studying the starvation effect which can be created when the system is overloaded).
In Figure 14 , the selection of memory modules is random, with the same probabilities for all modules. If this policy is not realistic, a different memory accessing policy can be implemented, for example, the probabilities of accessing consecutive memory modules by each processor could be used to model sequential processing of large arrays, and so on.
Figure 15 shows the utilization of processors and busses as functions of the number of processors in a system outlined in Figure 13 .
In Figure 15 , even for 20 processors, the average utilization of the bus is close to 80%, so the system can accommodate more processors.
5. Concluding remarks
The chapter uses timed Petri nets to model sharedmemory busbased architectures at the level of instruction execution to study the effects of modeling parameters on the performance of the system. The models are rather simple with straightforward representation of modeling parameters.
Performance results presented in this chapter have been obtained by the simulation of developed Petri net models. However, the model shown in
Figure 7
has only 10 states, so its analytical solution (for different values of modeling parameters) can be easily obtained and compared with simulation results to verify their accuracy.
Table 2
shows such a comparison of processor utilization for several combinations of parameters


Simulated results  Analytical results 

0.8  0.8  0.3341  0.3333 
0.8  0.9  0.3846  0.3846 
0.9  0.8  0.4763  0.4762 
0.9  0.9  0.5255  0.5263 
The models of multiprocessor systems are usually composed of many copies of the same submodel of a processor and possibly other elements of the system. Colored Petri nets [17] can significantly simplify such models by eliminating copies of similar subsystems. Analysis of colored Petri nets is, however, much more complex than that of ordinary Petri nets.
Finally, it should be noted that the performance of reallife multiprocessor systems very rarely can be described by a set of parameters that remain stable for any significant period of time. The basic parameters like the hit rates depend upon the executed programs as well as their data, and can change very quickly in a significant way. Consequently, the characteristics presented in this chapter can only be used as some insight into the complex behavior of multiprocessor systems.
References
 1.
Hamilton S. Taking Moore’s law into the next century. IEEE Computer. 1999; 32 (1):4348  2.
Patterson DA, Hennessy JL. Computer Architecture—A Quantitative Approach. 4th ed. San Mateo, CA: Morgan Kaufmann; 2006  3.
Wilkes MV. The memory gap and the future of highperformance memories. ACM Architecture News. 2001; 29 (1):27  4.
Mutlu O, Stark J, Wilkerson C, Patt YN. Runahead execution: An effective alternative to large instruction windows. IEEE Micro. 2003; 23 (6):2025  5.
Murata T. Petri nets: Properties, analysis and applications. Proceedings of IEEE. 1989; 77 (4):541580  6.
Reisig W. Petri nets—An introduction (EATCS Monographs on Theoretical Computer Science. Vol. 4). New York, NY: SpringerVerlag; 1985  7.
PopovaZeugmann L. Time and Petri Nets. Berlin, Heidelberg: SpringerVerlag; 2013  8.
Zuberek WM. Timed petri nets—Definitions, properties and applications. Microelectronics and Reliability (Special Issue on Petri Nets and Related Graph Models). 1991; 31 (4):627644  9.
Allen AA. Probability, Statistics and Queueing Theory with Computer Science Applications. 2nd ed. San Diego, CA: Academic Press; 1991  10.
Jain R. The Art of Computer Systems Performance Analysis. New York, NY: Wiley Interscience; 1991  11.
Pooch UW, Wall JA. Discrete Event Simulation. Boca Raton, FL: CRC Press; 1993  12.
Zuberek WM. Dtimed petri nets and modelling of timeouts and protocols. Transactions of the Society for Computer Simulation. 1987; 4 (4):331357  13.
Zuberek WM. Mtimed petri nets, priorities, preemptions, and performance evaluation of systems. In: Advances in Petri Nets 1985 (LNCS 222). Berlin, Heidelberg: SpringerVerlag; 1986. pp. 478498  14.
Ramamoorthy CV, Li HF. Pipeline architecture. ACM Computing Surveys. 1977; 9 (1):61102  15.
Zuberek WM. Modeling and analysis of simultaneous multithreading. In: Proceedings of the 14th International Conference on Analytical and Stochastic Modeling Techniques and Applications (ASMTA07), a part of the 21st European Conference on Modeling and Simulation (ECMS’07); Prague, Czech Republic; 2007. pp. 115120  16.
Suh T, Lee HHS, Blough DM. Integrating cache coherence protocols for heterogeneous multiprocessor system. Part 2. IEEE Micro. 2004; 24 (5):5569  17.
Jensen K, Kristensen LM. Coloured Petri Nets—Modeling and Validation of Concurrent Systems. Berlin, Heidelberg: SpringerVerlag; 2009