Simultaneous multithreading – modeling parameters and their typical values
1. Introduction
In modern computer systems, the performance of the whole system is increasingly often limited by the performance of its memory subsystem [1]. Due to continuous progress in manufacturing technologies, the performance of processors has been doubling every 18 months (the so–called Moore’s law [2]), but the performance of memory chips has been improving only by 10% per year [1], creating a “performance gap” in matching processor’s performance with the required memory bandwidth [3]. More detailed studies have shown that the number of processor cycles required to access main memory doubles approximately every six years [4]. In effect, it is becoming more and more often the case that the performance of applications depends on the performance of the system’s memory hierarchy and it is not unusual that as much as 60% of time processors spend waiting for the completion of memory operations [4].
Memory hierarchies, and in particular multi–level cache memories, have been introduced to reduce the effective latency of memory accesses [5]. Cache memories provide efficient access to information when the information is available at lower levels of memory hierarchy; occasionally, however, long–latency memory operations are needed to transfer the information from the higher levels ofmemory hierarchy to the lower ones. Extensive research has focused on reducing and tolerating these large memory access latencies.
Techniques which tolerate long–latency memory accesses include out–of–order execution of instructions and instruction–level multithreading. The idea of out–of–order execution [1] is to execute, instead of waiting for the completion of a long–latency operation, instructions which (logically) follow the long–latency one, but which do not depend upon the result of this long–latency operation. Since out–of–order execution exploits instruction–level concurrency in the executed sequential instruction stream, it conveniently maintains code–base compatibility [6]. In effect, the instruction stream is dynamically decomposed into micro-threads, which are scheduled and synchronized at no cost in terms of executing additional instructions. Although this is desirable, speedups using out–of–order execution on superscalar pipelines are not so impressive, and it is difficult to obtain a speedup greater than 2 using 4 or 8-way superscalar issue [7]. Moreover, in modern processors, memory latencies are so long that out–of–order processors require very large instruction windows to tolerate them.
Although ultra–wide out-of-order superscalar processors were predicted as the architecture of one-billion-transistor chips, with a single 16 or 32-wide-issue processing core and huge branch predictors to sustain good instruction level parallelism, the industry has not been moving toward the wide–issue superscalar model [8]. Design complexity and power efficiency direct the industry toward narrow–issue, high–frequency cores andmultithreaded processors. According to [6]: “Clearly something is very wrong with the out–of–order approach to concurrency if this extravagant consumption of on–chip resources is only providing a practical limit on speedup of about 2.”
Instruction–level multithreading [9], [10], [1] is a technique of tolerating long–latency memory accesses by switching to another thread (if it is available for execution) rather than waiting for the completion of the long–latency operation. If different threads are associated with different sets of processor registers, switching from one thread to another (called “context switching”) can be done very efficiently [11], in one or just a few processor cycles.
In simultaneous multithreading [12], [6] several threads can issue instructions at the same time. If a processor contains several functional units or it contains more than one instruction execution pipeline, the instructions can be issued simultaneously; if there is only one pipeline, only one instruction can be issued in each processor cycle, but the (simultaneous) threads complement each other in the sense that whenever one thread cannot issue an instruction (because of pipeline stalls or context switching), an instruction is issued from another thread, eliminating ‘empty’ instruction slots and increasing the overall performance of the processor.
Simultaneous multithreading combines hardware features of wide-issue superscalar processors and multithreaded processors [12]. From superscalar processors it inherits the ability to issue multiple instructions in each cycle; from multithreaded processors it takes hardware state for several threads. The result is a processor that can issue multiple instructions from multiple threads in each processor cycle, achieving better performance for a variety of workloads.
The main objective of this work is to study the performance of simultaneously multithreaded processors in order to determine how effective simultaneous multithreading can be. In particular, an indication is sought if simultaneous multithreading can overcome the out–of–order’s “barrier” of the speedup (equal to 2 [13]). A timed Petri net [14] model of multithreaded processors at the instruction execution level is developed, and performance results for this model are obtained by event–driven simulation of the developed model. Since the model is rather simple, simulation results are verified (with respect to accuracy) by state–space–based performance analysis (for those combinations of modeling parameters for which the state space remains reasonably small).
Section 2 recalls basic concepts of timed Petri nets which are used in this study. A model of simultaneous multithreading, used for performance exploration, is presented in Section 3. Section 4 discusses the results obtained by event–driven simulation of themodel introduced in Section 3. Section 5 contains concluding remarks including a short comparison of simulation and analytical results.
2. Timed Petri nets
A marked place/transition Petri net is typically defined [15] [16] as = (,
A place
All places which are not conflict–free and not free–choice, are conflict places. Transitions sharing conflict places are (directly or indirectly) potentially in conflict (i.e., they are in conflict or not depending upon a marking function; for different marking functions the sets of transitions which are in conflict can be different). All transitions which are potentially is conflict constitute a conflict class. All conflict classes are disjoint. It is assumed that conflicts are resolved by random choices of occurrences among the conflicting transitions. These random choice are independent in different conflict classes.
In timed nets [14], occurrence times are associated with transitions, and transition occurrences are real–time events, i.e., tokens are removed from input places at the beginning of the occurrence period, and they are deposited to the output places at the end of this period. All occurrences of enabled transitions are initiated in the same instants of time in which the transitions become enabled (although some enabled transitions may not initiate their occurrences). If, during the occurrence period of a transition, the transition becomes enabled again, a new, independent occurrence can be initiated, which will overlap with the other occurrence(s). There is no limit on the number of simultaneous occurrences of the same transition (sometimes this is called infinite occurrence semantics). Similarly, if a transition is enabled “several times” (i.e., it remains enabled after initiating an occurrence), it may start several independent occurrences in the same time instant.
More formally, a timed Petri net is a triple, = (,
The occurrence times of transitions can be either deterministic or stochastic (i.e., described by some probability distribution function); in the first case, the corresponding timed nets are referred to as D–timed nets [18], in the second, for the (negative) exponential distribution of firing times, the nets are called M–timed nets (Markovian nets [17]). In both cases, the concepts of state and state transitions have been formally defined and used in the derivation of different performance characteristics of the model [14]. Only D–timed Petri nets are used in this paper.
The firing times of some transitions may be equal to zero, which means that the firings are instantaneous; all such transitions are called
Performance analysis of net models can be based on their behavior (i.e., the set of reachable states) or on the structure of the net; the former is called
3. Models of simultaneous multithreading
A timed Petri net model of a simple multithreaded processor is shown in Fig.1 (as usually, timed transitions are represented by solid bars, and immediate ones, by thin bars).
For simplicity, Fig.1 shows only one level of memory; this simplification is removed further in this section.
If long–latency operation is detected in the issued instruction,
The choice probability associated with
The number of memory ports, i.e., the number of simultaneous accesses to memory, is controlled by the initial marking of
In a similar way, the number of simultaneous threads (or instruction issue units) is controlled by the initial marking of
Memory hierarchy can be incorporated into the model shown in Fig.1 by refining the representation of memory. In particular, levels of memory hierarchy can be introduced by replacing the subnet
The effects of memory hierarchy can be compared with a uniform, non–hierarchical memory by selecting the parameters in such a way that the average access time of the hierarchical model (Fig.2) is equal to the access time of the non–hierarchical model (Fig.1).
Processors with different numbers of instruction issue units and instruction execution pipelines can be described by a pair of numbers, the first number denoting the number of instruction issue units, and the second – the number of instruction execution pipelines. In this sense a 3-2 processor is a (multithreaded) processor with 3 instruction issue units and 2 instruction execution pipelines.
For convenience, all temporal properties are expressed in processor cycles, so, the occurrence times of
The main modeling parameters and their typical values are shown in Table 1.
4. Performance exploration
The model developed in the previous section is evaluated for different combinations of modeling parameters. Performance results are obtained by event-driven simulation of timed Petri net models.
The utilization of the processor and memory, as a function of the number of available threads, for a 1-1 processor (i.e., a processor with a single instruction issue unit and a single instruction execution pipeline) is shown in Fig. 3.
ℓt tm tcs ps1 ps2 | number of available threads number of execution pipelines number of simultaneous threads thread runlength average memory access time context switching time prob. of one–cycle pipeline stall prob. of two–cycle pipeline stall | 1,...,10 1,2,... 1,2,3,... 10 5 1,3 0.2 0.1 |
The value of the processor utilization for
The utilization of the processor can be improved by introducing a second (simultaneous) thread which issues its instructions in the slots unused by the first slot. Fig.4 shows the utilization of the processor and memory for a 2-1 processor, i.e., a processor with two (simultaneous) threads (or two instruction issue units) and a single pipeline. The utilization of the processor is improved by almost 50 % and is within a few percent from its upper bound (of 100 %).
The influence of pipeline stalls (probabilities
Fig. 6 shows the utilizations of processor and memory for reduced probabilities of pipeline stalls, i.e., for
A more realistic model of memory, that captures the idea of a two–level hierarchy, is shown in Fig.2. In order to compare the results of this model with Fig.3 and Fig.4, the parameters of the two–level memory are chosen in such a way that the average memory access time is equal to the memory access time in Fig.1 (where
The results for a 1-1 processor with a two–level memory are shown in Fig.7, and for a 2-1 processor in Fig.8.
The results in Fig.7 and Fig.8 are practically the same as in Fig.3 and Fig.4. This is the reasonthat the remaining results are shown for (equivalent) one-level memory models; the multiplelevels of memory hierarchy apparently have no significant effect on the performance results.
The effects of simultaneous multithreading in amore complex processor, e.g., a processorwithtwo instruction issue units and two instruction execution pipelines, i.e., a 2-2 processor, canbe obtained in a very similar way. The utilization of the processor (shown as the sum of theutilizations of both pipelines, with the values ranging from 0 to 2), is shown in Fig.9.
When another instruction issue unit is added, the utilization increases by about 40%, as shownin Fig.10.
Further increase of the number of the simultaneous threads (in a processor with 2 pipelines)can provide only small improvements of the performance because the utilizations of both,the processor and the memory, are quite close to their limits. The performance of the systemcan be improved by increasing the number of pipelines, but then the memory becomes thesystem bottleneck, so its performance also needs to be improved, for example, by introducingdual ports (which allow to handle two accesses at the same time). The performance of a 5-3processor with a dual-port memory is shown in Fig.11 (the utilization of the processor is thesum of utilizations of its 3 pipelines, so it ranges from 0 to 3).
Fig.11 shows that for 3 pipelines and 5 simultaneous threads, the number of available threadsgreater than 6 provides the speedup that is almost equal to 3.
System bottlenecks can be identified by comparing service demands for different componentsof the system (in this case, the memory and the pipelines); the component with the maximumservice demand is the bottleneck because it is the first component to reach its utilizationlimit and to prevent any increase of the overall performance. For a single runlength (of allsimultaneous threads) the total service demand for memory is equal to
Simultaneous multithreading is quite flexible with respect to context switching times becausethe (simultaneous) threads fill the instruction issuing slots which normally would remainempty during context switching. Fig.12 shows the utilization of the processor and memoryin a 1-1 processor with
Fig.13 shows utilization of the processor and memory in a 2-1 processor, also for
5. Concluding remarks
Simultaneous multithreading discussed in this paper is used to increase the performanceof processors by tolerating long–latency operations. Since the long–latency operationsare playing increasingly important role in modern computer system, so is simultaneousmultithreading. Its implementation as well as the required hardware resources are muchsimpler than in the case of out–of–order approach, and the resulting speedup scales well withthe number of simultaneous threads. Themain challenge of simultaneousmultithreading is tobalance the systemby maintaining the right relationship between the number of simultaneousthreads and the performance of the memory hierarchy.
All presented results indicate that the number of available threads, required for improvedperformance of the processor, is quite small, and is typically greater by 2 or 3 threads than thenumber of simultaneous threads. The results show that a larger number of available threadsprovides rather insignificant improvements of system’s performance.
The presented models of multithreaded processors are quite simple, and for small values ofmodeling parameters (
1 2 3 4 5 | 11 52 102 152 202 | 0.538 0.670 0.684 0.685 0.685 | 0.536 0.671 0.685 0.686 0.686 |
1 2 3 4 5 | 11 80 264 555 951 | 0.538 1.030 1.384 1.568 1.655 | 0.536 1.031 1.381 1.568 1.647 |
The comparisons show that the results obtained by simulation of net models are very similarto the analytical results obtained from the analysis of states and state transitions.
A similar performance analysis of simultaneous multithreading, but using a slightly differentmodel, was presented in [20]. All results presented there are very similar to results presentedin this work which is an indication that the performance of simultaneous multithreadedsystems is insensitive to (at least some) variations of implementation.
It should also be noted that the presented model is oversimplified with respect to theprobabilities of pipeline stalls and does not take into account the dependence of stallprobabilities on the history of instruction issuing. In fact, the model is “pessimistic” in thisregard, and the predicted performance, presented in the paper, is worse than the expectedperformance of real systems. However, the simplification effects are not expected to besignificant.
Acknowledgement
The Natural Sciences and Engineering Research Council of Canada partially supported this research through grant RGPIN-8222.
References
- 1.
Patterson D. A. Hennessy J. L. 2006 Computer architecture- a quantitative approach (4 th ed.); Morgan Kaufmann. - 2.
Hamilton S. 1999 Taking Moore’s law into the next century”; IEEE32 1 43 48 - 3.
Wilkes M. V. 2001 The memory gap and the future of high-performance memories”; ACM Architecture News,29 1 2 7 - 4.
Sinharoy B. 1997 Timed Petri Nets in Performance Exploration of Simultaneous Multithreading 40 6 388 400 - 5.
Baer-L J. 2010 Microprocessor architecture: from simple pipelines to chip multiprocessors; Cambridge University Press. - 6.
Jesshope C. 2003 Multithreaded microprocessors- evolution or revolution”; in Advances in Computer Systems Architecture (LNCS 2823),21 45 - 7.
Tseng J. . Asanovic K. 2003 Banked multiport register files for high-frequency superscalar microprocessor”; Proc. 30-th Int. Annual Symp. on Computer Architecture, San Diego, CA,62 71 - 8.
Burger D. Goodman J. R. 2004 Timed Petri Nets in Performance Exploration of Simultaneous Multithreading IEEE37 3 22 28 - 9.
Byrd G. T. Holliday M. A. 1995 Timed Petri Nets in Performance Exploration of Simultaneous Multithreading 32 8 38 46 - 10.
Dennis J. B. . Gao G. R. 1994 Timed Petri Nets in Performance Exploration of Simultaneous Multithreading in Multithreaded Computer Architecture: a Summary of the State of the Art, Kluwer Academic,1 72 - 11.
Ungerer T. Robic G. . Silc J. 2002 Timed Petri Nets in Performance Exploration of Simultaneous Multithreading 43 3 320 348 - 12.
Eggers S. J. Emer J. S. Levy H. M. Lo J. L. Stamm R. L. . Tullsen D. M. 1997 Simultaneous multithreading: a foundation for next-generation processors”; IEEEMicro,17 5 12 19 - 13.
Mutlu O. Stark J. Wilkerson C. . Patt Y. N. 2003 Timed Petri Nets in Performance Exploration of Simultaneous Multithreading 23 6 20 25 - 14.
Zuberek W. M. 1991 Timed Petri nets- definitions, properties and applications”; Microelectronics and Reliability (Special Issue on Petri Nets and Related Graph Models),31 4 627 644 - 15.
Murata T. 1989 Timed Petri Nets in Performance Exploration of Simultaneous Multithreading s”;77 4 541 580 - 16.
Reisig W. 1985 Petri nets- an introduction (EATCS Monographs on Theoretical Computer Science 4); Springer-Verlag. - 17.
Zuberek W. M. 1986 M-timed Petri nets, priorities, preemptions, and performance evaluation of systems”; in Advances in Petri Nets 1985 (LNCS 222), Springer-Verlag,478 498 - 18.
Zuberek W. M. 1987 D-timed Petri nets and modelling of timeouts and protocols”; Transactions of the Society for Computer Simulation,4 4 331 357 - 19.
Zuberek W. M. 1996 Timed Petri Nets in Performance Exploration of Simultaneous Multithreading ion”; Technical Report #9602, Department of Computer Science, Memorial University, St. John’s, Canada A1B 3X5. - 20.
Zuberek W. M. 2007 Modeling and analysis of simultaneous multithreading”; Proc. 14-th Int. Conf. on Analytical and Stochastic Modeling Techniques and Applications(ASMTA-07), apart of the 21-st European Conference onModeling and Simulation (ECMS’07), Prague, Czech Republic,115 120