Discrete markings of the FSPN model (CBR=3)
Fluid models have been used and investigated in queuing theory . Recently, the concept of fluid models was used in the context of Stochastic Petri Nets, referred to as Fluid Stochastic Petri Nets (FSPNs) [2-6]. In FSPNs, the fluid variables are represented by fluid places, which can hold fluid rather than discrete tokens. Transition firings are determined by both discrete and fluid places, and fluid flow is permitted through the enabled timed transitions in the Petri Net. By associating exponentially distributed or zero firing time with transitions, the differential equations for the underlying stochastic process can be derived. The dynamics of an FSPN are described by a system of first-order hyperbolic partial differential equations (PDEs) combined with initial and boundary equations. The general system of PDEs may be solved by a standard discretization approach. In , the problem of immediate transitions has also been addressed in relation to the fluid levels, by allowing fluid places to be connected to immediate transitions. The transportation of fluid in zero time is described by appropriately chosen boundary conditions.
In a typical multiple-issue processor, instructions flow through pipeline and pass through separate pipeline stages connected by buffers. An open multi-chain queuing network can present this organization, with each stage being a service center with a limited buffer size. Considering a machine that employs multiple execution units capable to execute large number of instructions in parallel, the service and storage requirements of each individual instruction are small compared to the total volume of the instruction stream. Individual instructions may then be regarded as atoms of a fluid flowing through the pipeline. The objective of this approach is to approximate large buffer levels by continuous fluid levels and decrease state-space complexity. Thus, in the first part of this chapter, we employ an analytical model based on FSPNs, derive the state equations for the underlying stochastic process and present performance evaluation results to illustrate its usage in deriving measures of interest. The attempt to capture the dynamic behavior of an ILP processor with aggressive use of prediction techniques and speculative execution is a rare example that demonstrates the usage of this recently introduced formalism in modeling actual systems. Moreover, we take into consideration numerical transient analysis and present numerical solution of a FSPN with more than three fluid places. Both the application of finite-difference approximations for the partial derivatives [7,8], as well as the discrete-event simulation of the proposed FSPN model [9,10], allow for the evaluation of a number of performance measures and lead to numerous conclusions regarding the performance impact of predictions and speculative execution with varying parameters of both the microarchitecture and the operational environment. The numerical solution makes possible the probabilistic analysis of the dynamic behavior, whereas the advantage of the discrete-event simulation is the much faster generation of performance evaluation results. Since the modeling framework is implementation-independent, it can be used to estimate the performance potential of branch and value prediction, as well as to assess the operational environment influence on the performance of ILP processors with much more aggressive, wider instruction issue.
Another challenging task in the application of FSPNs is the modeling and performance analysis of Peer-to-Peer (P2P) live video streaming systems. Web locations offering live video content increasingly attract more and more visitors, which, if the system is based on the client/server architecture, leads to sustainability issues when clients rise above the upload capabilities of the streaming servers. Since IP Multicast failed to satisfy the requirements of an affordable, large scale live video streaming, in the last decade the science community intensively works in the field of P2P networking technologies for live video broadcast. P2P live video streaming is a relatively new paradigm that aims for streaming live video content to a large number of clients with low cost. Even though many such applications already exist, these systems are still in their early stages and prior to creation of such a system it is necessary to analyze performance via representative model that provides significant insight into the system’s behavior. Nevertheless, modeling and performance analysis of P2P live video streaming systems is a complex combinatorial problem, which requires addressing many properties and issues of such systems. Inspired by several research articles concerned with modeling of their behavior, in the second part of this chapter, we present how FSPNs can be used for modeling and performance analysis of a mesh based P2P live video streaming system. We adopt fluid flow to represent bits as atoms of a fluid that travel through fluid pipes (network infrastructure). If we represent peers with discrete tokens and video bits as fluid, then we have numerous possibilities to evaluate the performance of the system. The developed model is simple and quite flexible, providing performance evaluation of a system that accounts for a number of system features, such as: network topology, peer churn, scalability, peer average group size, peer upload bandwidth heterogeneity, video buffering, control traffic overhead and admission control for lesser contributing peers. In this particular case, discrete-event simulation (DES) is carried out using SimPy (http://simpy.sourceforge.net) – an object-oriented, process-based discrete-event simulation language based on standard Python (http://www.python.org), which provides the modeler with components of a simulation model including processes (for active components) and resources (for passive components) and provides monitor variables to aid in gathering statistics.
2. Part A: Fluid atoms in ILP processor pipelines
Most of the recent microprocessor architectures assume sequential programs as input and use a parallel execution model. The hardware is expected to extract the parallelism out of the instruction stream at run-time. The efficiency is highly dependent on both the hardware mechanisms and the program characteristics, i.e. the instruction-level parallelism (ILP) the programs exhibit. Many ILP processors speculatively execute control-dependent instructions before resolving the branch outcome. They rely upon branch prediction in order to tolerate the effect of control dependences. A branch predictor uses the current fetch address to predict whether a branch will be fetched in the current cycle, whether that branch will be taken or not, and what the target address of the branch is. The predictor uses this information to decide where to fetch from in the next cycle. Since the branch execution penalty is only seen if the branch was mispredicted, a highly accurate branch predictor is a very important mechanism for reducing the branch penalty in a high performance ILP processor.
A variety of branch prediction schemes have been explored  – they range between fixed, static displacement-based, static with profiling, and various dynamic schemes, like Branch History Table with n-bit counters, Branch Target Address Cache, Branch Target Instruction Cache, mixed, two-level adaptive, hybrid, etc. Some research studies have also proposed concepts to implement high-bandwidth instruction fetch engines based on multiple branch prediction. Such concepts include trace cache  or the more conventional multiple-block fetching .
On the other hand, given that a majority of static instructions exhibit very little variations in values that they produce/consume during the course of a program’s execution , data dependences can be eliminated at run-time by predicting the outcome values of instructions (value prediction) and by executing the true data dependent instructions. In general, the outcome value of an instruction can be assigned to registers, memory locations, condition codes, etc. The execution is speculative, as it is not assured that instructions were fed with correct input values. Since the correctness of execution must be maintained, speculatively executed instructions retire only if the prediction was proven correct – otherwise, they are discarded.
Several architectures have been proposed for value prediction  – last value predictor, stride predictor, context predictors and hybrid approaches in order to get good accuracy over a set of programs due to the different data value locality characteristics that can be exploited only by different schemes. Based on instruction type, value prediction is sometimes identified as prediction of the outcome of arithmetic instructions only, and the prediction of the outcome of memory access instructions as a different class, referred to as memory prediction.
2.1. Model definition
A model should always have a form that is more concise and closer to a designer’s intuition about what a model should look like. In the case of a processor pipeline, the simplest description would be that the instructions flow and pass through separate pipeline stages connected by buffers. Control dependences stall the inflow of useful instructions (fluid) into the pipeline, whereas true data dependences decrease the aperture of the pipeline and the outflow rate. The buffer levels always vary and affect both the inflow and outflow rates. Branch prediction techniques tend to eliminate stalls in the inflow, while value prediction techniques help keeping outflow rate as high as possible.
Representing the dynamic behavior of systems subject to randomness or variability is the main concern of stochastic modeling. It relies on the use of random variables and their distribution functions . We assume that the distribution of the time between two consecutive occurrences of branch instructions in the fluid stream is exponential with rate λ. The rate depends on the instruction fetch bandwidth, as well as the program’s average basic block size. Branches vary widely in their dynamic behavior, and predictors that work well on one type of branches may not work as well on others. A set of hard-to-predict branches that comprise a fundamental limit to traditional branch predictors can always be identified . We assume that there are two classes: easy-to-predict and hard-to-predict branches, and the expected branch prediction accuracy is higher for the first, and lower for the second. The probabilities to classify a branch as either easy- or hard-to-predict depend on the program characteristics.
When the instruction fetch rate is low, a significant portion of data dependences span across instructions that are fetched consecutively . As a result, these instructions (a producer-consumer pair) will eventually initiate their execution in a sequential manner. In this case, the prediction becomes useless due to the availability of the consumer’s input value. Hence, in each cycle, an important factor is the number of instructions that consume results of simultaneously initiated producer instructions. We assume that the distribution of the time between two consecutive occurrences of consuming instructions in the fluid stream is exponential with rate μ. The rate depends on the number of instructions that simultaneously initiate execution at a functional unit, as well as the program’s average dynamic instruction distance. We assume that there are two classes of consuming instructions: (1) instructions that consume easy-to-predict values and (2) instructions that consume hard-to-predict values. The expected value prediction accuracy is higher for the first and lower for the second. The probability to classify a value as either easy- or hard-to-predict depends on the program’s characteristics, similarly to the branch classification.
The set of programs executed on the machine represent the input space. Programs with different characteristics are executed randomly and independently according to the operational profile. We partition the input space by grouping programs that exhibit as nearly as possible homogenous behavior into program classes. Since there are a finite number of partitions (classes), the upper limits of λ and μ, as well as the probabilities to classify a branch/value as either easy- or hard-to-predict are considered to be discrete random variables and have different values for different program classes.
2.2. FSPN representation
We assume that the pipeline is organized in four stages: Fetch, Decode/Issue, Execute and Commit. Fluid places PIC, PIB, PRS/LSQ, PROB, PRR, PEX and PREG, depicted by means of two concentric circles (Figure 1), represent buffers between pipeline stages: instruction cache, instruction buffer, reservation stations and load/store queue, reorder buffer, rename registers, instructions that have completed execution and architectural registers. Five of them have limited capacities: ZIBmax, ZRS/LSQmax, ZRRmax, ZROBmax and ZEXmax. We prohibit both an overflow and a negative level in a fluid place. The fluid place PTIME has the function of an hourglass: it is constantly filled at rate 1 up to the level 1 and then flushed out, which corresponds to the machine clock cycle. ZTIME(t) denotes the fluid level in PTIME at time t. Fluid arcs are drawn as double arrows to suggest a pipe. Flow rates are piecewise constant, i.e. take different values at the beginning of each cycle and are limited by the fetch/issue width of the machine (W). Rates depend on the vector of fluid levels and change when TCLOCK fires and the fluid in PTIME is flushed out. The flush out arc is drawn as thick single arrow.
Let be the fluid levels at the beginning of the clock cycle, i.e. , , , , and , where and .
A high-bandwidth instruction fetch mechanism fetches up to W instructions per cycle and places them in the instruction buffer. The fetch rate is given by:
In the case of a branch misprediction, the fetch unit is effectively stalled and no useful instructions are added to the buffer. Instruction cache misses are ignored.
Instruction issue tries to send W instructions to the appropriate reservation stations or the load/store queue on every clock cycle. Rename registers are allocated to hold the results of the instructions and reorder buffer entries are allocated to ensure in-order completion. Among the instructions that initiate execution in the same cycle, speculatively executed consuming instructions are forced to retain their reservation stations. As a result, the issue rate is given by:
Up to W instructions are in execution at the same time. With the assumptions that functional units are always available and out-of-order execution is allowed, the instructions initiate and complete execution with rate:
During the execute stage, the instructions first check to see if their source operands are available (predicted or computed). For simplicity, we assume that the execution latency of each instruction is a single cycle. Instructions execute and forward their own results back to subsequent instructions that might be waiting for them (no result forwarding delay). Every reference to memory is present in the first-level cache. With the last assumption, we eliminate the effect of the memory hierarchy.
The instructions that have completed execution are ready to move to the last stage. Up to W instructions may commit per cycle. The results in the rename registers are written into the register file and the rename registers and reorder buffer entries freed. Hence:
In order to capture the relative occurrence frequencies of different program classes, we introduce a set of weighted immediate transitions in the Petri Net. Each program class is assigned an immediate transition with weight . The operational profile is a set of weights. The probability of firing the immediate transition represents the probability of occurrence of a class i program, given by:
A token in PSTART denotes that a new execution is about to begin. The process of firing one of the immediate transitions randomly chooses a program from one of the classes. The firing of transition puts i tokens in place PCLASS, which identify the class. At the same time instant, tokens occur in places PFETCH and PINITIATE, while the fluid place PIC is filled with fluid with volume Vi equivalent to the total number of useful instructions (program volume).
Firing of exponential transition TBRANCH corresponds to a branch instruction occurrence. The parameter changes at the beginning of each clock cycle and formally depends on both the number of tokens in PCLASS and the fetch rate:
where is its upper limit for a given program class i at maximum fetch rate (rFETCH=W). The branch is classified as easy-to-predict with probability pBEP, or hard-to-predict with probability 1-pBEP. In either case, it is correctly predicted with probability pBEPC (pBHPC), or mispredicted with probability 1-pBEPC (1-pBHPC). These probabilities are included in the FSPN model as weights assigned to immediate transitions TBEP, TBHP, TBEPC, TBHPC, TBEPMIS and TBHPMIS, respectively. This approach is known as synthetic branch prediction. Branch mispredictions stall the fluid inflow for as many cycles as necessary to resolve the branch (CBR tokens in place PBMIS). Usually, a branch is not resolved until its execution stage (CBR=3). With several consecutive firings of TCLOCK, these tokens are consumed one at a time and moved to PRESOLVED. As soon as the branch is resolved, transition TCONTINUE fires, a token appears in place PFETCH and the inflow resumes.
Similar to this, firing of exponential transition TCONSUMER corresponds to the occurrence of a consuming instruction among the instructions that initiated execution. The parameter changes at the beginning of each clock cycle and formally depends on both the number of tokens in PCLASS and the initiation rate:
where is its upper limit for a given program class i when maximum possible number of instructions simultaneously initiate execution (rINITIATE=W). The consumed value is classified as easy-to-predict with probability pVEP, or hard-to-predict with probability 1-pVEP. In either case, it is correctly predicted with probability pVEPC (pVHPC), or mispredicted with probability 1-pVEPC (1-pVHPC). These probabilities are included in the FSPN model as weights assigned to immediate transitions TVEP, TVHP, TVEPC, TVHPC, TVEPMIS and TVHPMIS, respectively. Whenever a misprediction occurs (token in place PVMIS), the consuming instruction has to be rescheduled for execution. The firing of immediate transition TREEXECUTE causes transportation of fluid in zero time. Fluid jumps have deterministic height of 1 (one instruction) and take place when the fluid levels in PRS and PEX satisfy the condition . Jumps that would go beyond the boundaries cannot be carried out. The arcs connecting fluid places and immediate transitions are drawn as thick single arrows. The fluid flow terminates at the end of the cycle when all the fluid places except PREG are empty and TEND fires.
2.3. Derivation of state equations
When executing a class i program, the nodes mi of the reachability graph (Figure 2) consist of all the tangible discrete markings, as well as those in which the enabling of immediate transitions depends on fluid levels and cannot be eliminated, since they are of mixed tangible/vanishing type (Table 1). It is important to note that the number of discrete markings does not depend on the machine width in any way.
|Number of tokens||#PFETCH||#PBMIS||#PINITIATE||#PVMIS|
A vector of fluid levels supplements discrete markings. It gives rise to a stochastic process in continuous time with continuous state space. The total amount of fluid contained in PIC, PIB, PRS/LSQ, PEX and PREG is always equal to Vi, and the amount of fluid contained in PRR (as well as PROB) is equal to the total amount of fluid in PRS/LSQ and PEX. Therefore, only the fluid levels ZIB(t), ZRS/LSQ(t), ZEX(t) and ZREG(t) are identified as four supplementary variables (components of the fluid vector ), which provide a full description of each state.
The instantaneous rates at which fluid builds in each fluid place are collected in diagonal matrices:
The matrix of transition rates of exponential transitions causing the state changes is:
Let be an abbreviation for the volume density that is the transient probability of being in discrete marking mi at time t, with fluid levels in an infinitesimal environment around . If , according to [4-6] the evolution of the process is described by a coupled system of nine partial differential equations in four continuous dimensions plus time:
If is the vector of initial fluid levels, the initial conditions are:
Since fluid jumps shift probability mass along the continuous axes (in addition to discrete state change), firing of transition TREEXECUTE at time t can be seen as a jump to another location in the four-dimensional hypercube defined by the components of the fluid vector. It can be described by the following boundary conditions:
The firing of transitions TCLOCK and TCOUNT at time t0 causes switching from one discrete marking to another. Therefore:
Similarly, the firing of transition TEND when all the fluid places except PREG are empty, causes switching from any discrete marking to 9i:
The probability mass conservation law is used as a normalization condition. It corresponds to the condition that the sum of all state probabilities must equal one. Since no particle can pass beyond barriers, the sum of integrals of the volume densities over the definition range evaluates to one:
Let be the state of the discrete marking process at time t. The probabilities of the discrete markings are obtained by integrating volume densities:
The fluid levels at the beginning of each clock cycle are computed as follows:
Finally, the flow rates and the parameters and are computed as indicated by Eqs. 1-4, 6 and 7, respectively.
2.4. Performance measures
Let τ be a random variable representing the time to absorb into . The distribution of the execution time of a program with volume Vi is:
with mean execution time:
Consequently, the sustained number of instructions per cycle (IPC) is given by:
When the input space is partitioned, IPC is the ratio between the average volume and the average execution time of all the programs of different classes, as indicated by the operational profile:
The sum of probabilities of the discrete markings that do not carry a token in place PFETCH gives the probability of a stall in the instruction fetch unit at time t:
Because of the discrete nature of pipelining, additional attention should be given to the probability that no useful instructions will be added to the instruction buffer in the cycle beginning at time t0 (complete stall in the instruction fetch unit that can lead to an effectively empty instruction buffer) due to branch misprediction. It can be obtained by summing up the probabilities of the discrete markings that still carry one or more tokens in place PBMIS immediately after firing of TCLOCK:
In addition, the execution efficiency is introduced, taken as a ratio between the number of useful instructions and the total number of instructions executed during the course of a program’s execution:
where is the average initiation rate.
2.5. Numerical experiments and performance evaluation results
We have used finite difference approximations to replace the derivatives that appear in the PDEs: forward difference approximation for the time derivative and first-order upwind differencing for the space derivatives, in order to improve the stability of the method [7,8]:
The explicit discretization of the right-hand-side coupling term allows the equations for each discrete state to be solved separately before going on to the next time step. The discretization is carried out on a hypercube of size with step size in direction of zIB, zRS, zEX and zREG, and step size in time. The computational complexity for the solution is
since for each of time steps we must increment each solution value in the four-dimensional grid for eight of the nine discrete markings. The storage requirements of the algorithm are at least
since for eight of nine discrete markings we must store a four-dimensional grid of floating-point numbers (solutions at successive time steps can be overwritten).
Unless indicated otherwise, , and , where n=4 is the number of continuous dimensions. With these capacities of fluid places, virtually all name dependences and structural conflicts are eliminated. Step size is varied between (coarser grid, usually when the prediction accuracy is high) and (finer grid, usually when the prediction accuracy is low).
Considering a low-volume program (Vi=50 instructions) executed on a four-wide machine (W=4), we investigate:
The influence of branch prediction accuracy on the distribution of the program’s execution time, when value prediction is not involved (Figure 3a),
The influence of branch prediction accuracy on the probability of a complete stall in the instruction fetch unit (Figure 3b), and
The influence of value prediction accuracy on the distribution of the program’s execution time, when perfect branch prediction is involved (Figure 4).
It is indisputably clear that both branch and value prediction accuracy improvements reduce the mean execution time of a program and increase performance. As an illustration, the size of the shaded area in Figure 3a is equal to the mean execution time when perfect branch prediction is involved, and IPC is computed as indicated by Eq. (20). In addition, looking at Figure 3b one can see that the probability of a complete stall in the instruction fetch unit, which can lead to an empty instruction buffer in the subsequent cycle, decreases with branch prediction accuracy improvement. As a result, both the utilization of the processor and the size of dynamic scheduling window increase as branch prediction accuracy increases.
The correctness of the discretization method is verified by comparing the numerical transient analysis results with the results obtained by discrete-event simulation, which is specifically implemented for this model and not for a general FSPN. The types of events that need to be scheduled in the event queue are either transition firings or the hitting of a threshold dependent on fluid levels. We have used a Unif[0,1] pseudo-random number generator to generate samples from the respective cumulative distribution functions and determine transition firing times via inversion of the cdf (“Golden Rule for Sampling”). Discrete-event simulation alone has been used to obtain performance evaluation results for wide machines with much more aggressive instruction issue (W>>1).
It takes quite some effort to tune the numerical algorithm parameters appropriately, so that a sufficiently accurate approximation is obtained. Various discretization and convergence errors may cancel each other, so that sometimes a solution obtained on a coarse grid may agree better with the discrete-event simulation than a solution on a finer grid – which, by definition, should be more accurate. In Figures 5a-b, a comparison of discretization results and results obtained using discrete-event simulation for a four-wide machine is given. Furthermore, Figure 5c shows the performance of several machines with realistic predictors executing a program with an average basic block size of eight instructions, given that about 25% of the instructions that initiate execution in the same clock cycle are consuming instructions.
Since the conservation of probability mass is enforced, the differences between the numerical transient analysis and the discrete-event simulation results arise only from the improper distribution of the probability mass over the solution domain. Due to the inherent dissipation error of the first-order accurate numerical methods, the solution at successive time steps is more or less dissipated to neighboring grid nodes. The phenomenon is emphasized when the number of discrete state changes is increased owing to the larger number of mispredictions.
The results are satisfactorily close to each other, especially when the prediction accuracy is high, which is common in recent architectures. Yet, we believe that much work is still uncompleted and many questions are still open for further research in the field of development of strategies for reducing the amount of memory needed to represent the volume densities, as well as efficient discretization schemes for numerical transient analysis of general FSPNs. Alternating direction implicit (ADI) methods  in order to save memory, and parallelization of the numerical algorithms to reduce runtime have been suggested.
In the remainder of this part, we do not distinguish the numerical transient analysis results from the results given by discrete-event simulation of the FSPN model. Initially we analyze the efficiency of branch prediction by varying branch prediction accuracy. Value prediction is not involved at all. The speedup is computed by dividing the IPC achieved with certain branch prediction accuracy over the IPC achieved without branch prediction (). For the moment, the input space is not partitioned and program volume is set to Vi=106 instructions.
It is observed that, looking at Figures 6a-b, branch prediction curves have an exponential shape. Therefore, building branch predictors that improve the accuracy just a little bit may be reflected in a significant performance increase. The impact of a given increment in accuracy is more noticeable when it experiences a slight improvement beyond the 90%. Another conclusion drawn from these figures is that one can benefit most from branch prediction in programs with relatively short basic blocks (high ) and which do not suffer excessively from true data dependences (low ). When the ratio is high, true data dependences overshadow control dependences. As a result, the amount of ILP that is expected without value prediction in a machine with extremely aggressive instruction issue is far below the maximum possible value, even with perfect branch prediction. Value prediction has to be involved to go beyond the limits imposed by true data dependences.
Next, we analyze the efficiency of value prediction by varying value prediction accuracy (Figures 7a-b). The speedup is computed by dividing the IPC achieved with certain value prediction accuracy over the IPC achieved without value prediction (). With perfect branch prediction, it seems clear that the value prediction curves have a linear behavior. Therefore, it is worthwhile to build a predictor that significantly improves the accuracy. Only a small improvement on the value predictor accuracy has a little impact on ILP processor performance, regardless of the accuracy range. Another conclusion drawn from these figures is that the effect of value prediction is more noticeable when a significant number of instructions consume results of simultaneously initiated producer-instructions during execution (high ), i.e. when true data dependences have a much higher influence on the program’s total execution time.
Branch prediction has a very important influence on the benefits of value prediction. One can see that the performance increase is less significant when branch prediction is realistic. Because mispredicted branches limit the number of useful instructions that enter the instruction window, the processor is able to provide almost the same number of instructions to leave the instruction window, even with lower value prediction accuracy. As a result, graphs tend to flatten out. Correct value predictions can only be exploited when the fetch rate is quite high, i.e. when mispredicted branches are infrequent. Branch misprediction becomes a more significant performance limitation with wider processors (Figure 7b).
In addition, we investigate branch and value prediction efficiency with varying machine width (Figures 8a-c). The speedup in this case is computed by dividing the IPC achieved in a machine over the IPC achieved in a scalar counterpart (W=1, μi=0). The speedup due to branch prediction is obviously higher in wider machines. With perfect branch prediction, the speedup unconditionally increases with the machine width. For a given width, the speedup is higher when there are a smaller number of consuming instructions (low ). With realistic branch prediction, there is a threshold effect on the machine width: below the threshold the speedup increases with the machine width, whereas above the threshold the speedup is close to a limit – machine width is by far larger than the average number of instructions provided by the fetch unit. The threshold decreases with increasing the number of mispredicted branches.
The maximum additional speedup that value prediction can provide is computed by dividing the IPC achieved with perfect value prediction over the IPC achieved without value prediction (Figures 9a-c). With perfect branch prediction, some true data dependences can always be eliminated, regardless of the machine width. Actually, the maximum additional speedup is predetermined by the ratio . However, with realistic branch prediction, the additional speedup diminishes when the machine width is above a threshold value. It happens earlier when there are a smaller number of consuming instructions and/or a larger number of mispredicted branches. In either case, the number of independent instructions examined for simultaneous execution is sufficiently higher than the number of fetched instructions that enter the instruction window. Again, branch prediction becomes more important with wider processors.
The rate at which consuming instructions occur depends on the initiation rate. Therefore, we also investigate the value prediction efficiency with varying instruction window size (varying capacity of the fluid place ) (Figures 10a-b). The speedup is computed in the same way as in the previous instance. It increases with the instruction window size in [W, 2W], but the increase is more moderate when there are a smaller number of consuming instructions (low ) and/or branch prediction is not perfect. As the instruction window grows larger, performance without value prediction saturates, as does the performance with perfect value prediction. The upper limit value emerges from the fact that in each cycle up to W new instructions may enter the fluid place and up to W consuming instructions may be forced to retain their reservation stations. One should also note that the speedup for W>>1 and realistic branch prediction is almost constant with increasing instruction window size. Two scenarios arise in this case: (1) the number of consuming instructions is large – the speedup is constant but still noticeable as there are not enough independent instructions in the window without value prediction, and (2) the number of consuming instructions is small – there is no speedup as there are enough independent instructions in the window even without value prediction, regardless of the window size. Again, the main reasons for this behavior are the small number of consuming instructions and the large number of mispredicted branches.
In order to investigate the operational environment influence, we partitioned the input space into several program classes, each of them with at least one different aspect: branch rate, consuming instruction rate, probability to classify a branch as easy-to-predict or probability to classify a value as easy-to-predict. We concluded that the set of programs executed on a machine have a considerable influence on the perceived IPC. Since the term program may be interchangeably used with the term instruction stream, these observations give good reason for the analysis of the time varying behavior of programs in order to find simulation points in applications to achieve results representative of the program as a whole. From a user perspective, a machine with more sophisticated prediction mechanisms will not always lead to a higher perceived performance as compared to a machine with more modest prediction mechanisms but more favorable operational profile [20,21].
3. Part B: fluid atoms in P2P streaming networks
In P2P live streaming systems every user (peer) maintains connections with other peers and forms an application level logical network on top of the physical network. The video stream is divided in small pieces called chunks which are streamed from the source to the peers and every peer acts as a client as well as a server, forwarding the received video chunks to the next peer after some short buffering. The peers are usually organized in one of the two basic types of logical topologies: tree or mesh. Hence, the tree topology forms structured network of a single tree as in , or multiple multicast trees as in , while mesh topology is unstructured and does not form any firm logical construction, but organizes peers in swarming or gossiping-like environments, as in . To make greater use of their complementary strengths, some protocols use combination of these two aspects, forming a hybrid network topology, such as . Hence, members are free to join or leave the system at their own free will (churn), which leads to a certain user driven dynamics resulting in constant disruptions of the streaming data delivery. This peer churn has high influence on the quality of offered services, especially for P2P systems that offer live video broadcast. Also, P2P network members are heterogeneous in their upload bandwidth capabilities and provide quite different contribution to the overall system performance. Efficient construction of P2P live video streaming network requires data latency reduction as much as possible, in order to disseminate the content in a live manner. This latency is firstly introduced by network infrastructure latency presented as a sum of serialization latency, propagation delay, router processing delay and router queuing delay. The second type of delay is the initial start-up delay required for filling the peer’s buffer prior to the start of the video play. The buffer is used for short term storage of video chunks which often arrive out of sequence in manner of order and/or time, and resolving this latency issue requires careful buffer modeling and management. Thus, buffer size requires precise dimensioning because even though larger buffers offer better sequence order or latency compensation, they introduce larger video playback delay. Contrary, small buffers offer smaller playback delay, but the system becomes more error prone. Also, since the connections between participating peers in these P2P logical networks are maintained by the means of control messages exchange, the buffer content (buffer map) is incorporated in these control messages and it is used for missing chunks acquisition. Chunk requesting and forwarding is controlled by a chunk scheduling algorithm, which is responsible for on-time chunk acquisition and delivery among the neighboring peers, which is usually based on the available content and bandwidth of the neighboring peers. A lot of research activities are strictly focused on designing better chunk scheduling algorithms [26,27] that present the great importance of carefully composed scheduling algorithm which can significantly compensate for churn or bandwidth/latency disruptions. Beside the basic coding schemes, in latest years an increasing number of P2P live streaming protocols use Scalable Video Coding (SVC) technologies. SVC is an emerging paradigm where the video stream is split in several sub-streams and each sub-stream contributes to one or more characteristics of video content in terms of temporal, spatial and SNR/quality scalability. Mainly, two different concepts of SVC are in greater use: Layered Video Coding (LVC) where the video stream is split in several dependently decodable sub-stream called Layers, and Multiple Description Coding (MDC) where the video stream is split in several independently decodable sub-stream called Descriptions. A number of P2P video streaming models use LVC  or MDC [23,28] and report promising results.
3.1. Model definition
As a base for our modeling we use the work in [29,30], where several important terms are defined. One of them is the maximum achievable rate that can be streamed to any individual peer at a given time, which is presented in Eq. (26).
rMAX – maximum achievable streaming rate
rSERVER – upload rate of the server
rPi – upload rate of the ith peer
n – number of participating peers.
Clearly, rMAX is a function of rSERVER, rPi and n, i.e. rMAX = φ(rSERVER, rP, n). This maximum achievable rate to a single peer is further referred to as the fluid function, or φ(). The second important definition is of the term Universal Streaming. Universal Streaming refers to the streaming situations when each participating peer receives the video stream with bitrate no less than the video rate, and in  it is achievable if and only if:
where rVIDEO is the rate of the streamed video content.
Hence, the performance measures of the system are easily obtained by calculating the Probability for Universal Streaming (PUS).
Now, we add one more parameter to the previously mentioned to fulfill the requirements of our model. We define the stream function ψ() which, instead of the maximum, represents the actual streaming rate to any individual peer at any given time, and ψ() satisfies:
3.2. FSPN representation
The FSPN representation of the P2P live streaming system model that accounts for: network topology, peer churn, scalability, peer average group size, peer upload bandwidth heterogeneity, video buffering, control traffic overhead and admission control for lesser contributing peers, is given in Figure 11. We assume asymmetric network settings where peers have infinite download bandwidths, while stream delay, peer selection strategies and chunk size are not taken into account.
Similar as in  we assume two types of peers: high contributing peers (HP) with upload bitrate higher than the video rate, and low contributing peers (LP) with upload bitrate lower than the video rate. Different from the fluid function φ(), beside the dependency to rSERVER, rP, and n, the stream function ψ() depends on the level of fluid in the unique fluid place PB as well:
where ZB represents the level of fluid in PB.
The FSPN model in Figure 11 comprises two main parts: the discrete part and the continuous (fluid) part of the net. Single line circles represent discrete places that can contain discrete tokens. The tokens, which represent peers, move via single line arcs to and out of the discrete places. Fluid arcs, through which fluid is pumped, are drawn as double lines to suggest a pipe. The fluid is pumped through fluid arcs and is streamed to and out of the unique fluid place PB which represents a single peer buffer. The rectangles represent timed transitions with exponentially distributed firing times, and the thin short lines are immediate transitions. Peer arrival, in general, is described as a stochastic process with exponentially distributed interarrival times, with mean 1/λ, where λ represents the arrival rate. We make another assumption that after joining the system peers’ sojourn times (T) are also exponentially distributed. Clearly, since each peer is immediately served after joining the system, we have a queuing network model with an infinite number of servers and exponentially distributed joining and leaving rates. Hence, the mean service time T is equal to 1/μ, which transferred to FSPN notation leads to the definition of the departure rate as µ multiplied by the number of peers that are concurrently being served. Now, λ represents peer arrival in general, but the different types of peers do not share the same occurrence probability (pH and pL). This occurrence distribution is defined by immediate transitions TAHP and TALP and their weight functions pH and pL. Hence, HP arrive with rate λH = pH * λ, and LP arrive with rate λL = pL * λ, where pH + pL = 1. In this particular case pH = pL = 0.5, but, if needed, these occurrence probabilities can be altered. This way the model with peer churn is represented by two independent M/M/∞ Poisson processes, one for each of the different types of peers. The average number of peers that are concurrently being served defines the size of the system as a whole (SSIZE) and is derived from the queuing theory:
TA is a timed transition with exponentially distributed firing times that represents peer arrival, and upon firing (with rate λ) puts a token in PCS. PCS (representing the control server) checks the type of the token and immediately forwards it to one of the discrete places PHP or QLP (PLP). Places PHP and PLP accommodate the different types of peers in our P2P live streaming system model. QLP on the other hand, represents queuing station for the LP, which is connected to the place PLP with the immediate transition TI that is guarded by a Guard function G.
The Guard function G is a Boolean function whose values are based on a given condition. The expression of a given condition is the argument of the Guard function and serves as enabling condition for the transition TI. If the argument of G evaluates to true, TI is enabled. Otherwise, if the argument of G evaluates to false, TI is disabled. For the model that does not take admission control into account G is always enabled, but when we want to evaluate the performance of a system that incorporates admission control we set the argument of the guard function as in Eq. (31):
Transitions TDHP and TDLP are enabled only when there are tokens in discrete places PHP and PLP. These are marking dependent transitions, which, when enabled, have exponentially distributed firing times with rate μ #PHP and μ #PLP respectively, where #PHP and #PLP represent the number of tokens in each discrete place. Upon firing they take one token out of the discrete place to which they are connected.
Concerning the fluid part of the model, we represent bits as atoms of fluid that travel through fluid pipes (network infrastructure) with rate dependent on the system’s state (marking). Beside the stream function as a derivative of several parameters, we identify three separate fluid flows (streams) that travel through the network with different bitrates. The main video stream represents the video data that is streamed from the source to the peers that we refer to as the video rate (rVIDEO). The second stream is the play stream which is the stream at which each peer plays the streamed video data, referred to as the play rate (rPLAY), and the third stream is the control traffic overhead, referred to as control rate (rCONTROL), which describes the exchange of control messages needed for the logical network construction and management. As mentioned earlier, transitions TDHP and TDLP are enabled only when there are tokens in discrete places PHP and PLP respectively and beside the fact that they consume tokens when firing, when enabled, they constantly pump fluid through the fluid arc to the fluid place. Flow rates of ψ() are piecewise constant and depend on the number of tokens in the discrete places and their upload capabilities. Continuous place PB represents single peer’s buffer, which is constantly filled with rate ψ() and drained with rate (rPLAY + rCONTROL). ZB is the amount of fluid in PB and ZBMAX is the buffer’s maximum capacity. Transition TSERVER represents the functioning of the server, which is always enabled (except when there are no tokens in any of the discrete places) and constantly pumps fluid toward the continuous place PB with maximum upload rate of rSERVER. Transition TPLAY represents the video play rate, which is also always enabled and constantly drains fluid from the continuous place PB, with rate rPLAY. TCONTROL, that represents the exchange of control messages among neighboring peers, is the third transition that is always enabled, has the priority over TPLAY, and constantly drains fluid from PB with rate rCONTROL. For further analysis we derived the rate of rCONTROL from  where it is declared that it linearly depends on the number of peers in the neighborhood, and for rVIDEO of 128 kbps, the protocol overhead is 2% for a group of 64 users, which leads to a bitrate of 2.56 kbps. Thus, for our performance analysis we assume that peers are organized in neighborhoods with an average size of 60 members where rCONTROL is 2.4 kbps. For the sake of convenience and chart plotting we also define the average upload rate of the participating peers as rAVERAGE, which is given in Eq. (32):
Since in our model of a P2P live video streaming system we take in consideration rCONTROL as well, Universal Streaming is achievable if and only if:
3.3. Discrete-event simulation
The FSPN model of a P2P live video streaming system accurately describes the behavior of the system, but suffers from state space explosion and therefore analytic/numeric solution is infeasible. Hence, we provide a solution to the presented model using process-based discrete-event simulation (DES) language. The simulations are performed using SimPy which is a DES package based on standard Python programming language. It is quite simple, but yet extremely powerful DES package that provides the modeler with simulation processes that can be used for active model components (such as customers, messages or vehicles), and resource facilities (resources, levels and stores) which are used for passive simulation components that form limited capacity congestion points like servers, counters, and tunnels. SimPy also provides monitor variables that help in gathering statistics, and the random variables are provided by the standard Python random module.
Now, although we deal with vast state space, we provide the solution by identifying four distinct cases of state types. These cases of state types are combination of states of the discrete part and the continuous part of the FSPN, and are presented in Table 2. Hence, the rates at which fluid builds up in the fluid place PB, in each of these four cases, can be described with linear differential equations that are given in Eq. (34).
|case 1||if||ZB = ZBMAX and φ() ≥ rVIDEO+rCONTROL||then||ψ() = rVIDEO+rCONTROL and|
rPLAY = rVIDEO
|case 2||if||0 <ZB ≤ ZBMAX and φ() < rVIDEO+rCONTROL||then||ψ() = φ() and rPLAY = rVIDEO|
|case 3||if||0 ≤ ZBUF<ZBUFMAX and φ() ≥ rVIDEO+rCONTROL||then||ψ() = φ() and rPLAY = rVIDEO|
|case 4||if||ZBUF = 0 and φ() < rVIDEO+rCONTROL||then||ψ() = φ() and rPLAY < rVIDEO|
In the next few lines (Table 3a-d) we briefly present the definitions of some of the the FSPN model components in SimPy syntax. Algorithm 1 presents the definition of SimPy processes for the different types of tokens. All the FSPN places (as well as rPLAY) are defined as resource facilities of the type “Level” and are given in Algorithm 2. The formulation of a “Level” for representing the rPLAY was enforced by the requirement for monitoring and modifying the rPLAY at each instant of time. Algorithm 3 presents TA combined with TAHP and TALP where it is defined as two separate SimPy Proceses that independently generate two different types of token processes. Algorithm 4 represents the definition of transitions TDHP and TDLP.
|Definition of HP token|
|class tokenHP (Process):|
def join (self):
yield put, self, Php, 1
|Definition of LP token with integrated Guard for TI|
|class tokenLP (Process):|
def join (self):
if (Php.amount + Plp.amount) == 0:
yield put, self, Plp, 1
yield put, self, Qlp, 1
return (((Rserver + (Plp.amount + 1)*Rlp + Php.amount*Rhp)/((Plp.amount + 1) + Php.amount)) "/>= Rvideo + Rcontrol)
yield waituntil, self, GuardOFF
yield get, self, Qlp, 1
yield put, self, Plp, 1
yield passivate, self
|(a) Algorithm 1: Definition of tokens in SimPy|
|Pcs = Level (name = ‘Control Server’, initialBuffered=0, monitored = True)|
|Php = Level (name = ‘Discrete Place Php’, initialBuffered=0, monitored = True)|
|Plp = Level (name = ‘Discrete Place Plp’, initialBuffered=0, monitored = True)|
|Qlp = Level (name = ‘Queuing Station’, initialBuffered=0, monitored = True)|
|Pb = Level (name = ‘Peer Buffer’, initialBuffered=Zbmax, monitored = True)|
|Pplay = Level (name = ‘Play rate’, initialBuffered=Rvideo, monitored = True)|
|(b) Algorithm 2: Definition of FSPN places in SimPy|
|Transition TA combined with TAHP||class HPgenerator (Process):|
def generate (self, end):
while now() < end:
yield peerHP = tokenHP ()
activate (peerHP, peerHP.join())
yield hold, self, expovariate (pH * Lamda)
|Transition TA combined with TALP||class LPgenerator (Process):|
def generate (self, end):
while now() < end:
peerLP = tokenLP ()
activate (peerLP, peerLP.join())
yield hold, self, expovariate (pL * Lamda)
|(c) Algorithm 3: Definition of transition TA combined with TAHP and TALP|
|Transition TDHP||class HPdeparture (Process):|
def depart (self, end):
return (Php.amount "/> 0)
yield waituntil, self, Condition
yield hold, self, expovariate (Mi * Php.amount)
yield get, self, Php, 1
|Transition TDLP||class LPdeparture (Process):|
def depart (self, end):
return (Plp.amount "/> 0)
yield waituntil, self, Condition
yield hold, self, expovariate (Mi * Plp.amount)
yield get, self, Plp, 1
|(d) Algorithm 4: Definition of transitions TDHP and TDLP|
For simulating the fluid part of the FSPN, time discretization is applied where a SimPy “Stream Processs” checks the system state in small time intervals and consequently makes changes to the level of fluid in the fluid place PB and rPLAY according to Eq. (34). For gathering the results we use the frequency theory of probability where the probability for Universal Streaming is computed as the amount of time the system spends in Universal Streaming mode against the total simulation time.
3.4. Performance evaluation results and analysis
In this section we make a brief evaluation of three system sizes:
Small system with an average of 100 concurrent participating peers
Medium system with an average of 500 concurrent participating peers
Large system with an average of 5000 concurrent participating peers
The simulation scenario is as follows: rSERVER = (rVIDEO + rCONTROL)*3, upload bandwidth of HP is rHP = 700kbps, upload bandwidth of LP is rLP = 100kbps, and sojourn time T = 45 minutes. For gathering the performance results we vary the rVIDEO and we plot the PUS against the quotient of rVIDEO/rAVERAGE, where rAVERAGE for this case is 400kbps. For calculating the PUS of a single scenario we calculate the average of 150 simulations for the small system, and an average of 75 simulations for the medium and large system, while each single simulation simulates 10 hours of system activity. Initial conditions are: ZB0 = ZBMAX, where ZB0 is the amount of fluid in PB in time t0 = 0 and all discrete places are empty.
Comparison of performance of small and medium systems with and without AC is presented in Figure 12, from which an obvious conclusion is inferred that AC almost does not have any direct influence on the performance, but considering the incremented initial delay, incorporation of AC would only have a negative effect on the quality of offered services. Regarding the performance of the system in respect to system scaling, presented in Figure 13, it is obvious that scaling causes increase in performance, but only to a certain point after which performance steeply decreases. Fortunately, the performance decrease is in the region of under capacity which is usually avoided, so it can be concluded that larger systems perform better than smaller ones. Finally, Figure 14 shows that optimal buffer size is about 30 seconds of stored material, and larger buffers only slightly improve performance, but introduce quite large play out delay which leads to diminished quality of user experience.
In the first part of this chapter, we have introduced an implementation-independent analytical modeling approach to evaluate the performance impact of branch and value prediction in modern ILP processors, by varying several parameters of both the microarchitecture and the operational environment, like branch and value prediction accuracy, machine width, instruction window size and operational profile. The proposed analytical model is based on recently introduced Fluid Stochastic Petri Nets (FSPNs). We have also presented performance evaluation results in order to illustrate its usage in deriving measures of interest. Since the equations characterizing the evolution of FSPNs are a coupled system of partial differential equations, the numerical transient analysis poses some interesting challenges. Because of a mixed, discrete and continuous state space, another important avenue for the solution is the discrete-event simulation of the FSPN model. We believe that our stochastic modeling framework reveals considerable potential for further research in this area, needed to better understand speculation techniques in ILP processors and their performance potential under different scenarios.
In the second part of this chapter, we have shown how the FSPN formalism can be used to model P2P live video streaming systems. We have also presented a simulation solution method using process-based discrete-event simulation language whenever analytic/numeric solution becomes infeasible, that is usually a result of state space explosion. We managed to create a model that accounts for numerous features of such complex systems including: network topology, peer churn, scalability, average size of peers’ neighborhoods, peer upload bandwidth heterogeneity and video buffering, among which control traffic overhead and admission control for lesser contributing peers are introduced for the first time.