Open access peer-reviewed chapter - ONLINE FIRST

Perspective Chapter: Dynamic Timing Enhanced Computing for Microprocessor and Deep Learning Accelerators

Written By

Jie Gu and Russ Joseph

Submitted: 18 September 2023 Reviewed: 26 September 2023 Published: 03 November 2023

DOI: 10.5772/intechopen.113296

Deep Learning - Recent Findings and Research IntechOpen
Deep Learning - Recent Findings and Research Edited by Manuel Domínguez-Morales

From the Edited Volume

Deep Learning - Recent Findings and Research [Working Title]

Ph.D. Manuel Jesus Domínguez-Morales, Dr. Javier Civit-Masot, Mr. Luis Muñoz-Saavedra and Dr. Robertas Damaševičius

Chapter metrics overview

44 Chapter Downloads

View Full Metrics

Abstract

Modern microprocessors such as CPU, GPU, and the recent deep learning accelerators exhibit significant runtime timing variation, i.e. dynamic timing slack due to the diverse instructions and programs being executed inside the processor cores. Many studies show that only in a small fraction of the system execution, e.g. 13% of the time, the processors fully occupy its dedicated clock cycle. This brings a new opportunity to enhance the processors and accelerators’ performance by exploiting the dynamic timing slack based on the instructions being executed inside the programs. This chapter presents the recent developments on the “dynamic timing enhanced computing scheme” where excessive runtime timing margin is utilized for boosting the computing performance on both microprocessors and deep learning accelerators. Simulation and test chip measurement results are presented to elaborate the benefits of the dynamic timing enhanced computing scheme in terms of performance and energy saving on modern processors. Both hardware design and software techniques using compiler optimization are presented as a holistic solution.

Keywords

  • dynamic timing slack
  • deep neural network accelerator
  • adaptive clocking
  • compiler optimization
  • error detection circuit

1. Introduction

Thanks to Moore’s law, modern integrated circuits (IC) have enjoyed six decades of exponential growth. While sustaining such a historical growth faces increasing challenges, the IC industry has been performing well in recent years with ever-increasing revenue as shown in Figure 1 [1]. Many large software or system companies, e.g. Google, Microsoft, Meta, Amazon, have also joined the IC developments with producing high-end computing chips, including the emerging deep learning accelerators, e.g. Google’s Tensor processing units (TPU) [2].

Figure 1.

Continuous growth of semiconductor revenue in recent years and projected future growth [1].

However, from a technology point of the view, the scaling of transistors at each generation leads to tremendous challenges to deliver robust and power-efficient ICs. As supply voltages have dropped to 0.8 V at 12 nm or below and the number of integrated processor cores easily reaches 60 or above, maintaining power and timing integrity of the chip is much more difficult than the “good-old-days” when supply voltage is 1.8 V in 180 nm or 1.2 V in 130 nm with only a single-core processor. Significant amount of margins has been added by IC designers to maintain the timing and power integrity from all sources of variations. However, the margins added to secure the chip’s timing integrity lead to a waste of power and significant loss of performance, e.g. more than 20% voltage margin or 25% energy loss in commonly used GPU processors [3]. The situation exacerbates at each technology generation. Therefore, more intelligent ways of managing chip timing and power are needed to sustain the growth of IC technology.

This chapter describes a new opportunity of enhancing computing performance by “squeezing” out the unused timing margin. Contrary to conventional worst-case-based timing management method, this chapter presents a holistic solution ranging from circuit to software compiler to remove the pessimism in the conventional approach leading to significantly boosted chip performance and energy efficiency. The techniques have been applied to both conventional processors and deep learning accelerators showing major improvements with very small overheads. The organization of this chapter is as follows: Section 2 describes the background of timing margin and dynamic timing slack during chip runtime; Section 3 describes circuit techniques to exploit the dynamic timing slack for performance enhancement to modern microprocessors and deep learning accelerators; Section 4 describes high-level compiler optimization techniques to further exploit dynamic timing slack; Section 5 lists a few practical considerations of the presented techniques. Conclusions are drawn in Section 6.

Advertisement

2. Background

This section describes the basics of timing analysis and dynamic timing slack inside modern ASIC chips.

2.1 Power consumption and static timing analysis of modern ASIC chips

Modern Very Large Scale Integrated (VLSI) systems have been governed by two fundamental equations:

PdγCVdd2fE1
f=1T=IdsQβ(VddVth)αCVddf0(1+kVddVddnom)E2

where Pd is the dynamic power consumption of the IC, C is the load capacitance of the circuits, Vdd is the supply voltage, Vddnom is the nominal supply voltage, f is operation frequency, Q is the charge that needs to be delivered to the output during the switching of the logic gate, f0 is the operation frequency at the nominal supply voltage, Vth is threshold voltage of transistors, γ is the activity factor which presents the percentage of the time the logic gates toggle, Ids is the transistor current, α is the power term (close to 1) in the widely used alpha-power model of transistor current, β is lumped device parameter constant, k is an empirical linearization parameter at the range when supply voltage is above threshold voltage. Eq. (2) is highly simplified based on the discharge of load capacitance from the current of a transistor in an inverter. Note that during the switching of a logic gate, the transistor current Ids traverses through saturation and linear regions with different current expressions. Hence, a precise closed-form equation between frequency and supply voltage is challenging to obtain. More complex models have been previously derived [4, 5]. For the sake of qualitative estimation, Eq. (2) presents an approximate linear relationship between frequency and supply voltage around the nominal supply voltage. Overall, Eqs. (1) and (2) establish a relationship between power consumption, supply voltage, and operation frequency.

At a target speed of the design, timing analysis is used to make sure that the sequential circuits can work properly under the timing constraints, i.e. setup timing constraint and hold timing constraint which are represented by Eqs. (3) and (4):

Slacksetup=Tclk[(Dclk_launch+Dclk_q+Dpd_maxDclk_capture)+Tsetup]E3
Slackhold=(Dclklaunch+Dclk_q+Dpd_minDclkcapture)TholdE4

where Slacksetup and Slackhold are the timing margin of the design, Tclk is the clock period, Dclk_launch and Dclk_capture are the path delay at launch and capture paths of the clock tee, Dclk_q is the propagation delay of the flip-flop, Dpd_max and Dpd_min are the longest and shortest propagation delay of the combinational circuits, Tsetup and Thold are the setup and hold time of the flip-flops. Figure 2(a) illustrates the timing analysis in Eqs. (3) and (4) which calculate the timing margin of each logic path in term of timing slack, i.e. timing margin of the path. A negative slack means a timing violation. By design, all timing paths should have a positive timing slack based on Eqs. (3) and (4). As can be seen, the delay of the combinational circuits within each pipeline stage of the digital circuits is bounded by the clock periods along with several other delay contributors, e.g. flip-flop delay, setup/hold timing of flip-flops, etc. A negative setup timing slack means the clock period is less than what is needed causing a setup timing violation which leads to a potential error of the chip operations. Similarly, a negative hold timing slack will lead to hold timing violation resulting in logic malfunctions.

Figure 2.

Timing analysis of digital circuits. (a) Basic diagram of timing paths in a digital IC; (b) various timing margin built into the clock period of a microprocessor.

To guarantee no timing violations, static timing analysis (STA) is performed on the design. STA evaluates all possible timing paths on the chip to ensure all setup and hold timing constraints are met. The worst timing paths, i.e. the paths with smallest timing slack, are referred as the critical paths. The setup timing of critical paths determines the clock period of the design. This means the clock period of an IC chip needs to be slower than the delay from the critical paths which represent the worst-case timing condition of the processor. STA is typically performed by using EDA tools e.g. Cadence Innovus or Synopsis Design Compiler. STA needs to be performed across all the design corners to guarantee timing integrity of the chip at all operation conditions. The process of performing STA to ensure no timing violation is referred as timing closure.

2.2 Timing margin and variation

While Eqs. (1)(4) is relatively simple, the real chip’s operation is much more complicated due to the uncertainty attributed to process, voltage, and temperature (PVT) variation. To make sure that the chip functions under all operating conditions, i.e. PVT corners, significant timing margins are kept to deal with numerous variations and uncertainties. Figure 2(b) illustrates the margins allocated to the sources of variations such as reliability or aging, process variation of threshold voltage, temperature changes, supply voltage droops, clock jitters, etc. The dynamic timing slack (DTS) is the timing variation due to different computing tasks being performed on the chip, which will be described in later sections. All sources of variations need to be considered and added into the operation delay of the circuits to determine the clock period of the chip. The additional timing margins, e.g. those from supply droops, jitters, temperature changes, can cause 15–20% of performance or throughput margin of the design [3, 6, 7]. According to the Eqs. (1) and (2), a 15% performance (or frequency) loss is translated into about 30% power overhead (if the performance is recovered by raising the supply voltages). Hence, significant performance loss or power waste occurs in modern processors to simply account for the uncertainty of the chip’s operation, e.g. PVT variations and DTS. The issue becomes worse at each generation of the technology due to the further decrease of supply voltage and increased current consumption from the chips.

2.3 Dynamic timing slack of modern microprocessors

Based on the discussion in Section 2.3, the conventional timing analysis and timing closure are “worst-case” bounded. In other words, the clock frequency is determined by the worst timing path, i.e. critical paths in the microprocessor. However, the execution of the critical paths is not frequent due to the required activation of multiple operation conditions of the critical paths. For instance, within an Arithmetic Logic Units (ALU), the circuit delay required for execution of “SHIFT”, “OR” will be quite different from the delay required for “ADD” or “CMP”. As a result, the real-time timing or delay of the digital circuits depends on what instructions are being performed. This means the execution of critical path may not happen all the time leading to unused timing margin in the real operation of microprocessors. We refer the unused timing margin at runtime as dynamic timing slack (DTS) which varies cycle-by-cycle during the operations of digital circuits. Below, we discuss the experimental results of DTS in a variety of computing platforms including CPU, GPU, and deep neural network accelerators.

Figure 3(a) shows the histogram of the gate-level delay at each pipeline stage of ARMv7 CPU core in a 45 nm CMOS technology including instruction fetch (IF), instruction decode (ID), operand fetch (OF), execution (EX), memory (MEM) stages. A large range of delay variation, with more than 3X distribution, is observed. Figure 3(b) shows the statistics of delay variations of the instructions of the ARMv7 CPU, which show various worst-case delay of each instruction. More importantly, we can observe that only a small number of instructions exercise the long critical-path delay while the majority of instructions can be executed much faster. As a result, only 13% of the time, the pipelines experience the top 10% of delay, e.g. close to critical path delay. This observation is understandable because, while the conventional design methodology creates timing bounds to the critical paths, there is no guarantee that each instruction at each clock cycle exercises the critical paths in actual operations. In reality, different instructions sensitize various segments of the logic and observe considerably different delay, which is an important fact that has been overlooked during the conventional hardware design. As a result, the conventional hardware design method which only focuses on the worst-case critical path is being pessimistic in term of setting of the clock frequency.

Figure 3.

Dynamic timing slack in an ARMv7 CPU core. (a) Statistics of cycle-to-cycle delay variation of ARMv7 CPU core at different pipeline stages running a benchmark application, i.e. binary search in Mibench; (b) histogram of delay variation by instruction types of the ARMv7 CPU core.

Figure 4 shows similar study in an open-source GPU core [8, 9] and a more recent deep neural network (DNN) accelerator [10, 11]. Delay varies cycle-by-cycle depending on the instruction or operands being executed inside the processors or accelerators with only small percentage of the time, the critical paths or near-critical timing paths are being executed. For instance, for the DNN accelerator, 40% of DTS margin (i.e. the circuit delay reaches close to the clock period at only 60% of the runtime) is observed within the PE of the accelerators due to the timing dependency of the operands and high sparsity in the input features of convolutional neural network [10, 11]. Similar studies performed at program level revealed the presence of up to 30% program-dependent DTS or a 20% program-dependent voltage margin in various modern low-power embedded microprocessors and GPUs [3, 12]. All the above studies highlight the important observation that significant dynamic time slack exists within modern microprocessors and conventional STA-based timing control is overly pessimistic. Therefore, it will be highly beneficial if one can exploit the DTS at runtime to improve the operational speed of the microprocessors. The following sections will discuss the techniques developed in recent years by researchers to exploit DTS for enhancement of performance and power efficiency of modern microprocessors and deep learning accelerators.

Figure 4.

Dynamic timing slack in GPU and deep neural network (DNN) accelerators. (a) Histogram of delay variation at different pipeline stages of an open-source GPU core; (b) histogram of delay variation at each PE (top) and composite delay variation from a group of PEs (bottom) inside DNN accelerators.

Advertisement

3. Circuit techniques for dynamic timing enhanced computing

While DTS exists broadly within microprocessors, it is not trivial to exploit its existence for performance improvement. This is because that the execution of critical path is difficult to be pre-determined before the execution of a program. Even if the critical path is exercised rarely, the clock frequency still needs to be set at the slowest speed following the static timing slack to ensure no error to be generated from the programs. The adjustment of clock period usually requires a change of the frequency from the mixed-signal circuits, e.g. phase locked loop (PLL) which typically takes microsecond to adjust, i.e. thousands of clock cycles. Hence, it is not easy in conventional microprocessors to adjust the clock period within a short period of the time, which is required for exploiting the DTS.

3.1 Error detection circuit techniques

If the clock period is set faster than the requirement of the timing path, a timing violation will happen leading to logic or computational errors at the output of the pipeline stages. If a circuit could detect the occurrence of such error and correct the errors through subsequent operations, we would be able to reduce the timing margin allocated for the rarely executed critical paths. Error detection circuits (EDC) are designed for this purpose. The most well-known technique in this category is “Razor” technique which utilizes special error detection flip-flops to detect the timing error occurred inside the pipelines at runtime and alert the system to take corrective actions [7, 13, 14, 15]. After detection of the timing error, the correction requires a flush of instructions in the CPU pipelines and re-issue of the failed instructions. If an error persists, the clock cycle is further extended to remove the timing violations. Figure 5 illustrates the EDC techniques at circuit level and pipeline level [6, 7]. The use of the detection and correction mechanism allows the microprocessor to operate with much-reduced timing margin, thus achieving improvement of performance and tolerance of chip-level variations such as supply droop, temperature, or clock jitters, as well as DTS. It has been shown that 33% energy saving can be achieved through the use of Razor techniques by lowering the supply voltages [7].

Figure 5.

Error detection techniques for DTS. (a) Two types of error detection flip-flops; (b) use of razor flip-flops in a CPU pipeline.

More recently, the EDC techniques have been applied to deep learning accelerators. For instance, the error detection was applied to the multiplication-accumulation (MAC) units of DNN accelerator exploiting the DTS in the MAC computation of a DNN accelerator leading to 30% energy saving [16]. Similarly, a negative margin timing error detection circuit for error detection of the timing paths that are less critical but more frequently activated than the worst-case critical paths was applied to a DNN accelerator, achieving 238% frequency gain or 59% power reduction in a recent work [17].

Although Razor types of error detection and correction techniques bring a nice concept of “self-correction” to the digital system in modern microprocessors, there are still several prominent challenges that are yet to overcome: (a) As hold timing is sacrificed, it has to be specially handled by delay padding for Razor techniques to work; (b) Extra design efforts deviating from conventional digital design flow need to be performed for Razor techniques such as implementation of pipeline correction or flushing operations, min-delay fix, design, and insertion of Razor flip-flops. Hence, the integration of error detection and correction into modern ASIC design flow using conventional EDA tools is a crucial step to enable broad adaptation of such techniques [6].

3.2 PVT monitor circuit and DVFS system

It is worth to briefly mention that many adaptive techniques such as dynamic voltage and frequency scaling (DVFS) and adaptive voltage scaling (AVS) also aim to reduce excessive timing or voltage margin built in the design time when “better than worst-case” conditions are observed during runtime. A significant amount of industry work has been reported utilizing on-chip sensors, e.g. PVT sensors to detect real-time chip conditions, e.g. process corners or supply droop and reduce the voltage or boost the frequency when margins are not needed. The PVT tracking or body bias combined with the knowledge of programs being executed in the microprocessors enable DVFS and AVS operations with reduced timing margin leading to 50 mV supply reduction in CPU [18]. Recently, such DVFS and AVS techniques have been applied to DNN accelerators combined with precision scaling showing up to 60% energy saving in accelerators [19]. However, the DVFS and AVS techniques are only applicable to the adjustments of supply voltage or clock frequency due to static variations, e.g. chip fabricated in fast or slow corners or semi-static variations, e.g. aging or slow change of temperature of the chip. As a result, DVFS and AVS which only happen at microsecond to millisecond time scale cannot be applied for timing adjustment at finer resolution where DTS exists. This leaves opportunities to further create DTS-based performance enhancement at finer time scale, as will be discussed in the next section.

3.3 Exploitation of dynamic timing slack through adaptive clocking

As pointed out in Section 3.1, EDS techniques have challenges due to the intrusive design approach where special flip-flops need to be inserted and extra hold timing margin needs to be created, leading to difficulties in directly using it following conventional ASIC design flow with common EDA tools. Given the short-term nature of dynamic timing slack, if one can adjust the clock period at fine resolution, i.e. every single clock cycle based on the knowledge of programs or instructions being executed in the digital core, one should be able to fully exploit the dynamic timing slack for performance enhancement. Such techniques have been demonstrated recently with advantage of easy adaptation as there is no change to the ASIC design flow on the ASIC implementation.

3.3.1 Dynamic phase scaling for fast clock period adjustment

PLLs are conventionally used to generate clock signals. Unfortunately, it takes microseconds to change a clock frequency due to the low pass filter used inside the PLL. To combat the fast events in modern microprocessors and the need of rapid clock frequency change, many sophisticated clock management circuits have been developed in recent years. For example, the integrated clock management circuits from Intel are used to adjust the clock period in the event of supply noise [6, 20]. Benefitting from the development of high-performance clock generation circuits, the state-of-the-art clock management allows the processor to perform clock period adjustment at a time scale of a single clock cycle, which is much faster than the loop bandwidth of a PLL.

Figure 6 shows a recent adaptive clock generator circuitry where a PLL or a delay locked-loop (DLL) is used to generate multi-phase clocks with a phase selection command issued from microprocessors [10, 11, 21]. The technique is referred as dynamic phase scaling (DPS) [21, 22]. The delay chain or ring oscillator inside a PLL or DLL generates multi-phases of a single clock. A multiplexer is used to select which phase to be sent out as a clock signal, providing a variable clock period. The phase selection can be performed within a clock cycle, enabling a dynamic clock period adjustment at the speed of clock frequency without waiting for the prolonged re-locking operation of the PLL or DLL. The capability of single-cycle clock period adjustment allows real-time DTS exploitation which is governed by the runtime knowledge from the microprocessors as described in the next section. Special circuit and layout efforts need to be taken at clock generation and clock distribution to guarantee that the clock phases are well generated and preserved when routed from PLL/DLL to the clock root of the microprocessors [11].

Figure 6.

Fast clock phase generator circuitry for cycle-by-cycle computing-adaptive clock period adjustment.

3.3.2 Runtime detection of dynamic timing slack on DNN accelerators and performance gain

Because critical paths are only executed by certain conditions, one could monitor the toggling of internal logic path signals to anticipate the level of dynamic timing slacks. The knowledge of dynamic timing slack will then be used to control the clock period through dynamic phase scaling as described in 3.3.1. Figure 7 shows two examples of the runtime detection of dynamic timing slacks as demonstrated recently [8, 11]. Figure 7(a) shows the detection of execution of critical paths within the pipeline stages of general-purpose graphic processing units (GPGPU). The operation codes of ALU from different instructions and several critical control signals are being real-time monitored to obtain information on the DTS in real time [8]. As GPGPUs use dynamic scheduled execution, the assignment of certain instructions to available scaler or vector ALUs must be real-time detected within one or two clock cycles latency to provide sufficient time for clock adjustment using the dynamic phase scaling. Measurement results on a 65 nm test chip show that such technique allows a 18% speed up of GPGPU operation or 30% equivalent power saving as in Figure 7(c). Figure 7(b) shows the implementation of runtime detection of DTS on a systolic array of a DNN accelerator [11]. As DNN accelerators only perform very few instructions, the DTS comes mostly from the operands being fetched into the processing elements (PE). As a result, to extract DTS information, the operands, i.e. inputs and weights of convolutional neural networks are being real-time monitored. A relationship between operand values and timing delay of the corresponding MAC units is derived through timing analysis during design time and recorded in a look-up table onto the chip in the timing control module. Hence, the delay of PEs of the DNN accelerators can be estimated before the calculation of MAC due to the pipeline latency of systolic operations of inputs. The clocks can be then adaptively selected based on the knowledge of the PE’s operands to improve the performance of the DNN accelerator. However, in a DNN accelerator with hundreds of PEs, every PE could operate with a critical path being exercised leading to high probability of reaching the worst-case timing scenario as shown in Figure 4(b) bottom where the histogram of the PE’s delay has shifted toward high-end when the number of PEs increases. To overcome such a challenge, a distributed clock chain technique, as shown in Figure 7(b), is used to split each row of PE array into 16 loosely connected clock domains. As a result, the PEs under each clock domain, i.e. within each PE row, can exploit DTS benefits separately while PEs between rows are managed to not violate the timing constraints between neighboring PE rows. This setting maximizes the benefit of DTS exploitation when the number of PEs is high. Measurements on a 65 nm convolutional neural network (CNN) test chip show 19% improvement of performance or 34% power saving using adaptive clock-based DTS exploiting technique with an overhead of 5.5% as shown in Figure 7(d).

Figure 7.

Diagram of exploitation of dynamic timing slacks in (a) GPGPU; (b) convolutional neural network (CNN) accelerator; (c) measured performance gain on the GPGPU; (d) measured performance gain on the CNN accelerator.

3.3.3 Implementation consideration and EDA support

Compared to the Razor technique, which disturbs the ASIC design flow and requires special Razor flip-flops, the adaptive clocking techniques for DTS exploitation preserve the conventional ASIC design flow without the need of special circuit cells, extra hold fixing effort, and system setup for error correction operations. Instead, the adaptive clocking technique requires special design efforts on clock generation circuits, thus adding complexity to the mixed-signal design and SoC integration. However, as clock generation circuitry, i.e. PLL or DLL and global clock distribution circuits are conventionally handled using analog mixed-signal design flow, this additional effort does not change the existing design flow for digital ASIC modules or analog mixed-signal circuits. As a result, the adaptive clocking techniques are more compatible with conventional IC design methodology.

It is worth to mention that special timing analysis is required for identifying DTS at circuit level. As from the example of DNN accelerator in Figure 7, the discovering of operand timing dependency relies on the “conditional” STA support from modern EDA tool where worst-case timing under certain signal conditioning, e.g. certain input bits fixed at 0 or 1, is obtained to generate the knowledge of DTS under different operand or instruction conditions [11]. A systematic way of estimating DTS under certain signal conditions or instructions for generating runtime DTS information is one of the future developments that could significantly improve the benefits and adoptability of the techniques. Please also note that DTS exploitation does not conflict with existing DVFS or AVS techniques. The conventional timing margin for PVT variations or jitters still has to be applied in the design to secure timing and power integrity at different corners.

Advertisement

4. Compiler techniques for dynamic timing enhanced computing

As the previous sections have shown, conditional static timing analysis can identify conditions that indicate when DTS is available. This can work well in simple designs where the presence of certain opcodes may imply that DTS will be present in intervening cycles. However, there are situations where the amount of DTS available may depend on interactions between two or more instructions. A common example in a microprocessor pipeline would be when a value produced by one instruction is forwarded to a subsequent one via the bypass network. In this case, additional information may need to be supplied to the hardware. In this section, we will turn to (1) co-design mechanisms which can be used to communicate information constructed off-line to the hardware for use in dynamic clock management and (2) software techniques which can create additional DTS within the instruction stream. This can be efficiently done through small modifications to the instruction set architecture (ISA) and the compiler.

4.1 Encoding timing into the instruction set

Instruction execution in most pipelines used in low-power processors and microcontrollers is statically scheduled. This generally means that the compiler can know all instructions that could possibly occur in the pipeline exactly what the sequencing will be.

This enables off-line program profiling which uses the conditional STA described above. The profile process considers the sequence of instructions in the pipeline and makes a conservative estimate of the amount of dynamic timing slack that will be present during instruction execution. The compiler automatically embeds the dynamic timing information into the program binary to direct the real-time adaptive clock management. Note that encoding the timing information directly into the instruction sequence avoids an increase to the code size and removes the need to insert additional hardware resources to store the clock control information.

Figure 8 shows how the timing information can be embedded into the existing ARMv7 instructions by repurposing some of the instruction encoding bits. In the 32-bit ARMv7 ISA, the condition code is held within instruction bits [31:28] and can represent 16 unique condition cases. Code analysis has shown that, in most programs, a relatively small set of these condition codes are frequently used. Specifically, the common condition cases like equal (EQ) and not equal (NE) are the most abundantly utilized (∼75% of the cases). The remaining condition cases appear rarely. Consequently, dynamic timing can be comfortably encoded into the 4-bit condition code by remapping the condition code without otherwise altering the instruction word or increasing the footprint of the application.

Figure 8.

Embedding timing information into ARM instructions. The condition code field of the ARM instruction set is used to provide timing information of the instructions.

As shown in Figure 8, the least significant bit of the condition code field (bit [28] of the instruction word) is used as a mode selector to identify the condition code usage. Note that in the original ARM instruction set, the 4-bits of condition code allow 16 different types of guards. For this altered ISA, the compiler and hardware are modified to use this new encoding space.

The instruction encoding allows for a combination of reduced conditional control and timing information to be embedded in the uppermost 4-bits of the instruction word. The approach relies on timing tables (F1Table, F2Table, F3Table) which specify quantized dynamic timings for instructions and bits of the encoded instruction are subsequently used to choose an index in the table. Depending on the value of bit [28], condition codes may or may not be used. Under normal operation, if bit [28] equals zero, the instruction will be executed unconditionally, and bits [31:29] specify a 3-bit clock control. Using the F3Table, the encoded values allow the clock period TPLL to be scaled in the range of −30 to +40% for unconditional instructions. If bit [28] equals one, i.e. indicating a conditional instruction, the instruction will be executed using bits [30:29] to specify the condition cases EQ/NE/GT/LE. Bit [31] is used to select the clock phases. For conditional instructions, a binary choice of scaling between 0 and 40% is implemented. Overall, the approach leverages the compiler’s global view of the system to achieve better DTS exploitation than would be possible by just inspecting the opcodes at runtime.

A timing calibration scheme is also implemented to adjust for the timing variation of instructions. In the special timing calibration mode, the most significant bit of the condition code (bit [31] of the instruction word) specifies if the instruction will be used for calibration. When bit [31] equals one, this indicates a calibration cycle, in which a shorter clock period is applied to calibrate the minimum timing requirement. As before, bit [28] serves as the mode selector. For non-conditional instructions, bits [30:29] hold the dynamic timing information. If there is a need to make use of condition codes while in calibration mode, the external control signals will instead govern the phase selection. When timing calibration is active, the target instructions or instruction sequences execute with short dynamic clock periods. The system tracks the minimum observed clock period, which ensures instruction execution correctness. This is later read out and reported as the calibrated timing.

4.2 Compiler-assisted optimization

As shown in the previous section, the compiler’s global view of the program allows for a tighter estimate of DTS which can be efficiently embedded in the program. The compiler also has considerable leeway to reorganize code to create additional DTS within the program. Three peephole optimizations, instruction substitution, instruction reordering, and instruction overlapping, as shown in Figure 9, are provided as examples. Traditionally, peephole optimizations appear in most compiler backend and allow the compiler to incrementally improve the code by replacing a short sequence of instructions with one or more instructions that achieve the same effect but do so more efficiently. In this case, the optimizations do not aim to reduce the dynamic instruction count but will create additional timing slack.

Figure 9.

Compiler optimizations that enhance dynamic timing slack. Three peephole optimizations, including substitution, reordering, and latency overlapping, are shown as examples.

4.2.1 Instruction substitution

The first optimization strategy replaces long delay instructions with shorter delay instructions with an identical semantic effect. The inspiration from this is the observation that different code sequences with the same semantics can exhibit divergent delay times due to different path sensitization in the circuit logic. For example, in ARMv7 ISA, and equivalence check can be performed by either a cmp or teq instruction. When used to determine NE/EQ status, these two operations are semantically equivalent, but the logical paths that they use in hardware can be different. In hardware, the teq functionality (a simple test to see if two operands are equal) can be supported by XOR operations. On the other hand, a cmp instruction more generally identifies inequality conditions (less than, greater than) in addition to the simple NE/EQ and thus requires subtraction using adders. This involves carry chains which increase the delay. As a result, teq does not require the same clock period as cmp, since the logic paths are shorter. This simple substitution does not impact dynamic instruction count and would not be of benefit to a traditional system. In a system which supports DTS, it creates a significant opportunity.

4.2.2 Instruction reordering

The second optimization strategy re-sequences groups of instructions to boost the amount of DTS. A prominent example of reordering is for producer-consumer data dependences. For these types of read-after-write (RAW) dependencies, results generated by an earlier instruction must be forwarded to the following instruction via the register bypass network. Re-sequencing can, in some cases, eliminate back-to-back data dependencies by increasing the distance between instructions. This may free up additional timing slack. Instruction re-ordering optimizations to improve performance are already commonplace in compilers. This is a novel application.

4.2.3 Latency overlapping

The third and final optimization hides the latency of one or more timing critical operations by scheduling them in the same cycle as another even more critical one. Consider a situation depicted in the bottom panel of Figure 9, where two critical paths are sensitized at different pipeline stages in back-to-back cycles. This optimization will seek to alter the instruction execution order so that the critical paths are exercised during the same clock cycle. In essence, this gives the compiler license to co-schedule two or more clock stretching operations, i.e. long delay instructions, at the same time.

4.3 Benefits of compiler optimization

The techniques described in the previous sections are implemented to a six-stage ARMv7 CPU test chip fabricated at a 55 nm technology [22]. Six programs taken from the SPEC CPU 2006 benchmark suite are run in the ARMv7CPU. All applications have been cross-compiled for the ARM architecture using LLVM compiler. Figure 10 shows the impact of the various techniques.

Figure 10.

Performance improvement and energy savings with DPS, timing calibration, and compiler optimizations.

The results show that the instruction-driven adaptive clock scheme achieved approximately 14% performance improvement. Furthermore, timing calibration was applied to obtain more accurate instruction dynamic timing under process variation. In the experiment, the instruction timing calibration leads to additional performance improvement by 3–5% or about 8% additional energy saving. By leveraging the compiler to optimize the runtime instruction sequences, the performance improvement is further increased by up to 4%. Overall, the proposed DPS scheme achieved up to 22% performance improvement with an average of 20% across different test programs. Equivalently, an average of 28% and up to 32% energy saving was achieved by the described operation scheme, as shown in Figure 10 [22].

Additional gains can be made by extending the ISA and compiler to create more opportunities for DTS [23, 24]. Subsequent work has shown that address computations found in typical programs can stress critical paths [24]. By changing the layout of data in memory and modifying the way that addresses are calculated, compiler and architecture co-design can further achieve greater benefits from DTS.

While the above compiler assistance technique uses the example of a CPU processor, similar techniques can be applied to deep learning accelerators with the instructions to control the operations of the DNN accelerators such as the bit precision, systolic data flow, etc. DTS varies based on such configurations and hence can be exploited for performance gain.

Advertisement

5. Considerations of dynamic timing enhanced computing on deep learning accelerators

The examples provided in Section 3 highlight the DTS techniques that can be applied to modern DNN accelerators for improving the accelerator performance. While as shown in Figure 7, significant improvements of 19% performance gain are observed on the accelerators, several important considerations need to be taken into account when using such techniques.

  1. DTS is highly data dependent while sparsity helps improve the benefits

As the instructions of DNN accelerators are much less than a CPU or GPU, the DTS is mainly due to the input or weight at the MAC operation. As a result, the statistics of the input data can be analyzed to know the level of DTS. As it is commonly known, the high sparsity exists in a CNN operation due to the pass of ReLu activation function, the sparsity helps increase the DTS in the operation because zero value requires very little time for execution. Hence, any sparsity enhancement techniques can be applied during both training phase or inference phase to further improve the benefits of DTS exploitation.

  1. DTS exists at longer term of execution for DNN accelerators

As the dataflow or operation scheme of DNN accelerators only changes after thousands of cycles, e.g. due to the change of CNN layers and fully-connected layers, the DTS of DNN accelerators can be exploited at coarse resolution of the operations, e.g. only using different DTS setting at different CNN layers. This may reduce the overall benefits of the technique but significantly also reduce the implementation efforts as conventional DVFS can also be applied to exploit DTS at the time duration of thousands of clock cycles.

  1. Different data format plays an important roles for DTS

As DTS for DNN accelerators is highly data dependent, the precision or quantization of the input or weights contributes to the level of DTS of the operations. This can be seen in the result in Figure 7(b), where the benefits drop with the lower number of bits used for DNN. A recent development shows that the use of sign-magnitude data format instead of 2’s compliment format helps further improve the benefits of DTS using Razor techniques on DNN accelerators [16]. Hence, it is important to consider the data format when applying the techniques described in this chapter.

Advertisement

6. Conclusions

The demand for performance on modern computing hardware is experiencing rapid growth especially due to the heavy workload from the deep learning applications. While classic technology scaling from Moore’s law has significantly slowed down, new circuit techniques are being developed to provide the required performance gain. Different from many high-performance computing techniques that focus on improvements at system level or architecture of a computing platform, this chapter describes recent techniques at circuit level, that are effective to improve the performance of ASIC design specially for deep learning accelerators. By exploiting the runtime timing variation i.e. dynamic timing slack, of instructions or operands, significant performance gain can be obtained on the modern processors including CPU, GPU, and deep learning accelerators. To achieve such goals, circuit techniques including error detection circuits and adaptive clock techniques are presented in this chapter to provide viable solutions which effectively reduce the redundant timing margin at runtime in modern accelerators. In addition, compiler techniques that reshuffle or replace the instructions inside the digital cores are also presented to bring another level of improvements of the performance. Measurement results on silicon test chips on deep neural network accelerators show close to 20% performance gain or 30% energy saving from the dynamic timing enhanced computing scheme presented in this chapter.

Advertisement

Acknowledgments

This work was partially supported by the National Science Foundation of the United States under grants #CCF-1618065 and #CCF-1908488.

References

  1. 1. “Semiconductor Revenue,” Statista Market Insights, 2023 [Online]. Available from: https://www.statista.com/outlook/tmo/semiconductors/worldwide
  2. 2. Jouppi NP et al. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture - ISCA’17. Toronto, ON, Canada: ACM Press; 2017. pp. 1-12. DOI: 10.1145/3079856.3080246
  3. 3. Leng J, Buyuktosunoglu A, Bertran R, Bose P, Zu Y, Reddi VJ. Predictive guardbanding: Program-driven timing margin reduction for GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. Jan. 2021;40(1):171-184. DOI: 10.1109/TCAD.2020.2992684
  4. 4. Kaur B, Alam N, Manhas SK, Anand B. Efficient ECSM characterization considering voltage, temperature, and mechanical stress variability. IEEE Transactions on Circuits and Systems I: Regular Papers. 2014;61(12):3407-3415. DOI: 10.1109/TCSI.2014.2336511
  5. 5. Acharya LC, Sharma AK, Ramakrishan V, Mandal A, Dasgupta S, Bulusu A. Variation aware timing model of CMOS inverter for an efficient ECSM characterization. In: 2021 22nd International Symposium on Quality Electronic Design (ISQED). Santa Clara, CA, USA: IEEE; 2021. pp. 251-256. DOI: 10.1109/ISQED51717.2021.9424341
  6. 6. Bowman KA et al. A 45 nm resilient microprocessor core for dynamic variation tolerance. IEEE Journal of Solid-State Circuits. Jan 2011;46(1):194-208. DOI: 10.1109/JSSC.2010.2089657
  7. 7. Das S et al. RazorII: In situ error detection and correction for PVT and SER tolerance. IEEE Journal of Solid-State Circuits. 2009;44(1):32-48. DOI: 10.1109/JSSC.2008.2007145
  8. 8. Jia T, Wei Y, Joseph R, Gu J. An adaptive clock scheme exploiting instruction-based dynamic timing slack for a GPGPU architecture. IEEE Journal of Solid-State Circuits. 2020;55(8):2259-2269. DOI: 10.1109/JSSC.2020.2979451
  9. 9. Jia T, Joseph R, Gu J. 19.4 an adaptive clock management scheme exploiting instruction-based dynamic timing slack for a general-purpose graphics processor unit with deep pipeline and out-of-order execution. In: 2019 IEEE International Solid-State Circuits Conference - (ISSCC). San Francisco, CA, USA: IEEE; 2019. pp. 318-320. DOI: 10.1109/ISSCC.2019.8662389
  10. 10. Jia T, Ju Y, Gu J. A compute-adaptive elastic clock chain technique with dynamic timing enhancement for 2D PE array based accelerators. In: International Solid-State Circuit Conference (ISSCC), San Fransisco; 2020. DOI: 10.1109/ISSCC19947.2020.9063062
  11. 11. Jia T, Ju Y, Gu J. A dynamic timing enhanced DNN accelerator with compute-adaptive elastic clock chain technique. IEEE Journal of Solid-State Circuits. 2021;56(1):55-65. DOI: 10.1109/JSSC.2020.3027953
  12. 12. Cherupalli H, Kumar R, Sartori J. Exploiting dynamic timing slack for energy efficiency in ultra-low-power embedded systems. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). Seoul, South Korea: IEEE; 2016. pp. 671-681. DOI: 10.1109/ISCA.2016.64
  13. 13. Das S et al. A self-tuning DVS processor using delay-error detection and correction. IEEE Journal of Solid-State Circuits. 2006;41(4)792-804. DOI: 10.1109/JSSC.2006.870912
  14. 14. Fojtik M et al. Bubble razor: Eliminating timing margins in an ARM cortex-M3 processor in 45 nm CMOS using architecturally independent error detection and correction. IEEE Journal of Solid-State Circuits. 2013;48(1):66-81. DOI: 10.1109/JSSC.2012.2220912
  15. 15. Whatmough PN, Das S, Bull DM. A low-power 1-GHz razor FIR accelerator with time-borrow tracking pipeline and approximate error correction in 65-nm CMOS. IEEE Journal of Solid-State Circuits. 2014;49(1):84-94. DOI: 10.1109/JSSC.2013.2284364
  16. 16. Whatmough PN, Lee SK, Brooks D, Wei G-Y. DNN engine: A 28-nm timing-error tolerant sparse deep neural network processor for IoT applications. IEEE Journal of Solid-State Circuits. 2018;53(9):2722-2731. DOI: 10.1109/JSSC.2018.2841824
  17. 17. Shen Z, Shan W, Du Y, Li Z, Yang J. Beyond eliminating timing margin: An efficient and reliable negative margin timing error detection for neural network accelerator without accuracy loss. IEEE Journal of Solid-State Circuits. 2023;58(5):1462-1471. DOI: 10.1109/JSSC.2022.3220525
  18. 18. Mair H et al. 2.5 a 7nm FinFET 2.5GHz/2.0GHz dual-gear Octa-Core CPU subsystem with power/performance enhancements for a fully integrated 5G smartphone SoC. In: 2020 IEEE International Solid-State Circuits Conference - (ISSCC). San Francisco, CA, USA: IEEE; 2020. pp. 50-52. DOI: 10.1109/ISSCC19947.2020.9062897
  19. 19. Moons B, Uytterhoeven R, Dehaene W, Verhelst M. 14.5 envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm FDSOI. In: 2017 IEEE International Solid-State Circuits Conference (ISSCC). San Francisco, CA, USA: IEEE; 2017. pp. 246-247. DOI: 10.1109/ISSCC.2017.7870353
  20. 20. Tokunaga C et al. 5.7 a graphics execution core in 22nm CMOS featuring adaptive clocking, selective boosting and state-retentive sleep. In: 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). San Francisco, CA, USA: IEEE; 2014. pp. 108-109. DOI: 10.1109/ISSCC.2014.6757359
  21. 21. Jia T, Joseph R, Gu J. An instruction driven adaptive clock phase scaling with timing encoding and online instruction calibration for a low power microprocessor. In: ESSCIRC 2018 - IEEE 44th European Solid State Circuits Conference (ESSCIRC). Dresden: IEEE; 2018. pp. 94-97. DOI: 10.1109/ESSCIRC.2018.8494244
  22. 22. Jia T, Joseph R, Gu J. An instruction-driven adaptive clock management through dynamic phase scaling and compiler assistance for a low power microprocessor. IEEE Journal of Solid-State Circuits. 2019;54(8):2327-2338. DOI: 10.1109/JSSC.2019.2912510
  23. 23. Fan Y, Jia T, Gu J, Campanoni S, Joseph R. Compiler-guided instruction-level clock scheduling for timing speculative processors. In: Proceedings of the 55th Annual Design Automation Conference. San Francisco California: ACM; Jun. 2018. pp. 1-6. DOI: 10.1145/3195970.3196013
  24. 24. Fan Y, Campanoni S, Joseph R. Time squeezing for tiny devices. In: Proceedings of the 46th International Symposium on Computer Architecture - ISCA’19. Phoenix, Arizona: ACM Press; 2019. pp. 657-670. DOI: 10.1145/3307650.3322268

Written By

Jie Gu and Russ Joseph

Submitted: 18 September 2023 Reviewed: 26 September 2023 Published: 03 November 2023