In this chapter, we explain NEMsCAM cell, a new content-addressable memory (CAM) cell, which is designed based on both CMOS technologies and nanoelectromechanical (NEM) switches. The memory part of NEMsCAM is designed with two complementary nonvolatile NEM switches and located on top of the CMOS-based comparison component. As a use case, we evaluate first-level instruction and data translation lookaside buffers (TLBs) with 16 nm CMOS technology at 2 GHz. The simulation results demonstrate that the NEMsCAM TLB reduces the energy consumption per search operation (by 27%), standby mode (by 53.9%), write operation (by 41.9%), and the area (by 40.5%) compared to a CMOS-only TLB with minimal performance overhead.
- nanoelectromechanical (NEM) switch
- content-addressable memory (CAM) cell
- translation lookaside buffer (TLB)
Computing technology has witnessed an inimitable progress in the last decades, which is the result of CMOS technology scaling commensurate with Moore’s law . Transistor feature sizes have shrunk to half at each generation, and consequently, the number of transistors per chip has doubled every 2 years. However, CMOS scaling faces serious problems that occur due to the exponential increase of the leakage of current during technology scaling . The subthreshold leakage is mainly affected by the subthreshold swing (S) of a device, which is defined as the amount of gate voltage reduction to reduce the subthreshold current by one decade (S = dVgs/dlogId) . For bulk CMOS, the subthreshold swing has a substantially lower limit of 60 mV/decade, which leads to a large increase in the power density . This limitation prevents manufacturers from fabricating smaller devices and forces them to look for alternative solutions targeting higher performance and efficiency. In order to maintain the scaling ability, a significant amount of research is ongoing to explore various non-CMOS technologies (emerging technologies) as a replacement for volatile and nonvolatile memories. This also motivates us, in this chapter, to exploit one of the most promising emerging technologies (nanoelectromechanical switches) to improve the processor performance.
Nanoelectromechanical (NEM) switches have been suggested as a promising candidate for replacing the CMOS technology . NEM switches provide some unique characteristics that are not available in conventional MOS, such as near-zero leakage current and infinite subthreshold slope (<0.1 mV/decade ). Such characteristics make them ideal for designing highly energy-efficient structures. However, NEMs have relatively long mechanical switching delay  compared to the intrinsic delay of CMOS devices, and to this date, they suffer from low endurance (1011 write cycles) . Also, NEMs do not offer high turn-on current like CMOS transistors.
In spite of the large mechanical delay and limited number of reliable cycles, NEMs have been useful for a wide range of applications such as FPGAs (used as programmable routing switches) , adders , flip-flops , memories [4, 11], DACs , and ADCs , where the long switching time and limited number of hits are not important issues. Most of the mentioned circuits use the benefits of combining NEMs and CMOS technology in order to highlight the advantage points of each technology and alleviate the disadvantages to achieve low-power and high-performance operation for some critical components. Motivated by these observations, in this chapter, we describe a new content-addressable memory (CAM) cell design, NEMsCAM , based on both NEMs and CMOS technologies, to employ in processor structures where writes are relatively infrequent, for example, the translation lookaside buffers (TLB).
Content-addressable memories (CAM) have been widely adapted for applications that depend on fully associative and high-speed search operations, such as translation lookaside buffers (TLBs), network routers, and data compression . Since the search operation requires fully parallel and fast comparisons, CAMs introduce high energy consumption and area constraints. Previous works have explored the design of CAMs with emerging memory technologies to mitigate these issues [14–18]. However, those CAMs suffer mainly from increased search latency due to the employed technology that prevents them from building performance critical structures such as TLBs.
The memory component of NEMsCAM cell is designed with two complementary nonvolatile NEM switches while the comparison circuits are designed with CMOS transistors to allow fast search operation. The novel structure of NEMsCAM along with unique characteristics of NEMs considerably reduces the layout area as well as the energy consumption. Also, in this design, both Out and OutB are available simultaneously, which is essential to design a CAM cell.
As a use case, we leverage the NEMsCAM cell to build fully associative TLBs. The TLB has been pointed out as a critical component of energy and performance in modern processors . Translation lookaside buffer (TLB) is a cache that is employed to accelerate virtual-to-physical address translation . The processor searches the TLB on every memory operation using the virtual page number. Hence, the TLB is a crucial component for the performance and power consumption of the computers . We evaluate first-level data and instruction TLBs with 16 nm technology at 2 GHz frequency. Our analysis demonstrates that the proposed TLB reduces the energy per search and write operation and standby mode by 27%, 41.9% and 53.7%, respectively, and the area by 40.5% compared to a CMOS-only TLB. Also, both designs execute the search operation in one clock cycle. Furthermore, it is shown that the NEMs’ increased write latency introduces minimal performance overhead (0.27% on average). The main contributions of this chapter are as follows: (1) Description of NEMsCAM cell design based on complementary nonvolatile NEM switches and CMOS transistors. (2) Explain the design of highly efficient first-level TLBs for data and instruction accesses based on the NEMsCAM cell. (3) Evaluate the proposed designs at both circuit and system level and compare to CMOS-only TLBs.
Section 2 provides background information whereas Section 3 and Section 4 describe the design of the NEMsCAM cell and the NEMsCAM TLB, respectively. In Section 5, we present our evaluation methodology and the obtained results. Finally, in Section Section 6, we summarize the chapter.
This section provides background information related to this work. First, we briefly review recent available emerging technologies. Then, we describe NEM switches in detail and the prior art in their use of memory. Finally, we describe CMOS-only CAM cells and fully associative TLB structures.
2.1. Emerging technologies
In order to continue the trend of Moore’s Law, many emerging technologies have been employed such as phase-change memory (PCRAM) , magnetoresistive RAM (MRAM) , spin-torque transfer magnetoresistive RAM (STT-RAM) , ferroelectric RAM (FeRAM) , memristor , and nanomechanical memory (NEM) . Typical performance parameters of mentioned memory technologies are presented in Table 1 [14, 25].
|Traditional technologies||Emerging technologies|
|Half pitch (F) (nm)||50||65||90||90||180||130||65||3–10||10|
|Smallest cell area (F2)||6||140||10||5||22||45||16||4||36|
|Read time (ns)||<1||<0.3||<10||<50||<45||<20||<60||<50||0|
|Write/erase time (ns)||<0.5||<0.3||105||106||10||20||60||<250||1ns(140ps-5ns)|
|Retention time (years)||Seconds||N/A||>10||>10||>10||>10||>10||>10||>10|
|Write op. voltage (V)||2.5||1||12||15||0.9-3.3||1.5||3||<3||<1|
|Read op. voltage (V)||1.8||1||2||2||0.9-3.3||1.5||3||<3||<1|
|Write energy (fJ/bit)||5||0.7||10||10||30||1.5 × 105||6 × 103||<50||<0.7|
|Voltage scaling||Fairly scalable||No||Poor||Promising||Promising|
|Highly scalable||Major technological barriers||Poor||Promising||Promising||Promising|
Ferroelectric RAM (FeRAM or FRAM) is one promising memory, which employs a ferroelectric gate cell instead of a poly-silicon cell and has been considered as a replacement for flash memory . In spite of some disadvantages like lower density, higher cost, and poor scalability, it has some advantages over flash like faster programming time, lower power usage, and higher endurance.
Magnetoresistive RAM (MRAM), which has been designed with magnetic storage elements, is another emerging technology . The elements are made of two ferromagnetic plates, each of them holds a magnetic field and an insulating layer separates them. One of the plates has a permanent magnetic field and the other has a variable one to be able to store data. MRAM is as fast as SRAM and as dense as DRAM. Also, it has a nonvolatility characteristic similar to flash and high endurance. However, it suffers from large cell size, high write current, and poor scalability, which greatly forbids being widely commercialized.
Another nonvolatile memory is phase change memory (PCRAM) . Similar to optical storage devices, it stores data into a chalcogenide glass. The state of the glass is changed to crystalline or amorphous whenever an electric current passes through a heating element and generates heat or quenches the glass. The main limitations of PCRAM are its high programming current and relatively long write/read time.
Memristor is considered to be one of the best candidates for future memory technologies. A memristor is a two-terminal nonvolatile memory and has been designed based on resistance switching effects . Memristor has low write energy and high density due to multilayer crossbar architecture. Since the memristor crossbar-based architecture is highly scalable, it is predicted to be the selected candidate to use for future ultrahigh density memories. As no tunnel oxide is used in this device, memristor has higher endurance than flash memory. In spite of high density and endurance, read time is considerably high. Memristors may easily replace flash memories; however, they are not an appropriate option to employ in extremely fast system components.
Another promising candidate as a replacement for CMOS devices is the nanoelectromechanical switch (NEM) . On/off state of NEM devices is determined by both electrical and mechanical forces between gate (a movable beam) and source terminals. Unique characteristics of NEM relays which are not available in conventional CMOS, such as zero leakage current and near infinite subthreshold slope, make them ideal for designing highly energy-efficient applications. In spite of low leakage current, NEM relays do not offer high turn-on current like CMOS transistors; moreover, they suffer from low endurance.
2.2. Nanoelectromechanical switches
Figure 1(a) and (b) shows the simplest NEM switch, 3T NEM that consists of three terminals: a cantilever beam (which is connected to the source terminal), a gate, and a drain. The voltage difference between the gate and the movable terminal (VGS) controls the position of the beam and state of operation. When VGS goes higher than a certain threshold value, called the pull-in voltage (Vpi), the electrostatic force exerted by the gate exceeds the elastic force of the beam and pulls the beam toward the drain until the beam collapses to the drain and forms a conductive channel from the beam (source) to the drain, thus closing the switch (on-state; Figure 1c). In order to release the beam from the drain, VGS decreases to a voltage smaller than Vpi, called pull-out voltage (Vpo), where the electrostatic force is not higher than the elastic force of the beam; at this moment the beam is disconnected from the drain (off-state; Figure 1c). Due to such sharp on/off transition, NEMS have zero off-state leakage as there is no path for current to flow. Moreover, because of the surface adhesion force between the two contacting regions, Vpo is usually smaller than Vpi and the IV characteristics of NEMS exhibit such a hysteresis characteristic which enables NEM relays to be used as memory elements (Figure 1c).
There are several implementations of NEM switches [26, 27]. In this work, we consider 5-Terminal (5T) NEM switch which is illustrated in Figure 2(a) and (b). A suspended beam is anchored at the source. Gate terminal 1 and gate terminal 2 (Gate1 and Gate2) are located close to the beam. The beam is connected to Drain1 or Drain2 (two output nodes). The beam moves toward Gate1/Gate2 because of the electrostatic attraction and a voltage difference between Gate1 and Gate2 (Figure 2b), and then connects to Drain1/Drain2, creating a conductive path between this drain and the source. An advantage of employing two gates is that the electrostatic force can be utilized as both pull-in and pull-out voltages. Therefore, one does not have to rely on only the elastic restoring force of the beam. Hence, the scalability of the device and the operational margins are improved considerably. However, the write operations in NEM switches take multiple clock cycles , because the mechanical movement of the beam is fairly slow which is related to the device technology.
2.3. Nonvolatile nanoelectromechanical switches
In this work, we choose the NEM switches which exhibit nonvolatile characteristic: once they are connected to a drain, they remain in this location until the beam is pulled out by electrostatic forces from the opposite gate. We select the NEM switch described in . Figure 2(c) demonstrates the two stable states of this NEM switch. As long as both gate1 and gate2 are at the same potential (wordline, WL = 0), the beam never suffers a net disturbing electrostatic force. Figure 2(d) shows the write operation of this switch.
2.4. Memory arrays based on nanoelectromechanical switches
Former studies have evaluated employing NEMs for memory usages [11, 30]. Some of them address the use of NEM switches to replace normal memory arrays, such as SRAM. Chong et al.  replace the two pull-down transistors in a 6T SRAM cell with NEM switches to reduce leakage and area. Some of these studies also discuss nonvolatile memory arrays . The memory array structure disclosed in Ref.  is of particular interest to our proposal as we explain in Section 3.
2.5. Configuration elements based on nanoelectromechanical switches
Recently, NEMs have been used for configuration tasks. Dong et al.  employed 3T NEMs as configuration memory components in FPGAs, replacing a routing switch by one NEMs, or an LUT cell by two NEMs. This design could be exploited for designing CAM cells; however, their proposed cell has many deficiencies. It relies only on the elastic restoring force for pull-out and suffers half-select conditions, it outputs only Out, not the complementary OutB, and it is volatile. The structure of the memory part of NEMsCAM which we describe in Section 3 depicts these mentioned shortcomings.
2.6. Content-addressable memory
A content-addressable memory (CAM) concurrently compares the search data with all of its stored data and returns the address of the matching location in a single clock cycle . A typical CMOS-only CAM cell incorporates a SRAM cell to store the data bit and additional XOR circuits to compare the stored bit with the search data. CAMs propose a popular solution for a wide variety of applications that require high search speeds such as data compressions, network routers, and lookup tables . However, the search operation in a CAM requires fully parallel comparison circuits to meet timing requirements. This results in high energy consumption and poses constraints on the number of entries affecting directly the effectiveness of the CAM.
2.7. Translation lookaside buffer
Virtual memory simplifies programming by abstracting and managing the available physical memory in pages. To accelerate virtual memory, processors employ the translation lookaside buffer (TLB) that holds recently used virtual-to-physical translations . The processor searches the TLB on every memory operation using the virtual page number. In the case of a hit, the TLB returns the physical page number so that the memory operation can further proceed with accessing the memory hierarchy. However, in case of a miss, the memory operation will not complete until the address translation is retrieved from the memory (page walk) which might take up to hundreds of cycles. The TLB is hence a crucial component for the performance of the processor .
3. Design of NEMsCAM cell
In this section, we present the circuit details of our proposed NEMsCAM cell. We use the memory structure proposed in  to implement the storage part of NEMsCAM. That memory structure provides full-select behavior which is necessary to build a CAM; it also employs electrostatic pull-in and pull-out and does not need a cell selector component in the write path. The nonvolatile memory is designed based on the NEM switch proposed in , which can eliminate net-disturbing electrostatic force. Figure 3(a) depicts the configuration of the NEMsCAM cell. Out and OutB, the outputs, are connected to the transistors of the comparison part. We select CMOS for the comparison part to beware the long delay of the NEMs that happens because of the beams’ mechanical movement, and that would slow down the search operation. Figure 4(a) represents the schematic of our NEM memory cell when it is programming to state “1.” Figure 4(b) shows its switch model and Figure 4(c) shows a simplified NEM Verilog-A model between BL (source), Out (drain), and WL nodes, which we employ in our circuit analysis . Other switches of NEM memory cell comply with similar switch modeling.
3.1. Circuit operations
3.1.1. Write biasing scheme
Figure 3(b) describes how the storage circuit of the NEMsCAM is written. When the WL (wordline) is activated, all beams on the row are sensitized. For the columns whose cells are to be programmed to 1, the bitlines are set to zero (BL = BLB = 0), and for the bitlines whose cells are to be programmed to 0, BL = BLB = 1 is applied.
No cell suffers half-disturb situations, and since BLB and BL are always at the same potential during switching, there is no risk of short-circuit current running through the switches. This is critical because high currents through contact between the drain and the beam can be a source of failure. During typical operation, BLB is put at 0 and BL at 1. Cells whose beams are in state 0 hence have OutB = 1 and Out = 0. Keep in mind that there is no separate read operation in this memory design and there is no mechanical switch latency in the read path.
3.2. Cell architecture
Figure 5 presents the three-dimensional perspective of two neighbor NEMsCAM cells placed in the same column index of the array. As NEM switches have the possibility to be fully integrated with CMOS devices , they are located on top of the CMOS layer in this work and considerably decrease the layout space. The searchline (SL) wires are located parallel to the BL wires, whereas the matchlines (ML) and wordlines (WL) are located orthogonally to the BLs. Using vertical NEM switches , the necessity of the long beam has a negligible effect on the layout space, since it is out of the plane. Two Gate1s are aligned and connected to their related WL, while the two Gate2s are connected to zero.
The drains are coupled from the opposite directions and build a cross configuration. The WL and BL wires can be combined with the real device terminals, resulting in a compressed layout. Eventually, the Vias connect OutB and Out to the CMOS layer which is placed under the NEMs layer. Because of this formation, our proposed NEMsCAM cell decreases the wire length, which considerably reduces the power consumption along with the near-zero leakage behavior of NEM switches.
4. A use case for NEMsCAM: TLB
As mentioned before, we leverage the design of the proposed NEMsCAM cell to build a fully associative translation lookaside buffer (TLB), called NEMsCAM TLB. In this section, we first elaborate on the motivation behind it and then we describe the design details and the circuit operations.
Because of the importance of the TLB in the system’s efficiency, processor designers have utilized a two-level TLB structure . The first-level TLB is fully associative, small and provides a very fast search operation, while the second-level TLB is large and holds as many translations as possible. In order to achieve further system’s performance, processors prepare separate TLBs for instructions and data .
The TLB hierarchy has been accounted for a substantial percentage of the power consumed in the chip [34, 35]. Intel recently informed that 13% of the total core energy comes from the TLBs designed for memory-intensive workloads . Based on our evaluation base (Section 5), we discover that the TLB power consumption is overwhelmingly dominated by the first-level TLBs in terms of accesses across the TLB hierarchy (Figure 6). Moreover, by breaking down the power in the first-level TLBs, we detect that the CAM component contributes by 94%. In order to decrease this source of power consumption without diminishing the performance, we leverage our proposed NEMsCAM cell to design energy-efficient first-level TLBs.
We design the NEMsCAM TLB with our proposed CAM cell and with typical SRAM memory circuits (Figure 7). The CAM part (Figure 7a) consists of the NEMsCAM cells and the necessary peripheral circuitry optimized for both search and write operations. Similarly, the SRAM cells (Figure 7b) and the associated circuits are designed with CMOS technology. The control signal unit consists of the necessary inverter chains that generate the signals to control the TLB circuits so that the search and the write operations are performed correctly.
The address decoder, the write circuits, and the data-in drivers are used only for the write operation; however, the rest of the circuits are designed to be used during the search operation as well. BL and BLB are driven with predefined signals according to the operations. The control circuit unit is added to generate the necessary Gate1 and Gate2 signals during the search and write operations.
4.2.1. Search operation
Within the search operation, WLen3 becomes high and the WLM lines are connected to the ML lines. At the beginning of search operation, all ML lines are set provisionally in the precharged state as in a CMOS-only TLB. The search cycle begins when MLpre (the precharge signal) becomes high driving the ML to zero. At the same time, the SLs (search lines) are charged to their related data value; with this method, there is no need for a separate SL precharge phase. After this (completion of the precharge phase), the ENB signal becomes low and supplies the ML with the current source. During the evaluation phase, the stored bits of the CAM cells are compared against the data provided on the corresponding SLs.
In case of a match (TLB hit), the current source enabled by ENB pulls ML up and the ML voltage changes to high state. The state of each ML row is sensed and improved by an ML sense amplifier. Our used sense amplifier can be seen in this figure: an nMOS transistor, and also a half-latch circuit which stores the output data (Figure 7a). We choose the current-race scheme among various matchline-sensing techniques due to its simplicity and the average-low ML energy consumption . Alternatively, in case of a mismatch (TLB miss) the cell(s) that cause a mismatch counteract the current source and keep ML close to ground level. In the match case, matchline trips the half-latch circuit when it is charged to a voltage slightly higher than threshold voltage of Msense; whereas, it leaves the latch in its initial state in the mismatch case when remains at a much lower voltage [5, 27]. Finally, the ML sense amplifiers feed the wordline buffers mapping the match location to its corresponding encoded address as stored in the SRAM cells (Figure 7a). Figure 9 summarizes the signal behavior of the matching case for a cell of the NEMsCAM TLB.
4.2.2. Write operation
During the write operation, the WLen1 and WLen2 are high, the WL which is generated in address decoder is routed to the CAM and SRAM parts, and the data is written into the corresponding cells.
5. Experimental evaluation
In this section, we first describe our methodology to evaluate the NEMsCAM TLB, and then we present the results.
We design NEMsCAM TLBs for DTLB (data) and ITLB (instruction) accesses based on  applying the TLB formation of a modern AMD server-oriented processor  (Table 2). For both CMOS-only and NEMsCAM TLB, we write the transistor-level netlists with all the tantamount resistance and capacitances of wires and essential circuitries. We simulate and optimize both TLB designs with Cadence Spectre exploiting 16 nm Predictive Technology Model  at T = 25°C and 2 GHz processor frequency. As mentioned before, for the NEM switches, we apply a naive Verilog-A pattern (Figure 4) with the following parameters: Vpo = 0.2 V, Vpi = 0.8 V, Cgs-off = 15 aF, Cgs-on = 20 aF, tmech = 3 ns . We optimize TLB circuits to minimize the energy consumption. We investigate that the search and write operations are performed correctly complying the timing necessities. We also assess the energy consumption per write and search operation and standby mode. Moreover, we plan the layouts , and span the wire lengths and optimize the wire capacitances in the netlists. Eventually, in order to evaluate the effect of the NEMsCAM TLBs at system’s proficiency, we consume the Sniper simulator  with the configuration of Table 2, and run the TLB-intensive workloads from Spec2006 with the reference input set and execute for one billion instructions.
|Per-core TLB organization|
|Level 1||Data (DTLB)||64 entries, fully assoc.|
|Instruction (ITLB)||48 entries, fully assoc.|
|Level 2||Data (L2-DTLB)||1024 entries, 4-way assoc.|
|Instruction (L2-ITLB)||512 entries, 4-way assoc.|
5.2.1. Energy and area
Table 3 demonstrates the simulation outcomes for both CMOS-only and NEMsCAM TLBs. We perceive that the area decreases by 40.5% for the DTLB (Figure 8). The unique design of the NEMsCAM cell is the reason for this improvement. Furthermore, we discover that the energy consumption per write and search operation and standby mode decreases by 41.9%, 27%, and 53.7%, respectively, for the DTLB. This occurs due to the lower dimensions of the circuit leading to lower parasitic wire resistances and capacitances on the matchlines and the searchlines which in turn need fewer driving buffers. Also, the energy consumption further reduces due to the near-zero leakage current that NEMs prepare. Same results are achieved for the ITLB as well.
|DTLB 64 entries||Search operation (pJ)||4.529||3.308||27.0|
|Write operation (pJ)||0.148||0.086||41.9|
|Standby mode (pJ)||0.141||0.065||53.7|
|ITLB 48 entries||Search operation (pJ)||3.658||2.805||23.3|
|Write operation (pJ)||0.187||0.107||42.8|
|Standby mode (pJ)||0.106||0.046||55.9|
Figure 9 demonstrates the simulation waveform of the matching case for one cell of the NEMsCAM DTLB (data TLB) during the search operation. The waveform considers that the design supplies the purpose time requirement of one clock cycle per search operation. On the other side, the write operation takes six cycles in the NEMsCAM TLB (based on Ref. ), whereas it takes two cycles in the CMOS-only TLB. This slowdown is because of the long mechanical delay of the NEM switches. However, this latency barely affects the processor performance as shown next.
Figure 10 displays the energy reduction in the first-level TLBs due to the NEMsCAM utilization for several workloads. We observe that the search operation overcomes in the energy breakdown for both ITLB and DTLB and that the NEMsCAM TLBs reduce the energy spent by 28.7% on average. Taking into account that 13% of the total core energy comes from the TLBs , the NEMsCAM cell can considerably assist in reducing the total chip’s energy performance. Figure 11 demonstrates the evaluated execution overhead due to the utilization of the NEMsCAM TLBs. This overhead occurs because of the increased latency of the write operation in NEMsCAM. Albeit, the write operation: (a) occurs only after TLB misses which take place scarcely compared to TLB hits, and (b) adds latency to an already slow operation, i.e., L2-TLB access (~7 cycles ), including potentially the penalty of L2-TLB miss (~100 s cycles ). Therefore, the NEMsCAMs TLB have an insignificant effect on the execution time for most applications (0.27% on average) while decreasing outstandingly the energy spent in the TLB hierarchy.
In this chapter, we describe the NEMsCAM cell design that combines both NEMs and CMOS to design low power and highly efficient processor structures such as TLBs. Our analysis shows that the NEMsCAM TLB exhibits significant benefits over the CMOS-only TLB in terms of energy consumption and area. However, the limited write endurance of current NEMs may delay their adoption until the technology improves.