# 國立交通大學電信工程學系 碩士論文 一個針對維特比解碼器設計之全客戶式殘存記憶單元 與低功率脈衝邊緣觸發閂鎖暫存器 A Fully Custom Survivor Memory Unit for Viterbi Decoder with Low Power Pulsed Edge-Triggered Latches 研究生:蘇韋力 指導教授: 闕河鳴 博士 中華民國九十五年七月 # 一個針對維特比解碼器設計之全客戶式殘存記憶單元 與低功率脈衝邊緣觸發閂鎖暫存器 #### **A Fully Custom Survivor Memory Unit** #### for Viterbi Decoder with Low Power Pulsed #### **Edge-Triggered Latches** 研 究 生:蘇韋力 Student: Wei-Li Su 指導教授:闕河鳴 博士 Advisor: Herming Chiueh #### A Thesis Submitted to Department of Communication Engineering College of Electrical and Computer Engineering National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master of Science in Communication Engineering July 2006 Hsinchu, Taiwan 中華民國九十五年七月 # 一個針對維特比解碼器設計之全客戶式殘存記憶單元 與低功率脈衝邊緣觸發閂鎖暫存器 研究生:蘇韋力 指導教授: 闕河鳴 博士 國立交通大學 電信工程學系碩士班 #### 摘要 維特比解碼器被認為在提供優異性能的同時,維持了合理的編碼複雜度與適當的運算資源而被廣泛地使用在現今的通訊系統錯誤更正機制中卷積碼的處理;然而由於電池壽命有限,低功率訴求在現今通訊領域的超大型積體電路設計之中成為了比過往更重要的議題,因此在本論文,我們針對維特比解碼器設計一個由大量暫存器組成,基於暫存器交換架構之帶有低功率特徵的全客戶式殘存記憶單元。首先本論文提出一個新的暫存器架構:低功率低擺幅靜態邊緣觸發門鎖暫存器;由於該暫存器具有較低之時脈節點電容及較少電晶體數目兩樣特徵,在功率以及面積方面的考量都非常設合於本殘存記憶單元的設計。此外除了零點一三微米互補式金氧半導體製程下的模擬,我們特別於九十奈米製程的模擬中觀察了漏電流功率的現象。藉由全客戶式設計流程的輔助,本論文完成了全客戶式殘存記憶單元的實體電路佈局實作,且在特定的測試樣本之下,所消耗的功率僅是電路輔助設計軟體所合成之殘存記憶單元的百分之二十七點三。本論文電路佈局實作的環境是台積電零點一三微米互補式金氧半導體製程。 #### **A Fully Custom Survivor Memory Unit** #### for Viterbi Decoder with Low Power Pulsed #### **Edge-Triggered Latches** Student: Wei-Li Su Advisor: Herming Chiueh SoC Design Lab, Department of Communication Engineering, College of Electrical and Computer Engineering, National Chiao Tung University Hsin-Chu 30050, Taiwan #### Abstract Viterbi decoders are widely used in communication systems as decoding convolutional codes which provide a superior error correction capacity while maintaining a reasonable coding complexity and computing resource. However, power dissipation has become a critical issue in modern communication systems which have emphasized low power features due to shortage of battery life. In this thesis, a fully custom register-exchange based SMU hard macro which is composed of a lot number of low power registers is presented for Viterbi decoders. We first propose our low power low swing static edge-triggered latch (ETL) which is very suitable for the register-exchange SMU design due to its less clock loading and fewer transistors number at power and area domain. We not only simulate proposed low swing static ETL in 0.13 um CMOS process, but also in 90 nm CMOS process especially interesting in leakage power dissipation. Then we based on the proposed static ETL to construct our low power SMU hard macro by means of the fully custom design flow, and the power consumption under the specified test pattern is merely about 27.3% as compared with the synthesized SMU from EDA synthesizer. The Layout implementation environment of this work is in TSMC 0.13 um CMOS process. ### Acknowledgement 本篇碩士論文得以順利完成,首先要感謝的是我的指導教授關河鳴博士。老師總在研究遇到挫折時給予寶貴的指引,使我獲益良多。此外老師平日樂於培養學生獨立思考與分析問題的能力,更使我在學習上能建立積極正面的態度。 接著,我要感謝偉閔、漳源、庭瑋、明崇和芳如等五位學長姐以及紹宣和佐昇兩位同窗,在我研究生活中給予諸多的建議與協助。尤其是偉閔與紹宣更是指點我許多研究方法的良師益友。另外再感謝的是交大電子所博士班的林建青、黄柏蒼學長,給予本研究許多理論和實驗的寶貴建議。此外我還要特別感謝交通大學計網中心技術發展組劉大川組長,在我研究所生活上的照顧以及支援。 再來,我要感謝父母親對我的愛護與包容,全力支持我完成研究所學業,在 我低潮的時候給予我鼓勵,您們的養育之恩我終生難忘;還要感謝我的姐姐與姐 夫,無時不刻關心我的生活起居跟飲食健康,一方面又負擔起照顧雙親的重擔, 您們對我的栽培之情,我也是銘記在心。 最後,我要感謝所有建國中學樂旗隊十五屆同學以及台北樂府室外樂旗藝術團六到十二季團員,你們豐富了我人生的旅程;特別要提到我的高中同學晟豪、彥荃、珣力、政逸、凱鈞,以及樂府的夥伴姿佑、群芯、欣燕、明慶、正仁,在我大學與研究所的生涯中,給予我莫大的支持以及鼓勵,沒有你們,這條路我必定走得備加艱辛。 要感謝的人太多了,上述難免遺漏,我誠心感謝所有提攜過我或幫助過我的你們,謝謝大家並祝福大家。 蘇韋力 中華民國九十五年七月於新竹 ## Content | 中文摘要 | | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--| | English Abstract | | | Acknowledgment | | | Content | | | List of Tables | | | List of Figures | | | 中文摘要 English Abstract Acknowledgment Content List of Tables List of Figures Chapter 1 Introduction 1.1 Motivation 1.2 Organization Chapter 2 Overview of Low Power Registers 2.1 Relative Low Power Registers 2.1.1 Master-Slave Flip-Flops 2.1.2 Pulse Registers 2.1.3 Sense-Amplifier-Based Registers 2.1.4 Summary 2.2 Dual-Rail Pulse Registers 2.2.1 Dual-Rail Dynamic Hybrid Latch Flip-Flop. 2.2.2 Dual-Rail Static Edge-Triggered Latch 2.3 Summary Chapter 3 Design of Proposed Low Power Pulse Register 3.1 Survivor Memory Unit 3.1.1 Stage-Reduced SMU 3.1.2 Summary 3.2 Proposed Low Swing Static Edge-Triggered Latch 3.2.1 Low Swing Static ETL in 0.13 um CMOS Process 3.2.2 Low Swing Static ETL in 90 nm CMOS Process | | | Chapter 1 Introduction | | | 1.1 Motivation | | | | | | | | | Chapter 2 Overview of Low Power Registers | | | | | | 2.1.1 Master-Slave Flip-Flops | | | 2.1.2 Pulse Registers | | | 2.1.3 Sense-Amplifier-Based Registers | | | 2.1.4 Summary | | | 2.2 Dual-Rail Pulse Registers | | | | | | 2.2.2 Dual-Rail Static Edge-Triggered Latch | | | 2.3 Summary | | | Chapter 3 Design of Proposed Low Power Pulse Register | | | | | | | | | | | | | | | | | | | | | 3.2.3 Low Swing Static ETL in 90 nm MTCMOS Process | | | 3.3 Summary | | | <b>Chapter 4 Implementation of Proposed Survivor Memory Unit</b> | <b>37</b> | |------------------------------------------------------------------|-----------| | 4.1 Architecture of Proposed Survivor Memory Unit | 37 | | 4.1.1 Floorplan and Folded Last Three Stages' SMU | 38 | | 4.1.2 Clock Tree Network Design | 40 | | 4.1.3 Design Flow | 41 | | 4.2 Implementation Results | 44 | | 4.2.1 Physical Layout Descriptions | 44 | | 4.2.2 Post-Layout Simulation Results | 54 | | 4.3 Summary | 58 | | Chapter 5 Conclusion and Future Works | 59 | | 5.1 Conclusion | 59 | | 5.2 Future Works | 60 | | Rihliography | 61 | # List of Tables | Table 1.1 | Overview of representative Viterbi decoders | 4 | |-----------|-------------------------------------------------------------------|----| | Table 3.1 | Comparison results of different registers based on TSMC 0.13 um | 24 | | | CMOS process | | | Table 3.2 | Comparison results of different registers based on UMC 90 nm | 28 | | | CMOS process | | | Table 3.3 | Comparison results of proposed design in different 90 nm CMOS | 31 | | | processes | | | Table 3.4 | Comparison results of proposed Design with and without | 33 | | | MTCMOS technique in 90 nm CMOC process | | | Table 4.1 | Simulation results of proposed low swing static ETL in pre-layout | 54 | | | simulation and post-layout simulation | | | Table 4.2 | Simulation results of the 16 and 13 stages SMU block in | 55 | | | pre-layout simulation and post-layout simulation | | | Table 4.3 | Simulation results of the 29 stages SMU block with and without | 56 | | | I/O buffers and clock tree network in post-layout simulation | | | Table 4.4 | Simulation results of the 29 stages SMU block from EDA | 56 | | | synthesis tool and this work in post-layout simulation | | | Table 4.5 | Simulation results of this SMU hard macro in different operating | 57 | | | voltages for reducing power consumption | | | Table 4.5 | Comparison of this work with recent Viterbi decoders | 58 | # List of Figures | Figure 1.1 | Three major evaluated terms in VLSI designs | 1 | |-------------|---------------------------------------------------------------------|----| | Figure 1.2 | Block diagram of Viterbi Decoder | 3 | | Figure 2.1 | Multiplexer-based flip-flop | 7 | | Figure 2.2 | Transmission-gate flip-flop (TGFF) | 7 | | Figure 2.3 | Glitch latch-timing generation | 8 | | Figure 2.4 | Illustrated pulse registers of the HLFF family | 10 | | Figure 2.5 | Illustrated sense-amplifier-based register | 12 | | Figure 2.6 | Dual-rail dynamic hybrid latch flip-flop (DHLFF) | 13 | | Figure 2.7 | Dual-rail static edge-triggered latch (ETL) | 14 | | Figure 2.8 | Low swing conditional capture flip-flop | 15 | | Figure 3.1 | Block diagram of this survivor memory unit (SMU) | 18 | | Figure 3.2 | Basic D element | 18 | | Figure 3.3 | Interconnection information of this SMU | 19 | | Figure 3.4 | Stage-reducing SMU | 20 | | Figure 3.5 | Proposed low swing static ETL type 1 and its timing diagram | 22 | | Figure 3.6 | Proposed low swing static ETL type 2 | 23 | | Figure 3.7 | The simulation environment of this work | 25 | | Figure 3.8 | Vertical stacked bar charts of different switching activities in | 27 | | | 0.13 um CMOS process | | | Figure 3.9 | Vertical stacked bar charts of different switching activities in 90 | 30 | | | nm CMOS process | | | Figure 3.10 | Proposed low swing static ETL in MTCMOS | 32 | | Figure 3.11 | Vertical stacked bar charts of different switching activities in 90 | 35 | | | nm CMOS low leakage and MTCMOS processes | | | Figure 4.1 | Basic architecture of proposed SMU | 38 | | Figure 4.2 | The floorplan of proposed SMU | 39 | | Figure 4.3 | The D elements-saving in last 3 stages | 40 | | Figure 4.4 | Illustrated clock tree network | 41 | | Figure 4.5 | Local block diagram of clock tree network in floorplan | 41 | | Figure 4.6 | The design flow of this work | 43 | | Figure 4.7 | The layouts of low swing static ETL and 4-to-1 multiplexer | 45 | | Figure 4.8 | The layout of the basic D element (4 * 20 um <sup>2</sup> ) | 46 | | Figure 4.9 | The layout of a partial one stage | 47 | | Figure 4.10 | The interconnection routing between each stage | 48 | | Figure 4.11 | The layout of folded last three stages and their interconnection | 49 | |-------------|---------------------------------------------------------------------|----| | | routing channel as referenced to Figure 4.3 | | | Figure 4.12 | The layout of the whole SMU block (450 * 1540 um²) as | 50 | | | referenced to the floorplan in Figure 4.2 | | | Figure 4.13 | The layout of the SMU hard macro (570 * 1660 um <sup>2</sup> ) with | 51 | | | power ring | | | Figure 4.14 | The partial clock tree network of this work including pre-buffer | 52 | | | and last five stage-buffers as referenced to Figure 4.5 | | | Figure 4.15 | The layout of wire-group power rings in this work | 53 | #### Chapter 1 #### Introduction #### 1.1 Motivation Performance, area and power are always three major evaluated terms in VLSI designs as illustrated in Figure 1.1. With process technology shrinking, the improvements of performance and area are obvious. However, Power dissipation issue in modern VLSI designs is more important than the past especially in System-on-Chip era. The ever increasing on-chip integrations in recent decade have enabled a dramatically increase in system performance and scale. Unfortunately, accompanied with the performance and area improvements, a significant increase in power dissipation and heat density is introduced [1]. In modern VLSI circuitry of mobile systems, especially for handheld audio and video applications, low power considerations are becoming an important issue as battery life and geometry of mobile systems are limited [2-3]. Otherwise, due to the high cost of packaging and cooling requirement below deep submicron CMOS technology, the demand for reliability design will require designers to seek out new technologies and circuit techniques to maintain high performance and long operational lifetimes. Therefore, power and thermal issues have become the major limitation of such systems. To the root of the matter, low power circuitry designs have become more important in modern VLSI and System-on-Chip implementation. **Figure 1.1** Three major evaluated terms in VLSI designs. #### 1.1.1 Modern Digital Communication System In modern digital communication systems, information is required to be transmitted at high data rates especially in wireless local area network (WLAN) [4]. It will result in increasing power dissipation and system complexity. Besides, for enhancing system performance, an efficient error-control code is often employed. Convolutional codes that have been exploited widely in communication system provide a superior error correction capacity while maintaining a reasonable coding complexity. Viterbi algorithm is an optimal solution for decoding convolutional codes with the modest computing resource [5-6]. As the requirement of high transmission rate increasing power dissipation of the system, the error-control mechanism also becomes an additional part of power dissipation in system implementation. In modern digital communication very large-scale integration (VLSI) design, however, power dissipation issue has been more important than the past because of two reasons: one is the limited battery life of portable mobile systems and the other is the high cost of packaging and cooling requirement for reliability in deep submicron technology. Thus, it's suggested that we build systems with low power feature [7]. The Viterbi decoder is constructed from three major units [8]: transition metric unit (TMU), add-compare-select unit (ACSU), and survivor memory unit (SMU) as illustrated in Figure 1.2. Figure 1.2 Block diagram of Viterbi Decoder. TMU calculates the transition metrics (TM) from the input data. ACSU accumulates transition metrics recursively as path metrics (PM), and makes decisions to select the most likely state transition sequence. Finally, SMU traces the decisions to extract this sequence. There are two main different ways to build SMU: traceback and register exchange [9]. The former is built of embedded memory element such as static random-access memory (SRAM), and the later is composed of many registers and multiplexers. The traceback approach is a power efficient solution, but not suitable for high speed application because of the limited bandwidth in embedded memory [10]. The register-exchange approach is more direct and intuitional to trace the most likely state transition sequence and easier to operate at higher speed. But its power consumption is proportional to its size and will increase as data throughput increasing. In [11], the SMU's area of Viterbi accelerator is about 73% of whole chip in physical dimension. Therefore, it is meaning to improve the SMU of a Viterbi decoder. However, due to speed restriction of the limited bandwidth, we decide to choose register-exchange topology for our low power high speed SMU design. In addition, because latches and registers are widely used in sequential circuit design due to their characteristic of data storage, low power latches and registers' researches are very beneficial for modern VLSI design especially in low power demands and have a lot of applications. Table 1.1 shows the overview of the physical implementation information of Viterbi decoders in different generations. The design trend of Viterbi decoder is having lower power consumption and smaller area size while maintaining high performance. 1896 **Table 1.1** Overview of representative Viterbi decoders. | | CMOS<br>Process | Operating<br>Voltage | Power @ Performance | Area | | |-------------|-----------------|----------------------|---------------------|----------------------|--| | [12] | | 3.0 V | 386 mW @ 110MHz | 2 02 2 | | | [12] | 0.6 um | 3.0 V | 776 mW @ 200 MHz | 2.92 mm <sup>2</sup> | | | [13] | 0.25 um | 2.25 V | 570 mW @ 550 MHz | 0.92 mm <sup>2</sup> | | | [4] 0.18 um | | 1.64 V | 4.5 mW @ 6 Mb/s | 2 | | | [4] | 0.16 uiii | 1.98 V | 68 mW @ 54 Mb/s | 5.61 mm | | | [1.4] | 0.13 um | 1.2 V | 970 mW @ 2 Gb/s | 2 | | | [14] | 0.13 um | 1.5 V | 2200 mW @ 2.8 Gb/s | 0.52 mm | | | [15] | 00 nm | 0.7 V | 5 mW @ 54 Mb/s | 2 | | | | 90 nm | 1.2 V | 40 mW @ 500 Mb/s | 0.18 mm | | However, there is no low power standard cell in our available standard cell libraries to be used in general cell-based design flow. Because the register-exchange based SMU contains a lot of registers, if we can design a better register topology especially suitable for low power design in physical implementation to replace the standard register cell, the power reduction effect will be remarkable. Then we decide to follow the fully custom design flow in physical implementation from basic low power register design to whole low power SMU design in this work. Therefore, by means of low power registers used in this SMU design we can achieve the low power demand for Viterbi decoders. In this thesis, we focus on the register-exchange SMU of Viterbi decoders and a low power pulsed edge-triggered latch is proposed both in 0.13 um CMOS process and 90 nm CMOS process for low power high speed SMU of Viterbi decoder. We design this work by bottom-up analysis and implement it following fully custom design flow. The final physical implementation result of the SMU design is given in the end of this thesis. Our long term goal is integrating the fully custom SMU hard macro with other Viterbi decoders' components synthesized from EDA tools. #### 1.2 Organization At the beginning of Chapter 2, the overview of relative low power registers is introduced. Afterwards, the dual-rail dynamic hybrid latch flip-flop and the dual-rail static edge-triggered latch which are chosen for our fundamental topologies are presented. Finally, reasons and features of our design choices of low power registers are summarized in the end of this chapter. In Chapter 3, our low power pulse registers is proposed in this chapter. The register-exchange SMU architecture is introduced first, and stage-reducing SMU is also presented after SMU architecture. Afterwards, the designs and simulation results of our proposed low power static edge-triggered latch based on both TSMC 0.13 um and UMC 90 nm CMOS processes are presented. Finally, the summary is given in the end of this chapter. At the beginning of Chapter 4, we describe the information about physical implementation such like detailed architecture, floorplan and clock tree network design. Afterwards, we show the pre-layout simulation result of proposed SMU design as compared with the post-layout simulation results and its physical layout information. Our fully customed SMU hard macro for Viterbi decoders is implemented in TSMC 0.13 um CMOS process. Conclusions and future works are addressed in Chapter 5. #### Chapter 2 #### Overview of Low Power Registers This chapter begins with several low power registers which are introduced to realize their characteristics. Afterwards, the dual-rail dynamic hybrid latch flip-flop and the dual-rail static edge-triggered latch which are chosen for our fundamental topologies are presented. Finally, reasons and features of our design choices of low power registers are summarized in the end of this chapter. #### 2.1 Relative Low Power Registers Latches and registers are widely used in sequential circuit design due to their characteristic of data storage. The main difference between them is their timing properties. A latch is a level-sensitive device; a register is an edge-triggered storage element and an edge-triggered register is often referred to as a flip-flop as well [16]. There are three major types of registers: master-slave flip-flop, pulse register and sense-amplifier-based register. #### 2.1.1 Master-Slave Flip-Flops Master-slave flip-flops consist of cascading a negative latch (master) with a positive latch (slave) to trigger at clock edge. The multiplexer-based flip-flop (MBFF) in Figure 2.1 is the most common used master-slave flip-flop in VLSI designs especially for standard cell in EDA cell-based design flow. For fairly comparing with other registers, we remove the front and the end invertors for reducing power consumption. However, its' driving ability is also very well as to other registers due to the "keeper" invertors (i.e., cascading two invertors like internal SRAM cell). Figure 2.1 Multiplexer-based flip-flop. Besides the most common used MBFF, there is a modified low power register named transmission-gate flip-flop (TGFF) [17] as illustrated in Figure 2.2. TGFF splits "keeper" inverters to two different feedback paths to maintain data in remained half cycle. Because of the stacking effect [18] in these two feedback loops, a few amount of power consumption could be reduced. **Figure 2.2** Transmission-gate flip-flop (TGFF). However, classical CMOS master-slave flip-flops employ two cascaded transparent latches controlled by true and inverting clocks [19]. Because of this, they are compared unfavorably with transparent latches in terms of power and area. #### 2.1.2 Pulse Registers The idea of pulse registers is to construct a short pulse around the rising (or falling) edge of the clock as illustrated in Figure 2.3 and this pulse acts as the clock input to a latch, sampling the input only in a short window. The advantage of pulse registers is the reduced clock loading and the smaller number of transistors required [16]. In our design experience, only combing a pulse generator with a latch to form a pulse register is not good enough to reduce total power consumption. Although the outer clock loading of pulse register is reduced, the internal node capacitance of latch's clock ports (i.e., the output of pulse generator) is not decreasing. It is the reason why the pulse generator will consume more power than we expecting, and the total power dissipation of pulse register will not be less than expectance. (a) Glitch generation. (b) Glitch clock. **Figure 2.3** Glitch latch-timing generation. Recently, other types of pulse register called edge-triggered latch (ETL) [20] have been designed. Instead of using pulse generator, pulsed clock signals are generated in registers locally, which are used in triggering the transparent latch. The latches are transparent only during a small pulse window and they effectively act as edge-triggered flip-flop. Hybrid latch flip-flop (HLFF) is a famous one of proposed ETLs, and the pulse registers of the HLFF family shown in Figure 2.4 in [21] was discussed very well. However, in additional to their transistor numbers, sizing their transistor sizes for racing problem results in larger cell area in our experience. (a) Hybrid latch flip-flip (HLFF). (b) Modified hybrid latch flip-flip (MHLFF). (c) Double-edge-triggered modified hybrid latch flip-flip (DMHLFF). Figure 2.4 Illustrated pulse registers of the HLFF family. #### 2.1.3 Sense-Amplifier-Based Registers In a sense, the sense-amplifier-based registers illustrated in Figure 2.5 are similar in operation to the pulse registers (i.e., the first stage generates the pulse, and the second latches it). However, sense-amplifier-based registers are used extensively in memory cores and in low-swing bus drivers to sense small input signals and amplify them to generate rail-to-rail swings [16]. However, our survivor memory unit (SMU) of Viterbi decoder contains a lot of registers and multiplexers which is discussed later in detail. If the register cell is more complex and has more transistor number, the area of our SMU will be larger and not suitable for our design goal. Besides, due to the simple architecture of this register-exchange SMU, it doesn't need to amplify signals for speeding the data rate. In [22] and [23], their complex circuit schemes and more transistor number are in the opposite direction to our expectance, so we do not discuss sense-amplifier-based registers in this SMU design. #### **2.1.4 Summary** There are three major types of registers: master-slave flip-flop, pulse register and sense-amplifier-based register. Master-slave flip-flops and sense-amplifier-based registers are not appropriate for register-exchange SMU designs because of their more transistor numbers and larger cell sizes. The larger cell size, the more power will be dissipated from charging and discharging the internal nodes' capacitance. It violates our design purpose: low power high speed SMU. Although pulse registers in HLFF family should be sized very carefully for the racing problem, their smaller clock loads and less transistor numbers are very suitable for our low power high speed SMU design which is discussed in detail in Chapter 3, and the original dual-rail pulse registers' topologies we chosen for modifying our design are introduced in the next section. Figure 2.5 Illustrated sense-amplifier-based register. #### 2.2 Dual-Rail Pulse Registers In this section, we focus on dual-rail pulse registers which are relative to HLFF family. We introduce two dual-rail pulse registers: dual-rail dynamic hybrid latch flip-flop [23] and dual-rail static edge-triggered latch [20]. Based on these two circuits, our modified low power static edge-triggered latch is presented in next chapter. #### 2.2.1 Dual-Rail Dynamic Hybrid Latch Flip-Flop Figure 2.6 illustrates the dual-rail dynamic hybrid latch flip-flop (DHLFF). Referring to the figure, both DHLFF outputs are pre-discharged to the ground when the clock is low. At the rising edge of the clock and depending on the state of data at D, either Q or QB will be asserted. The outputs are held statically as long as the clock remains high. However, the pre-discharge and output-held networks increase the total cell size and the extra power dissipation. Figure 2.6 Dual-rail dynamic hybrid latch flip-flop (DHLFF). #### 2.2.2 Dual-Rail Static Edge-Triggered Latch The simplest type of dual-rail static edge-triggered latches (ETLs) is illustrated in Figure 2.7. It is a simplified DHLFF without the pre-discharge and output-held networks and can be viewed as a static differential cascode voltage switch (DCVS) logic latch which may switch only at the short clock pulse window. Without the pre-discharge network, it is slower than the DHLFF due to the hysteresis associated with toggling the load device [20]. However, it has less power consumption. Figure 2.7 Dual-rail static edge-triggered latch (ETL). Another low power pulsed edge-triggered latch based on dual-rail static ETL is shown in Figure 2.8 and named as low swing conditional capture flip-flop [24]. It is a modified dual-rail static ETL for detecting input switching activity, and it uses NMOS transistors cascoded on the top of inverter chain to restrict voltage swing at $(V_{DD} - V_T)$ for saving power. By using the XOR gate to generate control signal Vg for header NMOS sleep transistor, this power gating device can save power dissipation when next input and current output being the same (i.e., switching activity is zero). **Figure 2.8** Low swing conditional capture flip-flop. Dual-rail pulse registers in this section have two benefits: less clock loads and transistor numbers which are helpful for our low power high speed SMU design. Using only two half invertors (i.e., two PMOS transistors) for keeping data is another reason of reducing power dissipation and cell size. We based on dual-rail pulse registers and proposed our modified edition in Chapter 3. Also the detail timing diagram and operation principle of dual-rail pulse registers are given there. #### 2.3 Summary There are three major types of registers: master-slave flip-flop, pulse register and sense-amplifier-based register. Master-slave flip-flops and sense-amplifier-based registers are not appropriate for register-exchange SMU designs because of their more transistor numbers and larger cell sizes. The larger cell size, the more power will be dissipated from charging and discharging the internal nodes' capacitance. Although pulse registers in HLFF family should be sized very carefully for the racing problem, their smaller clock loads and less transistor numbers are very suitable for our low power high speed SMU design. Dual-rail pulse registers maintain two benefits of pulse registers: less clock loads and transistor numbers which are helpful for our low power high speed SMU design. Otherwise, using only two PMOS transistors keeping data is another reason of reducing power dissipation and cell size. We base on this topology and propose our modified edition in Chapter 3. #### Chapter 3 #### Design of Proposed Low Power Pulse Register Our low power pulse registers is proposed in this chapter. The register-exchange SMU's architecture we adopted is introduced first, and stage-reducing SMU is coming after SMU's architecture. The design concept of our low power SMU is also pointed out here. Afterwards, the design and simulation results of our low power static edge-triggered latch based on both 0.13 um and 90 nm CMOS processes are presented. Finally, the summary is given in the end of this chapter. #### 3.1 Survivor Memory Unit The radix-4 SMU of Viterbi decoder based on register-exchange method is illustrated in Figure 3.1. The primary inputs of this SMU are 64 2-bit constants and 64 2-bit select signals (i.e., S00 to S63) calculated from add-compare-select unit (ACSU). These 64 2-bit constants are composed of sixteen 00s, 01s, 10s and 11s of each in sequence (i.e., {16[00], 16[01], 16[10], 16[11]}). The truncation length is how many stages we use to trace the decisions to extract expected sequence, and is dependant on system specification and transmitted channel property. In our system requirement, the truncation length is 16 or 32 for practical implementation. We use one column to represent one stage (i.e., 16 or 32 columns in our system) and each column in SMU contains 64 basic D elements as shown in Figure 3.2. Each of D elements has one 2-bit 4-to-1 multiplexer and one 2-bit register. Finally, the primary output of this SMU is the output of the first D element in the last stage. Figure 3.1 Block diagram of this survivor memory unit (SMU). Figure 3.2 Basic D element. The interconnection information of this SMU is shown in Figure 3.3. As mentioned above, there are 64 basic D elements in one column. For example, $1\_D_{02}$ identifies the D element in the first column and the third row. The numbers in blocks between two columns of D elements indicate the relative connected port. For instance, $1\_D_{02}$ 's output port is connected to the input ports of $2\_D_{00}$ , $2\_D_{16}$ , $2D\__{32}$ and $2D\__{48}$ . Alternatively, $2D\__{03}$ 's four input ports are one by one linked to the output ports of $1D\__{12}$ , $1D\__{13}$ , $1D\__{14}$ and $1D\__{15}$ . By select signals generated from the ACSU, the 4-to-1 multiplexer in the D element will choose the desired input decision passing to the output to extract targeted sequence. Figure 3.3 Interconnection information of this SMU. #### 3.1.1 Stage-Reduced SMU After tracing the sequence, we found there are a relation between primary input signals and select signals as illustrated in Figure 3.4. Base on the property of primary inputs and register-exchange method, the output patterns of the first three stages are keeping in the same order despite the select signals from ACSU. Therefore we can cancel the first three stages. The select signals from ACSU become the primary inputs of stage-reducing SMU in a specified order. For example of the stage-reduced SMU in our simulation, the truncation length of 16 could be reduced to 13. It results in saving area and power proportional to reduced stages. Figure 3.4 Stage-reducing SMU. #### **3.1.2 Summary** Because the register-exchange based SMU contains a lot of registers and all of them need outer clock signal to work up. If we can reduce the input capacitance of clock port in registers, the clock loading of whole SMU could be scaled down and the total power consumption of the SMU could be also reduced. In the Section 2 of the Chapter 3, we will focus on which type of low power registers could have less clock loading and also less transistor number to restrict the area size of whole SMU. Because this SMU contains a lot of registers, if we can have registers with less transistor number, the total area of the SMU could obviously be reduced as the power consumption at the same time. #### 3.2 Proposed Low Swing Static Edge-Triggered Latch In this section, we proposed our static edge-triggered latch (ETL) and the simulation results are both given here in TSMC 0.13 um and UMC 90 nm CMOS process. As summarized in Chapter 2, dual-rail pulse registers maintain two benefits of pulse registers: less clock loads and transistor numbers which are helpful for our low power high speed SMU design. Then we focus on the dual-rail static edge-triggered latch [20], which is relative to the dual-rail dynamic HLFF in [23]. Low swing conditional capture flip-flop presented in [24] is a modified dual-rail static ETL for detecting input switching activity, and it uses NMOS transistors cascoded on the top of inverter chain to restrict voltage swing at (V<sub>DD</sub> - V<sub>T</sub>) for saving power. However, the data switching activity in the Viterbi decoder is channel-dependent and we decide not to add conditional capture component for saving peripheral XOR gate. #### 3.2.1 Low Swing Static ETL in 0.13 um CMOS Process The timing information and operation principles of dual-rail pulse registers are almost the same with each other and are explained first in this section. We use our proposed low swing static ETL type 1 as an example illustrated in Figure 3.5. The signal CLKB is inverse to the signal CLK and has a little delay time because of the inverter chain. In the usual time, the CLK and CLKB are inverse and the pull down network of the ETL turns off. Therefore, the change in input signal D will not cause any data variation in the register. When both CLK and CLKB are logic level one of short duration, this timing overlap can turn on both transistors M1 and M2. At this moment, the node X is connected to ground directly thus transistors M4 and M5 can sense the input signal D if it changes or not and storage the change after this narrow sampling window until the next overlap of CLK and CLKB. (b) Timing diagram of the low swing static ETL type 1. **Figure 3.5** Proposed low swing static ETL type 1 and its timing diagram. For reducing cell area, we use only one diode-connected NMOS M3 to limit voltage swing of the inverter chain in Figure 3.5 and also only one output buffer inverter. In the static ETL type registers, the pull down network especially M1 and M2, needs to be sized carefully of poor driving ability due to the hysteresis associated with toggling the load device (i.e., M4 and M5) and the series connection of transistors M1 and M2. Furthermore, the gate terminal of M2 is weak logic level one $(V_{DD}-V_T)$ which results in weaker pull down driving ability of ETL. That is the reason why we exchange the gate signals between M1 and M2 from original static ETL to our low swing static ETL for making the bottom transistor M1 have stronger driving ability. Since we do not need the conditional capture component, we move the diode-connected NMOS M3 to the bottom of inverter chain just like footer power gating device in Figure 3.6. Therefore, both M1 and M2 have equal driving ability and pull down network is stronger than Fig. 3.5. After sizing gate width of pull down network M1 and M2 in Figure 3.6, we get smaller cell size of type 2 than type 1. Not only cell size, but also power consumption could be reduced due to smaller driving current in pull down network of our low swing static ETL type 2. Figure 3.6 Proposed low swing static ETL type 2. The caparison result table of registers is listed in Table 3.1. The simulation environment of this work includes the register, input signal buffer, clock buffer and output capacitance 0.7 fF for a single inverter input loading as shown in Figure 3.7. The clock frequency of all registers is at 1 GHz, and the process technology is TSMC 0.13 um CMOS logical process. The operating voltage is 1.2 V. **Table 3.1** Comparison results of different registers based on TSMC 0.13 um CMOS process. | Register | Mux-Based FF | | <b>Trans-Gate</b> | | Proposed | | | Proposed | | | | | |----------------------|--------------|-------|-------------------|---------------|----------|-----------|--------|-----------|-------|--------|-------|-------| | Type | | | | $\mathbf{FF}$ | | | Type 1 | | | Type 2 | | | | <b>Total Gate</b> | 4.05 | | | 4.05 | | | 3.3 | | | 2.7 | | | | Width (um) | | | | | | | | | | | | | | Clock Load | 1.76 | | 1.76 | | | 1.07 | | | 0.75 | | | | | (fF) | 1.76 | | | | | | | | | | | | | Switching | IDLE | 50% | 100% | IDLE | 50% | 100% | IDLE | 50% | 100% | IDLE | 50% | 100% | | Activity | IDEL | 3070 | 10070 | | 30% | 100/3 | E C | 3070 | 10070 | IDLL | 3070 | 10070 | | <b>Core Power</b> | 3.00 | 6.93 | 10.82 | 2,98 | 6.36 | 9.70 | 4.11 | 7.18 | 10.10 | 3.68 | 6.42 | 9.12 | | (uW) | 3.00 | 0.73 | 10.02 | 2.90 | | 5.70 | | 7.10 | 10.10 | 3.00 | 0.42 | 9.12 | | Peripheral | 6.35 | 7.82 | 9.33 | 6.21 | 7.79 | 9.36 | 5.75 | 6.89 | 8.09 | 4.84 | 6.06 | 7.28 | | Power (uW) | 0.33 | 7.62 | 9.33 | 177 | | 0.89 8.09 | | 4.04 0.00 | | 7.20 | | | | <b>Total Power</b> | 9.35 | 14.75 | 20.15 | 9.19 | 14.15 | 19.06 | 9.86 | 14.07 | 18.19 | 8.52 | 12.48 | 16.44 | | (uW) | 7.33 | 14.73 | 20.13 | 7.17 | 14.13 | 17.00 | 7.00 | 14.07 | 10.17 | 0.52 | 12.40 | 10.44 | | Average | | | | | | | | | | | | | | C-Q Delay | 96 | | | 102.5 | | | 124 | | | 101.5 | | | | (ps) | | | | | | | | | | | | | | <b>Power-Delay</b> | | | | | | | | | | | | | | Product | 0.898 | 1.416 | 1.934 | 0.942 | 1.450 | 1.953 | 1.223 | 1.745 | 2.256 | 0.865 | 1.267 | 1.669 | | (10 <sup>-15</sup> ) | | | | | | | | | | | | | **Figure 3.7** The simulation environment of this work. As mentioned in Table 3.1, the total power is the sum of the core power and the peripheral power. The peripheral power which is the power dissipated by input and clock buffers and the output loading is used here to stand for especially showing the importance of reducing system clock load. Because of the stacking effect, the transmission-gate flip-flop (TGFF) has the least core power when data switching activity being not so high. However, TGFF's total gate width and clock load are equal to the multiplexer-based flip-flop (MBFF) that results in large area and peripheral power despite its lower core power. Our proposed type 1 low swing static edge-triggered latch (ETL) proves the benefit of reduced clock load in the peripheral power term. However, due to sizing the pull down network of the proposed type 1 latch for the hysteresis of the pull up network, our type 1 latch consumes greatest core power when data switching activity being low. In addition, the proposed type 1 latch has the greatest average C-Q delay. After changing the diode-connected NMOS to the bottom of the delay chain as the footer power gating device, the transistor sizes of M1 and M2 in Figure 3.6 could be scaled down. Thus, the clock load of our type 2 low swing static ETL is the least one in all of the above and it results in the least peripheral power. The smaller transistors M1 and M2 also bring about the quite low core power dissipation. It is obvious our proposed type 2 low swing static ETL has the lowest total power consumption, the least peripheral power, and the smallest power-delay product as maintaining the comparable speed (i.e., average C-Q delay). Otherwise, our proposed type 2 design also has the smallest cell size and clock load. Vertical stacked bar charts of different registers' power consumption data in 0.13 um CMOS process of different switching activities are given in Figure 3.8 as a short summary. It is clear that our low swing static latch has lower power consumption. (b) Switching activity is 50 %. (c) Switching activity is 100 %. **Figure 3.8** Vertical stacked bar charts of different switching activities in 0.13 um CMOS process. # 3.2.2 Low Swing Static ETL in 90 nm CMOS Process The comparison table based on 90 nm CMOS process is given in Table 3.2. In this 90 nm CMOS process simulation condition, we do not considerate our proposed type 1 low swing static ETL as a beta edition in Section 3.2.1. Otherwise, the simulation environment is the same as illustrated in Figure 3.7. The clock frequency of all registers is at 1 GHz, and the process technology is UMC 90 nm CMOS logic and mixed-mode 1.0V standard performance process. **Table 3.2** Comparison results of different registers based on UMC 90 nm CMOS process. | Register | Mux | -Base | d FF | Tra | Trans-Gate | | Proposed | | | |----------------------|-------|--------|-------|---------------|------------|-----------|----------|-------|-------------| | Type | | | | $\mathbf{FF}$ | | Type 2 | | | | | <b>Total Gate</b> | | 3.24 | | 2.24 | | 2.16 | | | | | Width (um) | | 3.24 | | 3.24 | | 2.10 | | | | | Clock Load | | 1.17 | | | 1 17 | | 0.57 | | | | (fF) | | 1.17 | | 1.17 | | 0.57 | | | | | Switching | IDLE | 50% | 100% | IDLE | 50% | 100% | IDLE | 50% | 100% | | Activity | IDEL | 3070 | 10070 | IDLL | 3070 | 10070 | IDEL | 3070 | 10070 | | <b>Core Power</b> | 1.75 | 4.02 | 6.26 | 1.75 | 3.78 | 5.76 | 2.43 | 4.08 | 5.71 | | (uW) | 1./5 | 4.02 | 0.20 | 1./3 | 3.76 | 5.70 | 2.43 | 4.00 | <b>5171</b> | | Peripheral | 3.50 | 4.49 | 5.50 | 3.49 4.5 | 4.50 | 4.50 5.51 | 2.92 3.5 | 3.55 | 5 4.20 | | Power (uW) | 3.30 | 4.49 | 5.50 | 3.49 | 4.50 | 3.31 | 2.92 | 3.33 | 4.20 | | <b>Total Power</b> | 5.25 | Q 51 🐗 | 11.76 | 5.24 | 8.28 | 11.27 | 5.35 | 7.63 | 9.91 | | (uW) | 3.23 | 6.51 | 1.70 | 3.24<br>1 FA | 6.26 | 11.27 | 3.33 | 7.03 | 9.91 | | Average | | Ē | | P) | 3 6 | | | | | | C-Q Delay | | 52.5 | 1/2 | | 53.7 | | | 40.1 | | | (ps) | | 3 | 18 | 1896 | Ş | | | | | | <b>Power-Delay</b> | | 3 | 2000 | - | T. B. | | | | | | Product | 0.276 | 0.447 | 0.617 | 0.281 | 0.445 | 0.605 | 0.215 | 0.305 | 0.397 | | (10 <sup>-15</sup> ) | | | | | | | | | | Unlike the trend of simulation results in 0.13 um CMOS process, our low swing static ETL has the smallest average C-Q delay to all of the above registers. Although our design still has the smallest cell size and the least clock load, its core power is grater than other two master-slave flip-flops. Because of the always turn-on diodeconnected NMOS M3 in the bottom of delay chain in Figure 3.6 and the always turn-on pull up networks (i.e., the hysteresis of dual-rail topology), the leakage power consumption becomes a major problem of our design. Vertical stacked bar chart of all the above comparisons are given in Figure 3.9 as a sort summary. It is obvious that our low swing static ETL in 90 nm CMOS process has to deal with static leakage power consumption when switching activity is idle. #### (a) Switching activity is IDLE. (b) Switching activity is 50 %. (c) Switching activity is 100 %. **Figure 3.9** Vertical stacked bar charts of different switching activities in 90 nm CMOS process. Next, we simulated our low swing static ETL in UMC 90 nm CMOS logic and mixed-mode 1.2 V low leakage with high threshold voltage process. The operating voltages are 1.2 V and 1.0 V, but other simulation conditions are all the same with the above. As listed in Table 3.3, with low leakage devices the core power decreased. But the legal operating voltage of the low leakage process is 1.2 V, according to the Formula 2.1 in Chapter 2, it resulted in more power dissipation when the switching activity going to be high. Then we reduced the operating voltage from 1.2 V to 1.0 V for saving peripheral power dissipation, and we get lower power consumption data. However, the total power consumption and static leakage power dissipation could be reduced by means of low leakage process; it is noticeable that the speeds of circuits degenerate very much as shown in Table 3.3. The power-delay products in low leakage process are much larger than in standard process due to the performance degeneration. Vertical stacked bar charts of power consumption in different switching activities are given in the end of Section 3.2.3 with other low leakage technique as a short summary. **Table 3.3** Comparison results of proposed design in different 90 nm CMOS processes | Process | Standard | | | Low Leakage | | Low Leakage | | | | |--------------------|-------------|-------|-------|-------------|-------|-------------|-------|-------|-------| | Type | Performance | | | in 1.2 V | | in 1.0 V | | | | | <b>Total Gate</b> | | 2.16 | | 2.16 | | 2.38 | | | | | Width (um) | | 2.10 | | | 2.10 | | 2.38 | | | | Clock Load | | 0.57 | | | 0.40 | | 0.56 | | | | (fF) | | 0.57 | | 0.49 | | | 0.56 | | | | Switching | IDLE | IDLE | IDLE | IDLE | 50% | 100% | IDLE | 50% | 100% | | Activity | IDLE | IDLE | IDLE | IDLE | 3070 | 100% | IDLE | 30% | 10070 | | <b>Core Power</b> | 2.43 | 4.08 | 5.71 | 0.51 | 3.85 | 7.00 | 0.07 | 3.24 | 6.20 | | (uW) | 2.43 | 4.06 | 3./1 | 0.51 | 3.63 | 7.00 | 0.07 | 3.24 | 0.20 | | Peripheral | 2.92 | 3.55 | 4.20 | 3.39 | 4.25 | 5.20 | 2.50 | 3.23 | 3.67 | | Power (uW) | 2.92 | 3.33 | 4.20 | 3.39 | 4.23 | 3.20 | 2.30 | 3.23 | 3.07 | | <b>Total Power</b> | 5.35 | 7.63 | 9.91 | 3.90 | 8.10 | 12.20 | 2.57 | 6.47 | 9.87 | | (uW) | 5.55 | 7.03 | 9.91 | 3.90 | 8.10 | 12.20 | 2.57 | 0.47 | 9.07 | | Average | | 3 | | | F | | | | | | C-Q Delay | 40.1 | | 159.0 | | 319.5 | | | | | | (ps) | | = | 18 | 1896 | S | | | | | | Power-Delay | | - 2 | 2112 | - | TIN . | | | | | | Product | 0.215 | 0.305 | 0.397 | 0.620 | 1.288 | 1.940 | 0.821 | 2.067 | 3.153 | | $(10^{-15})$ | | | | | | | | | | ## 3.2.3 Low Swing Static ETL in 90 nm MTCMOS Process For saving the leakage power dissipation while maintain enough performance, we redesign our low swing static ETL by means of the MTCMOS [19] technique. As described in Chapter 2, the leakage power of gate leakage and sub-threshold leakage consumes the greater part of the static power in deep sub-micron technology. The leakage current is increased because the threshold voltage is lowed down to increase the operation speed. In order to retain the high performance and reduce the sub-threshold leakage current in the same time, the MTCMOS technique uses two kinds of transistors: high V<sub>T</sub> transistor and low V<sub>T</sub> transistor. The low V<sub>T</sub> transistor could be used in the critical path of the circuit to retain the operation speed, and the high $V_T$ transistor is used in the non-critical path to reduce the leakage current. The MTCMOS technique is used to reduce the static power with no overhead in the performance. After utilizing MTCMOS technique, our low swing static ETL is illustrated in Figure 3.10. We use the low threshold voltage device in the critical path to increase the operation speed, and use the high threshold voltage device in the non-critical path and leakage path to reduce both the dynamic power and static power consumption. According to the design issues, the pull up network (i.e., the pre-charged PMOS transistor pair), M4 and M5, diode-connected NMOS M3, and the delay chain are used the high $V_T$ device to reduce the leakage power as shown in Figure 3.10. In our design experience, the pull down network of our design could not be replaced by high $V_T$ device because the diving ability will degenerate and it will result in error operation. **Figure 3.10** Proposed low swing static ETL in MTCMOS. The comparison table based on UMC 90 nm CMOS logic and mixed-mode 1.0V standard performance process with and without MTCMOS technique is given in Table 3.4. Otherwise, the simulation environment is the same as illustrated in Figure 3.7. The clock frequency of all registers is at 1 GHz. As shown in Table 3.4 in Page 36, the low swing static ETL with MTCMOS technique has all advantages to non-MTCMOS low swing static ETL in every compared term. After using MTCMOS technique to replace the delay chain and the pull up network, the core power consumption is extra low than before. Especially in average C-Q delay, due to the weaker pull up network, M4 and M5, the pull down network is much easier to resist the hysteresis of dual-rail topology. Therefore, the low swing static ETL with MTCMOS is faster than it without MTCMOS and it brings about the incredible small power-delay product. **Table 3.4** Comparison results of proposed Design with and without MTCMOS technique in 90 nm CMOC process. | Process | Without | | | With | | | |----------------------|---------|-------|-------|--------|-------|-------| | Type | MTCMOS | | | MTCMOS | | | | <b>Total Gate</b> | | 2.16 | A | 2.16 | | | | Width (um) | | 2.10 | 15 | E | 2.10 | | | Clock Load | | 0.57 | S. | No. | 0.53 | | | (fF) | | 0.57 | 96 | 100 | 0.53 | | | Switching | IDLE | 500/ | 100% | IDLE | 50% | 100% | | Activity | IDLE | 50% | 100% | IDLE | 30% | 100% | | <b>Core Power</b> | 2.43 | 4.08 | 5.71 | 0.12 | 1.45 | 2.79 | | (uW) | 2.43 | 4.08 | 5./1 | 0.12 | 1.45 | 2.19 | | Peripheral | 2.92 | 3.55 | 4.20 | 2.87 | 3.55 | 4.21 | | Power (uW) | 2.92 | 3.33 | 4.20 | 2.87 | 3.33 | 4.21 | | <b>Total Power</b> | 5.35 | 7.63 | 9.91 | 2.99 | 5.00 | 7.00 | | (uW) | 3.33 | 7.03 | 9.91 | 2.99 | 5.00 | 7.00 | | Average | | | | | | | | C-Q Delay | 40.1 | | | | 26.8 | | | (ps) | | | | | | | | Power-Delay | | | | | | | | Product | 0.179 | 0.256 | 0.332 | 0.065 | 0.109 | 0.153 | | (10 <sup>-15</sup> ) | | | | | | | Vertical stacked bar charts of different registers' power consumption data in 90 nm CMOS low leakage and MTCMOS processes of different switching activities are given in Figure 3.11 as a short summary. It is apparent that our low swing static latch with MTCMOS technique has lower power consumption except for idle condition. (c) Switching activity is 50 %. (c) Switching activity is 100 %. **Figure 3.11** Vertical stacked bar charts of different switching activities in 90 nm CMOS low leakage and MTCMOS processes. In this section, we introduce our modified dual-rail static ETL named low swing static ETL. Based on simulating in both 0.13 um CMOS process and 90 nm CMOS process, our design has the lowest total power consumption and peripheral power dissipation, the smallest cell size and clock load and the incredible small power-delay product especially in 90 nm CMOS process with MTCMOS technique. The smaller cell size and less clock load are suitable for our register-exchange SMU design and we base on our low swing static ETL to build the low power high speed SMU of Viterbi decoder in Chapter 4. ### 3.3 Summary In Chapter 3, we began with introducing the architecture of our register-exchange SMU of Viterbi decoder. Because the register-exchange based SMU contains a lot of registers and all of them need outer clock signal to stimulus. If we can reduce the input capacitance of clock port in registers, the clock loading of whole SMU could be scaled down and the total power consumption of the SMU could be also reduced. Otherwise, because this SMU contains a lot of registers, if we can have registers with less transistor number, the total area of the SMU could obviously be reduced as the power consumption at the same time. Therefore, pulse registers which have two benefits: less clock loads and transistor numbers become our design choice of registers. Next, we propose our modified dual-rail static ETL named low swing static ETL. According to the simulation results in 0.13 um CMOS process and 90 nm CMOS process, our low swing static ETL has the lowest total power consumption and peripheral power dissipation, the smallest cell size and clock load and the incredible small power-delay product especially in 90 nm CMOS process with MTCMOS technique. We will base on our low swing static ETL to construct the low power high speed SMU of Viterbi decoder in Chapter 4. # Chapter 4 # Implementation of Proposed Survivor Memory Unit The physical implementation description and post-layout simulation results of our proposed survivor memory unit are presented in this chapter. We first begin with detailed architecture of our register-exchange based SMU design in Section 4.1. The floorplan and clock tree network design of proposed SMU are also given in this section. Next, Section 4.2 demonstrates our fully custom layouts of basic cells and the whole SMU block. Afterwards, the comparison results between pre-layout simulation and post-layout simulation are available in the end of this chapter. These post-layout simulation results are based on TSMC 0.13 um CMOS technology. # 4.1 Architecture of Proposed Survivor Memory Unit In this section, first we introduce the detailed architecture of our proposed radix-4 SMU in Figure 4.1. Although the non stage-reduced register-exchange SMU with 16 stages is enough to trace the expected sequence in general case, we use 32 stages' architecture for more accuracy in physical implementation. After using stage-reducing technique as mentioned in Chapter 3, the basic architecture of our SMU design has 29 stages (i.e., stage 00 to stage 28) as in Figure 4.1. The primary input d0 to d63 are 64 2-bit select signals from add-compare-select unit (ACSU), and the interconnection information between each stage is the same as described in Chapter 3. The consideration for physical implementation such like floorplan and clock tree network design are discussed later. Also for saving area of whole SMU, the folded last three stages' SMU is introduced. In the end of this section is the design flow of this work. Figure 4.1 Basic architecture of proposed SMU. # 4.1.1 Floorplan and Folded Last Three Stages' SMU The floorplan of this proposed SMU is shown in Figure 4.2. In our radix-4 architecture of SMU, the data are all 2-bit signals. For reducing the complexity of metal line interconnection between each stages of SMU for physical route, we divide whole memory core into two parts as shown in Figure 4.2. Each 1-bit SMU block handles 1-bit input signal and combining these two blocks' output we can get desired 2-bit output data. Then we set the clock buffer in the middle of the floorplan to separate those two 1-bit SMU blocks. Because the clock network of a memory module is very important in VLSI design, according to this floorplan, the clock network of proposed SMU could distribute the clock signal to each stage in SMU evenly. The detailed clock network design of proposed SMU is described in next sub-section later. Finally, the input buffers are set to the right side of the floorplan as common seen and the output buffers are included in those two 1-bit SMU blocks. **Figure 4.2** The floorplan of proposed SMU. As shown in Figure 4.1, the primary output port of our SMU is the output of the first D element in the last stages (i.e., stage 28). It is obvious that other 63 D elements from D28\_1 to D28\_63 can be eliminate from our SMU block in physical implementation. Due to the interconnection information mentioned in Chapter 3, not only the last stages has redundant D elements could be removed, but also the stage 27 and stage 26 have redundant D elements could be eliminated in physical design. The detailed D elements-saving in last 3 stages is illustrated in Figure 4.3. Besides the stage 28 only needs one D element, the stage 27 needs four D elements and stage 26 needs sixteen ones. In this rectangle figure, the bottom part of stage 26, stage 27 and stage 28 shows a lot of area could be saved in physical implementation if we can integrate the last three stages 26, 27 and 28 in one physical column domain. We call this concept "Folded Last Three Stages' SMU," and the physical layout of the folded last three stages' SMU is given in the Section 4.2.1. **Figure 4.3** The D elements-saving in last 3 stages. # 4.1.2 Clock Tree Network Design The clock network with fan-out information of this work is illustrated in Figure 4.4. After calculating load capacitance, the pre-buffer of clock signal is an inverter chain composed of four inverters which are 1:1:4:16 in aspect ratios. The outputs of pre-buffer are connected to twenty six stage-buffers for stage 00 to stage 25 and the folded last three stage (i.e., stage 26, 27 and 28). Each stage-buffer in Figure 4.4 is composed of four inverters which are 1:4:16:64 in aspect ratios for driving each stage. As mentioned in Section 4.1.1 we setting the clock buffer in the middle of the floorplan to distribute the clock signal in SMU evenly, the local detailed block diagram of clock tree network is shown in Figure 4.5 as reference to Figure 4.2. The most interesting thing is that the clock tree network is fed with the clock signal from the right side of the floorplan as shown in Figure 4.2 which does not usually occur as input signals. The reason why we feeding the clock tree network by this way is making sure to avoid positive clock skew in our SMU design. Because the register-exchange based SMU acts like a lot of shift registers with multiplexers, the delay between two registers (i.e., the delay of the multiplexer) is too small to keep away from hold time violation. If we feed the clock tree network from the right side of the network, the negative clock skew to the whole clock tree network is helpful for avoiding hold time violation in physical implementation. Figure 4.5 Local block diagram of clock tree network in floorplan. #### 4.1.3 Design Flow In order to accomplish the implementation of proposed SMU design especially focusing on low power and small area specifications, we choose our low power low swing static edge-triggered latch (ETL) as a basic register cell which is different to traditional master-slave flip-flops using in today's cell-based design flow. With the purpose of integrating our proposed low swing static ETL in this SMU design, we decide to take the fully custom solution to implement this hard memory macro. The design flow of this work is described in this sub-section and illustrated in Figure 4.6. According to the architecture conforming to the specification, we first establish the netlist file of proposed SMU design in Hspice format in order to run pre-layout simulation by means of Synopsys Hspice simulator. After making sure there being no time violation and the function of proposed SMU being correct, we start to convert the schematic of this design to physical layout by using layout editor Cadence Virtuoso. When finish the layout of this work, there are three major steps need to be done before running the post-layout simulation. The first two steps of above are usually regarded as post-layout verification steps. One is the design rule check (DRC), and the other is the layout versus schematic (LVS). DRC checks the data of physical layouts against the design rules of fabrication, because manufacturing processes have inherent limitations in accuracy and the design rule documents specify geometry of masks which will provide reasonable yields. LVS checks the identity of physical layouts to its relative schematic netlist files. As shown in Figure 4.6, after DRC checking, we use the original Hspice netlist to campare with the physical layout information from Virtuoso in LVS step. There might be a lot of iterations to fix the physical layouts to obey the DRC documents and to match the original netlists if there is any error in these two post-layout verification steps. After confirming the identity of netlists and physical layouts, we then start to run the step of layout parameter extraction (LPE) preparing for post-layout simulation. LPE can extract layout parameters including parameters of transistors, parasitic capacitors and resistors, and the extracted netlist could be used in post-layout transistor-level simulation. All of above these three steps, DRC, LVS and LPE, could be completed by the aid of Mentor Calibre. Finally we get the extracted netlist file after LPE, and then we can run the post-layout simulation in Synopsys Nanosim environment as a fast Spice simulator especially suitable for digital VLSI design. Figure 4.6 The design flow of this work. #### **4.2 Implementation Results** In this section, fully custom physical layouts of this work are introduced first. Not only the basic D element's layout is demonstrated, but also the whole SMU block's layout is shown later. The post-layout simulation results calculated from Synopsys Nanosim simulator are given and also compared with the pre-layout simulation data in Section 4.2.2. In this section, all the physical implementations are based on TSMC 0.13 um CMOS technology. ### **4.2.1 Physical Layout Descriptions** The physical layouts of proposed low swing static ETL and the 4-to-1 multiplexer is shown in Figure 4.7. The standard cell sizes of each one are the same because of their similar transistor numbers and the convenience for integrating. The height of the standard cell is about 4 um and the width is about 10 um. (a) The layout of proposed low swing static ETL $(4 * 10 \text{ um}^2)$ . (b) The layout of the 4-to-1 multiplexer $(4 * 10 \text{ um}^2)$ . **Figure 4.7** The layouts of low swing static ETL and 4-to-1 multiplexer. The layout of the basic D element given in Figure 4.8 is 4\*20 um² in dimension and it is obvious that the height of the standard cell is restricted by the 4-to-1 multiplexer's four input ports. That is why the area utilization of our low swing static ETL is not as small as possible. The Figure 4.9 shows the layout of partial one stage constructed of 64 basic D elements. The interconnection routing between each stage is illustrated in Figure 4.10. Because the interconnection is complex due to a lot of input ports of each basic D element, the area for interconnection routing dominates the whole SMU design in physical area utilization. The folded last three stages' layout is shown in Figure 4.11 for explaining how we saving area in physical implementation as referenced to Figure 4. 3. The whole SMU block including I/O buffers and the clock tree network based on the floorplan in Section 4.1.1 and Section 4.1.2 is presented in Figure 4.12 as referenced to Figure 4.2. The core area of the whole SMU block is 450 \* 1540 um². Finally, the completed SMU hard macro including whole SMU block in Figure 4.13 with the power ring of itself is given in Figure 4.14. After adding power ring, the area of proposed SMU hard macro is 570 \* 1660 um². The partial clock tree network is illustrated in Figure 4.14 as referenced to Figure 4.5, and the layout of wire-group power rings for distributing supply current evenly in whole hard macro in this design is demonstrated in Figure 4.15. **Figure 4.8** The layout of the basic D element $(4 * 20 \text{ um}^2)$ . **Figure 4.9** The layout of a partial one stage. Figure 4.10 The interconnection routing between each stage. **Figure 4.11** The layout of folded last three stages and their interconnection routing channel as referenced to Figure 4.3. **Figure 4.12** The layout of the whole SMU block (450 \* 1540 um<sup>2</sup>) as referenced to the floorplan in Figure 4.2. **Figure 4.13** The layout of the SMU hard macro (570 \* 1660 um<sup>2</sup>) with power ring. **Figure 4.14** The partial clock tree network of this work including pre-buffer and last five stage-buffers as referenced to Figure 4.5. Figure 4.15 The layout of wire-group power rings in this work. #### **4.2.2 Post-Layout Simulation Results** Before we show the post-layout simulation result of this SMU hard macro, we give the post-layout simulation result of proposed low swing static ETL compared with the pre-layout simulation result in Chapter 3 first. After Calibre LPE extraction, the post-layout simulation shows the degeneration as compared with the pre-layout simulation in power and speed domain due to the parasitic capacitors and resistors in physical layout. The Table 4.1 shows the comparison results of pre-layout and post-layout simulations. The peripheral powers in these two conditions are almost the same, but the core power of post-layout simulation is slightly higher than pre-layout simulation. The average C-Q delay has the same situation in this comparison, and they result in larger power-delay product in post-layout simulation. **Table 4.1** Simulation results of proposed low swing static ETL in pre-layout simulation and post-layout simulation. | ميالللان. | | | | | | | |------------------------------------------|-------|-----------------|--------|---------------------------|-------|-------| | Simulation<br>Type | 30/ | e-Lay<br>nulati | ALC: V | Post-Layout<br>Simulation | | | | Total Gate<br>Width (um) | 2.7 | | | 2.7 | | | | Clock Load<br>(fF) | 15 | 0.75 | MIN | 0.748 | | | | Switching<br>Activity | IDLE | IDLE | IDLE | IDLE | 50% | 100% | | Core Power (uW) | 3.68 | 6.42 | 9.12 | 4.41 | 8.14 | 11.80 | | Peripheral<br>Power (uW) | 4.84 | 6.06 | 7.28 | 4.93 | 6.10 | 7.30 | | Total Power (uW) | 8.52 | 12.48 | 16.44 | 9.34 | 14.24 | 19.10 | | Average<br>C-Q Delay<br>(ps) | 101.5 | | | 104.8 | | | | Power-Delay Product (10 <sup>-15</sup> ) | 0.865 | 1.267 | 1.669 | 0.979 | 1.492 | 2.002 | Because in whole SMU block pre-layout simulation we use only 16 and 13 stages to verify the functionality for reducing simulation complexity, so we first show the 13 stages SMU post-layout simulation results for comparing in Table 4.2. After stage-reducing, the pre-layout simulation result of 13 stages SMU block has lower power consumption than 16 stages SMU and the saved power is proportional to stages which are removed. In post-layout simulation result, we can see that the power consumption is about 2 mW higher than pre-layout simulation due to the parasitic capacitors and resistors in the layout. Besides, in this comparison we only focus on the functionality is correct or not. The simulation circuit includes only SMU core without I/O buffers and the clock tree network. We will show the post-layout simulation results of whole 29 stages SMU block (i.e., 32 stages after stage-reducing) with I/O buffers and the clock tree network later in implementation for practical applications. **Table 4.2** Simulation results of the 16 and 13 stages SMU block in pre-layout simulation and post-layout simulation. | Simulation | 16 Stages | 13 Stages | | | |-------------|-------------|-------------|-------------|--| | Target | SMU Block | SMU | Block | | | Simulation | Pre-Layout | Pre-Layout | Post-Layout | | | Type | Simulation | Simulation | Simulation | | | Operation | 1.2 V | 1.2 V | 1.2 V | | | Voltage | 1.2 V | 1.2 V | 1.2 V | | | Clock | 1 GHz | 1 GHz | 1 GHz | | | Frequency | 1 GHZ | 1 GHZ | I GHZ | | | Power | 15.26 mW | 13.53 mW | 15.42 mW | | | Consumption | 13.20 III W | 13.33 III W | 15.42 mW | | The post-layout simulation results of whole SMU block with and without I/O buffers and the clock tree network are given in Table 4.3. The total power consumption of the completed SMU hard macro including I/O buffers and the clock tree network is 41.10 mW operating in 1.2 V at 1 GHz. It is clear that the I/O buffers and the clock tree network consume about 23.9% of total power consumption. Without our proposed low swing static ETL, the system clock loading of traditional master-slave flip-flops is double to this work and the total power consumption will be more than this number obviously. **Table 4.3** Simulation results of the 29 stages SMU block with and without I/O buffers and clock tree network in post-layout simulation. | Simulation | 29 Stages SMU Block | 29 Stages SMU Block | | | |-------------|-------------------------|----------------------|--|--| | | without I/O Buffers and | with I/O Buffers and | | | | Target | Clock Tree Network | Clock Tree Network | | | | Simulation | Post-Layout | Cimulation | | | | Type | Fost-Layout | Simulation | | | | Operation | 1.2 V | 1.2 V | | | | Voltage | 1.2 V | 1.2 V | | | | Clock | 1 GHz | 1 GHz | | | | Frequency | T GHZ | I UIIZ | | | | Power | 31.28 mW | 41.10 mW | | | | Consumption | 31.20 III W | 41.10 IIIW | | | The comparison result between SMU designs in cell-based design flow and fully custom design flow is given in Table 4.4, and in this operation condition the operating frequency 250 MHz is fast enough to meet our system specification. The power consumption of the SMU synthesized from EDA tool is 37.61 mW, and our fully custom SMU hard macro consumes only 10.28 mW under the equal test pattern. Our fully custom SMU hard macro's power consumption is merely about 27.3% when compared to the power consumption of the synthesized SMU. **Table 4.4** Simulation results of the 29 stages SMU block from EDA synthesis tool and this work in post-layout simulation. | Simulation | 29 Stages SMU from | Fully Custom 29 Stages | | | | |-------------|------------------------|------------------------|--|--|--| | Target | EDA Synthesis Tool | SMU Hard Macro | | | | | Simulation | Post-Layout Simulation | | | | | | Type | | | | | | | Operation | 1.2 V | 1.2 V | | | | | Voltage | 1.2 V | 1.2 V | | | | | Clock | 250 MHz | 250 MHz | | | | | Frequency | 250 WITZ | 2.50 WIIIZ | | | | | Power | 37.61 mW | 10.28 mW | | | | | Consumption | 57.01 III W | 10.28 III W | | | | According to Equation 2.1 in Chapter 2, the most direct way for reducing dynamic power is lowering the operating voltage because of its square property. In our SMU design we have our own local power ring and it is convenient if we can implement our Viterbi decoder in multi-level voltage environment such like the voltage islands in [25-27], the operating voltage of this SMU hard macro could decrease and so as the power consumption. Table 4.5 shows the simulation result after lowering the operating voltage to 1.0 V, the power consumption of this work is 6.63 mW in 0.13 um CMOS process. It evidences that reducing supply voltage level is powerful in low power applications and the reduced part of power consumption is almost proportional to the calculated answer by using Equation 2.1. **Table 4.5** Simulation results of this SMU hard macro in different operating voltages for reducing power consumption. | Simulation<br>Target | Fully Custom 29 Stages SMU Hard Macro | | | | | | |----------------------|---------------------------------------|---------|--|--|--|--| | Simulation<br>Type | Post-Layout Simulation | | | | | | | Operation<br>Voltage | 1.2 V | 1.0 V | | | | | | Clock<br>Frequency | 250 MHz | 250 MHz | | | | | | Power<br>Consumption | 10.28 mW | 6.63 mW | | | | | In the end of this section, we compare the power data of this work with recent Viterbi decoders. After gate-level simulating of other Viterbi decoder's components, the power consumption of transition metric unit (TMU) and add-compare-select unit (ACSU) are 13.78 and 43.80 mW at 250 MHz respectively. Table 4.6 shows the ratio of the power consumption to the clock rate. Although the power data of this work is get from gate-level simulation and [17-18] are physical measurements on chip, the advantage of this work to reference [17] is noticeable enough even if there is having degeneration after on chip measurement. Finally, the reference [18] is our next design goal in the next 90 nm process generation. **Table 4.6** Comparison of this work with recent Viterbi decoders. | | Process | Operating<br>Voltage | Power @ Performance | |---------------|---------|----------------------|---------------------| | This | 0.13 um | 1.2 V | | | Work | CMOS | 1.2 V | 67.86 mW @ 500 Mb/s | | [1 <b>7</b> ] | 0.13 um | 1.2 V | 970 mW @ 2 Gb/s | | [17] | CMOS | 1.5 V | 2200 mW @ 2.8 Gb/s | | Г1 <b>0</b> 1 | 90 nm | 0.7 V | 5 mW @ 54 Mb/s | | [18] | CMOS | 1.2 V | 40 mW @ 500 Mb/s | #### 4.3 Summary In the beginning of Chapter 4, we start at describing the architecture of our proposed SMU clearly. We adopt the 29 stages register-exchange based SMU architecture as our design for real implementation after stage-reducing. In our 2-bit radix-4 SMU floorplan, for reducing the interconnection complexity in local routing we split this 2-bit SMU core into two parts in physical implementation and each part deals with 1-bit signals. From calculating node capacitances, we design the appropriate clock tree network including clock buffers carefully. Also for distributing the clock signal evenly in this design, we insert the clock tree network between these two 1-bit SMU cores in floorplan. And we introduced the concept "the folded last three stages' SMU" for saving area of redundant basic D elements in last three stages in this SMU design. In the end of Section 4.1 are the fully custom design flow and the relative EDA tools used in this work. In the second part in Chapter 4, we demonstrate several fully custom layouts including basic cells and the completed SMU block used in this work. After post-layout simulation, the proposed completed SMU hard macro in our specified test patterns consumes 41.10 mW at 1 GHz clock rate and the operating voltage is 1.2 V. The area of this work is 570 \* 1660 um², and these power and area data are very outstanding as reference to synthesized SMU designs from cell-based design flow. Then, the concept of "Voltage islands" is used to reduce total power consumption of the whole Viterbi decoder. After lowering the operating voltage to 1.0 V, the power consumption of this work is 6.63 mW at 250 MHz under the same test pattern. The comparison of this work with recent Viterbi decoders is in the end of this chapter. # Chapter 5 ## Conclusion and Future Works #### **5.1 Conclusion** Viterbi decoders are widely used in communication systems as decoding convolutional codes which provide a superior error correction capacity while maintaining a reasonable coding complexity and computing resource. However, power dissipation in modern VLSI designs has become a more critical issue in communication systems and System-on-Chip (SoC) era which have emphasized low power features due to the limited battery life. For this reason, the goal of this work is to implement a low power survivor memory unit (SMU) hard macro for Viterbi decoders. In this thesis, a fully custom SMU hard macro, which is composed of a lot of registers, with low power feature for Viterbi decoders is presented. We choose the register-exchange based SMU as our topology. After introducing relative low power registers, we proposed our low power low swing static edge-triggered latch (ETL) which is very suitable for the register-exchange SMU design due to its less clock loading and fewer transistors number at power and area domain. We not only simulate proposed low swing static ETL in 0.13 um CMOS process, but also in 90 nm CMOS process especially interesting in leakage power dissipation. The detailed simulation results and discussions are addressed in Chapter 3. Then we based on our proposed low swing static ETL to construct the low power SMU hard macro by means of the fully custom design flow, and the implementation environment of this work is TSMC 0.13 um CMOS process. The area of this SMU hard macro including the power ring is 570 \* 1660 um², and the power consumption under the specified test pattern is 10.28 mW operating in 1.2 V at 250 MHz and is merely about 27.3% when compared with the synthesized SMU from EDA synthesizer. Lastly, after lowering the operating voltage to 1.0 V, the total power consumption of this work is 6.63 mW at 250 MHz. The detailed physical layout information and simulation data are given in Chapter 4. Finally, with the soft behavior model this SMU hard macro becomes a completed hard intellectual property (IP). #### **5.2 Future Works** Although we have completed a low power SMU hard macro design for Viterbi decoders, to integrate this work with other synthesized blocks of Viterbi decoders by means of EDA tools is another challenge. Besides, the data-gating and clock-gating techniques which are very powerful in low power applications could be utilized in implementation of SMU for saving input and clock buffers' power dissipation. In modern VLSI designs the demand of design for testability (DFT) is more and more important to the past because the larger chip size and complexity are disadvantageous to the yield of manufacture. Adding scan chain is an effective and widely used way in cell-based design flow for DFT. However, in our fully custom SMU hard macro we can not add scan mechanism easily. The DFT issue in this work needs to be solved in the future. Besides, the complex interconnect routing between each stages in proposed SMU design occupied more area than stages itself. Besides, the fully custom routing takes a lot of time and is not an efficient solution while the system specifications of Viterbi decoders change. If we build the proposed low power low swing static ETL into the standard cell library, not only the complex interconnect routing could be done efficiently but also the DFT problem could be solved easily by means of convenient EDA tools in the cell-based design flow. # **Bibliography** - [1] M. Powell, S.-H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar, "Gated-Vdd: A Circuit Technique to Reduce Leakage in Deep-Submicron Cache Memories," in *International Symposium on Low Power Electronics and Design*, 2000. - [2] M. Nakai, S. Akui, K. Seno, T. Meguro, T. Seki, T. Kondo, A. Hashiguchi, H. Kawahara, K. Kumano, and M. Shimura, "Dynamic Voltage and Frequency Management for a Low-Power Embedded Microprocessor," *IEEE Journal of Solid-State Circuits*, vol. 40, pp. 28-35, January 2005. - [3] J. T. Kao, M. Miyazaki, and A. P. Chandrakasan, "A 175-mV Multiply-Accumulate Unit Using an Adaptive Supply Voltage and Body Bias Architecture," *IEEE Journal of Solid-State Circuits*, vol. 37, pp. 1545-1554, November 2002. - [4] C.C. Lin, Y.H. Shih, H.C. Chang, and C.Y. Lee, "Design of a Power-Reduction Viterbi Decoder for WLAN Applications," IEEE Transactions on Circuits and Systems, vol. 52, no. 6, JUNE 2005. - [5] A. J. Viterbi, "Error bounds for convolutional codes and asymptotically optimum decoding algorithm," IEEE Trans. Inf. Theory, vol. IT-13, no. 2, pp. 260–269, Apr. 1967. - [6] J. G. D. Forney, "The Viterbi algorithm," Proc. IEEE, vol. 61, no. 3, pp.268–278, Mar. 1973. - [7] W.M. Chan; Herming Chiueh, "A block-level optimization of comprehensive thermal-aware power management for SoC integration in nano-scale CMOS technology," Circuits and Systems, 2005. 48th Midwest Symposium on, August 7-10, 2005. - [8] G. Fettweis and H. Meyr, "A 100 MBit/s Viterbi decoder chip: Novel architecture and its relization," in Proc. IEEE Int. Conf. Communications (ICC), vol. 2, Aug. 1990, pp. 463–467. - [9] Ranpara, S.; Dong Sam Ha, "A low-power Viterbi decoder design for wireless communications applications" ASIC/SOC Conference, 1999. Proceedings. Twelfth Annual IEEE International, 15-18 Sept. 1999 - [10] C. M. Rader, "Memory management in a Viterbi decoder," IEEE Trans. Commun., vol. 29, no. 9, pp. 1399–1401, Sep. 1981. - [11] M. Anders, S. Mathew, R. Krishnamurthy and S. Borkar, "A 64-state 2GHz SOOMbps 40mW Viterbi Accelerator in 90nm CMOS," 2004 Symposium On VLSI Circuits Digest of Technical Papers - [12] S. Sridharan and L. R. Carley, "A 110 MHz 350 mW 0.6 m CMOS 16-State Generalized-Target Viterbi Detector for Disk Drive Read Channels," IEEE TRANSACTIONS ON SOLID-STATE CIRCUITS, VOL. 35, NO. 3, MARCH 2000. - [13] V. S. Gierenz, O. Weiß, T. G. Noll, I. Carew, J. Ashley and R. Karabed, "A 550 Mb/s Radix-4 Bit-Level Pipelined 16-State 0.25-μm CMOS Viterbi Decoder," Application-Specific Systems, Architectures, and Processors, 2000. Proceedings. IEEE International Conference on. - [14] N. Bruels, E. Sicheneder, M. Loew, A. Schackow, J. Gliese and C. Sauer, "A 2.8 Gb/s, 32-State, Radix-4 Viterbi Decoder Add-Compare-Select Unit," 2004 Symposium On VLSI Circuits Digest of Technical Papers. - [15] M. Anders, S. Mathew, R. Krishnamurthy and S. Borkar, "A 64-state 2GHz SOOMbps 40mW Viterbi Accelerator in 90nm CMOS," 2004 Symposium On VLSI Circuits Digest of Technical Papers. - [16] J.M. Rabaey, "Digital Integrated Circuits: A Design Perspective," Prentice Hall, New Jersey, 1996. - [17] G. Gerosa et al., "A 2.2 W, 80 MHz superscalar RISC microprocessor," IEEE J.Solide-State Circuits, vol. 29, DEC 1994. - [18] V. De, Y. Ye, A. Keshavarzi, S. Narendra, J. Kao, D. Somasekhar, R. Nair, and S. Borkar, "Techniques for leakage power reduction," in Design of High-Performance Microprocessor Circuits, A. Chandrakasan, W. Bowhill, and F. Fox, Eds. Piscataway, NJ: IEEE, 2001, ch. 3, pp. 52–55. - [19] Ding Li, P. Mazumder, N. Srinivas, "A dual-rail static edge-triggered latch". ISCAS'01, pp.645-648 vol. 2 - [20] S. H. Rasouli, A. Khademzadeh, A. Afzali-Kusha, and M. Nourani, "Low power single- and double-edge-triggered flip-flops for high speed applications," IEE Proc.Circuits Devices Syst. Vol. 152, no. 2, April 2005 - [21] S.-D. Shin and B.-S. Kong, "Variable sampling window flip-flop for low power high-speed VLSI," IEE Proc.Circuits Devices Syst. Vol. 152, no. 3, June 2005. - [22] H. Zhang, P. Mazumber, "Design of a new sense amplifier flip-flop with improved power-delay product," Circuits and Systems, 2005. ISCAS 2005. IEEE International Symposium on 23-26 May 2005 Page(s):1262 1265 Vol. 2 - [23] H. Partovi, et al., "Flow-Through Latch and Edge-Triggered Flip-Flop Hybrid Elements," IEEE biteniatiorzal Solid-State Circuits Corlfereiice, pp. 138-139, Feb. 1996. - [24] Chi-Ken Tsai , Po-Tsang Huang and Wei Hwang, "Low Power Pulsed Edge-Triggered Latches Design", 16th VLSI Design/CAD Symposium, 2005 - [25] K. Usami and M. Horowitz, "Clustered Voltage Scaling Technique for Low-Power Design," in *International Symposium on Low Power Electronics and Design*, 1995. - [26] D. E. Lackey, P. S. Zuchowski, T. R. Bednar, D. W. Stout, S. W. Gould, and J. M. Cohn, "Managing Power and Performance for System-on-Chip Designs Using Voltage Islands," in *International Conference on Computer Aided Design*, 2002. - [27] R. Puri, L. Stok, J. Cohn, D. Kung, D. Pan, D. Sylvester, A. Srivastava, and S. Kulkarni, "Pushing ASIC Performance in a Power Envelope," in *40th Design Automation Conference*, 2003.