## Springer Series in Advanced Microelectronics 34

## Philip Teichmann

## Adiabatic Logic

FutureIrend and System Level Perspective

Springer

## Springer Series in

## Springer Series in

## ADVANCED MICROELECTRONICS

Series Editors: K. Itoh T.H. Lee T. Sakurai W.M.C. Sansen D. Schmitt-Landsiedel

The Springer Series in Advanced Microelectronics provides systematic information on all the topics relevant for the design, processing, and manufacturing of microelectronic devices. The books, each prepared by leading researchers or engineers in their fields, cover the basic and advanced aspects of topics such as wafer processing, materials, device design, device technologies, circuit design, VLSI implementation, and subsystem technology. The series forms a bridge between physics and engineering and the volumes will appeal to practicing engineers as well as research scientists.

Philip Teichmann

## Adiabatic Logic

Future Trend and System Level Perspective

Dr.-Ing. Philip Teichmann<br>Lehrstuhl für Technische Elektronik<br>Technische Universität München<br>Arcisstrasse 21<br>80333 Munich<br>Germany<br>Teichmann@tum.de

## Series Editors:

Dr. Kiyoo Itoh
Hitachi Ltd., Central Research Laboratory, 1-280 Higashi-Koigakubo, Kokubunji-shi, Tokyo 185-8601, Japan

Professor Thomas H. Lee
Department of Electrical Engineering, Stanford University, 420 Via Palou Mall, CIS-205
Stanford, CA 94305-4070, USA
Professor Takayasu Sakurai
Center for Collaborative Research, University of Tokyo, 7-22-1 Roppongi, Minato-ku, Tokyo 106-8558, Japan

Professor Willy M.C. Sansen
ESAT-MICAS, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium

Professor Doris Schmitt-Landsiedel
Lehrstuhl für Technische Elektronik, Technische Universität München, Theresienstrasse 90, Gebäude N3, 80290 Munich, Germany

ISSN 1437-0387 Springer Series in Advanced Microelectronics
ISBN 978-94-007-2344-3 e-ISBN 978-94-007-2345-0
DOI 10.1007/978-94-007-2345-0
Springer Dordrecht Heidelberg London New York
Library of Congress Control Number: 2011941857
© Springer Science+Business Media B.V. 2012
No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.

Cover design: VTeX UAB, Lithuania
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

## Preface

Adiabatic Logic is a potential successor for static CMOS circuit design when it comes to ultra-low-power energy consumption. Future development like the evolutionary shrinking of the minimum feature size as well as revolutionary novel transistor concepts will change the gate level savings gained by Adiabatic Logic. In addition, the impact of worsening degradation effects has to be considered in the design of adiabatic circuits. The impact of the technology trends on the figures of merit of Adiabatic Logic, energy saving potential and optimum operating frequency, are investigated, as well as degradation related issues. Adiabatic logic benefits from future devices, is not susceptible to Hot Carrier Injection, and shows less impact of Bias Temperature Instability than static CMOS circuits. Major interest also lies on the efficient generation of the applied power-clock signal. This oscillating power supply can be used to save energy in short idle times by disconnecting circuits. An efficient way to generate the power-clock is by means of the synchronous 2N2P LC oscillator, which is also robust with respect to pattern-induced capacitive variations. An easy to implement but powerful Power-Clock Gating supplement is proposed by gating the synchronization signals. Diverse implementations to shut down the system are presented and rated for their applicability and other aspects like energy reduction capability and data retention. Advantageous usage of Adiabatic Logic requires compact and efficient arithmetic structures. A broad variety of adder structures and a Coordinate Rotation Digital Computer are compared and rated according to energy consumption and area usage, and the resulting energy saving potential against static CMOS proves the ultra-low-power capability of Adiabatic Logic. In the end, a new circuit topology has to compete with static CMOS also in productivity. On a 130 nm test chip, a large scale test vehicle containing an FIR filter was implemented in Adiabatic Logic, utilizing a standard, library-based design flow, fabricated, measured and compared to simulations of a static CMOS counterpart, with measured saving factors compliant to the values gained by simulation. This leads to the conclusion that Adiabatic Logic is ready for productive design due to compatibility not only to CMOS technology, but also to electronic design automation (EDA) tools developed for static CMOS system design.

Munich, Germany

## Acknowledgements

Presented data in this work is a result of my employment as a research assistant at the Lehrstuhl für Technische Elektronik (LTE) at the Technische Universtität München. PhD theses at the LTE dealing with Adiabatic Logic were published by Ettore Amirante and Jürgen Fischer previously. Ettore and Jürgen have supplied the basis for my work on Adiabatic Logic. Jürgen was my roommate for a couple of years and we had many fruitful discussions.

The opportunity to work at the LTE was offered to me by my supervisor and head of the institute, Professor Doris Schmitt-Landsiedel. She supported the work on Adiabatic Logic with a lot of enlightening input and personal effort. I would like to thank her for giving me the chance to be a part of the team at LTE. I always enjoyed being a member of a team that is composed of committed and inspiring people. A lot of input has come from the colleagues working on diverse fields of research at the institute.

Furthermore I want to use the opportunity to thank my parents for their support and their patience along this long academic journey. Veronika, you managed to keep me motivated during almost all stages of this work.

The Deutsche Forschungsgemeinschaft (DFG) supported the research on Adiabatic Logic in the Schwerpunktprogramm VIVA. Within the VIVA project I worked together with Prof. Jürgen Götze from the Technische Universität Dortmund. Prof. Jürgen Götze confirmed to be the second supervisor of this thesis, I am very grateful for his effort.

Philip Teichmann

## Contents

1 Introduction ..... 1
1.1 Motivation for This Work ..... 1
1.2 A Brief History of Reversible Computation and Adiabatic Logic ..... 2
2 Fundamentals of Adiabatic Logic ..... 5
2.1 The Charging Process in Adiabatic Logic Compared to Static CMOS ..... 5
2.1.1 The Definition of the Energy Saving Factor (ESF) ..... 8
2.2 An Adiabatic System ..... 8
2.2.1 Introducing Adiabatic Logic Families Used in This Work ..... 8
2.2.2 The Four-Phase Power-Clock ..... 9
2.3 Loss Mechanisms in Adiabatic Logic ..... 10
2.3.1 Impact of Process Variations on the Losses in Adiabatic Logic ..... 12
2.4 Voltage Scaling-A Comparison of Static CMOS and Adiabatic Logic ..... 13
2.5 Properties of Adiabatic Logic and Resultant Design Considerations ..... 15
2.5.1 Dual-Rail Encoded Signals ..... 15
2.5.2 Inherent Pipelining ..... 17
2.5.3 Delay Considerations in Adiabatic Logic ..... 18
2.5.4 The Power Supply Net in Adiabatic Logic: Crosstalk, $i R$-drop, $L \frac{d i}{d t}$-drop, Electromigration ..... 18
2.6 General Simulation Setup ..... 21
3 Future Trend in Adiabatic Logic ..... 23
3.1 Scaling Trends for Sub 90 nm Transistors ..... 24
3.2 Adiabatic Logic with Novel Devices ..... 30
3.2.1 What Should an Ideal (Novel) Device for Adiabatic Logic Look Like? ..... 30
3.2.2 Adiabatic Logic with Carbon Nanotubes (CNT) ..... 36
3.2.3 Adiabatic Logic with the Vertical Slit Field Effect Transistor (VESFET) ..... 43
3.3 (Negative) Bias Temperature Instability ((N)BTI) and Hot Carrier Injection (HCI) in Adiabatic Logic ..... 51
3.3.1 Impact of NBTI on the Energy Dissipation of Adiabatic Logic Circuits ..... 52
3.3.2 Comparison of the Stress Due to the Permanent NBTI in Static CMOS and AL ..... 58
3.3.3 How Will Positive Bias Temperature Instability (PBTI) Impact Adiabatic Logic? ..... 61
4 Generation of the Power-Clock ..... 65
4.1 Introduction ..... 65
4.2 Topologies of Inductor-Based Power-Clock Generators ..... 67
4.3 Impact of Pattern-Induced Capacitive Variations on the Energy Dissipation of the Synchronized 2N2P LC-oscillator ..... 69
4.3.1 Impact of Pattern-Induced Variations on the Dissipation of a Discrete-Cosine Transformation (DCT) System ..... 71
4.4 Generation of the Synchronization Signals ..... 72
4.4.1 Synchronous Versus Asynchronous Generation of the Control Signals for the Oscillator ..... 73
4.4.2 Partitions of the Energy Losses Within an Adiabatic System ..... 77
5 Power-Clock Gating ..... 83
5.1 Introduction to Power-Clock Gating ..... 83
5.2 The Theory of Power-Clock Gating ..... 84
5.3 Gating Topologies for PCG ..... 86
5.3.1 Cut-off with Power-down Transistors ..... 86
5.3.2 Power-down of the Power-Clock Oscillator ..... 101
5.4 Power-down Mode for the Synchronous 2N2P LC-oscillator ..... 107
6 Arithmetic Structures in Adiabatic Logic ..... 113
6.1 Design of Arithmetic Structures ..... 114
6.1.1 Framework for the Estimation of $E_{\text {diss }}$ and $A_{\text {active }}$ ..... 115
6.1.2 Ripple-Carry Adder (RCA) ..... 115
6.1.3 Parallel-Prefix Adders (PPA) ..... 119
6.2 Overhead Reduction by Applying Complex Gates ..... 128
6.2.1 Impact of Increased Input Stack on the Energy Dissipation ..... 129
6.2.2 Case Study: Energy, Latency and Area Reduction by Applying Complex Gates in the RCA Structure ..... 130
6.3 Multi-operand Adders and the CORDIC Algorithm ..... 136
6.3.1 Nested RCA Structure ..... 136
6.3.2 The Carry-Save Adder (CSA) Structure ..... 137
6.3.3 A CORDIC-Based Discrete Cosine Transformation (DCT) ..... 138
7 Measurement Results of an Adiabatic FIR Filter ..... 145
7.1 Structure of the Adiabatic FIR Filter ..... 145
7.2 Measurement Results and Comparison to Static CMOS ..... 149
8 Conclusions ..... 155
Bibliography ..... 159
Index ..... 165

## Abbreviations

| $\alpha$ | signal activity factor |
| :---: | :---: |
| $\alpha^{\prime}$ | velocity saturation factor (alpha-power law) |
| $\overline{\mathrm{A}}, \overline{\mathrm{B}}$ | inverted input signals of logic gates |
| A | cross section area |
| $A^{*}$ | active gate area |
| $a_{i}, b_{i}$ | input bit $i$ |
| A,B | input signals of logic gate |
| AL | adiabatic logic |
| AST | adiabatic signal test |
| BK | brent-kung PPA |
| BTI | bias temperature instability |
| $\vec{C}_{h}$ | chiral vector |
| C | capacitance |
| $C_{F}, C_{\bar{F}}$ | capacitance of logic block $F$ and $\bar{F}$ |
| $c_{i}$ | carry bit $i$ |
| $c_{i}$ | filter coefficient $i$ |
| $c_{i}^{l}$ | bit $l$ of filter coefficient $i$ |
| $C_{L}$ | load capacitance |
| $C_{R}$ | replacement capacitance |
| $C_{S}$ | switch capacitance |
| $C_{T}$ | tunable capacitance |
| $C_{\text {OX }}$ | specific oxide capacitance |
| CLA | carry-lookahead adder |
| CMOS | complementary metal oxide semiconductor |
| CNT | carbon nanotube |
| CNTFET | carbon nanotube field effect transistor |
| CORDIC | coordinate rotation digital computer |
| CSA | carry-save adder |
| CSEA | carry-select adder |
| CVSL | cascode voltage switch logic |
| $\Delta C_{A L}$ | deviation of the adiabatic load capacitance |


| $\Delta D_{P}$ | permanent part of BTI |
| :---: | :---: |
| $\Delta D_{R}$ | recoverable part of BTI |
| $\Delta f_{\text {res }}$ | deviation in the resonance frequency |
| $\Delta N$ | number of gates disconnected by PCG |
| $\Delta v$ | voltage drop |
| $\Delta V_{\text {th,rel }}$ | relative shift in the threshold voltage |
| $\Delta V_{\text {th }}$ | absolute shift in the threshold voltage |
| D | pipeline depth |
| DC | stress duty cycle for BTI |
| DCT | discrete cosine transformation |
| DIBL | drain-induces barrier lowering |
| DPM | dynamic power management |
| $\epsilon_{0}$ | vacuum permittivity |
| $\epsilon_{r}$ | relative permittivity |
| $\epsilon_{o x}$ | permittivity of oxide material |
| $\eta_{O s c}$ | oscillator efficiency |
| $\eta_{\text {System }}$ | system efficiency |
| $\overline{E_{P C G}}$ | mean energy consumption with PCG |
| E | energy consumption |
| $E_{0}$ | energy consumption without PCG |
| $E_{A L}$ | energy consumption of adiabatic logic |
| $E_{\text {Buf }}$ | energy consumption in the buffers of the oscillator |
| $E_{\text {CMOS }}$ | energy consumption in static CMOS |
| $E_{\text {diss }}$ | dissipated energy |
| $E_{\text {gap }}$ | bandgap energy |
| $E_{\text {leak }}$ | energy consumption due to leakage |
| $E_{\text {Line }}$ | energy consumption in the power-supply line |
| $E_{\text {min }}$ | minimum energy consumption |
| $E_{\text {non-adia }}$ | energy consumption due to non-adiabatic losses |
| $E_{\text {off }}$ | energy consumption in off-state |
| EOH,rel | relative energy overhead due to PCG |
| $E_{\text {on }}$ | energy consumption in on-state |
| EOsc | energy consumption in the power-clock oscillator |
| $E_{\text {SOH }}$ | energy consumption due to switching overhead |
| $E_{S y n c}$ | energy consumption in the synchronization signal generator |
| $E_{V_{D D}}$ | energy consumption from the power-supply $V_{D D}$ |
| ESF | energy saving factor |
| $E S F_{\text {max }}$ | maximum ESF |
| E | evaluate interval of power-clock signal |
| ECRL | efficient charge recovery logic |
| ERF | energy reduction factor |
| F | logic block of adiabatic logic gate |
| $\overline{\mathrm{F}}$ | dual logic block of adiabatic logic gate |
| $f$ | frequency |
| $f_{\text {opt }}$ | frequency for minimum energy dissipation in AL |


| $f_{\text {res }}$ | resonance frequency |
| :---: | :---: |
| $f_{\text {sync }}$ | synchronization frequency |
| $F O_{\text {max }}$ | maximum fanout |
| FA | full-adder |
| FD-SOI | fully-depleted silicon-on-insulator |
| FET | field effect transistor |
| FIR | finite impulse response (filter) |
| FSM | finite state machine |
| $\Gamma$ | ratio of capacitive load disconnected |
| $G_{i}$ | generate signal at bit position $i$ |
| GND | ground potential |
| $h$ | device height in VESFET device |
| H | hold interval of power-clock signal |
| HC | Han-Carlson PPA |
| HCI | hot carrier injection |
| $\bar{I}$ | mean current |
| $i(t)$ | time-variant current |
| $I_{D S, s a t}$ | saturation drain current |
| $I_{D}$ | drain current |
| $I_{l e a k}$ | leakage current |
| ITRS | international technology roadmap of semiconductors |
| $J$ | current density |
| $k$ | Boltzmann's constant |
| $k$ | fractions within the CSEA |
| $K_{P}$ | fitting parameter for permanent part of BTI |
| $K_{R}$ | fitting parameter for recoverable part of BTI |
| $\lambda$ | channel-length modulation |
| $L$ | channel length of transistor |
| $L$ | inductance |
| $L$ | line length |
| $L_{\text {ch }}$ | physical channel length |
| $L_{D D}$ | drain extension |
| $L_{S S}$ | source extension |
| LSB | least significant bit |
| $\mu$ | carrier mobility |
| M | fraction count of power line |
| MOS | metal oxide semiconductor |
| MOSFET | metal oxide semiconductor field effect transistor |
| MTF | median time to failure |
| MUGFET | multi-gate field effect transistor |
| $\left(n_{1}, n_{2}\right)$ | chiral number |
| $N$ | bit width of arithmetic structure |
| $N$ | number of steps |
| $n$ | ideality factor for diode current; counting variable |
| $N_{A}$ | acceptor doping concentration |


| $N_{D}$ | donor doping concentration |
| :---: | :---: |
| $N_{\text {AND }}$ | number of AND gates |
| $N_{\text {Buf }}$ | number of buffers |
| $N_{F A}$ | number of full-adder cells |
| $N_{\text {Mux }}$ | number of multiplexers |
| $n_{\text {sub }}$ | substrate doping concentration |
| $N_{\text {XOR }}$ | number of XOR gates |
| NAND3 | 3 -input NAND gate |
| NMOS | n -channel metal oxide semiconductor (field effect transistor) |
| OESF | overall energy saving factor |
| $\phi$ | power-clock |
| $\phi$ | rotation angle |
| $\phi_{i}$ | phase $i$ of the power-clock $\phi$ |
| $\varphi$ | activation energy |
| $p(t)$ | time-variant power |
| $P_{i}$ | propagate signal at bit $i$ |
| PCG | power-clock gating |
| PFAL | positive feedback adiabatic logic |
| PMOS | p-channel metal oxide semiconductor (field effect transistor) |
| PPA | parallel-prefix adder |
| PTM | predictive technology model |
| $Q$ | electrical charge |
| $Q$ | quality factor of the coil |
| $q$ | elementary load |
| $R$ | resistance |
| $r$ | device radius in VESFET device |
| $R_{L}$ | load resistance |
| $R_{p}$ | resistance of the PMOS charging device |
| $R_{S}$ | switch resistance |
| $R_{\text {eval }}$ | resistance value in the evaluate interval |
| $R_{\text {F/F }}$ | resistance of the $F / F$ logic block |
| $R_{\text {on }}$ | on-resistance of transistor |
| $R_{\text {reco }}$ | resistance in the recover interval |
| $R_{S, n}, R_{S, p}$ | on resistance of NMOS/PMOS switch |
| $R_{S, T G}$ | on resistance of transmission gate switch |
| R | recover interval of power-clock signal |
| RCA | ripple-carry adder |
| RISC | reduced instruction set computer |
| $s_{i}$ | sum bit $i$ |
| $s_{i}$ | synchronization signal for phase $i$ |
| SCE | short channel effects |
| Static CCNT | static CMOS carbon nanotube |
| Static CMOS | static CMOS circuit style |
| SWNT | single walled carbon nanotube |
| $\theta$ | rotation operator |


| $\vec{T}$ | one-dimensional translation vector |
| :---: | :---: |
| $T$ | absolute temperature |
| $T_{\phi}$ | period of power-clock |
| $t_{\tau}$ | input slope |
| $t_{p}$ | propagation delay |
| $t_{r}$ | relaxation time |
| $t_{s}$ | stress time |
| $T_{M P D}$ | minimum power-down time |
| $T_{o f f}$ | time in off-state |
| $T_{\text {on }}$ | time in on-state |
| $T_{p w}$ | pulse-width of synchronization pulse |
| TESF | technology energy saving factor |
| V | time-invariant voltage |
| $v(t)$ | time-variant voltage |
| $V_{g}$ | BTI stress voltage at gate |
| $V_{T}$ | temperature voltage |
| $V_{D D}$ | supply voltage |
| $V_{D S}$ | drain to source voltage |
| $V_{G D}$ | gate to drain voltage |
| $V_{G S}$ | gate to source voltage |
| $V_{O V}$ | overdrive voltage |
| $V_{t h, n}$ | threshold voltage of n -channel device |
| $V_{t h, p}$ | threshold voltage of p-channel device |
| $V_{\text {th }}$ | threshold voltage of transistor |
| $V_{\text {tune }}$ | tuning voltage |
| VESFET | vertical slit field effect transistor |
| VMA | vector-merging adder |
| VTRO | $V_{t h}$ roll-off |
| $W_{L}$ | line width |
| $w_{S}$ | switch width multiplier with respect to $W_{\text {min }}$ |
| $W_{L, \text { min }}$ | minimum line width |
| W | channel width of transistor |
| W | wait interval of power-clock signal |
| $[x y]^{T}$ | input vector |
| $x_{d}$ | under diffusion length |
| $x_{i}, y_{i}$ | input vector $x / y$-value after rotation step $i$ |
| $\mathrm{x}[\mathrm{n}]$ | discrete input vector |
| $\mathrm{y}[\mathrm{n}]$ | discrete output vector |
| $\zeta$ | position of the switch for PCG |

## Chapter 1 Introduction

### 1.1 Motivation for this Work

Increasing demand for portable electronic devices and the ongoing shrinking of the minimum feature size in integrated circuits require electronic circuit topologies to implement integrated circuits with a low power consumption. On the one hand, power consumption limits the operating time of battery-driven devices, on the other hand, increasing power densities lead to more effort for sinking the temperature of integrated circuits. Besides active losses, also passive losses due to the short channel behavior of nanoscale transistors gain increased impact on the overall power consumption.

Various concepts on all hierarchical levels have been proposed in the past to decrease active losses in static CMOS circuits. Active losses are described by $\alpha C V_{D D}^{2}$, where $V_{D D}$ is the power supply, $C$ the capacitance of the circuit, and $\alpha$ the activity factor. Thus decreasing the voltage is the most effective way to decrease active losses. Voltage scaling helps to cut-down active losses, on the downside the gate delay is increased. The capacitance is determined by the intrinsic capacitance of the devices in the applied technology, and the parasitic capacitances due to interconnects, that can be minimized by compact layout of integrated circuits. Activity can be decreased by avoiding unnecessary switching losses due to glitching, but also by applying algorithms that apply less power consuming operations. The clock tree adds additional power consumption in synchronous designs. These losses have a high activity factor and can add up to $50 \%$ to the overall losses for high-performance microprocessors [1]. Clock-gating [1-5] helps to avoid those losses, flip flops and clock branches are disconnected from the clock tree for inactive circuits.

Passive losses due to leakage currents gain more importance with ongoing shrinking of microelectronic circuits. Power-gating cuts off unused circuits from the power supply. Uncritical paths within a complex system can be equipped with higher- $V_{t h}$ devices, allowing for a trade-off of speed for passive losses. Besides those circuit level methods to cut down leakage losses also new transistor concepts are presented to cope with leakage losses. The Vertical Slit Field Effect Transistor (VESFET) [6-8] as well as the Carbon Nanotube (CNT) based field effect transistor are promising concepts to push further the limits of miniaturization in integrated circuit design.

Adiabatic Logic has proven to be a circuit technique with the potential to dramatically decrease energy consumed per operation. A lot of investigations on the gate level have shown, that a major cut-down of losses with respect to static CMOS is gained. In previous work it was concluded, that out of the broad variety of adiabatic gate topologies, only a few are promising successors to static CMOS, as those are compatible to a static CMOS design flow, are robust with respect to PVT variations, and apply a manageable number of clocked power-supplies that can be generated in an energy efficient manner [9, 10]. But how will these gate topologies behave with shrinking of transistor sizes and with revolutionary transistor concepts like VESFET and CNT-based FET? How will this impact the energy savings potential of Adiabatic Logic?

Energy is decreased by an order of magnitude compared to static CMOS on gate level in a 130 nm CMOS technology [11]. Rules for the optimal use of the inherent properties of Adiabatic Logic have to be derived to be able to construct ultra-lowpower adiabatic systems. An efficient generation of the power-clock signals is also crucial to gain high savings on system level. Preventing switching losses and leakage losses in Adiabatic Logic can be achieved by disconnecting the system from the power-clock supply signals. Of course this leads to deviations in the capacitive load seen by an LC oscillator in charge of deriving the power-clock from a DC source. This impact may lead to increased losses, therefor deeper investigations are needed to implement efficient shut-down methods. Thus a major impact on the overall efficiency of Adiabatic Logic systems lies in the systematic exploration of advantageous use of inherent properties in Adiabatic Logic and the evolution of Adiabatic Logic with future technology nodes and future devices.

### 1.2 A Brief History of Reversible Computation and Adiabatic Logic

Landauer in 1961 [12] stated, that energy in computation must be dissipated only if information is erased. Each bit that is erased generates a theoretical lower bound energy quantum of $k T \ln 2$ dissipated as heat to the surrounding. Therefore accessible energy is transformed into inaccessible energy. In 1973 Bennett [13] proposed a logically reversible Turing machine, that allows to circumvent this lower bound. The history is stored in the Turing machine in order to retrace the input from the output state. In conventional logic gates like AND, NAND, OR information is erased as no unique mapping from the inputs to the output states is done, i.e. the inputs cannot be reconstructed from the outputs. The NOT operation and the Identity Gate (that mirrors the input to the output) are by construction reversible gates, but of course more complex boolean functions are needed to set-up arithmetic units. Fredkin and Toffoli [14] invented gates that allow to perform diverse logical operations without erasing information. The Toffoli and the Fredkin gates are both three-input-threeoutput gates, where the Toffoli gate is a controlled NOT operation and the Fredkin gate implements a swap-out of two inputs controlled by the third input. All these
gates suffer from a large transistor count and therefore an excessive area usage. Different implementations have been presented to allow for reversible systems. The retractile cascades [15] allow to restore the whole energy from the outputs by keeping the input information valid during the recover process. Due to the way retractile cascades are operated, they suffer from a lack in performance and a supply clock scheme that is dependent on the number of cascaded gates. In order to restore the charge from the output capacitor, the inputs do not necessarily have to be known. If the charging path (or any other path) is kept conducting during the recover process, charge can be recovered and is not discharged to ground. This means that a gate can work energetically reversible without the need to be logically reversible. Koller and Athas [16] introduced an adiabatic latch circuit that stores the information which node was charged. Without the need to keep or restore the input signals, the energy can be recovered from the output nodes. But as these use MOSFET devices as latching elements, and thus the recovery works only until the threshold voltage is reached, a fraction of the stored energy cannot be recovered. This circuit type is called quasi-adiabatic, due to the non-adiabatic losses that are unavoidable. A variety of quasi-adiabatic families [17-25] have been introduced in the past years. In previous works, these different topologies were assessed and two of them were identified as most suitable for CMOS integration in deep-sub-micron technologies. Thus, investigations presented in this work are carried out with the Efficient Charge Recovery Logic (ECRL, [19]) and the Positive Feedback Adiabatic Logic (PFAL, [18]).

## Chapter 2 <br> Fundamentals of Adiabatic Logic

### 2.1 The Charging Process in Adiabatic Logic Compared to Static CMOS

First the energy dissipation caused by switching of a simple CMOS inverter as shown in Fig. 2.1 is observed. The capacitor $C$ at the output of the gate represents the input capacitance of following gates. Depending on the input signal, in steadystate either the PMOS device or the NMOS device is on, the remainder is off. If an input transition from 1 to 0 occurs, energy is transferred from the voltage source to charge the output capacitor to the voltage $V_{D D}$. A charge of $Q=C V_{D D}$ is taken from the voltage source, an energy quantum of

$$
\begin{equation*}
E_{V_{D D}}=Q V_{D D}=C V_{D D}^{2} \tag{2.1}
\end{equation*}
$$

is withdrawn from the voltage source. The energy stored on the capacitor at the voltage $V_{D D}$ is equal to

$$
\begin{equation*}
E_{C}=\frac{1}{2} C V_{D D}^{2} \tag{2.2}
\end{equation*}
$$

The difference between the delivered energy and the stored energy is dissipated in the PMOS switch. Now, if the input switches from 0 to 1 , in steady-state condition the NMOS channel is on, the PMOS off. Charge stored on the output capacitance is then dissipated via the NMOS device to ground. The energy dissipation of a switching event in a static CMOS gate is given as

$$
\begin{equation*}
E_{C M O S}=\alpha \frac{1}{2} C V_{D D}^{2} \tag{2.3}
\end{equation*}
$$

where $\alpha$ is the switching probability, as there is no dissipation (except leakage losses) in static CMOS gates, if there is no switching event at all. Different approaches are useful to reduce the energy dissipation in static CMOS. Reducing the number of transitions needed for a computation of a certain task can be done on algorithmic, on structural and on circuit level [26]. Reducing the capacitive load

Fig. 2.1 Schematic of a static CMOS inverter

Fig. 2.2 An ECRL buffer and an exemplary scheme of the signals in the gate in operation

$C$ is strongly limited by the technology and its intrinsic device capacitance. But wiring capacitance can be reduced by choosing a proper architecture and a carefully designed layout. Reducing the voltage supply $V_{D D}$ is a very powerful method to reduce the power dissipation, but as downside the performance is degraded. Nevertheless, (2.3) is the lower bound for the dissipation per switching event.

In contrast Adiabatic Logic does not abruptly switch from 0 to $V_{D D}$ (and vice versa), but a voltage ramp is used to charge and recover the energy from the output. The principle of operating an adiabatic gate is presented for a buffer gate in the Efficient Charge Recovery Logic (ECRL, [19]) in Fig. 2.2. The gate consists of two cross-coupled PMOS devices that are used to store the information. The logic function is constructed via two NMOS devices. Cascaded gates are operated by a four-phase power-clock signal that is presented in Sect. 2.2.2. Input signals for the ECRL gate in Fig. 2.2 are shifted by $90^{\circ}$ with respect to the applied power-clock signal.

Now for instance it is assumed, that input in is at logic one and the dual input in is at zero. Then the NMOS device N1 will conduct and connect out to ground, while N 2 is disabled. As soon as the power-clock $\phi$ ramped from 0 to $V_{D D}$ reaches the threshold voltage $V_{t h, p}$ of the PMOS device, P 2 will be turned on. Thus the output signal out will follow the power-clock $\phi$. Now the gate voltage of device P1 is equal to the supply voltage, the gate-to-source voltage is zero, thus this device stays disabled. As soon as $\phi$ reaches the maximum level $V_{D D}$ the input signals are ramped down, as the preceding gate recovers the energy at this time. The PMOS devices will take care of storing the information while both NMOS devices are disabled. Then the power-clock is descending from $V_{D D}$ to 0 . While $\phi$ is above $V_{t h, p}$ charge from the output out is restored to $\phi$. A certain fraction of energy $\frac{1}{2} C_{\text {out }} V_{t h, p}^{2}$ remains

Fig. 2.3 Equivalent circuit to determine the losses by adiabatically loading a capacitance

on the according output capacitance that is dissipated or reused in the next cycle, according to the succeeding input signals.

To calculate the energy consumed by charging a capacitance adiabatically, the equivalent circuit in Fig. 2.3 for an adiabatic gate is used.
$R$ is the resistance in the charging path of the circuit, consisting of the onresistance of transistors in the charging path and the sheet resistance of the signal line. For the observations of the energy dissipation $R$ is considered to be constant. The voltage is ramped from 0 to $V_{D D}$ within $T$, slow enough that $v_{C}(t)$ is able to follow signal $v(t)$ instantly, so $v_{C}(t) \approx v(t)$. Therefore the current into the circuit can be determined by

$$
\begin{equation*}
i(t)=C \frac{d v(t)}{d t}=\frac{C V_{D D}}{T} . \tag{2.4}
\end{equation*}
$$

The energy for a charging event is calculated by integrating the power $p(t)$ during the transition time $T$ :

$$
\begin{equation*}
E=\int_{0}^{T} p(t) d t=\int_{0}^{T} v(t) \cdot i(t) d t=\int_{0}^{T}\left(v_{R}(t)+v_{C}(t)\right) \cdot i(t) d t . \tag{2.5}
\end{equation*}
$$

The integral of $v_{C}(t) \cdot i(t)$ over one clock cycle will be zero, as no energy is dissipated in the capacitance. Thus by replacing the voltage $v_{R}(t)$ in (2.5) with $i(t) \cdot R$ and inserting (2.4) into (2.5) results in

$$
\begin{equation*}
E=\int_{0}^{T} R \frac{C^{2} V_{D D}^{2}}{T^{2}} d t=\frac{R C}{T} C V_{D D}^{2} \tag{2.6}
\end{equation*}
$$

A whole cycle consists of charging and recovering. As the recover process will lead to the same amount of energy dissipation, the overall dissipation in Adiabatic Logic (AL) is

$$
\begin{equation*}
E_{A L}=2 \frac{R C}{T} C V_{D D}^{2} \tag{2.7}
\end{equation*}
$$

Observing (2.7) shows that the operating speed impacts the energy dissipation. The slower the circuit is charged, the less energy is dissipated. The opportunity to further reduce the consumption by scaling the supply voltage or by reduction of the capacitive load also exists in Adiabatic Logic. In contrast to static CMOS the size of the switch transistor also has an effect on the energy dissipation, as $R$ is found in the equation for the energy dissipation in Adiabatic Logic. If (2.3) and (2.7) are opposed, a minimum for the transition time $T$ can be found, up to which adiabatic circuits are more energy efficient than static CMOS circuits, it is $T>4 \frac{R C}{\alpha}$. In static

CMOS during one cycle the gates output either stays constant, switches from 0 to 1 or from 1 to 0 . The activity factor $\alpha$ in the expression $T>4 \frac{R C}{\alpha}$ reveals that applications with a moderate to high activity factor are suitable for the operation with AL. Otherwise static CMOS is superior, as it doesn't suffer from losses in a steady input state as long as leakage losses are neglected.

### 2.1.1 The Definition of the Energy Saving Factor (ESF)

Comparing static CMOS and Adiabatic Logic with respect to their energy dissipation calls for the definition of the Energy Saving Factor (ESF). It is a measure for how much more energy is used in a static CMOS gate or system with respect to an Adiabatic Logic counterpart. The precise definition of the ESF depends on the considered hierarchical level. If the efficiency of an Adiabatic Logic family shall be compared with respect to static CMOS the ESF compares the losses in a single gate. On system level also the generation of the supply voltage in static CMOS and the power-clock in AL and losses due to layout parasitics have to be included in the calculation for the ESF. A general definition of the ESF is

$$
\begin{equation*}
E S F=\frac{\sum_{C M O S} E}{\sum_{A L} E} \tag{2.8}
\end{equation*}
$$

All the energy dissipation fractions under consideration have to be summed up. An explanation has to be given at the time where the ESF is used, whether gate level comparison or a comparison on system level is performed.

### 2.2 An Adiabatic System

Each adiabatic system consists of two main parts, the digital core design made up of adiabatic gates and the generator of the power-clock signals. Two adiabatic families are used in this work, both are shortly introduced in Sect. 2.2.1. Resultant design considerations due to the inherent properties of these Adiabatic Logic families are explained in detail in Sect. 2.5. The power-clock generation is a very important topic in adiabatic systems, as an efficient generation of the four phases making up the power-clock is essential to get high overall saving factors. The four-phase powerclock used in the adiabatic families in this work is presented in Sect. 2.2.2.

### 2.2.1 Introducing Adiabatic Logic Families Used in This Work

Two Adiabatic Logic families are used in the investigations in the presented work. One is the Positive Feedback Adiabatic Logic (PFAL [18]), and the other is the


Fig. 2.4 Inverter circuit in the (a) PFAL and (b) ECRL family

Efficient Charge Recovery Logic (ECRL [19]). Both share the property, that they are operated with a four-phase power-clock. PFAL consists of a latch element formed by two cross-coupled inverters to store the output state when the input signals are ramped down. ECRL, based on the Cascode Voltage Switch Logic (CVSL [27]), uses a cross-coupled PMOS pair as latching element. Logic blocks constructed of NMOS transistors only are used for PFAL and ECRL. As both families use identical function blocks, design procedures presented for CVSL [28] can be used for ECRL and also for PFAL. Logic blocks are connected from the power-clock $\phi$ to the output nodes for PFAL, and from the output to GND for ECRL.

As example, an inverter is sketched for PFAL (Fig. 2.4a) and ECRL (Fig. 2.4b). For more complex gates the logic block transistors $N_{F}$ and $N_{\bar{F}}$ are replaced by logic function blocks. If e.g. a NAND gate has to be constructed, a series connection of two transistors is use instead of $\mathrm{N}_{\mathrm{F}}$, using A and B as input vectors. A dual block composed of two parallel transistors, having $\overline{\mathrm{A}}$ and $\overline{\mathrm{B}}$ as inputs, is connected at the position of $\mathrm{N}_{\overline{\mathrm{F}}}$.

### 2.2.2 The Four-Phase Power-Clock

Adiabatic Logic circuits are operated with an oscillating power-supply, the so-called power-clock. Depending on the regarded adiabatic family, more than one powerclock signal is used to operate an system consisting of Adiabatic Logic gates. In this work adiabatic families are employed, that use a four-phase power-clock $\phi_{0}-\phi_{3}$ (Fig. 2.5).

Each power-clock cycle consists of four intervals. In the evaluate (E) interval, the outputs are evaluated from the stable input signals. During the hold $(\mathrm{H})$ interval, outputs are kept stable for supplying the subsequent gate with a stable input signal. Energy is recovered in the interval called recover (R). And for symmetry reasons a wait (W) interval is inserted, as symmetric signals are easier and more efficient to be generated. Data in adiabatic systems is processed in a pipeline fashion, data is handed over as shown in Fig. 2.5. Valid data words 1, 2, 3 and 4 are sketched in phase $\phi_{0}$. Data word 1 is transferred during the H interval of $\phi_{0}$ and while $\phi_{1}$ is in E .

Fig. 2.5 Scheme of the four-phase power-clock


It is processed by the logical function given in the succeeding gate and valid at the outputs as $1^{*}$ for further processing in the next gates. As mentioned before, signals have to be kept constant during E, therefore a $90^{\circ}$ phase shift between subsequent phases is obtained. In a pipeline, subsequent gates have to be connected to the right phases in order to guarantee a transfer of valid input data.

### 2.3 Loss Mechanisms in Adiabatic Logic

In an ideal adiabatic system losses are expected to follow (2.7), but shrinking devices into the sub- $\mu \mathrm{m}$ regime and the non-existence of zero- $V_{t h}$ transistors lead to additional loss mechanisms. These effects can dominate the energy consumption and also exhibit a lower bound for the energy dissipation. With ongoing shrinking, leakage currents gain more impact on the overall dissipation of static CMOS gates. One of the dominant leakage currents is the so-called sub-threshold current. It is expressed by [29]

$$
\begin{equation*}
I_{D}=I_{D 0} e^{\frac{V_{G S}-V_{t h}}{n V_{T}}}\left(1-e^{\frac{-V_{D S}}{V_{T}}}\right), \tag{2.9}
\end{equation*}
$$

where $V_{T}$ is the thermal voltage, $V_{t h}$ is the threshold voltage of the device and $V_{G S}$ and $V_{D S}$ are the terminal voltages. As long as $V_{D S}$ is zero, no leakage current will flow. Only for values of $V_{D S}$ that are multiples of the thermal voltage, the leakage increases to its maximum value. Besides that, also a junction leakage exists and in state-of-the-art CMOS processes leakage currents tunnel through the thin gate oxide.

In Adiabatic Logic, during evaluation, hold and recovery, leakage currents flow from the voltage supply to ground, leading to dissipation of charge that cannot be recovered. All leakage mechanism can be summarized in a mean current $\overline{I_{\text {leak }}}$, that

Fig. 2.6 $E_{A L}$ are proportional, leakage losses $E_{\text {leak }}$ are inverse proportional to the frequency and the non-adiabatic losses are independent of the frequency. An optimum frequency exists for Adiabatic Logic circuits, as can be seen from the overall losses $\sum$

leads to the energy consumption per cycle of

$$
\begin{equation*}
E_{l e a k}=V_{D D} \overline{I_{l e a k}} \frac{1}{f} \tag{2.10}
\end{equation*}
$$

Leakage-related dissipation increases for lower frequencies, as leakage losses are accumulated over a longer time interval.

Discharging a gate in PFAL and ECRL will lead to a residual voltage at the output node that is in the range of the threshold voltage $V_{t h, p}$ of the PMOS device. As long as the gate evaluates the same input in the next cycle, in ECRL, the residual charge will be reused in the next cycle, otherwise it is dismissed to ground. In PFAL, this charge is dissipated when the output signal changes, as the output is then connected to ground via the NMOS device in the latch in the evaluate interval. If the output state remains the same, the charge is dissipated in the W interval, as the input transistors are turned on and connect the output to the power-clock (that is on ground potential in the W interval). Besides that, in ECRL the output cannot instantly follow the rising power-clock. Only when the power-clock is at least $\left|V_{t h, p}\right|$, the charging path over the PMOS device is opened. Then the output voltage follows the power-clock abruptly, leading to a dynamic loss. All these losses are related to the threshold voltage and lead to a non-adiabatic dissipation of

$$
\begin{equation*}
E_{\text {non-adia }}=\frac{1}{2} C V_{t h, p}^{2} . \tag{2.11}
\end{equation*}
$$

Non-adiabatic losses are independent of the operating frequency, leading to an offset in the energy dissipation over the whole frequency range. Thus, three loss mechanisms that contribute to the overall losses are found in Adiabatic Logic. Adiabatic losses (equation (2.7)) and leakage losses (equation (2.10)) are dependent on the operating frequency $f$. Figure 2.6 shows the three loss mechanisms in dependence of the frequency, and the overall dissipation is gained by summarizing all three components.

A minimum dissipation of the energy at a certain frequency is observed. Therefore an optimum frequency exists in Adiabatic Logic, where energy consumed per cycle is minimized.

### 2.3.1 Impact of Process Variations on the Losses in Adiabatic Logic

As in today's CMOS technologies variations in the process are a major concern, circuit designers are confronted with new challenges in designing robust circuits. Also in Adiabatic Logic process variations have an impact on the circuit, mainly on the energy consumption. In static CMOS, functional errors due to process variations can be induced in circuits operated at high speed. If single transistors are too slow or too fast, timing constraints are violated leading to system fails. A lot of effort is put into methods to deal with these variations. As adiabatic circuits are operated with a frequency that is relatively low, timing issues are not of concern. But in Adiabatic Logic the variations will impact the energy consumption of the circuit [9, 30].

If (2.7) is considered, the effective charging path resistance $R$, which is composed by the on-resistance of the MOS device charging the output and the resistance of interconnects, impacts the energy dissipation. The charging MOS device is operated in the linear region most of the time, the charging path resistance can be estimated via the equation for the drain current in the linear region [31]:

$$
\begin{equation*}
I_{D}=k_{n}^{\prime} \frac{W}{L}\left(\left(V_{G S}-V_{t h}\right) V_{D S}-\frac{V_{D S}^{2}}{2}\right) \tag{2.12}
\end{equation*}
$$

The factor $k_{n}^{\prime}$ summarizes the mobility of the majority carriers $\mu$ and the specific oxide capacitance $C_{O X}$. To operate an adiabatic circuit efficiently, the frequency has to be slow enough to allow the output to follow the power-clock such that a very small $V_{D S}$ will appear. The on-resistance $R_{o n}=\frac{V_{D S}}{I_{D}}$ can therefore be approximated by

$$
\begin{equation*}
R_{o n}=\frac{L}{k_{n}^{\prime} W}\left(V_{G S}-V_{t h}\right)^{-1} . \tag{2.13}
\end{equation*}
$$

The impact of the threshold voltage $V_{t h}$ on the on-resistance $R_{o n}$ is determined via (2.13). As the gate overdrive voltage $V_{G S}-V_{t h}$ is affected by process variations, an increased or decreased on-resistance is observed, and therefore the energy dissipation will be changed.

For the leakage losses, the impact of variations on the current in sub-threshold region (see (2.9)) are regarded. An exponential dependence of (2.9) on $V_{t h}$ is seen, the dissipation caused by leakage currents shows an exponential dependence on a shift in $V_{t h}$. A shift in $V_{t h}$ causes a change in the non-adiabatic losses according to (2.11). Non-adiabatic losses are quadratically dependent on variations in the threshold voltage.

Summarizing, the $V_{t h}$-shift induced by process variations has the strongest impact in the frequency regime where leakage currents dominate the overall losses in Adiabatic Logic. A shift of the optimum frequency to higher values can be observed if $V_{t h}$ is shifted to lower absolute values [10].

In Fig. 2.7 simulation results of a buffer circuit in the Positive Feedback Adiabatic Logic (PFAL, [32]) in a 130 nm low- $V_{t h}$ CMOS technology shows the im-

Fig. 2.7 In the low frequency regime, the leakage currents are the main contributor to the energy dissipation. These losses are exponentially dependent on variations in $V_{t h}$. Adiabatic losses are less impacted by process variations. The optimum operating frequency is shifted to higher frequencies when going from the slow corner to the fast corner

pact of the process variations on the energy dissipation versus the frequency. Nominal and corner simulations slow and fast are plotted. Process variations impact all regimes, the leakage dominated regime in the lower frequency region, the adiabatic regime in the high frequency region, and of course also non-adiabatic losses, that are independent of the frequency. As leakage currents are more sensitive to parameter variation, the highest deviation is seen in the low frequency range. The slow corner has a raised $V_{t h}$ with respect to the nominal value, leakage is therefore reduced, but the on-resistance in the loading path is increased, resulting in higher adiabatic losses. For the fast corner $V_{t h}$ is reduced, leading to a reduced on-resistance, and therefore to reduced adiabatic losses. But on the downside here leakage is increased. The optimum frequency is shifted from 10 MHz in case of the nominal $V_{t h}$ to 3 MHz in the slow corner, and to 50 MHz for the fast corner parameters.

### 2.4 Voltage Scaling-A Comparison of Static CMOS and Adiabatic Logic

An easy and powerful way to reduce losses in static CMOS is by reducing the voltage supply $V_{D D}$ [33]. Equation (2.3) reveals a quadratic dependence of the energy dissipation on $V_{D D}$ due to dynamic losses:

$$
\begin{equation*}
E_{C M O S} \propto V_{D D}^{2} \tag{2.14}
\end{equation*}
$$

The limiting factor for voltage scaling is the propagation delay $t_{p}$, that is increased while the voltage is decreased according to [34]

$$
\begin{equation*}
t_{p}=\left(\frac{\frac{V_{t h}}{V_{D D}}+\alpha^{\prime}}{1+\alpha^{\prime}}-\frac{1}{2}\right) t_{\tau}+\frac{C_{L} V_{D D}}{2 I_{D 0}}, \tag{2.15}
\end{equation*}
$$

where $t_{\tau}$ is the input slope, $\alpha^{\prime}$ is the velocity saturation parameter, and $I_{D 0}$ is the drain current for $V_{G S}=V_{G D}=V_{D S}$. The impact of the input slope decreases with
ongoing miniaturization [34], therefore the first term of (2.15) can be neglected for state-of-the-art CMOS technologies. Using $I_{D} \propto\left(V_{G S}-V_{t h}\right)$ in saturation [35] the dependence of the delay on the voltage supply is found:

$$
\begin{equation*}
t_{p} \propto \frac{V_{D D}}{\left(V_{D D}-V_{t h}\right)^{\alpha^{\prime}}} . \tag{2.16}
\end{equation*}
$$

A trade-off exists between speed and power consumption, therefore the voltage can only be reduced to a level where no timing constraints in the design are violated. The critical path in a static CMOS design determines the maximum degree to which the voltage can be reduced. In designs where only a few critical paths exist, but many paths have a positive slack after reducing the supply voltage, the gain from globally reducing the supply voltage is not satisfying. To make voltage scaling more effective one can try to break up the critical paths to allow further reduction of voltage and thus power, and also using different voltage domains for fast and slow paths could increase the benefits of scaling [36].

Delay is not a concern for Adiabatic Logic circuits, as the maximum possible frequency is far above the optimum frequency for an energy-efficient operation of gates and systems. Looking into the frequency regime where adiabatic losses dominate the energy consumption of Adiabatic Logic, it is expected that the reduction of the supply voltage will lead to a benefit in energy consumption. On first sight a dependence of $V_{D D}^{2}$ is observed, but the on-resistance of the transistor in the charging path is also a function of the supply voltage. If the overdrive voltage $V_{G S}-V_{t h}$ is reduced by reducing the supply voltage, the resistance is increased. As long as $V_{D D}$ is far above $V_{t h}$, the dissipated energy is [30]

$$
\begin{equation*}
E_{A L} \propto V_{D D}\left(1+\frac{V_{t h}}{V_{D D}}\right) \tag{2.17}
\end{equation*}
$$

Thus, Adiabatic Logic also gains from voltage scaling, but the ESF on gate level will decrease if voltage reduction is applied:

$$
\begin{equation*}
E S F \propto \frac{V_{D D}}{1+\frac{V_{t h}}{V_{D D}}} . \tag{2.18}
\end{equation*}
$$

Leakage losses are also impacted by reducing the supply voltage. As long as the leakage losses are negligible compared to the dynamic losses in static CMOS, and as long as the adiabatic circuit is not operated in the leakage dominated regime, and if non-adiabatic losses are negligible, the impact of voltage scaling on the ESF can be estimated by (2.18).

The lower bound for $V_{D D}$ in static CMOS is mainly limited by timing constraints, including margins for variations in the process and fluctuations in the temperature and supply voltage. Supply voltage reduction in Adiabatic Logic is not limited by timing constraints. But a functional limit for ECRL and PFAL is observed
when reducing $V_{D D}$. Minimum supply voltages are given by [9]:

$$
\begin{align*}
& V_{D D, \min } @ E C R L=\max \left(V_{t h, n},\left|V_{t h, p}\right|\right),  \tag{2.19}\\
& V_{D D, \min } @ P F A L=2 V_{t h, n} .
\end{align*}
$$

Below this lower bound, malfunctions of circuits constructed by ECRL and PFAL gates appear. In ECRL, the NMOS device is responsible to keep one output node at ground potential, and the PMOS device charges the dual output node. Thus in ECRL the voltage supply has to be higher than the highest absolute threshold voltage value. In the PFAL gate, the output node has to be at least loaded to $V_{t h, n}$ to make the NMOS device in the latch conductive that is responsible for keeping the dual output node at ground. The input device's source node is connected to the output node, that is expected to be at least $V_{t h, n}$. Thus the gate voltage of the input device needs a voltage of greater than $2 \cdot V_{t h, n}$ to be conducting.

Finally the reduction in the voltage levels will degrade the noise margin for static CMOS as well as for Adiabatic Logic. Energy reduction via supply voltage scaling will thus be a trade-off between energy and robustness of the design.

### 2.5 Properties of Adiabatic Logic and Resultant Design Considerations

Based on the way PFAL and ECRL are constructed and operated, properties exist that need to be considered when designing adiabatic systems. The dual-rail signaling is due to the differential constitution of PFAL and ECRL, whereas delay and inherent micropipelining are implications of the four-phase power-clock.

### 2.5.1 Dual-Rail Encoded Signals

Differential logic styles like PFAL and ECRL generate dual output signals. But in contrast to differential static CMOS styles like CVSL, differential Adiabatic Logic styles are not always differential in a physical sense. As the power-clock ramps down to 0 each cycle, both outputs will go to 0 during the W interval. Only during the H interval, differential Adiabatic Logic gates are also physically differential.

Although the two outputs out and out are generated, the area consumption due to the transistor count is comparable to static CMOS. Considering a NAND gate, that needs 2 NMOS and 2 PMOS devices for static CMOS. ECRL consists of 4 NMOS and 2 PMOS devices, whereas PFAL uses additional 2 NMOS devices in the latch. But, compared to static CMOS, the dual-rail adiabatic gate performs a NAND and an AND function, as both signals, out $=\overline{A \& B}$ and $\overline{\text { out }}=A \& B$ are generated. The AND gate in static CMOS needs an additional inverter circuit, consisting of 1 PMOS and 1 NMOS device. Implementing more complex functions will further reduce the


Fig. 2.8 The ECRL XOR gate (a) without and (b) with reusing transistors in the logic blocks

Table 2.1 $A^{*}$ for XOR implementation of static CMOS [29], PFAL and ECRL. For the calculation of $A^{*}$ a ratio of transistor widths $\frac{W_{P}}{W_{N}}=2$ is assumed. Values in brackets are for static CMOS XOR without input inverting gates

| CMOS |  | PFAL |  |  |  | ECRL |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| \#n | \#p | \#n | \#p | $\frac{\Sigma_{P F A L}}{\Sigma_{C M O S}}$ | $\frac{A_{P F A L}^{*}}{A_{C M O S}^{*}}$ | \#n | \#p | $\frac{\Sigma_{E C R L}}{\Sigma_{C M O S}}$ | $\frac{A_{E C R L}^{*}}{A_{C M O S}^{*}}$ |
| 6(4) | 6(4) | 8 | 2 | 83\% (125\%) | 66\% (100\%) | 6 | 2 | 66\% (100\%) | 55\%(83\%) |

overhead introduced by the latch devices. Dual-rail signals can help to simplify functions, and also common sub-blocks in functions can be shared for $F$ and $\overline{\mathrm{F}}$ [28], as demonstrated for an ECRL XOR/XNOR gate in Fig. 2.8.

For the XOR the transistor count for ECRL versus static CMOS is 8 compared to 12 . The active gate area is a rough measure for the area consumption. If for the static CMOS gate a symmetric rise and fall time is required, the PMOS devices have to be sized larger than the NMOS devices, due to the reduced mobility of the majority carriers (holes) in the PMOS device.

In Table 2.1 the XOR implementations for static CMOS, PFAL and ECRL are compared. For the active gate area $A^{*}$ a ratio $\frac{W_{P}}{W_{N}}=2$ has been assumed for all three gates to compensate for the smaller carrier mobility of holes compared to electrons. What can be seen clearly is, that already for the pure transistor count $(\Sigma=\# n+\# p)$ ratio $\frac{\Sigma_{A L}}{\Sigma_{C M O S}}$, where $A L$ stands for PFAL and ECRL respectively, the XOR gates in Adiabatic Logic are smaller then the corresponding gate in static CMOS if the input inverters in the static CMOS gate are taken into account. The ratio of the active gate areas $\frac{A_{A L}^{*}}{A_{C M O S}^{*}}$ is even better for Adiabatic Logic. Even if the input inverters in the static CMOS gates are not regarded in the transistor count, the ECRL and PFAL gates are comparable in transistor count and active gate area as indicated by the values in brackets in Table 2.1.

Components used in various arithmetic structures are adders and subtractors. If 2 numbers $A$ and $B$ are subtracted, the subtraction is carried out by adding the 2 's


Fig. 2.9 The arrangement of static CMOS gates in (a) cannot be directly translated into Adiabatic Logic. Due to the micropipelining, signal A has to be buffered (b) to be synchronous to the output X of the AND gate in phase $\phi_{1}$
complement of $B$ [29]:

$$
\begin{equation*}
A-B=A+\bar{B}+1 \tag{2.20}
\end{equation*}
$$

Dual-rail gates offer inverted outputs, $\bar{B}$ is generated without additional inverter gates, saving inverters in larger designs. Speaking about latency, Adiabatic Logic systems rise in latency the more gates are cascaded. Dual-rail signaling allows to skip inverter stages and so decrease the number of cascaded gates, and thus allows to decrease the energy consumption and the latency in adiabatic systems.

### 2.5.2 Inherent Pipelining

In Fig. 2.5 the transport of information in an adiabatic circuit is sketched in the power-clock scheme. A cascade of adiabatic gates forms a pipeline. Each gate consists of a storage element and the logic blocks, a gate acts comparable to a latch in static CMOS with integrated logical functionality. Pipelining is thus inherent in Adiabatic Logic. Pipelining in some cases eases the construction of a system. A critical path does not exist in Adiabatic Logic, as each path consists of one gate only. The power-clock itself enforces that input signals are valid as soon as a gate starts to evaluate its outputs. It is guaranteed, that the succeeding gate starts to evaluate only after its inputs are stable. So no care has to be taken to avoid setup time or hold time failures, they are by construction excluded in the design of adiabatic circuits. On the other hand care has to be taken that signals are synchronous at the time when they are further processed. An example (Fig. 2.9) shows the difference of static CMOS design and Adiabatic Logic, if two signal paths converge. To synchronize input A and the output signal of the AND gate X, a buffer has to be inserted in the adiabatic implementation of the design example in Fig. 2.9.

Especially if arithmetic units are designed, carefully selecting suitable topologies is of great importance to avoid overhead due to synchronization stages.

Fig. 2.10 Capacitive
coupling between adjacent
lines leads to crosstalk


### 2.5.3 Delay Considerations in Adiabatic Logic

The delay characterizes the time a signal needs to propagate through a path (path delay) or through a gate. In static CMOS it is crucial for high speed designs to observe critical paths and gate delays to be aware of timing errors. The delay is determined by the ability of the driving transistors to source or sink current, and also by the capacitance value that has to be charged or discharged. It is approximated by a current source (if the transistor is in saturation) and the load capacitance. To operate an adiabatic gate in an energy efficient manner, the voltage drop between the rising/falling transition of the power-clock and the active output node has to be very small ( $V_{D S} \approx 0$ ). Therefore, an operating frequency is chosen that is well below the maximum frequency allowed for a correct function of the gates. Thus, gate delay in Adiabatic Logic is fixed, the full swing output signal is valid after $\frac{1}{4 f}$, where $f$ is the frequency of the power-clock.

### 2.5.4 The Power Supply Net in Adiabatic Logic: Crosstalk, i R-drop, L $\frac{d i}{d t}$-drop, Electromigration

Crosstalk Adjacent lines A and B (Fig. 2.10) will experience changes in the voltage level if a transition occurs on the neighboring line [29, 37].

The relation of the capacitance between the lines $C_{12}$ and the line capacitances $C_{1}$ and $C_{2}$, and the voltage swing of the transition determine the change in the voltage seen by the impacted line. If a voltage transition occurs on line A , and the swing is $\Delta v_{A}$, the change on line B is

$$
\begin{equation*}
\Delta v_{B}=\frac{C_{12}}{C_{2}+C_{12}} \Delta v_{A} \tag{2.21}
\end{equation*}
$$

if signal line B is a floating line. Most likely in static CMOS and Adiabatic Logic the interfered lines will not be floating. As soon as line B is actively driven (Fig. 2.11), the driver will counteract the deviation due to crosstalk and will bring the voltage level on line B back to its original value. In [37] the equation for the deviation is given, if line $A$ is connected to a driver:

$$
\begin{equation*}
\Delta v_{B}=\frac{C_{12}}{C_{T}} \frac{R_{1} C_{T}}{t_{r A}}\left(1-e^{-\frac{t_{r A}}{R_{1} C_{T}}}\right) \tag{2.22}
\end{equation*}
$$

Here $C_{T}=C_{1}+C_{2}, t_{r A}$ is the transition time of the signal swing $\Delta v_{A}$ and $R_{1}$ is the on-resistance of the driver connected to line B.

Fig. 2.11 Scheme for crosstalk if the disturbed line is actively driven


If the transition time $t_{r A}$ of the disturbing signal is increased, the voltage peak induced in the neighboring line is reduced. As less charge is transferred in the same time, the driver has more time to counteract the disturbance. In static CMOS the transition time is determined by the slew rate of the gate driving line A and the capacitances that have to be driven by this gate. In Adiabatic Logic the transition time $t_{r A}$ of a line is determined by the power-clock. If the adiabatic gate is operated in the frequency regime with the lowest dissipation per cycle, the transition time will be much lower than in static CMOS. Thus it can be concluded according to (2.22), that Adiabatic Logic will be less impacted by crosstalk-induced voltage drop.
$\boldsymbol{i} \boldsymbol{R}$-drop $\quad$ Power supply lines have a non-zero on-resistance (Fig. 2.12), voltage drop occurs if a current is drawn on the power supply. In static CMOS circuits, current peaks occur when the registers are clocked, as than a lot of gates switch simultaneously. Not only peak currents, also average currents lead to $i R$-drop, but as the peak current is supposed to be dominant over the average, these peaks will lead to the sizing of safety margins for save operation of electronic systems in static CMOS. A reduced voltage supply due to the $i R$-drop will lead to an increased gate delay and thus critical paths can possibly fail to process data in time. An inverter that switches from low to high will first draw a current from the power supply that is the saturation current of the PMOS device:

$$
\begin{equation*}
I_{D S, s a t}=-\frac{k_{p}^{\prime}}{2} \frac{W}{L}\left(-V_{G S}+V_{t h, p}\right)^{2}\left(1-\lambda V_{D S}\right) \tag{2.23}
\end{equation*}
$$

At the beginning of the charging process, the maximum current will be drawn:

$$
\begin{equation*}
I_{p e a k, C M O S}=I_{D S, s a t}\left(V_{D D}\right)=-\frac{k_{p}^{\prime}}{2} \frac{W}{L}\left(V_{D D}+V_{t h, p}\right)^{2}\left(1-\lambda V_{D D}\right) \tag{2.24}
\end{equation*}
$$

Due to different paths within the gates composed in the logic core, the current waveform will get broader, and the peak will thus be reduced compared to the case where all gates switch simultaneously. The duration of such peaks with respect to the cycle time will show a great impact on the critical path delay. Even if the peak voltage drop is very short, and time is left where the regular $V_{D D}$ is seen by the gates, paths with a very critical timing can still fail.

Adiabatic Logic circuits operate with relatively small currents (transistors in linear region with small voltage $\left|V_{D S}\right|$, and due to the four-phase power-clock, not all gates will switch at once. Each phase has its own power line that only sees the

Fig. 2.12 Voltage drops
appear due to $i R$ and $L \frac{d i}{d t}$ on the supply lines of a circuit. Current peaks on the power line appear when the circuit's output is switched from the low to the high state

current profile of a single phase. During the evaluate interval, an almost constant current will be delivered to the gate. In the recover interval, the charge will be recovered with a constant rate (in ideal case) during the whole interval. In real ECRL gates there are non-adiabatic effects, where transistors in Adiabatic Logic operate in the saturation region. This happens when the power-clock reaches the threshold voltage of the PMOS device, and the gate's output abruptly rises to the present voltage of the power-clock. The maximum peak current is determined by this effect in ECRL to

$$
\begin{equation*}
I_{p e a k}=\frac{\Delta q}{t_{x}}=\frac{C_{L}\left(\left|V_{t h, p}+V_{0}\right|\right)}{t_{x}} \tag{2.25}
\end{equation*}
$$

where $t_{x}$ is the time it takes to follow from the initial voltage level $V_{0}$ on $C_{L}$ to the voltage level $\left|V_{t h, p}\right|$. In (2.25) it is assumed, that the charge $\Delta q$ is transferred via a constant current.

Each current peak, the one in static CMOS as well as the current peak in (2.25) are saturation currents, they are proportional to the square of the overdrive voltage $\left(-V_{G S}+V_{t h, p}\right)^{2}$. In contrast to static CMOS, where the maximum overdrive voltage is applied, i.e. $-V_{D D}+V_{t h, p}$, in ECRL only a very small overdrive is seen at the PMOS device at the beginning of the evaluate interval, thus also the peak current will be only a small fraction of the current in static CMOS. Additionally Adiabatic Logic circuits are not operated at a critical timing. Even larger fractions of the $i R-$ drop will not impact the functionality of AL.
$\boldsymbol{L} \frac{d i}{d t}$-drop If steep current peaks occur in a circuit design, inductance (Fig. 2.12) may play an important role, as a voltage drop of $\Delta V=L \frac{d i}{d t}$ is induced in the inductor. Added on top of the voltage drop due to the resistance of the line, this will further decrease the power supply voltage at the circuit, the delay of critical paths is further increased. As soon as inductances are in a regime where a remarkable voltage drop can be observed for the $\frac{d i}{d t}$ slopes seen in the circuit, this has to be accounted for in the safety margin for $V_{D D}$. In static CMOS, when instantaneous switching occurs, steep slopes of the current are expected. Adiabatic circuits do not draw such high peak currents, slopes $\frac{d i}{d t}$ are small compared to those in static CMOS.

Electromigration Electromigration is a wear-out process on lines carrying currents with a strong current density [29]. The effect is more likely to occur in lines where a strong unidirectional current flows, i.e. power supply lines in static CMOS
circuits. Black [38] presents a relationship for the median time to failure [MTF]:

$$
\begin{equation*}
\frac{1}{M T F}=A J^{2} e^{-\frac{\varphi}{k T}} \tag{2.26}
\end{equation*}
$$

The constant $A$ (involving the cross section area of the line), the current density $J$, the activation energy $\varphi$, and the temperature of the line $T$ impact the $M T F$. If the power is supplied in a bidirectional fashion, like in Adiabatic Logic, where charge is provided to the circuit and recovered later on into the power supply on the same line, electromigration is strongly reduced [39]. In order to limit electromigration, line widths (or thicknesses) have to be increased, to reduce the current density. Adiabatic Logic's power supply lines will obviously less likely fail due to electromigration and can be sized smaller, also resulting in less capacitance of the power supply net. Similarly, in [40] stepwise charging has been proposed in SRAM cells in order to reduce electromigration and the Hot Carrier Injection.

Due to the properties of Adiabatic Logic and the power-clock, Adiabatic Logic suffers less from cross-talk, $i R$-drop, $L \frac{d i}{d t}$-drop and electromigration. In static CMOS, due to the high peak currents, the power supply lines will exhibit a higher peak $i R$-drop, a stronger voltage bounce due to the $L \frac{d i}{d t}$-drop and also electromigration will be significantly higher due to the unidirectional current flow and the high current peaks. Adiabatic Logic thus allows for the design of a voltage supply network that will have less constraints then in static CMOS.

### 2.6 General Simulation Setup

Adiabatic Logic in this work is supplied with a trapezoidal waveform in most of the circuit simulations. To characterize a gate, a simulation environment is established that reproduces the conditions in a real system. Static CMOS gates dissipate energy dependent on the slope of the input signal and on the capacitive load of the output signal. In such a system, a gate will see input signals, that are shaped by previous gates and the output load is formed by the connected gates. Two gates are connected at the inputs of the device under test to shape the input signal and two gates are used at the output to have a load connected to the outputs.

In Fig. 2.13 such a simulation setup is displayed. A general Device Under Test (DUT) is fed with $N$ input signals shaped by two driver stages each. Idealized signals $r$ are inserted at the front interface of the simulation setup. In static CMOS, the two driver gates are used to provide a input signal with a realistic slope to the DUT. Also a realistic imbalance between rising and falling edge is introduced. Adiabatic gates output signals differ from an ideal trapezoidal waveform due to different reasons, i.e. non-adiabatic steps in the voltage, remaining charge on the nodes and due to the voltage drop over the loading path during charging of the output. The $M$ outputs of the gate are each connected to two gates in series. The energy is measured for the DUT by integrating over the power $p(t)$ dissipated in the gate. Due to the energy transfer observed in PFAL gates [9], also the energy introduced via the inputs or the outputs can be regarded by measuring the energy flow via those ports.


Fig. 2.13 Simulation setup for an $N \times M$ ( $N$ inputs, $M$ outputs) static CMOS or Adiabatic Logic gate characterization. The ideal input signals $\mathrm{r}, \ldots, \mathrm{r} N$ are converted to realistic signals by two inverter/buffer gates. Connecting the outputs of the Device Under Test (DUT) to two further inverter/buffer gates allows for determining the energy dissipation with a realistic load

If not stated otherwise, gates are characterized with such a setup. Also for the simulation of larger systems, signal shaping is used to provide a realistic signal to the inputs.

## Chapter 3 <br> Future Trend in Adiabatic Logic

Where does the road lead to? The ongoing miniaturization according to Moore's Law [41] is leading to smaller, faster and more energy-efficient (with respect to energy per switching event) CMOS technologies. But also a lot of new physical effects accompany this evolution, leading to problems that circuit designers have to be aware of. Leakage is becoming more and more dominant, thus threshold voltage is not decreasing as fast as nominal voltage descends from generation to generation. Gate overdrive voltage $V_{G S}-V_{t h}$ decreases with ongoing miniaturization, leading to a lack in performance improvement in static CMOS and to a reduced fraction of the charging process in Adiabatic Logic that is done adiabatically and increased on-resistance values of transistors.

Novel devices have the potential to improve the ESF. One candidate is the Vertical Slit Field Effect Transistor (VESFET) [6], that allows for building highly regular designs. Due to its construction, it presents a smaller capacitance to the driving transistors. A comparison of $E_{A L}$ (see (2.7)) and $E_{C M O S}$ (see (2.3)) shows, that Adiabatic Logic benefits from a reduced capacitance value more than static CMOS does. A lot of literature can be found on carbon nanotubes (CNT), which promise ballistic transport due to a one-dimensional carrier transport and thus reduced scattering. For gate lengths below 1 micron, single walled carbon nanotubes (SWNT) show ballistic behavior, and can reach the fundamental resistance limit of $G^{-1}=6.5 \mathrm{k} \Omega$ [42]. By the chirality (the way a CNT is rolled) it is determined, whether a CNT is metallic or a semiconductor. CNTs offer a resistivity that is comparable to the best metal candidates and a mobility that is much higher than those of semiconductors, decreased adiabatic losses can be expected. A lot of effort has been spent to build CNT-based Field Effect Transistors (CNTFET) with SWNTs [43-46]. Though first devices show a promising behavior, a lot of barriers in fabricating large scale systems with CNTs have to be overcome.

Additional effects causing changes in the parameters (predominantly in $V_{t h}$ ) of semiconductor devices over time, caused by high electric fields and high operating temperatures, are a major concern for designers. Circuits will degrade due to Hot Carrier Injection (HCI) [47] and Bias Temperature Instability (BTI) [48-54], circuit performance declines over time, possibly leading to specification violations after a

Table 3.1 Nominal voltages for the 65 nm CMOS process and the PTM models [58]

| Node | 65 nm | 45 nm | 32 nm | 22 nm | 16 nm |
| :--- | :--- | :--- | :--- | :--- | :--- |
| $V_{D D, \text { nom }}[\mathrm{V}]$ | 1.1 | 1.0 | 0.9 | 0.8 | 0.7 |

certain operating time. In contrast to static CMOS, where degradation will lead to malfunction in digital designs with critical timing, in Adiabatic Logic these effects will more or less influence only the energy dissipation. If the design is limited by a certain energy budget, this could also make the design fail after a couple of years with respect to the energy limits. But Adiabatic Logic gates are powered down each cycle, caused by the transient behavior of the power-clock and thus experience lower stress. In a worst case scenario in static CMOS, a device could be under maximum stress for a long time, leading to breakdown of the whole system.

### 3.1 Scaling Trends for Sub 90 nm Transistors

The International Technology Roadmap of Semiconductors (ITRS) [55] predicts, that in high-performance applications the gate length of transistors reaches 24 nm in 2010. In [10] a scaling trend for Adiabatic Logic based on ITRS values is presented, that shows the development of the ESF based on the predictions in the ITRS. It is concluded, that the relative change of the ESF will be almost constant from 2003 up to the 65 nm in 2009, and that the optimum frequency will increase by $10 \%$ each year, with the exception for 2004 where an improvement in the optimum frequency of almost $60 \%$ is gained, but on the downside the ESF is reduced by more than 15\% [10].

Here a prediction based on simulations with model parameters from 65 nm down to 16 nm is performed. The 65 nm node is simulated with industrial low-power (LP) CMOS parameters. The low-power (LP) process does not offer the best process option for Adiabatic Logic, as a high-performance (HP) process, with a low $V_{t h}$ allows to reduce the adiabatic and the non-adiabatic losses further. The optimum operation frequency will be shifted into a lower frequency region due to the lower leakage losses in the LP process. The Predictive Technology Model (PTM) [56-58] of the Arizona State University offers 45 nm down to 16 nm models that are used to evaluate savings of Adiabatic Logic with ongoing scaling. In order to operate the circuit with minimum energy consumption at a high frequency of around 100 MHz and above, the high-performance (HP) models are chosen.

Devices are chosen to be minimum length $L=L_{\text {min }}$, width for the PMOS and NMOS are related according to $W_{p}=\frac{3}{2} W_{n}$, to have a balanced rise and fall time in static CMOS. The devices in the Adiabatic Logic counterparts are sized according to this sizing rule for comparison reasons. The nominal supply voltages for the four process nodes are listed in Table 3.1. Simulations are performed for static CMOS, PFAL and ECRL inverter gates.

Fig. 3.1 Dissipated energy for the industrial 65 nm low-power (LP) CMOS process


Fig. 3.2 ESF for the industrial 65 nm low-power (LP) CMOS process


Industrial 65 nm Low-Power (LP) Process The energy consumption in Fig. 3.1 shows, that no distinct leakage behavior in CMOS and ECRL can be observed for the observed frequency regime. The losses of static CMOS are more or less constant over the whole simulated frequency range. PFAL is superior to ECRL with respect to the optimum frequency $\left(f_{\text {opt }, P F A L} \approx 10 \cdot f_{\text {opt }, E C R L}\right)$ as well as the minimum dissipated energy ( $E_{\text {min, } E C R L} \approx 2.65 \cdot E_{\min , P F A L}$ ). In the adiabatic regime, PFAL dissipates less than the ECRL counterpart, due to the assisted charging via the logic blocks. At 100 MHz the energy dissipation of ECRL is approximately three times that of PFAL. Glancing at the ESF in Fig. 3.2 it is obvious, that PFAL is the Adiabatic Logic family of choice at the 65 nm node. At 10 MHz an ESF of around 10 is observed for PFAL, whereas ECRLs maximum ESF appears at a lower frequency and is less than 5 . Regarding an operating frequency of 100 MHz an $E S F_{P F A L}=5.55$ and $E S F_{E C R L}=1.86$ is found. The low-power process is not optimum for Adiabatic Logic, as with high-performance transistors the losses in the adiabatic regime can be further reduced, thus the ESF can be further improved.

Fig. 3.3 Dissipated energy for the PTM 45 nm high-performance (HP) process


Fig. 3.4 ESF for the PTM 45 nm high-performance (HP) process


PTM 45 nm High-Performance (HP) Process Further shrinking according to the predictions in the PTM models shows what expectations for the future of Adiabatic Logic can be found. Strongly increased leakage currents exist at the 45 nm node, as in contrast to the used 65 nm process models, these are parameters for a high-performance process. A clear shift of the optimum frequency can be observed in Fig. 3.3 to around 100 MHz for PFAL and approximately 50 MHz for ECRL. Minimum energy consumption is achieved with the PFAL adiabatic family $\left(E_{\min , E C R L} \approx 2.19 \cdot E_{\min , P F A L}\right)$. Because of the shift of $f_{\text {opt }}$ to higher frequencies the maximum ESF for ECRL and PFAL are closer to 100 MHz compared to the 65 nm node. In Fig. 3.4 the maximum ESF $\left(E S F_{\max , E C R L}=2.39\right.$ and $\left.E S F_{\max , P F A L}=5.25\right)$ as well as the ESF at 100 MHz $\left(E S F_{E C R L} @ 100 \mathrm{MHz}=2.18\right.$ and $\left.E S F_{P F A L} @ 100 \mathrm{MHz}=5.08\right)$ can be determined. In the leakage dominated regime below 3 MHz , static CMOS shows a better leakage behavior with reduced energy dissipation compared to AL. This may be due to model insufficiencies, as investigations have shown that Adiabatic Logic with the four-phase power-clock is by definition less impacted by leakage currents [10].

Fig. 3.5 Dissipated energy for the PTM 32 nm high-performance (HP) model


Fig. 3.6 ESF for the PTM 32 nm high-performance (HP) model


PTM 32 nm High-Performance (HP) Process Adiabatic Logic shows a reduced $f_{\text {opt }}$ for 32 nm compared to the results with the PTM 45 nm HP process, and no dominant leakage current dissipation is seen for static CMOS and ECRL for the investigated frequency range in Fig. 3.5. For ECRL and PFAL the respective optimum frequencies are $f_{\text {opt }, E C R L}=5 \mathrm{MHz}$ and $f_{\text {opt }, P F A L}=10 \mathrm{MHz}$. Figure 3.6 shows the ESFs for ECRL and PFAL. ECRL at the optimum frequency consumes approximately twice the minimum energy of PFAL. The maximum ESF for ECRL is $E S F_{\max , E C R L}=4.64$, and for PFAL it is $E S F_{\max , P F A L}=7.45$. At 100 MHz ECRL and PFAL have ESFs of $E S F_{E C R L} @ 100 \mathrm{MHz}=2.11$ and $E S F_{P F A L} @ 100 \mathrm{MHz}=4.74$.

PTM 22 nm High-Performance (HP) Process Also for the 22 nm node of the PTM the leakage current does not dominate the losses of CMOS and ECRL in the examined frequency regime in Fig. 3.7. What can be observed in this plot is the functional limit of the PFAL family. At frequencies above 500 MHz a clear reduction of the energy dissipation is observed in PFAL. This indicates a malfunction of the circuit caused by the very low supply voltage of 0.8 V at the 22 nm node.

Fig. 3.7 Dissipated energy for the PTM 22 nm high-performance (HP) model


Fig. 3.8 ESF for the PTM 22 nm high-performance (HP) model


No shift of the optimum operating frequency can be observed going from 32 nm to 22 nm , still $f_{\text {opt }}$ is around 5 MHz for ECRL and 10 MHz for PFAL. A ratio of $E_{\min , E C R L}=1.91 \cdot E_{\min , P F A L}$ is found at this technology node. According to Fig. 3.8 an $E S F_{\max , E C R L}=3.62$ and at 100 MHz of $E S F_{E C R L}=1.84$ is found. For PFAL the according values are $E S F_{\max , P F A L}=6.68$ and at 100 MHz it is $E S F=3.78$.

PTM 16 nm High-Performance (HP) Process The excessive growth of leakage currents in the circuits with the PTM $16 \mathrm{~nm}(\mathrm{HP})$ model allows no prediction for the evolution of the energy dissipation and the ESF for Adiabatic Logic. As soon as dynamic losses are negligible compared to the leakage losses, reducing the dynamic losses will not reduce the energy consumption significantly. But, the excessive growth of the leakage current will very likely not be seen in industry technologies. According to the ITRS 2009 roadmap [59] (High-Performance Logic Technology Requirements), the leakage current $I_{D S, \text { leak }}$ will stay more or less constant over the different technology nodes. No conclusions can be drawn for the development of Adiabatic Logic with the PTM 16 nm high-performance parameters.

Table 3.2 Overview of optimum frequency $f_{\text {opt }}$, energy minimum $E_{\min }$, the maximum energy saving factor $E S F_{\max }$ and the $E S F$ at the operating frequency of 100 MHz . The development due to scaling according to the predictions in the PTM models can be observed

|  | ECRL |  |  |  |  |  |  |
| :--- | :--- | :---: | :--- | :--- | :---: | :---: | :---: |
| $f_{\text {opt }}[\mathrm{MHz}]$ | $E_{\text {min }}[\mathrm{J}]$ | $E S F_{\text {max }}$ | $E S F @ 100 \mathrm{MHz}$ |  |  |  |  |
| industrial 65 nm (LP) | $1-2$ | 93 a | 4.55 | 1.86 |  |  |  |
| PTM $45 \mathrm{~nm}(\mathrm{HP})$ | $20-50$ | 134 a | 2.39 | 2.18 |  |  |  |
| PTM $32 \mathrm{~nm}(\mathrm{HP})$ | 5 | 43 a | 4.64 | 2.11 |  |  |  |
| PTM $22 \mathrm{~nm}(\mathrm{HP})$ | 5 | 23 a | 3.62 | 1.84 |  |  |  |


|  | PFAL |  |  |  |
| :--- | :--- | :--- | :--- | :--- |
|  | $f_{\text {opt }}[\mathrm{MHz}]$ | $E_{\text {min }}[\mathrm{J}]$ | $E S F_{\text {max }}$ | $E S F @ 100 \mathrm{MHz}$ |
| industrial $65 \mathrm{~nm}(\mathrm{LP})$ | $5-10$ | 35 a | 10.29 | 5.55 |
| PTM $45 \mathrm{~nm}(\mathrm{HP})$ | $50-100$ | 61 a | 5.25 | 5.08 |
| PTM $32 \mathrm{~nm}(\mathrm{HP})$ | $10-20$ | 22 a | 7.45 | 4.74 |
| PTM 22 nm (HP) | $10-20$ | 12 a | 6.68 | 3.78 |

Overview of Scaling Trend In Table 3.2, the characteristic values for the scaling results are summarized. The optimum frequency $f_{\text {opt }}$ is the frequency where the minimum in the energy dissipation $E_{\text {min }}$ is found in Adiabatic Logic. The maximum energy saving factor $E S F_{\max }$, and the energy saving factor at 100 MHz are given. The optimum frequency according to the simulation results with the PTM models will be decreased from the 45 nm node to the 32 nm node for ECRL and PFAL. Simulations with the industrial 65 nm process parameters show, that the minimum energy per cycle $E_{\min }$ is less than that of the PTM model for the 45 nm node. A high threshold voltage shows the lowest energy dissipation due to the exponential descent of leakage losses with increasing $V_{t h}$. But within the adiabatic regime, losses are higher than in the according high-performance process. Thus it can be expected, that with a high-performance 65 nm process $E_{\min }$ increases and $E S F_{\max }$ is reduced. But $f_{\text {opt }}$ and the ESF@ 100 MHz will be improved. Even though the maximum ESF for ECRL and PFAL are subject to fluctuations over the different nodes, the ESF @ 100 MHz stays almost constant down to the PTM 32 nm (HP) node. Only for the PTM $22 \mathrm{~nm}(\mathrm{HP})$ node the ESF @ 100 MHz degrades for PFAL.

Adiabatic Logic families ECRL and PFAL prove their low-power dissipation even at strongly scaled device dimensions, though the expected maximum saving factors are degraded. Based on the predictions of the PTM it is seen, that the ESF at the frequency of interest $(100 \mathrm{MHz})$ stays more or less constant, with a noticeable decrease only for the PTM $22 \mathrm{~nm}(\mathrm{HP})$ model parameters and PFAL. Circuits are functional operating into the GHz regime, PFAL fails for frequencies above 500 MHz at the PTM $22 \mathrm{~nm}(\mathrm{HP})$ node. As technologies always have been adjusted during their development for production, it is to be expected that these results for 22 nm and 16 nm will experience changes when mature processes are available.

### 3.2 Adiabatic Logic with Novel Devices

Novel devices are proposed to complement or replace planar bulk MOS devices to overcome barriers due to the integration, due to performance barriers or Short Channel Effects (SCE). SCEs like the $V_{t h}$ Roll-off (VTRO) and the Drain-induced Barrier Lowering (DIBL) lead to a reduced threshold voltage and thus to increased leakage currents. Fully-depleted SOI (FD-SOI), multi-gate FET (MUGFET) are novel transistor concepts to cope with SCE. Non-silicon based devices like Carbon Nanotubes (CNT) promise some advantageous electrical characteristics like a high conductivity, high current transport capability as well as a beneficial thermal behavior. Adiabatic Logic with CNT-based transistors is inspected in Sect. 3.2.2. The Vertical Slit Field Effect Transistor (VESFET) is a highly regular structure that allows for a regular circuit design. In contrast to the MOSFET it is based on the modulation of depletion widths within a bulk, and therewith the current composed by majority carriers is controlled. How Adiabatic Logic performs with the VESFET device is explained in Sect. 3.2.3. But first according to the equations describing losses in AL it is derived what an ideal device for Adiabatic Logic should look like.

### 3.2.1 What Should an Ideal (Novel) Device for Adiabatic Logic Look Like?

A first glance at the characteristic equation for adiabatic losses reveals the factors that will impact the energy consumption of Adiabatic Logic within the frequency regime, where adiabatic losses dominate the overall consumption. In Sect. 2.1 adiabatic losses are derived as

$$
\begin{equation*}
E_{A L}=2 \frac{R C}{T} C V_{D D}^{2}=8 R C f C V_{D D}^{2} \tag{3.1}
\end{equation*}
$$

Basically it is seen that the energy is decreased if the on-resistance $R$, the capacitive load $C$, the supply voltage $V_{D D}$ or the operating frequency $f=\frac{1}{4 T}$ are decreased. In Sect. 2.4 voltage scaling is discussed, which is also applicable in Adiabatic Logic, where scaling of the supply voltage has a linear impact on the energy dissipation. Trading speed (reduced frequency) for a lowered energy dissipation is of course the principle of Adiabatic Logic. Two parameters, that are evidently impacted by the device are the on-resistance and the capacitive load. Whether the intrinsic on-resistance or capacitance of the device plays a role in the overall consumption is dependent on technological parameters as well as on circuit and layout. Let $R_{\text {gate }}$ be the on-resistance of the gate and $R_{i c}$ be the resistance due to interconnects of the power-clock, of inter- and intragate connections. Then the overall resistance value is

$$
\begin{equation*}
R=R_{\text {gate }}+R_{i c} . \tag{3.2}
\end{equation*}
$$

The overall capacitance seen by the power-clock can also be subdivided into different fractions. Let $C_{\text {gate }}$ be the intrinsic gate capacitance that also covers the input
capacitance of the devices driven by the gate under consideration, and interconnects are summed up in $C_{i c}$. The overall capacitance seen by the gate therefore is

$$
\begin{equation*}
C=C_{g a t e}+C_{i c} . \tag{3.3}
\end{equation*}
$$

Only if the term $R C^{2}$ in (3.1) is dominated by the device's intrinsic parasitic $R_{\text {gate }}$ or $C_{\text {gate }}$, improvements of the device will impact the energy consumption in the adiabatic frequency regime.

The Device's On-resistance Adiabatic Logic is mainly operated in the linear region. The equation describing the on-resistance of a NMOS device is given as

$$
\begin{equation*}
R_{o n}=\frac{V_{D S}}{I_{D S}}=\frac{1}{k_{n}^{\prime} \frac{W}{L}\left(V_{G S}-V_{t h, n}\right)}, \tag{3.4}
\end{equation*}
$$

with $k_{n}^{\prime}=\mu_{o x} C_{o x}$ and $C_{o x}=\epsilon_{o x} / t_{o x}$. Thus the on-resistance is

$$
\begin{equation*}
R_{o n}=\frac{t_{o x}}{\mu_{o x} \epsilon_{o x}} \frac{L}{W}\left(V_{G S}-V_{t h, n}\right)^{-1} \tag{3.5}
\end{equation*}
$$

The first term of (3.5) is dependent on the process technology. Width $W$ and length $L$ of the device are accessible to the designer. The supply voltage determines $V_{G S}$ and the threshold voltage is on the one hand dependent on the process technology, the device option (selection of a certain device with low- or high- $V_{t h}$ ) and also is subject to the designer due to the $V_{t h}$ Roll-off. The focus here is set on new devices and/or materials, and structures of devices that apply a MIS-structure. Thus $C_{o x}$ can be increased in order to improve the on-resistance. But this will also lead to an increased intrinsic gate capacitance per unit area. New materials and/or devices are under development, that will decrease the on-resistance by improving the mobility of the carriers. III-V semiconductor compounds have a 50 to 100 times higher electron mobility compared to silicon [60]. Thus they allow for very high speed NMOS devices. Some major issues have to be overcome before compound semiconductors can be used in digital circuits. One major issue is the improvement of the hole mobility in III-V compounds [60] to offer also high speed PMOS FETs for static CMOS and Adiabatic Logic configurations.

A candidate investigated in this work in Sect. 3.2.2 is the Carbon Nanotube (CNT), that allows for the assembly of FET devices with a high mobility and a good scalability. Here a major issue is the controlled and structured assembly of devices to render possible the integration on large scale.

The Device's Capacitance The capacitance in a MOSFET device is given by different sandwich and junction capacitances. The gate stack forms overlap capacitances $C_{G S 0}=C_{G D 0}\left(=C_{o x} x_{d} W\right)$ to source and drain, which are independent of the applied voltages, and due to under-diffusion of source and drain by $x_{d}$. The gate capacitance formed by the gate stack with the bulk material is dependent on the terminal voltages at the device. Additionally, junction capacitances between source/drain and the bulk material also add to the capacitive load within a gate.

As $C_{o x}$ is fundamental to the working principle of a MIS structure, here only reduction (no avoidance) is possible to diminish capacitive loading. But this will also affect the on-resistance (see (3.5)), as it is responsible for the charge in the channel. If the specific oxide capacitance $C_{o x}$ is smaller, the channel cannot be controlled to the same extent, effectively this is an increase in the threshold voltage $V_{t h}$. Therefore the supply voltage $V_{D D}$ has to be increased to increase the overdrive voltage $V_{G S}-V_{t h, n}$ and therewith decrease the on-resistance. And thus this again leads to increased energy consumption due to $V_{D D}^{2}$. Anyway, the on-resistance $R \propto \frac{1}{C_{o x}}$ and the overall capacitance $C \propto C_{o x}$. As $E_{D i s s, A L} \propto R C^{2}$, reducing the capacitance value $C_{o x}$ and hazarding the consequences in the on-resistance could overall decrease the energy consumption in the adiabatic regime. Junction capacitances are formed due to pn-junctions in a MOSFET device at the source and drain diffusion areas. The junction capacitance of a pn-junction with $N_{A}$ and $N_{D}$ as acceptor and donor density, respectively, is given by

$$
\begin{equation*}
C_{j}=\frac{\epsilon_{0} \epsilon_{r} A}{\sqrt{\frac{2 \epsilon}{e}\left(V_{D}-V\right)\left(\frac{1}{N_{A}}+\frac{1}{N_{D}}\right)}} . \tag{3.6}
\end{equation*}
$$

Reducing the junction capacitance can be done by adopting technological parameters but also by decreasing the area $A$. This is on the one hand a parameter controlled by the device engineer as well as the circuit designer. The circuit designer can use minimum width devices, as the area is proportional to the width $W$ of the device. On device level, silicon-on-insulator (SOI) is a concept, that promises performance increase due to decreased junction capacitances. An insulator directly underlies the source and drain region, thus the bottom area of source and drain diffusions are insulated from the bulk, avoiding the formation of junction capacitances. Another interesting device, the Vertical Slit FET (VESFET) is investigated with Adiabatic Logic in Sect. 3.2.3. Due to the absence of junction capacitances, the device implicitly is superior to planar bulk CMOS devices with respect to intrinsic device capacitance at source and drain regions. Additionally, the series connection of gate and depletion capacitance of the device also reduces the effective gate capacitance.

Impact of Reducing On-resistance or/and Capacitance Different frequency regimes are impacted in case of the energy dissipation when reducing either the onresistance or the capacitive load. According to the equations presented in Sect. 2.3 the on-resistance reduction is only impacting the adiabatic losses. This statement is based on the assumption, that the on-resistance can be reduced without changing the threshold voltage $V_{t h}$. When the threshold voltage is altered to gain a reduced on-resistance, the non-adiabatic losses will be impacted according to (2.11). For static CMOS no direct impact on the energy consumption in the frequency regime dominated by dynamic losses can be observed by reducing the on-resistance. But it will improve the performance and allow to run the device with a reduced $V_{D D}$.

Contrary, a reduced capacitance will affect both, static CMOS as well as Adiabatic Logic. The energy consumption in static CMOS is affected linearly with $C$. In

Adiabatic Logic two loss mechanisms are affected by a reduced capacitance, namely the non-adiabatic $(\propto C)$ and the adiabatic losses $\left(\propto C^{2}\right)$.

Inserting the different loss mechanisms into the definition of the ESF in (2.8) leads to

$$
\begin{equation*}
E S F=\frac{V_{D D} I_{\text {leak }} f^{-1}+\frac{1}{2} C V_{D D}^{2}}{V_{D D} \overline{I_{\text {leak }}} f^{-1}+\frac{1}{2} C V_{t h}^{2}+8 R C f C V_{D D}^{2}} . \tag{3.7}
\end{equation*}
$$

Short-circuit currents for static CMOS have been neglected in this equation. $I_{\text {leak }}$ is the leakage current in the static CMOS gate, that is constant during the whole cycle. $\overline{I_{l e a k}}$ is the mean value of the leakage current in Adiabatic Logic, as due to the power-clock it is a function of time. In the leakage dominated regime (3.7) is reduced to

$$
\begin{equation*}
E S F_{\text {leakage }}=\frac{I_{\text {leak }}}{\overline{I_{\text {leak }}}} \tag{3.8}
\end{equation*}
$$

The ratio is dependent only on the ratio of the mean leakage currents, and is determined by the shape of the power-clock signal [61]. For the high frequency regime, leakage losses and non-adiabatic losses are negligible. Equation (3.7) is reduced to

$$
\begin{equation*}
E S F_{\text {adiabatic }}=\frac{1}{16 R C f} \tag{3.9}
\end{equation*}
$$

Improvements in the ESF within the adiabatic regime are achieved when the loading path resistance $R$, the load capacitance $C$ and the frequency (with the restriction, that the frequency is limited to the regime where adiabatic losses dominate) are reduced. The optimum frequency $f_{\text {opt }}$ for Adiabatic Logic is derived by equalizing the adiabatic losses and the leakage losses, as a minimum will be found at the intersection of both.

$$
\begin{equation*}
V_{D D} \overline{I_{l e a k}} f^{-1} \equiv 8 R C f C V_{D D}^{2} \tag{3.10}
\end{equation*}
$$

As frequency is positive, the optimum frequency is found at

$$
\begin{equation*}
f_{o p t}=\sqrt{\frac{\overline{I_{l e a k}}}{8 R C^{2} V_{D D}}} . \tag{3.11}
\end{equation*}
$$

All three parameters $R, C$ and the supply voltage $V_{D D}$ affect $f_{o p t}$ and $V_{D D}$ will also directly affect $\overline{I_{\text {leak }}}$.

Next the maximum energy saving factor $E S F_{\max }$ is calculated. It is presumed that leakage losses in static CMOS are negligible at the frequency $f_{\text {opt }}$, where $E S F_{\text {max }}$ is expected. Leakage losses in Adiabatic Logic and adiabatic losses are equal at $E S F_{\max }$. Thus from (3.7) it is derived that

$$
\begin{equation*}
E S F_{\max }=\frac{1}{\left(\frac{V_{t h}}{V_{D D}}\right)^{2}+32 R C f_{o p t}} \tag{3.12}
\end{equation*}
$$

Fig. 3.9 A reduced charging path resistance does only impact the adiabatic losses. Static CMOS will not benefit from a change in the on-resistance directly in its energy consumption, but the performance will be increased


Inserting (3.11) into (3.12) leads to

$$
\begin{equation*}
E S F_{\max }=\frac{1}{\left(\frac{V_{\text {th }}}{V_{D D}}\right)^{2}+32 R \sqrt{\frac{\bar{I}_{\text {leak }}}{8 R V_{D D}}}} \tag{3.13}
\end{equation*}
$$

The capacitance does not have any impact on the maximum savings gained, but reducing the resistance value will impact $E S F_{\max }$. Within the adiabatic regime of course a reduction of the capacitance will lead to a higher ESF according to (3.9), and after (3.11) it will shift $f_{\text {opt }}$ to higher frequencies.

A MATLAB calculation is performed to prove these results. Therefore leakage losses and dynamic losses are calculated for static CMOS and leakage losses, nonadiabatic losses, and adiabatic losses for Adiabatic Logic. All energy losses are summed up for the according family and then the ESF is calculated. In the first example the on-resistance $R$ is varied (without any impact on $V_{t h}$, by e.g. mobility enhancement) to a reduced value that is only $20 \%$ of the original value. Impacts in the energy dissipation can be observed in Fig. 3.9 only for Adiabatic Logic. Static CMOS will only experience increased performance. Looking at the ESF in Fig. 3.10 it is seen that both, $E S F_{\max }$ and $f_{\text {opt }}$ are increased by altering $R$.

In contrast the lowered energy dissipation due to a reduction of the capacitance by $\frac{1}{3}$ presented in Fig. 3.11 is seen in the overall consumption for static CMOS as well as Adiabatic Logic. As expected from (3.13) no improvement can be observed for $E S F_{\text {max }}$, but the optimum frequency $f_{o p t}$ is shifted to higher frequencies in Fig. 3.12.

Sensitivity of $f_{\text {opt }}$ to Changes in the On-resistance and Capacitance As both changes, resistive as well as capacitive, impact the optimum frequency, the quantitative impact is of interest. Which parameter will lead to the strongest impact? From (3.11) the sensitivities of $f_{\text {opt }}$ to $R$ and $C$ can be derived to

$$
\frac{\delta f_{o p t}}{\delta R}=-\frac{1}{2 C} \sqrt{\frac{\overline{I_{\text {leak }}}}{8 V_{D D}}} R^{-1.5},
$$

Fig. 3.10 The $E S F$ is increased by decreasing $R$.
With a reduced $R$ the optimum operating frequency is affected as well as the maximum value of the ESF

Fig. 3.11 Impact of a reduced capacitance on the energy dissipation of static CMOS and AL. Both are affected by the decreased capacitive load


Fig. 3.12 The ESF is impacted by the reduced capacitance value. Savings are shifted to higher frequencies, the maximum energy saving is not affected


$$
\frac{\delta f_{o p t}}{\delta C}=-\frac{1}{\sqrt{R}} \sqrt{\frac{\overline{I_{l e a k}}}{8 V_{D D}}} C^{-2}
$$

Fig. 3.13 If the capacitance is decreased $(X=C)$, the optimum frequency is more impacted when compared to a resistive change ( $X=R$ )

The MATLAB plot in Fig. 3.13 shows, that reducing the capacitance will impact the optimum frequency $f_{\text {opt }}$ much more when compared to the frequency $f_{\text {opt }, 0}$, which is the optimum frequency for the initial values of $R$ and $C$. If the optimum frequency is the criterion of interest, reducing the capacitive load is much more effective. But on the other hand a reduction of $R$ will also affect the maximum energy saving factor $E S F_{\text {max }}$.

Both, capacitance value as well as resistance value do play a role when it comes to invention of a device that is suitable for Adiabatic Logic. Or rather a rating of new technologies can be made by having a look at the development of their intrinsic device capacitance and resistance. Two novel devices will be simulated in the remainder. First a look at Carbon Nanotubes (CNT) is taken, that mainly benefit from a strong increase of the mobility $\mu$ of the carriers and thus a reduced on-resistance. Then a look at the Vertical Slit Field Effect Transistor (VESFET) device is taken, that has a reduced intrinsic capacitance compared to a bulk CMOS technology counterpart.

### 3.2.2 Adiabatic Logic with Carbon Nanotubes (CNT)

Aggressive scaling of bulk MOSFETs has lead to problems occurring due to increased short channel effects (SCE), leakage currents, and also due to fabrication limitations. One potential future replacement for bulk MOSFETs is the so-called Carbon Nanotube (CNT) FET. Made out of a wrapped around single-atom thick carbon cylinder, they promise to have excellent electrical as well as mechanical behavior. CNTs can be metallic or semiconductors (small-gap or moderate-gap), depending on the way they are rolled and the circumference of the tube [62]. So both, active devices and connections can be made out of CNTs. Single Wall Carbon Nanotubes (SWCNT) have a high mobility in the order of $100000 \mathrm{~cm}^{2} / \mathrm{V} \mathrm{s}$ and a conductivity of up to $400000 \mathrm{~S} / \mathrm{cm}$ [42], due to the 1-D transport and thus reduced phase space for scattering leading to the possibility of a ballistic transport
[63]. The covalent bonding of the carbon atoms results in mechanical and thermal stability in CNTs [44]. Though these facts are very convincing, a lot of barriers have to be overcome before MOSFETs made out of SWCNTs can be produced on a large scale. One limitation is the Shottky Barrier (SB) that is formed by any metalsemiconductor connection due to the different work functions [45]. This limits the on-state current and thus the performance of the device. And additionally this leads to an ambipolar characteristic [64], the on-current is increased, when $\left|V_{G S}\right|$ is increased. In digital circuits, this leads to problems in transistor stacks, where negative gate-to-source voltages appear and thus turn on a device that has a " 0 " input [65]. One way to overcome this limitation is by connecting the nanotubes via Palladium, which is a metal with a high work-function [46]. Thus the barrier can be greatly reduced or eliminated. Chirality cannot be controlled so far, nanotubes are to date a mix of $1 / 3$ metallic and $2 / 3$ semiconducting tubes [42]. The controlled assembly of nanotubes is another barrier that has to be overcome in order to allow for the integration of systems consisting of CNT devices with traditional metal interconnects, or only CNT (semiconducting for devices and metallic for interconnects). Derycke et al. [66] propose a methodology to selectively produce p-type and n-type CNTs for inter-nanotube inverters. One of the tubes is covered with Polymethylmethacrylat (PMMA) and after thermal annealing and following exposure to oxygen, one of the tubes is n-type and the other is a p-type tube.

Nevertheless, due to their superior electrical characteristic, the high conductivity resulting from a (near-)ballistic transport, carbon nanotubes are interesting candidates for future large scale integration circuits; their capabilities allow for lowest consumption in Adiabatic Logic. In the following part of the work the principle of chirality and its meaning for the properties of CNTs will briefly be presented, and later on, simulations are carried out with the Stanford CNT Hspice simulation model [67-69] to see how Adiabatic Logic performs with those future devices.

### 3.2.2.1 The Chirality of a CNT and the CNTFET

As mentioned earlier, a CNT can be either metallic or semiconducting. The chiral vector gives the information of the wrapping angle and circumference of the nanotube. It determines whether a metallic or semiconducting nanotube is formed. The chiral vector is defined as $\vec{C}_{h}=n_{1} \vec{a}_{1}+n_{2} \vec{a}_{2}$. A graphical representation of the chirality vector can be observed in Fig. 3.14. It connects the origin with the point on the grid that is identical to the origin after wrapping up the carbon sheet. With the chiral number $\left(n_{1}, n_{2}\right)$ the circumference of the nanotube can be determined as $\left|\vec{C}_{h}\right|=a \sqrt{n_{1}^{2}+n_{2}^{2}+n_{1} n_{2}}$, where $a=2.49 \AA$ is the bond distance of two carbon atoms. Electrical characteristics are also determined via the chiral number. A nanotube with $\left(n_{1}, n_{2}\right)$ is a metal, if $n_{1}-n_{2}=3 l(l=1,2,3, \ldots)$ and a semiconductor if $n_{1}-n_{2} \neq 3 l$ [70], and the bandgap energy $E_{g a p}$ is also dependent on the circumference of the tube. Additionally, tiny-gap and large-gap semiconducting CNTs exists, also deterministic according to the chiral number ( $n_{1}, n_{2}$ ) [62]. Two constellations are pictured in Fig. 3.14, the chiral number ( 3,0 ) represents a metallic


Fig. 3.14 The chiral vector $\vec{C}_{h}$ is defined via the unit vectors $\vec{a}_{1}$ and $\vec{a}_{2}$ on the carbon honeycomb lattice. It defines the wrapping of the nanotube, the circumference and as well the electrical properties. As examples, the metallic $(3,0)$ and a semiconducting $(4,2)$ orientation is presented. Perpendicular to the chiral vector the one-dimensional translation vector $\vec{T}$ is found, which is the axis of the obtained nanotube


Fig. 3.15 Schematic cross-section of a Carbon Nanotube Field Effect Transistor (CNTFET) proposed in [71]. A gate is separated via a insulating layer (oxide) from the CNT. Through the CNT a connecting channel between source and drain can be established
nanotube, whereas the $(4,2)$ constellation is a semiconducting nanotube. A relation of the bandgap energy $E_{g a p}$ to the circumference of the nanotube exists. For tinygap and large-gap semiconducting CNTs the bandgap energy is proportional to $\frac{1}{R^{2}}$ and $\frac{1}{R}$, respectively [70], where $R$ is the radius of the tube.

A carbon nanotube field effect transistor cross-section is comparable to the crosssection of a planar MOSFET device [71]. The source and drain regions are connected by a semiconducting CNT. As in the bulk MOSFET device, the control gate of the CNTFET is separated from the semiconductor (in the case of the CNTFET it is a semiconducting CNT) with an insulating layer, namely the oxide. Figure 3.15 shows a schematic cross-section through a CNTFET device.

### 3.2.2.2 Simulation Results

Simulations are carried out with the HSpice CNT compact model provided by Deng et al. [67-69] from the Stanford University Nanoelectronic group. The model includes non-idealities like near-ballistic transport, scattering, effects of the source/drain extension, and charge screening between tubes, when more than one

Table 3.3 Parameter list for the applied Stanford CNT simulation model

| Parameter | Value | Explanation |
| :--- | :--- | :--- |
| $V_{D D}$ | 1.2 V | supply voltage |
| $\left(n_{1}, n_{2}\right)$ | $(19,0)$ | chirality of nanotube(s) <br> tubes |
| pitch | 3 | \# of parallel tubes within a device |
| $L_{c h}$ | 20 nm | distance between adjacent CNTs |
| $L_{S S}, L_{D D}$ | 32 nm | physical channel length |
| $t_{o x}$ | 42 nm | doped source/drain extension <br> $k_{o x}$ |

tube is used for a CNTFET [69]. By inclusion of a transcapacitance network, good predictions of the transient behavior are expected [69].

Based on the Stanford CNT model, static CMOS like (static CCNT) inverters as well as adiabatic inverters are implemented and compared with respect to the energy consumed per performed calculation. The standard parameters summarized in Table 3.3 are used if not stated otherwise. Three tubes in parallel are used in the CNTFET devices, each with a moderate-gap semiconductor tube of chiral number (19, 0). Pitches between the adjacent CNTs within one device are 20 nm . A channel of 32 nm physical length is separated via a 4 nm oxide. Both devices, n-type an p-type are parametrized with the same values according to Table 3.3. A chain of inverters is used to determine the energy consumption and therewith the resulting ESF. The static CCNT structure is used as a reference and compared to ECRL and PFAL adiabatic inverters. In this investigation the gates were not optimized to improve the energy behavior of static CCNT as well as Adiabatic Logic. By properly choosing the PMOS devices in PFAL, energy consumption can be halved [10], thus further improvement of the PFAL values appears possible.

Energy dissipation and ESF versus the frequency can be observed in Figs. 3.16 and 3.17. On first sight the same dissipation mechanisms can be found also for CNTFET based circuits. A leakage dominated region exists in the lower frequency regime and for the static CCNT inverter, the energy dissipation is independent of the frequency above 50 MHz . Adiabatic losses proportional to the frequency are observed, leading to an optimum energy dissipation at a frequency of 100 MHz for ECRL and 500 MHz for PFAL, that is also the frequency where the maximum energy saving factor $E S F_{E C R L, \max }=12$ and $E S F_{P F A L, \max }=19$ exhibits. It is higher compared to the maximum ESF in the PTM 32 nm technology node, but it has to be considered, that the voltage supply for the CNTFET is higher than the one used in the simulation of the PTM 32 nm node ( $V_{D D}=0.9 \mathrm{~V}$ for PTM 32 nm ).

If the chiral vector is changed, the bandgap of the semiconductor is influenced. A reduced bandgap energy leads to a reduced threshold-voltage $V_{t h}$, as the barrier height from source/drain to the channel is reduced. The on-current is increased, but off-currents are also increased. In Fig. 3.16 the simulated energy dissipation of ECRL and static CCNT inverters is not only plotted for the chiral number of $(19,0)$ but also a transistor with a $(25,0)$-nanotube is observed. The static CCNT is only

Fig. 3.16 Energy versus the frequency for a static CCNT, ECRL, and PFAL inverter using CNTFETs

simulated for the frequency where the minimum energy dissipation of the $(25,0)$ ECRL counterpart appears. Due to the reduced $V_{t h}$ two impacts are observed: on the one hand, the leakage currents are increased drastically, on the other hand, the adiabatic losses are reduced, as the gate-overdrive $\left(V_{D D}-V_{t h}\right)$ is increased, and thus the on-resistance is decreased. Strong increased leakage currents lead to an intersection of leakage losses and adiabatic losses for the $(25,0)$ ECRL configuration, that shows an increased minimum of $E_{\text {Diss }}$ when compared to the case of $(19,0)$ chirality. Reduced adiabatic losses lead to a shift of the optimum frequency $f_{\text {opt }}$ to 500 MHz and at this frequency the dissipated energy is reduced when compared to its $(19,0)$ counterpart. Thus the point of optimum frequency can be adopted by changing the chirality of the nanotube in the CNTFET devices. Also at 500 MHz the static CCNT inverter was simulated with the $(25,0)$ chiral vector. An increase in the energy is observed, that can be explained by the increased leakage currents. For static CCNT circuits, performance is improved but dynamic losses are not directly impacted by the chirality.

Values are derived from Fig. 3.17: $f_{\text {opt }}$ is improved from 100 MHz to 500 MHz and the $E S F_{\max }$ is decreased from 12 to 10 in case of replacing (19, 0)-tubes in ECRL with $(25,0)$-tubes, and it is increased from 12 to 13.6 in case that $(25,0)$ tubes are used both in ECRL and static CCNT. PFAL for the ( 19,0 )-setup has already a very high energy saving potential with a maximum value at 500 MHz . Increasing the chirality for PFAL to $(25,0)$ as expected also leads to increased leakage losses for PFAL, but within the adiabatic regime, no changes are observed. Hence changing chirality from $(19,0)$ to $(25,0)$ does not show any benefit for PFAL, it rather worsens the energy behavior of PFAL, thus it is not presented in the figures.

All previous investigations are carried out with a supply voltage of $V_{D D}=1.2 \mathrm{~V}$. In Sect. 2.4 it is derived, that Adiabatic Logic gains less from voltage scaling. Thus decreasing the voltage supply will lead to a decreased ESF. Figure 3.18 shows the impact of a reduced voltage on the energy dissipation of static CCNT as well as on ECRL and PFAL with (19, 0)-tubes. The optimum frequency point of 100 MHz for ECRL and 500 MHz for PFAL is simulated and compared to the static CCNT values at 100 MHz (as the energy dissipation is expected to stay constant for the static

Fig. 3.17 The ESF gained by applying CNTFETs for ECRL


Fig. 3.18 Impact of voltage scaling on the dissipation of a CNT-based inverter circuit


CCNT circuit over the frequency). A quadratic dependence of the energy dissipation is observed for static CCNT and a linear reduction for Adiabatic Logic by scaling the voltage supply. The gainings through voltage scaling are linearly dependent on the reduction of the supply voltage.

The according ESFs are presented in Fig. 3.19. It is reduced by $34 \%$ and $39 \%$ when going from 1.1 V to 0.8 V for ECRL and PFAL, respectively. But it has to be mentioned, that the performance for static CMOS is thus also decreased. As cascaded gates are used, the overall performance loss will decrease the maximum operating frequency or the effort for pipelining is increased. The first fact will determine a lower limit for voltage scaling (where safety margins for variations have to be regarded also), whereas the second fact leads to a trading of speed versus energy.

A high fan-out of a gate leads to the necessity of scaling the devices of the driving gate. Comparable to the width in planar CMOS technology, the number of tubes is a measure in CNTFETs for their driving capability. Equipping a gate with a higher number of tubes in the device will impact the energy dissipation. Figure 3.20 shows the simulation results of $E_{\text {Diss }}$, when different numbers of tubes are used within the

Fig. 3.19 Development of the ESF of an inverter circuit under the impact of voltage scaling


Fig. 3.20 Energy dissipation with respect to the number of parallel tubes used in a CNTFET device

inverter chain. All gates in the chain are sized, and both devices, NCNT and PCNT. Again ECRL and static CCNT are simulated with a frequency of 100 MHz , and PFAL at 500 MHz , all with a supply voltage $V_{D D}=1.2 \mathrm{~V}$. The increased number of tubes obviously impacts $E_{\text {Diss }}$ for static CCNT and for Adiabatic Logic families. In Fig. 3.21 the according ESF values are plotted for ECRL and PFAL. While the ESF for PFAL stays constant up to five tubes in parallel, and is increased from 20 to 22 when ten tubes are used, in ECRL is continuously increased from 9 for three tubes to 12 in case of ten tubes. This is explained by the increased capacitance value and the reduced on-resistance due to parallel connection of more tubes. In static CMOS, the increased capacitance will increase the energy dissipation according to $E_{C M O S} \propto C$. In AL, also the reduced on-resistance impacts the energy dissipation with $E_{A L} \propto R C^{2}$. If the quadratic increase of the capacitance value is strongly reduced by the reduced on-resistance, the energy $E_{A L}$ rises less than $E_{C M O S}$.

If carbon nanotubes become successors to planar CMOS transistors they offer a tremendous energy saving impact and improved performance in Adiabatic Logic.

Fig. 3.21 ESF with respect to the number of parallel tubes used in a CNTFET device


Due to their superior carrier transport they offer a small on-resistance and thus are exceptionally well applicable in Adiabatic Logic.

### 3.2.3 Adiabatic Logic with the Vertical Slit Field Effect Transistor (VESFET)

The Vertical Slit Field Effect Transistor (VESFET) is a novel device proposed by W. Maly [6]. Its geometry is based on cylindrical elements of radius $r$ as pictured in Fig. 3.22. It has 4 terminals, source, drain and two gates, and is regarded as a hybrid of a JFET and a MOSFET device [72]. Majority carriers are responsible for the transport of charge through the bulk. Two gates control depletion regions into the bulk and therewith the channel width can be controlled. Compared to a JFET a MOS structure for the gates is found, the gates are separated from the bulk via an oxide. Gate leakage currents are expected to be negligible compared to a JFET device [72], as only tunneling currents account for the gate leakage in comparison to junction leakage in JFET devices. Parameters to adjust the device are the thickness $t_{o x}$ of the gate oxide and the doping concentration in the channel region.

Different configurations exist for the VESFET device, as the two gates can be controlled independently. Tied configuration, where both gates are connected and thus controlled by the same signal, and independent gate, that allows to control the depletion regions from both sides independently are distinguished. The independent gate control allows for the integration of the non-trivial logic functions AND and OR within one device.

The intrinsic capacitance of a VESFET (with device height of $h=400 \mathrm{~nm}$ ) is about $50 \%$ smaller than that of a minimum sized 65 nm MOSFET [73]. This is due to the fact, that gate oxide capacitance and depletion capacitance are connected in series, together with the absence of junction and overlap capacitances. On the contrary, the on-resistance for the VESFET device is about $30 \%$ higher than that of the 65 nm device [73]. Based on the equations for the energy dissipation in static CMOS


Fig. 3.22 Basic structure of a Vertical Slit Field Effect Transistor (VESFET) in a 3D and the top view [72]. The design is based on cylindrical elements of radius $r$ and height $h$, that form the terminals and also the shape of the Poly-Si and the Bulk-Si. Via the gates the depletion region is modulated, therefore the effective channel width is modulated
(see (2.3)) caused by dynamic losses and in Adiabatic Logic (see (2.7)) due to adiabatic losses an improved ESF is expected in the adiabatic regime due to the reduced intrinsic capacitances, that impact adiabatic losses in AL more than dynamic losses in static CMOS. This conclusion is only valid when the overall capacitance is dominated by the intrinsic capacitance. If the wire capacitance becomes dominant, adiabatic losses are increased by the introduction of the VESFET, as its on-resistance is increased. For the following investigations its assumed, that the intrinsic device capacitance is the dominant part seen by the gate.

### 3.2.3.1 Simulation Results

Simulations are carried out to see what impact the introduction of the VESFET will have on the energy consumption in static CMOS as well as PFAL gates. As CMOS technology for comparison, the Predictive Technology Model 65 nm node (PTM 65 nm ) is used [58], as the 65 nm device is the corresponding bulk CMOS transistor when compared to the used VESFET. Additionally, a look at the 45 nm node is taken for the inverter circuit. Therefore the PTM 45 nm high-performance (HP) model is used. The parameters of the VESFET and static CMOS devices are summarized in Table 3.4.

All devices are minimum sized, i.e. for static CMOS gates as well as for PFAL. As the VESFET Spice models are designed for a supply voltage of $V_{D D}=0.8 \mathrm{~V}$, this voltage will also be used for the PTM models. Due to these prerequisites the values for the ESF derived in the following are worst case values for ESF. Both, the very low voltage and the minimum sized devices are disadvantageous for Adiabatic Logic [10].

Definition of the Investigated Energy Saving Factors In Sect. 2.1 the Energy Saving Factor (ESF) is defined. For all further investigations, not only switching from static CMOS to the adiabatic PFAL family is investigated, but also from a

Table 3.4 Parameters for PTM 45 nm HP, PTM 65 nm , and VESFET devices used in the simulations, $t_{o x}$ is the physical oxide thickness, $L$ and $W$ are related to the PTM model, whereas $r$ and $h$ are the device geometry parameters for the VESFET device

| Model | $t_{o x}$ | $L$ or $r$ | $W$ or $h$ | $n_{\text {sub }}$ |
| :--- | :--- | :--- | :--- | :--- |
| PTM 45 nm HP ${ }^{\text {a }}(\mathrm{n} / \mathrm{p})$ | 1 nm | 45 nm | 80 nm | b |
| PTM 65 nm (n/p) | 1.2 nm | 65 nm | 110 nm | b |
| VESFET (n/p) | 6 nm | 50 nm | 200 nm | $5 \cdot 10^{17} \mathrm{~cm}^{-3}$ |

${ }^{\text {a }}$ For 45 nm , PTM offers a high-performance (HP) model, that is advantageous for Adiabatic Logic
${ }^{\mathrm{b}}$ Substrate doping parameter not given in model cards

CMOS process to the VESFET process. Thus saving factors into four directions are determined. In order to distinguish between switching the circuit style, that is determined by the classical definition of the ESF, and the switching from the CMOS process to the VESFET process, the Technology Energy Saving Factor (TESF) is defined as:

$$
\begin{equation*}
T E S F=\frac{E_{C M O S}}{E_{V E S F E T}} \tag{3.14}
\end{equation*}
$$

Four combinations for the ESF and the TESF result from the variety of used circuit styles and processes:

$$
\begin{align*}
E S F_{C M O S} & =\frac{E_{\text {StaticCMOS/CMOS}}}{E_{P F A L / C M O S}}, \\
E S F_{V E S F E T} & =\frac{E_{\text {StaticCMOS/VESFET}}}{E_{P F A L / V E S F E T}},  \tag{3.15}\\
T E S F_{S t a t i c C M O S} & =\frac{E_{\text {StaticCMOS/CMOS }}}{E_{\text {StaticCMOS/VESFET }}}, \\
T E S F_{P F A L} & =\frac{E_{P F A L / C M O S}}{E_{P F A L / V E S F E T}} .
\end{align*}
$$

One additional trace is defined, that gives the saving factor when going from static CMOS in the CMOS process (staticCMOS/CMOS) to PFAL in the VESFET process (PFAL/VESFET). This saving factor is named Overall Energy Saving Factor (OESF) and is defined as

$$
\begin{equation*}
O E S F=\frac{E_{\text {StaticCMOS/CMOS }}}{E_{P F A L / V E S F E T}} \tag{3.16}
\end{equation*}
$$

Simulation Results for an Inverter Circuit A chain of five inverters (see Fig. 3.23 for the PFAL inverter implementation with tied-gate VESFET devices) is simulated according to Sect. 2.6. Energy dissipation per cycle as well as the Energy Saving Factors are derived. First a look at the energy dissipation in static CMOS and

Fig. 3.23 A sketch of a VESFET-based PFAL inverter circuit using the schematic symbols for the tied-gate VESFET proposed in [72]


PFAL for different technologies is taken, including the PTM 45 nm HP model. In Fig. 3.24, the impact of the different circuit styles (static CMOS and PFAL) as well as the technology (PTM 45 nm , PTM 65 nm , and VESFET) is observed. Going from 65 nm to 45 nm in static CMOS, as expected a reduction in the energy dissipation is seen for frequencies above 10 MHz , e.g. $30 \%$ reduction at 200 MHz . Leakage currents grow, as the threshold voltage is reduced (see energy dissipation for frequencies less than 10 MHz ). The VESFET model gives the highest leakage currents, and in the frequency regime above 30 MHz , the losses approximately match those of the 45 nm inverter circuit. When compared to the 65 nm model, the introduction of VESFET pays in static CMOS, whereas no benefit is gained regarding the 45 nm node.

For PFAL, using the 45 nm instead of the 65 nm model also shows improvements in the optimum frequency, shifted from around 100 MHz to 200 MHz . Additionally, the minimum energy dissipation is decreased by $16 \%$. Even compared to the 45 nm model, the VESFET PFAL inverter will lead to additional gain in performance as well as in power dissipation. The optimum frequency is shifted to 500 MHz , the minimum energy value is reduced by $59 \%$ and $51 \%$ compared to the 65 nm node and the 45 nm node respectively. Mentioned earlier, the 65 nm node is the corresponding CMOS technology with respect to area consumption. Though the 45 nm node is smaller in case of area consumption compared to the used VESFET device, a reduction in the energy consumption can be observed if the VESFET is used with PFAL, whereas static CMOS does not noticeably benefit from replacing the 45 nm transistor with the VESFET device.

Next, ESF, TESF and OESF are calculated. The ESF $F_{V E S F E T}$ line is superior to the ESF ${ }_{P T M 65 n m}$ line for almost all frequencies. PFAL gains by the introduction of the VESFET, the ESF is improved.

Simulation Results for a NAND Circuit A NAND gate is implemented in static CMOS and PFAL, and equipped with PTM 65 nm models and VESFET. Results of the simulation are found in Fig. 3.25. Again by introducing the VESFET, energy is saved in static CMOS as well as in PFAL. At 500 MHz the energy dissipation is reduced by inserting the VESFET device into static CMOS by around $43 \%$. For PFAL at 500 MHz the reduction in the dissipated energy is $66 \%$. The appearance of the minimum energy in PFAL is shifted from 100 MHz to 500 MHz . The ESF

Fig. 3.24 Energy dissipation and $(T / O) E S F$ for the inverter circuit. In the energy dissipation graph, also the PTM 45 nm model is plotted

values are slightly reduced when compared to the results of the inverter circuit, but still PFAL benefits from the VESFET device, as the ESF ${ }_{V E S F E T}$ is again superior to $E S F_{P T M 65 n m}$ over the whole frequency range. By switching from CMOS technology to the VESFET technology, PFAL gains more than static CMOS.

Simulation Results for a Four-Input NAND (NAND4) Circuit The four-input NAND (NAND4) is implemented and simulated for the various combinations. For the energy consumption in Fig. 3.26 again the trend of reduced energy consumption can be observed by introduction of the VESFET device. At 500 MHz the energy is reduced by $40 \%$ and $62 \%$ for static CMOS and PFAL respectively. Again the PFAL gate gains more by the introduction of the VESFET. The shift of the minimum in the energy dissipation for the NAND4 gate is less when compared to the previous gates. According to Fig. 3.26 the optimum frequency stays pretty much the same for CMOS and VESFET in PFAL. At 100 MHz an outlier in the graph for PFAL/VESFET appears due to numerical effects caused by the model.

When compared to the NAND gate, the $(T / O) E S F s$ are increased for the NAND4 gate. $E S F_{V E S F E T}$ is superior to $E S F_{P T M 65 n m}$ and $T E S F_{P F A L}$ is superior to TESF StaticCMos over all frequencies.

Fig. 3.25 Energy dissipation and $(T / O) E S F$ for the NAND circuit


In Table 3.5 the results for the maximum saving factors as well as the minimum dissipation (for PFAL only) are summarized. Generally spoken it is seen, that the energy dissipated in PFAL gates is lowered if the VESFET device is used. Also the appearance of the minimum point of energy is shifted towards higher frequencies. Only for the NAND4 gate, the result of $\min \left\{E_{P F A L / V E S F E T}\right\}$ is uncertain due to the outlier at 100 MHz . VESFET allows for a higher ESF factor (ESF ${ }_{V E S F E T}$ ), the savings from static CMOS to PFAL can be improved by employing the VESFET device. The saving factors ( $E S F$ as well as TESF and $O E S F$ ) are all reduced if the complexity of the gate grows. This can be explained by the higher capacitive load, and according to (2.3) and (2.7) it is clear, that increasing the capacitance will impact the energy dissipation of static CMOS less than that of Adiabatic Logic.

Simulation Results for an Inverter Circuit with a Fan-Out of Four (FO4) In [9] it is mentioned, that the ESF is dependent on the fan-out (FO) of a gate. For PFAL an increase in the ESF is observed for a relation of $C_{\text {Load }} / C_{\text {in }}$ of 4 up to 6 at 100 MHz [9]. To find the ( $T / O$ )ESF for an increased FO, a inverter circuit is loaded with a fan-out of four (FO4). Four identical inverter pairs are driven by the device under test (see Fig. 3.27). The simulations only concentrate on the frequency range

Fig. 3.26 Energy dissipation and $(T / O) E S F$ for the NAND4 circuit



Table 3.5 Overview of saving factors and energy dissipation per operation

|  | INV | NAND | NAND4 |
| :--- | :--- | :--- | :--- |
| $\max \left\{E S F_{P T M 65 n m}\right\}$ | $3.26 @ 100 \mathrm{MHz}$ | $2.04 @ 100 \mathrm{MHz}$ | $2.07 @ 100 \mathrm{MHz}$ |
| $\max \left\{E S F_{V E S F E T}\right\}$ | $4.52 @ 200 \mathrm{MHz}$ | $2.62 @ 20 \mathrm{MHz}$ | $3.61 @ 50 \mathrm{MHz}$ |
| $\max \left\{T E S F_{\text {StaticCMOS }}\right\}$ | $1.58 @ 1 \mathrm{GHz}^{\mathrm{a}}$ | $1.77 @ 1 \mathrm{GHz}^{\mathrm{a}}$ | $1.67 @ 500 \mathrm{MHz}$ |
| $\max \left\{T E S F_{P F A L}\right\}$ | $3.19 @ 500 \mathrm{MHz}$ | $2.97 @ 500 \mathrm{MHz}$ | $2.66 @ 1 \mathrm{GHz}$ |
| $\max \{O E S F\}$ | $6.54 @ 200 \mathrm{MHz}$ | $4.38 @ 200 \mathrm{MHz}$ | $5.05 @ 50 \mathrm{MHz}$ |
| $\min \left\{E_{P F A L / P T M 65 n m}\right\}$ | $0.37 \mathrm{fJ} @ 100 \mathrm{MHz}$ | $0.62 \mathrm{fJ} @ 100 \mathrm{MHz}$ | $0.92 \mathrm{fJ} @ 100 \mathrm{MHz}$ |
| $\min \left\{E_{P F A L / V E S F E T}\right\}$ | $0.15 \mathrm{fJ} @ 500 \mathrm{MHz}$ | $0.27 \mathrm{fJ} @ 500 \mathrm{MHz}$ | $0.39 \mathrm{fJ} @ 200 \mathrm{MHz}$ |

[^0]Fig. 3.27 In the fan-out of four (FO4) simulation setup, four equal inverters are connected to the output of the device under test (DUT)


Fig. 3.28 Energy dissipation and $(T / O) E S F$ for the inverter FO4 circuit


between 100 MHz to 500 MHz , where the lowest energy consumption is expected for the PFAL family.

Figure 3.28 summarizes the simulation results for the FO4 circuit. At 500 MHz the energy savings by the transfer from the PTM 65 nm to the VESFET device are $29 \%$ and $54 \%$ for static CMOS and PFAL, respectively. A shift to higher frequencies is seen for the optimum frequency in PFAL. $E S F_{V E S F E T}$ and $T E S F_{P F A L}$ are superior to their CMOS and static CMOS counterparts. The ( $T / O$ )ESF values this

Fig. 3.29 NBTI conditions for the PMOS device. Due to the high lateral field over the oxide, SiH-bonds break up at the semiconductor-oxide interface leaving traps behind

time are higher compared to inverter, NAND and PFAL, leading to an OESF of higher than 10 .

Summarizing these results shows that both families benefit from the reduced intrinsic capacitance of the VESFET device. As expected, due to the quadratic impact of the capacitance reduction on the dissipated energy in Adiabatic Logic, AL benefits more from the VESFETs properties with respect to the energy dissipation. Thus with the VESFET device Adiabatic Logic circuits can be pushed even more into ultra low-power dissipation.

## 3.3 (Negative) Bias Temperature Instability ((N)BTI) and Hot Carrier Injection (HCI) in Adiabatic Logic

NBTI [51-54, 74] and HCI [47] both impact $V_{t h}$ and lead to a shift in the threshold voltage of the devices. BTI only appears appreciably in PMOS devices for today's technologies (without high-k in the gate stack), called NBTI (due to the negative biasing voltage), but in the future PBTI (Positive Temperature Bias Instability) will also influence NMOS devices [50]. BTI appears while static conditions are applied to the transistor (Fig. 3.29), when $V_{D} \approx V_{S}$ and a high $\left|V_{G S}\right|$ voltage is applied. HCI appears in static CMOS during switching transitions. If current flows at a high drain-to-source voltage and also a respectable gate-to-source voltage is applied, carriers with high energy are deflected and emitted into the oxide. HCI in Adiabatic Logic is strongly reduced [40], as the voltage difference between drain and source is vanishingly low by definition. Thus, only BTI is regarded in the following.

First only NBTI is considered, as only this effect is already observable in state-of-the-art CMOS technologies without high-k gate stacks. NBTI consists of a permanent and a recoverable part: The recoverable part $\Delta D_{R}$ is explained with trapping and detrapping of holes in the oxide [54]. A portion of the overall threshold voltage shift $\Delta D_{P}$ will be permanent. This is explained by breaking Si-H bonds at the oxide-semiconductor interface [54].

$$
\begin{align*}
& \Delta D_{R} \propto K_{R} V_{g}^{c}\left(1+\frac{t_{s} \tau_{e}}{t_{r} \tau_{c}}\right)  \tag{3.17}\\
& \Delta D_{P} \propto K_{P} V_{g}^{a} t_{s}^{b} \tag{3.18}
\end{align*}
$$

Equations (3.17) and (3.18) show the dependence of the recoverable part $\Delta D_{R}$ and the permanent part $\Delta D_{P}$ of NBTI with respect to the gate voltage $V_{g}$, the stress
time $t_{s}$, and the recovery time $t_{r}$. Due to the stress duty cycles in most digital products, the permanent part of NBTI will dominate the long term degradation [53]. In Adiabatic Logic because of the hold interval of the power-clock, a periodically appearing, intrinsic recover phase is integrated, and thus also the permanent part of NBTI is expected to dominate the long term degradation in AL. NBTI will lead to a shift in the threshold voltage. First, the impact of a $V_{t h}$-shift on the operating behavior and the energy dissipation of Adiabatic Logic circuits is inspected. Then duty-cycle based investigations in Adiabatic Logic and static CMOS are performed and compared. At the end of the NBTI investigation an outlook on the impact of PBTI on Adiabatic Logic is given, as this will in the near future also impact the circuits.

### 3.3.1 Impact of NBTI on the Energy Dissipation of Adiabatic Logic Circuits

Shifting $V_{t h}$ to higher absolute levels will, on first sight, influence the loading path resistance in AL and thus will lead to a higher energy consumption. Both logic families, ECRL and PFAL, charge the outputs via a PMOS device, thus both suffer from the shift in the threshold voltage. Leakage currents will be decreased, non-adiabatic as well as adiabatic losses are increased by NBTI. As long as the non-adiabatic losses lie underneath the crossing-point of leakage losses and adiabatic losses, the maximum saving factor is improved, but the optimum frequency is shifted to lower operating frequencies. If a system has a fixed operating frequency, NBTI will raise the energy consumption of the gate, whereas a system, that is allowed to adopt its frequency to new conditions can counteract the increased energy dissipation by operating at a lower frequency. In contrast, NBTI will lead to functional errors if a static CMOS system is operated close to the maximum frequency and a $V_{t h}$-shift is introduced. But, on the other hand, no impact on the dynamic losses will be observed in static CMOS circuits. Due to the higher $V_{t h}$ a reduced leakage in all circuit families is observed.

In a PFAL gate, the charging path is constructed by the $n$-channel logic block and a parallel p-channel MOSFET. The logic block is not impacted by BTI, as long as PBTI is still a negligible effect. Impact by NBTI affects the p-channel MOSFET and increases the effective charging path resistance. The overall charging path resistance is

$$
\begin{equation*}
R_{\text {eval }}=R_{F / \bar{F}} \| R_{p}, \tag{3.19}
\end{equation*}
$$

in the evaluation phase, where $R_{F / \bar{F}}$ is either the equivalent resistance of the logic block $F$ or $\bar{F}$. Only the p-channel device is responsible for recovering the charge stored on the node during the recover interval, thus

$$
\begin{equation*}
R_{\text {reco }}=R_{p} . \tag{3.20}
\end{equation*}
$$

What can be seen from those two equations is, that NBTI affects both phases, namely evaluate and recover, but will affect the recovery of charge more, as there the logic block does not support the PMOS device. The on-resistance $R_{p}$ in the linear region is expressed by

$$
\begin{equation*}
R_{p}=k_{p}^{\prime} \frac{L}{W} \frac{1}{-V_{G S, p}+V_{t h, p}} \tag{3.21}
\end{equation*}
$$

and NBTI will increase the value of the threshold voltage $V_{t h, p}$ by $\Delta V_{t h, p}$. The modified on-resistance is

$$
\begin{align*}
R_{p} & =k_{p}^{\prime} \frac{L}{W} \frac{1}{-V_{G S, p}+V_{t h, p}+\Delta V_{t h, p}} \\
& =k_{p}^{\prime} \frac{L}{W} \frac{1}{-V_{G S, p}+V_{t h, p}\left(1+\Delta V_{t h, p, r e l}\right)}, \tag{3.22}
\end{align*}
$$

with the relative change of the threshold voltage $\Delta V_{t h, p, r e l}=\frac{\Delta V_{t h, p}}{V_{t h, p}}$. Inserting (3.22) into (3.19) and (3.20) gives the on-resistance in evaluation and recover phase modified by NBTI:

$$
\begin{align*}
R_{\text {eval }} & =\frac{R_{F / \bar{F}} k_{p}^{\prime} \frac{L}{W}\left(-V_{G S, p}+V_{t h, p}\left(1+\Delta V_{t h, p, r e l}\right)\right)^{-1}}{R_{F / \bar{F}}+k_{p}^{\prime} \frac{L}{W}\left(-V_{G S, p}+V_{t h, p}\left(1+\Delta V_{t h, p, \text { rel }}\right)\right)^{-1}} \\
= & \frac{R_{F / \bar{F}}}{R_{F / \bar{F}} \frac{W}{k_{p}^{\prime} L}\left(-V_{G S, p}+V_{t h, p}\left(1+\Delta V_{t h, p, \text { rel }}\right)\right)+1}  \tag{3.23}\\
R_{\text {reco }} & =k_{p}^{\prime} \frac{L}{W}\left(-V_{G S, p}+V_{t h, p}\left(1+\Delta V_{t h, p, \text { rel }}\right)\right)^{-1} \tag{3.24}
\end{align*}
$$

For ECRL both, charging and discharging of the output node happens via the p-channel device, therefore

$$
\begin{equation*}
R_{\text {eval }}=R_{\text {reco }}=k_{p}^{\prime} \frac{L}{W}\left(-V_{G S, p}+V_{\text {th,p }}\left(1+\Delta V_{\text {th,p,rel }}\right)\right)^{-1} \tag{3.25}
\end{equation*}
$$

An increase in the loading path resistance does impact the energy consumption in the adiabatic circuit, but not in static CMOS. As mentioned before, static CMOS suffers only from a performance decrease due to NBTI (and PBTI in future technologies). The energy saving factor is determined by

$$
\begin{equation*}
E S F=\frac{\frac{1}{2} C V_{D D}^{2}}{4 \frac{R_{\text {eval }} C}{T} C V_{D D}^{2}+4 \frac{R_{\text {reco }} C}{T} C V_{D D}^{2}}=\frac{T}{8 C\left(R_{\text {eval }}+R_{\text {reco }}\right)} \tag{3.26}
\end{equation*}
$$

If $R_{\text {eval }}$ or/and $R_{\text {reco }}$ are increased, the ESF is decreased. Summarizing these results show the expectation of increased energy dissipation in the adiabatic frequency regime due to the $V_{t h}$-shift caused by NBTI. In PFAL, the non-affected n-channel

Fig. 3.30 Energy dissipation of static CMOS inverter for a toggling input sequence

logic block will support the p-channel device during evaluation. The recovery takes place only via the p-channel device. In ECRL both, evaluation and recovery, are performed via the p-channel device. This leads to the assumption, that NBTI will basically affect ECRL more than PFAL.

### 3.3.1.1 Simulation Results for PFAL and ECRL

A voltage source at the gates has been introduced to simulate the shift of the threshold voltage for the p-channel devices in a CMOS, a PFAL, and an ECRL inverter gate. The simulations are performed in the 130 nm process used throughout this work, and stepwise increase of $V_{t h}$, from 10 mV to 50 mV is investigated. This shows the general impact of BTI on AL. In 130 nm , BTI is not yet a problem, but starting from the 90 nm node it will increasingly become important. The simulation is carried out with an inverter chain consisting of 5 inverters, the device under test is the third inverter in this row. All gates in the chain are assumed to be altered by the same $V_{t h}$-shift. A symmetrical shift is expected for the Adiabatic Logic families, i.e. both p-channel devices are altered with the same shift in $V_{t h}$.

The result for the CMOS inverter in Fig. 3.30 shows a decreased consumption in the leakage dominated regime ( $f<100 \mathrm{MHz}$ ) when $V_{t h, p}$ is increased. Within the frequency regime where dynamic losses dominate the overall losses, no impact is expected due to NBTI. But also short-circuit currents contribute to the overall losses, which are decreased due to the weaker p-channel device in the inverter circuit. The high-to-low transition is not impacted, as the n-channel device is responsible for this transition. But the low-to-high transition is dependent on the driver strength of the pchannel device. Short-circuit currents appear, when both devices are conducting. In Fig. 3.31 a high-to-low transition is assumed at the input of the static CMOS inverter, that leads to a low-to-high transition at the output. This transition is performed by the degraded p-channel device. If the input transition is fast in comparison to the output slope, the n-channel device of the inverter is in cut-off before a noteworthy drain-to-source voltage is applied.


Fig. 3.31 The n-channel device in the CMOS inverter is in cut-off as soon as the input transition reaches $V_{t h, n}$. If the output transition is slow, the drain-to-source voltage across the $n$-channel device is low during the time it is conducting. Thus, short-circuit currents during the low-to-high transition are decreased, when the p-channel device is degraded


Fig. 3.32 The voltage swing during the evaluate and recover interval in ECRL is reduced when the output signal stays constant. NBTI increases $\left|V_{t h, p}\right|$ to higher levels, thus the voltage swing is reduced by NBTI, leading to a decreased energy consumption

The energy dissipation nevertheless can be seen as independent of NBTI in the frequency regime dominated by dynamic losses. For a frequency of 100 MHz the energy dissipation is only reduced by about $1.3 \%$ when a shift in the threshold voltage of 50 mV is applied.

For the Adiabatic Logic families two cases are distinguished, one is with a constant input signal, the second is for toggling inputs. In ECRL voltage in the order of $\left|V_{t h, p}\right|$ is remaining on the output node, thus the voltage swing is $V_{D D}-\left|V_{t h, p}\right|$. If the input signal is not changed, in the next cycle the evaluation voltage swing is also $V_{D D}-\left|V_{t h, p}\right|$ as pictured in Fig. 3.32. According to (2.7), a reduced voltage swing leads to a reduced energy dissipation, whereas an increase of the on-resistance increases the energy dissipation. NBTI will rise the threshold voltage and thus will decrease the voltage swing, but increase the on-resistance.

Overall the energy dissipation is reduced, as the reduced voltage swing has a stronger impact on the energy dissipation. This can be observed in Fig. 3.33. At 100 MHz the energy dissipation is decreased by $17 \%$, if the threshold voltage is shifted by 50 mV .

For the toggling input sequence, the remaining charge is discharged during the wait interval. An increased energy dissipation is expected due to the raised onresistance, but in Fig. 3.34 it is observed, that the energy dissipation is more or less independent of the threshold voltage shift up to 50 mV .

Fig. 3.33 Energy dissipation for an ECRL inverter in case of a constant input signal


Fig. 3.34 Energy dissipation for ECRL inverter for toggling input sequence


In PFAL also charge remains on the output node, but is discharged if the input signal stays constant, as the n-channel device in the logic block is turned on during the wait interval by the preceding gate. The energy dissipation in the PFAL gate is almost independent of the input sequence [9]. Comparing Figs. 3.35 and 3.36 it is seen, that also the deviation in the energy dissipation for constant and toggling inputs is almost identical. At 100 MHz the energy dissipation is increased by $14 \%$ for a voltage shift of $V_{t h, p}=50 \mathrm{mV}$.

According to these results the Energy Saving Factors for ECRL (Fig. 3.37) and PFAL (Fig. 3.38) are derived. For ECRL the ESF is increased for increased threshold voltage induced by NBTI, whereas for PFAL the ESF is decreased.

In Fig. 3.39 the relative deviation in the ESF can be observed, showing the deviation by a 50 mV shift with respect to the virgin circuit. ECRL's saving potential is increased by around $5 \%$ at 100 MHz , whereas that of PFAL is decreased by around $14 \%$.

Fig. 3.35 Energy dissipation for a PFAL inverter for a constant input signal

Fig. 3.36 Energy dissipation for a PFAL inverter for a toggling input sequence

Fig. 3.37 Energy Saving Factor of an ECRL inverter under the impact of NBTI


Fig. 3.38 Energy Saving Factor of a PFAL inverter under the impact of NBTI


Fig. 3.39 Relative change of the ESF in ECRL and PFAL from the virgin circuit $\left(E S F_{0}=\operatorname{ESF}\left(\Delta V_{t h, p}=0\right)\right)$ to one with an NBTI-induced $V_{t h}$-shift of 50 mV $\left(E S F\left(\Delta V_{t h, p}=-50 \mathrm{mV}\right)\right)$


### 3.3.2 Comparison of the Stress Due to the Permanent NBTI in Static CMOS and AL

The permanent part of NBTI accumulates during stress times, and is not changed during intervals of relaxation. Thus to calculate the permanent stress induced due to NBTI, the stress time $t_{s}$ has to be determined. A device is exposed to stress conditions during stress times and thus gains some offset shift due to NBTI. Due to the four-phase power-clock in Adiabatic Logic, a logical one is not represented by a $V_{D D}$ signal during the whole cycle time $T$. It is composed by a ramp, that applies the maximum peak of $V_{D D}$ for one fourth of the clock cycle, if a trapezoidal waveform is expected.

The physical overall stress time of the CMOS circuit is determined as

$$
\begin{equation*}
t_{s, C M O S}=N \cdot T \cdot D C \tag{3.27}
\end{equation*}
$$

Fig. 3.40 Even if the gates do output the same states in a logical sense, the physical representation in static CMOS and Adiabatic Logic does differ

$D C$ is the stress duty cycle, that indicates how many of the overall cycles will the gate be under stress condition. This value is connected to the logical signals. E.g. for a inverter gate in static CMOS NBTI stress conditions are seen when the output is at a one value. If half of the operating time the circuit will output a one, the $D C$ value is 0.5 .

Figure 3.40 shows the mapping from logical value to the physical representation in static CMOS and in Adiabatic Logic. From there it is seen, that in AL the maximum stress is only applied for one fourth of the cycle time. In Adiabatic Logic, the physical representation of the logic value is only valid during the hold interval, thus, the stress time is reduced by the factor 4 . This is a slight underestimation, as also during the evaluate and the recover interval, reduced stress conditions will occur. Due to the dependence of the stress in (3.18) on the voltage, these intervals will have a minor impact on the overall stress. Thus, the overall stress time in AL will be expressed as:

$$
\begin{equation*}
t_{s, A L, \text { out }}=\frac{1}{4} N \cdot T \cdot D C . \tag{3.28}
\end{equation*}
$$

Two transistors in ECRL as well as in PFAL are exposed to NBTI stress, the pchannel device connected to the output out and the counterpart on the opposite side connected to $\overline{\text { out }}$. With this dual-rail encoding, in each cycle one of the two devices will be stressed, leading to the stress time of the PMOS device on the $\overline{o u t}$ side of the gate

$$
\begin{equation*}
t_{s, A L, \overline{\text { out }}}=\frac{1}{4} N \cdot T(1-D C) . \tag{3.29}
\end{equation*}
$$

A $V_{t h}$-shift is introduced in the PMOS devices that is in accordance to a powerlaw time function given in (3.18). Let's assume the general case, that $\Delta V_{t h} \propto t_{s}^{b}$.

Fig. 3.41 Induced voltage shift in static CMOS compared to Adiabatic Logic with respect to $b$ and $D C$


Then the shift induced in static CMOS can be related to the shift induced in AL via

$$
\begin{align*}
\frac{\Delta V_{\text {th }, \text { CMOS }}}{\Delta V_{\text {th }, A L}} & =\frac{(N \cdot T \cdot D C)^{b}}{\left(\max \left\{t_{s, A L, \text { out }}, t_{s, A L, \overline{\text { out }}}\right\}\right)^{b}} \\
& =\left(\frac{D C}{\max \left\{\frac{1}{4} D C, \frac{1}{4}(1-D C)\right\}}\right)^{b} . \tag{3.30}
\end{align*}
$$

The maximum stress-time is responsible for the worst impact caused by NBTI. To estimate the worst-case in AL, the side of the adiabatic gate that experiences the maximum accumulated stress via the selection with the maximum function ( $x=$ $\max \{y\})$ is considered. Depending on the $D C$ value for AL either the stress of the PMOS device connected to the out node, or the one on the opposite side $\overline{o u t}$ is
 value of more than $0.5 t_{s, A L, \text { out }}$ is chosen. Equation (3.30) can be rewritten as:

$$
\frac{\Delta V_{t h, C M O S}}{\Delta V_{t h, A L}}= \begin{cases}\left(\frac{D C}{\frac{1}{4}(1-D C)}\right)^{b}, & D C<0.5  \tag{3.31}\\ 4^{b}, & D C \geq 0.5\end{cases}
$$

In Fig. 3.41 a plot of (3.31) is presented. The relation of the $V_{t h}$-shifts is plotted over $D C$ and $b$. For $D C \geq 0.5$ the result is independent of $D C$. For a $D C<0.5$ the relation $\frac{\Delta V_{t h, C M O S}}{\Delta V_{t h, A L}}$ is dependent on the stress duty cycle $D C$. Proceeding to $D C=0$ it is observed that no stress is introduced into the static CMOS gate, whereas due to dual-rail signaling in AL NBTI will introduce a $V_{t h}$-shift. The impact of the parameter $b$ is also observed here and it is shown, that the effect of varying $b$ within the given range is low. In literature, the exponent $b$ is given as 0.27 [53, 54].

Equidistant lines of $\frac{\Delta V_{t h, C M O S}}{\Delta V_{t h, A L}}$ are plotted in Fig. 3.42. The unity line is found for a $D C=0.2$. On the right side of this line it can be seen, that AL is less impacted by NBTI, only for very small $D C$ values, static CMOS sees less NBTI due to the short overall stress time. Again for $D C \geq 0.5$ the relation in the shifts is independent of $D C$. If a gate is under stress for more than $50 \%$ of the operating time $N \cdot T$, for an

Fig. 3.42 $\frac{\Delta V_{t h, C M O S}}{\Delta V_{t h, A L}}$ due to the threshold shift induced by NBTI

exponent $b=0.27$, static CMOS $V_{t h}$-shift is more than 1.4 times higher than that in AL.

Summarizing the section on HCI and NBTI it is remarked that Adiabatic Logic is not suffering from HCI at all. In case of NBTI the four-phase power-clock is advantageous compared to the constant voltage-supply used in static CMOS. The overall stress time $t_{s}$ in Adiabatic Logic is less under certain stress duty cycles. An inherent relaxation phase is also introduced, as the power-clock is shut-down during each cycle for a quarter of a period. Thus, Adiabatic Logic in addition to its ultra-low-power capability is also quite suitable for the construction of reliable digital signal processing systems.

### 3.3.3 How Will Positive Bias Temperature Instability (PBTI) Impact Adiabatic Logic?

In future technology nodes also the NMOS devices will suffer from BTI [50]. Positive BTI (PBTI) will increase the threshold voltage of the NMOS devices. In the charging path of ECRL no NMOS devices are used. PBTI will impact the logic blocks, a weaker pull-down network is the result. This will not impact the adiabatic losses, but the residual charge due to capacitive coupling in the recover interval will be increased, non-adiabatic losses are therefore increased. Leakage losses will be decreased.

In PFAL, on the one hand, the NMOS devices in the latch will be weakened. On the other hand, the logic blocks assisting the charging process in the evaluation interval will suffer from a higher on-resistance. A higher on-resistance will lead to an increased energy consumption. In Sect. 2.4 it is mentioned, that the minimum supply voltage in PFAL is dependent only on the threshold voltage of the NMOS devices. This means, that due to PBTI PFAL will also suffer from a raised minimum supply voltage.

By means of simulation of a 130 nm technology, the impact of PBTI only and composite stress due to NBTI and PBTI shall be investigated. ECRL and

Fig. 3.43 Energy dissipation per cycle $E_{\text {diss }}$ versus the frequency for an ECRL inverter under the influence of NBTI and PBTI. Four constellations are shown, each with a threshold voltage shift of $\left(\Delta V_{t h, P}, \Delta V_{t h, N}\right)$


Fig. 3.44 Impact factors of NBTI and PBTI over the operating frequency for an ECRL inverter under NBTIand PBTI-induced threshold voltage shift. Factors are derived by $E_{d i s s}\left(\Delta V_{t h, P}\right.$, $\left.\Delta V_{t h, N}\right) / E_{\text {diss }}(0,0)$ for the three constellations


PFAL inverters are simulated. The threshold voltage of the devices is altered by series connection of a voltage source to the gate node of each device. All NMOS/PMOS devices share the same threshold shift caused by PBTI/NBTI. Figures 3.43 and 3.45 show $E_{\text {diss }}$ versus the frequency for ECRL and PFAL inverter. Four ( $\Delta V_{t h, P}, \Delta V_{t h, N}$ ) constellations are pictured. One is without any stressinduced threshold voltage shift. And then a $|50 \mathrm{mV}|$ amount is simulated for a shift caused by PBTI or NBTI only. The last setup shows what happens, if both effects are combined. Figures 3.44 and 3.46 show the factors at various frequencies, i.e. the according altered value $E_{d i s s}\left(\Delta V_{t h, P}, \Delta V_{t h, N}\right)$ compared to the case $E_{d i s s}(0,0)$. An input pattern sequence has been chosen, that toggles the input signal every second cycle. Thus the input signal combines the toggling and the constant input sequence.

In Sect. 3.3.1 it is concluded from the simulation results, that for ECRL an increased absolute threshold voltage of the PMOS device will decrease the consumption within the higher frequency regime. This can be observed in Fig. 3.43 for the case of $(-50 \mathrm{mV}, 0)$. An increase in $\Delta V_{t h, N}$ leads to increased losses in the high frequency regime. Whenever a voltage shift is introduced, the leakage regime shows reduced losses. Both, PMOS as well as NMOS devices are involved in the leakage

Fig. 3.45 Energy dissipation per cycle $E_{\text {diss }}$ versus the frequency for a PFAL inverter under the influence of NBTI and PBTI. Four constellations are shown, each with a threshold voltage shift of $\left(\Delta V_{t h, P}, \Delta V_{t h, N}\right)$


Fig. 3.46 Impact factors of NBTI and PBTI over the operating frequency for an PFAL inverter under NBTIand PBTI-induced threshold voltage shift. Factors are derived by $E_{\text {diss }}\left(\Delta V_{t h, P}\right.$, $\left.\Delta V_{t h, N}\right) / E_{\text {diss }}(0,0)$ for the three constellations

regime. In the case $(-50 \mathrm{mV}, 50 \mathrm{mV})$ the losses in the leakage regime are reduced most. From Fig. 3.44 it is seen, that in the case of $(-50 \mathrm{mV}, 50 \mathrm{mV})$ at a frequency of 100 MHz an energy increase of around $20 \%$ is observed. But on the other hand, in the leakage regime, the losses are reduced by almost $80 \%$ (at 100 kHz ). During active operation at a reasonably high frequency, leakage losses do not contribute to the overall losses noticeable. As soon as a system allows for a long power-down, leakage losses will contribute to the mean overall losses. In this case, a system's overall energy consumption figure could benefit from the reduced leakage losses.

The PFAL inverter shows an almost identical energy consumption in the higher frequency regime, when only the NMOS device threshold voltage is increased by 50 mV . The main contribution to the on-resistance in the evaluation and the recover interval is due to the PMOS device. As soon as the PMOS device has an increased absolute threshold voltage, the losses in the adiabatic frequency regime are increased. Leakage losses are minor decreased in the case of $(-50 \mathrm{mV}, 0)$, but are greatly reduced as soon as the NMOS devices are weakened by 50 mV . According to Fig. 3.46 at 100 MHz the energy dissipation is increased by less than $20 \%$ for $(-50 \mathrm{mV}, 50 \mathrm{mV})$ and the leakage is decreased by around $75 \%$ (at 100 kHz ).

## Chapter 4 <br> Generation of the Power-Clock

### 4.1 Introduction

For the operation of the Efficient Charge Recovery Logic (ECRL) and the Positive Feedback Adiabatic Logic (PFAL), a four-phase power-clock is needed (see Sect. 2.2.2). The design of the power supply has a crucial impact on the overall savings of an adiabatic system, as a bad conversion efficiency could flatten the savings gained by using Adiabatic Logic in the logic circuit. Generally the four-phase power-clock can be generated via inductor-free [75] or inductor-based [16, 76-79] clock generators.

Avoiding the inductors can be achieved by stepwise charging, where the charging of the output to $V_{D D}$ is not done abruptly, but is divided into $N$ steps. Figure 4.1 shows a scheme of a stepwise charging circuit. It consists of $N$ voltage supplies with a voltage of $i \frac{V_{D D}}{N}$, where $i=1,2, \ldots, N$. Multiple voltage sources are not favored in integrated circuits. Only $V_{D D}$ has to be supplied, all other voltage sources can be substituted by a tank capacitor.

To load the output capacitance switch 1 is closed, loading $C_{L}$ to $\frac{V_{D D}}{N}$. Then switch 1 is opened and switch 2 is closed, $C_{L}$ is charged to $2 \frac{V_{D D}}{N}$. Proceeding with the switches up to switch $N$, the capacitor $C_{L}$ at the end is loaded to $V_{D D}$. To unload $C_{L}$, first switch $N-1$ is closed. If the tank capacitor is much larger than the load capacitance, $C_{L}$ will be unloaded to $(N-1) \frac{V_{D D}}{N}$. Consecutively, all switches down to switch 1 are closed, the charge is transferred to the according tank capacitors. Only the last portion of charge $\frac{V_{D D}}{N}$ is dissipated to ground via switch 0 . The energy dissipated via stepwise charging $C_{L}$ is [75]

$$
E_{S W C}=\frac{C_{L} V_{D D}^{2}}{N}
$$

for a whole cycle of charging and discharging the capacitance. The more steps introduced, the less energy is consumed. But more switches have to be used, when the amount of steps is increased. These switches have to be controlled by static CMOS signals leading to more losses as the effort in the control logic increases. Switches 1 to $N$ have to be closed and opened again during the charging process, and switch

Fig. 4.1 This stepwise charging scheme shows how the load capacitance $C_{L}$ is charged to $V_{D D}$ via $N$ steps. Switches 1 to $N$ are used to charge the capacitor from 0 to $V_{D D}$. Discharging the capacitor also implies switch 0 , as the last portion of charge has to be dismissed to the GND potential

$N-1$ down to 0 are closed and opened during the uncharging process. Overall losses $E_{S W}$ due to controlling the switches are given by [75]

$$
E_{S W}=\left(\sum_{i=1}^{N} C_{i}+\sum_{i=0}^{N-1} C_{i}\right) V_{D D}^{2}
$$

with the switch capacitance $C_{i}$ of switch $i$. Each switch has to be able to deliver the charge $q=C_{L} \frac{V_{D D}}{N}$ within the time $\frac{T}{N}$. Therefore, if a higher $N$ is chosen, the number of switches on the one hand is increased, but each switch has to deliver a smaller fraction of charge. Additionally, when a constant time $T$ for charging the voltage from 0 to $V_{D D}$ is assumed, the time for each step $\frac{T}{N}$ to settle to the according voltage level is also decreased when $N$ is increased. Thus the energy due to $E_{S W C}$ will be decreased for higher values of $N$, but the losses due to controlling the switches will increase. It also has to be considered, that the drain and source potentials at switches closer to $V_{D D}$ are at a higher level. So the gate overdrive voltage, if all switches are controlled by the same voltage $V_{\text {ctrl }}$, will be decreased.

To use stepwise charging with the Adiabatic Logic families presented in Sect. 2.2.1, four phases have to be generated to operate the system. Each switch has to be fourfold, one for each phase and the control signals for the $90^{\circ}, 180^{\circ}$, and $270^{\circ}$ power-clock phases have to be generated additionally. Thus, the effort for the control logic and the area consumption for the tank capacitors will be strongly increased and thus stepwise charging is not a suitable way to generate the power-clock for the used four-phase adiabatic families in this work.

Adiabatically charging a capacitance can also be accomplished by using resonant loading via an $L C$ constellation. If only one capacitance has to be charged, e.g. the capacitance of a long line, one inductor can be combined with a large tank capacitor as seen in Fig. 4.2.

The tank capacitor is charged to $\frac{V_{D D}}{2}$ while switches B are closed, the load capacitance is shorted to ground. Then switches B are opened and switch A is closed. Via the inductor, charge is resonantly transferred from $C_{T}$ to $C_{L} . C_{T}$ has to be chosen high enough, so that the charge transferred to load $C_{L}$ will not greatly influence the voltage level of the tank capacitor. Switch A is opened when the maximum peak


Fig. 4.2 Driving a load $C_{L}$ adiabatically can be done with the resonant line driver proposed in [80]. The charge is transferred from the tank capacitor $C_{T}$ to the load capacitance $C_{L}$ via a resonant charging process over the inductance $L$
voltage is gained, the output level is kept constant. In order to recover the energy stored on $C_{L}$, switch A is closed, and the charge is resonantly transferred back to the tank capacitor. When the switches B are closed, the voltage level on $C_{T}$ is refreshed to $\frac{V_{D D}}{2}$ and residual charge on $C_{L}$ is dismissed to ground.

Basically a similar proposal is used for Adiabatic Logic circuits with a fourphase power-clock. The tank capacitor is replaced by the capacitance of the circuit connected to the phase shifted by $180^{\circ}$ and charge is cycled between two phases of the power-clock. In this work, inductor-based power-clock generators are used for supplying the adiabatic circuit, as those show a high conversion efficiency and the effort for the generation of the control signals is relatively low.

### 4.2 Topologies of Inductor-Based Power-Clock Generators

Different topologies have been proposed in the past. In [81, 82] an inductor-based generator for the four-phase power-clock is presented. It is composed by four $90^{\circ}$ phase-shifting elements, realized with a resonant LC circuit. In order to generate a trapezoidal waveform, Shottky diodes are introduced to cut-off the peaks of the sinusoidal oscillation, additional losses are therefore caused. But also a sinusoidal waveform is suitable for the power-clock [10].

A variety of oscillators, including the proposal in [81, 82], are compared in [83]. The synchronous 2 N 2 P oscillator uses two inductances compared to the four inductances in the oscillator presented by [81, 82]. A disadvantage to asynchronous (self-timed) topologies is the additional circuitry needed to generate the control signals for gating the transistors in the synchronous 2N2P oscillator. The asynchronous 2 N 2 P oscillator avoids control signals, but the efficiency is strongly impacted and will decrease by more than $35 \%$ [83]. Also the asynchronous oscillator is more susceptible to fluctuations in the load, as the operating frequency is shifted if deviations in the capacitance occur. The synchronous oscillator is timed by external control signals and thus will be pinned to the desired frequency.

As integrated inductors will lead to a major area consumption in a design, the proposal presented in [84] is also interesting. Here only one inductor is used, that is switched between the different phases by a matrix of transmission gates. A proper way of synchronizing the switching process to the current flow in the inductor is of


Fig. 4.3 Phases $\Phi_{0}$ and $\Phi_{2}$ are generated via a synchronous 2 N 2 P oscillator. The according synchronization signals $s_{0}, \overline{s_{0}}, s_{2}$ and $\overline{s_{2}}$ are used to synchronize the oscillator and to inject energy compensating for the losses. $C_{0}$ and $C_{2}$ is the equivalent capacitance of the adiabatic system connected to phase $\Phi_{0}$ and $\Phi_{2}$, respectively. A second oscillator is used for the phases $\Phi_{1}$ and $\Phi_{3}$, where the synchronization signals are shifted by $90^{\circ}$
great importance. In the comparison presented in [83] a poor conversion efficiency is given for the one-coil oscillator.

The synchronous 2 N 2 P oscillator according to the investigation in [83] is the best choice with respect to conversion efficiency and charge recovery capability. It consists of two n -channel and two p-channel transistors, all having a separate control signal to connect the phases to ground or $V_{D D}$ respectively. In Fig. 4.3 the oscillator is pictured, and the synchronization signals $s_{0}, \overline{s_{0}}, s_{2}$ and $\overline{s_{2}}$ are given. Four phases are generated using two oscillators and applying synchronization signals to the second oscillator that are shifted by $90^{\circ}$.

In Fig. 4.3 gates connected to phase $\phi_{0}$ are represented by $R_{0}$ and $C_{0}$, and for phase $\phi_{2}$ by $R_{2}$ and $C_{2}$. The $L C$ tank is thus mainly composed by the capacitances of the adiabatic circuit and the inductance $L$.

The resonance frequency of the LC tank is given by

$$
\begin{equation*}
f_{\text {res }}=\frac{1}{2 \pi \sqrt{L C}} \tag{4.1}
\end{equation*}
$$

In order to gain a high efficiency of the oscillator, the frequency of the synchronization signals $f_{\text {sync }}$ has to match the resonance frequency of the LC tank $f_{\text {res }}$. Compared to an asynchronous oscillator, that adopts to the resonance frequency $f_{\text {res }}$, with a synchronized oscillator the oscillation is forced to a desired frequency within a certain range. Therefore, if the LC tank is deviated due to a fluctuation in the value of $C$ or $L$, the frequency of the oscillation stays at $f_{\text {sync }}$. But the efficiency will be decreased by the fluctuations induced e.g. by the varying pattern of the data in the digital circuit.

### 4.3 Impact of Pattern-Induced Capacitive Variations on the Energy Dissipation of the Synchronized 2N2P LC-oscillator

Temporal fluctuations can occur in the capacitance value seen by the oscillator, as the digital data processed inside the logic circuit will cause different capacitive loads seen by the oscillator. Capacitance $C$ of a gate is composed by a capacitance $C_{\text {Latch }}$, due to the transistors that store the information in the gate while the input transistors are turned off, and the capacitances $C_{\mathrm{F}}(\vec{x})$ and $C_{\overline{\mathrm{F}}}(\vec{x})$ of the logic blocks. While the capacitance of the latch is constant due to its symmetrical buildup, the capacitance of the logic blocks is a function of the input vector $\vec{x}$, as in general, the charging paths through the logic blocks differ for different input patterns. Thus also the overall capacitance is dependent on the state of the input vector and

$$
\begin{equation*}
C(\vec{x})=C_{\text {Latch }}+C_{\mathrm{F}}(\vec{x})+C_{\overline{\mathrm{F}}}(\vec{x}) . \tag{4.2}
\end{equation*}
$$

Adiabatic losses are dependent on the capacitance and also on the resistive elements in the charging path. Losses in the adiabatic frequency regime are dependent on the input vector via

$$
\begin{equation*}
E_{D i s s, A L}(\vec{x})=16 \frac{R(\vec{x}) C(\vec{x})}{T} C(\vec{x}) V_{D D}^{2} \tag{4.3}
\end{equation*}
$$

The input pattern also selects a certain loading path through the logic blocks. Not only the capacitance is a function of the input vector, but also the loading path resistance $R$. $E_{\text {Diss,AL }}$ are the losses in the adiabatic circuit, introduced by different constellations in the $R C$ value used for loading the output. But besides the losses in the adiabatic circuit, also the deviation in the capacitance $C$ leads to a deviation of the resonance frequency of the oscillator. Rising losses $E_{O s c}$ are the consequence if $f_{\text {res }}$ is changed, and thus is deviated from the synchronization frequency $f_{\text {sync }}$. Overall, a superposition of two energy losses that are influenced by the input pattern variation is seen, leading to an energy dissipation of the power supply $V_{D D}$ according to

$$
\begin{equation*}
E_{V_{D D}}(\vec{x})=E_{O s c}(\vec{x})+E_{D i s s, A L}(\vec{x}) . \tag{4.4}
\end{equation*}
$$

The more asymmetric the two logic blocks F and $\overline{\mathrm{F}}$ are, the more deviations occur due to $\vec{x}$. A NAND3 gate implemented in a 130 nm CMOS technology is used to investigate the impact of the pattern-induced variations as seen in Fig. 4.4. To get the information on the worst case, an equivalent of ten thousand NAND3 gates is used, all connected to the same input. Instead of connecting ten thousand gates in the simulation, and therefore increasing the simulation time, only one gate with transistor widths sized by the factor of ten thousand is used.

The synchronous 2N2P oscillator in Fig. 4.3 is used to power the adiabatic circuit. The width of the driver transistors and the inductance value are determined by simulation in such a way, that the overall losses are minimized for the input pattern $\vec{x}=(0,0,0)$. All other patterns will lead to a deviation, as they expose a different capacitive load to the oscillator, and thus the $L C$ value will be deviated.


Fig. 4.4 A PFAL gate consists of the latch (N1, N2, P1 and P2) and the logic block F and the inverse logic block $\overline{\mathrm{F}}$. As example the blocks F and $\overline{\mathrm{F}}$ for a 3-input NAND (NAND3) gate are shown

Table 4.1 $E_{V_{D D}}(\vec{x})$ of an equivalent of ten thousand parallel NAND3 gates for the different input patterns with respect to $E_{V_{D D}}(0,0,0)$
$E_{V_{D D}}(\vec{x}) / E_{V_{D D}}(0,0,0)-1$

| $(0,0,0)$ | $(0,0,1)$ | $(0,1,0)$ | $(0,1,1)$ | $(1,0,0)$ | $(1,0,1)$ | $(1,1,0)$ | $(1,1,1)$ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| $0.0 \%$ | $-20.7 \%$ | $-12.1 \%$ | $-33.9 \%$ | $13.5 \%$ | $-16.9 \%$ | $26.5 \%$ | $13.5 \%$ |

In Table 4.1 the results of the simulation for all input patterns are summarized. All results give the relative overhead with respect to pattern $\vec{x}=(0,0,0)$. Deviations due to the input pattern lead to a major overhead in energy consumption of up to $26.5 \%$ (for $\vec{x}=(1,1,0)$ ), but also reduce the overall energy consumption down to $-33.9 \%$ (for $\vec{x}=(0,1,1)$ ). A deviation in the capacitance value will lead to increased energy consumption in the oscillator. A reduced overall value is also possible due to a reduction of the adiabatic losses $E_{\text {Diss,AL }}(\vec{x})$ in (4.4). This can be explained when the capacitive load seen by the oscillator is not greatly influenced by the input pattern, but if the charging path resistance is reduced. Then the dissipation in the logic gates $E_{\text {Diss,AL }}(\vec{x})$ is reduced, while the dissipation in the oscillator itself $E_{O s c}(\vec{x})$ stays more or less constant.

In a real system smaller deviations than in the NAND3 gate simulation setup will occur. A variety of digital gates will compose a system, so different logic blocks will be used. Some of them will be more symmetric than the NAND3 gate. E.g. an inverter or buffer in Adiabatic Logic does not lead to any asymmetry, and is independent of the input state concerning the adiabatic losses and the capacitive load. Furthermore the different input signals for each gate will lead to a superposition of all logic gates, and thus the deviation will be reduced due to canceling. A digital system composed by adiabatic gates will expose a capacitance $C_{\text {System }}(\vec{X})$ to the oscillator. The system state vector $\vec{X}$ holds all the input vectors $\vec{x}_{n}$ of the $N$ gates the system is composed of.

$$
\begin{align*}
C_{\text {System }}(\vec{X}) & =\sum_{n=1}^{N} C_{n}\left(\vec{x}_{n}\right) \\
& =N C_{\text {Latch }}+\sum_{n=1}^{N} C_{\mathrm{F}_{n}}\left(\vec{x}_{n}\right)+\sum_{n=1}^{N} C_{\overline{\mathrm{F}}_{n}}\left(\vec{x}_{n}\right) . \tag{4.5}
\end{align*}
$$

The latches in PFAL and ECRL are symmetric, the system capacitance consists of a constant capacitance $N C_{\text {Latch }}$ due to the latches, and two more capacitances are introduced by the logic blocks. Logic block capacitances are a function of the input vector. The impact of pattern-induced variations are simulated using a real system in the following.

### 4.3.1 Impact of Pattern-Induced Variations on the Dissipation of a Discrete-Cosine Transformation (DCT) System

As test vehicle a Discrete-Cosine Transformation (DCT) system pictured in Fig. 6.26 is used. It is composed of transistors in a 130 nm CMOS technology. Simulations are carried out with the nominal $V_{D D}$ of 1.2 V . The target frequency is 100 MHz . First of all, the oscillator's driving transistors and the inductance have to be sized to fit to the median operating conditions that the DCT presents to the oscillator. To find the right value for the inductor, the DCT is simulated with a trapezoidal input signal, and the current $I_{l}$ is measured. The input pattern used for the simulation is random, and a mean value $\bar{I}_{l}$ is used to determine the capacitance $C_{l}$ according to

$$
\begin{equation*}
C_{l}=\frac{Q_{\mathrm{C}}}{V_{D D}}=\frac{T \overline{I_{l}}}{4 V_{D D}} . \tag{4.6}
\end{equation*}
$$

Inserting $C_{l}$ into (4.1) results in the inductance value $L$. The optimum width for the driver transistors in the oscillator is found by simulation over different widths. Generally, the drivers have to be strong enough, to deliver the energy lost in one cycle within the synchronization time. If the transistors are sized too large, they will on the one hand increase the capacitance value (but this also helps to decrease the relative impact induced by the patterns) and on the other hand, leakage losses due to the transistors will increase.

By simulating 2000 random patterns, the impact of the deviations onto the dissipation of the oscillator is investigated with the DCT system connected to it. What is seen in Fig. 4.5 is the deviation of $E_{V_{D D}}$ with respect to the mean value $\overline{E_{V_{D D}}}$. Almost all input patterns lead to a deviation of less than $5 \%$ with respect to the mean value. So the impact in a real system is relatively small compared to the results of the NAND3 gate. A certain amount of capacitance will be attached additionally to the oscillator by the layout parasitics. These are not considered in the simulations, but will in the end help to stabilize the oscillator's frequency. When carefully designed, capacitances by the clock distribution network will not considerably contribute to

Fig. 4.5 $\quad E_{V_{D D}}$ with respect to the according mean value $\overline{E_{V_{D D}}}$ for the 2000 random input patterns of the simulated DCT system

the overall losses. Only inter- and intragate interconnects will lead to an increase in the energy dissipation, as those capacitances are loaded via the loading path transistors of the gate.

So large systems do not significantly suffer from pattern-induced variations. The impact is less than $5 \%$ and therefore, no countermeasures to stabilize the capacitive value have to be taken. The 2N2P LC-oscillator is an efficient and robust four-phase power-clock generator circuit for Adiabatic Logic systems.

### 4.4 Generation of the Synchronization Signals

A further source of energy consumption is the generation of the synchronization signals for the synchronous 2N2P LC-oscillator. In general, the synchronization signals have to be derived from a reference signal. The generation is also connected to losses, that additionally contribute to the overall losses of the adiabatic system. Equation (4.7) is extended by the term $E_{S y n c}$, the dissipation caused by generating the synchronization signals for the oscillator.

$$
\begin{equation*}
E_{V_{D D}}=E_{O s c}+E_{D i s s, A L}+E_{S y n c} . \tag{4.7}
\end{equation*}
$$

To preserve the energy savings gained by using Adiabatic Logic, the portions $E_{O s c}$ and $E_{S y n c}$ have to be kept as small as possible. In general it cannot be expect, that the synchronization signals are generated without additional energy consumption. They have to be derived from a clock signal supplied by an internal or external clock source. Two methodologies will be introduced here to generate the signals. One is a synchronous approach, using a state machine to generate the signals. The other approach is asynchronous, using a delay element to shift a reference signal by $90^{\circ}$. Both methodologies are investigated with respect to their energy consumption and also concerning the robustness when parameter variations influence the design.

Besides the generation of the signal, there are also losses in the buffers driving the gates of the oscillator's transistors. The capacitance of $C_{G}$ of the driver transistors has to be driven, leading to a loss of

$$
\begin{equation*}
E_{B u f}=C_{G} V_{x}^{2} . \tag{4.8}
\end{equation*}
$$

Variable $V_{x}$ is used in (4.8), as boosting the gates can be applied to reduce the on-resistance of the oscillator's transistors. Equation (4.7) is extended by $E_{B u f}$ to

$$
\begin{equation*}
E_{V_{D D}}=E_{O s c}+E_{D i s s, A L}+E_{S y n c}+E_{B u f} . \tag{4.9}
\end{equation*}
$$

In (4.9) only $E_{S y n c}$ is independent of the size of the system connected to the oscillator. If a larger system is connected, the capacitive load of the oscillator is raised. Thus, the inductance value has to be reduced. According to (2.7), losses in the adiabatic system will be increased, as more gates consume more energy. The oscillator has to be adapted. As more energy is dissipated in the system, the oscillator has to deliver more energy during the synchronization, therefore needs larger sized transistors, leading to higher losses in the oscillator. The gates of the oscillator transistors are increased, thus losses in the driver will also be increased according to (4.8). Only the generation of the synchronization signals is not impacted by the size of the adiabatic system. The relative overhead due to $E_{S y n c}$ is reduced if the system size is increased.

### 4.4.1 Synchronous Versus Asynchronous Generation of the Control Signals for the Oscillator

Synchronous Generation Synchronous generation of the control signals for the oscillator transistors is achieved by using a state machine. Clocked with a reference signal, the state machine will step through the states. It has to be at least clocked with four times the frequency of the signal $f_{\text {sync }}$, as this is required to subdivide into $90^{\circ}$ sections. Figure 4.6 shows the schematic of such a state machine. To adopt the output signals to fit the pulse width of $T_{p w}$, pulse shrinkers are inserted. The same could also be achieved by a higher input clock, and enhancing the state machine with intermediate steps. But then the number of transitions per time unit is increased, as well as the amount of hardware. Both will lead to increased losses. Thus it's desirable to use the reference signal with the minimum amount of transitions per time unit, which is the reference signal with the frequency $f_{\text {ref }}=4 f_{\text {sync }}$ and connecting pulse shrinkers at the outputs of the state machine.

The state machine consists of two parts, a 2 bit synchronous counter and a onehot decoder. Four states are cycled continuously synchronous to the reference signal $f_{\text {ref }}$. Each state is associated to one synchronization signal, a one-hot decoder is used to generate the mapping in Table 4.2. A schematic view of the counter and the decoder is given in Fig. 4.7.

Pulses are formed with a duration of $1 / f_{\text {ref }}=4 / f_{\text {sync }}$. These pulses cannot be connected to the oscillator in the current form. The pulse width has to be shrunk in order to fit it to the desired synchronization pulse width $T_{P W}$. After shaping the signals, 4 inverters are used to generate the inverted signals $\overline{s_{0}}, \overline{s_{1}}, \overline{s_{2}}$, and $\overline{s_{3}}$.


Fig. 4.6 A state machine with four states is needed to generate the synchronization signals $0^{\circ}, 90^{\circ}$, $180^{\circ}$, and $270^{\circ}$. Synchronous to the reference signal with a frequency of $f_{r e f}$, the four states are cycled through. The output vector $\left(0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\right)$ is set in the states, such that the waveform at the right is seen at the outputs. Without any further pulse shapers, the pulse width $T_{p w}$ is $\frac{1}{f_{r e f}}$. If a shorter pulse is needed, pulse shrinkers will shorten the pulses. Each of the two oscillators needs two signals $s_{0}$ and $s_{2}$ or $s_{1}$ and $s_{3}$ respectively, and the according inverted signals

Table 4.2 This truth-table describes the counter and the decoding logic. The current counter state $\left(c_{0}[n], c_{1}[n]\right)$ is mapped to the outputs $0^{\circ}, 90^{\circ}, 180^{\circ}$, and $270^{\circ}$ by an one-hot decoder. In the two most right rows, the counter's next state is given

| $c_{1}[n]$ | $c_{0}[n]$ | $0^{\circ}$ | $90^{\circ}$ | $180^{\circ}$ | $270^{\circ}$ | $c_{1}[n+1]$ | $c_{0}[n+1]$ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
| 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |



Fig. 4.7 Schematic of the state machine. The synchronous counter is basically constructed by two flip flops, one inverter and a XOR gate. Outputs of the counter are $c_{0}$ and $c_{1}$, that are then decoded one-hot to derive the synchronization signals


Fig. 4.8 The reference clock is delayed by $\frac{\pi}{2}$ to generate a signal with a $90^{\circ}$ phase shift. Pulses are shaped by pulse shrinkers to adopt the synchronization signals to the required pulse width $T_{p w}$

Asynchronous Generation Asynchronous generation allows to use the target frequency $f_{\text {sync }}$ instead of a four times faster clock signal. A delay element is used to generate the $90^{\circ}$ phase-shift. In Fig. 4.8 a scheme is given, how all synchronization signals are generated from a reference clock. If $f_{r e f}$ is taken as the signal with $0^{\circ}$ phase-shift, deriving $180^{\circ}$ is accomplished by using an inverter circuit. To generate $90^{\circ}$ and $270^{\circ}$, a delay element ( $\frac{\pi}{2}$-element) is used to delay the incoming signal by $\frac{1}{4 f_{\text {ref }}}$ and an inverter derives a $270^{\circ}$ shifted output from the delayed signal. After that, the signals are also shaped using pulse shrinkers, to fit the pulse width of the

Fig. 4.9 Delay $\Delta t$ of the $\frac{\pi}{2}$-element with respect to the nominal delay of 2.5 ns after 1000 Monte Carlo runs with statistical parameters of a 130 nm CMOS process

synchronization signals to the desired pulse width $T_{p w}$ and the inverted signals are generated.

Simulation results in Fig. 4.9 show the impact of process variations on the generated delay $\Delta t$ by the $\frac{\pi}{2}$-element. A Monte Carlo simulation in a 130 nm CMOS process at a supply voltage of $V_{D D}=1.2 \mathrm{~V}$ and for a target frequency of 100 MHz is performed with 1000 runs. The results are referenced to the nominal delay of 2.5 ns . The majority of simulation runs lies within a $\pm 10 \%$ range around the nominal value. Within a range of $\pm 10 \%$ shift of the power-clock signals according to [10] PFAL and ECRL circuits will exhibit an increased energy of less than $10 \%$. Countermeasures, e.g. in the layout, will allow to further decrease the impact of local process variations.

Comparison Results of Synchronous and Asynchronous Synchronization Pulse Generation Both methodologies offer advantages as well as disadvantages. For Adiabatic Logic the major focus lies on the energy consumption. As a deviation between the synchronization signals of up to $\pm 10^{\circ}$ can be accepted in PFAL and ECRL circuits without major penalty in the energy consumption, asynchronous due to the lower hardware effort and lower frequency of the reference clock is the preferred methodology. The higher activity in the synchronous implementation due to the increased reference frequency will lead to a noticeable higher energy consumption compared to the asynchronous generator. In Fig. 4.10 the transient energy consumption of both versions can be observed. The state machine and the delay element are simulated with 130 nm CMOS technology parameters, at $V_{D D}=1.2 \mathrm{~V}$ and for a target frequency of $f_{\text {sync }}=100 \mathrm{MHz}$.

The $\frac{\pi}{2}$-element based generation of the synchronization signals is superior to the state machine with respect to energy consumption. Ignoring the losses in the pulseshrinkers and the inverters to generate the dual control signals, the state machine approach consumes more than 5 times the energy. But as it is a synchronous approach, and it operates at a moderate frequency of 100 MHz it is not as susceptible to variations. In the delay element, variations are translated into a deviation of the delay time, thus a direct impact is seen in the generated waveforms.

Fig. 4.10 Energy consumption $E_{\text {sync }}$ of state machine and $\frac{\pi}{2}$-element based synchronization signal generator


What also has to be kept in mind is, that the energy consumption of the synchronization signal generator is constant for all system sizes. If a system is large, the losses due to the generation with comparison to the overall losses are negligible $\left(E_{s y n c} \ll E_{D i s s, A L}+E_{O s c}+E_{b u f}\right)$. Thus, if a large adiabatic circuit is supplied, the variation tolerant state machine approach can be applied without affecting the overall consumption noticeable.

### 4.4.2 Partitions of the Energy Losses Within an Adiabatic System

The overall consumption of an adiabatic system is composed by the energy fractions $E_{D i s s, A L}$ in the logic, $E_{O s c}$ in the oscillator, $E_{S y n c}$ in the generator for the synchronization signals and $E_{b u f}$ due to the losses driving the gates in the oscillator. A system is simulated with industrial 130 nm CMOS technology parameters to determine the energy fractions. The system in Fig. 4.11 is assembled by a load of two INV PFAL gates per phase, each gate is sized to an equivalent load of 2000 minimum sized PFAL INV gates. The overall load per phase is 4000 PFAL INV gates. The oscillator is equipped with an inductance value of $L=335 \mathrm{nH}$, which is the value for an operation of the adiabatic circuit at 100 MHz with minimum energy dissipation. The quality factor of the coil is assumed to be $Q=100$. Though quality factors for integrated coils are nowadays still poor, a lot of effort is put into optimizing integrated inductors [85, 86], the invention of new topologies [87] and into the integration of new materials [88-90]. A supply voltage $V_{D D}=1.2 \mathrm{~V}$ is used. Two driving stages are used to amplify the synchronization signals, each sized to four times the width of the according minimum size of the transistors in a CMOS INV with a fan-out of one. The $\frac{\pi}{2}$-element is used to generate the synchronization signals, as according to Table 4.3 it shows a lower energy consumption compared to the synchronous generation via a state-machine.


Fig. 4.11 Each portion of energy dissipation is extracted via the simulation setup pictured here. Losses in the adiabatic gates are determined by measuring the energy drawn from the phases by the adiabatic gates

Table 4.3 Summarizing comparison of the two approaches to generate the synchronization signals. In accordance to energy consumption, the $\frac{\pi}{2}$-element based approach is superior. Due to the synchronous nature, the state machine is more variation tolerant. The reference clock for the state machine is four times that of the delay-element based circuit

|  | $f_{\text {ref }}$ | $E_{\text {Diss }}[\mathrm{J}]$ | Comment |
| :--- | :--- | :--- | :--- |
| State machine | $4 \cdot f_{\text {sync }}$ | $3.75 \mathrm{e}-13$ | variation tolerant |
| Delay element | $f_{\text {sync }}$ | $7.25 \mathrm{e}-14$ | - |

The definitions of the efficiency values used to characterize the oscillator and the system are

$$
\begin{align*}
\eta_{O s c} & =\frac{E_{D i s s, A L}}{E_{D i s s, A L}+E_{O s c}},  \tag{4.10}\\
\eta_{\text {System }} & =\frac{E_{\text {Diss }, A L}}{E_{\text {Diss }, A L}+E_{O s c}+E_{s y n c}+E_{\text {buf }}} . \tag{4.11}
\end{align*}
$$

The conversion efficiency of the oscillator $\eta_{O s c}$ reveals how much of the energy delivered to the oscillator is consumed in the Adiabatic Logic gates. The system efficiency shows how much of the energy delivered to the system containing adiabatic gates, oscillator, generation of the synchronization signals and the drivers to drive the transistors in the oscillator, is dissipated in the adiabatic gates. A high efficiency on system level is desired, most of the energy should be due to the calculations performed in the adiabatic gates.

Besides looking at the energy values it has to be ensured that the circuit is still operating with the generated waveforms. The oscillator is expected to deliver a peak-to-peak voltage of around 1.2 V . This is observed by measuring the maximum and minimum voltage in steady-state condition of the oscillator. Finally the number of

Fig. 4.12 Fractions of the energy consumption in an adiabatic system in steady-state in dependence of the width of the transistors in the oscillator. The efficiency of the oscillator decreases for small widths, but the overall energy is still decreased

Fig. 4.13 Fractions of the energy consumption in an adiabatic system during the start-up phase of the oscillator. If the width of the transistors in the oscillator is too small, the energy consumption of the oscillator will increase dramatically during the start-up

cycles to gain a stable oscillation with at least $95 \%$ of the desired peak-to-peak voltage is extracted.

Simulation results are presented in Figs. 4.12-4.16.
In Fig. 4.12 the fractions of energy per cycle consumed in steady-state within the different parts of the system related to the size of the transistors in the oscillator (where $W_{p}=1.5 W_{n}$ and $W_{n}=$ width) are presented. Great reductions of the energy dissipation in the buffers are gained by reducing the width of the oscillator transistors. No other fraction is noticeable influenced by a reduced width of the driver transistors. Only the oscillator losses are increased for transistor widths less than $30 \mu \mathrm{~m}$. Even though the system efficiency decreases by reducing the width to values less than $30 \mu \mathrm{~m}$ (see Fig. 4.14), the overall losses are further reduced, therefore the minimum energy operation point in steady-state operation is found for the smallest transistor width in Fig. 4.12. All these observations are for steady-state operation of the oscillator.

The start-up behavior shows a different trend in Fig. 4.13. If the width is reduced to a value smaller than $40 \mu \mathrm{~m}$, the energy dissipation in the oscillator is strongly

Fig. 4.14 Efficiency of the oscillator and of the overall system in steady-state. While the oscillator has the highest efficiency of around $62 \%$ at a width of $85 \mu \mathrm{~m}$, the overall efficiency is further increased when the width of the driver transistors is decreased. This is because $E_{b u f}$ is further reduced by decreasing the width of the oscillator's transistors


Fig. 4.15 The maximum and minimum value of the generated sinusoidal waveform $\phi_{0}$ in steady-state operation
increased. For a system that offers a shut-down, this fact has to be considered, especially if the circuit is switched on and off frequently.

Figure 4.14 shows the efficiency values for the oscillator and the system in steady-state operation. By reducing the width of the transistors, the efficiency values are increased to a certain point. The best oscillator efficiency is gained for a value around $70 \mu \mathrm{~m}$, the system efficiency increases further with decreasing width. For the oscillator the maximum efficiency is more than 60 percent. For the overall system, the maximum efficiency is around 40 percent. Though the overall efficiency is gained for the design where the major fraction of the overall consumption is due to losses in the adiabatic circuit, it can be seen in Fig. 4.12, that the overall consumption is still reduced for a width smaller than the width for the maximum system efficiency. Thus, designing for the lowest system efficiency value a sub-optimum energy consumption is obtained.

The peak-to-peak voltage is plotted in Fig. 4.15. For a small width of the oscillator transistors the peak-to-peak voltage will noticeably deviate from the nominal voltage of 1.2 V . This is due to overshoots in the LC tank that are not compensated

Fig. 4.16 Number of cycles until the peak-to-peak voltage is $95 \%$ of the steady-state peak-to-peak voltage. For a width smaller than $100 \mu \mathrm{~m}$ the peak-to-peak voltage takes more than one cycle to settle. And for a width smaller than $40 \mu \mathrm{~m}$ a strong increased time until settling is observed

by the transistors within the synchronization time. A larger peak-to-peak voltage results in increased adiabatic losses.

Also the number of cycles necessary to gain a stable oscillation of $95 \%$ of $V_{D D}$ is an important design parameter. According to the results in Fig. 4.16 the oscillator will gain a stable oscillation within one cycle for a transistor width greater than $100 \mu \mathrm{~m}$. Between $100 \mu \mathrm{~m}$ and $40 \mu \mathrm{~m}$ the oscillator will reach the stable state within the first two cycles. If the width is then reduced below $40 \mu \mathrm{~m}$, up to 5 cycles are necessary until the stable oscillation is reached. If a circuit is operated in on-state for a long time, the impact of the start-up cycles needed until the stable operation is gained will not noticeable contribute to the overall consumption. But in operation modes where the oscillator is frequently turned on and off, the idle cycles until stable operation is achieved can lead to a penalty in the overall energy dissipation. Cycles with a peak-to-peak voltage smaller than a certain limit cannot be used for performing operations, as a stable output is not guaranteed. Nevertheless energy is consumed during these idle cycles that add up to the overall consumption.

## Chapter 5 <br> Power-Clock Gating

### 5.1 Introduction to Power-Clock Gating

Scaling CMOS technology leads to performance increase and more functionality with the same chip size. The downside of scaling is an increased power density. Breaking down the power in a high-performance CPU shows, that almost $50 \%$ of the power are consumed in the clock-net [1]. Each new technology node will introduce increased leakage losses. Leakage losses are exponentially dependent on the threshold voltage (compare (2.9)). Standby losses will more and more contribute to the overall power consumption in high-performance digital circuits. In the following approaches are presented to cope with the high power-density.

Clock gating is used to decrease the capacitance of the clock net by disconnecting idle circuit parts from the clock network [1-5]. Benini [5] presents an algorithm to automatically and with fine granularity recognize wait states in FSMs and use clock gating to prevent switching losses. Up to $25 \%$ savings are reported with an area overhead of only $5 \%$. Pham et al. [4] present measurement results for a RISC microprocessor with clock gating. The Dynamic Power Management (DPM) decides on a cycle-by-cycle basis if blocks are needed. Silicon measurements show savings between $12 \%$ and $30 \%$. Also leakage losses are positively influenced by a low-power operation of the circuit. Leakage currents are strongly dependent on the temperature, thus, a reduced temperature increase will lead to lower leakage currents.

As long as dynamic losses dominate the overall losses, and circuits are in idle mode only for moderate times, leakage losses will not considerably contribute to the overall power consumption. As soon as the cumulative power consumption due to leakage losses in idle modes is no longer negligible, disconnecting the power supply voltage from the leaky circuit blocks by power gating [91-93] will help to cut-down losses. A so-called sleep transistor is used, that has a higher threshold voltage and thus low leakage. Sizing of the sleep transistor is crucial to find the right trade-off between circuit speed, power reduction and area overhead [91, 94].

In Adiabatic Logic, the voltage supply and the clock signal is combined within the power-clock. Gating or switching-off the power-clock will allow to power down
the voltage supply and the clock signal with only one gating mechanism. In static CMOS those two gating mechanism are separated from each other, each with its own control circuitry and gating devices. Each leads to overhead due to power consumption in the control as well as area overhead due to separate devices.

Adiabatic gates used in this work are charged and discharged continuously, even if no data is processed. By disconnection of the gates these losses are prevented. Leakage losses are reduced by insertion of switches with a low leakage (high- $V_{t h}$ devices) behavior. Even if switches with a low threshold voltage are inserted, the inherent stacking will lead to decreased losses [95]. In [96] we proposed PowerClock Gating (PCG) in Adiabatic Logic circuits for the first time. An additional path-resistance is introduced by the gating transistor, thus sizing is important for a small impact on the energy dissipation. Different gating transistor topologies are presented. PCG is further modified in [97-99]. Here boot-strapping is used in order to achieve a higher overdrive voltage for the NMOS gating transistor.

If for the generation of the power-clock an LC tank based oscillator is used, shutting down fractions of the system will lead to deviations of the LC value. This is discussed in [100], where also the proposal to use a synchronized oscillator with a shut-down mode is presented for the first time. In [101] an analysis of the robustness of the 2 N 2 P LC oscillator under the impact of fluctuations introduced by the processed data is presented.

Below two different methodologies to implement PCG are investigated. First is by using a power-down transistor in Sect. 5.3.1, that cuts off the circuit from the power-clock [96]. The second method is by gating the power-clock generator [100, 101], i.e. the 2N2P Oscillator presented in Sect. 4.

### 5.2 The Theory of Power-Clock Gating

In the PCG scheme in Fig. 5.1 the increased on-state energy consumption $E_{\text {on }}$ is illustrated that appears if Power-Clock Gating is introduced. This is due to the raised on-resistance in the charging path if a power-down transistor is inserted to shutdown the circuit. Additional circuitry is needed to control the power-down mode. This will also add up to the energy dissipation in the on-state. Shut-down and poweron introduce some amount of energy $E_{S O H}$ due to dissipating energy at shut-down or by supplying energy at power-on. For an overall time of $T_{g e s}$ the system is in on-state for $T_{o n}$ and in off-state for $T_{o f f}$. During the off-state only a small amount of energy $E_{\text {off }}$ is consumed. The mean energy consumption $\overline{E_{P C G}}$ is calculated as

$$
\begin{equation*}
\overline{E_{P C G}}=\frac{1}{T_{g e s}} \int_{0}^{T_{g e s}} E(t) d t . \tag{5.1}
\end{equation*}
$$

As the energies $E_{0}, E_{o f f}$, and $E_{S O H}$ are assumed to be constant, the integral can be simplified into a sum of energies multiplied with the according time interval. The switching overhead is assumed to be much higher than the energy dissipation in

Fig. 5.1 Scheme of a Power-Clock Gating event. The area bounded by $E_{0}$ and the thick line indicates the overhead due to raised energy consumption in on-state, and the switching overhead. To gain from power-clock gating, accumulated energy overhead in the on-state and due to switching has to be compensated in the off-state

off-state $\left(E_{S O H} \gg E_{o f f}\right)$ and is assumed to take place within one clock period $T_{\phi}$.

$$
\begin{equation*}
\overline{E_{P C G}}=\frac{1}{T_{g e s}}\left(T_{o n} E_{o n}+T_{o f f} E_{o f f}+T_{\Phi} E_{S O H}\right) \tag{5.2}
\end{equation*}
$$

Power-Clock Gating only pays, if the mean energy $\overline{E_{P C G}}$ is less than that of a system with $E_{0}$ that is implemented without PCG. Otherwise, more energy is wasted by applying PCG compared to the system without PCG. The following criterion has to be fulfilled to beneficially integrate PCG:

$$
\begin{equation*}
\left(E_{o n}-E_{0}\right) T_{o n}+\left(E_{S O H}+E_{\text {off }}-E_{0}\right) T_{\phi}<\left(E_{0}-E_{o f f}\right)\left(T_{o f f}-T_{\phi}\right) . \tag{5.3}
\end{equation*}
$$

It is assumed, that $E_{S O H} \gg E_{\text {off }}$ and also $E_{0} \gg E_{\text {off }}$. Equation (5.3) thus simplifies to

$$
\begin{equation*}
\left(E_{\text {on }}-E_{0}\right) T_{o n}+\left(E_{S O H}-E_{0}\right) T_{\phi}<E_{0}\left(T_{o f f}-T_{\phi}\right) \tag{5.4}
\end{equation*}
$$

Now the off-time $T_{\text {off }}$ that fulfills this equation is extracted:

$$
\begin{equation*}
T_{o f f}>\frac{E_{\text {on }}-E_{0}}{E_{0}} T_{o n}+\frac{E_{S O H}}{E_{0}} T_{\phi} . \tag{5.5}
\end{equation*}
$$

Equation (5.5) is the off-time requirement for the applicability of PCG. The Minimum Power-Down Time $T_{M P D}$ is introduced as a border condition for PCG. Only if the off-time $T_{o f f}$ is at least $T_{M P D}$ a benefit by the introduction of PCG is gained:

$$
\begin{equation*}
T_{M P D}=E_{O H, r e l} T_{o n}+T_{\Phi} \frac{E_{S O H}}{E_{0}} \tag{5.6}
\end{equation*}
$$

The relative overhead $E_{O H, r e l}$ is defined as

$$
\begin{equation*}
E_{O H, r e l}=\frac{E_{\text {on }}}{E_{0}}-1 . \tag{5.7}
\end{equation*}
$$

The second term in (5.6) is a constant value, whereas $E_{O H, r e l} T_{\text {on }}$ is dependent on the time the circuit is in on-state. Each cycle a certain amount $\left(E_{o n}-E_{0}\right) T_{\phi}$ is addi-
tionally wasted due to PCG, a higher accumulated overhead has to be compensated for, hence $T_{M P D}$ is dependent on the on-time $T_{o n}$.

In a system that is already equipped with Power-Clock Gating, or if it does not suffer from a noticeable overhead in on-state ( $E_{o n} \approx E_{0}$ ) due to PCG, the equation for $T_{M P D}$ is modified to

$$
\begin{equation*}
T_{M P D}\left(E_{o n} \approx E_{0}\right)=T_{\Phi} \frac{E_{S O H}}{E_{0}} \tag{5.8}
\end{equation*}
$$

Therewith $T_{M P D}$ has to be only long enough to compensate for the overhead due to the switching overhead $E_{S O H}$.

### 5.3 Gating Topologies for PCG

### 5.3.1 Cut-off with Power-down Transistors

The straight forward implementation for PCG is by adding switches in the powerclock line and cutting-off the power-clock from the Adiabatic Logic circuit when in idle state. A scheme is presented in Fig. 5.2, where NMOS devices are gated by the $\overline{\mathrm{PCG}}$ signal. If PCG is activated, the phases $\phi_{i}$ are disconnected from the system. This is comparable to the cut-off with a sleep transistor in static CMOS [91], that powers down the system into a sleep mode. As mentioned in the introduction of this chapter, adiabatic losses are prevented when $\overline{\text { PCG }}$ is active. Carefully selecting the switch is of great importance. Using high- $V_{t h}$ devices will allow for large savings in the sleep-mode, as those devices do also have a low leakage behavior. But in the on-state, they present a higher on-resistance compared to a low- $V_{t h}$ device with the same area and gate voltage. If a low- $V_{t h}$ device is used, overhead in the on-state $\left(E_{\text {on }}-E_{0}\right)$ is decreased. The stack effect $[95,102]$ will also allow for the reduction of leakage currents even for low- $V_{t h}$ devices.

In all equations derived in this section it is assumed, that each node's RC constant is small enough to allow the node potential to follow the power-clock signal instantly. Thus, all node currents are derived by the equation $i_{x}=\frac{d \phi_{i}}{d t} C_{x}$, where $i=0,1,2,3$ selects the according power-clock and $C_{x}$ is the capacitance connected to branch $x$.

The equivalent circuit of one power-clock phase of the adiabatic circuit, including the power line, is sketched in Fig. 5.3. The adiabatic circuits equivalent consists of a charging resistance $R_{A L}$ and the capacitance $C_{A L}$, the line equivalent is a lumped model with resistor $R_{L}$ and capacitor $C_{L}$.

The energy dissipated per cycle $E_{0}$ is determined by

$$
\begin{equation*}
E_{0}=8 \frac{V_{D D}^{2}}{T}\left(R_{A L} C_{A L}^{2}+R_{L}\left(C_{A L}+C_{L}\right)^{2}\right) \tag{5.9}
\end{equation*}
$$

Losses given in (2.7) are extended by a term for the losses in the line resistance. Now a switch is inserted for power down. To find the right place for insertion of the

Fig. 5.2 A $\overline{\text { PCG }}$ signal is used to cut off the Adiabatic Logic system from the power-clock signals


Adiabatic Logic system

Fig. 5.3 Equivalent circuit for an adiabatic system and the power line for the power-clock signal for one power-clock phase


Fig. 5.4 A switch is introduced at the interface of adiabatic circuit and power-clock line

switch three positions for the switch are regarded. The switch can be inserted close to the Adiabatic Logic, i.e. at the end of the power-clock line. It can alternatively be inserted at the beginning of the line, as general case the insertion of the switch at a certain fraction of the line is considered.

### 5.3.1.1 Switch Inserted at the End of the Line

If an equivalent circuit of the switch is introduced at the interface of the power line and the adiabatic circuit, an additional resistance $R_{S}$ and a switch capacitance $C_{S}$ are inserted into the equivalent circuit. It is assumed, that half of the switch capacitance is seen at the input of the switch, and the second half is located at the output. The equivalent circuit in Fig. 5.4 is valid for the switch turned-on.

This leads to a higher energy consumption compared to the circuit without a switch:

$$
\begin{equation*}
E_{\text {on }}=8 \frac{V_{D D}^{2}}{T}(\underbrace{R_{A L} C_{A L}^{2}}_{A L}+\underbrace{R_{S}\left(\frac{C_{S}}{2}+C_{A L}\right)^{2}}_{\text {switch }}+\underbrace{R_{L}\left(C_{A L}+C_{S}+C_{L}\right)^{2}}_{\text {line }}) . \tag{5.10}
\end{equation*}
$$

The first term is not affected by the switch, as $R_{A L}$ only has to conduct the current to charge $C_{A L}$. The charging current for $C_{A L}$ and for the half of the switch capacitance at the output of the switch has to cross the switch resistance. Also the line resistance now has to carry the current to load the capacitance of the switch. In case that the switch is turned off, the adiabatic load is disconnected from the powerclock. Therefore only the line capacitance and the off-state switch capacitance on the input side $\frac{C_{S, \text { off }}}{2}$ have to be charged via the line resistance. The losses are reduced to

$$
\begin{equation*}
E_{o f f}=8 \frac{V_{D D}^{2}}{T}\left(R_{L}\left(\frac{C_{S, o f f}}{2}+C_{L}\right)^{2}\right) \tag{5.11}
\end{equation*}
$$

Leakage currents are neglected in this consideration, as the losses in the line will dominate over the leakage losses. If on- and off-state are compared, the Energy Reduction Factor (ERF) can be derived, that is a measure of how efficiently PCG can reduce the energy consumption:

$$
\begin{equation*}
E R F=\frac{E_{\text {on }}}{E_{\text {off }}}=\frac{R_{A L} C_{A L}^{2}+R_{S}\left(\frac{C_{S}}{2}+C_{A L}\right)^{2}+R_{L}\left(C_{A L}+C_{S}+C_{L}\right)^{2}}{R_{L}\left(\frac{C_{S, o f f}}{2}+C_{L}\right)^{2}} \tag{5.12}
\end{equation*}
$$

Reducing the denominator of (5.12) will lead to a better ratio between on- and off-state. To do so, the switch should be placed at the beginning of the line, as then also the line capacitance $C_{L}$ can be disconnected as in the off-state.

### 5.3.1.2 Switch at the Beginning of the Line

If the switch is inserted prior to the power line, the equivalent circuit looks as pictured in Fig. 5.5. The energy consumption for this constellation is given in (5.14). If the switch is in off-state, only gate-drain and drain-bulk capacitance are connected to the power-clock generator. In case the generator has an intrinsic resistance of zero, also the off-state energy consumption (equation (5.14)) is zero.

$$
\begin{align*}
E_{\text {on }} & =8 \frac{V_{D D}^{2}}{T}(\underbrace{R_{A L} C_{A L}^{2}}_{A L}+\underbrace{R_{S}\left(\frac{C_{S}}{2}+C_{A L}+C_{L}\right)^{2}}_{\text {switch }}+\underbrace{R_{L}\left(C_{L}+C_{A L}\right)^{2}}_{\text {line }}) \\
E_{\text {off }} & =0 \tag{5.14}
\end{align*}
$$

Fig. 5.5 Equivalent circuit when switch is introduced prior to the power-clock line

Fig. 5.6 $E_{o n}$ for the switches prior or post to the power-clock line with respect to the switch width


As the energy in off-state is zero, it is not possible to allow for the definition of the $E R F$ in the case where the switch is at the beginning of the line. In reality, there will be a small amount of energy consumed due to resistances connecting the transistor to the power supply. In general the $E R F$ will be increased significantly compared to the positioning of the switch at the end of the line, as long as the line is not very short.

In Fig. 5.6 the on-state energy dissipation per cycle $E_{\text {on }}$ is sketched for the switch at the beginning (prior line) and at the end (post line) of the power-clock line. The impact of the switch width can be observed. For small widths of the switch it can be seen, that the insertion of the switch at the end of the line can be advantageous for the configuration shown here (equivalent number of gates of 10 k gates and line length of 1 mm ). The terms in (5.10) and (5.13) are marked with AL, switch and line. The $A L$ term is dissipated in the Adiabatic Logic, switch is dissipated in the switch resistance and line is dissipated in the resistance of the power-clock line. Both equations share the same $A L$ term, but differing switch and line terms. A wider switch does impact both, the switch and the line term in (5.10) (post line). The switch term is linearly dependent on the width, as the resistance value will decrease linearly with rising $W$, the capacitive value $C_{S}$ increases linearly. Due to the quadratic impact of the capacitance, the switch term will increasingly contribute to the overall losses with $W$. The line term is quadratically increased with the width of the switch. In (5.13) only the switch term is impacted by a raised width. In Fig. 5.6 the "post

Fig. 5.7 Equivalent circuit when the switch is introduced at a certain position $\zeta$ in the line

line" configuration shows a large increase in energy dissipation for wide widths of the switch due to the increased $C_{S}$ that is loaded via the line resistance.

### 5.3.1.3 Switch at a Certain Fraction of the Line

Also a general case can be imagined, where the switch is positioned at a certain point $L \zeta$, where $L$ is the line length and $\zeta$ is the position of the switch with $0 \leq \zeta \leq 1$. The equivalent circuit for such a configuration with a switch in the on-state is sketched in Fig. 5.7.

The energy dissipation in the on-state is then given by (5.15).

$$
\begin{align*}
E_{\text {on }}= & 8 \frac{V_{D D}^{2}}{T}\left(R C_{A L}^{2}+R_{L}(1-\zeta)\left(C_{L}(1-\zeta)+C_{A L}\right)^{2}\right. \\
& \left.+R_{S}\left(\frac{C_{S}}{2}+C_{L}(1-\zeta)+C_{A L}\right)^{2}+R_{L} \zeta\left(C_{L}+C_{S}+C_{A L}\right)^{2}\right) \tag{5.15}
\end{align*}
$$

The switch position and the optimum width is determined graphically in Fig. 5.8 for a line length of 1 mm and 5 mm , when the power-clock line is routed on the top metal layer of a four-metal process. The load $R_{A L}$ and $C_{A L}$ is assumed as concentrated at the end of the line with an equivalent of ten thousand gates, the supply voltage is 1.2 V and the operating frequency is 100 MHz . For short lines, the impact of the position is negligible, no or a small effect can be seen by placing the switch at a certain position $L \zeta$. This is due to the fact, that the overall dissipation is mainly due to the adiabatic load. Energy dissipation is reduced, if the switch size is increased, as the path resistance of the switch is reduced. But also the capacitive load is influenced, a larger switch leads to a larger $C_{S}$, and therefore with larger widths, the energy dissipation is raised if $W$ is further increased. In case of the short line of 1 mm it can be seen, that the switch is best inserted at the beginning of the line in order to decrease the energy dissipation. If the line length is then increased from 1 mm to 5 mm an optimum inserting point can be found that is different from the results with a short length, due to the increasing impact of the line resistance $R_{L}$ and capacitance $C_{L}$. By moving the switch to a certain position $\zeta$, the optimum sizing of the switch is altered. If the switch is moved to the end of the line, a relatively

Fig. 5.8 The switch is inserted at the position $L \zeta$ in the line, the switch width $W$ is varied. A certain point can be found, where the energy overhead is minimum. The plots show the relative energy dissipation $E_{\text {on, rel }}$ for two different line lengths, i.e. 1 mm (top) and 5 mm (bottom)


small switch delivers the minimum overall savings, leading to a minimization of the term $R_{L} \zeta\left(C_{L}+C_{S}+C_{A L}\right)^{2}$ in (5.15), whereas for the opposite configuration at the beginning of the line, a large switch leads to the minimum energy configuration as then the term $R_{S}\left(\frac{C_{S}}{2}+C_{L}(1-\zeta)+C_{A L}\right)^{2}$ is minimized (under the assumption, that $C_{L}$ and $C_{A L}$ dominate over $C_{S}$ ). Nevertheless at $\zeta=0$, the switch's width impacts the energy noticeably only in the regime of very small transistor widths. This is due to the dominant reduction of energy dissipation via the on-resistance.

The off-state consumption is also affected by the switch positioning, as more or less capacitive load is connected to the oscillator in off-state. Also the path resistance $R_{L} \zeta$ in off-state is modulated by the switch position $\zeta$.

$$
\begin{equation*}
E_{o f f}=8 \frac{V_{D D}^{2}}{T}\left(R_{L} \zeta\left(C_{L} \zeta+\frac{C_{S}}{2}\right)^{2}\right) \tag{5.16}
\end{equation*}
$$

Thus the switch's position will influence the off-state energy consumption and also the $E_{\text {on }} / E_{\text {off }}$ ratio. Decreased off-state energy consumption according to (5.16) is gained when the switch is moved to the beginning of the line.

Fig. 5.9 An adiabatic system is split into $M$ fractions distributed along a line with width $l$. The equivalent circuit for fraction $i$ is sketched. At each resistor $R_{L, i}, R_{S, i}$, and $R_{A L, i}$ a certain fraction of the overall energy is dissipated


### 5.3.1.4 Switch Dimensioning for a Distributed Load Along the Power-Clock Line

In large systems the load will not be connected at one point of the power line, but will be distributed along the line. In the following investigations it is assumed that the load is split into $M$ equal fractions and is distributed along the power line in equidistant points $l / M$, where $l$ is the length of the whole power line. Then in the fraction $i$ of the line the following energy is dissipated in the different resistors inside the basic cell shown in Fig. 5.9.

$$
\begin{align*}
E_{A L, i} & =8 \frac{R_{A L, i} C_{A L, i}^{2}}{T} V_{D D}^{2},  \tag{5.17}\\
E_{S, i} & =8 \frac{R_{S, i}\left(\frac{C_{S, i}}{2}+C_{A L, i}\right)^{2}}{T} V_{D D}^{2},  \tag{5.18}\\
E_{L, i} & =8 \frac{R_{L, i}\left(C_{L, i}+C_{S, i}+C_{A L, i}+C_{o u t, i}\right)^{2}}{T} V_{D D}^{2},  \tag{5.19}\\
C_{A L, i} & =\frac{1}{M} C_{A L}  \tag{5.20}\\
R_{A L, i} & =M R_{A L}  \tag{5.21}\\
R_{L, i} & =\frac{R_{L}}{M}=\frac{R_{L}\left(W_{L, \min }\right)}{M} \frac{W_{L, \min }}{W_{L}}, \tag{5.22}
\end{align*}
$$

$$
\begin{align*}
C_{L, i} & =\frac{C_{L}}{M}=\frac{C_{L}\left(W_{L, \min }\right)}{M} \frac{W_{L}}{W_{L, \min }}  \tag{5.23}\\
R_{S, i} & =R_{S}\left(W_{S, \min }\right) \frac{W_{S, \min }}{W_{S, i}}  \tag{5.24}\\
C_{S, i} & =C_{S}\left(W_{S, \min }\right) \frac{W_{S, i}}{W_{S, \min }} \tag{5.25}
\end{align*}
$$

$R_{A L}$ and $C_{A L}$ are the equivalent resistance and capacitance of the overall adiabatic system and $R_{L}$ and $C_{L}$ are the overall values for the line. The line can be adapted to a certain width $W_{L}$. Each stage consumes a fraction $E_{A L, i}$ within the Adiabatic Logic. This energy portion is constant for all $i$, as equal parts of the system are assumed to be connected here. Losses in the line resistance $E_{L, i}$ and the switch resistance $E_{S, i}$ are dependent on $i$, as the switches can be dimensioned differently $\left(\mathbf{W}_{S}=\left\{W_{S, 1}, W_{S, 2}, \ldots, W_{S, M}\right\}\right)$ along the line, and the accumulated output capacitance $C_{\text {out }, i}$ at the output of each fraction is dependent on the position $i$. The lower the number of $i$ is, the higher will the output capacitance $C_{\text {out }, i}$ be. It is calculated via

$$
\begin{equation*}
C_{o u t, i}=\sum_{j=i+1}^{M}\left(C_{A L, i}+C_{S, i}+C_{L, i}\right) \tag{5.26}
\end{equation*}
$$

The overall consumption is calculated by

$$
\begin{equation*}
E=\sum_{i=1}^{M}\left(E_{A L, i}+E_{S, i}+E_{L, i}\right) \tag{5.27}
\end{equation*}
$$

The optimum parameters for the switches $\mathbf{W}_{S}$, the line width $W_{L}$, and also the optimum granularity $M$ can be found by evaluation of (5.27). All three parameters are restricted by a certain limit. Widths $\mathbf{W}_{S}$ can be increased on the cost of area overhead. The power-line is stacked on top of the circuit, thus it can be increased to a certain degree without affecting the overall area. And a circuit can only be split into a finite number of fractions. Within these limits, a set of parameters can be found that minimizes the power consumption.

Results of a MATLAB simulation are presented in Fig. 5.10. The width of the switches is altered, and the resulting minimum energy dissipation $E_{\min }$ is plotted over $C_{S} / C_{A L}$, a measure for the area overhead by the switch. A system equivalent of ten thousand ECRL inverter gates is assumed as adiabatic load. This load is split into $M$ parts and distributed along a $100 \mu \mathrm{~m}$ long line. In the legend of the plot two parameters are given. The first is the line width $W_{L}$ and the numbers in square brackets hold the information of the switch sizing. The first number is the relative size of the switch in fraction 1, and the second number is the relative size of the switch in fraction $M$. Switches for fractions 2 to $M-1$ are sized according to a linear interpolation between $W_{S, 1}$ and $W_{S, N}$. Each constellation is simulated for three different fractions. Numbers in the plot show, whether $M=5$ (1), $M=15$ (2) or $M=25$ (3).


Fig. 5.10 Minimum energy consumption $E_{\min }$ versus $C_{S} / C_{A L}$ for a system with distributed load. The system is connected via a line of $100 \mu \mathrm{~m}$ length, and consists of ten thousand ECRL inverter equivalents. Results for different constellations for the line width multiplier $W_{L} / W_{L, \min }$, granularity $M$ and switch sizings are plotted

In the upper left region the data points for $W_{L} / W_{L, \min }=1$ are found. As here the losses in the line resistance dominate the overall losses, switches are sized small for $E_{\text {min }}$. Switches introduce a small area overhead $C_{S} / C_{A L}$ at $E_{\text {min }}$. Sizing the switches [2 1] or [4 1] has either no large impact or will even introduce a larger $E_{\text {min }}$. The granularity of the circuit will have an impact on the overall losses. Increasing $M$ from 5 to 15 will decrease the energy consumption in the case of $W_{L} / W_{\min }=1$, [11] by about $8 \%$ on the cost of area, $C_{S} / C_{A L}$ is increased by $8 \%$. As an adiabatic circuit with dominating losses in the power-clock net indicates a bad design concerning the power-clock lines, width of the lines has to be increased in such cases.

If the line width is scaled to larger values $W_{L} / W_{L, \text { min }}$, the line resistance will decrease and will drop such that now losses in the switch and in the adiabatic circuit will dominate the overall losses. In such cases, the dissipation in each block has to be minimized. We have $M$ equal blocks, each switch is then sized with the same width. Looking at the point, where $W_{L} / W_{L, \min }=10$ and where the switches are equally sized ([11]), increasing the granularity $M$ only allows for small reductions of $E_{\min }$ on the cost of a greatly reduced area overhead $C_{S} / C_{A L}$. An alternative sizing of the switches of [2 1] will increase the area consumption, but will barely decrease $E_{\text {min }}$.

In Fig. 5.11 the energy consumption for $W_{L} / W_{L, \min }$ is 1,5 and 10 is plotted over $C_{S} / C_{A L}$. The minimum energy dissipation points are labeled. What can be seen here is the possibility for trading area against energy. In case of $W_{L} / W_{\min }=5$ and $W_{L} / W_{\min }=10$ the energy consumption for the same area over-

Fig. 5.11 Energy consumption for three line widths over the relative switch size and the minimum energy and area reduced configurations

head as $E_{\text {min }}\left(W_{L} / W_{\text {min }}=1\right)$ is labeled as $E_{\text {redArea }}\left(W_{L} / W_{\text {min }}\right)$. Area overhead can be greatly reduced without affecting the energy consumption too much. So for $W_{L} / W_{\min }=5$, area is reduced by more than $50 \%$ with an energy penalty of less then $6 \%$. In the case of $W_{L} / W_{\min }=10$, area is reduced by almost $65 \%$, but with a slightly higher energy penalty of $10 \%$. Even for the reduced area energy point, the widest used width $W_{L} / W_{\min }=10$ of the power-line leads to the lowest overall energy consumption.

From this investigation it is concluded, that increasing the width of the power line $W_{L}$ is fundamental to gain low energy consumption. Separating the circuit into $M$ blocks has a small impact on the energy consumption, but will increase the area overhead. If the losses in the power-line resistances are negligible (and this should be the case in a reasonably designed system), and if the circuit is split into equal fractions, the switches shall be equally sized.

### 5.3.1.5 Switch Topologies

NMOS and PMOS switches with overdrive voltage $V_{O V}$ and a transmission gate are investigated as switches for PCG (Fig. 5.12). In the linear region $\left(\left|V_{G S}\right|>\left|V_{t h}\right|\right.$ and $\left.\left|V_{G D}\right|>\left|V_{t h}\right|\right)$ the resistance for NMOS and PMOS switch with overdrive gate voltage in dependence of the clock line voltage level $\phi$ is

$$
\begin{align*}
R_{S, n} & =\frac{1}{\mu_{n} C_{O X} \frac{W}{L}\left(V_{D D}+V_{O V}-\Phi-V_{t h, n}\right)}  \tag{5.28}\\
R_{S, p} & =\frac{1}{\mu_{p} C_{O X} \frac{W}{L}\left(V_{O V}+\Phi+V_{t h, p}\right)} . \tag{5.29}
\end{align*}
$$

Both, $V_{O V}$ and the threshold voltage $V_{t h}$ are absolute values. To operate only in the linear region $V_{O V}>\max \left\{V_{t h, n},-V_{t h, p}\right\}$. This is also the condition to keep the switch open during the whole range of the voltage ramp of the power-clock signal $\phi$. The NMOS device exhibits the lowest on resistance during the beginning of the ramp-up of $\phi$. In contrast, the PMOS device can better handle voltages of the power-clock that are close to $V_{D D}$.

Fig. 5.12 Three implementations of switches to cut off AL are investigated. In case of NMOS and PMOS switch, overdrive voltage $V_{O V}$ is applied to gain full-swing operation



A transmission gate combines NMOS and PMOS devices. The NMOS device will take part in the charging process during the low voltage levels of $\phi$ and the PMOS device for the high voltage levels. For the transmission gate the overall resistance is calculated by the equation for parallel connected resistors

$$
\begin{equation*}
R_{S, T G}=\frac{R_{S, n} \cdot R_{S, p}}{R_{S, n}+R_{S, p}} \tag{5.30}
\end{equation*}
$$

Until the power-clock reaches $-V_{t h, p}$ the PMOS device will only contribute with its leakage currents to the charging process. The equivalent on-resistance in this regime is much higher than that of the NMOS device ( $R_{S, p} \gg R_{S, n}$ ). Up to $V_{D D}-$ $V_{t h, n}$ both devices are in the linear region. Above this level, the NMOS device is in the sub-threshold regime. Here the overall on-resistance $R_{S, T G}$ is dominated by the PMOS device. Equation (5.30) is split into three voltage regimes and expressed by

$$
R_{S, T G}= \begin{cases}R_{S, n}, & \phi<-V_{t h, p}  \tag{5.31}\\ \frac{R_{S, n} \cdot R_{S, p}}{R_{S, n}+R_{S, p}}, & -V_{t h, p} \leq \phi \leq V_{D D}-V_{t h, n} \\ R_{S, p}, & \phi>V_{D D}-V_{t h, n}\end{cases}
$$

Insertion of the equations for the on-resistance of $R_{S, n}$ and $R_{S, p}$ leads to

$$
R_{S, T G}= \begin{cases}\frac{1}{\mu_{n} C_{O X} \frac{W_{n}}{L}\left(V_{D D}-\phi-V_{t h, n}\right)}, & \phi<-V_{t h, p},  \tag{5.32}\\ \frac{1}{\left(\mu_{n}+\mu_{p}\right) C_{O X} \frac{\left(W_{n}+W_{p}\right)}{L}\left(V_{D D}-\left(V_{t h, n}-V_{t h, p}\right)\right)}, & -V_{t h, p} \leq \phi \leq V_{D D}-V_{t h, n}, \\ \frac{1}{\mu_{p} C_{O X} \frac{W_{p}}{L}\left(+\phi+V_{t h, p}\right)}, & \phi>V_{D D}-V_{t h, n} .\end{cases}
$$

In a MATLAB script the on resistances of the switches after (5.28), (5.29), and (5.32) are determined over the whole voltage range of the power-clock $\phi$. For the transmission gate, the width of the PMOS $W_{p}=2 W_{n}$ to compensate for the differing values of the mobility. The relative overhead $E_{O H, \text { rel }}$ is then calculated. $V_{O V}$ is 400 mV for the boosted NMOS and PMOS devices. The following assumptions are taken for the calculation of the relative overhead. The switch is inserted at the end of the line (compare (5.10)). with a length of $100 \mu \mathrm{~m} .16$ equivalents of an ECRL inverter gate are attached as load at the end of the line. Results are shown in Fig. 5.13.

Fig. 5.13 Results of the evaluation of the on-resistance and the relative overhead introduced by the different switch topologies


Within the voltage range of $\phi$, the transmission gate exhibits a change in the on-resistance that is relatively low when compared to two decades change in the on-resistance value for NMOS and PMOS devices (both minimum sized) with gate overdrive. In the case of area consumption, the transmission gate is three times larger than the single device switches. But as an increased voltage has to be provided for boosting the gate, additional circuitry for the boosted NMOS or PMOS transistor will consume area and also power.

In the plot of $E_{O H, \text { rel }}$ versus $W / W_{\text {min }}$ the impact of sizing the switch is demonstrated. An increased width will lead to a decreased on-resistance, thus the losses within the switch are reduced. But additionally, the switch capacitance $C_{S} \propto W \cdot L$ is increased, and an increased switch capacitance will also lead to increased adiabatic losses within the switch and the line resistance, as observable in (5.10). The width $W / W_{\min }$ is the overall width ( $W_{n}+W_{p}$ in the case of the transmission gate) of the switch. The boosted NMOS device allows for the highest savings, closely followed by the transmission gate with the same overall size. In contrast to the NMOS device, no boost circuitry is needed for the transmission gate. The PMOS device, due to the reduced mobility, is not suitable as a power gating switch. Only if the

Fig. 5.14 Boosted NMOS switch for Power-Clock Gating: energy consumption for an adiabatic system with 16 ECRL (top) or 16 PFAL (bottom) inverter gates

overdrive voltage is increased, the PMOS switch performance can be increased. An optimum in the sizing of the switch can be found. For small widths the on-resistance is the dominant contributor to the losses. The dependence is $\frac{1}{W / W_{\min }}$ in the plot for small size $W / W_{\text {min }}$ of the switch. If the switch width is further increased, the impact of the capacitance $C_{S} \propto W / W_{\text {min }}$ can be seen.

In simulations the impact of the overdrive voltage and the sizing of the NMOS switch shall be investigated. ECRL and PFAL are simulated with 16 inverter circuits at a frequency of 100 MHz . Both are connected to the power-clock via a $100 \mu \mathrm{~m}$ line. Overdrive voltage $V_{O V}$ and the multiplier for the switch width $w_{S}$ $\left(w_{S}=W_{n} / W_{\min }\right)$ are altered. The switch is placed at the end of the line. In Figs. 5.14, 5.15 and 5.16 simulation results for the energy consumption, the relative energy overhead, and the output voltage with respect to $V_{D D}=1.2 \mathrm{~V}$ are plotted. With a size of $w_{S}=1$ and an $V_{O V}=0$, PCG adds a large overhead of $30 \%$ and $330 \%$ for ECRL and PFAL respectively. The large overhead in PFAL is due to the small reference value $E_{0}$ in PFAL. For this constellation, the overall consumption is dominated by the switch resistance $\left(R_{S} \gg R_{\text {on,ECRL }}>R_{\text {on, PFAL }}\right)$. Due to the cutoff at $V_{D D}-V_{t h, n}+V_{O V}$, the output voltage is reduced in the gates. At $w_{S}=1$ the

Fig. 5.15 Boosted NMOS switch for Power-Clock Gating: relative energy overhead for an adiabatic system with 16 ECRL (top) or 16 PFAL (bottom) inverter gates

switch has to be boosted with $V_{O V}=0.5 \mathrm{~V}$ to get a output signal peak of at least $0.95 V_{D D}$. If the switch is increased, the energy consumption and thus the energy overhead is greatly reduced. The output signal is increased for a constant $V_{O V}$. If $V_{O V}=0.4 \mathrm{~V}$ for a $w_{S}>10$, the output voltage peak is at least $0.95 V_{D D}$.

A higher switch resistance leads to a higher consumption in the system. For the ECRL gates in Fig. 5.14 it is found that for lower $V_{O V}$ the energy dissipation is decreased. Here the reduced output voltage peak leads to a decreased current and thus to decreased losses. Within the PFAL plot, the energy is increased when $V_{O V}$ is decreased.

Power-Clock Gating System Case Study: A 130 nm 16 bit ECRL CarryLookahead Adder A Carry-Lookahead Adder (CLA) in ECRL is used as a test device for PCG in the configuration shown in Fig. 5.17. Lines of $100 \mu \mathrm{~m}$ length are used to connect the trapezoidal power-clock signal to the gates. Simulations are carried out for a target frequency of 100 MHz and a supply voltage of $V_{D D}=1.2 \mathrm{~V}$. Boosted NMOS switches are compared to transmission gates as PCG switches. Switches with a width of 15 times the minimum width are inserted at the end of the line. This accords to an area penalty of less than $12 \%$. The system is simulated

Fig. 5.16 Boosted NMOS switch for Power-Clock Gating: voltage at the output of the adiabatic circuit with comparison to $V_{D D}$ for an adiabatic system with 16 ECRL (top) or 16 PFAL (bottom) inverter gates


for cases, where $T_{o f f}=T_{o n}$. Equation (5.2) simplifies to

$$
\begin{equation*}
\overline{E_{P C G}}=\frac{1}{2}\left(E_{o n}+E_{o f f}+\frac{T_{\Phi}}{T_{o f f}} E_{S O H}\right) . \tag{5.33}
\end{equation*}
$$

The simulation result in Fig. 5.17 show $\overline{E_{P C G}}$ with respect to different off-times $T_{o f f} . T_{o f f}$ is related to the clock period $T_{\phi}$. Switching losses due to driving the gates are not regarded in the results. The Minimum Power-Down Time $T_{M P D}$ can be found at the crossing of each line with the reference (no switch). $T_{M P D}$ is only around 13 clock cycles for the transmission gate switch and approximately 19 clock cycles for the boosted NMOS switch. From (5.6) it can be seen, that the used transmission gate either introduces a lower switching overhead $E_{S O H}$ or a lower overhead in the on-state $E_{O H}$. Overhead introduced by the switch in on-state can be seen for relatively long $T_{o f f}$, where after (5.33) the switching overhead can be neglected and $\overline{E_{P C G}} \approx \frac{1}{2} E_{\text {on }}$. The boosted NMOS switch accounts for $21 \%$ overhead and the transmission gate for $24 \%$ energy overhead in on-state. Thus it can be concluded that the switching overhead $E_{S O H}$ is smaller in the transmission gate.

Fig. 5.17 Mean energy dissipation $\overline{E_{P C G}}$ with PCG and energy for a system without a switch for a ECRL 16 bit CLA. A boosted NMOS switch $\left(V_{O V}=0.4 \mathrm{~V}\right)$ and a transmission gate are compared as switch


Fig. 5.18 If a part of the circuit is disconnected from the oscillator, the overall capacitance will be decreased by a fraction $\Delta C_{A L}$


This case study shows, that with a moderate overhead in the area of $12 \%$ already for very low power down times of less than 20 cycles, energy is saved compared to the system with no power-clock gating applied.

### 5.3.2 Power-down of the Power-Clock Oscillator

All gating technologies that switch off a circuit via a switch [96-99] lead to a reduced capacitance connected to the power-clock. If the power-clock is generated by a resonant generator, a deflection of the resonance frequency $f_{\text {res }}$ is resulting from PCG. When the LC oscillator presented in Sect. 4 is used, the frequency will be forced to the value $f_{s y n c}$ as long as $f_{\text {res }}$ stays within a certain range. But a mismatch of $f_{\text {sync }}$ and $f_{\text {res }}$ means a decreased efficiency. The resonance frequency is defined in (4.1) as

$$
\begin{equation*}
f_{\text {res }}=\frac{1}{2 \pi \sqrt{L C}} \tag{5.34}
\end{equation*}
$$

Disconnecting a part of the system leads to a deviation of the overall capacitance by $\Delta C_{A L}$. Therefore, if only a part of the system is shut-off, the active part of the circuit will see an increased frequency. Figure 5.18 shows a schematic of such a

Fig. 5.19 Plot of $\Delta f / f_{0}$ versus $\Gamma$ : If e.g. $30 \%$ ( $\Gamma=0.3$ ) of the capacitive load are switched off by PCG, the frequency is increased by 20\%

constellation. In reality a system consists of four phases and a four-phase oscillator. Two coils are used in a real setup. Nevertheless the frequency shift due to $\Delta C_{A L}$ is expressed by

$$
\begin{equation*}
\Delta f_{r e s}=f_{P C G}-f_{0}=\frac{1}{2 \pi \sqrt{L}}\left(\frac{1}{\sqrt{C-\Delta C_{A L}}}-\frac{1}{\sqrt{C}}\right) \tag{5.35}
\end{equation*}
$$

In $C$ the overall capacitance when PCG is inactive is summarized. With $\Delta C_{A L}=$ $\Gamma C$ and $0 \leq \Gamma \leq 1$ the equation above can be rewritten as

$$
\begin{equation*}
\frac{\Delta f_{r e s}}{f_{0}}=\frac{1}{\sqrt{1-\Gamma}}-1 \tag{5.36}
\end{equation*}
$$

Equation (5.36) is plotted in Fig. 5.19.
The capacitance introduced by the power-grid will add a stabilizing capacitive load to the oscillator. If properly designed, the power lines will not noticeably contribute to the energy consumption of the adiabatic system, as $R_{L} C_{L}^{2} \ll R_{A L} C_{A L}^{2}$. Low ohmic resistances connect $C_{L}$ to the oscillator. Thus this capacitance will not perceivable contribute to the overall losses, but if $C_{L}$ is in the same magnitude as $C_{A L}$ it will stabilize $f_{\text {res }}$. Nevertheless the deviation will lead to a mismatch of $f_{\text {sync }}$ and $f_{\text {res }}$ and thus to a reduced efficiency of the oscillator.

A simulation of a system composed of a synchronized 2N2P LC oscillator and an equivalent load of 10k ECRL buffer gates per phase should show the impact of the deviation on the energy consumed. A $100 \mu \mathrm{~m}$ long line is used to connect the oscillator to the adiabatic equivalents. The line will add a capacitance value in the range of the adiabatic load's capacitance. Transistors for the oscillator are taken from an industrial 130 nm technology. A fraction $\Delta N$ of the overall gates $N$ is shut-off. The capacitive load caused by the adiabatic load is reduced by $\Delta C_{A L}$. Figure 5.20 shows the dissipated energy $E$ with respect to the deviation $\frac{\Delta C_{A L}}{C_{A L}+\Delta C_{A L}}$. Energy values are related to $E_{0}=E\left(\Delta C_{A L}=0\right)$. Instead of saving energy by applying PCG, a large overhead is introduced as the resonance frequency of the LC tank is detuned and forced to $f_{\text {sync }}$ by the synchronization signals. Also the energy within the remaining part of the adiabatic system is increased. Mainly this is caused by increased

Fig. 5.20 A system consisting of a synchronized 2 N 2 P oscillator with an equivalent load of 10k ECRL buffer gates per phase and a $100 \mu \mathrm{~m}$ long power line: Energy dissipation fractions with respect to the switched capacitance if no replacement capacitance is attached, where $E_{V_{D D}}$ is the overall energy drawn from the power supply $V_{D D}$


Fig. 5.21 A system consisting of a synchronized 2 N 2 P oscillator with an equivalent load of 10k ECRL buffer gates per phase and a $100 \mu \mathrm{~m}$ long power line: Fractions of the energy dissipation with respect to the switched capacitance when a replacement capacitance is attached

dynamic losses occurring at the time when the oscillator is synchronized, as the internal nodes voltage is abruptly recharged to $V_{D D}$. No energy is saved with PCG in this case, in fact the energy is increased if no countermeasures are introduced to stabilize the resonance frequency.

In the following countermeasures are proposed, that shall help to keep the $L C$ tank's resonance frequency constant in the case of varying capacitive load by the adiabatic system. The proposals are explained and reviewed for their capability.

Counteracting the reduction of the capacitance by switching off a part of the adiabatic system could be done by connecting a replacement capacitance. In Fig. 5.21 simulation results with a replacement cap added are presented. In the simulation, a replacement cap $C_{R}=\Delta C_{A L}$ is added. This will keep the resonance frequency constant and savings are gained by PCG.

Two proposals of replacement caps are presented here. One is by switching to a capacitance that is additionally integrated on the chip, and connected via a switch in the case of active PCG (Fig. 5.22). Some area is consumed by the replacement capacitance, that will increase the overall area of the design. MOS capacitors have the highest capacitance per unit area and are thus preferable. $C_{R}$ has to be equal to

Fig. 5.22 A capacitance $C_{R}$ is added, to compensate for the reduced capacitance caused by switching off the adiabatic system


Fig. 5.23 Comparable to the proposal in Fig. 5.22 a tunable capacitor could be used to compensate the change in the overall capacitance
$\Delta C_{A L}$, and $\Delta C_{A L}$ is composed by the active gate area of the circuit and the wiring capacitance. A more compact layout can be achieved for the replacement cap, but still this will add area overhead that is of the same order of magnitude as the size of the gated circuit. Care has to be taken with sizing the switch, as the replacement cap has to be connected to the oscillator via a low-ohmic connection. In PCG mode, energy consumption of the switched-off circuit will be cut-off, but due to the charging of the replacement cap via the switch a certain amount is added. Again also controlling the switch will add dynamic losses $\frac{1}{2} C_{S} V_{D D}^{2}$. Switching losses will add to $E_{S O H}$ and thus will increase $T_{M P D}$.

A tunable capacitance $C_{T}$ (Fig. 5.23) allows to avoid the switch. It is permanently connected to the oscillator and in the PCG mode tuned via a control voltage $V_{\text {tune }}$ to a higher capacitance value. This can be realized by switching a MOS capacitance from depletion mode, with a low capacitance value to the inversion mode, with a high capacitance value. The permanently connected part $C_{A L}$ is increased by the capacitance $C_{T}$. Losses due to switching $\frac{1}{2} C_{T} \Delta V_{\text {tune }}^{2}$ to the tuning voltage will also increase $T_{M P D}$ in this proposal.

Nevertheless, both proposals will lead to an increased area consumption, thus a shut-down for the oscillator is proposed in the following. The system is divided into subsystems, each equipped with its own oscillator. All sub-systems that have long enough idle times are equipped with oscillators that can be disconnected from the power supply $V_{D D}$. A common synchronization signal $f_{\text {sync }}$ is connected to all oscillators, to guarantee correct phase-conditions between the interfaces. In the

Fig. 5.24 If two oscillators are used, a part of the system is cut-off by removing the power supply $V_{D D}$ from the oscillator. If the system interchanges signals, both oscillators have to be synchronized in order to operate properly at their interfaces

example sketched in Fig. 5.24 two sub-systems are used, one of them is equipped with PCG.

Both oscillators have their own inductance value to meet the desired resonance frequency.

$$
\begin{align*}
f_{\text {res }, A L} & =\frac{1}{2 \pi \sqrt{L_{1}\left(C_{L, 1}+C_{A L}\right)}}  \tag{5.37}\\
f_{r e s, \Delta A L} & =\frac{1}{2 \pi \sqrt{L_{2}\left(C_{L, 2}+\Delta C_{A L}\right)}} \tag{5.38}
\end{align*}
$$

If the same frequency $f_{\text {res }}$ has to be met, a lower capacitive load means, that a higher inductance value has to be chosen. Consequently the inductance values for the subsystems are higher. For a planar, single-layer, square inductor the inductance value is calculated in [103] by

$$
\begin{align*}
L & =\frac{\mu_{0}}{2 \pi} l\left(\ln \frac{l}{n(w+t)}-0.2\right)  \tag{5.39}\\
l & =(4 n+1) d_{i n}+\left(4 N_{i}+1\right) N_{i}(w+s) \tag{5.40}
\end{align*}
$$

Here $w$ is the width, $s$ the spacing, $t$ the thickness of the metal, $d_{i n}$ the inner diameter of the inductor, $n$ the winding count and $N_{i}=\operatorname{integer}(n)$. The area occupied by such an inductor can be roughly estimated by

$$
\begin{equation*}
A=\left(d_{i n}+2 N_{i}(w+s)\right)^{2} . \tag{5.41}
\end{equation*}
$$

For an integrated coil with the parameters $d_{i n}=1 \mathrm{~mm}, w=s=t=200 \mathrm{~nm}$ the inductance value $L$ and the area consumed $A$ are plotted in Fig. 5.25.


Fig. 5.25 Estimation results of the inductance value and the area consumed by a planar, sin-gle-layer, square inductor according to [103]

As the inner diameter for this coil is already very large, additional windings will not add too much area overhead. Even though the increased inductance value will not add too much overhead, the amount of inductors will do so. If the setup with only one inductor is used, and the overall capacitance is connected to the oscillator, the inductance $L$ is calculated via

$$
\begin{equation*}
L=\frac{1}{\omega_{0}^{2}\left(C_{L}+C_{A L}+\Delta C_{A L}\right)} \tag{5.42}
\end{equation*}
$$

with $\omega_{0}=2 \pi f_{\text {res }}$. The system with two oscillators will need two inductances calculated via

$$
\begin{align*}
L_{1} & =\frac{1}{\omega_{0}^{2}\left(C_{L 1}+\Delta C_{A L}\right)}  \tag{5.43}\\
L_{2} & =\frac{1}{\omega_{0}^{2}\left(C_{L 2}+C_{A L}\right)} \tag{5.44}
\end{align*}
$$

If it is assumed, that $\left(C_{L 1}+\Delta C_{A L}\right)$ and $\left(C_{L 2}+C_{A L}\right)$ are both smaller than $\left(C_{L}+C_{A L}+\Delta C_{A L}\right)$ then $L_{1}$ and $L_{2}$ have to be larger than $L$. Even if there is no area increase expected for $L_{1}$ and $L_{2}$ with respect to $L$, i.e. $A_{L} \approx A_{L 1} \approx A_{L 2}$, the overall area is doubled $\left(A_{L 1}+A_{L 2} \approx 2 A_{L}\right)$. This means that splitting into subsystems will allow for increasing the saving potential of PCG without any additionally introduced capacitances, a drawback is the increased area consumed in such divided systems due to the coils. Basically the lower limit of the area overhead by division into $M$ sub-blocks can be expressed by $\sum_{i=1}^{M} A_{L i}=M A_{L}$.

Nevertheless PCG is a powerful method to prevent energy consumption in adiabatic circuit blocks that do not process data. As soon as the whole adiabatic circuit
can be switched off, PCG by shutting-down the oscillator is a powerful way to prevent energy losses in an idle system. One could imagine, that an adiabatic system is integrated into a static CMOS environment and is a sub block that is not used during the whole operation time. Then the system is equipped with one oscillator only, that can be shut-off.

### 5.4 Power-down Mode for the Synchronous 2N2P LC-oscillator

Generation of the power-clock without degrading the system efficiency of AL too much is allowed by the synchronous 2N2P LC-oscillator presented in Sect. 4. Synchronization signals gate the transistors in the oscillator periodically. On the one hand, the energy dissipated in one cycle in the gates of the Adiabatic Logic system is compensated for by this and on the other hand it synchronizes the oscillator to the frequency $f_{\text {sync }}$ of the synchronization signals. By means of gating the synchronization signals PCG can be implemented in this oscillator without the need for additional gating transistors. Different power-down modes can be thought about. The constriction is, that no voltage drop has to be applied to the inductor, as otherwise a current will be induced according to $\frac{d i}{d t}=\frac{u}{L}$. Three modes that conform to this constriction are presented subsequently. They are rated with respect to their applicability and their energy efficiency. Schematics of all three modes are presented in Figs. 5.26, 5.27 and 5.28. In power-down mode 1 all transistors are off, the oscillator dismisses the stored energy within a damped oscillation. In power-down mode 2 and 3, either the NMOS devices or the PMOS devices will be held in the on-state in PCG. Thus the output signals are either connected to ground or $V_{D D}$. In mode 2 the stored energy is dismissed abruptly in mode 3 , that is also called SRAM mode, gates are connected to $V_{D D}$ and thus active signals can be stored.

The three modes do show different behavior concerning retention of data values, shut-down and shut-on. Mode 1 leads to a damped sinusoidal signal, the energy stored in the LC tank is dissipated via adiabatic charging events, leakage losses and also due to losses in the inductor. Mode 2 switches the signals abruptly to ground, Mode 3 switches to $V_{D D}$. At Power-On, the dismissed energy has to be supplied to the LC tank, thus at the end of the PCG phase, an increased energy consumption will be seen by all three modes.

In mode 3 the supply rails to the adiabatic system are all at $V_{D D}$. Thus, the stored data in the gates, that are in the hold interval, can be stored on the first sight. But in this mode, the gates are not operated in a pipelined fashion any longer, data at a deeper level of the pipeline will be corrupted by the feed-through of data signals. Lets assume, that five inverters are operated in a row, and the first power-clock phase $\phi_{0}$ is in the hold interval. Then in the first gate, data is stored that was fed to the input in the present cycle, in the fifth gate the data signal from the previous cycle is stored. Now the intermediate gates (two to four) are also connected to $V_{D D}$. The output of gate two will switch to output states according to the inputs, i. e. the output signals of the first gate. Then gate three and four will also be switched. Gate five will be influenced by the output signals of gate four. Storing data thus is not granted in this

Fig. 5.26 Power-down mode 1: All transistors are turned off


Fig. 5.27 Power-down mode 2: Both NMOS
transistors are on and connect AL gates to ground


Fig. 5.28 Power-down mode 3 (SRAM mode): Connecting AL gates to $V_{D D}$ via the PMOS transistors

mode. A hybrid mode is introduced, that allows for storing data signals during PCG. It is composed by mode 2 and 3, one of the oscillators is connected to ground, the other is connected to $V_{D D}$. Feed-through of signals is thus prevented.

A simulation shall show the energy consumption of the different modes in onstate, during active PCG and also the overhead due to switching into the off-state and into the on-state. According to (5.2) the energy consumption in on-state $E_{\text {on }}$ and in off-state $E_{o f f}$ are energy per cycle values. $E_{S O H}$ is assumed to take place in one cycle, thus all overhead due to switching is accumulated and summarized in $E_{S O H}$. A system consisting of PFAL inverters is used as load. Synchronized 2N2P oscillators provide the oscillating power-clock, the synchronization signals are gated with static CMOS NAND and NOR gates (depending on the mode), that are buffered to meet the required driver strength for the transistors within the oscillators. The

Fig. 5.29 Energy dissipation versus time for $T_{\text {on }}=T_{\text {off }}=25 T$. Mode 1 and mode 2 consume increased energy when PCG is inactivated, mode 3 consumes increased energy mostly when PCG is activated. The hybrid mode is a mix of mode 2 and mode 3 , increased energy can be seen at both transitions of PCG


Table 5.1 Overview of saving potential and switching overhead for all power-down modes

|  | $E_{\text {on }} / E_{\text {on,mode } 1}$ | $E R F$ | $E_{S O H} / E_{\text {on }}$ |
| :--- | :--- | ---: | :---: |
| mode 1 | 1 | 4280 | 4.39 |
| mode 2 | 1 | 2404 | 9.11 |
| mode 3 | 1 | 11 | 8.37 |
| hybrid mode | 1 | 25 | 10.44 |

system is in on-state for 25 periods and is then switched to PCG for 25 cycles. The energy consumption for the four modes are presented in Fig. 5.29.

Mode 1 shows the lowest $E_{S O H}$. Signals will float to a voltage level $0<V_{x}<$ $V_{D D}$ dependent on the ratio between NMOS and PMOS devices in the oscillator, and leakage paths within the adiabatic circuit. When switched on again, dynamic losses are smaller compared to mode 2 , where voltage levels have to be recovered from ground level. In mode 3 all levels are increased abruptly to $V_{D D}$, energy is dissipated when PCG is activated. Here the energy overhead during switching is comparable to mode 2. The hybrid mode consumes energy when PCG is activated, and also when PCG is deactivated. Compared to all other modes, the hybrid mode shows the highest overall consumption at the end of the test case. But data is retained and additional circuitry or overhead due to emptying the pipeline prior to PCG is avoided.

In Table 5.1 the simulated values are summarized. All values are relative values. In the on-state, as expected, all modes lead to the same energy consumption. Mode 1 and mode 2 offer a high ratio $E R F=E_{\text {on }} / E_{\text {off }}$. Mode 3 reduces the energy consumption in PCG by a factor of 11 , the hybrid mode offers a slightly better $E_{\text {on }} / E_{\text {off }}$ ratio of 25 . The differing $E_{\text {on }} / E_{\text {off }}$ ratios are due to leakage. Where in mode 1 the power-clock lines drift to a voltage of approximately $V_{D D} / 2$ in the simulated case, in mode 2 the power-clock lines are held at ground. The PMOS devices see the whole voltage drop of $V_{D D}$ in mode 2, and thus an increased leakage. In mode 3 the power-lines are at $V_{D D}$. The oscillator and the adiabatic circuit is ex-


Fig. 5.30 Energy saving potential by switching off the circuit for two on times and the selected power down modes
posed to the full voltage and thus an increased leakage is observed. Due to a mix of mode 2 and mode 3 in the hybrid mode, $E_{o n} / E_{\text {off }}$ is increased compared to mode 3.

No overhead in on-state is introduced, as no additional resistance is introduced into the loading path. Therefore these values can be used to evaluate (5.8). This equation determines the minimum power down time $T_{M P D}\left(E_{o n} \approx E_{0}\right)$. Column $E_{S O H} / E_{\text {on }}$ is equal to the $T_{M P D}\left(E_{\text {on }} \approx E_{0}\right) / T_{\phi}$. Mode 1 is superior to all other modes with respect to minimum allowable power-down times.

Different constellations of $T_{o n}$ and $T_{o f f}$ can be investigated if the values extracted in Table 5.1 are inserted in (5.2). Results are presented in Fig. 5.30 for mode 1, mode 2 and the hybrid mode, and for two different $T_{\text {on }}$.

All three modes allow for a reduction of the energy consumption $\overline{E_{P C G}}$ of the system by applying PCG with respect to the energy consumption $E_{\text {on }}$ of a system always in on-state. Due to the floating internal nodes in mode 1, dynamic losses are reduced at power-on of the system, thus the lowest energy consumption is gained with mode 1 for both $T_{o n}$ times. The hybrid mode consumes the highest energy consumption as it drains energy at shut-down and power-on from the power supply. For longer times $T_{\text {on }}$ the impact of the switching overhead gains less impact. This is observed in the plots, for $T_{o n}=100 \cdot T_{\phi}$ all three dissipation values are closer to each other than for $T_{o n}=25 \cdot T_{\phi}$. For $T_{o n}=T_{o f f}$ the possible reductions are between $10 \%$ and $20 \%$ for $T_{o n}=25 \cdot T_{\phi}$ and roughly between $70 \%$ and $80 \%$ for $T_{\text {on }}=100 \cdot T_{\phi}$.

If data has to be retained, mode 1 and mode 2 will cause an extended on-time, as PCG can only be activated after all data has been retrieved from the pipeline. The pipeline depth $\mathcal{D}$ is a measure by what extent the off-time is decreased. $T_{\text {clear }}$ is the time for clearing the pipeline, it is a function of $\mathcal{D}$. In this case modified times $T_{o n}^{*}$ and $T_{o f f}^{*}$ are introduced, that are related to $T_{o n}$ and $T_{o f f}$ via

$$
\begin{aligned}
T_{o n}^{*} & =T_{\text {on }}+T_{\text {clear }}(\mathcal{D}) \\
T_{\text {off }}^{*} & =T_{\text {off }}-T_{\text {clear }}(\mathcal{D})
\end{aligned}
$$

Thus in cases, where a long $T_{\text {clear }}$ leads to degradation of the savings gained by applying mode 1 or mode 2 for PCG it can be advantageous to use the hybrid mode. If mode 1 or mode 2 is used, it is advantageous to decrease the pipeline depth $\mathcal{D}$. Arithmetic structures have to be used that allow for a small pipeline depth, what is also desirable with respect to buffer overhead.

The synchronized 2N2P oscillator allows for a power-down mode with only minor modifications of the circuit. By different schemes for gating the transistors in the oscillator, various modes for power-down of the oscillator can be implemented that differ with respect to their saving potential and the ability for data retention.

## Chapter 6 <br> Arithmetic Structures in Adiabatic Logic

Digital signal processing tasks use a variety of arithmetic operations. Most of the advanced operations, like multiplication, are based on shift operations and binary addition [104]. Thus binary adders are a fundamental research field in high speed and low power digital design. The ripple-carry adder (RCA) structure is the straight forward implementation of a binary adder. But as the carry chain grows with the bit width $N$ of the adder, also the delay in static CMOS grows with $N$. Speed improvements are achieved by breaking up or modifying the carry chain. A RCA can be subdivided into fractions, each calculates a group propagate signal and the results based on the inputs of the sub-fraction. Thus the carry can take a bypass path and does not have to ripple through the whole full-adder chain. Carry select and conditional sum adders improve the carry propagation by a respective amount of additional hardware. Again the adder is subdivided and each fraction calculates the outputs according to both possible inputs of the carry in. Then the carry selects the corresponding output via a multiplexer and also the carry in for the next stage. The conditional sum adder additionally improves the carry propagation by pre-computation, but enlarges the overhead due to multiplexers. Based on the carry select scheme, a carry increment adder can reduce the massive hardware overhead. Here the result is calculated for a zero input of the carry, but a consecutive stage decides whether the output is incremented or not by the carry.

Carry-save adders (CSA) do not propagate the carry signal within one stage, but do forward a second vector consisting of all carry outputs to the next stage [105]. These adders are primarily advantageous, when consecutive add operations are performed. Parallel prefix adders (PPA) pre-compute group propagates and thus decrease the logical depth of binary adders, accompanied by increased parallel effort and thus power consumption. A broad spectrum of PPA adder designs was proposed in the past [106-109]. Area consumption, speed, power, and layout implications are the design trade-offs in the selection of the adder architecture of choice.

Multipliers are constructed based on adder designs. Generation of the partial products on bit level can be done by using a simple AND gate on bit level and than adding up all partial products. For the multiplier iterative algorithms can be used or a field of cascaded adder structures. CSA are properly suited to form cas-
caded fields of adders. After the last adding stage two vectors are left, that have to be combined to the final sum vector by a vector merging adder (VMA).

Three properties of Adiabatic Logic presented in Sect. 2.5 lead to the necessity to investigate arithmetic structures in Adiabatic Logic. First of all, the ultra-low power dissipation of Adiabatic Logic allows for the assembling of arithmetic units with ultra-low energy consumption. Dual-rail encoding implicitly delivers the inverted output signal. Thus arithmetic operations like subtraction, that imply the generation of the 2's complement can be built without the need for additional inverters. Micropipelining allows to design large systems without any worries about timing constraints. But careful selection of the arithmetic topology is essential, as otherwise overhead in area and energy due to synchronization buffers is induced. In the following part, the importance of the careful design in Adiabatic Logic is highlighted. Decreasing the pipeline depth $\mathcal{D}$ by means of proper choice of the topology and the application of complex gates is demonstrated. Then in design examples the energy savings of complex systems compared to static CMOS counterparts is presented.

### 6.1 Design of Arithmetic Structures

Advanced operations like subtraction and multiplication are all based on adders. Subtraction of numbers presented in the 2's complement is simply done by addition. The negative number $-\mathbf{X}$ in the 2's complement representation is calculated via $\overline{\mathbf{X}}+1$. Subtraction of the type $\mathbf{Y}-\mathbf{X}$ is thus accomplished with $\mathbf{Y}+\overline{\mathbf{X}}+1$. Multipliers can be built by arrays of adders that perform the summing of all partial products. Hence, an elaboration of adders is the base for all other investigations.

As in Adiabatic Logic the focus is on lowest energy consumption primarily, reduction of gates incorporated into the addition operation is the primary goal. Thus it is crucial to find a design, that on the one hand uses a low count of energy consuming gates and on the other hand does not call for many buffer stages due to synchronization reasons. Due to the parallel calculation of the sums in the sub-fractions, and a selection or increment of the results in a later stage due to carries, the carry select, conditional sum and carry increment adder are constrictedly suitable for Adiabatic Logic. The carry select scheme could lead to savings, as the pipeline depth in the circuit is decreased. Conditional sum adders only pay in static CMOS, as these speed up the carry propagation on the cost of increased hardware effort. The increment adder will lead to a lower hardware overhead compared to carry skip and conditional sum. Only one adder is used per sub-fraction complemented by an incrementer and additional circuitry for the group propagate signals.

The ripple-carry adder as the simplest implementation is taken as a reference design. Parallel-prefix adders are due to their parallel processing explicitly applicable for Adiabatic Logic. In the following first the ripple carry adder is discussed (including also RCA-based carry select adders), followed by different implementations of PPA adders and then compared with respect to their energy consumption and the active gate area.

But before any structures are discussed, the framework for the estimation of energy consumption as well as for the active gate area shall be described.

Table 6.1 Values for the energy and area consumption estimations of the investigated adder structures. All results are related to the energy and area of the BUF/INV circuit. Logic equations of the gates are given in the second column. HA, FA and GP gates consist of two parallel PFAL gates, that compute the outputs within one cycle

|  | Logic equation(s) | $\bar{E} / \overline{E_{B U F}}$ | $A / A_{B U F}$ |
| :--- | :--- | :--- | :--- |
| BUF/INV | $Y=X / Y=\bar{X}$ | 1.0 | 1.0 |
| AND $(\&)$ | $Y=A \cdot B$ | 1.9 | 1.3 |
| XOR $(\oplus)$ | $Y=A \oplus B$ | 3.7 | 1.6 |
| HA | $S=A \oplus B$ | 5.7 | 2.9 |
|  | $C=A \cdot B$ |  |  |
| FA | $S=A \oplus B \oplus C_{i n}$ | 13.4 | 4.2 |
|  | $C=A \cdot B+C_{i n} \cdot(A+B)$ |  |  |
| MUX | $Y=A \cdot S+A \cdot \bar{S}$ | 4.3 | 1.9 |
| $\bullet($ GP $)$ | $G=G_{1}+P_{1} G_{0}$ | 5.4 | 2.9 |
| G0 | $P=P_{1} \cdot P_{2}$ |  |  |

### 6.1.1 Framework for the Estimation of $E_{\text {diss }}$ and $A_{\text {active }}$

All further investigations are estimated for a 130 nm industrial CMOS process. Estimations are performed for adiabatic gates in the PFAL logic family. All gates are simulated for a voltage of $V_{D D}=1.2 \mathrm{~V}$ in the simulation setup in Sect. 2.6 and an input pattern suited for the determination of the mean energy dissipation. Table 6.1 shows the energy dissipation $\overline{E_{\text {diss }}}$ and active gate area $A_{\text {active }}$ with respect to a PFAL BUF/INV gate.

### 6.1.2 Ripple-Carry Adder (RCA)

The ripple-carry adder is the straight-forward implementation of adder topologies based on full adder gates. The sum bit $s_{i}$ and the carry $c_{i+1}$ are generated according to

$$
\begin{align*}
s_{i} & =a_{i} \oplus b_{i} \oplus c_{i},  \tag{6.1}\\
c_{i+1} & =a_{i} \cdot b_{i}+c_{i}\left(a_{i}+b_{i}\right) . \tag{6.2}
\end{align*}
$$

In static CMOS the delay of such a structure is determined by the critical path, i.e. the carry path of the adder. As the carry path is increased with the bit width $N$, also the delay is increased with $N$. In contrast in Adiabatic Logic the pipeline depth $\mathcal{D}$ is determined by $N$. In case of the ripple carry adder $\mathcal{D}=N$. This leads to a vast amount of buffers, as incoming signals of higher bit positions have to be delayed until they are used in the calculation and results have to be delayed until

Fig. 6.1 A massive overhead is introduced into the RCA scheme in Adiabatic Logic, as carry signals are forwarded to the next power-clock phase. Thus incoming signals and outgoing signals have to be synchronized to the according phase via buffers. For small bit widths, the overhead is negligible, as can be seen here for the 3 bit RCA structure

the overall computation is finished. For smaller bit widths, this overhead can still be neglected, but with rising bit widths the buffers consumption will dominate over the energy consumed by the full adder cells. Figure 6.1 is a scheme of a 3 bit RCA in Adiabatic Logic, it has a pipeline depth of $\mathcal{D}=3$. The input buffers have to delay signals $a_{1}$ and $b_{1}$ for one phase and $a_{2}$ and $b_{2}$ for two phases. Outgoing sums $s_{0}$ and $s_{1}$ are delayed by two and one buffer, respectively. Only a minor overhead of nine buffers is introduced in case of the 3 bit RCA. In the general case of a $N$ bit adder, the amount of full adders $N_{F A}$, of buffers $N_{B u f}$ and the pipeline depth for the adiabatic implementation of a RCA are:

$$
\begin{align*}
N_{F A} & =N,  \tag{6.3}\\
N_{B u f} & =\frac{3}{2}\left(N^{2}-N\right),  \tag{6.4}\\
\mathcal{D} & =N . \tag{6.5}
\end{align*}
$$

For RCAs of higher bit width, the buffer overhead will dominate the overall energy consumption and also the area will be mainly determined by the buffers. At a certain bit width, implementing a carry select adder CSEA with RCAs as sketched in Fig. 6.2 will pay. If the RCA is split into $k$ fractions, $1+2(k-1)$ RCAs with a bit width of $\frac{N}{k}$ are assembled in the design.

The hardware amount for the $1+2(k-1)$ RCA blocks is determined via:

$$
\begin{align*}
N_{F A} & =(1+2(k-1)) \frac{N}{k}  \tag{6.6}\\
N_{B u f}^{\prime} & =(1+2(k-1)) \frac{3}{2}\left(\left(\frac{N}{k}\right)^{2}-\frac{N}{k}\right),  \tag{6.7}\\
\mathcal{D}^{\prime} & =\frac{N}{k} \tag{6.8}
\end{align*}
$$

As the multiplexer output also selects the carry for the subsequent multiplexer, buffers synchronize the data outputs of the adder blocks. This spans a gate array at the outputs of dimension $(k-1) N$, where $(k-1) \frac{N}{k}$ gates are the basic buffer


Fig. 6.2 Scheme of an $N$ bit carry select adder with $k$ fractions. Each sub-adder in this case is a RCA of width $\frac{N}{k}$. Input vectors are applied at the top of the scheme and data is processed from top to bottom. Each adder block except the first is implemented twice, one for each possible carry input. Multiplexer of block 2 then selects the right output with respect to the carry of block 1 and so on. Thus the outputs of blocks 3 to $k$ have to be synchronized with two buffers, one for each possible output sum. Buffers for the carry of each block are neglected in this scheme. All outputs processed by blocks 1 up to $k-1$ have to be de-skewed by buffers, so that after the last multiplexer decision is performed, all signals of the according input value are available synchronous at the output
cells. The amount of single buffers can be calculated via $(k-1) \frac{N}{2}$ and the double buffers are determined as $(k-1) \cdot\left(\frac{N}{2}-\frac{N}{k}\right)$. Hardware used for the multiplex and synchronization stage is calculated via

$$
\begin{align*}
N_{M u x} & =(k-1) \frac{N}{k},  \tag{6.9}\\
N_{B u f}^{\prime \prime} & =(k-1)\left(\frac{3 N}{2}-\frac{2 N}{k}\right),  \tag{6.10}\\
\mathcal{D}^{\prime \prime} & =k-1 . \tag{6.11}
\end{align*}
$$

Overall, the CSEA consumes $N_{B u f}^{\prime}+N_{B u f}^{\prime \prime}$ buffers and the pipeline length is $\mathcal{D}^{\prime}+\mathcal{D}^{\prime \prime}$. Estimations for different bit widths are performed. Equations (6.3) and (6.4) are evaluated for the RCA and (6.6), (6.7), (6.9), and (6.10) for the CSEA. According dissipation values and active area for full adder (FA) and multiplexer (MUX) related to the buffer (BUF) circuit are taken from Table 6.1. Results of $E_{C S E A} / E_{R C A}$ and $A_{C S E A} / A_{R C A}$ are presented in Table 6.2.

Table 6.2 Results for different values of $N$ and $k$ of the comparison between CSEA and RCA. Energy relation $E_{C S E A} / E_{R C A}$ and active area relation $A_{C S E A} / A_{R C A}$ are given. The optimum values of $k$ for $N>32$ are highlighted in the table

| $N$ | $E_{C S E A} / E_{R C A}\left(A_{C S E A} / A_{R C A}\right)[\%]$ |  |  |  |  |  |
| ---: | :--- | ---: | :--- | :--- | :--- | :---: |
|  | $k=2$ | $k=4$ | $k=8$ | $k=16$ | $k=32$ |  |
| 4 | $139.7(114.9)$ |  |  |  |  |  |
| 8 | $123.5(98.6)$ | $135.3(98.0)$ |  |  |  |  |
| 16 | $107.3(88.0)$ | $104.7(73.6)$ | $112.8(79.0)$ |  |  |  |
| 32 | $94.4(81.9)$ | $80.3(59.5)$ | $\mathbf{7 7 . 0}(\mathbf{5 2 . 7})$ | $89.5(66.0)$ |  |  |
| 64 | $85.8(78.5)$ | $64.0(51.8)$ | $\mathbf{5 3 . 2}(\mathbf{3 8 . 5 )}$ | $55.1(39.8)$ | $72.4(58.4)$ |  |
| 128 | $80.7(76.8)$ | $54.5(47.8)$ | $39.2(31.1)$ | $\mathbf{3 4 . 8}(\mathbf{2 6 . 1})$ | $41.2(32.6)$ |  |
| 256 | $77.9(75.9)$ | $49.3(45.8)$ | $31.5(27.3)$ | $\mathbf{2 3 . 8}(\mathbf{1 9 . 2})$ | $24.4(19.5)$ |  |

Results for $k \geq N$ do not appear in the results, as $k=N$ corresponds to a carry save adder and values of $k>N$ do not make sense. For small bit widths $(N<32)$ the RCA implementation consumes less energy. Here the overhead due to the buffers in the RCA is small compared to the energy consumed in the full adders, so that the overhead due to the CSEA structure does not justify the application of a CSEA. But, for $N=16$ it is seen, that the active gate area is decreased by $12-26 \%$ for the CSEA structure with respect to the RCA, while the energy is increased by only $5-13 \%$. Here, by using more full-adders in the respective CSEA circuit the energy saving is degraded on the one hand, as the full-adder consumes 13.4 times more energy than a buffer gate. But on the otter hand, as the ratio of active area of FA and BUF is not as excessive, area savings are observed. Here a trade-off between energy and area consumption is offered.

As soon as bit widths $N \geq 32$ are implemented, energy and area benefits are observed for the CSEA structure. Basically the savings are increased with the bit width $N$. This is due to the excessive overhead of buffers in the RCA. Then the breakdown of the circuit via $k$ also impacts the savings greatly. The optimum of the investigated values of $k$ for each bit width is highlighted in the table. For a $N$ of 32 or 64 bit, $k=8$ delivers the largest savings. For 128 and 256 bit, 16 fractions provide the highest savings. The increase in the energy consumption for high values can be explained by taking a look at the multiplexing circuit at the outputs of the RCAs. If $k=N$, each RCA fraction collapses to a single full adder gate. Thus each multiplexer will only consist of two single gates, that switch the sum and carry combination for the according input value. The pipeline depth $\mathcal{D}^{\prime \prime}$ will increase to $N-1$ and so will the buffers (that do dominate the consumption in the RCA for the assumed bit widths $N \geq 32$ ) that are used in the synchronization stages prior and post to the multiplexers. So the amount of buffers will be the same as in the RCA, but the count for FA will be almost twice as much and additionally the MUX will consume energy and add additional area. Energy savings are $77.0 \%, 53.2 \%, 34.8 \%$, and $23.8 \%$ for the CSEA of $N=32$ with $k=8, N=64$ with $k=8, N=128$ with $k=16$, and $N=256$ with $k=16$, respectively.

### 6.1.3 Parallel-Prefix Adders (PPA)

In the case of an $N$ bit adder, each output $s_{i}(i=0,1, \ldots, N-1)$ depends on all input signals $a_{j}, b_{j}$ and the generated carries $c_{j}$ for $j=0,1,2, \ldots, i$. Two signals are defined in the calculation of the sum in a basic full adder cell. Either a carry is generated at bit position $i$ or an incoming carry is forwarded to carry out. Generation $G_{i}$ and propagation $P_{i}$ signals at bit position $i$ are defined as

$$
\begin{align*}
G_{0} & =a_{0} b_{0}+c_{i n}\left(a_{0}+b_{0}\right),  \tag{6.12}\\
G_{i} & =a_{i} b_{i} \quad \text { with } i=1,2, \ldots, N-1,  \tag{6.13}\\
P_{i} & =a_{i} \oplus b_{i} \quad \text { with } i=0,1, \ldots, N-1 . \tag{6.14}
\end{align*}
$$

The sum output at bit position $i$ is then calculated with $P_{i}$ and the carry $c_{i+1}$ :

$$
\begin{align*}
s_{i} & =P_{i} \oplus c_{i},  \tag{6.15}\\
c_{i+1} & =G_{i: 0} . \tag{6.16}
\end{align*}
$$

Group generate $G_{i: j}$ and group propagate $P_{i: j}$ signals [105] are introduced. $G_{i: j}$ indicates if a carry is generated in one of the bit positions $j, \ldots, i(i \geq j)$ and propagates to bit position $i$. The group propagate signal $P_{i: j}$ indicates if a carry applied at position $j$ is propagated through the whole group to position $i$. The signals are grouped into ( $G_{i: j}, P_{i: j}$ ) and the $\bullet(\mathrm{GP})$ operator is defined as

$$
\begin{align*}
\left(G_{i: j}, P_{i: j}\right) & =\left(G_{i: k}, P_{i: k}\right) \bullet\left(G_{k: j}, P_{k: j}\right) \\
& =\left(G_{i: k}+P_{i: k} G_{k: j}, P_{i: k} P_{k: j}\right) \tag{6.17}
\end{align*}
$$

with $i \geq k \geq j$. Thus by the dot operator $\bullet$, group generate and propagate are defined. A signal is generated in the group $G_{i: j}$ if in the subgroup $G_{i: k}$ a carry is generated, or when in the lower subgroup a carry is generated $\left(G_{k: j}\right)$ and then propagated through from $k$ to $i$. A carry-in at bit position $j$ is propagated into position $i$ only if the carry is propagated through both subgroups. According to (6.15) and (6.16) all group generate and propagate signals $G_{i: 0}$ and $P_{i: 0}$ for $N<i \leq 0$ have to be calculated to find the resultant sum vector.

This can be formulated into a prefix problem [106], where all products $x_{0} \diamond \cdots \diamond x_{k}$ for $k=1,2, \ldots, N-1$ of a binary word with $N$ inputs $x_{0}, x_{1}$, $\ldots, x_{N-1}$ and an associative operator $\diamond$ shall be computed. Different implementations of grouping generate and propagate signals are presented in [107-110]. These implementations will be discussed shortly and then estimated with respect to the energy dissipation and the active gate area consumption in Adiabatic Logic. All PPA adders are compared with the RCA structure, as for small bit widths, the energy overhead due to the buffers will probably lead to a more energy efficient design using an adiabatic RCA.

Fig. 6.3 Common circuit block in case of an 8 bit PPA. $G P$ gates compute the propagate $P_{i}$ and generate signals $G_{i}$ from the input vectors $\mathbf{a}$ and $\mathbf{b}$, where at position 0 the carry in signal $c_{\text {in }}$ is included and thus (6.12) is implemented for $G_{0}$


### 6.1.3.1 Common Circuit Blocks to All PPA Structures

All parallel prefix adders share a common structure. Bit-wise propagate $P_{i}$ and generate $G_{i}$ signals are computed in the first stage according to (6.12), (6.13) and (6.14). Final sum generation is also common to all PPA structures. XOR gates are used to calculate the sum outputs (6.15). In Fig. 6.3 the common circuit block for an 8 bit PPA is sketched. At the outputs, the XOR gate generates the sums. Equation (6.15) shows, that the propagate signals $P_{i}$ are needed for this operation. So buffer chains have to be inserted in adiabatic PPA circuits, to synchronize the $P_{i}$ signals generated with the $G P$ gates to the outputs. The number of consecutive buffers is dependent on the pipeline depth $\mathcal{D}_{P P A}$ of the corresponding PPA structure. The carry out $c_{\text {out }}$ is identical to the highest group generate $G_{(N-1): 0}$. The effort for the common block for a bit width of $N$ is

$$
\begin{align*}
N_{G 0} & =1,  \tag{6.18}\\
N_{A N D} & =N-1,  \tag{6.19}\\
N_{X O R} & =2 N,  \tag{6.20}\\
N_{B u f} & =\mathcal{D}_{P P A} \cdot N+1 . \tag{6.21}
\end{align*}
$$

One buffer results from $c_{o u t}$, that has to be synchronized to the outputs of the final sum, that are calculated via an XOR operation. This is common to all prefix schemes, so that this amount has to be added on top of the prefix effort, to compare the according PPAs to the RCA implementation. Now four different prefix algorithms are examined.

### 6.1.3.2 Sklansky PPA

The Sklansky PPA scheme [110] sketched in Fig. 6.4 offers the lowest possible depth of $\mathcal{D}=\left\lceil\log _{2} N\right\rceil$. A disadvantage is, that the maximum fan out $\mathrm{FO}_{\max }$ increases with the number of stages. Gates have to be inserted that are able to drive

Fig. 6.4 Scheme of the parallel prefix graph for an 8 bit Sklansky PPA

the high fan out branches. For simplicity, the possibility to use gates with higher driver strengths is neglected here. The number of $\bullet$ gates is $\frac{1}{2} N$ in each stage. Operators o do not alter the input data. In static CMOS those gates can be skipped, but due to the micropipeline in Adiabatic Logic, two buffers are used for each o. Effort and maximum fan out and pipeline depth for the Sklansky prefix adder are

$$
\begin{align*}
N_{\bullet} & =\frac{1}{2} N\left\lceil\log _{2} N\right\rceil,  \tag{6.22}\\
N_{B u f} & =N\left\lceil\log _{2} N\right\rceil,  \tag{6.23}\\
\mathrm{FO}_{\max } & =\frac{1}{2} N,  \tag{6.24}\\
\mathcal{D} & =\left\lceil\log _{2} N\right\rceil . \tag{6.25}
\end{align*}
$$

### 6.1.3.3 Brent-Kung (BK) PPA

To limit the maximum fan out to 2, Brent and Kung proposed a structure [107], that is sketched for an 8 bit adder in Fig. 6.5. On the downside the amount of buffers is increased, as the pipeline depth is increased to $\mathcal{D}=2 \log _{2} N-2$. The amount of $\bullet$ gates is reduced compared to Sklansky. Summarizing, the following equations determine the effort for the Brent-Kung PPA:

$$
\begin{align*}
N_{\bullet} & =2 N-\left\lceil\log _{2} N\right\rceil-2,  \tag{6.26}\\
N_{B u f} & =2\left(2 N\left\lceil\log _{2} N\right\rceil+\left\lceil\log _{2} N\right\rceil-3 N+2\right),  \tag{6.27}\\
\mathrm{FO}_{\max } & =2,  \tag{6.28}\\
\mathcal{D} & =2\left\lceil\log _{2} N\right\rceil-1 . \tag{6.29}
\end{align*}
$$

### 6.1.3.4 Kogge-Stone (KS) PPA

Kogge and Stone proposed a scheme that allows the minimum possible pipeline depth of $\mathcal{D}=\left\lceil\log _{2} N\right\rceil$ and a fixed fan out of 2 [109]. But this comes on the expense of a massive increase in $\bullet$ gates and a high amount of long wires, also seen in the scheme in Fig. 6.6. Long wires will impact the layout and add a certain amount of

Fig. 6.5 Scheme of the parallel prefix graph for an 8 bit Brent-Kung PPA


Fig. 6.6 Scheme of the parallel prefix graph for an 8 bit Kogge-Stone PPA

capacitance that the gates have to drive. Additional capacitance leads to additional energy consumption. Overall this structure has an hardware effort, $\mathrm{FO}_{\max }$ and $\mathcal{D}$ of

$$
\begin{align*}
N_{\bullet} & =N \cdot\left\lceil\log _{2} N\right\rceil-N+1,  \tag{6.30}\\
N_{B u f} & =2(N-1),  \tag{6.31}\\
\mathrm{FO}_{\max } & =2,  \tag{6.32}\\
\mathcal{D} & =\left\lceil\log _{2} N\right\rceil . \tag{6.33}
\end{align*}
$$

### 6.1.3.5 Han-Carlson (HC) PPA

The scheme (Fig. 6.7) proposed by Han and Carlson [108] achieves a good trade-off between pipeline depth $\mathcal{D}$ and the amount of $\bullet$ gates used, with a fixed fan out of 2 . The Han-Carlson PPA is characterized with respect to hardware effort, $\mathrm{FO}_{\max }$ and $\mathcal{D}$ according to

$$
\begin{align*}
N_{\bullet} & =\frac{1}{2} N\left\lceil\log _{2} N\right\rceil,  \tag{6.34}\\
N_{B u f} & =N\left\lceil\log _{2} N\right\rceil+2 N,  \tag{6.35}\\
\mathrm{FO}_{\max } & =2,  \tag{6.36}\\
\mathcal{D} & =\left\lceil\log _{2} N\right\rceil+1 . \tag{6.37}
\end{align*}
$$

Fig. 6.7 Scheme of the parallel prefix graph for an 8 bit Han-Carlson PPA


Fig. 6.8 Estimation of the energy consumption for RCA, CSEA (opt) and diverse PPA implementations


### 6.1.3.6 Comparison of the Ripple-Carry Adder and Parallel-Prefix Adders

The implementations of the RCA and the PPAs are compared with the presented gate count equations. The overall gate count, including the common block structure is summarized in Table 6.3. Energy dissipation for the gates is determined according to the framework presented in Sect. 6.1.1. As a measure of the area consumed by the different implementations, the active gate area is summed up. Also a CSEA implementation of an RCA structure is added. Therefore the values for optimum energy and area are taken from Table 6.2. Thus a CSEA with $k=8$ is taken for the estimation for $N=32$ and 64 , and one with $k=16$ for $N=128$ and 256. Corresponding values are indicated as CSEA (opt) in Figs. 6.8, 6.9, 6.10 and 6.11.

Figure 6.8 shows the energy dissipation of the adder structures.
The RCA, due to the immense overhead of buffers shows a great rise in the energy consumption for high bit widths. A quadratic dependency on the bit width is observed for the RCA. Splitting the RCA into sub-fractions by means of a CSEA will allow to drop the energy consumption remarkably. Though great savings compared to the RCA are gained, the CSEA consumes more energy than all investigated PPAs. PPA structures show a linear energy consumption versus the bit width. The different structures do differ in the overall energy consumption. The Brent-Kung structure shows the highest energy consumption, and the Sklansky structure is the one with the lowest energy consumption. Factors with respect to the RCA are sum-

Table 6.3 Overview of RCA and diverse PPA implementations. $\mathcal{D}_{P P A}$ is the pipeline depth of the prefix kernel, and $\mathcal{D}$ is the overall pipeline depth. $\mathrm{FO}_{\max }$ is the maximum fan-out within the structure and $N_{x y}$ is the count for gate $x y$. According to these values based on the bit width $N$, estimations are carried out to allow for comparison of the various structures


Table 6.4 Energy consumption for the RCA with respect to PPAs and CSEA (opt) for different bit widths. The best architecture for each bit width is highlighted in bold. For all bit widths the Sklansky PPA is superior with respect to energy consumption, closely followed by the Han-Carlson PPA structure

| $N$ | $\frac{E_{R C A}}{E_{S k l a n s k y}}$ | $\frac{E_{R C A}}{E_{B K}}$ | $\frac{E_{R C A}}{E_{K S}}$ | $\frac{E_{R C A}}{E_{H C}}$ | $\frac{E_{R C A}}{E_{C S E A(\text { opt })}}$ |
| ---: | :--- | :---: | :---: | :---: | :---: |
| 32 | $\mathbf{1 . 7 1}$ | 1.40 | 1.57 | 1.67 | 1.30 |
| 64 | $\mathbf{2 . 7 3}$ | 2.20 | 2.43 | 2.66 | 1.88 |
| 128 | $\mathbf{4 . 6 1}$ | 3.71 | 4.02 | 4.51 | 2.87 |
| 256 | $\mathbf{8 . 1 0}$ | 6.49 | 6.93 | 7.93 | 4.20 |

marized in Table 6.4. In case of the area consumption a strong correlation of the overall energy consumption in Fig. 6.8 and the area in Fig. 6.9 is seen.

This is obvious, as the energy consumed is dependent on the overall capacitance, which is dominated by the active gate area capacitance as long as interconnect capacitances can be neglected. Due to the buffers in the RCA, also the area consumed will rise tremendously with the bit width. Statements derived for the energy

Fig. 6.9 Estimation of the area consumption for RCA, CSEA (opt) and diverse PPA implementations


Fig. 6.10 Estimation of the energy consumption for RCA, CSEA (opt) and diverse PPA implementations

consumption can be translated into statements on the consumed area. Due to the deepest pipeline, the Brent-Kung structure consumption is increased appreciably. A higher amount of buffers is inserted in the Brent-Kung structure. This increases the area consumption, but does not affect the energy consumption in the same degree.

For smaller bit widths a plot for the energy and area consumption is presented in Figs. 6.10 and 6.11. Bit widths up to 32 bit are magnified here. For small widths ( $N<20$ ), the RCA structure performs comparable to all PPA structures with respect to energy and area consumption. At 8 bit, the RCA shows the lowest energy and area consumption. Here the Sklansky PPA structure consumes $10 \%$ more energy than the RCA. And at 16 bit, the RCA consumes roughly $20 \%$ more than the Sklansky PPA. Due to the highly regular layout of the RCA, adders up to 16 bit are preferably designed in the RCA structure.

Thus rules can be derived from these results what architecture has to be chosen for a certain bit width. In case of adder structures less than 20 bits, the RCA structure

Fig. 6.11 Estimation of the area consumption for RCA, CSEA (opt) and diverse PPA implementations

is a good choice, as then the energy overhead due to the buffers has little weight in the overall consumption. For higher bit widths, the structure of choice is the Sklansky or the Han-Carlson structure, when a PPA structure shall be used. Both structures do only differ slightly with respect to energy and area consumption. But they do differ in the maximum fan-out and the pipeline depth $\mathcal{D}$. The fan-out is pinned to 2 for Han-Carlson, whereas the fan-out rises with $\frac{1}{2} N$ for Sklansky. The differing $\mathcal{D}$ will vanish for high bit widths. Thus for high bit widths, Han-Carlson due to its maximum fan-out of 2 is probably the structure of choice. Alternatively, a CSEA architecture based on the RCA structure is a good replacement for the simple RCA structure, that is inferior to all PPA structures.

Comparison of a Han-Carlson PPA in PFAL and CMOS The Han-Carlson structure is taken for the comparison of the adiabatic implementation with PFAL and that in static CMOS. Both are implemented as 8 bit and 16 bit adders in a 130 nm CMOS technology and simulated for two different voltages to investigate also the impact of voltage scaling on the ESF. Results of the simulations are shown in Fig. 6.12. Simulation results for the PFAL implementation are also compared to the estimation results gained by the extractions presented above. Voltage scaling is performed down to 0.8 V for PFAL and to a voltage of 0.6 V in static CMOS. The estimated results of PFAL are also voltage scaled by multiplying the estimation values with a factor of $\frac{0.8 \mathrm{~V}}{1.2 \mathrm{~V}}$. Errors due to the estimation are presented in Table 6.5 and the energy saving factors are summarized in Table 6.6. A good match for 8 bit and 16 bit Han-Carlson PFAL adders can be found for the estimation in case of the nominal supply voltage. Only a slight overestimation of less than $5 \%$ is observed, thus the estimation method presented is justified. Voltage scaling estimation introduces some underestimation of the energy dissipation. This is due to the fact, that estimation is simply multiplied by the voltage scaling factor $\frac{0.8 \mathrm{~V}}{1.2 \mathrm{~V}}$. In reality, the scaling is deviated from this rule due to some effects that are not covered in this model. A better fit can be expected, if the gates are characterized for the reduced supply voltage and estimations are evaluated therewith.


Fig. 6.12 Simulation results of a Han-Carlson structure in static CMOS and the PFAL family. All results are referenced to the energy consumption of a single PFAL buffer. Estimation results for the PFAL implementation are given, where voltage scaling was used for the estimation according to: (HC PFAL@0.8 V (Est.)) $=\frac{0.8 \mathrm{~V}}{1.2 \mathrm{~V}}$ (HC PFAL@1.2 V (Est.)). Estimation errors and the ESF for both pairs of supply voltages are presented in Table 6.6

Table 6.5 Errors of the estimation procedure for the two pairs of supply voltages in the simulation

|  | 8 bit | 16 bit |
| :--- | :--- | :--- |
| $\frac{E_{P F A L}(\text { Est. })}{E_{P F A L}(\text { Sim. })}-1 @ 1.2 \mathrm{~V}$ | $+2.51 \%$ | $+4.95 \%$ |
| $\frac{E_{\text {PFAL }}(\text { Est. })}{E_{P F A L} \text { (Sim.) }}-1 @ 0.8 \mathrm{~V}$ | $-15.50 \%$ | $-16 \%$ |

Table 6.6 The ESF values for the two pairs of supply voltages in the simulation

|  | 8 bit | 16 bit |
| :--- | ---: | ---: |
| ESF@1.2 V, 1.2 V | 17.53 | 15.31 |
| ESF@0.8 V, 0.6 V | 4.99 | 4.23 |

High energy saving factors can be observed for both bit sizes when the nominal supply voltage of 1.2 V is applied to PFAL and static CMOS. From Sect. 2.4 it is clear, that voltage scaling will degrade the energy saving factor. PFAL is limited in its capability to handle low voltages, as the minimum supply voltage has to be at least $2 V_{t h, n}$.

Even for the aggressively reduced supply voltage of 0.6 V for the static CMOS implementation versus 0.8 V for AL, the adiabatic adder consumes only $20 \%$ and $24 \%$ of the energy consumed by the static CMOS design for 8 bit and 16 bit, respectively. Additionally, due to the layout, additional timing issues are introduced in the static CMOS implementation, that are not covered in the simulation. These


Fig. 6.13 The combination of 2-input XOR gates requests for the insertion of a buffer circuit in AL. By inventing a 3-input XOR gate $\left(\oplus_{3}\right)$, that incorporates the same boolean behavior, the buffer can be skipped, the latency is reduced in any case and also the overall energy consumption shall be reduced
will give rise to an increased activity in the static CMOS adder circuit and will thus improve the ESF.

### 6.2 Overhead Reduction by Applying Complex Gates

Synchronization issues in Adiabatic Logic caused by the inherent pipelining might lead to overhead due to buffers as seen in the RCA adder. On the one hand the pipeline depth $\mathcal{D}$ is a measure for the overhead in the RCA structure, as a deeper pipeline calls for more effort in the synchronization. The micropipeline on the other hand offers a design procedure, that allows for implementing structures without considering timing issues. The adiabatic principle enforces a operating frequency with a period that is much larger than the intrinsic delay of the gate. Trading intrinsic gate delay and thus energy for a more compact design can be done by applying complex gates. Complex gates could offer an overall energy reduction by reducing the overhead of buffers. A simple example shall motivate the application of complex gates. The arrangement of $a \oplus b \oplus c$ in Fig. 6.13 demands for one buffer insertion if two 2-input XOR gates are used. To avoid the buffer and thus the energy consumed by the buffer, a 3-input OR gate can be invented. In this case, the introduction of the 3-input XOR $\left(\oplus 1_{3}\right)$ gate only pays, when $E_{\oplus 1_{3}}<2 \cdot E_{\oplus 1}+E_{B u f}$. If this substructure is embedded in a larger system, this could also lead to buffers saved in the environmental circuits. In this case the pipeline depth is reduced from $\mathcal{D}=2$ to $\mathcal{D}=1$. Two effects are gained from the application of complex gates. It will help to save overhead due to buffering needs and the latency is reduced. Thus complex gates are useful if the energy has to be further reduced, and/or if a reduced latency is of interest.

XOR gates are basic building cells for adder structures and thus are also important for a variety of signal processing tasks. In Sect. 2.5 it is mentioned, that dual rail encoded signals in Adiabatic Logic allow for a compact assembly of XOR gates. A 2-input XOR ECRL gate is shown in Fig. 2.8. It can be easily adopted to an $N$-input XOR gate by extending the XOR gates according to the scheme presented in Fig. 6.14. This procedure is also valid for the PFAL XOR gate.

Fig. 6.14 An $N$-input ECRL
gate $\left(\right.$ out $=x_{1} \oplus x_{2} \oplus \cdots$ $\left.x_{n-1} \oplus x_{n}\right)$ is constructed by extending the circuit by 4 transistors for each additional input, a very compact implementation is thus guaranteed


### 6.2.1 Impact of Increased Input Stack on the Energy Dissipation

ECRL and PFAL gates are investigated with respect to the energy dissipation when the number of input transistors is increased. A XOR gate is symmetric, output node out and $\overline{o u t}$ will see the same structure. More complex logic blocks will increase the capacitance connected to the output nodes for both families. In PFAL, the logic blocks support the PMOS transistors during the evaluation interval. Thus the loading path resistance is impacted by modification of the logic blocks. In ECRL no change is observed in the loading path resistance, as it consists of the PMOS channels only.

Gates are implemented in a 130 nm CMOS technology and are simulated for all possible input patterns and transitions. Results are presented in Fig. 6.15 for ECRL and in Fig. 6.16 for PFAL.

An increased number of inputs will lead to an increased load capacitance in the case of ECRL and PFAL, and additionally will modify the loading path resistance in PFAL.

A replacement of $(N-1)$ 2-input XOR gates can be done by one $N$-input XOR gate. Thus, for the XOR gate, $E_{\oplus_{N}}<(N-1) \cdot E_{\oplus}+\sum E_{B u f}$ has to be fulfilled. In both Figs. 6.15 and 6.16 the solid line is the dissipation of the complex gate with $N$ inputs. A dashed line is introduced, that shows the cumulative energy when the complex gate is replaced by the according number of 2 -input XOR gates. As for both, ECRL and PFAL, the solid line is below the dashed line, applying complex XOR gates pays. Additional buffers are neglected in this observation. Different effects lead to the different results for ECRL and PFAL. On the one hand, ECRL suffers from undershoots in the recover interval due to capacitive coupling. This leads to non-adiabatic losses in the next evaluate interval. If the capacitance at the output

Fig. 6.15 $N$-input ECRL XOR gate simulation in a 130 nm CMOS technology. The solid line shows the consumption of the complex $N$-input gate, the dashed line is the energy consumption if ( $N-1$ ) 2-input gates are used. Energy consumption in both cases is with respect to the energy consumed by the 2-input XOR gate


Fig. 6.16 $N$-input PFAL XOR gate simulation in a 130 nm CMOS technology. The solid line shows the consumption of the complex $N$-input gate, the dashed line is the energy consumption if ( $N-1$ ) 2-input gates are used. Energy consumption in both cases is with respect to the energy consumed by the 2-input XOR gate

node is increased, the absolute value of the undershoot is decreased. As mentioned earlier, the on-resistance in PFAL is modified by the increased stack in the logic blocks, leading to increased energy consumption. In any case the pipeline depth $\mathcal{D}$ is decreased. It has to be decided dependent on the architectural structure and the constraints whether complex gates can be beneficially applied or not. As soon as the overall energy is reduced, complex gates pay. But even if complex gates would lead to an increased overall energy consumption, the circuit could benefit from a decreased latency. Now for the RCA structure the impact of complex gates on the energy consumption and the latency shall be investigated.

### 6.2.2 Case Study: Energy, Latency and Area Reduction by Applying Complex Gates in the RCA Structure

Full-adders add up three bits with the same binary weight. An $N$-bit RCA has a pipeline depth of $\mathcal{D}=N$. Inventing a full-adder cell, that adds $M$ bits of different weight within one cell will lead to a decreased latency, and shall also lead to


Fig. 6.17 On the left a 4 bit RCA structure is sketched, that applies four FA-1 bit cells and 18 buffers for synchronization reason. Its latency is equal to $\mathcal{D}=4$. It can be replaced by the complex cell sketched on the right side. Five gates assemble the FA-4 bit cell. Four cells compute the sum bits $s_{3}, s_{2}, s_{1}, s_{0}$ and one cell computes the carry output $c_{\text {out }}$. Latency is reduced to $\mathcal{D}=1$
decreased energy consumption and area. The $M$-bit full-adder (FA- $M$ bit) is a complex cell, that has $(2 \cdot M+1)$ inputs ( $a_{M-1}, a_{M-2}, \ldots, a_{0}, b_{M-1}, b_{M-2}, \ldots, b_{0}, c_{0}$ ) and produces $(M+1)$ outputs $\left(c_{M}, s_{M-1}, s_{M-2}, \ldots, s_{0}\right)$ within one cycle.

| + | $a_{M-1}$ $b_{M-1}$ | $a_{M-2}$ | $a_{1}$ | $a_{0}$ $b_{0}$ |
| :---: | :---: | :---: | :---: | :---: |
| + |  |  |  | $c_{0}$ |
|  | $s_{M-1}$ | $s_{M-2}$ | $s_{1}$ | $s_{0}$ |

A computation of the sum at bit position $i$ is done by $S_{i}=a_{i} \oplus b_{i} \oplus c_{i}$. Thus, within one cycle all intermediate carries have to be computed. A scheme for a 4 bit RCA, that is compressed to a FA-4 bit complex cell is sketched in Fig. 6.17. It can be seen, that the FA-4 bit complex cell, that is assembled by five gates, will reduce the latency. Whether the energy and area are reduced depends on the energy consumption of the applied gates and their size.

Sum gates and the gate to generate $c_{M}$ are assembled via complex gates. The general setup for the carry out and the sums for an FA-Mbit gate is presented in Fig. 6.18. The carry out can be extended by a symmetrical structure, that is encircled for the bit pair $a_{1}$ and $b_{1}$ (with the according dual representations $\overline{a_{1}}$ and $\overline{b_{1}}$ ) in the picture. $M$ sum cells are constructed, that incorporate the logic block of the carry out gate for the according bit cell $i$, with $0 \leq i \leq(M-1)$. The higher $M$ is, the higher will the stack in the carry out cell, and the sum gates with the highest bit weight $s_{(M-1)}$ be. This will be the limiting factor for the construction of FA-M bit cells, as stacking can not be done unlimited without gaining any functional errors. Simulations have shown, that a maximum of $M=4$ is possible for ECRL and PFAL.


Fig. 6.18 Construction of the carry out gate (left) and the sum gates (right) for FA-Mbit cells. A basic building block is encircled in the logic block of the carry out gate. One additional basic cell is stacked for an extension by one bit. $M$ sum gates are used in the FA- $M$ bit cell, each contains the logic block of the carry gate for bit position $i$

In the following a 6 -bit RCA and a 12 -bit RCA with FA-1 bit gates will be replaced with FA-Mbit cells and rated with respect to the savings in latency, energy and area. All structures are simulated in a 130 nm CMOS technology for ECRL and PFAL for an operating frequency of 100 MHz . A scheme is presented in Fig. 6.19 for the 6 bit RCA, and the simplifications by applying FA-2 bit and FA-3bit cells.

Obviously the overhead due to buffers is reduced and latency is reduced by $\frac{1}{M}$ to $\frac{N}{M} \cdot \frac{T}{4}$ and $\mathcal{D}=\frac{N}{M}$. Area consumption is estimated via the following equations. Sizing between NMOS and PMOS devices will lead to an active gate area relation of $A_{P M O S}=1.25 \cdot A_{N M O S}$. The area consumed by one FA- $M$ bit cell is determined by

$$
\begin{aligned}
A_{\mathrm{FA}-M \mathrm{bit}}(M) & =\sum_{i=0}^{M-1} A_{S_{i}}+A_{C_{M}} \\
& = \begin{cases}\left(3 M^{2}+15.5 M+4.5\right) A_{N M O S} & \text { for ECRL } \\
\left(3 M^{2}+17.5 M+6.5\right) A_{N M O S} & \text { for PFAL. }\end{cases}
\end{aligned}
$$


Fig. 6.19 The latency of the RCA structure can be reduced when complex cells are used. Here the basic RCA scheme is reduced via a FA-2 bit and a FA-3 bit cell. The latency thus is reduced by the factor of 2 and 3 , respectively Whether energy can be saved by the application of complex gates is dependent on the ratio of saved buffers to the increased consumption in the complex gates

Fig. 6.20 Energy and area consumption of a 6 bit RCA structure with FA-M bit cells in ECRL


Fig. 6.21 Energy and area consumption of a 12 bit RCA structure with FA- $M$ bit cells in ECRL



An $N$-bit RCA consists of $\frac{N}{M}$ FA- $M$ bit cells and a buffer overhead of $\frac{3}{2} N\left(\frac{N}{M}-1\right)$. The overall area consumption for the $N$-bit RCA composed by FA- $M$ bit cells is thus determined via

$$
\begin{equation*}
A_{R C A}(M)=\frac{N}{M} \cdot A_{\mathrm{FA}-M \mathrm{bit}}(M)+\frac{3}{2} N\left(\frac{N}{M}-1\right) \cdot A_{B u f} \tag{6.38}
\end{equation*}
$$

with

$$
A_{B u f}= \begin{cases}4.5 A_{N M O S} & \text { for ECRL } \\ 6.5 A_{N M O S} & \text { for PFAL }\end{cases}
$$

Simulation results and area estimation results after (6.38) are presented in Figs. 6.20 and 6.21 for the 6 bit RCA and the 12 bit RCA in ECRL, respectively. Results for PFAL are presented in Figs. 6.22 and 6.23.

The area consumption as well as latency are continuously decreased with increasing $M$ in the investigated constellations. For ECRL also the energy consumption is

Fig. 6.22 Energy and area consumption of a 6 bit RCA structure with FA- $M$ bit cells in PFAL


Fig. 6.23 Energy and area consumption of a 12 bit RCA structure with FA-Mbit cells in PFAL

further decreased, when more bits are combined in the FA- $M$ bit cell. An optimum of $M=2$ is observed in PFAL for both bit widths of the RCA in case of energy consumption. If $M$ is increased to 3 or 4 within the 12 Bit RCA, the energy increases with respect to the preceding $M$ value. In case of the 6 bit RCA in PFAL, the energy consumed by the RCA with $M=3$ is even higher than the energy consumed by the structure with $M=1$. Results are summarized in Tables 6.7 and 6.8 for ECRL and PFAL. Pipeline depth $\mathcal{D}$ as a measure of latency, energy and area values are given in those tables and are compared to the values of the implementation with $M=1$.

Latency and area are reduced in all investigated cases. In case of PFAL, the energy consumption is increased for the 6 bit RCA for $M=3$ by $6 \%$. For the 12 bit RCA, the energy consumption is reduced compared to $E_{\text {diss }}(1)$ for $M>1$, but the lowest overall energy consumption is gained for $M=2$, where the energy is reduced by around $20 \%$. Latency reduction is the same in the ECRL and PFAL families. The absolute area consumed is lower for ECRL in any case. Though absolute energy is lower for PFAL in case of $M=1$, for higher values of $M$ ECRL's energy consump-

Table 6.7 Summary of results for the 6 bit and the 12 bit ECRL RCA if FA- $M$ bit cells are used

| ECRL <br> M | 6 bit RCA |  |  |  | 12 bit RCA |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | $\mathcal{D}(M)$ | $\frac{\mathcal{D}(M)}{\mathcal{D}(1)}$ | $\frac{E_{\text {diss }}(M)}{E_{\text {diss }}(1)}$ | $\frac{A_{R C A}(M)}{A_{R C A}(1)}$ | $\mathcal{D}(M)$ | $\frac{\mathcal{D}(M)}{\mathcal{D}(1)}$ | $\frac{E_{\text {diss }}(M)}{E_{\text {diss }}(1)}$ | $\frac{A_{R C A}(M)}{A_{R C A}(1)}$ |
| 1 | 6 | 1 | 1 | 1 | 12 | 1 | 1 | 1 |
| 2 | 3 | 0.5 | 0.62 | 0.66 | 6 | 0.5 | 0.57 | 0.59 |
| 3 | 2 | $\frac{1}{3}$ | 0.51 | 0.58 | 4 | $\frac{1}{3}$ | 0.43 | 0.48 |
| 4 | - | - | - | - | 3 | 0.25 | 0.39 | 0.43 |

Table 6.8 Summary of results for the 6 bit and the 12 bit PFAL RCA if FA-Mbit cells are used

| PFAL M | 6 bit RCA |  |  |  | 12 bit RCA |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | $\mathcal{D}(M)$ | $\frac{\mathcal{D}(M)}{\mathcal{D}(1)}$ | $\frac{E_{\text {diss }}(M)}{E_{\text {diss }}(1)}$ | $\frac{A_{R C A}(M)}{A_{R C A}(1)}$ | $\mathcal{D}(M)$ | $\frac{\mathcal{D}(M)}{\mathcal{D}(1)}$ | $\frac{E_{\text {diss }}(M)}{E_{\text {diss }}(1)}$ | $\frac{A_{R C A}(M)}{A_{R C A}(1)}$ |
| 1 | 6 | 1 | 1 | 1 | 12 | 1 | 1 | 1 |
| 2 | 3 | 0.5 | 0.94 | 0.61 | 6 | 0.5 | 0.81 | 0.56 |
| 3 | 2 | $\frac{1}{3}$ | 1.06 | 0.51 | 4 | $\frac{1}{3}$ | 0.85 | 0.43 |
| 4 | - | - | - | - | 3 | 0.25 | 0.95 | 0.38 |

Table 6.9 Logic family with minimum energy consumption for 6 bit and 12 bit RCA with FA- $M$ bit cells

| $M$ | 6 bit | 12 bit |
| :--- | :--- | :--- |
| 1 | PFAL | PFAL |
| 2 | ECRL/PFAL | PFAL |
| 3 | ECRL | ECRL |
| 4 | - | ECRL |

tion can be lower than that of PFAL. Which family has the minimum absolute energy consumption for the different combinations of $N$ and $M$ is listed in Table 6.9.

By applying complex gates in Adiabatic Logic, latency, energy consumption and area consumption can be reduced. Arithmetic structures call for a careful investigation in order to be able to implement area and energy efficient arithmetic structures in AL.

### 6.3 Multi-operand Adders and the CORDIC Algorithm

### 6.3.1 Nested RCA Structure

RCA structures with large bit widths show an increased energy consumption due to buffering, as the amount of cells that perform the calculation (full adders) rises with $N$ whereas the amount of buffers rises with $\frac{3}{2}\left(N^{2}-N\right)$. The structure does not seem to be suitable for arithmetic operations in case of higher bit widths and if only one

Fig. 6.24 Nested RCA structure for three 3 bit binary numbers. The relative overhead due to the buffers is reduced when more consecutive operations are performed

adding operation is performed. As soon as consecutive calculations are processed, the overhead due to buffers can be reduced by nesting RCA structures. If e.g. three 3 bit numbers are added, overhead can be reduced by nesting two adder structures as sketched in Fig. 6.24.

In the pictured case the full adder gate count is increased by 3 and 6 additional buffers are used. Each word added additionally will demand for some buffers. The overhead due to buffers compared to the number of gates performing calculations is reduced, though buffers have to stall the information of the accessory sum term bits $c_{0}, c_{1}$ and $c_{2}$ until they are processed.

In the COordinate Rotation DIgital Computer (CORDIC) based Discrete Cosine Transformation presented in Sect. 6.3.3 the so-called butterfly structure is used, that can be composed of such nested RCA structures. The sum terms added are results of computations that take place in adder stages in parallel. Results are crossed-out and used as inputs for other adding stages. Thus for this structure, nested RCAs might prove as suitable structure. Especially, if a CORDIC with a width of less than 20 bit is implemented, RCA is a suitable structure compared to the parallel-prefix scheme.

### 6.3.2 The Carry-Save Adder (CSA) Structure

The carry-save adder (CSA) structure breaks up the carry propagation path in the RCA. Instead of a sum vector and a carry bit, two vectors are generated and forwarded to the next stage. A full adder gate accepts three inputs and produces two output bits, one is the sum, that has the same weight as the input bits and the other is the carry, that is of double weight. The full adder cell is also called 3-to-2 compressor, as it takes three input bits, that are compressed to two outputs. Three binary input vectors $\mathbf{d}, \mathbf{e}$ and $\mathbf{f}$ are compressed to the sum vector $\mathbf{s}$ and the carry vector $\mathbf{c}$ :

|  | $d_{3}$ | $d_{2}$ | $d_{1}$ | $d_{0}$ |
| :---: | :---: | :---: | :---: | :---: |
| + | $e_{3}$ | $e_{2}$ | $e_{1}$ | $e_{0}$ |
| + | $f_{3}$ | $f_{2}$ | $f_{1}$ | $f_{0}$ |
|  | $s_{3}$ | $s_{2}$ | $s_{1}$ | $s_{0}$ |
| $c_{4}$ | $c_{3}$ | $c_{2}$ | $c_{1}$ |  |

In the next adder stage one more vector can be added to the outputs of the previous stage. CSA structures are properly suited in multipliers. Partial products are generated via AND combination of the input vector and each single bit of the multiplicand and then summed up in the CSA array [31]. After the last stage in the carry-save array, still two vectors are left that have to be combined to the final sum vector and the carry out bit. This can be done by RCA adders, when timing is not an issue in static CMOS, or when the overhead due to the buffers or latency is not a concern in AL. Otherwise parallel-prefix structures offer a good way to improve the speed in static CMOS or the energy overhead and latency in AL.

### 6.3.3 A CORDIC-Based Discrete Cosine Transformation (DCT)

In 1959 Jack Volder presented the COordinate Rotation DIgital Computer (CORDIC) computing technique, that allows to compute complex tasks like coordinate rotation, multiplication, division or conversion of binary into mixed-radix systems [111]. The CORDIC performs a vector rotation on $[x y]^{T}$ by an angle of $\phi$ according to

$$
\left[\begin{array}{l}
x^{\prime}  \tag{6.39}\\
y^{\prime}
\end{array}\right]=\theta\left[\begin{array}{l}
x \\
y
\end{array}\right], \quad \theta \in \mathbb{R}^{2 x 2}
$$

An orthogonal CORDIC will describe a plane rotation of a vector and $\theta$ is described by

$$
\theta=\left[\begin{array}{cc}
\cos \phi & \sin \phi  \tag{6.40}\\
-\sin \phi & \cos \phi
\end{array}\right] .
$$

The orthogonal type CORDIC can be operated in two modes: In the rotation mode, the angle $\phi$ is attached as an external parameter and the vector at the inputs is rotated by this angle. In the vector mode, the vector is rotated in such a way, that the applied input vector is transformed to $\left[\begin{array}{ll}x^{\prime} & 0\end{array}\right]^{T}$. By this, $\phi$ and the norm of the input vector are determined. For an iterative implementation of an orthogonal type CORDIC the circuit block in Fig. 6.25 is applied.

Its function is given by

$$
\left[\begin{array}{l}
x_{i+1}  \tag{6.41}\\
y_{i+1}
\end{array}\right]=\left[\begin{array}{cc}
\mu^{(11)} & \mu^{(12)} 2^{-i} \\
\mu^{(21)} 2^{-i} & \mu^{(22)}
\end{array}\right] \cdot\left[\begin{array}{c}
x_{i} \\
y_{i}
\end{array}\right]
$$

where the parameters $\mu^{(k l)} \in\{-1,0,1\}$ are determined in dependence of the mode, the input vectors, and the type of the CORDIC. If a rotation mode by a fixed angle


Fig. 6.25 A transformation stage for an iterative, orthogonal type CORDIC consists of two adder and shift stages. Depending on the rotation angle $\phi$ and the rotation steps of the succeeding stages, the factors $\mu_{i}^{(12)}=-\mu_{i}^{(21)}$ are determined in each step $i$. In case of a fixed rotation angle, the factors are set to constant values
$\pm \arctan \left(2^{-1}\right)$ is performed in an orthogonal type CORDIC, then the parameters are

$$
\begin{equation*}
\mu^{(11)}=\mu^{(22)}=1, \quad \mu^{(12)}=-\mu^{(21)} \in\{-1,1\} . \tag{6.42}
\end{equation*}
$$

After $n$ of these transformation stages the iterative algorithm results in a vector

$$
\left[\begin{array}{l}
x_{n}  \tag{6.43}\\
y_{n}
\end{array}\right]=f \cdot\left[\begin{array}{cc}
\cos \phi & \sin \phi \\
-\sin \phi & \cos \phi
\end{array}\right] \cdot\left[\begin{array}{l}
x_{0} \\
y_{0}
\end{array}\right],
$$

with

$$
\begin{equation*}
f=\prod_{i=0}^{n-1} \sqrt{1-2^{-2 i}}, \quad \phi=\sum_{i=0}^{n-1} \mu_{i}^{(12)} \arctan \left(2^{-i}\right) \tag{6.44}
\end{equation*}
$$

The factor $f$ describes the deviation of the vector's norm due to the performed rotation and thus scaling stages follow the $n$ transformation stages. One scaling stage computes

$$
\begin{align*}
& x_{i+1}=\left(1+\sigma_{i}^{1} \cdot 2^{-t_{i}}\right) x_{i}  \tag{6.45}\\
& y_{i+1}=\left(\sigma_{i}^{2} \cdot 2^{-t_{i}}+1\right) y_{i}
\end{align*}
$$

with $\sigma_{i}^{k} \in\{-1,0,1\}$ and $t_{i}=i-n$. After $m$ scaling stages the output signals are

$$
\left[\begin{array}{l}
x_{n+m}  \tag{6.46}\\
y_{n+m}
\end{array}\right]=f_{m} \cdot f \cdot\left[\begin{array}{cc}
\cos \phi & \sin \phi \\
-\sin \phi & \cos \phi
\end{array}\right] \cdot\left[\begin{array}{l}
x_{0} \\
y_{0}
\end{array}\right] .
$$

The parameters $\sigma_{i}^{k}$ are determined in such a way, that $f_{m} \cdot f \approx 1$ [112]. In [113] Heyne et al. presented a computationally efficient and high-quality CORDICbased Discrete Cosine Transformation. They optimize the high-quality Loeffler DCT [114], that incorporates 11 multiplications that are intensive in hardware as well as in software implementation [113]. The CORDIC-based Loeffler DCT utilizes a hardware amount that is comparable to the binDCT-D5 [115] algorithm and has a quality comparable to that of the DCT presented by Loeffler [113, 116].

Fig. 6.26 Scheme of the CORDIC-based Loeffler DCT used for the estimations of AL and static CMOS implementations. The design consists of adders and subtractors mainly. The $\bigcirc$-operators (no operation) are synchronization buffers in AL but do not cause any hardware in static CMOS


### 6.3.3.1 CORDIC-Based Loeffler DCT: Estimation of the Energy Consumption and Comparison to Static CMOS

Based on the structure presented in [113, 116] a 12 bit CORDIC-based Loeffler DCT is implemented in Adiabatic Logic and compared to a variety of implementations in static CMOS. A scheme is shown in Fig. 6.26. Only adders (subtractors are constructed from adders) and synchronization stages (in case of AL) are used in the implementation. The static CMOS implementations use full adders, inverters and latches. Shift operations are hard wired, and due to the fixed rotation angles, the parameters $\mu_{i}^{(12)}$ and $\mu_{i}^{(21)}$ are determined in the design phase. The target frequency for the DCT is 100 MHz and the nominal supply voltage is 1.2 V .

As the width is 12 bit, this allows to use the RCA structure in AL with a negligible overhead due to the buffers. A nested RCA structure is used in AL (AL RCA). For static CMOS, different implementations are considered. CMOS CPA is a static CMOS implementation of the CORDIC-based Loeffler DCT using pipelined carrypropagate adders. This static CMOS design conforms to the AL reference design, as in AL the inherent pipelining transforms the RCA to a pipelined carry-propagate adder (CPA). CMOS CSA V1 uses a carry-save adder (CSA) structure. After the last stage a vector merging adder (VMA), which is not considered in the investigation, has to be used to calculate the final sum vector. In this implementation, after every CSA stage a pipeline stage is inserted. CMOS CSA V2 is the same architecture as CMOS CSA V1, but with pipelining after every second CSA stage and CMOS CSA V3 uses no pipelining at all. Finally, a CMOS RCA structure is compared, that has the least gate count in static CMOS and uses only latches at the inputs and outputs of the DCT.

Number representation is in 2's complement. Subtractors are constructed by inverting the input signal and adding a one. Dual-rail encoding in AL allows to implement the inverting of the signal by crossing out the outputs of the adder gates. Gate count results for the various implementations are summarized in Table 6.10.

Table 6.10 Hardware effort of the AL reference circuit and diverse static CMOS implementations of the CORDIC-based Loeffler DCT

|  | \# of gates in respective implementation |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | Adiabatic |  | Static CMOS |  |  |
|  | $\#_{F A, A L}$ | $\#_{B u f, A L}$ | $\#_{F A, C M O S}$ | $\#_{\text {Inv, }}$ CMOS | $\#_{\text {Latch, }}$ CMOS |
| AL RCA | 456 | 1272 | - | - | - |
| CMOS CPA | - | - | 456 | 228 | 2184 |
| CMOS CSA V1 | - | - | 912 | 456 | 1344 |
| CMOS CSA V2 | - | - | 912 | 456 | 576 |
| CMOS CSA V3 | - | - | 912 | 456 | 0 |
| CMOS RCA | - | - | 456 | 228 | 192 |

Fig. 6.27 Estimated ESF of diverse static CMOS implementations with respect to the RCA design in AL


The energy consumption for the gates is characterized for the PFAL AL family and static CMOS, both for activity of $\alpha=0$ and $\alpha=1$. These values are used for interpolations of values $0<\alpha<1$ :

$$
\begin{equation*}
E(\alpha)=E(\alpha=0)+\alpha(E(\alpha=1)-E(\alpha=0)) . \tag{6.47}
\end{equation*}
$$

Estimations are carried out using interpolation values according to (6.47) multiplied with the gate count in the respective implementation:

$$
\begin{align*}
E_{A L}(\alpha)= & \#_{F A, A L} \cdot E_{F A, A L}(\alpha)+\#_{B u f, A L} \cdot E_{B u f, A L}(\alpha)  \tag{6.48}\\
E_{C M O S}(\alpha)= & \#_{F A, C M O S} \cdot E_{F A, C M O S}(\alpha)+\#_{I n v, C M O S} \cdot E_{I n v, C M O S}(\alpha) \\
& +\#_{\text {Latch }, C M O S} \cdot E_{\text {Latch }, C M O S}(\alpha) . \tag{6.49}
\end{align*}
$$

The ESF is calculated for all implementations, where the AL RCA design is the reference circuit. Figure 6.27 presents the ESF over the activity factor $\alpha$.

A higher activity factor leads to a higher energy consumption for all static CMOS designs, hence the ESF is reduced, when $\alpha$ is reduced. Latches are connected to the
clock signal, even if the activity factor is zero, losses are caused in the latch. The CMOS CSA V3 design is the only design without any latches. Only for this design a ESF of zero is gained for zero activity. The more latches involved in the static CMOS design, the higher will the energy saving factor at $\alpha=0$ be. In case of high activities, the energy consumption of the full adder cells will dominate the overall consumption. The CMOS CSA V1 design will consume more energy then the CMOS CPA design in case of $\alpha>0.75$, although the CMOS CPA design involves 1.6 times more latches. Twice as many full adders are involved in the design of the CMOS CSA V1 design, so the high activity provokes a tremendous energy increase. Decreasing the effort for the pipeline in the CMOS CSA designs V2 and V3, decreases the energy consumption and thus the ESF. Nevertheless, the hardware effort in the CMOS CSA designs is doubled, as in each stage not only two vectors are summed up, but two sum and two carry vectors. So a 4-to-2 compressor, constructed by two consecutive 3-to-2 compressor (full adder) stages is required. The CMOS RCA design, that applies the same amount of full adder gates as the AL RCA and the CMOS CPA design at a low latch count result in the lowest overall energy consumption for the CMOS designs. And thus the ESF is the lowest for the CMOS RCA design, as long as an $\alpha>0.1$ is assumed. This design, due to the ripple paths, will allow for the slowest operating frequency of the investigated static CMOS designs. Voltage scaling is restricted, as a lowered voltage supply will increase the delay of the gates further. Recapitulating the inherent properties of AL, no effort has to be spent on the compliance to timing constraints. For an activity factor of $\alpha=0.3$, the energy saving factor is between 8 and 30 , depending on the implementation in static CMOS. Glitches are not regarded in the applied estimation method. The CMOS energy consumption will be increased due to glitches, as this corresponds to an increased activity factor $\alpha$. This shows that the adiabatic implementation of a CORDIC-based Loeffler DCT system saves energy in the order of a magnitude compared to static CMOS implementations.

### 6.3.3.2 CORDIC-Based Loeffler DCT: System Simulation and the Impact of Voltage Scaling

To verify the estimation results, the AL RCA and the CMOS RCA systems are implemented and simulated with parameters for a 130 nm CMOS technology. Input data is derived from a photograph, thus typical activity factors are expected. Both designs are simulated for the nominal voltage $V_{D D}=1.2 \mathrm{~V}$ and for two scaled supply voltages of 1.0 V and 0.8 V . ESF values are calculated for all three operating voltages and plotted against the supply voltage in Fig. 6.28. The ESF has also been calculated without the energy consumed by the latches in the CMOS RCA design.

The simulation results show a good match to the estimations. Although no prediction about the activity caused by the applied input sequence is made, the results match to the estimations for $\alpha=0.5$. Even if the pattern itself will not cause an activity factor of 0.5 , propagation delays will cause glitches and hence the activity factor is increased by a certain amount in static CMOS. No such effects exist in AL.

Fig. 6.28 The estimation results are verified by simulation for the static CMOS design with the lowest overall consumption. The impact of voltage scaling is presented for a reduced voltage down to $V_{D D}=0.8 \mathrm{~V}$


At nominal supply voltage, the ESF is 11, thus the adiabatic system consumes less than a tenth of the energy consumed by the static CMOS design. Voltage scaling degrades the ESF, as it reduces the static CMOS design's energy consumption more than that of AL. For the scaled voltage of 0.8 V still a saving factor of 7 is left. Even exclusion of the latch's energy consumption at the lowest investigated voltage supply does lead to the conclusion, that the CORDIC-based Loeffler DCT in Adiabatic Logic is an ultra-low power DCT signal processing unit. It still consumes less than $20 \%$ energy of the static CMOS counterpart.

Hence, selection of an appropriate architecture allows to build ultra-low power signal processing units in Adiabatic Logic. Additionally, in contrast to static CMOS designs with a low energy consumption, no care has to be taken about timing issues in Adiabatic Logic.

## Chapter 7 <br> Measurement Results of an Adiabatic FIR Filter

A first test chip was fabricated in a 130 nm CMOS process and results are presented in [9-11]. As test structures, an 8 bit RCA structure, and buffer gates for the PFAL and ECRL adiabatic family are implemented. Amirante et al. presented signal converters from the static CMOS single rail domain into the adiabatic dual-rail domain and measurement results in [11]. They showed a good agreement between measured results and the simulations carried out for frequencies of $5 \mathrm{MHz}, 10 \mathrm{MHz}$, and 20 MHz . Relating the measurement results to simulations of a static CMOS 8 bit RCA system showed energy savings of almost 8 at a frequency of 20 MHz , and that savings are gained by the adiabatic implementation for frequencies up to $1 \mathrm{GHz}[9$, 11]. Also voltage scaling measurements are in good agreement to simulated values [10]. In [10] also measurement results of the PFAL and ECRL buffer gates are presented, where the waveforms measured show the characteristic attributes predicted by simulations. Additionally a functional test is carried out to determine the lower supply voltage limit for both families. The minimum supply voltages at a frequency of 800 kHz are $V_{D D, \text { min }}=0.3 \mathrm{~V}$ and $V_{D D, \text { min }}=0.425 \mathrm{~V}$ for ECRL and PFAL, respectively [10]. A functional proof and agreements of simulation and measurements thus have been shown by these results, in case of saving potential and the impact of the supply voltage.

In a second test chip, a large scale adiabatic system was fabricated, to prove the saving potential for larger sized systems and to show, that a standard industry design flow allows to implement adiabatic systems efficiently. As test structure an FIR filter in the PFAL adiabatic family was implemented and embedded into a static CMOS interface to allow easy measurement with static CMOS compatible signals. FIR filters are used in a quite large range of applications and due to their transfer function, they are inherently stable.

### 7.1 Structure of the Adiabatic FIR Filter

A finite impulse response (FIR) filter is determined by its finite length pulse response. The FIR filter takes $N$ consecutive, weighted input vectors $x[n]$ and sums

Fig. 7.1 Transposed direct form FIR filter implementation

them up according to

$$
\begin{equation*}
y[n]=\sum_{i=0}^{N} c_{i} x[n-i] \tag{7.1}
\end{equation*}
$$

$N$ weighting coefficients $c_{i}$ are inserted. Equation (7.1) reveals, that the FIR filter has no feedback in its structure. Thus, the pulse response is of finite length and the group propagation is independent of the input frequency. Thus stability is guaranteed independent of the choice of the filter coefficients $c_{i}$ and a linear phase response is gained. A transposed direct form implementation of the FIR filter is given in Fig. 7.1.

The intermediate bit width at the adder stages can be decreased by reordering the filter coefficient bits according to the Horner Scheme. Here the coefficient bits $c_{i}^{l}$ of all coefficients $c_{i}$ are grouped in such a way, that a group contains all $i$ coefficient bits with the same binary weight $l$. By this the filter is implemented in a bitplane structure [117]. Each bitplane calculates the output word for the according bit $l$ of the coefficients and the adder stages will compute the final result. A carry-save adder architecture is chosen for the multipliers. 4-to-2 compressors merge the intermediate sum and carry vectors of different weight gained from the bitplanes. The bitplanes are arranged in parallel, so additional buffers due to delay elements are avoided. The final sum output of the last compressor stage is calculated via a Sklansky carry-lookahead adder. A carry-overflow correction [118] is used in the bitplanes, to allow a minimum bit width broadening during the calculations. A detailed description of the bitplane structure and the filter design is presented in [10].

For the FIR filter the input signal $x[n]$ has a bit width of 16 , and 8 coefficients with a width of 8 bit, hence 8 bitplanes are arranged in parallel. The width of the output signal $y[n]$ is also 16 bit, thus the resulting output sum and carry vectors from the last compressor stage are truncated by the least significant bits, as a 16 bit Sklansky CLA is used as vector merging adder. This results in a maximum error of $1 \cdot L S B$. Figure 7.2 shows the scheme of the adiabatic FIR filter core. Three compressor stages are used to compress the output vectors of the bitplane, that differ in the bit width [10]. The CLA is connected to the last compressor stage, to calculate the final sum vector of the highest order bits.

The filter core is embedded into a static CMOS framework (Fig. 7.3).
Input cache and coefficient cache are connected to the adiabatic system via interface circuits as shown in Fig. 7.4. The dual-rail output of the FIR filter is directly


Fig. 7.2 Scheme of the adiabatic FIR filter core: carry-save representation of signals is indicated by grey arrows


Fig. 7.3 The adiabatic filter core is embedded into a static CMOS interface

Fig. 7.4 The input words and the coefficients are converted from the static CMOS domain into the adiabatic domain using a static CMOS inverter and a PFAL buffer


Fig. 7.5 Handshake scheme for the serial interface of the test-chip
sampled by the output cache. Sampling the non-inverted output synchronized to the hold interval leads to the desired conversion from the adiabatic into the static CMOS domain.

A serial interface allows to connect the filter core via the lowest count of input and output pins and with CMOS compatible input and output signals. The handshaking scheme is sketched in Fig. 7.5.

It consists of a cache for 16 input words (incache), a cache for the coefficient bits (coeffcache) and a 16 word output cache (outcache). The cache is serially connected in the setup mode and it is loaded via $s_{i n}$ synchronous to the external clock signal clk ext. 64 coefficient bits and $16 \times 16$ bits for the input words are thus serially loaded into the cache. When the system is switched into the active mode with the signal ser_par, the words in the input cache are cyclically attached to the input of the FIR filter and the resultant output words are stored into the output cache. After operation, the system is switched into the readout mode. Via clocking the cache, the contents of the output cache is transferred to $s_{\text {out }}$. A circuit is connected to the outputs of the filter to check for the integrity of the adiabatic signals. This circuit

Fig. 7.6 Photograph of the packaged chip

is called Adiabatic Signal Test (AST) and indicates that dual-rail encoding at the output of the filter core is valid.

The first design step is the generation of a standard library for the filter core in the PFAL family. A bottom-to-top approach is used to design building blocks like the bitplane slice and the compressors or the CLA, that in the end are placed and connected via top level metal interconnects. This approach is identical to the design of complex static CMOS circuits. Then the static CMOS surrounding is constructed from an industrial static CMOS standard library and interfaced to the filter core. As Adiabatic Logic is fabricated in CMOS technology, the whole design process is done with standard industry tools. Only the connection of the clock signals between two blocks needs some manual design overhead. Due to the hierarchical approach it is not obvious what phase the last stage in the circuit is connected to due to the encapsulation of circuit blocks. But also in static CMOS circuits exist, that will require a similar, detailed knowledge of the clock order within the hierarchy. If latches are used in the design, both the $c l k$ and the $\overline{c l k}$ signals are applied in the design. Also here, the appropriate connection of the right clock signal is a major issue. For this task, an automatic check could parse the netlist to guarantee the right connection of succeeding gates.

### 7.2 Measurement Results and Comparison to Static CMOS

The test-chip in a 130 nm CMOS technology is bonded in a 40 pin DIL package. A photograph of the packaged chip is presented in Fig. 7.6. First the function of the chip is verified. A pattern generator attaches the handshake sequence and a data stream, that is loaded into the cache. Two signal generators are used to generate the four phases of the power-clock signal. Each signal generator allows to output two phase-shifted power supply phases. The generators are synchronized via an external clock signal supplied by the pattern generator. After a certain period of operation, the output cache is read out and the contents is verified according to data gained from a system level simulation, functionality of the chip is verified.


Fig. 7.7 Setup for measuring the energy consumption of the chip

Measuring the energy (measurement setup photograph in Fig. 7.7) is done by insertion of a measurement resistor $R_{\text {meas }}$, that converts the current flowing into the circuit into a voltage. The voltage levels $u_{1}(t)$ at the connection of the clock generator to the measurement resistor and the voltage at the input of the test chip $u_{2}(t)$ are measured. The voltage drop $\Delta u(t)=u_{1}(t)-u_{2}(t)$ is proportional to the current $i(t)=\Delta u(t) / R_{\text {meas }}$. The energy $e(t) / R_{\text {meas }}$ is calculated from this signal with $p(t) / R_{\text {meas }}=\Delta u(t) \cdot u_{2}(t)$ and integration over time. A plot of the signals displayed by the oscilloscope can be observed in Fig. 7.8. Signals at both ends of the measurement resistor, as well as all other calculation steps are plotted. In the signal plot of $e(t) / R_{\text {meas }}$ the energy recovery of the adiabatic circuit can be observed. During the evaluate interval, energy is inserted into the adiabatic circuit, in the hold interval, the energy transfer is strongly reduced and energy is restored in the recover interval. Energy dissipated by the circuit is measured between two hold intervals.

Pre-Layout simulation results for the adiabatic FIR filter at $V_{D D}=1.2 \mathrm{~V}$ and the according measurement results are presented in Fig. 7.9. For $1 \mathrm{MHz}, 10 \mathrm{MHz}$ and 20 MHz the energy consumption for three test-chips is measured and the mean value is plotted. At a frequency of 1 MHz the variation over the energy of the measured test-chips was quite large. Here the circuit is operated in the frequency regime that is dominated by leakage currents. As leakage currents are exponentially dependent on variations in the threshold voltage, these deviations can be explained by die-to-


Fig. 7.8 Plot of the oscilloscope signals

Fig. 7.9 Simulated values for the adiabatic FIR filter with a supply voltage $V_{D D}=1.2 \mathrm{~V}$. Mean measurement values are plotted. Measurement results do fit to the simulation values quite well. An increased energy dissipation is expected in the adiabatic frequency regime, as in the simulation no interconnects are regarded

die variations. Though the mean value at 1 MHz matches the value gained by the simulation. The three measurement values at 10 MHz and at 20 MHz fit quite good, the maximum deviation of the measurement to the mean value amounts to $3 \%$ and $8 \%$ for 10 MHz and 20 MHz , respectively. A good agreement is observed between simulation and measurements. Measurement values are $26 \%$ and $34 \%$ above the simulation result for 10 MHz and 20 MHz , respectively. The higher value gained

Fig. 7.10 Voltage scaling measurements for the test chip at an operating frequency of 10 MHz

from measurement results from the parasitic capacitances due to the layout, which are not considered in the simulation.

Limited by the pulse generators and the measurement setup, values for $f>$ 20 MHz could not be gained. Thus an extrapolation based on the energy measurement results are carried out for $f=100 \mathrm{MHz}$. Therefore it is assumed, that the energy consumption of the chip is determined by adiabatic losses mainly at 10 MHz and 20 MHz , so that the consumed energy is linear dependent on $f$. A linear fit results in the extrapolation mean value plotted in Fig. 7.9. The mean value of the extrapolated measurement results is about $90 \%$ above the simulated energy dissipation.

Also in the plot simulation results for a comparable static CMOS FIR filter are given for the frequency regime of major interest. The design is also based on a CSA bitplane structure. A difference lies in the way how intermediate carry and sum vectors are calculated. In contrast to the adiabatic filter core, here the bitplanes are arranged in a serial manner, no compressor stages are thus used. The final vector merging adder is a Han-Carlson structure. The filter specifications with respect to input, coefficient and output bit widths are equal to the adiabatic design. Also here no parasitics from the layout are reckoned, thus the simulation results are a lower bound. Additionally, investigations in [1] show, that the clock network can contribute up to $50 \%$ to the overall consumption in high-performance designs due to parasitics and buffer insertion. For the static CMOS design two voltages are used in the simulation, so that also voltage scaling, that allows to lower the energy consumption, is regarded.

Voltage scaling is measured for the adiabatic testchip at a frequency of 10 MHz down to a supply voltage of 0.8 V . The function of the circuit is verified for this low supply voltage. The mean value of the measured energy of all three chips referenced to the energy dissipation at $V_{D D}=1.2 \mathrm{~V}$ is plotted in Fig. 7.10. The energy at the supply voltage of 0.8 V is reduced by $26 \%$ compared to the energy consumed at 1.2 V .

A summary of the energy saving factors gained from the comparison is presented in Table 7.1. Simulation values of the static CMOS implementation for $V_{D D}=0.8 \mathrm{~V}$

Table 7.1 Overview of the ESF gained at different frequencies and voltage constellations. For comparison, a static CMOS filter was also designed and simulated. Mean values gained from measurements in AL are compared to the simulation results of the static CMOS filter. ESF values gained from extrapolation in AL are marked with the asterisk $\left(^{*}\right)$. Voltage scaling extrapolations are printed italic in the table

| Energy Saving Factor |  | Static CMOS (sim.) |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | $V_{D D}=0.8 \mathrm{~V}$ |  |  | $V_{D D}=1.2 \mathrm{~V}$ |  |  |
|  |  | 10 MHz | 20 MHz | 100 MHz | 10 MHz | 20 MHz | 100 MHz |
| AL (sim.) | $V_{D D}=1.2 \mathrm{~V}$ | 4.5 | 4.6 | 3.3 | 11.1 | 10.8 | 8.1 |
| AL (meas.) | $V_{D D}=1.2 \mathrm{~V}$ | 3.5 | 3.5 | 1.8* | 8.8 | 8.0 | 4.3* |
| AL (meas.) | $V_{D D}=0.8 \mathrm{~V}$ | 5.0 | 4.67 | 2.38* |  |  |  |

and $V_{D D}=1.2 \mathrm{~V}$ are compared to the simulation values of the Adiabatic Logic (AL) chip at $V_{D D}=1.2 \mathrm{~V}$. Measurement values of AL at $V_{D D}=1.2 \mathrm{~V}$ and for $V_{D D}=0.8 \mathrm{~V}$ are compared in the second and third column. Voltage scaling values for AL at 20 MHz and 100 MHz (printed italic in Table 7.1) are derived from the measurement values at 1.2 V under the assumption, that those values will also be reduced by the same ratio gained from the voltage scaling measurements at 10 MHz . Results at 100 MHz are compared to extrapolated measurement results and indicated with an asterisk ( $*$ ). If both circuits are operated at $V_{D D}=1.2 \mathrm{~V}$, the simulation shows an ESF of $11.1,10.8$, and 8.1 for $10 \mathrm{MHz}, 20 \mathrm{MHz}$ and 100 MHz , respectively. Comparing static CMOS simulations and AL measurement shows lower ESFs of 8.8, 8.0, and 4.3, respectively, as in the measured values also parasitics are observed. Parasitics will also introduce additional losses in the static CMOS design, the ESF in reality will be in between the values gained by simulation only and those by comparison of static CMOS simulations with AL measurements. With respect to the value extracted from measurements, the ESF for 100 MHz will be 4.3 , when both circuits operate with $V_{D D}=1.2 \mathrm{~V}$. If voltage scaling is introduced in static CMOS only, and the supply voltage is reduced to 0.8 V , the ESF is reduced. According to simulation and extrapolation of the measurement, at 100 MHz the static CMOS counterpart still consumes 1.8 times more energy. But also Adiabatic Logic is functional at reduced supply voltages. Scaled measurement results of Adiabatic Logic lead to an improved ESFs of 5.0, 4.67, and 2.38 at $10 \mathrm{MHz}, 20 \mathrm{MHz}$, and 100 MHz , respectively.

Finally, the adiabatic FIR test-chip shows, that with industrial tools a large scale adiabatic system can be implemented within a design time comparable to static CMOS. For the right sequencing of the clock signals within a large system a program capable of parsing the netlist and checking for the right progression of the clocks would ease the design phase. The filter core is verified with respect to functionality, and measurement results are in good accordance to the simulation values for a voltage supply of $V_{D D}=1.2 \mathrm{~V}$. Results are compared to a static CMOS counterpart, resulting in an energy saving factor of more than 4 at $f=100 \mathrm{MHz}$. Even if the voltage is reduced to 0.8 V for both designs, the adiabatic filter core saves almost $60 \%$ of the energy consumed in the static CMOS filter at 100 MHz . All re-
sults are lower limits, as comparisons are based on simulations of the static CMOS core without parasitics due to the layout. Overall, a large scale adiabatic system was integrated with industrial design tools, and the ultra-low power consumption was verified by means of measurement results.

This proves that Adiabatic Logic is an ultra-low-power circuit topology with major savings compared to static CMOS designs. Due to the use of CMOS technology, Adiabatic Logic is designed and fabricated within an industrial design flow without any modifications.

## Chapter 8 Conclusions

Adiabatic Logic's saving potential over static CMOS is dependent on the future development of devices and on the overall savings gained on system level. Diverse factors are identified in this work, which determine the savings of Adiabatic Logic on circuit and system level and in future technologies.

Scaling Trend and Novel Devices Future scaling of devices leads to impacts on the saving potential as well as on the optimum operating frequency $f_{\text {opt }}$ of adiabatic circuits. Comparisons are carried out between static CMOS and two adiabatic families, i.e. the Efficient Charge Recovery Logic (ECRL) and the Positive Feedback Adiabatic Logic (PFAL), for sub 90 nm technologies. For 65 nm an industrial process is used, whereas for 45 nm down to 16 nm the Predictive Technology Model (PTM) is used. Exemplary results show a maximum Energy Saving Factor (ESF) of 3.62 for ECRL and 6.68 for PFAL at the 22 nm PTM node. Besides shrinking of devices, that is ongoing in the foreseeable future, novel device topologies will impact the applicability of Adiabatic Logic (AL). It is concluded from theoretical considerations on the impact of reduced resistance and capacitance on the ESF and $f_{\text {opt }}$, that reducing the capacitance will not impact the maximum ESF, but the savings in the adiabatic regime are improved and $f_{\text {opt }}$ is shifted to higher frequencies. Contrary, reducing the device's resistance shows improved maximum ESF and $f_{\text {opt }}$. Revolutionary transistor topologies like a Carbon Nanotube (CNT) based transistor (CNTFET) and the Vertical Slit Field Effect Transistor (VESFET) were presented in recent years. Both show promising device characteristics for AL, the VESFET due to decreased intrinsic capacitance and the CNTFET due to quasi-ballistic transport, and thus an on-resistance close to the physical limits. Both transistor concepts boost the ESF, and the CNTFET's superior on-resistance also impacts the optimum operating frequency in AL. It is shifted to higher frequency regimes, i.e. into the region of 500 MHz .

Besides desired effects of shrinking, also some disadvantages are observed. In addition to leakage, degrading effects worsen the reliability in integrated circuits. Hot Carrier Injection is not an issue in AL, the working principle avoids current flow when a high drain-to-source voltage is applied at the devices. Intrinsically the
power-clock is shut down for a quarter of the power-clock period, thus the effect of Bias Temperature Instability (BTI) is less in AL compared to static CMOS. Due to degraded transistors, the energy consumption in the leakage regime is decreased, whereas in the adiabatic regime, increased on-resistance due to BTI increases the energy consumed. Overall the minimum energy is decreased, but $f_{\text {opt }}$ is shifted to somewhat lower frequencies. Finally, BTI will not lead to malfunction in AL as long as the circuit is operated at frequencies chosen for minimum energy operation.

Efficient Power-Clock Generation Generating the power-clock is a main concern, when the overall efficiency of the system is observed. A synchronous 2N2P oscillator shows the best efficiency values for generating the four-phase power-clock, and proves to be tolerant to capacitive fluctuations caused by the digital patterns in the logic. The energy introduced by the logic generating the synchronizing signals is negligible. A main power loss is due to driving the transistors in the oscillator. It is seen, that losses in the drivers can be greatly reduced without increasing the energy consumed by the oscillator too much. Overall efficiency values of more than $60 \%$ for AL systems are gainable with the synchronous 2 N 2 P oscillator.

Cutting-down the Power in AL by a Shut-down Mode Powering down Adiabatic Logic in idle times leads to further savings. First, the use of a power-down switch disconnecting the oscillator from the circuit is discussed. The optimum position for inserting the power-switch depends on the load, the wire length and width. Though this straight-forward method leads to major savings, also disadvantages are seen by cutting-off parts of the circuit from the oscillator. A deviation in the LC tank is introduced, the resonance frequency is detuned, thus a asynchronous oscillator would change the operating frequency, or a synchronous oscillator experiences a mismatch between the resonance frequency of the LC tank and the synchronization frequency, leading to greatly increased energy consumption due to the forced oscillation. Shutting down single oscillators, either connected to a part of the system or the whole system, shows promising results and is preferable over proposals that cause detuning of the LC tank. Different power-down modes are presented and rated according to the switching overhead, the achievable energy reduction and the possibility to retain data in shut-down. Mode 1 dissipates the energy stored in the LC tank by connecting the power-down transistors in the oscillator to ground potential. It allows reductions of the energy with respect to a system that is not equipped with a shut-down by more than $75 \%$, when a ratio $T_{o n}=T_{o f f}=1 \mu$ s is assumed. If very short power-down times of $T_{\text {on }}=T_{\text {off }}=250 \mathrm{~ns}$ are regarded, still energy reduction of almost $50 \%$ is possible. A hybrid mode is proposed, that besides energy savings due to the shut-down also allows to retain data stored in the processing circuit. With this mode instant power-down is possible, as no data is lost while in power-down.

Designing Efficient Arithmetic Structures in Adiabatic Logic Besides the efficient generation of the power-clock and mechanisms to save energy in idle states, the choice of the signal processing architecture is of great importance. Inherent micropipelining allows to construct structures without any considerations on critical
paths and correct timing. Arithmetic systems constructed from Adiabatic Logic are by construction pipelined on gate level. Nevertheless, care has to be taken when choosing an appropriate architecture. Ripple Carry Adder (RCA), Carry Select Adder (CSEA) and diverse Parallel Prefix Adders (PPA) are rated in area and energy consumption, versus the desired bit widths. RCAs are the most energy and area efficient structures up to input bit widths of less than 20 bit. Higher bit widths demand for PPA implementations in Adiabatic Logic. Han-Carlson and Sklansky PPA structures are superior to other PPA structures with respect to energy consumed and active gate area. Comparisons by means of simulations of a Han-Carlson PPA in static CMOS and AL implemented with industrial 130 nm BSIM models show that the static CMOS implementation consumes more than 15 times the energy of the AL implementation, when both are operated with the nominal supply voltage of $V_{D D}=1.2 \mathrm{~V}$ and at an operating frequency of 100 MHz . But even with aggressive voltage scaling (static CMOS with $V_{D D}=0.6 \mathrm{~V}$ and AL with $V_{D D}=0.8 \mathrm{~V}$ ), static CMOS still consumes almost 5 times more energy. A $V_{D D}$ of 0.6 V for static CMOS leaves only tiny safety margins for the critical path to cope with PVT variations, and is thus a critical supply voltage scenario. Further advantages can be derived from the inherent properties of AL when arithmetic structures are constructed. A larger transistor stack can be implemented in AL without provoking timing errors. By designing complex gates, the latency, area and energy consumption can be reduced all at once. A FA- $M$ bit cell is proposed, that combines $M$ bits of an input word to the according sum bits and the carry bit in one step. In a 12 bit ECRL RCA structure a FA-4bit cell allows to reduce the latency, the energy and the area to $0.25,0.39$, and 0.43 , respectively, with respect to an RCA utilizing the regular FA (FA-1bit) cell. Very complex gates can be invented, without timing errors, and major savings in latency, energy, and area are gained.

The COordinate Transformation DIgital Computer (CORDIC) is perfectly suitable for Adiabatic Logic, as it combines nested adders without excessive buffer overhead. Comparisons for a 12 bit CORDIC-based Discrete Cosine Transformation (DCT) in 130 nm CMOS, and with a typical input pattern, show an ESF of more than 11 against the static CMOS implementation under nominal supply voltage $\left(V_{D D}=1.2 \mathrm{~V}\right)$. Even for a reduced voltage of 0.8 V the ESF is more than 7.

Measurement Results To support the statement, that Adiabatic Logic systems can be designed with industrial static CMOS design flows, and to derive the ESF for a large scale system from measurements, a test chip was designed and fabricated in a 130 nm CMOS technology. The chip contains an FIR filter composed in a bitplane structure. The input width is 16 bit, 8 coefficient words of 8 bits each are implemented. Measurements affirm that the chip is functional down to 0.8 V . Measurement data is compared to the simulated data and shows good agreement within the observed frequency regime. Extrapolation of the measured data shows increased energy consumption in the adiabatic regime with respect to the simulation data, what is explained by missing post-layout data in the simulations. Simulation data for a comparable static CMOS design are opposed, and the ESF is derived for different supply voltages. At 10 MHz measurement data in AL compared to simulation results in static CMOS result in an ESF of 8.8 at $V_{D D}=1.2 \mathrm{~V}$ and 5 when both
designs are supplied with $V_{D D}=0.8 \mathrm{~V}$. All values derived have to be considered as lower bounds, as no layout parasitics are included in the simulation results of static CMOS. These measurements prove that Adiabatic Logic can be implemented with industrial tools developed for static CMOS, and demonstrate silicon-proven ESFs of more than 5 even with reduced supply voltage.

Final and Overall Conclusions From the results gained by the investigations presented in this work, design rules for the efficient implementation of adiabatic systems can be derived. Energy efficient generation of the power-clock, a methodology to shut-down adiabatic blocks during idle times, and arithmetic structures that profit from the inherent properties of AL, are fundamental to obtain adiabatic systems that can not only compete, but outperform state-of-the-art static CMOS circuit designs. The measurement results presented prove, that the design and implementation of large scale systems composed of ECRL or PFAL gates is achievable through industrial design automation tools developed for static CMOS, and that measured savings in large scale systems comply with the expectations derived from gate and system level investigations. Finally, future technology nodes and devices perform quite well with Adiabatic Logic. Hence, Adiabatic Logic is prepared for a successful journey into the next decades.

## Bibliography

1. V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. Patel, F. Baez, Reducing power in highperformance microprocessors, in Proceedings of the 35th Annual Design Automation Conference. San Francisco, CA (ACM, New York, 1998), pp. 732-737
2. R.K. Krishnamurthy, A. Alvandpour, S. Mathew, M. Anders, V. De, S. Borkar, Highperformance, low-power, and leakage-tolerance challenges for sub-70nm microprocessor circuits, in Proceedings of the 28th European Solid-State Circuits Conference, 2002, pp. 315321
3. R.K. Krishnamurthy, S.K. Mathew, M.A. Anders, S.K. Hsu, H. Kaul, S. Borkar, Highperformance and low-voltage challenges for sub-45nm microprocessor circuits, in 6 th International Conference on ASIC, vol. 1, 2005, pp. 283-286
4. D. Pham, M. Alexander, A. Arizpe, B. Burgess, C. Dietz, L. Eisen, R. El-Kareh, J. Eno, S. Gary, G. Gerosa, B. Goins, J. Golab, R. Golla, R. Harris, B. Ho, Y.-W. Ho, K. Hoover, C. Hunter, P. Ippolito, R. Jessani, J. Kahle, K.R. Kishore, B. Kuttanna, S. Litch, S. Mallick, T. Ngo, D. Ogden, C. Olson, S.-H. Park, R. Patel, M. Pham, J. Prado, S. Reeve, R. Reininger, H. Sanchez, M. Schiffli, J. Slaton, G. Thuraisingham, K. Torku, C. Tran, N. Vanderschaaf, P. Voldstad, A 3.0 W 75SPECint92 85SPECfp92 superscalar RISC microprocessor, in IEEE International Solid-State Circuits Conference, 1994, pp. 212-213
5. L. Benini, P. Siegel, G. De Micheli, Saving power by synthesizing gated clocks for sequential circuits. IEEE Design \& Test of Computers 11(4), 32-41 (1994)
6. W.P. Maly, Integrated circuit, device, system, and method of fabrication. US Patent PCT/US2007/011630, 2007
7. W. Maly, Y.-W. Lin, M. Marek-Sadowska, OPC-Free and Minimally Irregular IC Design Style, in Proc. 44th ACM/IEEE Design Automation Conference DAC '07, 4-8 June 2007, pp. 954-957
8. W. Maly, A. Pfitzner, Complementary vertical slit field effect transistors, Technical Report No. CSSI 08-02, CSSI, Carnegie Mellon University, January 2008
9. E. Amirante, Adiabatic Logic in Sub-quartermicron CMOS Technologies. Selected Topics of Electronics and Micromechatronics, vol. 13 (Shaker, Aachen, 2004)
10. J. Fischer, Adiabatische Schaltungen und Systeme in Deep-Submicron-CMOS-Technologien. Selected Topics of Electronics and Micromechatronics, vol. 24 (Shaker, Aachen, 2006)
11. E. Amirante, J. Fischer, M. Lang, A. Bargagli-Stoffi, J. Berthold, C. Heer, D. SchmittLandsiedel, An ultra low-power adiabatic adder embedded in a standard $0.13 \mu \mathrm{~m}$ CMOS environment, in Proceedings of the 29th European Solid-State Circuits Conference, 2003, pp. 599-602
12. R. Landauer, Irreversibility and heat generation in the computing process. IBM J. Res. Dev. 5, 183-191 (1961)
13. C.H. Bennett, Logical reversibility of computation. IBM J. Res. Dev. 17(6), 525-532 (1973)
14. E. Fredkin, T. Toffoli, Conservative logic. Int. J. Theor. Phys. 21(3-4), 219-253 (1982)
15. J.S. Hall, An electroid switching model for reversible computer architectures, in Workshop on Physics and Computation, 1992, pp. 237-247
16. J.G. Koller, W.C. Athas, Adiabatic switching, low energy computing, and the physics of storing and erasing information, in Proc. Workshop on Physics and Computation, 1992, pp. 267270
17. A. Kramer, J.S. Denker, B. Flower, J. Moroney, 2nd order adiabatic computation with $2 \mathrm{~N}-2 \mathrm{P}$ and 2N-2N2P logic circuits, in Proceedings of the International Symposium on Low Power Design (ACM, New York, 1995), pp. 191-196
18. A. Vetuli, S.D. Pascoli, L.M. Reyneri, Positive feedback in adiabatic logic. Electron. Lett. 32(20), 1867-1869 (1996)
19. Y. Moon, D.-K. Jeong, An efficient charge recovery logic circuit. IEEE J. Solid-State Circuits 31(4), 514-522 (1996)
20. V.G. Oklobdzija, D. Maksimovic, F. Lin, Pass-transistor adiabatic logic using single powerclock supply. IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process. 44(10), 842-846 (1997)
21. D. Maksimovic, V.G. Oklobdzija, B. Nikolic, K.W. Current, Clocked CMOS adiabatic logic with integrated single-phase power-clock supply: experimental results, in Proc. International Symposium on Low Power Electronics and Design, 1997, pp. 323-327
22. S. Kim, M.C. Papaefthymiou, True single-phase energy-recovering logic for low-power, high-speed VLSI, in Proc. International Symposium on Low Power Electronics and Design, 1998, pp. 167-172
23. C. Kim, S.-M. Yoo, S.-M.S. Kang, Low-power adiabatic computing with NMOS energy recovery logic. Electron. Lett. 36(16), 1349-1350 (2000)
24. H. Jianping, C. Lizhang, L. Xiao, A new type of low-power adiabatic circuit with complementary pass-transistor logic, in 5th International Conference on ASIC, vol. 2, 2003, pp. 1235-1238
25. V.S. Sathe, M.C. Papaefthymiou, C.H. Ziesler, A GHz-class charge recovery logic, in Proc. International Symposium on Low Power Electronics and Design, 2005, pp. 91-94
26. V.G. Moshnyaga, K. Tamaru, A comparative study of switching activity reduction techniques for design of low-power multipliers, in Proc. IEEE International Symposium on Circuits and Systems, vol. 3, 1995, pp. 1560-1563
27. L. Heller, W. Griffin, J. Davis, N. Thoma, Cascode voltage switch logic: A differential CMOS logic family, in Proc. IEEE International Solid-State Circuits Conference, 1984, pp. 16-17
28. K.M. Chu, D.L. Pulfrey, Design procedures for differential cascode voltage switch circuits. IEEE J. Solid-State Circuits 21(6), 1082-1087 (1986)
29. N. Weste, D. Harris, CMOS VLSI Design-A Circuits and Systems Perspective, 3rd edn. (Addison-Wesley, Reading, 2005)
30. J. Fischer, E. Amirante, T. Nirschl, P. Teichmann, D. Schmitt-Landsiedel, S. Henzler, Impact of process parameter variations on the energy dissipation in adiabatic logic, in Proc. European Conference on Circuit Theory and Design, vol. 3, 2005, pp. 429-432
31. J. Rabaey, A. Chandrakasan, B. Nikolic, Digital Integrated Circuits: A Design Perspective (Prentice Hall, Englewood Cliffs, 2003)
32. A. Blotti, S. Di Pascoli, R. Saletti, Simple model for positive-feedback adiabatic logic power consumption estimation. Electron. Lett. 36(2), 116-118 (2000)
33. A.P. Chandrakasan, S. Sheng, R.W. Brodersen, Low-power CMOS digital design. IEEE J. Solid-State Circuits 27(4), 473-484 (1992)
34. T. Sakurai, CMOS inverter delay and other formulas using -power law MOS model, in Proc. IEEE International Conference on Computer-Aided Design, 1988, pp. 74-77
35. T. Sakurai, A.R. Newton, Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas. IEEE J. Solid-State Circuits 25(2), 584-594 (1990)
36. K. Usami, M. Horowitz, Clustered voltage scaling technique for low-power design, in Proceedings of the International Symposium on Low Power Design, Dana Point, CA (ACM, New York, 1995), pp. 3-8
37. F. Moll, E. Isern, E. Sicard, A. Rubio, Analysis of cross-talk effects on logic cell delays in CMOS integrated circuits, in Proceedings of the 34th Midwest Symposium on Circuits and Systems, vol. 1, 1991, pp. 387-390
38. J.R. Black, Electromigration: A brief survey and some recent results. IEEE Trans. Electron Devices 16(4), 338-347 (1969)
39. B.-K. Liew, N.W. Cheung, C. Hu, Projecting interconnect electromigration lifetime for arbitrary current waveforms. IEEE Trans. Electron Devices 37(5), 1343-1351 (1990)
40. S. Nakata, The stability of adiabatic reversible logic using asymmetric tank capacitors and its application to SRAM. IEICE Electron. Express 2(20), 512-518 (2005)
41. G.E. Moore, Cramming more components onto integrated circuits. Electronics 38(8), 114117 (1965)
42. D. Hecht, Properties and applications of carbon nanotube films: a revolutionary material for transparent and flexible electronics, PhD thesis, University of California Los Angeles, 2007
43. P. Avouris, J. Appenzeller, V. Derycke, R. Martel, S. Wind, Carbon nanotube electronics, in Digest. International Electron Devices Meeting, 2002, pp. 281-284
44. P. Avouris, J. Appenzeller, R. Martel, S.J. Wind, Carbon nanotube electronics. Proc. IEEE 91(11), 1772-1784 (2003)
45. J. Appenzeller, J. Knoch, R. Martel, V. Derycke, S.J. Wind, P. Avouris, Carbon nanotube electronics. IEEE Trans. Nanotechnol. 1(4), 184-189 (2002)
46. A. Javey, J. Guo, Q. Wang, M. Lundstrom, H. Dai, Ballistic carbon nanotube field-effect transistors. Nature 424(6949), 654-657 (2003)
47. C. Hu, S.C. Tam, F.-C. Hsu, P.-K. Ko, T.-Y. Chan, K.W. Terrill, Hot-electron-induced MOSFET degradation model, monitor, and improvement. IEEE Trans. Electron Devices 32(2), 375-385 (1985)
48. K.O. Jeppson, C.M. Svensson, Negative bias stress of MOS devices at high electric fields and degradation of MNOS devices. J. Appl. Phys. 48(5), 2004-2014 (1977)
49. G. Bersuker, J. Sim, C.S. Park, C. Young, S. Nadkarni, R. Choi, B.H. Lee, Mechanism of electron trapping and characteristics of traps in $\mathrm{HfO}_{2}$ gate stacks. IEEE Trans. Device Mater. Reliab. 7(1), 138-145 (2007)
50. S. Pae, M. Agostinelli, M. Brazier, R. Chau, G. Dewey, T. Ghani, M. Hattendorf, J. Hicks, J. Kavalieros, K. Kuhn, M. Kuhn, J. Maiz, M. Metz, K. Mistry, C. Prasad, S. Ramey, A. Roskowski, J. Sandford, C. Thomas, J. Thomas, C. Wiegand, J. Wiedemer, BTI reliability of 45 nm high-K + metal-gate process technology, in IEEE International Reliability Physics Symposium, 2008, pp. 352-357
51. C. Schluender, R. Brederlow, P. Wieczorek, C. Dahl, J. Holz, M. Roehner, S. Kessel, V. Herold, K. Goser, W. Weber, R. Thewes, Trapping mechanisms in negative bias temperature stressed p-MOSFETs. Microelectron. Reliab. 39(6-7), 821-826 (1999)
52. C. Schlueunder, W. Heinrigs, W. Gustin, H. Reisinger, On the impact of the NBTI recovery phenomenon on lifetime prediction of modern p-MOSFETs, in IEEE International Integrated Reliability Workshop (Final Report), 2006, pp. 1-4
53. V. Huard, C. Parthasarathy, N. Rallet, C. Guerin, M. Mammase, D. Barge, C. Ouvrard, New characterization and modeling approach for NBTI degradation from transistor to product level, in IEEE International Electron Devices Meeting, 2007, pp. 797-800
54. V. Huard, C.R. Parthasarathy, A. Bravaix, T. Hugel, C. Guerin, E. Vincent, Design-inreliability approach for NBTI and hot-carrier degradations in advanced nodes. IEEE Trans. Device Mater. Reliab. 7(4), 558-570 (2007)
55. International technology roadmap for semiconductors-2008 update, http://www.itrs.net/ Links/2008ITRS/Home2008.htm
56. Y. Cao, T. Sato, M. Orshansky, D. Sylvester, C. Hu, New paradigm of predictive MOSFET and interconnect modeling for early circuit simulation, in Proc. of the IEEE Custom Integrated Circuits Conference, 2000, pp. 201-204
57. W. Zhao, Y. Cao, New generation of predictive technology model for sub-45nm design exploration, in Proc. 7th International Symposium on Quality Electronic Design, 2006, pp. 585590
58. Arizona state university predictive technology model, http://www.eas.asu.edu/~ptm/
59. International technology roadmap for semiconductors-2009 edition (2009), http://www. itrs.net/Links/2009ITRS/Home2009.htm
60. R. Chau, S. Datta, A. Majumdar, Opportunities and challenges of III-V nanoelectronics for future high-speed, low-power logic applications, in IEEE Compound Semiconductor Integrated Circuit Symposium, 2005
61. P. Teichmann, J. Fischer, E. Amirante, S. Henzler, A. Bargagli-Stoffi, C. Otte, D. SchmittLandsiedel, Gate leakage reduction by clocked power supply of adiabatic logic circuits. Adv. Radio Sci. 3(14), 281-285 (2005)
62. N. Hamada, S.-i. Sawada, A. Oshiyama, New one-dimensional conductors: Graphitic microtubules. Phys. Rev. Lett. 68(10), 1579-1581 (1992)
63. A. Javey, H. Dai, Carbon nanotube electronics, in 19th International Conference on VLSI Design, 2006
64. R. Martel, V. Derycke, C. Lavoie, J. Appenzeller, K.K. Chan, J. Tersoff, P. Avouris, Ambipolar electrical transport in semiconducting single-wall carbon nanotubes. Phys. Rev. Lett. 87(25), 256805 (2001)
65. A. Raychowdhury, K. Roy, Carbon nanotube electronics: Design of high-performance and low-power digital circuits. IEEE Trans. Circuits Syst. I 54(11), 2391-2401 (2007)
66. V. Derycke, R. Martel, J. Appenzeller, P. Avouris, Carbon nanotube inter- and intramolecular logic gates. Nano Lett. 1(9), 453-456 (2001)
67. J. Deng, A. Lin, G.C. Wan, H.-S.P. Wong, Carbon nanotube transistor compact model for circuit design and performance optimization. J. Emerg. Technol. Comput. Syst. 4(2), 1-20 (2008)
68. J. Deng, H.-S.P. Wong, A compact SPICE model for carbon-nanotube field-effect transistors including nonidealities and its applications, Part I: Model of the intrinsic channel region. IEEE Trans. Electron Devices 54(12), 3186-3194 (2007)
69. J. Deng, H.-S.P. Wong, A compact SPICE model for carbon-nanotube field-effect transistors including nonidealities and its applications, Part II: Full device model and circuit performance benchmarking. IEEE Trans. Electron Devices 54(12), 3195-3205 (2007)
70. P.A. Mildred, S. Dresselhaus, G. Dresselhaus (eds.), Carbon Nanotubes: Synthesis, Structure, Properties, and Applications (Springer, Berlin, 2001)
71. J.M. Marulanda, Current transport modeling of carbon nanotubes: Concepts, analysis, and design, PhD thesis, Louisiana State University, 2008
72. M. Weis, A circuit design perspective for vertical slit field effect transistor (VESFET), in Selected Topics of Electronics and Micromechatronics, vol. 35 (Shaker, Aachen, 2010)
73. Y.-W. Lin, M. Marek-Sadowska, W. Maly, A. Pfitzner, D. Kasprowicz, Is there always performance overhead for regular fabric? in Proc. IEEE International Conference on Computer Design, 2008, pp. 557-562
74. Y. Miura, Y. Matukura, Investigation of silicon-silicon dioxide interface using MOS structure. Jpn. J. Appl. Phys. 5, 180 (1966)
75. L.J. Svensson, J.G. Koller, Driving a capacitive load without dissipating fCV, in Proc. IEEE Symposium Low Power Electronics, 1994, pp. 100-101
76. W.C. Athas, L.J. Svensson, N. Tzartzanis, A resonant signal driver for two-phase, almost-non-overlapping clocks, in Proc. IEEE International Symposium on Circuits and Systems, vol. 4, 1996, pp. 129-132
77. H. Mahmoodi-Meinnand, A. Afzali-Kusha, M. Nourani, Adiabatic carry look-ahead adder with efficient power clock generator. IEE Proc., Circuits Devices Syst. 148(5), 229-234 (2001)
78. H. Mahmoodi-Meimand, A. Afzali-Kusha, Efficient power clock generation for adiabatic logic, in Proc. IEEE International Symposium on Circuits and Systems, vol. 4, 2001, pp. 642645
79. J.G. Koller, L. Svensson, Adiabatic charging without inductors, in Proc. Int. Workshop Low Power Design, 1994
80. W.C. Athas, J.G. Koller, L.J. Svensson, An energy-efficient CMOS line driver using adiabatic switching, in Proceedings of the 4th Great Lakes Symposium on VLSI, 1994, pp. 196-199
81. A. Bargagli-Stoffi, G. Iannaccone, S. Di Pascoli, E. Amirante, D. Schmitt-Landsiedel, Fourphase power clock generator for adiabatic logic circuits. Electron. Lett. 38(14), 689-690 (2002)
82. A. Bargagli-Stoffi, E. Amirante, J. Fischer, G. Innaccone, D. Schmitt-Landsiedel, Resonant 90 degree shifter generator for 4-phase trapezoidal adiabatic logic. Adv. Radio Sci. 1, 243246 (2003)
83. M. Arsalan, M. Shams, Charge-recovery power clock generators for adiabatic logic circuits, in Proc. 18th International Conference on VLSI Design, 2005, pp. 171-174
84. A. Blotti, S. Borghese, R. Saletti, Single-inductor four-phase power-clock generator for positive-feedback adiabatic logic gates, in 9th International Conference on Electronics, Circuits and Systems, vol. 2, 2002, pp. 533-536
85. R.B. Merrill, T.W. Lee, Y. Hong, R. Rasmussen, L.A. Moberly, Optimization of high Q integrated inductors for multi-level metal CMOS, in International Electron Devices Meeting, 1995, pp. 983-986
86. C.P. Yue, S.S. Wong, On-chip spiral inductors with patterned ground shields for Si-based RF ICs. IEEE J. Solid-State Circuits 33(5), 743-752 (1998)
87. C.L. Chua, D.K. Fork, K. Van Schuylenbergh, J.-P. Lu, Out-of-plane high-Q inductors on low-resistance silicon. J. Microelectromech. Syst. 12(6), 989-995 (2003)
88. Y. Zhuang, M. Vroubel, B. Rejaei, J.N. Burghartz, Ferromagnetic RF inductors and transformers for standard CMOS/BiCMOS, in International Electron Devices Meeting, 2002, pp. 475-478
89. M. Yamaguchi, S. Bae, K.H. Kim, K. Tan, T. Kusumi, K. Yamakawa, Ferromagnetic RF integrated inductor with closed magnetic circuit structure, in International Microwave Symposium Digest, 2005
90. M. Yamaguchi, K. Yamada, K.H. Kim, Slit design consideration on the ferromagnetic RF integrated inductor. IEEE Trans. Magn. 42, 3341-3343 (2006)
91. S. Henzler, T. Nirschl, S. Skiathitis, J. Berthold, J. Fischer, P. Teichmann, F. Bauer, G. Georgakos, D. Schmitt-Landsiedel, Sleep transistor circuits for fine-grained power switch-off with short power-down times, in Proc. of the IEEE International Solid-State Circuits Conference, 2005, pp. 302-303, 600
92. S. Henzler, P. Teichmann, M. Koban, J. Berthold, G. Georgakos, D. Schmitt-Landsiedel, Impact of gate leakage on efficiency of circuit block switch-off scheme, in VLSI SoC: From Systems to Chips, ed. by M. Glesner, R. Reis, L. Indrusiak, V. Mooney, H. Eveking. IFIP International Federation for Information Processing, vol. 200 (Springer, Boston, 2006), pp. 229-245
93. S. Kim, Y. Shin, S. Kosonocky, W. Hwang, Long-term power minimization of dual- $\mathrm{V}_{t}$ CMOS circuits, in 15th Annual IEEE International ASIC/SOC Conference, 2002, pp. 323327
94. H. Jiang, M. Marek-Sadowska, S.R. Nassif, Benefits and costs of power-gating technique, in Proceedings of the IEEE International Conference on Computer Design, 2005, pp. 559-566
95. Y. Ye, S. Borkar, V. De, A new technique for standby leakage reduction in high-performance circuits, in Symposium on VLSI Circuits, 1998, pp. 40-41
96. P. Teichmann, J. Fischer, S. Henzler, E. Amirante, D. Schmitt-Landsiedel, Power-clock gating in adiabatic logic circuits, in Integrated Circuit and System Design. Lecture Notes in Computer Science (Springer, Berlin, 2005), pp. 638-646
97. J. Hu, D. Zhou, L. Wang, Power-gating adiabatic flip-flops and sequential logic circuits, in International Conference on Communications, Circuits and Systems, 2007, pp. 1016-1020
98. J. Hu, J. Fu, Leakage dissipation reduction of single-phase power-gating adiabatic sequential circuits using MTCMOS, in Asia Pacific Conference on Postgraduate Research in Microelectronics \& Electronics, 2009, pp. 456-459
99. J. Hu, J. Dai, D. Zhou, A power-gating technique for adiabatic circuits using bootstrapped NMOS switches, in IEEE International Midwest Symposium on Circuits and Systems, vol. 1, 2006, pp. 89-93
100. P. Teichmann, J. Fischer, E. Amirante, D. Schmitt-Landsiedel, Power-clock-gating in adiabatischen logikschaltungen, in Advances in Radio Science, vol. 6, 2006, pp. 275-280
101. P. Teichmann, J. Fischer, D. Schmitt-Landsiedel, A robust synchronized 2N2P LC oscillator with a shut-down mode for adiabatic logic circuits, in Proc. of the IEEE International Symposium on Circuits and Systems, 2009, pp. 241-244
102. S. Narendra, S. Borkar, V. De, D. Antoniadis, A. Chandrakasan, Scaling of stack effect and its application for leakage reduction, in International Symposium on Low Power Electronics and Design, 2001, pp. 195-200
103. S. Jenei, B.K.J.C. Nauwelaers, S. Decoutere, Physics-based closed-form inductance expression for compact modeling of integrated spiral inductors. IEEE J. Solid-State Circuits 37(1), 77-80 (2002)
104. R. Zimmermann, Binary adder architectures for cell-based VLSI and their synthesis, PhD thesis, ETH Zurich, 1997
105. I. Koren, Computer Arithmetic Algorithms, 2nd edn. (AK Peters, Wellesley, 2002)
106. R. Ladner, M. Fischer, Parallel prefix computation. J. Assoc. Comput. Mach. 27(4), 831-838 (1980)
107. R.P. Brent, H.T. Kung, A regular layout for parallel adders. IEEE Trans. Comput. C-31, 260-264 (1982)
108. T. Han, D.A. Carlson, Fast area-efficient VLSI adders, in Proceedings of the Symposium on Computer Arithmetic, 1987, pp. 49-56
109. P.M. Kogge, H.S. Stone, A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans. Comput. C-22, 786-793 (1973)
110. J. Sklansky, Conditional-sum addition logic. IRE Trans. Electron. Comput. EC-9, 226-231 (1960)
111. J. Volder, The CORDIC computing technique, in Western Joint Computer Conference, San Francisco, CA (ACM, New York, 1959), pp. 257-261
112. P. Teichmann, M. Vollmer, J. Fischer, B. Heyne, J. Gotze, D. Schmitt-Landsiedel, Saving potentials of adiabatic logic on system level: A CORDIC-based adiabatic DCT, in Proceedings of the International Symposium on Integrated Circuits, 2009, pp. 105-108
113. B. Heyne, C.-C. Sun, J. Goetze, S.-J. Ruan, A computationally efficient high-quality CORDIC based Loefller DCT, in European Signal Processing Conference, 2006
114. C. Loeffler, A. Ligtenberg, G.S. Moschytz, Practical fast 1-D DCT algorithms with 11 multiplications, in International Conference on Acoustics, Speech, and Signal Processing, vol. 2, 1989, pp. 988-991
115. T.D. Tran, A fast multiplier-less block transform for image and video compression, in Proceedings of the International Conference on Image Processing, vol. 3, 1999, pp. 822-826
116. C.-C. Sun, B. Heyne, S.-J. Ruan, J. Goetze, A low-power and high-quality cordic based Loeffler DCT, in International Symposium on VLSI Design, Automation and Test, 2006, pp. 1-4
117. E. De Man, M. Schulz, R. Schmidmaier, M. Schobinger, T.G. Noll, Architecture and circuit design of a 6-GOPS signal processor for QAM demodulator applications. IEEE J. Solid-State Circuits 30(3), 219-227 (1995)
118. T.G. Noll, Carry-save arithmetic for high-speed digital signal processing, in IEEE International Symposium on Circuits and Systems, vol. 2, 1990, pp. 982-986

## Index

## Symbols

2's complement, 17, 114
3-to-2 compressor, 137
4-to-2 compressor, 146

## A

Activity factor, 8
Adiabatic losses, 7, 69
Adiabatic signal test, 149
Adiabatic system, 8
Area consumption, 16
Asynchronous oscillator, 67, 73

## B

Ballistic transport, 36
Bias temperature instability, 23
Bitplane structure, 146
Brent-Kung PPA, 121
BTI, 23

## C

Carbon nanotubes, 23, 30, 36
Carrier mobility, 12
Carry-lookahead adder, 99
Carry-propagate adder, 140
Carry-save adder, 113, 137, 146
Carry-select adder, 113, 116
Cascode voltage switch logic, 9
Charging path resistance, 12
Chirality, 23, 37, 40
CLA, 99
CNT, 23, 30, 36
CNT-based field effect transistor, 23
CNTFET, 23, 36, 38
Complex gates, 128
Conditional-sum adder, 113
Conversion efficiency, 68

Coordinate rotation digital computer, 137, 138
CORDIC, 137, 138, 142
CPA, 140
Crosstalk, 18
CSA, 113, 137, 140
CSEA, 116, 123
CVSL, 9, 15

## D

DCT, 71, 137, 139, 142
Degradation, 23
Discrete cosine transformation, 71, 137, 139
Drain-induced barrier lowering, 30
Dual-rail encoding, 15, 145, 146

## E

ECRL, 6, 9
Efficiency, oscillator, 78
Efficiency, system, 78
Efficient charge recovery logic, 6, 9, 65
Electromigration, 20
Energy dissipation, 5, 7
Energy losses, 77
Energy reduction factor, 88
Energy saving factor, 8
Equivalent circuit, 7, 86
ERF, 88
ESF, 8, 33, 153
Evaluate interval, 9, 150

## F

Fan-out, 48
Finite impulse response filter, 145
FIR, 145
Four-phase power-clock, 65, 67
Fully-depleted SOI, 30

## H

Han-Carlson PPA, 122, 126, 152
HCI, 23, 51
Hold interval, 9, 150
Horner scheme, 146
Hot carrier injection, 23, 51

## I

Inherent pipelining, 17, 114, 128
IR-drop, 19

## J

JFET, 43
Junction FET, 43

## K

Kogge-Stone PPA, 121
L
$L d i / d t$-drop, 20
Latency, 128, 130, 134
Leakage currents, 10
Leakage losses, 11, 83

## M

Minimum power-down time, 85
Minimum transition time, 7
Multi-gate FET, 30
Multiplier, 113

## N

NBTI, 51, 52, 58
Negative bias temperature instability, 51
Nested RCA, 136
Noise margin, 15
Non-adiabatic losses, 11
Novel devices, 23

## 0

OESF, 45
On-resistance, 31, 32, 84, 130
Optimum frequency, 33, 34
Overall energy saving factor, 45

## P

Parallel-prefix adder, 113, 119, 123
PBTI, 51, 52, 61
PFAL, 8
Positive bias temperature instability, 51, 61
Positive feedback adiabatic logic, 8,65
Power-clock, 9, 17, 65, 83

Power-down mode, 107
PPA, 113, 119, 120, 123
Predictive technology model, 24
Prefix problem, 119
PTM, 24

## R

RCA, 113, 116, 123, 132, 136, 137, 140, 145
Recover interval, 9, 150
Resonance frequency, 68, 69, 102
Resonant loading, 66
Ripple-carry adder, 113, 115, 123

## S

Scaling, 83
SCE, 30
Short channel effects, 30, 36
Single walled carbon nanotubes, 23, 36
Sklansky PPA, 120, 146
Stepwise charging, 65
Supply voltage, functional limit, 14
SWCNT, 36
Switching overhead, 84
Switching probability, 5
SWNT, 23
Synchronization frequency, 68, 69, 107
Synchronization losses, 72
Synchronization signals, 68
Synchronization signals, generation, 72
Synchronous oscillator, 67, 73, 107

## T

Tank capacitor, 66
Technology energy saving factor, 45
TESF, 45
Transition time, 7

## V

Vector merging adder, 140, 146
Vertical slit field effect transistor, 23, 30, 43
VESFET, 23, 30, 43
VMA, 140
Voltage scaling, 13, 126, 145, 152
Vth roll-off, 30

## W

Wait interval, 9
X
XOR, 16, 128


[^0]:    ${ }^{\mathrm{a}} 1 \mathrm{GHz}$ is the upper frequency range of the investigation
    ${ }^{\mathrm{b}}$ Minimum could be found at lower frequency, as outlier found at 100 MHz

