Superconducting Digital Signal Processor for Telecommunication

Anna Herr

Microtechnology and Nanoscience, Chalmers University of Technology
41296 Gothenburg, Sweden
e-mail: anna.herr@chalmers.se

Abstract- Ultra-fast switching speed of superconducting digital circuits makes possible Digital Signal Processors with performance unattainable by any other technology. Based on rapid-single-flux-quantum (RSFQ) logic, these integrated circuits are capable of delivering high computation capacity up to 30GOPS on a single processor and very short latency of 0.1 ns. There are two main applications of such hardware in practical telecommunication systems: filters for superconducting ADCs operating with digital RF data and recursive filters at baseband. The latter of these allow functions such as multiuser detection for 3G WCDMA, equalization and channel pre-coding for 4G OFDM MIMO, and general blind detection. The performance gain is an increase in the cell capacity, quality of service, and transmitted data rate. The current status of the development of the RSFQ baseband DSP is discussed. Major components with operating speed of 30 GHz have been developed. Designs, test results, and future development of the complete systems including cryopackaging and a CMOS interface are reviewed.

Based on author’s talk presented at the Microcooler Workshop of EU Project “S Pulse (A24)” on April 8, 2008.

I. INTRODUCTION

Digital Signal Processors (DSPs) are microprocessors of architecture dedicated and optimized for the computations typical in digital signal processing, such as discrete filters, Fast Fourier Transforms, and image processing. In recent decades, DSP use has been increasing rapidly due to the rapid growth of wireless telecommunications market. DSP performance influences the overall system architecture, quality of the products, and cost. The current demands in quality of service, unification of interfaces and communication security place very tough requirements on the DSPs information/processing rate and throughput.

The rationale for using superconducting Rapid Single Flux Quantum (RSFQ) technology for digital signal processors lies in the benefits of exceptionally high clock frequency in combination with low power dissipation, high integration density and scalability [1], [2]. RSFQ technology is especially suitable for DSPs because it can guarantee very high throughput in heavily pipelined architectures and fast interface to serial memory, which is essential for DSP performance.

Much work has been done in the field of RSFQ-based digital signal processing. Work has focused mainly on two areas: RF DSP decimation filters fed by high performance, oversampling superconducting Analog-to-Digital converter (ADC) modulators, and baseband nonlinear recursive filters for interference rejection. In both cases, the speed of the RSFQ digital
circuit is the major performance factor. In RF DSP the clock speed of the digital filters is determined by the 20-40 GHz ADCs sampling rate. In baseband recursive filtering, high speed is required to minimize feedback computation time involving processing of a massive parallel data. The nonlinear recursive algorithms demand not only 100 GOPs ($10^{12}$ fixed point operations per second) range overall performance but also low latency, with total computation time on a microsecond scale.

The main topic of this paper is the baseband RSFQ DSP. Initially, application of RSFQ baseband processing was introduced in the context of interference cancellation in the uplink of W-CDMA (wideband code division multiple access) systems [3]. A similar approach is applicable to a wide class of algorithms used in different telecommunication standards, both uplink and downlink. From RSFQ practical realization point of view, implementation of DSPs is a difficult, but manageable task even at the current level of the fabrication process and packaging technology. However demonstration of the first fully functional prototype is crucial for opening a potentially large commercial market.

II. RSFQ DSP APPLICATIONS

Due to the necessity of cooling to 4 K, application of RSFQ technology is justified only for tasks that are impossible or ineffective using semiconductors. For DSPs this translates to problems that can not be effectively parallelized. The general class of DSP problems that falls into this category is adaptive filtering [4]. The general principle of adaptive filtering is shown in Fig. 1. The signal is applied simultaneously to the unknown transmitting system and adaptive filter, and the output signal if formed by subtracting the output of the adaptive filter from the output of transmitting system. The adaptive algorithm is constructed to minimize output error, thus to guarantee perfect model for unknown system and perfect detection of the input signal.

![Fig. 1. Block diagram of adaptive filters for system identification and channel equalization.](image)

The applications of adaptive filters are numerous and include system identification, channel equalization, and signal prediction. For example, adaptive Kalman filters are used in radar, wherein tracking of a target with noisy location, speed, and acceleration readings is improved when the ballistics of the target are known [5]. Here we concentrate on the application of adaptive filtering to modern wireless communication systems: Successive Interference Cancellers (SIC) for 3G Code Division Multiple Access (CDMA) [6], channel equalizers and precoders for 4G Multiple Input Multiple Output (MIMO) systems [7], and general blind signal detectors [8].

Perhaps the most strongly advocated application for adaptive filtering in modern wireless communication systems is interference cancellation. The performance of the system is mainly limited by Multiple Access Interference (MAI). MAI affects both uplink and downlink capacity and results from imperfect separation of the signals at the receiver. In addition to MAI,
multipaths within each radio channel cause intersymbol interference (ISI). Signal detection in an interference limited system is exactly where currently existing cellular wireless networks show a disparity between what is theoretically possible and what is technically realizable using conventional technology.

The most general approach to interference cancellation is to compute the interference picture using available data about the propagation channels. As soon as the interference is known, it can be removed from the received signal or used to predistort the transmitted signal. If the transfer function of the propagation environment including all interference is denoted as $R$, applying the $R^{-1}$ filter at the receiver without noise would result in ideal detection of the transmitted signals. In practice, the efficiency of this process depends on how accurately the interference picture is computed, how quickly the interference picture changes in time, and on the properties of the algorithm that is used. Successive interference cancellers for CDMA systems and equalizers for MIMO systems are adaptive filters that can be viewed as recursive alternatives to the matrix inversion solution. In iterative algorithms, an initial rough estimate of the signals is improved with each additional iteration using newly received information on the propagation environment and the dependence between communication channels.

Similar iterations that add interference at the transmitter instead of subtracting it at the receiver can be used in Tomlinson-Harashima Precoding (THP) [9]. This considerably increases the resources of the down-link by keeping all the complexity at the base station as opposed to the mobile unit.

Both multiuser detectors and channel equalizers require channel estimation, i.e. knowledge about $R$. The receiver performs blind detection if the channel coefficients are unknown. In existing communication systems, channel estimation is assisted by a "pilot" or "reference" sequence. Unfortunately, it is often the case that part of the interference is caused by a signal with unknown properties. For example, the mobile transmitter maybe on the border of the cell, with more than 50% of the incoming signal coming from adjacent base stations. It is also true that the pilot signal consumes considerable system frequency and time resources. Iterative blind detection is thus an important tool to avoid performance degradation.

### III. RSFQ DSP ARCHITECTURE

The main problem with practical implementation of the recursive algorithms in telecommunication is the required computational complexity. The mathematical theory behind it is computationally intensive. All computations are performed on the complex matrixes of the received signal with order $N^3$ increase of the computation complexity with the number of the processed channels $N$. In practice, it means that the hardware should have a capacity of about 100 GOPS. Beyond the overall GOPS performance, the inherent feedback of the algorithms requires extremely low latency to make the total computation time significantly less than the duration of the transmitted symbols.

No existing hardware signal processing solution is able to satisfy both requirements on throughput and latency. Common approaches for boosting the throughput are extreme parallelism using DSP arrays, or extreme pipelining using stream processors. Both solutions provide high effective computation rates at the expense of latency of data propagation through the computational blocks.

RSFQ circuits are very well suited to fill the performance gap needed for implementation of adaptive filters for a variety of reasons. They offer:
- High clock rate on the order 30 GHz for the current 1.5 μm fabrication process,
- High-performance signal processing capability through gate-level pipelining,
- Near speed-of-light delay between processors using superconducting interconnect.
These features are critical to minimize the algorithm’s inherent feedback time, and cannot be compensated using conventional DSPs. However, RSFQ technology has also several limitations. Some arise from the relative immaturity of the technology, and some are inherent. The major maturity-related problem is the achievable integration density. Although a number of complex circuits have been demonstrated, the integration density has been limited to roughly $10^4$ active elements per chip [10], [2]. Inherent limitations include flux trapping, bias current distribution, noise-induced jitter and heavy use of active elements for signal propagation. Altogether, these factors limit complexity of reliable circuits.

Our hybrid DSP architecture was chosen to take advantage of the strengths of superconducting logic, while sidestepping the weaknesses. It consists of both semiconductor and superconductor parts. The heart of the system is a high-speed superconductor RSFQ digital core enabling ultra-fast data feedback. The semiconductor part of the processor is responsible for programmability, mass input data storage, and the I/O interface.

The RSFQ core is based on two Multiply-Accumulate Units (MAC), shown in Fig. 2 [11]. The first MAC computes the filter matrix, $R$, and the second MAC performs iterations. The second MAC receives elements of the filter matrix directly without intermediate storage. It admits the possibility of adjusting filter coefficients during the iterations, as is required by the most general case of adaptive filtering. The DSP structure uses single multipliers with an accumulator to implement the recursive filter sequentially.

For implementation of the recursive algorithms operating on matrixes, the memory hierarchy is organized in the following way: input $N \times N$ matrices $A$ and $B$ are stored in the large CMOS memory. Each matrix is partitioned into $M$ submatrices. There are two serial shift-register based RSFQ caches feeding the processor row and column of the submatrix. Each bank of $A$ stores $K=N=M$ words of each column in $A$, for every row of $A$, and similarly for $B$. Because data remains in the cache memory only until used, additional data can be loaded from external memory to accommodate array dimensions larger than internal storage would allow.

Table 1 itemizes the complexity required for implementation of the DSP core shown in Fig. 4. The table corresponds to a $10 \times 10$ parallel MAC with convergent rounding to 12-bits and an 18-bit accumulator, and the serial $12 \times 1$ MAC. The parallel MAC operates with cache A and

![Fig. 2. RSFQ channel equalizer using two MACs.](image-url)
B where 10-bit parallel words of input channel data are stored. The length of cache A and B is set to 15-words, and each cache is duplicated for loading low speed external data. The serial MAC operates with cache Y and P, where 12-bit initial data and iterative updates are stored in serial order. Inputs and outputs are two's complement numbers. The required precision is set in accordance with simulations of the fixed-point SIC algorithm performed on a full-up channel model of W-CDMA wireless systems [12].

### Table 1 Integration density of the RSFQ DSP components.

<table>
<thead>
<tr>
<th>DSP components</th>
<th>JJ per bit</th>
<th>Total</th>
<th>bias current, A</th>
</tr>
</thead>
<tbody>
<tr>
<td>parallel MAC</td>
<td>22510</td>
<td>3.43</td>
<td></td>
</tr>
<tr>
<td>multiplier</td>
<td>176</td>
<td>17600</td>
<td>2.5</td>
</tr>
<tr>
<td>accumulator LSB</td>
<td>79</td>
<td>316</td>
<td>0.05</td>
</tr>
<tr>
<td>accumulator MSB</td>
<td>51</td>
<td>1560</td>
<td>0.1</td>
</tr>
<tr>
<td>rounding block</td>
<td>41</td>
<td>3034</td>
<td>0.68</td>
</tr>
<tr>
<td>cache A</td>
<td>3</td>
<td>880</td>
<td>0.15</td>
</tr>
<tr>
<td>cache B</td>
<td>3</td>
<td>13200</td>
<td>2.33</td>
</tr>
<tr>
<td>serial MAC</td>
<td>193</td>
<td>2316</td>
<td>0.47</td>
</tr>
<tr>
<td>cache Y</td>
<td>3</td>
<td>3600</td>
<td>0.72</td>
</tr>
<tr>
<td>cache P</td>
<td>3</td>
<td>3600</td>
<td>0.72</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>68616</strong></td>
<td><strong>11</strong></td>
<td></td>
</tr>
</tbody>
</table>

### IV. CIRCUIT IMPLEMENTATION

The Multiply-Accumulate (MAC) unit consists of three parts: parallel multiplier, rounding block and accumulator. The RSFQ implementation of the parallel multiplier has the well-known structure of a parallel array of full adders. The data flow from one adder row to the next is in the vertical direction, and there are no horizontal connections between adder’s bits of the same row. Clock is distributed in the first row and propagates along parallel columns of the multiplier. Concurrent clock distribution scheme is used and careful design of each cell is done in order to avoid accumulated jitter.

The complete MAC has been simulated in VHDL (Very-High-Speed Integrated Circuits hardware description language). The circuit consists of a 10x10 parallel multiplier, a 10-bit combiner, an 18-bit accumulator and a rounding block. The maximum clock frequency of the RSFQ parallel MAC is equal to 26 GHz for HYPRES 4.5 kA/cm² fabrication process and does not depend on the size of the circuit. The main limitation on clock speed comes from the thermal jitter. The simulations have been done with all critical timings of the gates widened by $9.3\sigma$, where $\sigma = 0.2$ ps is the standard deviation of the clock jitter [13]. This criterion corresponds to $10^{-20}$ BER (bit error rate), which is acceptable for DSP applications.

Several components have been already implemented in the HYPRES 4.5 kA/cm² process [14]. The 4x4 two's complement parallel multiplier (2800 JJ) and 5-bit accumulator (350 JJ) were tested at low speed. The 4x4 two's complement parallel multiplier had margins on the DC bias voltage for the exhaustive test pattern equal to ±2%. Degradation of the margin is due to the clock jitter, spread of fabrication parameters, and high mirrored ground current. A separate exhaustive test of a regular bit of the multiplier resulted in ±15% margins on the DC bias voltage. A similar test on the signed bit gave ±11% margins. The 5-bit accumulator had margins on the DC bias voltage for the exhaustive test pattern equal to ±18%.

Two test structures with 20x5 and 4x15 bits have been designed to test functionality of the memory blocks that are long enough in both directions. The 20x5 shift register has 700
Josephson junctions and dimensions of 2030x432.5 \( \mu \text{m}^2 \). The margins on DC bias voltage in low speed exhaustive test were \( \pm 16.6\% \) for individual channels and \( \pm 6\% \) for all channels.

A simplified version of the complete MAC consisting of a 4x4 Multiply-Accumulate Unit with rounding of the final product to 5 bits, and two 17x6 and 17x5 memory caches has been also designed [15]. It contains 10,320 Josephson junctions with an occupied area of 1.9x4.5 mm\(^2\) and average design density of 1,200 JJ/mm\(^2\) (Fig. 3). The circuit was fabricated on a 1x1 cm\(^2\) chip and currently is in test.

![Fig. 3. Microphotograph of the 1 cm\(^2\) chip with RSFQ 4-bit parallel MAC with rounding to 5 bits, and integrated on chip 17x6 bits serial cache.](image)

**V. PACKAGING**

Implementation of the DSP core would benefit from partitioning onto several chips on a Multi-Chip-Module (MCM). Such packaging is a key technology for practical implementation of large scale circuits, as it allows increased yield of the final devices by reduction of the chip size, and functional flexibility without compromising individual chip performance.

An advantage of the systolic array architecture of the parallel MAC is that it is readily partitioned onto multiple chips. The parallel MAC is partitioned onto five chips: four chips with 4,869 Josephson junctions each, containing the multiplier and the accumulator, and one chip with 3,034 Josephson junctions, containing the rounding block. Cache A occupies one chip and contains 880 Josephson junctions. Cache B is partitioned onto three chips with approximately 10,000 Josephson junctions each. As a result, the system consists of nine chips, but the integration density is reduced by half and is within the limits of practicable RSFQ circuits.

The dimensions of the required DSP MCM are within 2x2 cm\(^2\). The proposed DSP needs 24 inputs with a bandwidth of 2 Gbps for cache A and cache B, and 24 low frequency lines for cache Y inputs and DSP output. In addition, one high bandwidth (more than 30 GHz) input is
required for the clock and two 2 Gbps channels are needed for communication with external memory. As a result, the MCM will need 50 signal pads.

Currently, Chalmers group in collaboration with HYPRES Inc. is in the last phase of completing the RSFQ DSP cryopackage mounted on the Sumitomo cryocooler RDK-101DP [16] (Fig. 4). The system has 60 DC bias wires with maximal current of 2.5 A, two pairs of flex lines with 20 wires each for 2.5 Gbps data inputs, four 20 GHz wires, and four high speed wires for external clock with 40 GHz bandwidth. The system supports two interchangeable units of 1x1 cm² and 2x2 cm² MCMs. The cryopackage has been electrically and mechanically characterized and is currently under assembly at Chalmers University.

![Fig. 4. Photograph of the top view and inside wiring of the RSFQ DSP cryopackage mounted on the Sumitomo cryocooler RDK-101DP.](image)

VI. CONCLUSION

In conclusion, RSFQ technology opens a unique possibility for implementation of adaptive filtering algorithms operating with massively parallel data. The applications of adaptive filters are numerous and include system identification, channel equalization, and signal prediction. The algorithms are capable of delivering nearly optimum signal detection with excellent stability and convergence properties independent of the communication scenario. From a capacity point of view, the algorithm performs as well as the single channel approach. The payoff is a saving of up to 30% in the total number of base stations required. It is important to note that the adaptive filtering solves the central problem of signal detection that simply cannot be addressed by other system improvements such as signal coding or advanced receiver hardware.

RSFQ DSPs even with the moderate integration density available in today's fabrication technology can execute complex recursive algorithms independent of the communication standard, for both uplink and downlink. The discussed architecture of the RSFQ DSP is scalable in various ways: memory, bit-precision and speed. The complexity of the system is well within the packaging capabilities of the superconducting electronics today.

ACKNOWLEDGMENTS

This work was supported in part by Swedish Foundation for Strategic Research INGVAR grant "Gigahertz electronics for telecommunication", EU FP6 grant "RSFQubit", Swedish Research Council grant N2003-10990-17764-11, Swedish foundation for international cooperation grant "Gigahertz electronics" grant, and USA Office of Naval Research grants. Dr. A. Herr (Kidiyarova-Shevchenko) is supported by The Royal Swedish Academy of Sciences by a grant from the Knut and Alice Wallenberg Foundation.
REFERENCES


[14] www.hypres.com
