19, No. 3, Part 1, 657 - 660 (2009) # Design, Implementation and On-Chip High-Speed Test of SFQ Half-Precision Floating-Point Multiplier H. Hara, K. Obata, H. Park, Y. Yamanashi, K. Taketomi, N. Yoshikawa M. Tanaka, A. Fujimaki, N. Takagi, K. Takagi, S. Nagasawa Abstract—We developed a large-scale reconfigurable data path (LSRDP) using single-flux-quantum (SFQ) circuits as a technology that fundamental can overcome power-consumption and memory-wall problems in CMOS microprocessors in future high-end computing systems. An SFQ LSRDP is composed of several thousands of SFQ floating-point units connected by reconfigurable SFQ network switches to achieve high performance with low power consumption. In this study, we designed and implemented an SFQ floating-point multiplier (FPM), which is one of the key components of the SFQ LSRDP. We designed a systolic-array bit-serial half-precision FPM using the 2.5 kA/cm<sup>2</sup> Nb process. The resultant circuit area and number of Josephson junctions are 6.22 mm × 3.78 mm and 11044, respectively. The designed clock frequency is 25 GHz. We tested the circuit and confirmed the correct operation of the FPM by on-chip high-speed tests. Index Terms— floating point units, LSRDP, multiplier, SFQ circuits, superconducting integrated circuits ## I. INTRODUCTION So far, significant progress in the performance of microprocessors has been achieved. However, there are several performance problems impeding further progress. These include a power-consumption problem and a memory-wall problem. At present, the performance of general-purpose microprocessors is limited by the memory access time. To overcome these problems, a large-scale reconfigurable data path (LSRDP) [1] was proposed as processor architecture suitable for single-flux-quantum (SFQ) circuits [2]. The LSRDP consists of several thousands of SFQ floating-point units (FPUs) and SFQ network switches. Because the data are directly transferred between FPUs without memory accesses, Manuscript received 19 August 2008. This research was supported by CREST, Japan Science and Technology Agency. H. Hara, H. Park, Y. Yamanashi, K. Taketomi and N. Yoshikawa are with the Department of Electrical and Computer Engineering, Yokohama National University, Yokohama, 240-8501 Japan. (phone: +81-45-339-4269; fax: +81-45-339-4269; e-mail: hara@yoshilab.dnj.ynu.ac.jp). K. Obata, M. Tanaka, N. Takagi and K. Takagi are with the Department of Information Engineering, Nagoya University, Nagoya, 464-8603 Japan. A. Fujimaki is with the Department of Quantum Engineering, Nagoya University, Nagoya, 464-8603 Japan. S. Nagasawa is with ISTEC-SRL, Tsukuba, 305-8501 Japan. high performance with low power consumption can be realized. A floating-point multiplier (FPM) is one of the basic components of floating-point units. In this study we designed and implemented an SFQ half-precision FPM and measured it at high frequency by on-chip high-speed tests. # II. FLOATING-POINT NUMBER A floating-point number is the expression of real numbers in computing systems. It is represented as $$(-1)^S \times F \times 2^E \tag{1}$$ where S is the sign, F is a fraction and E is an exponent. Table 1 lists the bit lengths for several formats of floating-point numbers specified in the IEEE standard. In this study, we designed a half-precision (16 bit) FPM. It should be noted that the exponent is biased by 15 to simplify the expression of the floating-point number. TABLE 1 THE BIT LENGTHS OF SEVERAL FORMATS OF FLOATING-POINT NUMBERS SPECIFIED IN THE IEEE STANDARD | | Sign | Exponent | Fraction | |----------------------------|-------|----------|----------| | Half-precision (16 bit ) | 1 bit | 5 bit | 11 bit | | Single-precision (32 bit) | 1 bit | 8 bit | 24 bit | | Double-precision (64 bit ) | 1 bit | 11 bit | 53 bit | ## III. DESIGN OF FLOATING-POINT MULTIPLIER A block diagram of the FPM is shown in Fig. 1. The FPM is divided into two parts: a fraction part and an exponent part. The Fig. 1. A block diagram of an SFQ FPM. $(X_f, Y_f)$ and $(X_e, Y_e)$ are input signals for each calculation part. "LOAD" and "reset" are control signals for each part. Calculation results are output to " $Z_f$ " and " $Z_e$ ." Fig. 2. A circuit schematic of the processing element of the systolic-array bit-serial multiplier. D represents a delay flip-flop. ND represents a non-destructive read-out flip-flop. Fig. 3. A block diagram of the systolic-array bit-serial multiplier, which is composed of a one-dimensional array of the processing elements (PE). NDROs are placed between PEs to improve the throughput of the multiplier. fraction part primarily performs multiplication of the fractions of two floating-point numbers to be calculated. The exponent part adds two exponents and determines the sign of the calculation result. After the calculation in each part, the results are normalized in the normalizer if they are not in normalized scientific form. We adopted the bit-serial architecture to reduce the circuit area and complexity [3] in the FPM design. Two bit-serial data paths corresponding to the fraction part and the exponent part were used, which are denoted as $(X_f, Y_f)$ and $(X_e, Y_e)$ in Fig. 1, respectively. ## A. Fraction Part The fraction part, which multiplies two fractions, was designed based on the systolic-array structure [4], which is one of the architectures for implementing multipliers [5]. In this architecture, processing elements (PEs), which perform simple calculations, are arranged regularly and multiplication is performed by pipelining. An n-bit multiplier can easily be made by placing n pieces of PEs in a row. Fig. 2 shows a circuit schematic of the PE. In this figure, $X = \{x_0, x_1, \bullet \bullet \bullet, x_n\}$ and $Y = \{y_0, y_1, \bullet \bullet \bullet, y_n\}$ , and both are the fractions of two floating-point numbers to be calculated. These fractions are serially input from the LSB (least significant bit) to the MSB (most significant bit) at 25 GHz. The multiplication is controlled by providing the "load" and "reset" signals at the proper timing [4]. Partial products are calculated by two NDROs, indicated by the ellipse in Fig. 2. A bit-serial adder adds the calculated partial products and carries data, $S_{-}$ in, from the previous PE, and outputs carry data, $S_{-}$ out, to the next PE. The i-th PE calculates the partial products of X and $y_i$ . Fig. 4. A circuit diagram of the exponent part. "X\_e" and "Y\_e" are input data. "load" is control signal. ## B. Improving the Throughput of the Multiplier The most important performance indicator of a multiplier is the throughput. In conventional n-bit systolic-array multipliers, at least 2n clock inputs are needed to obtain a result to avoid data collisions in the PEs because the bit-length of the result is 2n. Therefore, data can be input every 2n clock cycles. Assuming the input frequency is 25 GHz and the bit-length of the fractions, X and Y, are 11 bits, the throughput of the multiplier is estimated to be 1.13 giga operations per second. This throughput is lower than that of SFQ floating-point adders [6], and hence the performance of the FPU is limited by the FPM. To improve the performance of the FPM, we propose a novel architecture for the systolic-array multiplier. Although the result becomes 2n bits in n-bit multiplication, valuable output data in the floating-point calculation is (n+1) bits from the MSB and includes the overflow bit. By eliminating data in the lower (n-1) bits, the throughput of the floating-point multiplier can be improved. Fig. 3 shows a block diagram of the systolic-array bit-serial multiplier with improved throughput. NDROs are added between PEs to control the propagation of results calculated by the previous PE, and so the results at lower (n-1) bits do not propagate to the next PEs. As a result, the data can be input every (n+1) clock cycles and the throughput is improved to 2.08 giga operations per second for 25 GHz operation. ## C. Exponent part The exponent part adds the exponents of the two floating-point numbers to be calculated. Fig. 4 shows a block diagram of the exponent part. The exponent part is composed of a bit-serial adder, EXOR gate, subtractor and normalizer. The first bit of $X_{-}$ e and $Y_{-}$ e corresponds to the sign of the floating-point numbers. The EXOR gate determines the sign of the calculation result. After addition of the two exponents, 15 is subtracted from the result because each exponent is biased by 15 in the half-precision floating-point number. ## D. Normalizer When an overflow occurs in the fraction, the result should be normalized to fit a normalized scientific form. Fig. 5 shows a block diagram of the normalizer in the fraction part. In this circuit, DFFs store the calculation result of the fraction. DFF0 stores the data of the overflow bit. The NDROs shown in Fig. 5 control the shift of the fraction according to the overflow bit. If an overflow occurs, the upper NDRO is enabled and the fraction is shifted left by 1 bit. At the same time, the circuit sends a control signal, shift\_ex, to the normalizer in the exponent part. The normalizer in the exponent part adds 1 to the exponent if it receives the control signal. If an overflow does Fig. 5. A circuit schematic of the normalizer in the fraction part. "Z\_in" represents the output from the fraction part. A clock tree for DFFs is not shown in the figure. Fig. 6. Microphotograph of test circuit of the SFQ half-precision FPM. The circuit includes the FPM and several circuits for the on-chip high-speed test. TABLE 2 THE ESTIMATED CIRCUIT SCALE AND PERFORMANCE OF THE SFQ FPMS | | # of JJs | Clock<br>skew<br>(ps) | Latency<br>(clock) | Minimum<br>interval<br>(clock) | |------------------|----------|-----------------------|--------------------|--------------------------------| | PE | 639 | 234 | 1 | - | | Half-precision | 8904 | 2834 | 23 | 12 | | Single-precision | 20700 | 6300 | 49 | 25 | | Double-precision | 44500 | 13100 | 107 | 54 | not occur, the lower NDRO is enabled and the fraction data is not shifted. # IV. DESIGN AND DEMONSTRATION OF FPM The SFQ half-precision FPM was fabricated using the SRL Nb 2.5 kA/cm² process [7] and the CONNECT cell library [8]. A microphotograph of the FPM is shown in Fig. 6. The circuit area is 6.22 mm $\times$ 3.78 mm and the number of Josephson junctions is 11044. The target frequency is 25 GHz. This circuit includes shift registers and a ladder-type clock generator for on-chip testing [9]. Each calculation block is connected by passive transmission lines which are composed of microstrip and stripline transmission lines [10], [11]. The digital circuit simulation indicates that the DC bias margin of the FPM is $\pm$ 20% at a 25 GHz input frequency. Fig. 7. A test result of the half-precision FPM in a low-speed test. The data are started with the LSB and this data pattern has overflow in the fraction part. We confirmed the correct operation. The input data pattern is X=-(1.1010110111) $_2\times exp(11001)_2,\ Y=$ -(1.1001010011) $_2\times exp(01101)_2$ . The correct result, S=(1.0101001110) $_2\times exp(11000)_2$ , is measured. We estimated the circuit scale and performance of single- and double-precision FPMs based on the result of the designed half-precision FPM. Table 2 shows the estimated results. In the table, the clock skew is the time lag between input and output of the clock signals for the FPM. Latency is the number of clock pulses required to produce the results. The minimum interval is the minimum possible clock cycles between input data, which corresponds to the reciprocal of the maximum throughput. We measured the DC bias margins of the FPM for several data patterns in a low-speed test and on-chip high-speed test. Fig. 7 shows an example of the low-speed test results. In this test pattern, the input data are $X = -(1.10101101111)_2 \times \exp(11001)_2$ , and $Y = -(1.1001010011)_2 \times \exp(01101)_2$ . The first bit of exponent\_X and exponent\_Y in the figure is the sign bit. One can see that the correct answer, $S = (1.0101001110)_2 \times \exp(11000)_2$ is obtained. It should be noted that this test pattern also checks the overflow of the fraction and normalization of the result. The DC bias margin is $\pm 2.95\%$ at low speed. In the on-chip high-speed test, we could only confirm the correct operation of the fraction part in this chip. The maximum operating frequency was about 31.5 GHz. We also confirmed the correct operation of the exponent part in different chips, in which the maximum operating frequency was 25 GHz. Unfortunately, we did not confirm the correct high-speed operation of both the fraction and exponent parts in the same chip. ## V. CONCLUSION We designed, implemented and demonstrated an SFQ bit-serial half-precision floating-point multiplier. In the design of this circuit, we adopted a systolic-array structure in the fraction part. We fabricated the FPM using the SRL Nb $2.5 \, \text{kA/cm}^2$ standard process. The number of Josephson junctions is 11044 and the circuit area is $6.22 \, \text{mm} \times 3.78 \, \text{mm}$ . We confirmed the correct operation of the FPM at low speed. In the on-chip high-speed test, we confirmed the correct operation of the fraction and the exponent parts separately by using different chips. The maximum operating frequencies were about $31.5 \, \text{GHz}$ in the fraction part and $25 \, \text{GHz}$ in the exponent part. ## ACKNOWLEDGMENT The CONNECT cell library and tools are used in this study. The authors would like to thank all the CONNECT members consisting of SRL-ISTEC, NiCT, Yokohama National University, and Nagoya University. ## REFERENCES N. Takagi, K. Murakami, A. Fujimaki, N. Yoshikawa, H. Inoue, and H. Honda, "A processor with a large-scale reconfigurable data-path using - rapid single flux quantum circuits," *IEICE Trans. Electron.*, vol. E91-C, pp. 350-355, Mar. 2008. - [2] K. K Likharev and V. K. Semenov, "RSFQ Logic/Memory Family: A New Josephson-Junction Technology for sub-terahertz-clock-frequency digital systems," *IEEE Trans. Appl. Supercond.*, vol. 1, pp. 3-28, Mar. 1991. - [3] A. Fujimaki, Y. Takai, and N. Yoshikawa, "High-end server based on complexity-reduced architecture for superconductor technology," IEICE Trans. Electron., vol. 85, pp. 612-616, Mar. 2002. - [4] K. Obata, M. Tanaka, Y. Tashiro, Y. Kamiya, N. Irie, K. Takagi, N. Takagi, A. Fujimaki, N. Yoshikawa, H. Terai, S. Yorozu, "Single-flux-quantum integer multiplier with systolic array structure," *Physica C*, vol. 445-448, pp. 1014-1019, 2006. - [5] H. T. Kung, "Why systolic architectures?," Comput. vol. 15, pp. 37-46, Jan. 1982. - [6] H. Park, Y. Yamanashi, K. Taketomi, N. Yoshikawa, M. Tanaka, K. Obata, Y. Itou, A. Fujimaki, N. Takagi, S. Nagasawa, "Design and Implementation of SFQ Half-Precision Floating-Point Adders," in Appl. Suprcond. Conf. 4EB01, Aug. 2008. - [7] S. Nagasawa, Y. Hashimoto, H. Numata, and S. Tahara, "A 380 ps, 9.5mW Josephson 4 Kbit RAM operated at high bit yield," *IEEE Trans. Appl. Supercond.* vol. 5, pp. 2447–2452, Jun. 1995. - [8] S. Yorozu, Y. Kameda, H. Terai, A. Fujimaki, T. Yamada, and S.Tahara, "A single flux quantum standard logic cell library," *Physica C*, vol. 378–381, pp. 1471–1474, Sep. 2002. - [9] Z. J. Deng, N. Yoshikawa, S. R. Whiteley and T. Van Duzer, "Data-Driven Self-Timed RSFQ High Speed Test System," *IEEE Trans. Appl. Supercond.* vol. 7, pp. 3830 – 3833, December, 1997. - [10] T. Yamada, H. Ryoki, A. Fujimaki, and S. Yorozu, "Flexible super-conducting passive interconnects with 50-Gb/s signal transmissions in single-flux-quantum circuits," *Jpn. J. Appl. Phys.*, vol. 45, pp.752-757, Feb. 2006. - [11] T. Yamada and A. Fujimaki, "A novel splitter with four fan-outs for ballistic signal distribution in single-flux-quantum circuits up to 50 Gb/s," *Jpn. J. Appl. Phys.*, vol. 45, pp. L262-264, Feb. 2006.