The Applied Superconductivity Conference 2016
4EOr3A-01

Fully Functional Operation of Low-Power 64-kb Josephson-CMOS Hybrid Memories

Gen Konno, Yuki Yamanashi, Nobuyuki Yoshikawa

Yokohama National University, Japan

- This work was supported in part by VLSI and Education Center (VDEC), the University of Tokyo, in collaboration with Cadence Design Systems, Inc. and Synopsis, Inc. The CMOS VLSI chip in this study was fabricated in chip fabrication program of VDEC in collaboration and Toppan Printing Corporation.
- The Josephson circuits were fabricated in the Clean Room for Analog-digital superconductivity (CRAVITY) of National Institute of Advanced Industrial Science and Technology (AIST) with standard process 2 (STP2).
Outline of Today’s Talk

■ Introduction of previous study
  ➢ SFQ interface circuit
  ➢ CMOS interface circuit & 64-kb CMOS static RAMs
  ➢ Access time measurement of 64-kb hybrid memory

■ This study
  ➢ Low power consumption of 64-kb CMOS static RAMs
    • Miniaturization of memory cell
    • Improvement of data driver
    • Introduction of binary-tree decoder
  ➢ Measurement results of low-power 64-kb hybrid memory
    • Functional tests
    • Evaluation of bias margin
  ➢ Perspective of hybrid memory

■ Summary
Now, CMOS Circuits are mainstream in computer engineering. The advantage of CMOS circuits is high integration density. However, the demerits are heat dissipation and delay. So SFQ is a focus of attention as possible VLSI technology in the next generation. The advantages of SFQ circuits are high speed operation and low power consumption. However, SFQ weak point is memory (RAM). Hence, we are investigating Josephson-CMOS hybrid memory system, where we use high-speed and high sensitive properties of SFQ circuits and high density and high drivability of CMOS circuits.
From the point of view of memories, the largest superconducting RAM successfully demonstrated up to now is only 4-kb. Due to low integration density and low driving capability of SFQ circuits, it’s difficult to realize large-scale memories by using only SFQ circuits. So, Josephson-CMOS hybrid memory was proposed by UC Berkeley group in order to overcome the memory bottleneck in SFQ digital system. This is the beginning of Josephson-CMOS hybrid memory. Full function of the hybrid memory implemented by the UC Berkeley group was reported. It’s the first fully functional 4 K memory beyond 4-kb. They stated that the Josephson-CMOS hybrid memory is an attractive near term solution for the Josephson memory in high-performance computing.
This slide shows a memory hierarchy in our SFQ digital system, with the capacity growing larger and speed becoming slower from top to bottom. It’s essential for construction of SFQ digital system to use the Josephson-CMOS hybrid memory techniques in order to overcome the memory bottleneck in SFQ digital system. In the Josephson-CMOS hybrid memory, the clock rate is a bit lower and the power consumption is slightly higher than that of superconducting RAMs. However, the capacity is much larger. Considering all these factors, the Josephson-CMOS hybrid memory is suitable to be adopted as the L2 cache memory in the future SFQ digital system. So we have been developing the Josephson-CMOS hybrid memory for L2 cache memory.
This slide shows the architecture of hybrid memory. Josephson-CMOS hybrid memory consists of memory cell array and address decoder, which is composed of CMOS circuits, and SFQ/CMOS interface, which is composed of Josephson latching driver and CMOS differential amplifier, and SFQ serial parallel converter and SFQ current sensor.

The operation of the hybrid system is explained below. When data and address pulse signals from SFQ circuits are entered, serial/parallel converters convert serial signal to parallel signal. The parallel pulse signals are transformed to signal levels of 40 mV by Josephson latching driver (JLD), and further amplified to 1.8 V level signal by CMOS differential amplifiers. By CMOS word and block decoder, 1-row 1-column memory cell is selected and the data are written to the selected cells. In read operation, the memory cells are specified by address pulses signals from SFQ circuits as with write operation. The data, which are stored in the specified memory cell, are transformed to pulse signals by SFQ current sensors (LDDS) and are returned to the SFQ circuits. The merit of this system is high speed and low power consumption by using SFQ current sensor, and operation speed of CMOS circuits is accelerated in cryogenic temperature.
In this section, we introduce our previous study for the realization of fully functional operation of the hybrid memory shortly. There are three topics. First topic is “SFQ interface circuit”.

Outline

- Introduction of previous study
  - SFQ interface circuit
    - CMOS interface circuit & 64-kb CMOS static RAMs
    - Access time measurement of 64-kb hybrid memory
- This study
  - Low power consumption of 64-kb CMOS static RAMs
    - Miniaturization of memory cell
    - Improvement of data driver
    - Introduction of binary-tree decoder
  - Measurement results of 64-kb low-power hybrid memory
    - Functional tests
    - Evaluation of bias margin
  - Perspective of hybrid memory
- Summary
The first stage amplifier of the Josephson-CMOS interface circuit is an AC-biased Josephson latching driver (JLD). It is composed of a Suzuki stack with two parallel 17/16-junction stacks of under damped Josephson junctions. By using a 4JL gate in front of the JLD, as a pre-amplifier, and carefully designing the resistance $R_o$ and inductance $L_o$ of the output load, we improved the input and output stability. Also, we coped with AC bias, which causes problems such as intensive variations in the magnetic field and influence of large AC bias current on SFQ circuits.
We covered the SFQ circuits except the stack part with the CTL layer as the magnetic shield to prevent disruption from external magnetic fields. The pink areas marked by slashes are magnetic shielding structure, where entire SFQ circuits are covered by the CTL layer. By using this method, we confirmed correct operation for every JLD and also obtained very wide bias margins.

We confirmed correct operation for every JLD.
We obtained very wide AC bias margins. (about ±30%)

Second topic is “CMOS interface circuit & 64-kb CMOS static RAMs”.

Outline

- Introduction of previous study
  - SFQ interface circuit
  - CMOS interface circuit & 64-kb CMOS static RAMs
  - Access time measurement of 64-kb hybrid memory

- This study
  - Low power consumption of 64-kb CMOS static RAMs
    - Miniaturization of memory cell
    - Improvement of data driver
    - Introduction of binary-tree decoder
  - Measurement results of 64-kb low-power hybrid memory
    - Functional tests
    - Evaluation of bias margin
  - Perspective of hybrid memory

- Summary
This amplifier is composed of level shift circuit in the first stage and a differential amplifier in the second stage. Level shift circuit shifts small input voltage (40 mV) to a certain voltage level appropriate to the differential amplifier in the second stage. This amplifier has high immunity to noise compared with conventional amplifiers because of self-biased architecture. This amplifier is most suitable for the interface circuit in terms of the speed, power consumption and robustness.

We have used eight-transistor (8T) SRAM cells for the memory cell array. There are several reasons why 8T-SRAM cells were used. They offer high-speed access times, good read/write stability and large read-out current of 200 μA which is sufficient to drive the following LDDS.
We demonstrated the complete operation of 64-kb CMOS static RAMs with 21-bit CMOS differential amplifiers at 4.2 K. Fabrication process was the Rohm 180 nm CMOS process. Sufficient margins were obtained from every amplifier ("Bias margin of CMOS differential amplifiers"). Also, we measured access time of 64-kb CMOS static RAMs. It was measured to be 1.32 ns ~ 1.47 ns depending on the memory address. (ex: "Histogram of access time of 4 K CMOS SRAM")

Third topic is “Access time measurement of 64-kb hybrid memory”.
We measured the access time of 64-kb Josephson-CMOS hybrid memory for different memory address by using SFQ TDC. The left block diagram is the access time measurement system. This system is composed of hybrid memory (serial parallel converter, JLD, CMOS SRAM, SFQ current sensor) and SFQ TDC (Clock generator and SFQ pulse counter). The access time is obtained by counting a number of SFQ pulse from the clock generator. The time resolution of the measurement system, which is determined by the period of the clock generator, is 43 ps.

The left graph in this slide shows the dependence of access time on the word addresses. Access time was obtained for 59 addresses of 256 addresses. It was measured to be 1.43 ns ~ 1.60 ns and we confirmed that the access time increases with increase of the word address due to the increase of the propagation length of the read-out data. Measurement results of each address represent the orange line in the graph. The tendency of the orange line is same as the blue line in the graph, which represent simulation result. In these results, proper dependence of the access time on the word addresses was confirmed.

Next, we introduce this study.

Outline

- Introduction of previous study
  - SFQ interface circuit
  - CMOS interface circuit & 64-kb CMOS static RAMs
  - Access time measurement of 64-kb hybrid memory

- This study
  - Low power consumption of 64-kb CMOS static RAMs
    - Miniaturization of memory cell
    - Improvement of data driver
    - Introduction of binary-tree decoder
  - Measurement results of low-power 64-kb hybrid memory
    - Functional tests
    - Evaluation of bias margin
  - Perspective of hybrid memory

- Summary
This slide shows purpose of this study. The bar graph in this slide shows the power consumption of the hybrid memory. It was estimated to be 56.5 mW for Write operation and 22.8 mW for Read operation at 1 GHz. Purple bar graph shows the power consumption of SFQ interfaces, and others show that of CMOS circuits. Most of the power consumption comes from CMOS circuits. Especially, the CMOS differential amplifiers and decoders were dominant in the power consumption of the hybrid memory. It’s essential to reduce their power consumption for improving the hybrid memory. So, we attempt to reduce the power consumption of 64-kb CMOS static RAMs. Since the speed and the power consumption of the CMOS differential amplifier had been optimized in the previous study, here we concentrate on the reduction of the power consumption of other CMOS circuit components.

Moreover, we aimed for demonstrating full function of 64-kb Josephson-CMOS hybrid memory with JLD, low-power CMOS SRAM and high-sensitive LDDS.
Firstly, we introduce “Low power consumption of 64-kb CMOS static RAMs”. First topic is “Miniaturization of memory cell”.

Outline

- Introduction of previous study
  - SFQ interface circuit
  - CMOS interface circuit & 64-kb CMOS static RAMs
  - Access time measurement of 64-kb hybrid memory

- This study
  - Low power consumption of 64-kb CMOS static RAMs
    - Miniaturization of memory cell
    - Improvement of data driver
    - Introduction of binary-tree decoder
  - Measurement results of 64-kb low-power hybrid memory
    - Functional tests
    - Evaluation of bias margin
  - Perspective of hybrid memory

- Summary
The schematic in this slide is 8T-SRAM cell. 8T-SRAM cell is composed of transistors for flip-flop in blue areas (M1-M4) and transistors for write/read access in orange areas (M5-M8). In our previous design, the transistor width of M1-M4 was the smallest width of Rohm 180 nm process, which is 0.22 μm. And the width of M5-M8 was two times larger in order to obtain large enough read-out current of 200 μA for SFQ current sensor (LDDS). The graph in this slide shows operation margin of LDDS. One can see 200 μA is detectable in the LDDS. Most of the power consumption of 8T-SRAM cells comes from the energy in Read “1” operation.
In this study, the transistor width of M5-M8 could be decreased to the minimum size by improving the sensitivity of the LDDS. By optimizing the circuit parameter, its sensitivity was increased. The graph in this slide shows operation margin of LDDS. Blue line in the graph shows the margin of high-sensitive LDDS. Because the modified 8T-SRAM cell provides read-out current of 120 μA, the sensitivity of the high-sensitive LDDS is enough to detect the output current of the 8T-SRAM cell. By this approach, the power consumption of memory cells in Read “1” operation was reduced from 1.99 mW to 1.29 mW. It was a 35% reduction.
Outline

■ Introduction of previous study
  ➢ SFQ interface circuit
  ➢ CMOS interface circuit & 64-kb CMOS static RAMs
  ➢ Access time measurement of 64-kb hybrid memory

■ This study
  ➢ Low power consumption of 64-kb CMOS static RAMs
    • Miniaturization of memory cell
    • Improvement of data driver
    • Introduction of binary-tree decoder
  ➢ Measurement results of 64-kb low-power hybrid memory
    • Functional tests
    • Evaluation of bias margin
  ➢ Perspective of hybrid memory

■ Summary

Second topic is “Improvement of data driver”.
Next, we focused on the power consumption in Write operation. The left graph in this slide shows the power consumption of 64-kb CMOS static RAMs. One can see, the power consumption in Write “1” is very large. This came from data driver circuits, which are buffer circuits to provide data to memory cells. The right figure in this slide shows previous design data providing structure of 64-kb CMOS static RAMs. The data drivers provide the data signals to all 8 blocks by 256 memory cells. This means that they provide the data to memory cells even in unselected blocks. In worst case, they provided the data to all memory cells shown in the figure. Also, large driving ability of the data driver was necessary due to a large load capacitance of all 8 blocks by 256 memory cells.
In our new design, the data drivers provided data signals to only one memory block selected by a block decoder. The right figure in this slide shows improved data providing structure. In this approach, we could decrease the size of the data drivers due to reduction of load capacitance of memory cells (1 block by 256 memory cells). It results in reduction of power consumption of the data drivers.
The right graph in this slide is power consumption of 64-kb CMOS static RAMs with improved design. One can see, the power consumption in Write “1” operation is considerably decreased in the improved design. By this approach, the average power consumption in Write “1” operation was reduced from 69.19 mW to 12.00 mW. It was a 83% reduction.
Third topic is “Introduction of binary-tree decoder”.
We employed binary-tree decoder to decrease the power consumption of the decoder. The schematic in this slide is a binary-tree decoder as an example of a 4-bit version. The decoder is composed of 1 to 2 demultiplexers. It switches an input enable signal to one of two output terminals depending on the Address signals. When Address signal is “0”, enable signal propagates upward. When address signal is “1”, the signal propagates downward. If address is “1111”, demultiplexers of each stage propagate Enable signal to downward. When Enable signal is “0”, this decoder doesn’t operate.
Compared with our previous decoder, which utilizes a conventional architecture based on an array of multiple-input AND gates, the binary tree decoder has lower power consumption for one reason. It’s the reduced size of buffer circuits to drive the address lines. For example, a signal line for “Address 1” is connected with only one demultiplexer. The load capacitance of buffer circuits is further reduced thanks to the decrease in the number of logic gates. This results in reduction of power consumption. The table in this slide shows a comparison of the performance of 8-bit decoders based on the binary-tree and conventional architectures. By this approach, the power consumption of decoder was decreased from 3.52 mW to 2.67 mW. It was a 24% reduction. However, the access time was deteriorated by 190 ps due to the increase in logic stages.
The table in this slide shows performance comparison of 64-kb CMOS static RAMs in the previous design and present low-power design. In this study, by using three low-power techniques, we decreased the power consumption of 64-kb CMOS static RAMs by 54% in Write operation and by 8% in Read operation. The power consumption was estimated to be 24.85 mW for Write operation and 18.48 mW for Read operation. Access time was decreased due to binary-tree decoder. It was evaluated to be 1512 ps. One can see that most of the power consumption is currently due to the CMOS differential amplifier. The figure in this slide shows the layout of the low-power 64-kb CMOS static RAMs. We implemented it by using the Rohm 180 nm process.
Secondly, we introduce “Measurement results of low-power 64-kb hybrid memory”.

Outline

Introduction of previous study
- SFQ interface circuit
- CMOS interface circuit & 64-kb CMOS static RAMs
- Access time measurement of 64-kb hybrid memory

This study
- Low power consumption of 64-kb CMOS static RAMs
  - Miniaturization of memory cell
  - Improvement of data driver
  - Introduction of binary-tree decoder
- Measurement results of low-power 64-kb hybrid memory
  - Functional tests
  - Evaluation of bias margin
- Perspective of hybrid memory

Summary
The left photograph is a chip photograph of the Josephson-CMOS hybrid memory. CMOS chip was fabricated by the Rohm 180 nm CMOS process and Josephson chip was fabricated by AIST 2.5 kA/cm² standard process 2. A thinned CMOS chip and two Josephson chips were bonded by Al wires. In this study, we implemented 64-kb Josephson-CMOS hybrid memory with 21-bit Josephson-CMOS interface circuits and 8-bit SFQ current sensors (LDDS). The 21-bit Josephson-CMOS interface circuits were used for a 8-bit word address, a 3-bit block address, 2-bit write/read control signals and an 8-bit data input. In this measurement, 2-bit data inputs and 2-bit write/read control signals were applied to CMOS SRAM directly through the CMOS differential amplifiers. Other 17-bit Josephson-CMOS interface circuits (Address:11-bit, Data:6-bit) were applied to JLD and 6-bit data output were detected from LDDS (The right figure in this slide).
The left figure is an example of measured waveforms. Word address is fixed to “00000000” in this example. In the first test sequence (①), we examined write/read function in specified address. The data “1” is written in a memory cell of block address “100” by applying “Data_in” and “Write” signals. The transitions of the output signal with 120 μA-level current are observed from LDDS when applying a “Read” signal. Next, the data “0” is written in the memory cell of the same address. The output of LDDS is not changed this time due to no output current, which shows correct Read “0” operation. In the second test sequence (②), we examined address selection. The data “1” is written in the memory cell of block address “000”, and then data “0” is read out from memory cells of block address “100”. This means address is selected correctly.

We confirmed the correct memory function for each address. We have demonstrated correct operation in 20-bit Josephson-CMOS interface inputs, which correspond to an 8-bit word address, a 3-bit block address, 2-bit write/read control signals and a 7-bit data input.
These graphs show bias margins of each interface circuit. In this measurement, we couldn’t confirm the operation in the input of “Data 6” due to the malfunction of LDDS. One can see that there is some scattering in bias margins of each interface circuits. However, we obtained wide bias margins larger than 20% for each of interface circuits except the CMOS differential amplifier of “Word 1”. From the results above, we demonstrated approximately fully functional operation of the 64-kb Josephson-CMOS hybrid memory.
Finally, I introduce “Perspective of hybrid memory”.
In the demonstration of the hybrid memory, we used the Rohm 180 nm CMOS process and the 2.5 kA/cm² Josephson process. The total access time was evaluated to be 1718 ps and the power consumption was estimated to be 27.62 mW in Write operation and 21.25 mW in Read operation.

In this study, we examined the performance of the hybrid memory by using more advanced process. Firstly, we assumed Starc 65 nm CMOS process and AIST 10 kA/cm² Nb advanced process. This table shows performance comparisons of 64-kb Josephson-CMOS hybrid memory with now process and advanced process. This comparison was based on simulation. By using a simulation model for the Starc 65 nm process and a behavior of a single CMOS inverter at 4 K as a reference, we simulated the performance. We also estimated the performance of JLDs and LDDSs from designed circuits using the AIST standard process 2. The total power consumption was estimated to be 11.43 mW for Write operation and 9.18 mW for Read operation, and the total access time was evaluated to be 657 ps. By using this advanced process, performance will be greatly improved.
We also estimated the performance of large-scale Josephson-CMOS hybrid memories. This graph shows scale dependence of Josephson-CMOS hybrid memories. Assumed process is same (Starc 65 nm process + AIST advanced process). For the access time, the CMOS decoder limits the performance of the hybrid memory. The delay of the decoder is increased by about 25 ps with a four times increase of the memory scale. For the power consumption, the Josephson-CMOS interface circuits are increased by two bits and the CMOS decoder and the data driver are increased by two times with a four times increase of the memory scale. Therefore the power consumption of the Josephson-CMOS interface circuits is relatively reduced in large scale memory, which shows blue and orange bar graphs, and that of the decoder and the data driver becomes relatively dominant, which shows red bar graph. For 16-Mb Josephson-CMOS hybrid memory, the access time was evaluated to be 756 ps and the power consumption was approximately three times of the 64-kb hybrid memory.
Finally, we estimated the performance of 16-Mb Josephson-CMOS hybrid memories assuming the use of future CMOS processes. These graphs show the performance prediction of the 16-Mb Josephson-CMOS hybrid memory. Left graph is Access time of the 16-Mb hybrid memory, and right graph is power consumption of the 16-Mb hybrid memory. We estimated the performance from the ratio of switching speed and switching energy of MOS transistors in each CMOS process [1]. For sub-22 nm process, we assumed FinFET technology. One can see that the performance of the hybrid memory is greatly improved. If we use a FinFET 7 nm process, the access time was evaluated to be about 200 ps and the power consumption was estimated to be about 5 mW.
Summary

- We reduced the power consumption of the 64-kb CMOS static RAMs by using three low power techniques.
  - Write operation: 54% reduction
  - Read operation: 8% reduction

- We demonstrated the correct memory function for arbitrary address accesses.
  - 20-bit/21-bit correct operation
  - Large enough margins of Josephson-CMOS interfaces

- By using deeply-scaled processes such as FinFET, the Josephson-CMOS hybrid memory is expected to be attractive large-scale memories for high-performance computing.

We reduced the power consumption of the 64-kb CMOS static RAMs by using three approaches, which were the miniaturization of the memory cell, the improvement of the data drive and the introduction of the binary-tree decoder. As a result, we could decrease the power consumption of 64-kb CMOS static RAMs by 54% in Write operation and by 8% in Read operation. By using the low-power CMOS static RAMs, we demonstrated the correct memory function of the hybrid memory for arbitrary address accesses. It was confirmed that margins of Josephson-CMOS interface are large enough for realizing the fully functional hybrid memory. Based on these results, we discussed Josephson-CMOS hybrid memories by simulating future CMOS processes. By using deeply-scaled processes such as FinFET, the Josephson-CMOS hybrid memory is expected to be an attractive large-scale memory for high performance computing.