### 1EOr2B-03: An adiabatic quantum-flux-parametron block permutation unit for a superconductor SHA-3 cryptoprocessor

### Christopher L. Ayala<sup>1</sup>, Tomoyuki Tanaka<sup>2</sup>, Ro Saito<sup>2</sup>, and Nobuyuki Yoshikawa<sup>1</sup>

<sup>1</sup>Institute of Advanced Sciences, Yokohama National University, Japan <sup>2</sup>Department of Electrical Engineering and Computer Engineering, Yokohama National University, Japan email: ayala-christopher-pz@ynu.ac.jp







#### 2

Trend of rising electricity demand of information and communications technology (ICT).

## Currently 10% of the total electric power worldwide as of 2020.

#### Facebook Data Center, Lulea, Sweden



Performance: 27-51 PFLOP/s Power 84 MW avg\* (120 MW max)

D.S. Holmes, ISS 2013, Tokyo, Japan. http://worldstopdatacenters.com/renewable-energy-output-rankings/



N. Jones, Nature, vol. 561, no. 7722, pp. 163–166, Sep. 2018.

## **Worst-case scenario:** ICT could use as much as 50% of global electricity by 2030.

A. S. G. Andrae and T. Edler, *Challenges*, vol. 6, no. 1, pp. 117–157, Jun. 2015.

## AQFP logic for computing

#### 3

F

□ Adiabatic quantum-flux-parametron (AQFP) logic

- Composed of a pair of Josephson junction (JJ) superconductor devices
- **Extremely small bit energy** <<  $I_c \Phi_0$ 
  - Very small switching energy due to adiabatic operation
  - 1.4 zJ per JJ at 4.2 K in experiment [1]
- High gain
  - 10-50x gain from µA's of input current
- High robustness
- Clock speeds on par with state-of-the-art CMOS logic (5-10GHz)



After cooling overhead [2], ~80x more efficient than 7nm FinFET with  $V_{DD}$  = 0.8V [3]

[1] N. Takeuchi et al., Appl. Phys. Lett., vol. 114, no. 4, p. 042602 (2019)
[2] D.S. Holmes et al., IEEE TAS, 23, no.3, (2013)
[3] A. Stillmaker et al., Integration. 58, pp. 74-81 (2017)

### AQFP logic a promising candidate for energy-efficient computing.

### MANA microarchitecture





### MANA – <u>Monolithic Adiabatic iNtegration Architecture</u>

- **Goal:** Demonstrate AQFP can do both logic and memory
- RISC-like datapath + dataflow-like control
- In-order, single-issue
- 4-bit data word size
- □ 16-bit instr. word
- Program branching
- □ 21,460 JJs in 1 x 1 cm<sup>2</sup> chip
- □ 30 fJ/op at RT @ 5 GHz
- 4-phase 5 GHz clock
- □ Latency: 108 clock phases or 27 cycles (5.4 ns @ 5 GHz)

Instruction Buffer, Decode, and Issue (IDI) 5,596 JJs 8 cycles (32 phases) Register File with external I/O (RFX) 8,142 JJs 8 cycles (32 phases) ALU-Shifter (EX) 2,238 JJs 9 cycles (36 phases)

Ctrl buffer, routing, write-back (WB) 5,484 JJs 17 cycles (68 phases) overlapped 2 cycles (8 phases) write-back

C. L. Ayala et al., "MANA: A Monolithic Adiabatic iNtegration Architecture Microprocessor Using 1.4zJ/op Unshunted Superconductor Josephson Junction Devices," IEEE JSSC, Apr. 2021.

### MANA featured in media

#### 5

Ţ

#### IEEE SPECTRUM Superconducting Microprocessors?

Turns Out They're Ultra-Efficient

The 2.5 GHz prototype uses 80 times less energy than its semiconductor counterpart, even accounting for cooling

By Michelle Hampson



The AQFP-based MANA microprocessor seated on a chip holder. The microprocessor die contains over 20,000 superconductor Josephson junctions. It is the first ever adiabatic superconducting microprocessor

Computers use a staggering amount of energy today. According to one recent estimate, data centers alone consume two percent of the world's electricity, a figure that's expected to climb to eight percent by the end of the decade. To buck that trend, though, perhaps the microprocessor, at the center of the computer universe, could be streamlined in entirely new ways.

One group of researchers in Japan have taken this idea to the limit, creating a superconducting microprocessor-one with zero electrical resistance. The new device, the first of its kind, is described in a study published last month in









FEBRUARY 4, 2021

4-bit Adiabatic Processor

r that uses super-time an adiabatic su 109/JSSC.2020.3041338). Their novel devi

made on comm

works, this figure could clim

tics that driv

about 10% of global energy

tructure or the eli

50% by 2030. With an eve

Device is Built into an AQFP Structure

The Yokohama 4-bit processor, dubbed the Monolithic Adiabatic iNtegration Architecture (MANA), is based on unshunted Josephson

[画像のクリックで拡大表示] づくマイクロプロセッサーの約100万分の1以

<sup>†</sup>MANA = Monolithic Adiabatic iNtegration Architecture Processor,

下と猛烈に低い。









### MANA featured in media



#### IEEE SPECTRUM Superconducting Microprocessors? Turns Out They're Ultra-Efficient

The 2.5 GHz prototype uses 80 times less energy than its semiconductor counter







### MANA: prototype, conceptual demo



Small instruction buffer, real-world programs would not run efficiently

Photo: Christopher Ayala The AQFP-based MANA microprocesso die contains over 20,000 superconductr adiabatic superconducting microproces

Computers use a staggering amount estimate, data centers alone consum figure that's expected to climb to eigh buck that trend, though, perhaps the computer universe, could be streaml

One group of researchers in Japan haves superconducting microprocessor—one with zero ele



Ongoing AQFP research on low-latency techniques, chip-level clocking, impedance matching of data interconnect, circuit density, etc.







ID

FX

## Cryptography: hashing

#### 7

F

## In what application can we leverage SCE technology today?

- How about cryptography hashing?
  - □ h:  $\{0, 1\}^* \rightarrow \{0, 1\}^n$
  - Input: "message" arbitrarily long binary input
  - Output: "digest" fixed length (*n*) binary output
    - Ideally a unique signature for the input message
  - **Similar inputs**  $\Rightarrow$  dissimilar outputs
  - Ideally difficult to reverse engineer input using output

#### Uses:

- O(1) data structure in programs (Hash Table)
- Digital signatures
- Encryption/cybersecurity
- Cryptocurrency

#### **Architecture implementation properties:**

- Data feedback is typically well-controlled
- Control is simple, usally defined as fixed rounds/iterations (counters)
- Easy to keep pipeline filled
- Usually no need for centralized memory during hashing



## Secure Hashing Algorithms (SHA)

| o |  |
|---|--|
| Ä |  |
| U |  |
|   |  |

| Algorithm           | Year | Output<br>Size | State<br>Size | Operations                               | Collisions<br>Found?                   |
|---------------------|------|----------------|---------------|------------------------------------------|----------------------------------------|
| SHA-0               | 1993 | 160-bit        | 160-bit       | AND, XOR, OR,<br>ROT, <mark>ADD32</mark> | Yes ( $\leq 2^{34}$ evaluations)       |
| SHA-1               | 1995 | 160-bit        | 160-bit       | AND, XOR, OR,<br>ROT, <mark>ADD32</mark> | Yes (< 2 <sup>63</sup><br>evaluations) |
| SHA-2<br>(SHA-256)  | 2001 | 256-bit        | 512-bit       | AND, XOR, OR,<br>ROT, SHR, ADD32         | No (2 <sup>128</sup><br>evaluations)   |
| SHA-3<br>(SHA3-256) | 2015 | 256-bit        | 1600-bit*     | ADD, XOR, ROT,<br>NOT                    | No (2 <sup>128</sup><br>evaluations)   |

SHA3/Keccak ("Ket-chak") algorithm won the NIST hash function competition in 2012

SHA-3 is paramterizable\*, simple, and modern – good candidate for implementation.



#### 16-bit AQFP Kogge-Stone adder component [1]

[1] T. Tanaka et al, "A 16-bit parallel prefix carry look-ahead Kogge-Stone adder implemented in adiabatic quantum-flux-parametron logic," IEICE Transactions on Electronics, vol. E105–C, no. 6, Jun. 2022.

## SHA3 overview

#### 9

=



## SHA3 overview

#### 10

=

### **Pseudo code of permute functions:**

Round[b](A,RC) { #  $\theta$  step C[x] = A[x,0] xor A[x,1] xor A[x,2] xor A[x,3] xor A[x,4], for x in 0...4 D[x] = C[x-1] xor rot(C[x+1],1), for x in 0...4 A[x,y] = A[x,y] xor D[x], for (x,y) in (0...4,0...4)

#  $\rho$  and  $\pi$  steps B[y,2\*x+3\*y] = rot(A[x,y], r[x,y]), for (x,y) in (0...4,0...4)

#  $\chi$  step A[x,y] = B[x,y] xor ((not B[x+1,y]) and B[x+2,y]), for (x,y) in (0...4,0...4)

#ιstep A[0,0] = A[0,0] xor RC

return A

# Just AND, XOR and NOT bitwise operators and ROT (rotate) operation

- □ "Sponge construction"
- □ State size (bits):  $b = 5 \times 5 \times 2^{l}$
- $\Box$  *l* = 0, 1, 2, 3, 4, 5, 6
- □ *b* = 25, 50, 100, 200, 400, 800, **1600**
- $\square \ b = c + r$ 
  - c is capacity, r is rate
- □ *c* determines security level
- □ *r* determines block size
- Permute functions:
   θ (theta), ρ (rho), π (pi), ι
   (iota)
- □ Permutation rounds: n = 12 + 2l

## Implementation of SHA3 permutation block



- □ Input pre-padded, non-standard parameters: l = 0, b = 25, r = 16, c = b r = 9
- Circuit divided into 3 clocked networks supplied by a resistor-based power divider
- Key components needed: Data rotator – similar to shifter, SerDes for I/O and debugging, sequential finite-state machine techniques, and improved impedance design power-clock lines

[1] C. L. Ayala, ISS 2017, Tokyo, Japan. [2] C. L. Ayala, ISS 2020, Online. [3] R. Saito et al., DOI: 10.1109/tasc.2021.3061636. [4] T. Yamae et al., DOI: 10.1109/TASC.2020.3044677. [5] N. Takeuchi et al., DOI: 10.1109/TASC.2021.3058080

## Measurement of permutation block





JJ count: 13,008 JJs Chip: 7 mm x 7 mm, 48 pad Active circuit area: 5.0 mm x 5.6 mm Maximum operation: 7 GHz AC margins: +16% / -12%



- □ First +10k JJ AQFP chip at GHz speeds 2 chips measured
- □ BER rather high on the permutation outputs (10<sup>-4</sup>) at 7 GHz
- □ BER on debug shift-registers (SR) reasonable (10<sup>-18</sup>) at 7 GHz

## Measurement of permutation block



## Summary and next steps

IEEE-CSC & ESAS SUPERCONDUCTIVITY NEWS FORUM (global edition), January 2023. Submitted October, 2022; Selected October, 2022. Presentation 1EOr2B-03 given at Applied Superconductivity Conference, Honolulu, HI, USA, October 24, 2022.
[1] F. Ke et al., 2021, DOI: 10.1109/TASC.2021.3059984
[2] L. Akçay and S. B. Ö. Yalçin, 2021, DOI: 10.3906/elk-2003-27

#### 14





b=200-bit SHA3 crypto chip (1 cm x 1 cm)

RSFQ 47.8 GHz FFT processor [1]

### Summary:

- First successful demo of an AQFP SHA3 permutation unit at 7 GHz (b=25 bits)
- Insight on BER consistent with ongoing research on interconnect

### Moving forward:

Measurement of 200-bit state SHA3 AQFP chip



AQFP circuits: TTA execution units and distributed registers RSFQ<->AQFP: Interface between SFQ/AQFP circuits RSFQ circuits: Transport network and routing

Candidate architecture for a future MANA-II processor

- Leverage FFT work for Number Theoretic Transform (NTT) for fully homomorphic encryption processing
- Consideration of Transport Triggered Architecture (TTA)
   -- performance/power advantage over RISC-V
   implementations for post-quantum cryptography [2]

## Summary and next steps

IEEE-CSC & ESAS SUPERCONDUCTIVITY NEWS FORUM (global edition), January 2023. Submitted October, 2022; Selected October, 2022. Presentation 1EOr2B-03 given at Applied Superconductivity Conference, Honolulu, HI, USA, October 24, 2022 [1] F. Ke et al., 2021, DOI: 10.1109/TASC.2021.3059984

[2] L. Akçay and S. B. Ö. Yalçin, 2021, DOI: 10.3906/elk-2003-27

#### 15





b=200-bit SHA3 crypto chip (1 cm x 1 cm)

RSFQ 47.8 GHz FFT processor [1]

### Summary:

- First successful demo of an AQFP SHA3 permutation unit at 7 GHz (b=25 bits)
- Insight on BER consistent with ongoing research on interconnect

### Moving forward:

Measurement of 200-bit state SHA3 AQFP chip



#### First steps towards hybrid AQFP-RSFQ systems:

T. Tanaka, 4EPo1E-07 [E40]: Demonstration of a hybrid superconductor logic computation system using single flux quantum and adiabatic quantum-flux-parametron logic families (Thurs. Oct. 27, 10:30AM)

AQFP circuits: TTA execution units and distributed registers RSFQ<->AQFP: Interface between SFQ/AQFP circuits RSFQ circuits: Transport network and routing

Candidate architecture for a future MANA-II processor

- Leverage FFT work for Number Theoretic Transform (NTT) for fully homomorphic encryption processing
- Consideration of Transport Triggered Architecture (TTA)
   -- performance/power advantage over RISC-V
   implementations for post-quantum cryptography [2]

### **Questions?**

## Thank You

This work was supported by the Grant-in-Aid for Scientific Research (S) No. 19H05614 and the Grant-in-Aid for Scientific Research (C) No. 21K04191 from the Japan Society for the Promotion of Science (JSPS). This work was also supported by the Office of the Directory of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) via the US Army Research Office grants W911NF-17-1-0120 and W911NF-17-9-0001.

This work was also supported by the VLSI Design and Education Center (VDEC) of the University of Tokyo in collaboration with Cadence Design Systems, Inc.

The circuits were fabricated in the Clean Room for Analog-digital superconductiVITY (CRAVITY) of the National Institute of Advanced Industrial Science and Technology (AIST) using the high-speed standard process (HSTP). Circuits were also fabricated by MIT Lincoln Laboratory using the SFQ5ee process under the IARPA SuperTools program.