

## lnec

Superconducting Array of Arrays for Acceleration of Transformers Manu Perumkunnil, Kartik Lakshminarasimhan, Udara De Silva, Debjyoti Bhattacharjee, Trent Josephson, Quentin Herr, Anna Herr

IEEE-CSC, ESAS and CSSJ SUPERCONDUCTIVITY NEWS FORUM (global edition), Issue No. 56, Sept 2024. Presentation given at WOLTE-16 2024, June 2024, Cagliari, Italy.



#### **OVERVIEW**

You can add a subtitle if you want.

- Introduction & Motivation
- Accelerating LLMs via Superconducting Digital Systems
- Scaling up
- Conclusions & Future

#### Introduction & Motivation

#### Introduction & Motivation Why Transformers/LLMs

 Increasingly common for a variety of tasks – text generation, vision, translation, code generation, classification, etc

- Superior performance in sequence-tosequence family of ML algorithms compared to RNNs & LSTMs
- Exploits all kinds of parallelisms data, pipeline, tensor/model & expert
- Consists of encoder & decoder blocks



#### ເກາຍດ

## Introduction & Motivation Cost of Al Training

 $10^{26}$ Gemini GP 1025 YottaFLOP 102 Llama AlphaGo Zero GPT  $10^{23}$ **OpenAl Five** DALL-E  $10^{22}$ Gato AlphaGo Lee ZettaFLOP 102 GPT-2 Modern DNN & Transformer era 750x/2years  $10^{20}$ MoE AlphaFold 10<sup>19</sup> GPT-1 ExaFLOP 1018 Exponential rise in Dropout C AlexNet Moore's Law 2x/2years  $10^{17}$ overall cost of  $10^{16}$ NPLM PetaFLOP 101 training large AI  $10^{14}$ TD-Gammon 1013 models  $\rightarrow$  Silicon, TeraFLOP 1012 . 101 NRE, Infrastructure NetTalk  $10^{10}$ GigaFLOP 109 Fuzzy NN & Power/Thermals Pandemonium Neocognitron Innervator 10<sup>8</sup>  $10^{7}$ MegaFLOP 106 Perceptron Mark I 10° Pre Deep Learning Era Deep Learning Era  $10^{4}$ Training Compute **Training Compute** Doubling Time: **Doubling Time** KiloFLOP 103 21.6 months 6.0 months 102 Theseus 10 FLOP 1 1950 1955 1980 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 2025

Total compute used to train notable Al models, measured in total FLOP (floating-point operations) | Logarithmic

Computing Power and the Governance of Artificial Intelligence: https://doi.org/10.48550/arXiv.2402.08797



#### Introduction & Motivation Cost of Al Training

Soaring cost of Silicon (& design) as we hit process technology scaling limits



asml-investor-day-2021-technology-strategy---martin-van-den-brink.pdf

ເງຍ

## 2024, June 2024, Cagliari, Italy.

#### Introduction & Motivation Cost of Al Training

Power distribution & thermal management cost (including Infra)



#### Introduction & Motivation Cost of Al Training

 Cost to overcome Memory & Interconnect (Infra and otherwise) bottlenecks that can severely underutilize compute in scaling AI/HPC clusters





BaGuaLu: https://doi.org/10.1145/3503221.3508417

#### ເງຍ

## Accelerating LLMs via Superconducting Digital Systems

## Accelerating LLMs via Superconducting Digital Systems Full stack solution

- Co-optimization across the stack is essential to exploit novel technology solutions to the maximum keeping in mind cost (including NRE, infra & cost of adoption)
- Superconducting technology offers  $\rightarrow$ 
  - THz bandwidth interconnects
  - Quantum accurate digital bits
  - Fast, low energy logic & memory



## Accelerating LLMs via Superconducting Digital Systems

#### Superconducting technology





Enables THz bandwidth (optical like) interconnects that require no signal amplification

#### High speed data link between digital superconductor chips: https://doi.org/10.1063/1.1473687



An 8-bit carry look-ahead adder with 150 ps latency and sub-microwatt power dissipation at 10 GHz; https://doi.org/10.1063/1.4776713

#### Meissner effect



Allows for Quantum accurate digital bits (Fundamental limit for energy) with a large noise margin



Josephson effect

umec



Enables fast, low energy logic & memory devices that can be scaled and used like classical CMOS technology at ultra-high frequencies (~30GHz)



## Accelerating LLMs

Superconducting digital logic and memory

- Superconducting digital Pulse-Conserving Logic (PCL), Josephson SRAM (JSRAM) & AC power distribution via resonant networks allow for classical digital systems that are scalable
- Fabrication stack allows for digital logic with 16 layers, achieving up-to 400 MJ/cm<sup>2</sup> device density. The stack allows scaling beyond 28 nm lithography & is compatible with standard hightemperature CMOS processes.

Superconducting Pulse Conserving Logic and Josephson-SRAM; https://doi.org/10.1063/5.0148235

Scaling NbTiN-based ac-powered Josephson digital to 400M devices/cm<sup>2</sup>; https://doi.org/10.48550/arXiv.2303.16792



#### Gate Schematics for JJ based PCL

43

m

m

BUF

OA2

OMA3

#### JSRAM unit cell schematics & Layout



#### ເກາຍc

Attention Is All You Need: <u>https://doi.org/10.48550/arXiv.1706.03762</u> Transformers from Scratch: <u>https://e2eml.school/transformers.html</u> 16<sup>th</sup> Workshop on Low Temperature Electronics

W C LTE<sup>16</sup>

IEEE

## Accelerating LLMs



LLM training (& inference) workloads are mainly composed of Matrix-Multiplication kernels

#### ເງຍອ

#### https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpu-v4-mlperf-2-0-results

## Accelerating LLMs

Architecture

- Array architectures (like systolic arrays) are best suited to efficiently execute MMM kernels (>>FLOPS/chip due to >>MACs)
- They also have minimal control overhead → reduces design complexity, NRE & cost of adoption for superconducting tech



16<sup>th</sup> Workshop on Low Temperature Electronics

W C LTE<sup>16</sup>

IEEE

ເງຍ

#### IEEE <sup>16<sup>th</sup></sup> Workshop on Low Temperature Electronics W IEE 16

## Accelerating LLMs

Architecture

 Control units to buffer dataflow from DRAM to MAC units



 However, simple systolic arrays incur pipeline bubbles during "FILL" & "DRAIN"



#### IEEE 16<sup>th</sup> Workshop on Low Temperature Electronics

## Accelerating LLMs Array Architecture



- Proposed Custom Superconducting Array is a homogenous version of regular arrays with superconducting wires.
- While this comes at the cost of more wires, the minimal wiring overhead (Power & Performance) incurred in superconducting digital technology makes this an ideal solution

#### ເງຍ

## Scaling the system



Scaling-Up Array Architecture

• Scale-up (Single Large PE array consisting of ' $\mathbb{N} \times \mathbb{N}$ ' MACs) vs Scale-out ('X' PE arrays consisting of ' $\mathbb{N}/(X_2) \times \mathbb{N}/(X_2)$ ' MACs) Time



## Scaling-Up

#### Superconducting Processing Stack

 Superconducting processing stack: multifunctional die stack 3D integrated via superconducting TSVs (≤30µm pitch, 10µm diameter) Stacked superconducting integrated circuits with three dimensional resonant clock networks: US20220208726A1



 Provisioning of JSRAM on-chip memory vs Main Memory BW





#### Scaling-Up

#### Superconducting Array of Arrays

- Shifting data to enable distributed matrix-multiply across the chip stack array is necessary: wrap around @edges  $\rightarrow$  2D torus topology
- Superconducting bumps ( $\leq 30 \mu m$  pitch,  $10 \mu m$  diameter) to facilitate inter-array communication



Scaling-Up High Level Metrics

- @ 30Gbps per for superconducting wires, TSVs & bumps  $\rightarrow$
- Intra-array & Inter-array bandwidth: ~0.5TBps/mm<sup>2</sup>
- @ JJ device dimension of 200nm, minimum metal pitch of 90nm & JSRAM density of  $\sim$ 4MB/cm<sup>2</sup>  $\rightarrow$
- Al compute density: ~I7TFlops/mm<sup>2</sup>

@ Cooling efficiency of 325 (4K regime), 290 attojoules per JJ transition  $\rightarrow$ 

 Energy efficiency: >50TOPs/Watt, (~100x vs conventional systems)





## Conclusions & Future Outlook

#### **Conclusions & Future Outlook**

- Path for realizing highly efficient Superconducting Array of Arrays for acceleration of next generation AI/ML algorithms like Transformers
  - Sneak peek on scaling towards supercomputing/post-exascale clusters: "<u>A Datacenter in a</u> <u>shoebox, IEEE Spectrum, June 2024</u>"
- Towards high fidelity, calibrated and accurate simulations based on experimental data
- Connectivity to Main Memory (77K-regime), and Room Temperature links need to be investigated for a feasible appliance
- Utilizing open standards for ISA, and software stack to help in wider community adoption

#### ເງຍອ

# embracing a better life