# Hardware-friendliness of HyperPlonk

In this note we provide the hardware perspective on HyperPlonk. We focus on the main building block, the Multivariate SumCheck protocol and compare its computational and memory complexity to that of an NTT (Number Theoretic Transform).

## Background — HyperPlonk

Plonk is one of the most widely adopted SNARKs in the industry. In vanilla Plonk after arithmetization, the resulting execution trace is interpolated into univariate polynomials and thus the resulting protocol relies heavily on NTTs. HyperPlonk is a new adaptation of Plonk, where the execution trace is interpolated on a boolean hypercube. Thus the polynomial representation of the trace is a multivariate polynomial with linear degree in each variable. This is known as an MLE (MultiLinear Extension). A good overview of Hyperplonk can be found in Benedikt Bunz ZKSummit8 talk.

One key advantage of HyperPlonk is the elimination of large NTTs, a major computational bottleneck in Plonk over large-circuits. By moving to the boolean hypercube, we no longer need univariate polynomials. Instead, HyperPlonk relies on multivariate polynomial arithmetic. Section 3 in the Hyperplonk paper is devoted to developing the toolbox for working with multivariate polynomials. The figure below, taken from the paper, shows how HyperPlonk is built out of this toolbox. As can be seen, at the root of it all we have the classical SumCheck protocol, which is bound to become the main computational problem in HyperPlonk (polynomial commitments aside), replacing NTTs all together.

## A toy example

Consider an execution trace unrolled in the form of a vector. We illustrate the general idea using a vector of length 8, constituted of the polynomial evaluations {fᵢ}ᵢ⁷=₀’ where fᵢ ∈ F. We interpolate these values using a multivariate polynomial F(X₃,X₂,X₁) as follows

The elements in the first row are finite field elements, and the second row represents the interpolation idea. We interpolate the values using the Lagrange interpolation

where X and 1−X are Lagrange base polynomials defined in a binary field. This unique polynomial F(X₃,X₂,X₁) is known as the MultiLinear Extension of a vector. Let the sumcheck problem in this case be defined as follows

where C ∈ F. The protocol proceeds with the following steps, where in each round the prover computes and commits to a univariate polynomial (Linear), and receives a challenge from the verifier.

What we would like to estimate is the complexity of the prover in the evaluation of the polynomials in each round. The commitment of each of the linear univariate polynomials by the prover is not a computational bottleneck, since it is a simple EC addition or a small MSM.

After a short calculation we find that the structure of the polynomial at the end of round 1 is

It is obvious that

The prover algorithm in this case is presented graphically in the following figure

## General form

The general form of the algorithm is easily extrapolated from this simple example. We consider an execution trace of size T=2ⁿ (for simplicity we assume that the trace is always zero padded to a power of 2). This trace can be interpolated using a multivariate polynomial F(Xₙ,…,X₁). The prover algorithm is then as follows:

• In round 1 the prover computes the linear polynomials

and commits their sum

Which halves the number of required multiplications.

## Enter Fiat-Shamir

The Fiat-Shamir transformation takes an interactive public coin proof such as the sumcheck protocol, and converts it into a non-interactive proof. Instead of having the challenges αiαi be sent by the verifier, the transformation produces them at each round by computing a hash function of the previous transcript of the protocol. The transcript for which each hash function is calculated must include all of the verifiable results from the previous round, to avoid security vulnerabilities. An analysis of these vulnerabilities is presented here and in much more technical detail here. As a consequence, each round in the sumcheck protocol must be performed sequentially, with no parallelization of the calculations between rounds. As is shown below, this fact seriously reduces the ability to parallelize computations in HyperPlonk.

## SumCheck in Hardware

Based on the above presentation of non-interactive SumCheck, we can identify the following pattern for each round i:

Concretely, say we have 2³⁰ field elements. In order to progress to the next round with 2²⁹ elements, we must use (read from memory) all the 2³⁰ elements. We cannot proceed to the next round, with 2²⁸ elements, until we complete the round with the 2²⁹ elements. This hard stop means we need to return to the memory in each round.

So for example, for size 2³⁰ we must read and write from/to the memory 30 times. For small enough sizes, say 2¹⁸ this memory can be SRAM, which is fast and unlikely to form bottlenecks. However, for large SumCheck sizes, e.g. 2³⁰, which is what we expect to encounter in practice, there will be rounds that must work with the DRAM, i.e. DDR or HBM, which is slow and will become the bottleneck instead of the SumCheck computation. The figure below illustrates the data flow for this SumCheck.

Memory bottleneck such as the one we just described is what we usually want to avoid when designing hardware accelerators — simply because there is not much we can do to accelerate these types of computations. By its nature, the vanilla SumCheck round-structure is extremely serial, and therefore there is not much room for acceleration.

## NTT in Hardware

The obvious comparison here is to how NTT is handled in hardware. In this note we will not discuss NTT at the same level of depth we discussed the SumCheck. For the sake of comparison, we provide the following diagram:

NTT holds a good degree of parallelism such that multiple rounds can happen without returning to memory. Observe that for NTT we do not have a hard stop due to Fiat-Shamir — the output of one computation (partial result of a round) can immediately feed the input to a second computation (partial input of the next round). As a rule of thumb: We can compute a full NTT of size 1k elements with just one access to memory. This is 10x better than SumCheck! As an example — we can solve an NTT of size 1 trillion with only 3 round trips to memory, in this case reading and writing all elements.

This makes NTT computationally bottlenecked, and not memory bottlenecked. In fact, our implementations of NTT in GPUs and FPGAs managed to accelerate the NTT computation so much that for all relevant sizes even the compute is no longer a bottleneck! The next bottleneck in line, faced by these implementations is the PCIe bandwidth, used for communication with the host.

## Conclusions

After all — are we better off with Plonk and NTT rather than with HyperPlonk and MLE-SumCheck?

At least from the hardware point of view, NTT, while obviously not significantly parallel, is still way more hardware friendly than the MLE-SumCheck. We note that the same structure is found in FRI, for example, check out plonky2 FRI implementation. We assume that the problem with hardware acceleration of FRI was not raised before because FRIs in existing systems such as STARKs take only a small fraction of the overall computation, especially compared to the many, and much larger NTTs.

## Take this note with a grain of salt!

We focused here only on the “vanilla” SumCheck and its structure and did not address the solution space at all. The hardware acceleration of SumCheck is an understudied problem, especially compared to an NTT. It is likely that the techniques to improve the hardware-friendliness of the SumCheck exist or will be invented in the future: ranging from cryptography (Fiat-Shamir plays a role here), through new variants of SumCheck, and to ideas enabling parallelism in higher levels of the protocol design (some of them presented in the HyperPlonk paper, such as Batching).

Departing thought: Have we reached a point where ZK protocols must be co-designed with hardware in mind?

Thanks to Tomer, Kobi, Suyash, Miki, Karthik, Oren and Yuval