Revisiting Paradigm “Hardware Acceleration for Zero Knowledge Proofs”

Intro

In April 2022, Paradigm.xyz released a blog post about ZK hardware acceleration. We go back to this piece with the goal of setting the record straight given certain inaccuracies in their analysis as well as presenting some new data points and arguments to enrich the original text based on our experience working in the field.

Chapter 1: Why Do We Care

At the time of writing these lines, almost 18 months after the publication, the Paradigm blog still carries a lot of impact. If you search Google for ZK hardware acceleration, you will still find it as one of the top results. It is not an attestation of the high quality of the work and has more to do with the fact that the blog is published under Paradigm domain. This is simply how Google ranking works.

As our field is growing swiftly, it is important that newcomers begin their journey with the most updated, informative, and objective resources available.

Is our write-up long overdue? — yes, but better late than never

Are we comfortable with writing this rebuttal? Absolutely! It’s probably the only thing we can talk about with confidence!

‍

At Ingonyama we do ZK hardware acceleration

Let’s dive in!

Chapter 2 FPGA vs GPU

Starting with an easy one: cost comparison.

The way Paradigm chose to make this comparison is by comparing the capital required for a single unit. Here is what they wrote in the blog:

There is no reference to where they got the 3x number from, but here is a counter-example. We compared:

Nvidia RTX 4090. The latest generation gaming GPU. Released in Q4 of 2022. This GPU is one of the best for ZK applications due to its high clock speed and disabled error-correction
Xilinx U55C FPGA. This FPGA belongs to the Alveo family and was used extensively in experiments and papers on ZK acceleration. Alveo is based on 16nm technology, and Xilinx already have a newer 7nm FPGA (Versal family) in mass-production

Here is the price history of the 4090:

taken from https://howmuch.one/product/average-nvidia-geforce-rtx-4090-24gb/price-history

And here is the product page for U55C, with a price tag of $4747

taken from https://www.xilinx.com/products/boards-and-kits/alveo/u55c.html , September 2023

As can be seen, and a bit ironically, the price is ~3x cheaper for a new 4090 GPU compared to the U55C.

There are some obvious disclaimers here:

There are other devices with different specs that are at other places on the pricing spectrum. A supportive example for Paradigm's argument could have been to compare Nvidia H100 to Xilinx c1100 if you really look for the most expensive GPU vs. the cheapest FPGA, but their performance is no match.
Many times a discounted price can be negotiated
Price fluctuates over time

Bottom line: Saying that a GPU costs 3x more than an FPGA is too general a statement and is not backed by data

Power efficiency

First, we need to define what power efficiency is. Such a definition is missing from the original blog post and while general definitions can be found on the internet, we posit that there should be a standard for ZK hardware.

Inspired by standards for other domains, such as AI, we suggest the following definition:

Power efficiency = Power / (381-bit modular multiplications in a second)

Or in short form, W/GMM381, where the G stands for Giga (billion) operations. The lower this number is, the better the power efficiency of the device under test. We wish to normalize the GMM381 the same way that TOPS and FLOPS are golden standards, but that is a task for another day.

One derivative of our definition is the GMM381 for X Watts (e.g. X=75 for PCIe card with passive cooling), which can be another way to compare apples to apples across different devices.

We’ve been running ZK on different generations and models of GPUs and FPGAs for quite some time now. Here is a quick sample comparison:

In blue we have Nvidia GPUs, and in yellow FPGAs. Lower W/GMM381 is better

The table teaches us that GPUs and FPGAs are competitive when it comes to power efficiency. For those interested, we maintain a PostgreSQL database with all our results across different devices and primitives — ask for it on our discord!

Here is what the original blog post says about the power efficiency comparison:

Putting aside the fact that FPGAs require a host device to operate as well, let's focus on the more important part: the 10x difference in power efficiency, contradicting our observation from above.

When we click on the >10x link, we are redirected to a whitepaper authored by Xilinx in 2017 (!) titled: Xilinx All Programmable Devices: A Superior Platform for Compute-Intensive Systems which is used as a reference.

However, we couldn’t find the 10x claim in the paper. The findings of the paper are summarized in Table 1 which we copied here for brevity:

We note that power efficiency is measured here as the inverse of how we measure it, therefore larger numbers are better. Yet, the differences all seem to be in the same order of magnitude. Also worth pointing out that as footnote 1 in the table suggests, getting to 90% device utilization is basically impossible. It might still be that as the whitepaper concludes, FPGAs provide best-in-class power efficiency for certain applications in deep learning, but as you can understand, it is very hard to draw a conclusion from this that ZK apps will be >10x with FPGAs compared to GPUs.

Before we move on to the next chapter, as part of the conclusion section, Paradigm added a second reference to the blog to show that cost and power efficiency are preferable for FPGAs. The link takes us to another blog, titled: FPGA vs GPU for Machine Learning Applications: Which one is better?

This one has no date on it, but according to the internet archive, it dates back to March 2020. As the name suggests, the write-up focuses strictly on machine learning. Costs are nowhere to be seen and in general the blog takes a balanced approach comparing the advantages and disadvantages of each platform as was shown in various studies. The closest thing we found to a quantitative discussion on power efficiency is again very far from what the Paradigm blog is presenting:

copied from FPGA vs GPU for Machine Learning Applications: Which one is better?

Chapter 3: FPGA vs. ASIC

To better understand why Paradigm “expect that the winning players in the market will be companies that focus on FPGAs over ASICs” we need to take one step back to understand the logic behind it.

This goes all the way to how we design ZK accelerators. The original blog correctly identifies MSM and NTT as two of the main math problems we have to accelerate in order to accelerate a ZK prover. However, here Paradigm stops, missing out a few key details in their analysis:

First is Amdahl’s Law, which states that “the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used”. Lets try to explain via a simple example: say that 80% of the ZK proof generation is dominated by NTTs and MSMs, the theoretical upper bound of the acceleration will be 5x as 20% of the prover time remains unoptimized. Not really where we want to get with ZK acceleration
Second, is that except for MSM and NTT, there are other extremely important primitives. In STARKs, for example, hashing is the main bottleneck
The above 5x boost from point 1, can be achieved only in theory, if all the MSMs and NTTs were ordered in a convenient way, requiring no data movement, read and write operations from memory, and so on. In practice, many times, the compute is not even the bottleneck. In an analogy, this is similar to AI and matrix multiplication.

In Paradigm’s favor we will say that back in April 2022, these insights were far from common knowledge. In fact, only deep into 2023, we identified a movement towards a different type of design, one which aims to (A) accelerate ZK end-to-end and (B) be flexible enough to support multiple ZK provers.

here is Cysic learning the hard way, July 2023

ZPU

The original blog cites PipeZK as a hardware design that best addresses the challenges in both MSM and NTT computation. As an anecdote, PipeZK architecture is actually simulated as an ASIC, instead of implemented on an FPGA. More importantly:

PipeZK design is not end-to-end, e.g. MSM G2 is done in the CPU
PipeZK works exclusively on a narrow set of proving systems and parameters, e.g. one must fix the field size

Recently, we shared some data points on our ZPU ASIC design, which adheres to the principle of programmability, in order to support different parameters and protocols in the ever-changing landscape of ZK proving. Our analysis shows that the hit taken by moving to this type of architecture, compared to PipeZK fixed architecture is minor. What we gain however is huge — an ASIC that with the appropriate SDK, can be programmed to support any ZK protocol. Kinda like GPU but for ZK, hence ZPU.

With this context in mind, now we have the necessary background to analyze Paradigm claim for FPGA>ASIC.

First the argument of write-once (ASIC) vs. write-multiple-times (FPGA):

It is clear why this may be a disadvantage if we consider only the fixed architecture design, as in PipeZK, however, when we think of an ASIC such as ZPU/GPU, it might, in fact, be easier to program compared to FPGA (ease of programmability is a topic that requires a blog post of its own!).

The second point raised by Paradigm is about a “healthier supply chain”, see the excerpt below:

Now this is subtle, so let’s break it down.

First, time-to-market is different by order of magnitude, sure. Second is about scaling: here we would say that if you already have an ASIC, getting more is not necessarily harder than getting more FPGAs. In both cases someone needs to produce the chips and if there is an inventory it can be a simple matter of purchasing.

What is mostly lacking in Paradigm analysis is a mention of the advantages of ASICs — above all, the precious power efficiency discussed earlier is way better for ASIC compared to FPGA. In the Paradigm blog, this is not mentioned and we assume it is dismissed because it simply does not make sense to bet on an ASIC for specific prover protocol and parameters while the space is evolving at a crazy pace, second only to AI, and not like what we have with, say Bitcoin. This is true for the fixed architecture design, but not so much for the programmable ASIC approach. If we learned something from AI ASICs is that dedicated chips indeed not always win over GPUs and FPGAs in an ever-changing market, however, there are always opportunities

Chapter 4: Market Size / Conclusions

In both the introduction and conclusion, Paradigm hints at some equivalence between proof of work (PoW) mining and ZK proving in terms of market economics and market size. Are they right? The answer is that this is still unclear, even today, and a topic for active research.

This is also the lens through which we hope the reader will read our blog post — Paradigm made a great contribution by raising awareness of ZK hardware, it is reasonable that 18 months after publication and with enough data points collected, we will be able to have an updated snapshot on where things are (That said — if you find any flaw in our logic, please don’t wait another 18 months and just ping us directly :) )

In this blog, we are not trying to counter Paradigm equations: FPGA>ASIC and FPGA>10GPU for ZK. On the contrary, we’ve been working on FPGAs for ZK applications since inception and certainly acknowledge their benefits compared to GPUs and ASICs. Our goal was to provide an updated and more data-driven discussion which perhaps as an industry we were not yet ready for back in April 2022. Finally, we hope to see something like GMM256 or similar take off as a standard; We are looking forward for community feedback

‍