Originally published in ZKProof Blog
Zero Knowledge Proofs (ZKPs) are on the verge of going mainstream. Amid these growing pains, and with an eye on more mature technologies that have undergone this transition, we recognize the importance of having a clear framework for comparing different ZK tech stacks. As we demonstrate below, this remains a complex open question. We argue that a good first step would be to use ZK Score: upper-bounding the proofs-per-Joule at the hardware level.
The Big Sister: How We Benchmark AI
Some inspiration can be taken from how things have been done in AI. In 2023, AI is a much more mature technology compared to ZK. Because AI training is outpacing Moore’s Law, AI advances as computing advances. Therefore, the performance of AI systems is entirely dependent on hardware acceleration. The contenders in the AI benchmarking race are hardware appliances, and the accompanying software to run them. Often, the same chips (e.g. Nvidia H100) are used in multiple distinct appliances and end up competing with one another.
The most widely known benchmarking tool is a suite of tests called MLPerf developed by MLCommons (a consortium of AI leaders from academia, research labs, and industry). MLPerf is a well-maintained project which has evolved over time. The most recent 2023 edition, started to measure Large Language Models (LLM) performance, for the first time ever.
We assume that the challenge of coming up with a fair, easily reproducible, and fast, suite of benchmarking tests, would be at least as hard to do for ZK, as it was for AI. Let’s consider LLM training benchmarks for a moment. To reproduce them might cost millions of dollars, so techniques were invented for clever ways to measure them. Check out this awesome quote from the MLCommons philosophy page,
“ML and artificial intelligence have been around for decades, but even today the technology is fragmented, bespoke, and poorly understood. We believe that we can unlock the next stage of AI/ML adoption by creating useful measures of quality and performance.”
MLPerf is a competition composed of multiple tests, with a few tracks allowing for full coverage and various ways to compare ML systems.
- The basic unit of measurement is throughput: how many samples or queries can be processed per unit of time (second). The choice of samples vs. queries depends on two scenarios: offline vs. querying a server.
- There is a power measurement which is a wrapper around the throughput.
- Submitters can also submit for a network track.
- The most popular track is for an apples-to-apples comparison (called “closed”), in which the same model and optimizations must be used. There is also an “open” track that allows teams to run whatever they want as long as they get the correct (above threshold or defined otherwise) result.
A few takeaways. First, for MLPerf to have become an industry standard, required support from dozens of organizations: small to large startups, universities and giant corporations; as well as a large diversity of stakeholders: cloud computing platforms, semiconductor companies, researchers, and app developers. Second, MLPerf started in 2018 and only recently introduced its eighth edition, version 3.1. Remember that AI in scaled products is barely 10 years old. MLPerf reminds us that continuous effort and dedication are needed to maintain “industry standard Machine Learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner”. The field keeps moving forward and so do its benchmarking tools.
One difference to note when making an analogy to ZK is that in ZK the output is a verifiable computation. It is a binary outcome; either we achieve the required properties (eg. soundness) for some security parameter, or we don’t. With AI, we build intelligence. We can measure how good a system is, based on how well it performs a certain task. For instance, depending on the goal, the task may be a classification or a specialized holistic evaluation of language models. In a way, this makes our job easier, assuming we can normalize security strength.
Intro to ZK Hardware Performance
Today, most ZK provers are running on CPUs. GPUs, FPGAs, and ASICs, are all legitimate targets for ZK proof-computation. We recommend watching this excellent talk to learn more about the differences between these devices.
Unfortunately, the hardware community within our own industry continues to struggle to align with established norms and standards. SuperScalar, a recent blog post about Aleo Provers, illustrates this point. Aleo is a blockchain that incorporates privacy by design using ZKPs. It also employs ZKPs at the consensus level in a mining-like fashion. Simply put, if you have a high throughput of proofs, you have a higher probability of mining success and subsequently reaping the rewards.
In the aforementioned blog post, the new K10 miner based on FPGA technology (referred to as a K10 miner going forward), is compared to the Nvidia RTX 3090 GPU (referred to as a 3090 GPU going forward):
The first question that comes to mind is, what is so special about the 3090 GPU? Why not compare the new K10 miner to the Nvidia RTX 4090 GPU, which is a newer and better GPU? The second question is, does it actually make sense to compare throughput alone? If throughput is the comparison metric, why not take a CPU host, connect it to 4×3090 GPUs for a total throughput of 28K PPS, and call it K12? A real-world FPGA ZK prover will look similar to GPU provers, with multiple PCIe cards plugged into a host CPU.
Before proceeding further, we’d like to briefly introduce an important term in hardware: total cost of ownership (TCO). Defined loosely, we can say that TCO is the cost of buying the device (CAPEX), plus the cost of operating the device through its lifetime (OPEX), minus the revenue of selling the device at the end of its lifetime.
TCO = CAPEX + OPEX – resale_price
Returning to our example of the GPU-based vs. FPGA-based Aleo miners, we initially aim to equate the TCO of both appliances before comparing their performance. The cost of a single K10 miner is USD 4500 (based on a query to the supplier), and the cost of buying a single 3090 GPU is between USD 700 (new) to USD 500 (used), according to howmuch.one. So, one can buy seven to nine 3090 GPUs for the price of one K10 miner.
With respect to operational expenditure, we can assume electricity cost is fixed and the same for both. The SuperScalar post states that the K10 miner’s power consumption is 1200 Watts. Note that this number is application-dependent. The 3090 GPU’s total dissipated power (TDP) is 350 Watts according to its technical specifications. That doesn’t mean that this is going to be the power consumed if we implement the same application, but for the sake of this analysis let’s assume the worst-case scenario, i.e. max power is needed.
Now, we can normalize by power: how many Joules (a Watt [W] is defined as one Joule per second) does it take to generate one proof?
For a K10 miner, we calculate: 24000 PPS/1200W = 20 proofs/Joule
For a 3090 GPU, we calculate: 7000 PPS/350W = 20 proofs/Joule
These calculations show that the K10 miner has no advantage over the 3090 GPU. In other words, since the power consumption is the same for both, we pay the same amount to generate proof. It doesn’t matter how many K10 miners or 3090 GPUs we have.
Fun fact: for the newer Nvidia RTX 4090 GPU, which costs USD 1500 (at the time this was written), we calculate 16000 PPS/450W > 35 proofs/Joule. So many more proofs for your dollar!
Finally, we take the resale price into consideration. Here, we can assume that the 3090 GPU will retain more of its value. We have concrete data that a used, 1-year-old 3090 GPU, has almost no depreciation in value (a comparison of the October 2022 price to the October 2023 price), and we know that a 3090 GPU can be repurposed easily for gaming, science, and so on. The K10 miner, on the other hand, is a dedicated mining machine and unfortunately, we don’t have any data on the resale price. For this reason, it is untenable to compare resale values. Of course, relying on historical data is nothing like projecting a price in the future, which is what is needed when computing TCO.
Does this mean that the 3090 GPU is better or worse than the K10 miner? Hard to tell, because TCO is involved and difficult to normalize. For example, what if the physical size of one appliance is much bigger than the other? That is, the real estate on a rack in a data center is different and thus, has an impact on cost and TCO. It doesn’t end there either, there are several other factors that need to be taken into consideration. Included, but not limited to: power infrastructure, maintenance, and reliability. Referring to the information presented above, our point is that we are still a long way from the golden standard when it comes to ZK performance measurement in hardware products. Enter ZK Score.
Part 1: ZK Score
The primary value of the ZK Score lies in its simplicity. Currently, the ZK space is deeply involved in middleware and infrastructure R&D. We have very little happening at the application layer. Continuing our Aleo example, the foundation develops a prover to run applications, but we don’t know yet which applications are going to be killer applications. The same is true for ZKML. We know which ML we want to use, and infrastructure is getting better and better, but no real use cases are running at scale yet. This is why we think that for the time being and as a first step, we should focus on infrastructure benchmarks, starting with hardware.
Already at this early stage, we can talk in terms of setups and systems that are designed to run ZK provers efficiently for a variety of use cases. Case in point, the K10 miner is an FPGA-based system that can support a limited set of ZK protocols. The Nvidia DGX is an off-the-shelf appliance with 8xH100 cards, that can be programmed to run any ZK protocol but produces varied results with respect to efficiency. And, we can take an AWS F1 instance (FPGA), connect it via a local area network (LAN) to another instance with an Nvidia T4 GPU, and call it a third system. Obviously, we can expect each system to perform differently, but how do we rank them against one another?
It’s important to understand which system is more efficient so that it can direct our efforts in the years to come and improve ZK infrastructure. For instance, if a K10 miner has the best ZK Score, we would hope to see more work focused on maximizing the potential of FPGAs, more middleware for this device, experimentation with K10 miner variants, and integration with more applications.
The most straightforward ZK Score is throughput/power. It is also best practice according to AI benchmarking. The difference is, with AI we measure the throughput of end-to-end samples. The ZK equivalent would be to measure throughput in proofs-per-second. However, as we will show later, this adds complexity which makes ranking more troublesome.
As long as we don’t have a standardized suite of tests, the best we can achieve while introducing minimal bias is directly measuring ZK-related operations on the hardware itself. Even then, we must distinguish between high-level and low-level primitives and lean towards lower-level primitives, to reduce complexity. What we lose in this tradeoff is accuracy and we are left with an approximation of the upper bound on the throughput achieved by the system.
Low-level Primitives for ZK Hardware
We now take a more concrete approach for ZK Score candidates. The arithmetic used for ZKPs (as in many other places in cryptography) is modular arithmetic, and modular multiplication is used more frequently than modular addition. Arithmetic operations on more complex primitives, specifically elliptic curves which are used in some popular ZK protocols, can be broken down into several modular multiplications and additions. In fact, just like with any arithmetic, addition, and multiplication form a complete functional language that we can use to express any ZK protocol. There are two popular algorithms for modular multiplications: Barret and Montgomery.
Together, they cover the majority of existing implementations. These algorithms have existed since the 1980s and have had years of battle testing. Some active research is still happening in the area of modular multipliers, but they usually end up as extensions or variants of Barret or Montgomery. Because these algorithms are in such a mature state, it is safe to assume that further optimization for this layer is unlikely.
One layer above, we have elliptic curve (EC) arithmetic, going from basic point addition to vector operations such as multiscalar multiplication (MSM). EC cryptography entered into wide use in the early 2000s and has received growing attention ever since, including in blockchain cryptography. It is a broader domain than that of modular arithmetic. An EC group can belong to one of several families that share common properties. There are multiple formulations for EC point addition, and even representing an EC point can be done in various ways. EC cryptography acceleration is also a newer field. Algorithms in hardware for MSM are still being explored (see ZPRIZE). From here on, higher layers are newer and leave an extensive design space for optimizations.
Our suggestion for making the most accurate apples-to-apples comparison is to rely on modular multiplications (mod mults) as the unit we measure for throughput. Note that this behaves as a kind of an upper bound. If a system being tested is able to produce X mod mults per second, and for an EC addition we need, say, 5 mod mults, then the ideal EC addition throughput will be upper-bounded by X divided by 5. We can use the same rule as we go up the stack.
Fields, Not a Field
Although we zoomed in on modular multiplication operations per second (MMOPS), a finer resolution is still needed. ZK systems can be implemented using different fields, depending on several factors and protocol requirements. If we benchmark MMOPS of a 31-bit or 64-bit field, we might end up with different results compared to MMOPS with 256-bit or 384-bit fields. There is likely no middle ground here, so the best approach would be to run benchmarks for all bit sizes of interest. Maybe in the future, there will be some clever way to put weights on these fields, in order to get an accumulated ZK Score.
Normalizing by Power
As we saw in the TCO discussion, there are several elements to scrutinize in order to enable viewing diverse systems as equals. Out of the three main elements that comprise the TCO, only the OPEX is measurable. CAPEX and resale prices can differ significantly for the same device and are hard to quantify or compare. The closest approximation of OPEX would be the electricity bill of the system when it runs the workload it is designed to run. That is to say, we need to measure power consumption over the same unit of time. That unit of measurement is Watts. As a result, we end up with a ZK Score definition of:
MMOPS/Watt == MMOP/Joule
As the ZK landscape continues to grow, the call for reliable benchmarking standards is clear. The evolution of ZKPs is a moving target full of intricacies, which makes apples-to-apples measurements challenging. Luckily, we can draw inspiration from how the AI community addressed this problem. With that in mind, we recommend ZK Score as a first step, in tackling the problem of comparing apples-to-apples. Driven by simplicity and determinism, ZK Score would allow us to evaluate and compare different ZK systems by using an upper-bound that is based on the throughput of the modular multiplications these systems can run, over a given time frame, and would be normalized by power consumption. In Part 2, we discuss existing alternatives to ZK Score.
Part 2: ZKP Comparison Method Pitfalls
To demonstrate the challenges in quantitatively analyzing & measuring ZKP performance, we will cite some popular methods used in the literature, and discuss some disadvantages of each.
Complexity Theory. Some ZK papers limit the comparison to the theoretical level. The motivations are clear. Complexity theory provides asymptotic performance, typically using the big O notation to indicate how metrics, such as prover runtime, grow based on the input size. This form of comparison is very clean and allows for immediate conclusions: if one prover takes O(N) and the other prover takes O(NlogN), it is obvious that the O(N) prover is better. There are a couple of issues with relying strictly on this approach. The root cause for both is that complexity theory does not reflect the real world well:
- When we compare two protocols with the same complexity, there is no accounting for the constants that determine which one is better. For example, let’s say that for group element operation we need three field element operations, which means that the ratio of the first and second rows in the figure below is 3. But, what if the actual performance of the first row, removing the big O notation, is N group operations, or 3N field operations, while the actual performance of the second row, is 50N field operations? There is no way for us to tell which one is better in practice, without this information.
- ZK engineering is messy. The complexity theory view avoids relevant implementation details. Sometimes it’s something big, like memory requirements (not to be confused with proof size, here we are referring to how much memory a proof uses during computation). Other times it is some other hidden missing cost. Let’s look at a figure from Lasso. Note the sentence, “In addition to the reported O(N) field operations, Hyrax and Dory require roughly O(N^½) cryptographic work to compute evaluation proofs.” Engineering is the art of trade-offs. When we implement a protocol we are forced to make multiple decisions, based on system requirements, hardware available, programming language, and so on. These things need to be taken into account.
Adding PoC Implementation. A few papers take the additional step of implementing their ZK protocol. The system setup is described in the paper, and most times targets available hardware. There are two issues with this approach. First, there is the question of which system or protocol we want to compare against. Ideally, we’d want to compare one implementation to all other implementations. But in practice, the design space today for proof generation is so diverse, that it is almost never an apples-to-apples comparison. Let’s examine the table from the Hyperplonk paper below, we see that Hyperplonk is juxtaposed with Ark-Spartan. Yet, the two systems use entirely dissimilar arithmetization techniques, namely R1CS and Plonk+. Taking the one-before-last application (rollup), the size in number of constraints for the R1CS system is 32X the size of Plonk+ constraints. Hence, the meaning of the prover run-time measurement is unclear, because the input test vectors vary in size. The second problem is the selection of applications. Looking at the last line, the zkEVM application was never expressed in R1CS format, and therefore cannot be measured. Making an analogy to AI, it would be like comparing a model for classification with an LLM. In theory, they can both run a classification application, but one was not specifically designed for it, while the other is optimized for it. We expect various protocols to be optimized/suitable for different applications, but it makes the reasoning behind comparing two protocols that are built for different applications difficult to understand.
End-to-End Benchmarking Tools. Recent efforts like EF zkalc, The Pantheon of ZKP, and ZK bench, are looking at the problem from the exact opposite way described in the complexity theory approach. Meaning, they try to run all existing ZK frameworks against each other and implement any missing pieces. Through this engineering-oriented approach, we aim to understand the performance of various protocols, considering the input size as a variable. There are a couple of issues with this approach. First, the implemented circuits might not be fully optimized, primarily because ZK DSLs are still in their infancy. Second, similar to what we saw in the previous case, it is hard to deduce meaningful conclusions by looking at the results. The graphs below make it seem as if all of the frameworks result in roughly the same performance.
Although using complexity theory, comparing implementation data, and using end-to-end benchmarking tools, have attempted to codify a way to compare ZKPs, they all have pitfalls. Our motivation for writing this post isn’t to cast aspersions, but to catalyze the ZK community to formulate methods and standards, possibly resembling AI, that reflect ZKP performance more accurately. The end goal is a fair, reproducible, and fast way, to conduct real-world comparative analyses. We submit ZK Score for your deliberation and as a potential first step toward that goal.