TL;DR:
- We demonstrate integration of our cloud-ZK FPGA acceleration by running Arkworks Groth16 over the BN254 curve
- This work can easily be extended to support other primitives in the Arkworks framework and various System/HW configurations
- This is a joint work with Geometry research
Background
Cloud-ZK is our toolkit for developing ZKP acceleration using F1 instances on AWS. It currently supports Multi-Scalar Multiplication (MSM) acceleration based on algorithms from our pipeMSM paper. NTT support is in progress.
Cloud-ZK was designed to introduce FPGA acceleration into ZK dev-frameworks, without introducing overhead in dev-ops or capital requirements. To wit, the developer experience is exactly the same as if running a CPU instance on AWS.
For relevant circuit sizes in common ZK problems, the required MSMs turn out to be the main computational bottleneck. When we released Cloud-ZK, we estimated that by using only MSM acceleration, it would be possible to significantly improve the resource allocation of a ZKP system and introduce meaningful latency and throughput acceleration. This work confirms our hypothesis.
Arkworks is an ecosystem for zkSNARK programming, written in Rust. The framework includes both frontend, i.e. circuit building, arithmetization and constraint generation, and backend, i.e. cryptographic constructions for proof generation. Concretely, it is a collection of utility repositories for algebra, field, and polynomial arithmetic, and repos dedicated to higher level proving protocols such as Groth16, Marlin, and Gemini. Arkworks-rs supplies one of the most complete toolchains for developers to build ZKP applications. Arkworks libraries are highly maintained and adopted by leading blockchain projects using ZK.
Arkworks did not support hardware acceleration… Until now :)
Anatomy of Groth16 — HW Challenges
We chose to focus on the Groth16 proving system using the BN254 elliptic curve. This choice is “Ethereum friendly”, as proofs can be efficiently verified on the Ethereum blockchain. The complexity of the prover is dominated by MSMs and NTTs. To be more specific: for a given witness, and n=2ᶜ the closest power of 2 to witness size (rounded up), we have 1 G1 MSM of size n, 3 G1 MSMs and 1 MSM of G2 all of (essentially) the witness size, and 7 NTTs of size n. NTT complexity is growing as n log n while MSM complexity is linear in n. Therefore NTT becomes more dominant when we increase the size n.
The curve points in G2 are defined over a quadratic extension of the base field. This is completely analogous to how complex numbers are a quadratic extension over real numbers. Karatsuba multiplication of two field elements in the quadratic extension field takes a minimum of three real multiplications and three additions/subtractions. Thus the computational complexity of a single G2 Elliptic curve operation is at least three times that of a G1 operation. Note that this also doubles the storage overhead per coordinate used to represent the curve points.
Sparkworks - a Library to Support Hardware with Epsilon Changes to Arkwoks
We had two design goals in mind:
- Simplicity for ZK application developers. No changes to upstream Arkworks
- Keeping the Cloud-ZK interface clean. The hardware should be agnostic to whomever is using it
To demonstrate how this can be done, we focused only on the MSM computation. Arkworks uses a trait called VariableBaseMSM which uses generics to compute MSM over the specified Elliptic Curve. In our version of Arkworks, we needed only to introduce a new feature in Cargo.toml. When this feature is used, VariableBaseMSM will be instantiated using a new Struct we implemented in a new library, called Sparkworks.
This Struct, FPGAVariableBaseMSM, is a wrapper that calls the relevant ZK-Cloud driver after identifying the Elliptic Curve. ZK-Cloud will perform the MSM computation on the target hardware, in our case AWS F1 FPGA.
Introducing Sparkworks is what allows us to meet both our design goals. The library is in charge of processing requests from higher level libraries and making the necessary conversion and routing to the Rust driver in charge of communication with the hardware. Arkworks devs need only to make sure they have access to the hardware and turn on the feature. The hardware can be repurposed to support different programming ecosystems, using the same driver/interface.
Introducing Sparkworks as middleware enables us to easily switch both the software stack above, say by moving to a different proving system/curve, and the hardware below, for example moving from FPGA to GPU.
Demo
We ran the benchmarks from the Groth16 repo in Arkworks. We made small changes to switch from the hardcoded BLS curve to BN254. We ran the code on an AWS F1 instance with 16 cores CPU and two FPGA cards. We used one card for MSM G1 acceleration and the other card for MSM G2 acceleration. The G1 card runs at 250MHz while the G2 card runs at 187MHz.
We measured the latency with and without FPGA acceleration. As can be seen from the figure, we saw at least 2x speed improvement across a broad range of circuit sizes.
Note that as opposed to CPU only, where most of the CPU cores are always working, when the FPGAs are computing the MSMs the CPU is idle. This means that in the FPGAs scenario we are not optimized on utilizing the system resources. For example, when the CPU is idle we could have used its cores to compute a new proof. Measuring throughput, i.e. number of proofs per second, will therefore achieve a higher factor of acceleration.
Future Work
Here we share details on possible future directions for this work.
First, if you are interested in exploring together how your Arkworks-based project can benefit from hardware acceleration — please reach out to [email protected].
Our Proof of Concept is curve specific and will not currently work with other curves, although both ZK-Cloud and Arkworks support a multitude of other curves. The missing work is adding proper curve detection and routing mechanisms, as well as error handling. If you want to continue this work, please contact [email protected].
The use of two FPGAs is in fact redundant in the current implementation, since G1 and G2 MSM are being called at different times, so we can use the same FPGA. That said, there is nothing stopping us from running G1 and G2 MSMs in parallel as they operate on two separate elliptic curves. In such a case, two FPGAs will produce better results.
Finally, we plan to add support for NTT acceleration in a similar way to MSM while supporting multiple 256 bit fields. With NTT acceleration another major bottleneck will be removed, contributing further to prover performance.
Follow our Journey
Twitter: https://twitter.com/Ingo_zk
Github: https://github.com/ingonyama-zk
YouTube: https://www.youtube.com/@ingo_zk
Join us: https://www.ingonyama.com/careers