Joint work with RockawayX infrastructure team
Ingonyama is the developer of ICICLE, which is the equivalent of PyTorch for AI, but for ZK proof generation. ICICLE is designed based on CUDA, hence currently only supports Nvidia GPUs. ICICLE enables developer teams to integrate ZK tech into their product with ease, and saves them a massive amount of research effort on implementing accelerated ZK primitives.
Currently, however, there is no standard on deploying ZK applications, which is why we are introducing ZKContainers: the building block ZK developers can use to deploy and scale their ZK infrastructure.
But what about client side ZK? What if I am a user today who wants to participate in a network that supports ZK verification?
Most relevant to 2024 is the use case of a company which wants to participate in a proof network with their available GPUs.
Let’s take Scroll for example, a leading ZK proving network currently in production. To participate there is a minimum hardware requirement: 64 threads CPU ,400 GB CPU RAM, 2X RTX 3080 GPU. We can imagine that to participate in the Scroll proving network and be profitable, there is going to be a use case for ZK Datacenters where a miner can quickly scale up or down their operation.
Further into the future as ZK enters new industries, it is not hard to imagine service providers who operate a large fleet of hardware and want to add ZK as a feature (think OpenAI and chatGPT accountability). With this example as well, there will be a need for an easy way to deploy, test, update, scale and experiment with different configurations of prover software. Here again there is no solution on the market. ZK developers should be able to deploy their applications on a single machine or a cluster with ease, and enjoy convenient DevOps tools & management systems.
To support these exciting use cases and teams wishing to scale their ZK operations, we are offering a new set of tools alongside ICICLE. Let us introduce the ZKDC framework, a set of secure and optimized ZK Containers for ICICLE applications and a set of scripts for deploying basic ZK datacenters. In the rest of this post we will discuss mostly technical specifications of our solution. To get access to the code (still in alpha phase) or to schedule a demo, please contact firstname.lastname@example.org
As part of our market research we interviewed companies who run ZK at scale today. We wanted to learn if the industry somehow independently converged to some magic formula.
The answer is that unsurprisingly, this did not happen. ZK compute has unique characteristics, and since no one is offering a one-solution-fits-all (think cloud for ZK), we have seen different approaches, from Kubernetes and Hadoop, to some customized software. Some companies chose on-prem, while others deployed using a cloud provider. ZK Datacenters are diverse, and tools and best practices are missing.
The logical first step, given that we already have ICICLE for ZK application development, is to package such ZK programs in portable and ready-to-use fashion. Enter ZKContainers (ZKC in short).
Containers are a lightweight alternative to virtual machines. They allow developers to package software and its dependencies in an isolated unit, which can be easily deployed in any environment.
Containers are one of the foundational technologies used today in cloud computing. They are frequently used in AI, for example, to deploy and scale AI workloads. Containers also have many tools developed for them which allow scaling software automatically across datacenters, both on-prem and on cloud.
A ZKContainer is a Docker image which includes ICICLE and all dependencies required to run an optimized ICICLE application on Nvidia GPUs. Here is the feature list enabled using ZKCs:
- Fast prototyping: it takes a few command lines to switch between provers and hardware configurations
- Storage solutions: generating ZK proofs often require many files to be generated, ZKCs make it easier to configure storage locations, caching, and cleaning up after proving
- Fast scaling: Start testing on a single cluster node or even a single GPU and easily scale based on demand. Containers also allow for load balancing and resource configurations and optimizations to be implemented simply
- Diverse set of use cases built-in: such as running different prover protocols (Groth16, Plonk) and different implementations (Gnark, Halo2)
- Built-in logging solutions: which can connect to a central logging solution
- Common tooling: since we currently only support Nvidia GPUs, our containers work with Nvidia’s GPU datacenter tooling such as NVIDIA GPU Operato
- Security: ZKCs images are scanned for common CVEs
- Constantly updated: ZKContainers are kept up-to-date with latest Icicle improvements
- Multi container platform support, while Docker is currently our main focus we also allow you to build containers for Singularity, cri-o, and containerd.
- They run anywhere: on-prem, bare-metal, cloud platforms, Kubernetes, virtual machines, and various architectures which support containers.
- Continuous testing: making sure ZKCs work out of the box on a wide range of systems.
You can find our ZKContainer here (email us for access)
Demo Time — Aleo Prover
For this demo, RockawayX provided us with access to one of their GPU data centers and infrastructure support. Our goal with this collaboration was to deploy our Aleo prover across their GPU data center of over 6000 GPUs; each node in this cluster contains 8 GPUs. We aimed to maintain every node at 100% GPU utilization.
We first created a docker ZKContainer for our Aleo consensus prover. This container included all required dependencies, tooling, and optimizations. and was created using Ingonyama’s CLI tool and base ZKContainer. The CLI tool allows you to locally configure your own custom container for your ZK application and deployment environment.
Configuring, Building and Deploying a ZKContainer
The architecture for this deployment is rather simple, since we don’t have any demand for autoscaling, external data sources or load balancing. The first stage of our deployment was to configure the nodes with correct software; the second stage was launching the containerized prover instances across all nodes.
For configuration of each node in the cluster we used Ansible, an open source IT Automation software. Ansible enables you to boot up machines, configure them, run scripts on them and deploy software automatically across many machines.
We then used Ansible scripts to deploy the containerized Aleo provers, since the ZKContainer contains information about the number of GPUs and resources it should consume on the machine it automatically launched and started proving.
As expected, using ZKContainers, this process was bug free. Manually configuring 6000 GPUs would have been a nightmare, even if we had to debug only 1% of them for dependency or compatibility issues. Monitoring was also greatly simplified due to the use of ZKContainers built-in monitor tooling.
In the case of this deployment, each node was composed of 8 Nvidia GPUs, and each node was able to achieve an impressive ~56,000 PPS (Proofs per second). Unsurprisingly, total data center performance was exactly equal to the number of GPUs times the max throughput of a single GPU.
Overall, the use of ZKContainers greatly simplified deployment and configuration. Future changes and updates will also be simplified, since any update can be deployed instantly by updating the container and redeploying. The nodes can be configured with a wide array of hardware, CPU, RAM, memory, an array of GPUs and interconnects without massive software changes, but just by updating the container. Most importantly, the changes can be deployed to the full datacenter after testing on 1–2 nodes first.
What is Next
With time, we hope to see cloud providers and hyperscalers harnessing their experience and scalability to standardize access to ZK supported hardware. Meanwhile, we operate from the other end: using our ZKContainer technology, we take our existing ZK GPU software and package it to match any hardware and cloud configuration. We will constantly be pushing this standard forward.
If you have idle GPU hardware (looking also at you, Ethereum proof of work miners:)) it’s time to put it to work. Reach out!