Fast Danksharding using ICICLE

In this blog we provide technical details on our Danksharding CUDA implementation

Background

Danksharding is the new sharding design proposed as part of Ethereum’s rollup-centric scaling roadmap. From an architectural perspective it mainly defines a new data type called blob-data. The blob-data is stored by the consensus layer (beacon chain) for a limited amount of time. It is not accessible by the execution layer (EVM), though its contents and availability can be verified by it. Danksharding utilizes a new actor type called a builder. Builders submit block proposals to proposers whose job it is to select the block with the highest bid. Only the selected block builder is then required to build the entire block. Once published, the selected block can be efficiently verified via data availability sampling.

‍

ICICLE is an open-source software project aimed at improving the performance and efficiency of zero-knowledge cryptographic operations using CUDA-enabled GPUs. ICICLE focus is on implementation of widely used ZK primitive functions using CUDA and providing API with Rust lang out-of-the-box. Currently, the library supports MSM and NTT, Elliptic Curve NTT (ECNTT) and some other helper utilities.

Danksharding — Bird’s Eye Architectural View

The raw blob-data block contains 256 blobs and serves as the builder’s input. Each blob is a sequence of 4096, 32-byte field-elements from the BLS12–381 scalar field. The raw data is typically organized into a matrix of 256 rows by 4096 columns.

The following list outlines the three distinct outputs required to be constructed by the builder:

Interpolated Data (D) — The raw data is treated as evaluations of a two-dimensional polynomial over a grid that is a multiplicative subgroup of the polynomial’s domain. This allows interpolating the raw data by a factor of two in each dimension resulting in a matrix of 512 rows by 8192 columns. Due to the linear nature of polynomial interpolation, each row (column) of the resulting matrix represents 8192 (512) samples of a 4095 (255) degree, one-dimensional polynomial. The main achievement of the interpolation is that data withholding (unavailability) by the builder is extremely improbable in comparison to the raw data which lacks redundancy. The interpolated data is treated as 512 oversampled blobs.
Commitments (K) — Each of the 512 interpolated blobs is committed-to using a KZG commitment that utilizes a common Structured Reference String (SRS). Each commitment binds the builder to a row of the interpolated data matrix D. It is worth noting that the commitments K maintain the linear relationship of the rows.
Proofs (P) — The builder constructs proofs for data verification. Each proof bundles 16 data samples from D. Each bundle of 16 samples can be verified with their associated proof and row commitment, using a KZG pairing equation. The proof generation is the dominant computationally intensive part of the builder’s building task.

The figure above depicts the builder’s block construction process. The right, middle, and left branches of the drawing describe the commitments, interpolation, and proofs tasks. Green boxes describe NTT, ECNTT operations, yellow boxes describe vector multiplication and blue boxes are for Multi-Scalar Multiplication (MSM). It can be immediately seen that the scheme is heavily dominated by Number Theoretic Transform (NTT) and Inverse NTT (INTT) operations. A detailed description of the process is provided in our danksharding math paper.

Implementing Danksharding Builder

Using ICICLE, we have implemented a flow that can run entirely in the GPU, with only the relevant outputs to be sent to a host machine and populated on the Ethereum network. Our work focuses on the smallest functions and building blocks that are relevant to the Danksharding builder. The CUDA functions that we designed, along with bindings to our Rust wrappers are capable of executing this process.

GPU data Flow

We designed the flow to be optimized in response time to the Ethereum network, by first splitting the input matrix D_in into two parallel execution branches:

Calculating the KZG commitments, shortest path to response
Interpolation matrix D and KZG proofs calculation

Accelerating the KZG commitment calculation and share it to the network takes highest priority. Therefore, we start with calculating the matrix C_rows = INTT_rows(D_in) which is a baseline for the rest of the process.

From this point, we call ICICLE to perform a series of computations of MSMs, ECNTTs and vector multiplication.

We finish the KZG commitment calculation by concatenating the MSM output vector with the result of the last ECNTT. This 512 element vector is pushed to the host CPU and sent to the Ethereum network.

The interpolation matrix D is calculated using a highly parallel sub-processes of INTT, vector multiplication and NTT to result 3 sub-matrices C_rows, C_cols and C_both. These sub-matrices, together with the input matrix C, construct the interpolation matrix D.

Since matrix D is used by both the CPU and the remaining part of KZG proofs, it is necessary to copy the matrix data to the host machine and store it.

Rust and Correctness

We have designed and implemented Danksharding scheme in Rust, with a simple API to call all individual building-block functions. The implementation follows the scheme and, at least for the first release, provides both an application that integrates ICICLE, and a ping-pong design between the host machine and the GPU device. This design leaves a lot of space for planned optimizations towards a vision of a full application that will run on accelerated hardware.

The design was tested for correctness using Ethereum research danksharding python reference code. Our test vectors and code to generate them can be found in the repo.

Don’t be a stranger

We are happy to officially open our Discord server. This is the place to meet and chat with the dev team, talk about features, bugs, redesign and basically everything around acceleration of ZKPs.