Diving Deeper into Blaze: NTT Module

Originally published on HackMD

In our previous blog post we described Blaze — a Rust library for ZK acceleration on Xilinx FPGAs.

Since the release of Blaze, we have been actively working on its architecture and applying an API of our NTT primitives implementation. Today we are ready to introduce a new module for working with NTT.

What is the NTT Module in Nutshell?

Blaze architecture makes it easy to add new modules. In our introductory Blaze blog post we described the Poseidon hash function, and here we will describe the NTT module.

NTT, or Number Theoretic Transform, is the term used to describe a Discrete Fourier Transform (DFT) over finite fields. Our module provides an API to the calculation of NTT of size 2²⁷. To use it, the input byte vector of elements must be specified. Each element in the input vector must be represented in little-endian. The result will be a similar byte vector in which each element is represented as little-endian bytes.

How is NTT Structured from a Developer’s Point of View?

In this brief blogpost we will not dive in depth about how and why the calculations are built. You can read about this in a series of posts previously published:

More information on this subject will be released as part of our upcoming NTT Webinar.

The important thing for us is that on-device memory (in our case we’re working with HBM) is divided into two buffers:

In one of the buffers the host writes data — our input vector
In the other buffer a computation is taking place, and then they swap places.

The main advantage of this design (especially having two buffers) is that it supports NTT back 2 back computing.

So our calculation involves the following steps:

Host writes input vector to card/device memory (can be HBM or DDR)
Previously written data is read to FPGA
Data is processed in FPGA
Processed data is written back to card/device memory
Host gets result from HBM

Our design supports the feature of writing the new vector and getting the result in parallel.

Additionally in the current version of the driver, the input byte vector must be divided into 16 segments, which we will call banks. The partitioning into banks is done inconsistently, and based on how further calculations will be done. At this stage of implementation, Blaze is responsible for all required conversions, so additional application integration or data manipulation is not required by the end user.

A detailed description of partitioning can be found in Section 2.4.1 Data Organization in our White Paper.

Using Blaze

A full description of the tests, which include the binary loading process and calculations will be available in the latest release. In addition, the binary file for NTT will be located there as well.

Adding Blaze to an Existing Rust Project

First and foremost, let’s connect Blaze to your project. To do this, run cargo command:

cargo add ingo-blaze --git "https://github.com/ingonyama-zk/blaze.git"

After this, you will see Blaze in your dependencies:

[dependencies]
ingo-blaze = { git = "https://github.com/ingonyama-zk/blaze.git"}

Create Connection to FPGA using DriverClient

The blaze architecture is designed so that we can load different drivers on the same FPGA. For this purpose we separate connecting to the hardware from communicating with it (directly to the module API)

To create a connection, it is necessary to specify the slot and type of card with which we will work. So far we support only the Xilinx C1100/U250 installed locally, but in the future we will add support for other cards as well and AWS F1 Instances.

‍

use ingo_blaze::driver_client::*;

let dclient = DriverClient::new("0",
DriverConfig::driver_client_cfg(CardType::U250));

Load program for NTT on FPGA

After opening the connection, let’s load our driver (a program that describes how to perform specific calculations on the FPGA). To do this we need to specify the path to our file and load it into memory:

‍

let bin = ingo_blaze::utils::read_binary_file(&bin_fname)?;

Next we need to check if our FPGA is ready to load the driver, and then directly load it on the FPGA:

dclient.setup_before_load_binary()?;
dclient.load_binary(&bin)?;

An important note is that we can replace the loaded FPGA binary/image at run-time. That means you can reuse one connection for different versions of one driver or for other drivers (MSM for example). Keep in mind — only a single driver can be loaded at a time.

Create the client for NTT module

After we succesfully conected to our FPGA and set up driver, we need to use this connection somewhere. As we mentioed before, each module must implement an trait DriverPrimitive based on the needs of a particular computation. So let’s further discuss what’s hidden under each traid function for NTT.

The first step is always the creation of the client module itself. To do this, we need to specify its type and pass an already open connection:

use ingo_blaze::ingo_ntt::*;
let driver = NTTClient::new(NTT::Ntt, dclient);

There is only one type for ntt for now: NTT::Ntt, but we can extend this module in the future.

If we look inside NTTClient , like other modules it is described by the following structures:

pub struct NTTClient {
    ntt_cfg: NTTConfig,
    pub driver_client: DriverClient,
}

where driver_client includes general addresses for FPGA, and NTTConfig which represents address memory space specific for NTT:

pub(super) const NOF_BANKS: usize = 16;

pub(super) struct NTTAddrs {
    pub hbm_ss_baseaddr: u64,
    pub hbm_addrs: [u64; NOF_BANKS],
}

pub(super) struct NTTConfig {
    pub ntt_addrs: NTTAddrs,
}

Initialize the FPGA

For the NTT module, the initializations currently allow us to configure the execution both in whole NTT computation mode only, as well as partial execution that we use to debug NTT.

‍

driver.initialize(NttInit{})?;

However, only full calculations are available to users. You can have a look inside the NTT initialize method.

Reading/Writing to the FPGA

NTT, like other modules, implements functions to write and read data from the FPGA.

‍

// Writing to the FPGA
driver.set_data(NTTInput {
    buf_host,
    data: in_vec,
})?;

‍

// Waiting and reading result from the FPGA
driver.wait_result()?;
let res = driver.result(Some(buf_host))?.unwrap();

Let’s dive a bit into what happens to our original byte vector after we pass it to write.

The NTTClient, after receiving inputs, starts preprocess computation. In this function the initial vector is distributed to the 16 banks in a particular order.

Next, each bank is written to the corresponding memory address.

‍

fn set_data(&self, input: NTTInput) -> Result<()> {
    let data_banks = NTTBanks::preprocess(input.data);

    data_banks
        .banks
        .into_iter()
        .enumerate()
        .try_for_each(|(i, data_in)| {
            let offset = self.ntt_cfg.ntt_bank_start_addr(i, input.buf_host);
            self.driver_client.dma_write(
                self.driver_client.cfg.dma_baseaddr,
                offset,
                data_in.as_slice(),
            )
        })
}

You can see that the memory address depends on which memory buffer the host (buf_host) is working with:

  pub(super) fn ntt_bank_start_addr(&self, bank_num: usize, buf_num: usize) -> u64 {
        self.hbm_bank_start_addr(bank_num) + (Self::NTT_BUFFER_SIZE * buf_num) as u64
    }

In terms of the result, the FPGA does not actually receive a whole vector, but 16 banks that need to be processed:

fn result(&self, buf_num: Option) -> Result>> {
        let mut res_banks: NTTBanks = Default::default();
        for i in 0..NOF_BANKS {
            let offset = self.ntt_cfg.ntt_bank_start_addr(i, buf_num.unwrap());
            res_banks.banks[i] = vec![0; NTTConfig::NTT_BUFFER_SIZE];
            self.driver_client.dma_read(
                self.driver_client.cfg.dma_baseaddr,
                offset,
                &mut res_banks.banks[i],
            )?;
        }

        let res = res_banks.postprocess();
        Ok(Some(res))
    }

So just as with writing, we now need to calculate the address again depending on the function. We then transfer our banks to postprocess. You can see how the function is organized here.

Run computation

While our read and write data functions depend on the host buffer, the start of the computation process is tied directly to the FPGA. So by swapping the buf_host and buf_kernel values we choose which section to start the calculation on.

The starting itself looks like this:

‍

driver.start_process(Some(buf_kernel))?;

Conclusion

We are excited to see what the community builds with Blaze! And we welcome your contributions to the project on Github.