The rise of Artificial Intelligence has been nothing short of spectacular. From basic algorithms to its most advanced form — Language Learning Models (LLMs) like ChatGPT and Copilot — AI is at the forefront of technological evolution.
As these models interact with users and process vast amounts of data and prompts, the question of data privacy comes into play. Amazon and Apple, among other major companies, have already curtailed their employees’ access to public APIs like chatGPT over potential data leaks stemming from AI interactions. Additionally, it’s reasonable to anticipate that regulations will soon be introduced to mandate certain levels of user privacy.
How can we ensure that our interactions, questions, and data shared with these models remain private?
Enter Fully Homomorphic Encryption (FHE).
Fully Homomorphic Encryption (FHE) in Depth
In the realm of cryptography, Fully Homomorphic Encryption is a groundbreaking concept. Its allure lies in a unique ability: it allows computations on encrypted data without needing to decrypt it first, enabling private inference on sensitive information.
This capability ensures two important things: data remains secure during processing and the complete preservation of the model’s intellectual property (IP).
Private Inference vs. IP protection
Today, “Privacy” and “User-experience” seem like conflicting goals. People often trust a third party with their information primarily for the sake of user experience. We believe these third parties can balance privacy with top-notch service, rather than having to choose between local solutions that might be more private but less feature-rich, or services that sacrifice privacy for additional features.
FHE has the ability to provide private inference and still completely preserve the model’s IP. By enabling computations on encrypted data, it ensures that prompts remain entirely private while also safeguarding the unique and proprietary aspects of the LLM.
Traditional Encryption vs. FHE
To perform meaningful computation on encrypted data in conventional encryption schemes, one would first need to decrypt it. This decryption exposes the data, making it vulnerable, even if only for a brief moment. In contrast, FHE enables operations directly on the ciphertext, ensuring that sensitive information remains hidden throughout the computational process.
Why is FHE Important?
The significance of FHE extends beyond theory. Imagine cloud computing services where your data can be processed without ever being exposed, or medical databases that can compute analytics without accessing sensitive patient details. FHE’s potential applications are vast and varied, from secure voting systems to private searches on encrypted databases.
FHE utilizes the Learning With Errors (LWE) problem, which is a type of lattice-based cryptography. It is considered secure even against quantum computing. In LWE, random noise is utilized to make the data unreadable except for those who possess the secret key. Performing arithmetic operations on the encrypted data is possible, however, that typically increases the noise level. Performing too many of them sequentially will make the data unreadable to anyone, even those who hold the key.
This is sometimes called Somewhat Homomorphic Encryption (SHE). To convert SHE into FHE, one needs an operation that can lower noise level. Such an operation is called Bootstrapping, and it exists in several FHE schemes. In this text we will focus on one such scheme, called TFHE (Torus FHE), which utilizes the algebraic structure of the mathematical torus to make FHE possible.
Advantages of TFHE:
While each scheme has its own strengths and weaknesses, TFHE presently boasts more efficient implementations in practical scenarios. Another significant advantage of TFHE lies in its programmable bootstrapping (PBS), which enhances the usual bootstrapping operation to include calculations of single-variable functions, such as activation functions which are pivotal in the realm of machine learning.
A weakness of TFHE is the requirement to perform a PBS operation for each arithmetic operation performed in the computation, while other schemes allow for batching of a few operations together between bootstrapping.
Assumptions and Approximations
So how much time would it take to run LLM inference with FHE? We make a few assumptions in order to estimate it:
- The number of arithmetic operations required for a single token is about 1–2 times the number of parameters in the model. This is a lower bound as the entire model is used for each token, and we will assume this bound is close enough to the actual requirement.
- Every arithmetic operation in the LLM can be mapped to an arithmetic operation in TFHE. This is basically a statement for variable type sizes in the two schemes. We assume INT4 variables will be sufficient for LLM and feasible for TFHE.
- Every arithmetic operation in the LLM needs to be mapped to an arithmetic operation in FHE. This means we can’t run some of the model unencrypted. A recent post by Zama considers FHE inference without this assumption, where most of the model is executed locally by the user without any encryption, and only a small piece (a single attention head in their example) is run using FHE on the model’s company server. We posit that this method does not in fact preserve the IP of the model, as in this case the user can either simply run the model with the missing head and just a small loss of accuracy, as can be seen here, or do a relatively cheap training of the missing part, to get results that are as good as those with the original model.
- Every arithmetic operation in TFHE requires a single PBS (programmable bootstrapping). PBS is the main bottleneck of TFHE computations.
- The state-of-the-art TFHE implementation is FPT. This is an FPGA implementation that computes PBS at a rate of 1 PBS every 35μs.
The Challenge with FHE and LLMs
With the latest advancements, the best existing implementation of FHE can execute an arithmetic operation in just 35 microseconds. However, when we consider a sophisticated model like GPT2, it requires an overwhelming 1.5 billion operations for a single token. This culminates in a processing time of around 52,000 seconds for each token.
To put this into perspective, in the context of language models, a token can represent anything from a single character to an entire word. Imagine interacting with a Language model, where the response time can take a week or two! Such delays are clearly infeasible for real-time communication or any practical application of the model.
Bridging the Gap
To make FHE usable in LLMs, here’s a potential roadmap:
- Parallel Processing with Machines:
- Starting at 52,000 seconds/token.
- By deploying 10,000 parallel machines, we reduce the time to 5 seconds/token. Note that LLMs are indeed highly parallelizable and inference today is usually executed in parallel on thousands or more GPU cores.
2. Transition to Advanced Hardware:
- From the improved 5 seconds/token.
- Moving to GPUs or ASICs, we can achieve 0.1 seconds/token. While GPUs can offer more immediate gains in terms of speed, ASICs can offer both higher gains, and significant savings in terms of power, one such example is the ZPU.
As shown, it is possible to run private inference of LLMs with FHE using available data acceleration techniques. A large, but feasible initial investment in a big enough datacenter, can support this. However, this possibility remains marginal, and given larger LLMs such as Copilot (12B parameters) or GPT3 (175B parameters) there is still a gap to bridge.
A smaller token throughput will suffice for Copilot, as it answers with code output, which is typically more succinct than human language. If we lower the throughput requirement by a factor of 8, then Copilot also hits the target for feasibility.
The final gap can be bridged by combining larger parallelization, better implementations, and potentially more efficient algorithms for bootstrapping in FHE. At Ingonyama, we believe that an important part of bridging this gap is with algorithms, on which our team is currently focused.
The marriage of FHE’s security and LLMs’ computational capabilities can redefine AI interactions, ensuring both efficiency and privacy. While challenges exist, with continuous research and innovation, a future where our interactions with AI models like ChatGPT are both instantaneous and private is well within reach.
Join us: https://www.ingonyama.com/career