Honey I SNARKED the GPT

Note: This is a repost of Dante’s HackMd from June 2023.

Or rather how we got nanoGPT into Halo2-KZG circuits using EZKL.

This post is written in collaboration with Bianca Gănescu a Master’s student at Imperial College London (supervised by Jonathan Passerat-Palmbach) who was instrumental in making this happen, and Jason Morton.

I am the lead developer on EZKL (stylized and pronounced Ezekiel), a library which enables users to convert computational graphs represented in the Open Neural Network Exchange (ONNX) format to a (Halo2-KZG) ZK-SNARK circuit. Since starting the project in August of 2022 we’ve come a long way in terms of the breadth of models we support, our proving performance, and also the community of folks building and researching on top of our tool!

Bianca, as part of her Master’s thesis, wanted to get nanoGPT into a ZK-SNARK using our tooling. This post is a high level overview of the steps it took to make that happen.

The sections below are somewhat technical and assume some knowledge of how the Halo2 API is structured.

Supporting arbitrary model sizes

A question we often get is how we manage to get large models into a Halo2-KZG circuit for (almost) any arbitrary trusted setup size.

The way we’ve been able to deal with this is by having operations “wrap” around advice columns. To illustrate this consider our dot product argument:

| a0  | a1  | m     | s_dot | 
|-----|-----|-------|-------|
| x_i | y_i |m_{i-1}| s_dot |
|     |     |m_i    |       |

where constraints are such that $$x_i*y_i + m_{i-1} = m_i,$$ and i$ indexes over the elements of a vector or some other iterable structure.

Let’s say i exceeds the number of rows available in a single column (let’s call that M). We start allocating the x_i, y_i where i > M to a new set of advice columns a_2, a_3. It requires some trickery in that we also need to map the result of the bottom of a_0, a_1 to the top of a_2, a_3 (by way of copy constraints), such that the constraint with a negative rotation of (-1) still holds.

In this manner we can have a relatively large number of constraints by increasing the number of columns, rather than increasing the number of rows. The way we get around the region shuffling that may break this is (drum roll) …. we have a single region for the entire circuit and make sure to keep track of the current offset throughout.

Folks are often suprised that our code looks quite different to other Halo2 examples, and this single region strategy is why. As a side effect, it also has the nice property of leaking very little information about the architecture of the model, providing a kind of functional commitment.

Supporting (almost) all Onnx operations

The ONNX format has over 100 operations that it can represent. Our first approach was to painstakingly implement circuit constraint equivalents for each one at a time, dealing with high-dimensional tensor shape casting, and creating custom constraints and gates for each one.

That is until we discovered tract. Tract is a library developed by the Sonos team (and mainly the work of one savant @kali), which reduces and prunes complex ONNX graphs into compositions of a small set of operations (the number of which is reducing with each new tract release). Instead of implementing 100+ operations we only need to implement 20 or so as equivalent circuit constraints.

In particular many linear operations such as matrix multiplication, column sum, outer product, dot product, transposition, transposed matrix multiplication, matrix to vector multiplication, Hadamard products, batched matrix multiplication, tensor contraction, even bilinear transformations over multiple tensors are represented as a single operation: the Einstein summation operation.

For those unfamiliar the Einstein summation convention is a powerful notation for expressing operations over multi-dimensional tensors. For a primer on it, I recommend this article; but a brief summary is that by way of a succinct notational convention we can express any abitrary summation or product of elements between multidimensional tensors.

For instance to express matrix multiplication we might write:

ik,kj->ij,

which in effect expresses:

we have 2 2D objects one with dimensionality ik, the other with dimensionality kj and our operation should produce an output of dimensionality ij by reducing along the common axis of dimensionality k.

For fun, here’s another relationship this sort of notation can express:

bn,anm,bm->ba.

We leave this as an exercise for the reader to untangle what this represents 😘.

Luckily the constraint equivalent of einsum is just a repeated application of the (generalized) argument of the dot product detailed above over the appropriate input axes. In follow up posts we’ll release the exact construction of these constraints and arguments.

As such when we started to use tract in the backend, and implemented einsum, we were able to support a whole world of models out of the box including LSTMs and self-attention layers in transformer blocks (one of the constitutive elements of nanoGPT).

Allowing for parallel region layouts in Halo2

Our “single region” strategy for Halo2, detailed above, means we are particularly well positioned to layout circuits in parallel. When dealing with 10million+ constraints, even generating circuit constraints or the witness can take some time and for the sake of user sanity / lack of patience we wanted to speed this process up.

Sadly the original implementation of Halo2 does not allow for parallel region layouts, and in particular the creation of the copy-constraints in the circuit are not deterministic in a multi-threaded setting. These copy constraints are expressed using “cycles”, effectively directed graphs which have as constitutive elements all the Halo2 cells that should have equal witness values (i.e are copy-constrained).

Quoting the Halo2 book directly:

For example, suppose that we have a circuit that defines the following equality constraints: a≡b, a≡c, d≡e. From this we have the equality-constraint sets {a,b,c} and {d,e}. We want to construct the permutation: (a b c) (d e). This is expressed as two disjoint cycles (a b c) and (d e).

When two cells are copy-constrained, the cycles they are currently a part of are merged. For example, given two disjoint cycles (diagrams taken from the Halo2 book): $$(A B C D) \ \text{and} \ (E F G H)$$

A +---> B
^       +
|       |
+       v
D <---+ C       E +---> F
                ^       +
                |       |
                +       v
                H <---+ G

After adding constraint $B≡E$:

A +---> B +-------------+
^                       |
|                       |
+                       v
D <---+ C <---+ E       F
                ^       +
                |       |
                +       v
                H <---+ G

In particular, referring to step 2 of the algorithm in the Halo2 Book for permutation arguments used to build copy-constraints:

  1. Otherwise, left and right belong to different cycles. Make left the larger cycle and right the smaller one, by swapping them iff sizes(aux(left))<sizes(aux(right)).

The smaller “cycle” in a multi-threaded setting can be non-deterministic and as such, the constructed copy constraint labels can differ between the key generation and proof stage. For an illustration of how the order in which the copy constraints are generated can create different cycle directions see the below diagram (creds to me the artist):

Untitled-2022-01-06-1700(6)

A way to mitigate this is to make this label deterministic which we do by way of the usage of different data structures for representing copy-constraint cycles in this PR to PSE Halo2.

Once we had this implemented we could generate the witness and layout circuit constraints in a parallel fashion.

Rescaling fixed-point arithmetic

As Halo2 (and most ZK frameworks) operate over Fields we are somewhat constrained in the types of arithmetic we can easily (or cheaply) represent. For speed and convenience and codebase simplicity we currently perform operations using fixed-point arithmetic.

Given a base b, a scale s and a shift k, we represent a given decimal number x as $$b^{s}x + k,$$ which we then map to a member of the given field we are operating over. As a default we choose b=2, k=0 and leave s as a tunable parameter that can be set by the user.

The issue is that when composing lots of chained multiplicative operations, such as in deep networks like multi-head self-attention layers, the underlying scale of the fixed-point representation can explode. The result of the product has for scale the sum of the input scales.

This poses a particular problem when used in conjunction with lookup tables. In particular we leverage fixed-column lookups to represent non-linear functions such as ReLU or e^x. Ignoring cryptographic details, a lookup table is basically two advice columns, and two fixed columns of the Halo2 “grid.” The fixed columns are pre-filled and committed at circuit definition time with all the allowed input-output pairs. The advice columns are filled at proving time, and an argument is made that all the input-output pairs in the witness are valid input-output pairs that appear in the pre-filled fixed columns.

Using random linear combination ideas, the argument conceptually reduces to looking up one advice column in one fixed column (image taken from here):

In the diagram above that corresponds to, say, the blue column and the orange one. When we say $$ReLU(x),$$ what this means in a circuit is that the provided input-output witness values of $$(x,ReLU(x)),$$ should be a row of the pre-committed lookup table.

When the fixed-point scale explodes in value, witness pairs can quickly fall out of range of our lookups. To keep this scale in control we introduce a heuristic scaling method where we periodically “divide-down” elements that have too large a fixed-point scale. This division is also performed using a lookup.

nanoGPT in a SNARK

This combination of improvements mean we were able to get nanoGPT, exported as an Onnx file, into a Halo2 circuit. Our latest benchmarks with public outputs and fixed model parameters and input data can be tracked here. In this setting we are proving a statement akin to:

I ran this public nanoGPT-like model on some private data and it produced an output which matches this publicly known value.

Note that in the fixed setting above we leverage a few facts to reduce the set of arguments in the circuit. Notably, if two fixed “constants” a and b are part of a constraint of the form

$$ a \circ b = c$$

then we can omit constraining this argument and instead propagate c as a fixed constant in the rest of the circuit (until it reaches a set of constraints that aren’t between two constants). We also simplify trivial constraints, where x and y can be variables, of the form $$x \circ 0 = y.$$ If the operation is multiplicative then y can be fixed to the 0 constant and propagated as such. If the operation is additive then we can set y to be equal to x by way of a copy constraint.

We ran the benchmarks on the following spec:

CPU
    AMD EPYC 7451 24-Core Processor

RAM

    256 GB

    DDR4 Synchronous Registered (Buffered) 2667 MHz (0.4 ns)

Results:

Tiny model (4 transformer layers ~ 250k parameters ~ 1 million rows ~ K=20):
Description
verifying key generation 192s
proving key generation 212s
proof time 237s
verify time 0.34s
Small model (20 transformer layers ~ 1mil parameters ~ 3 million rows ~ K=20) (column overflow):
Description
verifying key generation 474s
proving key generation 569s
proof time 966s
verify time 0.42s

We also run the private network setting on an Apple M1 Pro with 32GB of RAM, meaning that we are proving a statement akin to:

I ran this private nanoGPT-like model on some private data and it produced an output which matches this publicly known value.

Note that the above methods which reduce the number of constraints in the fixed settings are less likely to be encountered here, as the parameters are not fixed constants but instead variable advice, which can vary from proof to proof. Even though we are operating with a much smaller laptop (in terms of memory and number of cores), with more constraints, this operation still completes successfully for the tiny model ! To reproduce this on your own laptop:

ezkl gen-settings -M network.onnx --num-inner-cols 3
ezkl calibrate-settings -M network.onnx -D input.json --target="resources"
# now manually set the calibrated settings.json rows to 20 by opening settings.json and changing run_args.logrows to 20
ezkl compile-circuit -M network.onnx --settings-path settings.json  --compiled-circuit network.compiled
ezkl gen-srs --logrows=20 --srs-path=kzg.srs #note that you can also use get-srs to fetch a pre-generated public ceremony from aws S3: ezkl get-srs -S settings.json
ezkl setup -M network.compiled --srs-path=kzg.srs
ezkl gen-witness -D input.json  -M network.compiled
ezkl prove -M network.compiled -W witness.json --pk-path pk.key --srs-path kzg.srs
ezkl verify --proof-path proof.proof -S settings.json --vk-path=vk.key --srs-path=kzg.srs

Results:

Tiny model (4 transformer layers ~ 250k parameters ~ 6 million rows ~ K=20) (column overflow):
Description
verifying key generation 268s
proving key generation 417s
proof time 2189s
verify time 0.34s

Future work

We’re excited to keep improving on these proving times. In particular we’re excited to see where hardware acceleration can take us in terms of performance (creds to @weikengchen for a great discussion on this).