PQC and Keccak on Karu

Here at Manse Processors, we have been working on Post-Quantum Cryptography since 2015!

Well, I was actually in Belfast doing my final post-doc then, and thanks to Prof. Máire & Co I can now claim “10+ years of PQC experience”. Anyway, let me try to explain what RISC-V PQC TG currently plans to do about PQC first..

Note

TL;DR: The Keccak instruction vkeccak.vi proposed in PQC TG in RISC-V International is implemented in our karu64 core and makes standard lattice-based PQC algorithms go 50% faster. The more you optimize the rest, the bigger the Keccak share becomes and the greater the relative benefit of Keccak.

PQC Standards vs Keccak

It turns out that cryptographic hash functions (the actual subject matter of my 2009 PhD) are incredibly important for PQC. From a practical viewpoint, the two main PQC algorithms are:

FIPS 203 “Module-Lattice-Based Key-Encapsulation Mechanism Standard” (ML-KEM) is the recommended PQC key establishment method.
FIPS 204 “Module-Lattice-Based Digital Signature Standard” (ML-DSA) is the recommended PQC authentication and integrity protection method.

Basic quantitative analysis of ML-KEM and ML-DSA reveals that often more than 50% of their execution time is actually spent computing the KECCAK-p[1600, 24] permutation, the core of the SHAKE extensible-output function (XOF). SHAKE is specified as a part of the SHA-3 standard FIPS 202.

Intuitively, you wouldn’t necessarily expect this — symmetric cryptography “support” functions consume a negligible percentage of cycles in traditional asymmetric RSA and Elliptic Curve cryptography. For the most part, ML-KEM and ML-DSA are not using SHAKE to hash (“absorb”) input data, but to expand (“squeeze”) random bits for various purposes.

Naturally, the hash-based signature algorithms SLH-DSA (FIPS 205) and LMS & XMSS (SP 800-208) consist almost exclusively of hash computation, so any Keccak speed-up is proportionally reflected on the performance of the SHAKE parameter sets of these algorithms.

Keccak is Hard to Vectorize

Nicolas Brunie (SiFive) wrote a blog post about RVV Implementation of Keccak / SHA-3 with a subtitle “Leveraging RISC-V Vector to slow down SHA-3 software implementation”. Why?

The Keccak permutation has a very efficient mixing transformation; the downside is that the particular access pattern of words makes efficient vectorization difficult. Since all 25 state words fit into the register file simultaneously, fast scalar execution is generally faster.

Main steps of the Keccak permutation. — The Keccak permutation operates on a 5×5 = 25 words, each 64 bits; a total of 1600 bits. Each of 24 rounds is a composition of steps A_i+1 = χ(π(ρ(θ(A_i)))) ⊕ rc_i.

The linear mixing step Theta (θ) is just xors and the Rho (ρ) step is just rotations; that’s why the availability of scalar rori makes a lot of difference for scalar Keccak speed. The word permutation PI (π) is just scalar register re-indexing, while the only nonlinear step Chi (χ) is a logical and operation with other input inverted; the bitmanip instruction andn. The cost of a full 24-round permutation ends up being at 3000+ instructions; in practice, the compilers emit 4000+ instructions if rori and andn are available (bitmanip extension), and 6000+ if they are not (regular RV64GC).

One can, of course, parallelize VLEN/64 independent Keccak permutations with the vector unit. ML-KEM and ML-DSA can utilize such parallelism for some operations. The speedup is significantly reduced because vector instructions are generally not issued at the same rate as scalar instructions.

Keccak in Pure Hardware: Very, very fast

The structure is very fast in pure hardware, allowing 1 or 2 rounds per cycle even at very high operating frequencies. This is because the logical depth of each round is small; each round has only a handful of gates between input and output. Note that steps Rho (ρ) and PI (π) are just wiring; no gates.

While fast, the permutation is not very small; the cycle-per-round implementation in Karu is roughly 30 kGE, the size of a very small microcontroller core. However, for application processors, the exceptional, typically 100-fold, difference in hardware and software speed makes it sensible to do the entire permutation with a single instruction.

Hence, the main proposal from the RISC-V Post-Quantum Cryptography Task Group is a single instruction, Vector-Immediate Multi-Round Keccak, with preliminary mnemonic vkeccak.vi.

While the permutation itself is easily 24 cycles, that is not the latency of the permutation due to the way vector register files are organized. A typical implementation will spend several times as much time getting data to and from the VRF as on the permutation itself. However, even if the permutation takes 100 cycles, it’s still orders of magnitude faster than executing thousands of instructions.

`vkeccak`: Specification Status

The implemented variant of the instruction performs Keccak in place — in register vd. The 5-bit immediate value specifies the number of rounds. This allows both SHA-3 functions and variants such as KangarooTwelve and TurboSHAKE to be implemented.

vkeccak.vi vd, imm5

The main proposal is that the source/destination register vd specify a LMUL = 2048/VLEN sized register group. In Karu, we have VLEN=256, and hence LMUL=8. The 25-word Keccak state 00 .. 24 can be mapped into 4 possible locations (vector register groups of size LMUL=8).; vd can be { V0, V8, V16, V24 }:

 V0: [00 01 02 03]   V1: [04 05 06 07]   V2: [08 09 10 11]   V3: [12 13 14 15]
 V4: [16 17 18 19]   V5: [20 21 22 23]   V6: [24 -- -- --]   V7: [-- -- -- --]

 V8: [00 01 02 03]   V9: [04 05 06 07]  V10: [08 09 10 11]  V11: [12 13 14 15]
V12: [16 17 18 19]  V13: [20 21 22 23]  V14: [24 -- -- --]  V15: [-- -- -- --]

V16: [00 01 02 03]  V17: [04 05 06 07]  V18: [08 09 10 11]  V19: [12 13 14 15]
V20: [16 17 18 19]  V21: [20 21 22 23]  V22: [24 -- -- --]  V23: [-- -- -- --]

V24: [00 01 02 03]  V25: [04 05 06 07]  V26: [08 09 10 11]  V27: [12 13 14 15]
V28: [16 17 18 19]  V29: [20 21 22 23]  V30: [24 -- -- --]  V31: [-- -- -- --]

There is a draft specification: zvknhk.adoc, also rendered as Chapter 31 here: riscv-spec.pdf. Furthermore, a private repo keccak-xrv provides tests for the instruction that can be run with a patched version of the Spike golden model/simulator.

Open and Semi-Open Issues

There are some open issues regarding how the 1600-bit state is mapped to the vector register file across various physical vector register sizes (VLEN).

For VLEN=256 and higher, the situation is relatively straightforward; ceil(1600/VLEN) registers are required; 7 registers for VLEN=256 and 4 registers with VLEN=512, etc. These fit into normal register group sizes. However, VLEN=128 is problematic: 13 registers are required, exceeding the maximum register group size, LMUL=8. This could be resolved simply by considering the group size being implicit for the Keccak instruction. From an implementation viewpoint, the FSM (or similar) for accessing the VRF is likely unique to it in any case.
A further open question is whether any 5-bit number of rounds should be admissible. Allowing any intermediate can complicate highly optimized hardware realizations that computes double-rounds or even triple-rounds per cycle. In practice, the Keccak permutation is used only with 12 or 24 rounds, which could be expressed with a single bit (for “Keccak” and “TurboKeccak”), leaving an additional 4 bits reserved for other use.
There is also the question of whether vector register V0 should be avoided for some VLEN sizes, as it also serves as the mask register.

Benchmarking PQC Code

Our ML-KEM and ML-DSA implementations in KaruDeb have been derived from the original “ANSI C” Kyber and Dilithium code, with some modifications and options.

There is a central macro flag for the Keccak permutation that is implemented alternatively with the Keccak instruction (VK_KECCAK == 1) or with a reasonably fast scalar code (VK_KECCAK == 0).
The RISC-V Autovectorizer in recent versions of LLVM works really for some targets, but unfortunately it doesn’t know that our vector unit is relatively slow compared to the scalar ALU. The C code was compiled with the clang cross-compiler version 23.0.0git, which I built on June 1, 2026 (commit c07f4eef0945cf8e3b1d7480cbfcfa19f79d885f). There was no special target tuning; simply -O3 flag is used, and architecture flags specify which extensions are allowed.
Certain parts have hand-written vector intrinsics optimizations controlled by flags MLDSA_RVV == 1 and MLKEM_RVV == 1. This is especially relevant for the shuffles in the Number Theoretic Transforms (vrgather shuffle instruction) and for rejection sampling in the A matrix generation (vcompress). The code is currently limited to our VLEN = 256 size, and may not be the final word in ML-KEM and ML-DSA optimization, but it still beats LLVM & GCC autovectorization in most cases.
In these implementations, different parameter sets share the same execution paths, eliminating the need to re-instantiate multiple full versions of the algorithms for different security levels. This may make implementations slightly slower due to compiler optimizations (fewer opportunities to embed constants in instruction immediates, etc.), but it was originally done for code size and to simplify the build flow.
The cycle counts are read with the standard RISC-V rdcycles instruction. Especially in a multi-user or multi-core Linux system, this register may be filtered/managed. The KaruDeb has a little utility for that: perf_run.c opens Linux perf events and then execs the actual benchmark binary (and its parameters) from the command line.

Warning

Even though these implementations pass the basic NIST ACVP KAT and hence are functionally correct 99.9% of the time, their implementation assurance level is nowhere near projects such as PQ Code Package. So, at present, they are intended only for performance testing.

Results

Here we are giving some raw cycle numbers; you can see that Karu is not as fast as most other processors (its CPI — the average cycles/instruction number when running general benchmarks such as CoreMark — is pretty high). That wasn’t our goal with this single-issue in-order CPU that doesn’t even have proper caches; we wanted to achieve feature completeness while still fitting a full application vector processor in an FPGA. In absolute terms, these numbers tell nothing about “RISC-V” — if I built a processor like this with x86 or ARM ISA, it would have an equally bad CPI.

Hence, “speed” is computed as before-cycles / after-cycles on the same target, so 1.00x means the same speed, and values above 1.00x mean the second build is faster. Averages are arithmetic means over the nine top-level operations for each algorithm.

If you are interested in instruction counts rather than cycle counts (which are purely ISA dependent), you can obtain those offline using the spike simulator and the test matrix scripts for ML-KEM and ML-DSA.

Impact of Scalar Bitmanip Alone (+15%)

(Simply enabling the Zbb flag in the compiler. This is almost entirely thanks to RORI and ANDN making Keccak faster.)

ML-KEM: RV64GC vs RV64GC+Zbb

Parameter	Operation	Before cycles	After cycles	Speed
ML-KEM-512	KeyGen	2,295,030	1,966,678	1.17x
ML-KEM-512	Encaps	2,662,499	2,360,391	1.13x
ML-KEM-512	Decaps	3,380,772	3,054,179	1.11x
ML-KEM-768	KeyGen	3,820,038	3,300,756	1.16x
ML-KEM-768	Encaps	4,415,354	3,902,175	1.13x
ML-KEM-768	Decaps	5,364,660	4,828,285	1.11x
ML-KEM-1024	KeyGen	6,033,467	5,215,560	1.16x
ML-KEM-1024	Encaps	6,612,651	5,796,009	1.14x
ML-KEM-1024	Decaps	7,823,228	6,997,590	1.12x
ML-KEM average				1.14x

ML-DSA: RV64GC vs RV64GC+Zbb

Parameter	Operation	Before cycles	After cycles	Speed
ML-DSA-44	KeyGen	7,338,818	6,188,665	1.19x
ML-DSA-44	Sign	19,964,933	18,210,571	1.10x
ML-DSA-44	Verify	7,889,854	6,772,220	1.17x
ML-DSA-65	KeyGen	12,625,934	10,479,449	1.20x
ML-DSA-65	Sign	39,670,452	36,420,314	1.09x
ML-DSA-65	Verify	13,024,223	11,005,438	1.18x
ML-DSA-87	KeyGen	21,264,381	17,715,313	1.20x
ML-DSA-87	Sign	47,264,050	42,618,207	1.11x
ML-DSA-87	Verify	21,955,096	18,462,775	1.19x
ML-DSA average				1.16x

Impact of Keccak without Intrinsics (+40%)

(Autovectorization in use, no intrinsics. No modification to source code except addition of Keccak instruction.)

ML-KEM: RV64GCV+Zbb vs RV64GCV+Keccak

Parameter	Operation	Before cycles	After cycles	Speed
ML-KEM-512	KeyGen	2,253,970	1,569,681	1.44x
ML-KEM-512	Encaps	2,804,064	2,136,533	1.31x
ML-KEM-512	Decaps	3,701,594	3,028,184	1.22x
ML-KEM-768	KeyGen	3,858,323	2,762,166	1.40x
ML-KEM-768	Encaps	4,604,880	3,461,625	1.33x
ML-KEM-768	Decaps	5,801,337	4,665,072	1.24x
ML-KEM-1024	KeyGen	6,025,810	4,253,128	1.42x
ML-KEM-1024	Encaps	6,903,974	5,091,228	1.36x
ML-KEM-1024	Decaps	8,425,032	6,622,388	1.27x
ML-KEM average				1.33x

ML-DSA: RV64GCV+Zbb vs RV64GCV+Keccak

Parameter	Operation	Before cycles	After cycles	Speed
ML-DSA-44	KeyGen	6,819,445	4,253,936	1.60x
ML-DSA-44	Sign	23,335,515	19,444,154	1.20x
ML-DSA-44	Verify	7,876,425	5,447,570	1.45x
ML-DSA-65	KeyGen	11,369,397	6,740,834	1.69x
ML-DSA-65	Sign	46,010,689	38,785,842	1.19x
ML-DSA-65	Verify	12,515,502	8,232,686	1.52x
ML-DSA-87	KeyGen	18,805,211	10,811,598	1.74x
ML-DSA-87	Sign	52,663,308	42,433,698	1.24x
ML-DSA-87	Verify	20,369,042	12,719,756	1.60x
ML-DSA average				1.47x

Impact of Intrinsics alone (+25%)

(No Keccak instruction. Impact of my optimizations with vector intrinsics over autovectorization.)

ML-KEM: RV64GCV+Zbb vs RV64GCV+Zbb+Intrin

Parameter	Operation	Before cycles	After cycles	Speed
ML-KEM-512	KeyGen	2,253,970	1,815,728	1.24x
ML-KEM-512	Encaps	2,804,064	2,164,385	1.30x
ML-KEM-512	Decaps	3,701,594	2,811,828	1.32x
ML-KEM-768	KeyGen	3,858,323	2,996,183	1.29x
ML-KEM-768	Encaps	4,604,880	3,465,683	1.33x
ML-KEM-768	Decaps	5,801,337	4,328,674	1.34x
ML-KEM-1024	KeyGen	6,025,810	4,619,065	1.30x
ML-KEM-1024	Encaps	6,903,974	5,167,511	1.34x
ML-KEM-1024	Decaps	8,425,032	6,257,855	1.35x
ML-KEM average				1.31x

ML-DSA: RV64GCV+Zbb vs RV64GCV+Zbb+Intrin

Parameter	Operation	Before cycles	After cycles	Speed
ML-DSA-44	KeyGen	6,819,445	6,033,446	1.13x
ML-DSA-44	Sign	23,335,515	18,163,803	1.28x
ML-DSA-44	Verify	7,876,425	6,675,913	1.18x
ML-DSA-65	KeyGen	11,369,397	10,282,689	1.11x
ML-DSA-65	Sign	46,010,689	36,023,519	1.28x
ML-DSA-65	Verify	12,515,502	10,828,080	1.16x
ML-DSA-87	KeyGen	18,805,211	17,248,740	1.09x
ML-DSA-87	Sign	52,663,308	42,379,474	1.24x
ML-DSA-87	Verify	20,369,042	18,061,301	1.13x
ML-DSA average				1.18x

With Vector Intrinsics: Impact of Keccak (+52%)

(This is the headline number. With vector intrinsics the relative share of Keccak cycles increases, so the impact of the Keccak instruction is larger.)

ML-KEM: RV64GCV+Zbb+Intrin vs RV64GCV+Intrin+Keccak

Parameter	Operation	Before cycles	After cycles	Speed
ML-KEM-512	KeyGen	1,815,728	1,156,131	1.57x
ML-KEM-512	Encaps	2,164,385	1,516,204	1.43x
ML-KEM-512	Decaps	2,811,828	2,164,080	1.30x
ML-KEM-768	KeyGen	2,996,183	1,934,499	1.55x
ML-KEM-768	Encaps	3,465,683	2,384,597	1.45x
ML-KEM-768	Decaps	4,328,674	3,248,962	1.33x
ML-KEM-1024	KeyGen	4,619,065	2,925,985	1.58x
ML-KEM-1024	Encaps	5,167,511	3,439,038	1.50x
ML-KEM-1024	Decaps	6,257,855	4,550,838	1.38x
ML-KEM average				1.45x

ML-DSA: RV64GCV+Zbb+Intrin vs RV64GCV+Intrin+Keccak

Parameter	Operation	Before cycles	After cycles	Speed
ML-DSA-44	KeyGen	6,033,446	3,442,754	1.75x
ML-DSA-44	Sign	18,163,803	14,196,803	1.28x
ML-DSA-44	Verify	6,675,913	4,218,008	1.58x
ML-DSA-65	KeyGen	10,282,689	5,572,035	1.85x
ML-DSA-65	Sign	36,023,519	28,784,995	1.25x
ML-DSA-65	Verify	10,828,080	6,564,131	1.65x
ML-DSA-87	KeyGen	17,248,740	9,112,433	1.89x
ML-DSA-87	Sign	42,379,474	32,177,514	1.32x
ML-DSA-87	Verify	18,061,301	10,303,710	1.75x
ML-DSA average				1.59x