Skip to content
KaruCore
Go back

PQC and Keccak on Karu

By Markku-Juhani O. Saarinen

Here at Manse Processors, we have been working on Post-Quantum Cryptography since 2015!

Well, I was actually in Belfast doing my final post-doc then, and thanks to Prof. Máire & Co I can now claim “10+ years of PQC experience”. Anyway, let me try to explain what RISC-V PQC TG currently plans to do about PQC first..

Note

TL;DR: The Keccak instruction vkeccak.vi proposed in PQC TG in RISC-V International is implemented in our karu64 core and makes standard lattice-based PQC algorithms go 50% faster. The more you optimize the rest, the bigger the Keccak share becomes and the greater the relative benefit of Keccak.

PQC Standards vs Keccak

It turns out that cryptographic hash functions (the actual subject matter of my 2009 PhD) are incredibly important for PQC. From a practical viewpoint, the two main PQC algorithms are:

Basic quantitative analysis of ML-KEM and ML-DSA reveals that often more than 50% of their execution time is actually spent computing the KECCAK-p[1600, 24] permutation, the core of the SHAKE extensible-output function (XOF). SHAKE is specified as a part of the SHA-3 standard FIPS 202.

Intuitively, you wouldn’t necessarily expect this — symmetric cryptography “support” functions consume a negligible percentage of cycles in traditional asymmetric RSA and Elliptic Curve cryptography. For the most part, ML-KEM and ML-DSA are not using SHAKE to hash (“absorb”) input data, but to expand (“squeeze”) random bits for various purposes.

Naturally, the hash-based signature algorithms SLH-DSA (FIPS 205) and LMS & XMSS (SP 800-208) consist almost exclusively of hash computation, so any Keccak speed-up is proportionally reflected on the performance of the SHAKE parameter sets of these algorithms.

Keccak is Hard to Vectorize

Nicolas Brunie (SiFive) wrote a blog post about RVV Implementation of Keccak / SHA-3 with a subtitle “Leveraging RISC-V Vector to slow down SHA-3 software implementation”. Why?

The Keccak permutation has a very efficient mixing transformation; the downside is that the particular access pattern of words makes efficient vectorization difficult. Since all 25 state words fit into the register file simultaneously, fast scalar execution is generally faster.

Main steps of the Keccak permutation.
The Keccak permutation operates on a 5×5 = 25 words, each 64 bits; a total of 1600 bits. Each of 24 rounds is a composition of steps Ai+1 = χ(π(ρ(θ(Ai)))) ⊕ rci.

The linear mixing step Theta (θ) is just xors and the Rho (ρ) step is just rotations; that’s why the availability of scalar rori makes a lot of difference for scalar Keccak speed. The word permutation PI (π) is just scalar register re-indexing, while the only nonlinear step Chi (χ) is a logical and operation with other input inverted; the bitmanip instruction andn. The cost of a full 24-round permutation ends up being at 3000+ instructions; in practice, the compilers emit 4000+ instructions if rori and andn are available (bitmanip extension), and 6000+ if they are not (regular RV64GC).

One can, of course, parallelize VLEN/64 independent Keccak permutations with the vector unit. ML-KEM and ML-DSA can utilize such parallelism for some operations. The speedup is significantly reduced because vector instructions are generally not issued at the same rate as scalar instructions.

Keccak in Pure Hardware: Very, very fast

The structure is very fast in pure hardware, allowing 1 or 2 rounds per cycle even at very high operating frequencies. This is because the logical depth of each round is small; each round has only a handful of gates between input and output. Note that steps Rho (ρ) and PI (π) are just wiring; no gates.

While fast, the permutation is not very small; the cycle-per-round implementation in Karu is roughly 30 kGE, the size of a very small microcontroller core. However, for application processors, the exceptional, typically 100-fold, difference in hardware and software speed makes it sensible to do the entire permutation with a single instruction.

Hence, the main proposal from the RISC-V Post-Quantum Cryptography Task Group is a single instruction, Vector-Immediate Multi-Round Keccak, with preliminary mnemonic vkeccak.vi.

While the permutation itself is easily 24 cycles, that is not the latency of the permutation due to the way vector register files are organized. A typical implementation will spend several times as much time getting data to and from the VRF as on the permutation itself. However, even if the permutation takes 100 cycles, it’s still orders of magnitude faster than executing thousands of instructions.

vkeccak: Specification Status

The implemented variant of the instruction performs Keccak in place — in register vd. The 5-bit immediate value specifies the number of rounds. This allows both SHA-3 functions and variants such as KangarooTwelve and TurboSHAKE to be implemented.

vkeccak.vi vd, imm5

The main proposal is that the source/destination register vd specify a LMUL = 2048/VLEN sized register group. In Karu, we have VLEN=256, and hence LMUL=8. The 25-word Keccak state 00 .. 24 can be mapped into 4 possible locations (vector register groups of size LMUL=8).; vd can be { V0, V8, V16, V24 }:

 V0: [00 01 02 03]   V1: [04 05 06 07]   V2: [08 09 10 11]   V3: [12 13 14 15]
 V4: [16 17 18 19]   V5: [20 21 22 23]   V6: [24 -- -- --]   V7: [-- -- -- --]

 V8: [00 01 02 03]   V9: [04 05 06 07]  V10: [08 09 10 11]  V11: [12 13 14 15]
V12: [16 17 18 19]  V13: [20 21 22 23]  V14: [24 -- -- --]  V15: [-- -- -- --]

V16: [00 01 02 03]  V17: [04 05 06 07]  V18: [08 09 10 11]  V19: [12 13 14 15]
V20: [16 17 18 19]  V21: [20 21 22 23]  V22: [24 -- -- --]  V23: [-- -- -- --]

V24: [00 01 02 03]  V25: [04 05 06 07]  V26: [08 09 10 11]  V27: [12 13 14 15]
V28: [16 17 18 19]  V29: [20 21 22 23]  V30: [24 -- -- --]  V31: [-- -- -- --]

There is a draft specification: zvknhk.adoc, also rendered as Chapter 31 here: riscv-spec.pdf. Furthermore, a private repo keccak-xrv provides tests for the instruction that can be run with a patched version of the Spike golden model/simulator.

Open and Semi-Open Issues

There are some open issues regarding how the 1600-bit state is mapped to the vector register file across various physical vector register sizes (VLEN).

Benchmarking PQC Code

Our ML-KEM and ML-DSA implementations in KaruDeb have been derived from the original “ANSI C” Kyber and Dilithium code, with some modifications and options.

Warning

Even though these implementations pass the basic NIST ACVP KAT and hence are functionally correct 99.9% of the time, their implementation assurance level is nowhere near projects such as PQ Code Package. So, at present, they are intended only for performance testing.

Results

Here we are giving some raw cycle numbers; you can see that Karu is not as fast as most other processors (its CPI — the average cycles/instruction number when running general benchmarks such as CoreMark — is pretty high). That wasn’t our goal with this single-issue in-order CPU that doesn’t even have proper caches; we wanted to achieve feature completeness while still fitting a full application vector processor in an FPGA. In absolute terms, these numbers tell nothing about “RISC-V” — if I built a processor like this with x86 or ARM ISA, it would have an equally bad CPI.

Hence, “speed” is computed as before-cycles / after-cycles on the same target, so 1.00x means the same speed, and values above 1.00x mean the second build is faster. Averages are arithmetic means over the nine top-level operations for each algorithm.

If you are interested in instruction counts rather than cycle counts (which are purely ISA dependent), you can obtain those offline using the spike simulator and the test matrix scripts for ML-KEM and ML-DSA.

Impact of Scalar Bitmanip Alone (+15%)

(Simply enabling the Zbb flag in the compiler. This is almost entirely thanks to RORI and ANDN making Keccak faster.)

ML-KEM: RV64GC vs RV64GC+Zbb

ParameterOperationBefore cyclesAfter cyclesSpeed
ML-KEM-512KeyGen2,295,0301,966,6781.17x
ML-KEM-512Encaps2,662,4992,360,3911.13x
ML-KEM-512Decaps3,380,7723,054,1791.11x
ML-KEM-768KeyGen3,820,0383,300,7561.16x
ML-KEM-768Encaps4,415,3543,902,1751.13x
ML-KEM-768Decaps5,364,6604,828,2851.11x
ML-KEM-1024KeyGen6,033,4675,215,5601.16x
ML-KEM-1024Encaps6,612,6515,796,0091.14x
ML-KEM-1024Decaps7,823,2286,997,5901.12x
ML-KEM average1.14x

ML-DSA: RV64GC vs RV64GC+Zbb

ParameterOperationBefore cyclesAfter cyclesSpeed
ML-DSA-44KeyGen7,338,8186,188,6651.19x
ML-DSA-44Sign19,964,93318,210,5711.10x
ML-DSA-44Verify7,889,8546,772,2201.17x
ML-DSA-65KeyGen12,625,93410,479,4491.20x
ML-DSA-65Sign39,670,45236,420,3141.09x
ML-DSA-65Verify13,024,22311,005,4381.18x
ML-DSA-87KeyGen21,264,38117,715,3131.20x
ML-DSA-87Sign47,264,05042,618,2071.11x
ML-DSA-87Verify21,955,09618,462,7751.19x
ML-DSA average1.16x

Impact of Keccak without Intrinsics (+40%)

(Autovectorization in use, no intrinsics. No modification to source code except addition of Keccak instruction.)

ML-KEM: RV64GCV+Zbb vs RV64GCV+Keccak

ParameterOperationBefore cyclesAfter cyclesSpeed
ML-KEM-512KeyGen2,253,9701,569,6811.44x
ML-KEM-512Encaps2,804,0642,136,5331.31x
ML-KEM-512Decaps3,701,5943,028,1841.22x
ML-KEM-768KeyGen3,858,3232,762,1661.40x
ML-KEM-768Encaps4,604,8803,461,6251.33x
ML-KEM-768Decaps5,801,3374,665,0721.24x
ML-KEM-1024KeyGen6,025,8104,253,1281.42x
ML-KEM-1024Encaps6,903,9745,091,2281.36x
ML-KEM-1024Decaps8,425,0326,622,3881.27x
ML-KEM average1.33x

ML-DSA: RV64GCV+Zbb vs RV64GCV+Keccak

ParameterOperationBefore cyclesAfter cyclesSpeed
ML-DSA-44KeyGen6,819,4454,253,9361.60x
ML-DSA-44Sign23,335,51519,444,1541.20x
ML-DSA-44Verify7,876,4255,447,5701.45x
ML-DSA-65KeyGen11,369,3976,740,8341.69x
ML-DSA-65Sign46,010,68938,785,8421.19x
ML-DSA-65Verify12,515,5028,232,6861.52x
ML-DSA-87KeyGen18,805,21110,811,5981.74x
ML-DSA-87Sign52,663,30842,433,6981.24x
ML-DSA-87Verify20,369,04212,719,7561.60x
ML-DSA average1.47x

Impact of Intrinsics alone (+25%)

(No Keccak instruction. Impact of my optimizations with vector intrinsics over autovectorization.)

ML-KEM: RV64GCV+Zbb vs RV64GCV+Zbb+Intrin

ParameterOperationBefore cyclesAfter cyclesSpeed
ML-KEM-512KeyGen2,253,9701,815,7281.24x
ML-KEM-512Encaps2,804,0642,164,3851.30x
ML-KEM-512Decaps3,701,5942,811,8281.32x
ML-KEM-768KeyGen3,858,3232,996,1831.29x
ML-KEM-768Encaps4,604,8803,465,6831.33x
ML-KEM-768Decaps5,801,3374,328,6741.34x
ML-KEM-1024KeyGen6,025,8104,619,0651.30x
ML-KEM-1024Encaps6,903,9745,167,5111.34x
ML-KEM-1024Decaps8,425,0326,257,8551.35x
ML-KEM average1.31x

ML-DSA: RV64GCV+Zbb vs RV64GCV+Zbb+Intrin

ParameterOperationBefore cyclesAfter cyclesSpeed
ML-DSA-44KeyGen6,819,4456,033,4461.13x
ML-DSA-44Sign23,335,51518,163,8031.28x
ML-DSA-44Verify7,876,4256,675,9131.18x
ML-DSA-65KeyGen11,369,39710,282,6891.11x
ML-DSA-65Sign46,010,68936,023,5191.28x
ML-DSA-65Verify12,515,50210,828,0801.16x
ML-DSA-87KeyGen18,805,21117,248,7401.09x
ML-DSA-87Sign52,663,30842,379,4741.24x
ML-DSA-87Verify20,369,04218,061,3011.13x
ML-DSA average1.18x

With Vector Intrinsics: Impact of Keccak (+52%)

(This is the headline number. With vector intrinsics the relative share of Keccak cycles increases, so the impact of the Keccak instruction is larger.)

ML-KEM: RV64GCV+Zbb+Intrin vs RV64GCV+Intrin+Keccak

ParameterOperationBefore cyclesAfter cyclesSpeed
ML-KEM-512KeyGen1,815,7281,156,1311.57x
ML-KEM-512Encaps2,164,3851,516,2041.43x
ML-KEM-512Decaps2,811,8282,164,0801.30x
ML-KEM-768KeyGen2,996,1831,934,4991.55x
ML-KEM-768Encaps3,465,6832,384,5971.45x
ML-KEM-768Decaps4,328,6743,248,9621.33x
ML-KEM-1024KeyGen4,619,0652,925,9851.58x
ML-KEM-1024Encaps5,167,5113,439,0381.50x
ML-KEM-1024Decaps6,257,8554,550,8381.38x
ML-KEM average1.45x

ML-DSA: RV64GCV+Zbb+Intrin vs RV64GCV+Intrin+Keccak

ParameterOperationBefore cyclesAfter cyclesSpeed
ML-DSA-44KeyGen6,033,4463,442,7541.75x
ML-DSA-44Sign18,163,80314,196,8031.28x
ML-DSA-44Verify6,675,9134,218,0081.58x
ML-DSA-65KeyGen10,282,6895,572,0351.85x
ML-DSA-65Sign36,023,51928,784,9951.25x
ML-DSA-65Verify10,828,0806,564,1311.65x
ML-DSA-87KeyGen17,248,7409,112,4331.89x
ML-DSA-87Sign42,379,47432,177,5141.32x
ML-DSA-87Verify18,061,30110,303,7101.75x
ML-DSA average1.59x

Share this post:

Next Post
Benchmarking (Stock) OpenSSL on Karu