libntruprime: Internals

This document explains the internal structure of libntruprime, and explains how to add new instruction sets and new implementations. The libntruprime infrastructure is adapted from the infrastructure used in lib25519 and libmceliece.

Code generation

Portions of the distributed libntruprime package were automatically generated by scripts in the autogen directory, using (among other things) source in the src directory.

For example, autogen/src converts the src/kem/sntrupP directory into crypto_kem/sntrup{653,761,857,953,1013,1277}, converts the the src/core/multsntrupP directory into crypto_core/multsntrup{653,761,857,953,1013,1277}, etc. (The crypto_hash* directories are not auto-generated.) This structure means that source code is shared across all of the sntrup sizes, while allowing per-size specialization of the compiled code, reducing the in-memory code size for the typical case of an application using just one size.

The installation process (./configure etc.) does not run the autogen scripts. Developers have a choice of two development cycles:

Cycle 1: Modify files in, e.g., src/core/multsntrupP; run autogen/src; install; test; repeat.
Cycle 2: Modify files in, e.g., crypto_core/multsntrup653; install; test; repeat. Once crypto_core/multsntrup653 is working, generalize to src/core/multsntrupP and switch to cycle 1.

Primitives

The crypto_kem/sntrup* directories inside libntruprime are intended to compute exactly the KEM primitives defined by the NTRU Prime Sage reference implementation.

Internally, the implementations rely on lower-level subroutines defined in further crypto_*/* directories. This structuring provides smaller targets for optimization, testing, and verification.

The subroutines are intended to compute the primitives defined in Python inside libntruprime's autogen/test Python script. For example, the crypto_core/multsntrup* functions are intended to match the Python function core_multsntrup. The autogen/test script creates a test program, ntruprime-test, which checks libntruprime's subroutines against some outputs of the Python functions and against some SUPERCOP "checksums" (hashes of outputs for various inputs).

As a concrete introduction to the subroutines, the list below describes specifically the primitives used for sntrup761. Various numbers appear in these descriptions. The sntrup761 parameters are p = 761, q = 4591, and w = 286. Related numbers appearing below are (p+3)/4 = 191; (q+2)/3 = 1531; (q−1)/2 = 2295; 1007, the number of bytes produced by the NTRU Prime Encode function for M = 761*[1531]; 1039 = 1007 + 32; and 1158, the number of bytes produced by Encode for M = 761*[4591]. The constant 32 is independent of the parameter set, as is the constant 10923 = 32769/3 appearing below. The 1039 and 1158 here are also the ciphertext size and public-key size for sntrup761. See autogen/test for the Encode function and a corresponding Decode function; these are copied from Figures 1 and 2 in the NTRU Prime specification.

Here are the primitives:

crypto_verify/1039: crypto_verify_1039(s,t) returns 0 when the 1039-byte arrays s and t are equal, otherwise -1.
crypto_decode/int16: crypto_decode_int16(x,s), where x is a uint16[1] array and s is a uint8[2] array, sets x[0] to the uint16 whose little-endian encoding is s[0],s[1].
crypto_decode/761xint16: crypto_decode_761xint16(x,s), where x is a uint16[761] array and s is a uint8[2*761] array, sets each x[i] to the uint16 whose little-endian encoding is s[2*i],s[2*i+1].
crypto_decode/761xint32: crypto_decode_761xint16(x,s), where x is a uint32[761] array and s is a uint8[4*761] array, sets each x[i] to the uint32 whose little-endian encoding is s[4*i],s[4*i+1],s[4*i+2],s[4*i+3].
crypto_decode/761x3: crypto_decode_761x3(x,s), where x is a uint8[761] array and s is a uint8[191] array, sets x[0] to (s[0]&3)-1, sets x[1] to ((s[0]>>2)&3)-1, sets x[2] to ((s[0]>>4)&3)-1, sets x[3] to ((s[0]>>6)&3)-1, sets x[4] to (s[1]&3)-1, etc.
crypto_decode/761x1531: crypto_decode_761x1531(x,s), where x is an int16[761] array and s is a uint8[1007] array, applies Decode to convert s into 761 integers between 0 and 1530 (i.e., the M input to the function is 761*[1531]), and then multiplies each integer by 3 and subtracts 2295 to obtain x.
crypto_decode/761x4591: crypto_decode_761x4591(x,s), where x is an int16[761] array and s is a uint8[1158] array, applies Decode to convert s into 761 integers between 0 and 4590 (i.e., the M input to the function is 761*[4591]), and then subtracts 2295 from each integer to obtain x.
crypto_encode/int16: crypto_encode_int16(s,x), where s is a uint8[2] array and x is a uint16[1] array, sets s[0],s[1] to the little-endian encoding of x[0].
crypto_encode/761x3: crypto_encode_761x3(s,x), where s is a uint8[191] array and x is a uint8[761] array, sets s[0] to (x[0]+1)+4*(x[1]+1)+16*(x[2]+1)+64*(x[3]+1), sets s[1] to (x[4]+1)+4*(x[5]+1)+16*(x[6]+1)+64*(x[7]+1), ..., sets s[189] to (x[756]+1)+4*(x[757]+1)+16*(x[758]+1)+64*(x[759]+1), and sets s[190] to x[760]+1.
crypto_encode/761xfreeze3: crypto_encode_761xfreeze3(s,x), where s is a uint8[761] array and x is an int16[761] array, sets each s[i] to x[i]-3*((10923*x[i]+16384)>>15).
crypto_encode/761x1531: crypto_encode_761x1531(s,x), where s is a uint8[1007] array and x is an int16[761] array, sets s to Encode(R,M), where M is 761*[1531] and R[i] is (((x[i]+2295)&16383)*10923)>>15. (The way this is used in sntrup761 has R[i] below M[i]; however, tests include larger R[i].)
crypto_encode/761x1531round: crypto_encode_761x1531round(s,x), where s is a uint8[1007] array and x is an int16[761] array, is the same as crypto_encode_761x1531(s,y), where y[i] = 3*((10923*x[i]+16384)>>15).
crypto_encode/761x4591: crypto_encode_761x4591(s,x), where s is a uint8[1158] array and x is an int16[761] array, sets s to Encode(R,M), where M is 761*[4591] and R[i] is (x[i]+2295)&16383. (The way this is used in sntrup761 has R[i] below M[i]; however, tests include larger R[i].)
crypto_sort/int32: crypto_sort_int32(x,n) sorts the int32 values x[0], x[1], ..., x[n-1].
crypto_sort/uint32: crypto_sort_uint32(x,n) sorts the uint32 values x[0], x[1], ..., x[n-1].
crypto_core/inv3sntrup761: crypto_core_inv3sntrup761(h,f,0,0) sets polynomial h to the reciprocal of polynomial f modulo x⁷⁶¹−x−1 modulo 3. The polynomial f is expressed as a uint8[761] array where bytes that reduce to 0, 1, 2, 3 modulo 4 are interpreted as the small integers 0, 1, 0, −1 respectively; these integers in {−1,0,1} are then interpreted as the coefficients of x⁰, x¹, etc. in that order. Coefficients −1, 0, 1 in 1/f modulo 3 are converted to bytes 255, 0, 1 in array h. There is then a final byte 0 indicating that the reciprocal exists, so h is a uint8[762] array. If f does not have a reciprocal then the h array is instead set to 761 bytes 0 followed by final byte 255.
crypto_core/invsntrup761: crypto_core_invsntrup761(h,f,0,0) sets polynomial h to the reciprocal of 3f modulo x⁷⁶¹−x−1 modulo 4591. The input polynomial f is expressed as an int8[761] array; the array entries in {−128,...,127} are interpreted as the coefficients of x⁰, x¹, etc. in that order. Each coefficient of the reciprocal of 3f is reduced modulo q to the range −(q−1)/2 through (q−1)/2 and then encoded as 2 bytes in little-endian form in h. There is then a final byte 0 indicating that the reciprocal exists, so h is a uint8[2*761+1] array. If 3f does not have a reciprocal then the h array is instead set to 2*761 bytes 0 followed by final byte 255.
crypto_core/mult3sntrup761: crypto_core_multsntrup761(h,f,g,0) sets h to the product of two small-coefficient polynomials f and g modulo x⁷⁶¹−x−1 modulo 3. The polynomial f is expressed as a uint8[761] array where bytes that reduce to 0, 1, 2, 3 modulo 4 are interpreted as the small integers 0, 1, 0, −1 respectively; these integers in {−1,0,1} are then interpreted as the coefficients of x⁰, x¹, etc. in that order. The polynomial g is expressed the same way. Each coefficient of the product modulo x⁷⁶¹−x−1 is reduced modulo 3 to the range −1, 0, 1, and then stored in the uint8[761] array h as byte 255, 0, 1 respectively.
crypto_core/multsntrup761: crypto_core_multsntrup761(h,f,g,0) sets h to the product of a polynomial f and a small-coefficient polynomial g modulo x⁷⁶¹−x−1 modulo q. The polynomial f is expressed as a uint8[2*761] array storing 761 int16 values in little-endian form. The polynomial g is expressed as a uint8[761] array, where bytes that reduce to 0, 1, 2, 3 modulo 4 are interpreted as the small integers 0, 1, 0, −1 respectively. Each coefficient of the product modulo x⁷⁶¹−x−1 is reduced modulo q to the range −(q−1)/2 through (q−1)/2, and then encoded as 2 bytes in little-endian form in h, which is a uint8[2*761] array.
crypto_core/scale3sntrup761: crypto_core_scale3sntrup761(h,f,0,0) transforms a polynomial f to a polynomial h. Each polynomial is expressed as a uint8[2*761] array storing 761 int16 values in little-endian form. Each f coefficient is transformed to the corresponding h coefficient with the following sequence of int16 operations (all intermediate results reduced to int16): multiply by 3; subtract 2296; if negative, add 4591; if negative, add 4591; subtract 2295.
crypto_core/weightsntrup761: crypto_core_weightsntrup761(y,x,0,0), where y is a uint8[2] array and x is a uint8[761] array, sets y[0],y[1] to the little-endian encoding of the sum of the 761 bits x[0]&1,...,x[760]&1.
crypto_core/wforcesntrup761: crypto_core_wforcesntrup761(y,x,0,0), where y is a uint8[761] array and x is a uint8[761] array, sets y to a copy of x if the little-endian encoding of the sum of the 761 bits x[0]&1,...,x[760]&1 is 286. Otherwise it sets y to 286 bytes equal to 1 followed by 761−286 bytes equal to 0.
crypto_hashblocks/sha512: crypto_hashblocks_sha512(h,x,xlen) updates an intermediate SHA-512 hash h using all of the full 128-byte blocks at the beginning of the xlen-byte array x, and returns the number of bytes left over, namely xlen mod 128.
crypto_hash/sha512: crypto_hash_sha512(h,x,xlen) computes the SHA-512 hash h of the xlen-byte array x.
crypto_kem/sntrup761: crypto_kem_sntrup761_keypair(pk,sk) is key generation for sntrup761, and is provided by the stable API as sntrup761_keypair. Similar comments apply to enc and dec.

The functions crypto_sort_int32(x,n) and crypto_sort_uint32(x,n) take time that depends on n but not on the contents of the x array. Similarly, the crypto_hash* functions take time that depends on input length but not on input contents. All other subroutines take constant time. There is one use of "declassification" in crypto_kem/sntrup: a rejection-sampling loop at the beginning of key generation enforces invertibility mod 3.

As in SUPERCOP and NaCl, array lengths intentionally use long long, not size_t. In libntruprime, as in lib25519 and libmceliece, array lengths are signed.

Implementations

A single primitive can, and usually does, have multiple implementations. Each implementation is in its own subdirectory. The implementations are required to have exactly the same input-output behavior, and to some extent this is tested, although it is not yet formally verified (except for some components such as crypto_sort).

Different implementations typically offer different tradeoffs between portability, simplicity, and efficiency. For example, crypto_core/inv3sntrup761/bits64 is portable; crypto_core/inv3sntrup761/avx is faster and less portable.

Each unportable implementation has an architectures file. Each line in this file identifies a CPU instruction set (and ABI) where the implementation works. For example, crypto_core/inv3sntrup761/avx/architectures has two lines

amd64 avx2
x86 avx2

meaning that the implementation works on CPUs that have the Intel/AMD 64-bit or 32-bit instruction sets with the AVX2 instruction-set extension. The top-level compilers directory shows (among other things) the allowed instruction-set names such as avx2.

At run time, libntruprime checks the CPU where it is running, and selects an implementation where architectures is compatible with that CPU. Each primitive makes its own selection once per program startup, using the compiler's ifunc mechanism (or constructor on platforms that do not support ifunc). This type of run-time selection means, for example, that an amd64 CPU without AVX2 can share binaries with an amd64 CPU with AVX2. However, correctness requires instruction sets to be preserved by migration across cores via the OS kernel, VM migration, etc.

The compiler has a target mechanism that makes an ifunc selection based on CPU architectures. Instead of using the target mechanism, libntruprime uses a more sophisticated mechanism that also accounts for benchmarks collected in advance of compilation.

Compilers

libntruprime tries different C compilers for each implementation. For example, compilers/default lists the following compilers:

clang -Wall -fPIC -fwrapv -Qunused-arguments -O2
gcc -Wall -fPIC -fwrapv -O3

Sometimes gcc produces better code, and sometimes clang produces better code.

As another example, compilers/amd64+sse3+ssse3+sse41+popcnt+avx+bmi1+bmi2+avx2+fma lists the following compilers:

clang -Wall -fPIC -fwrapv -Qunused-arguments -O2 -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mbmi -mbmi2 -mpopcnt -mavx2 -mfma -mtune=skylake
gcc -Wall -fPIC -fwrapv -O3 -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mbmi -mbmi2 -mpopcnt -mavx2 -mfma -mtune=skylake

The -mavx2 option tells these compilers that they are free to use the AVX2 instruction-set extension.

Code compiled using the compilers in compilers/amd64+sse3+ssse3+sse41+popcnt+avx+bmi1+bmi2+avx2+fma will be considered at run time by the libntruprime selection mechanism if the supports() function in compilers/amd64+sse3+ssse3+sse41+popcnt+avx+bmi1+bmi2+avx2+fma.c returns nonzero. This function checks whether the run-time CPU supports AVX2 (and SSE3 and so on, and OSXSAVE with XMM/YMM being saved; https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85100 says that all versions of gcc until 2018 handled this incorrectly in target). Similar comments apply to other compilers/* files.

If some compilers fail (for example, clang is not installed, or the compiler version is too old to support the compiler options used in libntruprime), the libntruprime compilation process will try its best to produce a working library using the remaining compilers, even if this means lower performance.

Trimming

By default, to reduce size of the compiled library, the libntruprime compilation process trims the library down to the implementations that are selected by libntruprime's selection mechanism.

For example, if the selection mechanism decides that CPUs with AVX2 should use invsntrup761/avx with clang and that other CPUs should use invsntrup761/portable with gcc, then trimming will remove invsntrup761/avx compiled with gcc and invsntrup761/portable compiled with clang.

This trimming is handled at link time rather than compile time to increase the chance that, even if some implementations are broken by compiler "upgrades", the library will continue to build successfully.

To avoid this trimming, pass the --no-trim option to ./configure. All implementations that compile are then included in the library, tested by ntruprime-test, and measured by ntruprime-speed. You'll want to avoid trimming if you're adding new instruction sets or new implementations (see below), so that you can run tests and benchmarks of code that isn't selected yet.

How to recompile after changes

If you make changes under crypto_*, the fully supported recompilation mechanism is to run ./configure again to clean and repopulate the build directory, and then run make again to recompile everything.

This can be on the scale of seconds if you have enough cores, but maybe you're developing on a slower machine. Three options are currently available to accelerate the edit-compile cycle:

There is an experimental --no-clean option to ./configure that, for some simple types of changes, can produce a successful build without cleaning.
Running make without ./configure can work for some particularly simple types of changes. However, not all dependencies are currently expressed in Makefile, and some types of dependencies that ./configure understands would be difficult to express in the Makefile language.
You can disable the implementations you're not using by setting sticky bits on the source directories for those implementations: e.g., chmod +t crypto_*/*/avx.

Make sure to reenable all implementations and do a full clean build if you're collecting data to add to the source benchmarks directory.

How to add new instruction sets

Adding another file compilers/amd64+foo, along with a supports() implementation in compilers/amd64+foo.c, will support a new instruction set. Do not assume that the new foo instruction set implies support for older instruction sets (the idea of "levels" of instruction sets); instead make sure to include the older instruction sets in + tags, as illustrated by compilers/amd64+sse3+ssse3+sse41+popcnt+avx+bmi1+bmi2+avx2+fma.

In the compiler options, always make sure to include -fPIC to support shared libraries, and -fwrapv to switch to a slightly less dangerous version of C.

The foo tags don't have to be instruction sets. For example, if a CPU has the same instruction set but wants different optimizations because of differences in instruction timings, you can make a tag for those optimizations, using, e.g., CPU IDs or benchmarks in the corresponding supports() function to decide whether to enable those optimizations. Benchmarks tend to be more future-proof than a list of CPU IDs, but the time taken for benchmarks at program startup has to be weighed against the subsequent speedup from the resulting optimizations.

To see how well libntruprime performs with the new compilers, run ntruprime-speed on the target machine and look for the foo lines in the output. If the new performance is better than the performance shown on the selected lines:

Copy the ntruprime-speed output into a file on the benchmarks directory, typically named after the hostname of the target machine.
Run ./prioritize in the top-level directory to create priority files. These files tell libntruprime which implementations to select for any given architecture.
Reconfigure (again with --no-trim), recompile, rerun ntruprime-test, and rerun ntruprime-speed to check that the selected lines now use the foo compiler.

If the foo implementation is outperformed by other implementations, then these steps don't help except for documenting this fact. The same implementation might turn out to be useful for subsequent foo CPUs.

How to add new implementations

Taking full advantage of the foo instruction set usually requires writing new implementations. Sometimes there are also ideas for taking better advantage of existing instruction sets.

Structurally, adding a new implementation of a primitive is a simple matter of adding a new subdirectory with the code for that implementation. Most of the work is optimizing the use of foo intrinsics in .c files or foo instructions in .S files. Make sure to include an architectures file saying, e.g., amd64 avx2 foo.

Names of implementation directories can use letters, digits, dashes, and underscores. Do not use two implementation names that are the same when dashes and underscores are removed.

All .c and .S files in the implementation directory are compiled and linked. There is no need to edit a separate list of these files. You can also use .h files via the C preprocessor.

If an implementation is actually more restrictive than indicated in architectures then the resulting compiled library will fail on some machines (although perhaps that implementation will not be used by default). Putting unnecessary restrictions into architectures will not create such failures, but can unnecessarily limit performance.

Some, but not all, mistakes in architectures will produce warnings from the checkinsns script that runs automatically when libntruprime is compiled. Running the ntruprime-test program tries all implementations, but only on the CPU where ntruprime-test is being run; also, ntruprime-test does not guarantee code coverage.

amd64 implies little-endian, and implies architectural support for unaligned loads and stores. Beware, however, that the Intel/AMD vectorized load/store intrinsics (and the underlying movdqa instruction) require alignment; if in doubt, use loadu/storeu (and movdqu). The ntruprime-test program checks unaligned inputs and outputs, but can miss issues with unaligned stack variables.

To test your implementation, compile everything, check for compiler warnings and errors, run ntruprime-test (or just ntruprime-test xof to test a crypto_xof implementation), and check for a line saying all tests succeeded. To use AddressSanitizer (for catching, at run time, buffer overflows in C code), add -fsanitize=address to the gcc and clang lines in compilers/*; you may also have to add return; at the beginning of the limits() function in command/limits.inc.

To see the performance of your implementation, run ntruprime-speed. If the new performance is better than the performance shown on the selected lines, follow the same steps as for a new instruction set: copy the ntruprime-speed output into a file on the benchmarks directory; run ./prioritize in the top-level directory to create priority files; reconfigure (again with --no-trim); recompile; rerun ntruprime-test; rerun ntruprime-speed; check that the selected lines now use the new implementation.

How to handle namespacing

As in SUPERCOP and NaCl, to call crypto_sort_int32(), you have to include crypto_sort_int32.h; but to write an implementation of crypto_sort_int32(), you have to instead include crypto_sort.h and define crypto_sort. Similar comments apply to other primitives.

The function name that's actually linked might end up as, e.g., libntruprime_sort_int32_avx2_C2 where avx2 indicates the implementation and C2 indicates the compiler. Don't try to build this name into your implementation.

If you have another global symbol x (for example, a non-static function in a .c file, or a non-static variable outside functions in a .c file), you have to replace it with CRYPTO_NAMESPACE(x), for example with #define x CRYPTO_NAMESPACE(x).

For global symbols in .S files and shared-*.c files, use CRYPTO_SHARED_NAMESPACE instead of CRYPTO_NAMESPACE. For .S files that define both x and _x to handle platforms where x in C is _x in assembly, use CRYPTO_SHARED_NAMESPACE(x) and _CRYPTO_SHARED_NAMESPACE(x); CRYPTO_SHARED_NAMESPACE(_x) is not sufficient.

libntruprime includes a mechanism to recognize files that are copied across implementations (possibly of different primitives) and to unify those into a file compiled only once, reducing the overall size of the compiled library and possibly improving cache utilization. To request this mechanism, include a line

// linker define x

for any global symbol x defined in the file, and a line

// linker use x

for any global symbol x used in the file from the same implementation (not crypto_* subroutines that you're calling, randombytes, etc.). This mechanism tries very hard, perhaps too hard, to avoid improperly unifying files: for example, even a slight difference in a .h file included by a file defining a used symbol will disable the mechanism.

Typical namespacing mistakes will produce either linker failures or warnings from the checknamespace script that runs automatically when libntruprime is compiled.

Version: This is version 2024.08.18 of the "Internals" web page.