This document explains the internal structure of libntruprime, and explains how to add new instruction sets and new implementations. The libntruprime infrastructure is adapted from the infrastructure used in lib25519 and libmceliece.
Code generation
Portions of the distributed libntruprime package were automatically
generated by scripts in the autogen
directory, using (among other
things) source in the src
directory.
For example, autogen/src
converts the src/kem/sntrupP
directory into
crypto_kem/sntrup{653,761,857,953,1013,1277}
, converts the
the src/core/multsntrupP
directory into
crypto_core/multsntrup{653,761,857,953,1013,1277}
, etc. (The
crypto_hash*
directories are not auto-generated.) This structure
means that source code is shared across all of the sntrup
sizes, while
allowing per-size specialization of the compiled code, reducing the
in-memory code size for the typical case of an application using just
one size.
The installation process (./configure
etc.) does not run the autogen
scripts. Developers have a choice of two development cycles:
-
Cycle 1: Modify files in, e.g.,
src/core/multsntrupP
; runautogen/src
; install; test; repeat. -
Cycle 2: Modify files in, e.g.,
crypto_core/multsntrup653
; install; test; repeat. Oncecrypto_core/multsntrup653
is working, generalize tosrc/core/multsntrupP
and switch to cycle 1.
Primitives
The crypto_kem/sntrup*
directories inside libntruprime are intended to
compute exactly the KEM primitives defined by the
NTRU Prime Sage reference implementation.
Internally, the implementations rely on lower-level subroutines defined
in further crypto_*/*
directories. This structuring provides smaller
targets for optimization, testing, and verification.
The subroutines are intended to compute the primitives defined in Python
inside libntruprime's autogen/test
Python script. For example, the
crypto_core/multsntrup*
functions are intended to match the Python
function core_multsntrup
. The autogen/test
script creates a test
program, ntruprime-test
, which checks libntruprime's subroutines
against some outputs of the Python functions and against some SUPERCOP
"checksums" (hashes of outputs for various inputs).
As a concrete introduction to the subroutines, the list below describes
specifically the primitives used for sntrup761
. Various numbers appear
in these descriptions. The sntrup761
parameters are p = 761, q = 4591,
and w = 286. Related numbers appearing below are (p+3)/4 = 191;
(q+2)/3 = 1531; (q−1)/2 = 2295; 1007, the number of bytes produced by
the NTRU Prime Encode
function for M = 761*[1531]
; 1039 = 1007 + 32;
and 1158, the number of bytes produced by Encode
for M = 761*[4591]
.
The constant 32 is independent of the parameter set, as is the constant
10923 = 32769/3 appearing below. The 1039 and 1158 here are also the
ciphertext size and public-key size for sntrup761
. See autogen/test
for the Encode
function and a corresponding Decode
function; these
are copied from Figures 1 and 2 in the NTRU Prime specification.
Here are the primitives:
-
crypto_verify/1039
:crypto_verify_1039(s,t)
returns 0 when the 1039-byte arrayss
andt
are equal, otherwise-1
. -
crypto_decode/int16
:crypto_decode_int16(x,s)
, wherex
is auint16[1]
array ands
is auint8[2]
array, setsx[0]
to theuint16
whose little-endian encoding iss[0],s[1]
. -
crypto_decode/761xint16
:crypto_decode_761xint16(x,s)
, wherex
is auint16[761]
array ands
is auint8[2*761]
array, sets eachx[i]
to theuint16
whose little-endian encoding iss[2*i],s[2*i+1]
. -
crypto_decode/761xint32
:crypto_decode_761xint16(x,s)
, wherex
is auint32[761]
array ands
is auint8[4*761]
array, sets eachx[i]
to theuint32
whose little-endian encoding iss[4*i],s[4*i+1],s[4*i+2],s[4*i+3]
. -
crypto_decode/761x3
:crypto_decode_761x3(x,s)
, wherex
is auint8[761]
array ands
is auint8[191]
array, setsx[0]
to(s[0]&3)-1
, setsx[1]
to((s[0]>>2)&3)-1
, setsx[2]
to((s[0]>>4)&3)-1
, setsx[3]
to((s[0]>>6)&3)-1
, setsx[4]
to(s[1]&3)-1
, etc. -
crypto_decode/761x1531
:crypto_decode_761x1531(x,s)
, wherex
is anint16[761]
array ands
is auint8[1007]
array, appliesDecode
to converts
into 761 integers between 0 and 1530 (i.e., theM
input to the function is761*[1531]
), and then multiplies each integer by 3 and subtracts 2295 to obtainx
. -
crypto_decode/761x4591
:crypto_decode_761x4591(x,s)
, wherex
is anint16[761]
array ands
is auint8[1158]
array, appliesDecode
to converts
into 761 integers between 0 and 4590 (i.e., theM
input to the function is761*[4591]
), and then subtracts 2295 from each integer to obtainx
. -
crypto_encode/int16
:crypto_encode_int16(s,x)
, wheres
is auint8[2]
array andx
is auint16[1]
array, setss[0],s[1]
to the little-endian encoding ofx[0]
. -
crypto_encode/761x3
:crypto_encode_761x3(s,x)
, wheres
is auint8[191]
array andx
is auint8[761]
array, setss[0]
to(x[0]+1)+4*(x[1]+1)+16*(x[2]+1)+64*(x[3]+1)
, setss[1]
to(x[4]+1)+4*(x[5]+1)+16*(x[6]+1)+64*(x[7]+1)
, ..., setss[189]
to(x[756]+1)+4*(x[757]+1)+16*(x[758]+1)+64*(x[759]+1)
, and setss[190]
tox[760]+1
. -
crypto_encode/761xfreeze3
:crypto_encode_761xfreeze3(s,x)
, wheres
is auint8[761]
array andx
is anint16[761]
array, sets eachs[i]
tox[i]-3*((10923*x[i]+16384)>>15)
. -
crypto_encode/761x1531
:crypto_encode_761x1531(s,x)
, wheres
is auint8[1007]
array andx
is anint16[761]
array, setss
toEncode(R,M)
, whereM
is761*[1531]
andR[i]
is(((x[i]+2295)&16383)*10923)>>15
. (The way this is used insntrup761
hasR[i]
belowM[i]
; however, tests include largerR[i]
.) -
crypto_encode/761x1531round
:crypto_encode_761x1531round(s,x)
, wheres
is auint8[1007]
array andx
is anint16[761]
array, is the same ascrypto_encode_761x1531(s,y)
, wherey[i] = 3*((10923*x[i]+16384)>>15)
. -
crypto_encode/761x4591
:crypto_encode_761x4591(s,x)
, wheres
is auint8[1158]
array andx
is anint16[761]
array, setss
toEncode(R,M)
, whereM
is761*[4591]
andR[i]
is(x[i]+2295)&16383
. (The way this is used insntrup761
hasR[i]
belowM[i]
; however, tests include largerR[i]
.) -
crypto_sort/int32
:crypto_sort_int32(x,n)
sorts theint32
valuesx[0]
,x[1]
, ...,x[n-1]
. -
crypto_sort/uint32
:crypto_sort_uint32(x,n)
sorts theuint32
valuesx[0]
,x[1]
, ...,x[n-1]
. -
crypto_core/inv3sntrup761
:crypto_core_inv3sntrup761(h,f,0,0)
sets polynomialh
to the reciprocal of polynomialf
modulo x761−x−1 modulo 3. The polynomialf
is expressed as auint8[761]
array where bytes that reduce to 0, 1, 2, 3 modulo 4 are interpreted as the small integers 0, 1, 0, −1 respectively; these integers in {−1,0,1} are then interpreted as the coefficients of x0, x1, etc. in that order. Coefficients −1, 0, 1 in1/f
modulo 3 are converted to bytes 255, 0, 1 in arrayh
. There is then a final byte 0 indicating that the reciprocal exists, soh
is auint8[762]
array. Iff
does not have a reciprocal then theh
array is instead set to 761 bytes 0 followed by final byte 255. -
crypto_core/invsntrup761
:crypto_core_invsntrup761(h,f,0,0)
sets polynomialh
to the reciprocal of3f
modulo x761−x−1 modulo 4591. The input polynomialf
is expressed as anint8[761]
array; the array entries in {−128,...,127} are interpreted as the coefficients of x0, x1, etc. in that order. Each coefficient of the reciprocal of3f
is reduced modulo q to the range −(q−1)/2 through (q−1)/2 and then encoded as 2 bytes in little-endian form inh
. There is then a final byte 0 indicating that the reciprocal exists, soh
is auint8[2*761+1]
array. If3f
does not have a reciprocal then theh
array is instead set to2*761
bytes 0 followed by final byte 255. -
crypto_core/mult3sntrup761
:crypto_core_multsntrup761(h,f,g,0)
setsh
to the product of two small-coefficient polynomialsf
andg
modulo x761−x−1 modulo 3. The polynomialf
is expressed as auint8[761]
array where bytes that reduce to 0, 1, 2, 3 modulo 4 are interpreted as the small integers 0, 1, 0, −1 respectively; these integers in {−1,0,1} are then interpreted as the coefficients of x0, x1, etc. in that order. The polynomialg
is expressed the same way. Each coefficient of the product modulo x761−x−1 is reduced modulo 3 to the range −1, 0, 1, and then stored in theuint8[761]
arrayh
as byte 255, 0, 1 respectively. -
crypto_core/multsntrup761
:crypto_core_multsntrup761(h,f,g,0)
setsh
to the product of a polynomialf
and a small-coefficient polynomialg
modulo x761−x−1 modulo q. The polynomialf
is expressed as auint8[2*761]
array storing 761int16
values in little-endian form. The polynomialg
is expressed as auint8[761]
array, where bytes that reduce to 0, 1, 2, 3 modulo 4 are interpreted as the small integers 0, 1, 0, −1 respectively. Each coefficient of the product modulo x761−x−1 is reduced modulo q to the range −(q−1)/2 through (q−1)/2, and then encoded as 2 bytes in little-endian form inh
, which is auint8[2*761]
array. -
crypto_core/scale3sntrup761
:crypto_core_scale3sntrup761(h,f,0,0)
transforms a polynomialf
to a polynomialh
. Each polynomial is expressed as auint8[2*761]
array storing 761int16
values in little-endian form. Eachf
coefficient is transformed to the correspondingh
coefficient with the following sequence ofint16
operations (all intermediate results reduced toint16
): multiply by 3; subtract 2296; if negative, add 4591; if negative, add 4591; subtract 2295. -
crypto_core/weightsntrup761
:crypto_core_weightsntrup761(y,x,0,0)
, wherey
is auint8[2]
array andx
is auint8[761]
array, setsy[0],y[1]
to the little-endian encoding of the sum of the 761 bitsx[0]&1,...,x[760]&1
. -
crypto_core/wforcesntrup761
:crypto_core_wforcesntrup761(y,x,0,0)
, wherey
is auint8[761]
array andx
is auint8[761]
array, setsy
to a copy ofx
if the little-endian encoding of the sum of the 761 bitsx[0]&1,...,x[760]&1
is 286. Otherwise it setsy
to 286 bytes equal to1
followed by 761−286 bytes equal to0
. -
crypto_hashblocks/sha512
:crypto_hashblocks_sha512(h,x,xlen)
updates an intermediate SHA-512 hashh
using all of the full 128-byte blocks at the beginning of thexlen
-byte arrayx
, and returns the number of bytes left over, namelyxlen
mod 128. -
crypto_hash/sha512
:crypto_hash_sha512(h,x,xlen)
computes the SHA-512 hashh
of thexlen
-byte arrayx
. -
crypto_kem/sntrup761
:crypto_kem_sntrup761_keypair(pk,sk)
is key generation forsntrup761
, and is provided by the stable API assntrup761_keypair
. Similar comments apply toenc
anddec
.
The functions crypto_sort_int32(x,n)
and crypto_sort_uint32(x,n)
take time that depends on n
but not on the contents of the x
array.
Similarly, the crypto_hash*
functions take time that depends on input
length but not on input contents. All other subroutines take constant
time. There is one use of "declassification" in crypto_kem/sntrup
: a
rejection-sampling loop at the beginning of key generation enforces
invertibility mod 3.
As in SUPERCOP and NaCl, array lengths intentionally use long long
,
not size_t
. In libntruprime, as in lib25519 and libmceliece, array
lengths are signed.
Implementations
A single primitive can, and usually does, have multiple implementations.
Each implementation is in its own subdirectory. The implementations are
required to have exactly the same input-output behavior, and to some
extent this is tested, although it is not yet formally verified (except
for some components such as crypto_sort
).
Different implementations typically offer different tradeoffs between
portability, simplicity, and efficiency. For example,
crypto_core/inv3sntrup761/bits64
is portable;
crypto_core/inv3sntrup761/avx
is faster and less portable.
Each unportable implementation has an architectures
file. Each line in
this file identifies a CPU instruction set (and ABI) where the
implementation works. For example,
crypto_core/inv3sntrup761/avx/architectures
has two lines
amd64 avx2
x86 avx2
meaning that the implementation works on CPUs that have the Intel/AMD
64-bit or 32-bit instruction sets with the AVX2 instruction-set
extension. The top-level compilers
directory shows (among other
things) the allowed instruction-set names such as avx2
.
At run time, libntruprime checks the CPU where it is running, and selects
an implementation where architectures
is compatible with that CPU.
Each primitive makes its own selection once per program startup, using
the compiler's ifunc
mechanism (or constructor
on platforms that do
not support ifunc
). This type of run-time selection means, for
example, that an amd64
CPU without AVX2 can share binaries with an
amd64
CPU with AVX2. However, correctness requires instruction sets to
be preserved by migration across cores via the OS kernel, VM migration,
etc.
The compiler has a target
mechanism that makes an ifunc
selection
based on CPU architectures. Instead of using the target
mechanism,
libntruprime uses a more sophisticated mechanism that also accounts for
benchmarks collected in advance of compilation.
Compilers
libntruprime tries different C compilers for each implementation. For
example, compilers/default
lists the following compilers:
clang -Wall -fPIC -fwrapv -Qunused-arguments -O2
gcc -Wall -fPIC -fwrapv -O3
Sometimes gcc
produces better code, and sometimes clang
produces
better code.
As another example, compilers/amd64+sse3+ssse3+sse41+popcnt+avx+bmi1+bmi2+avx2+fma
lists the following compilers:
clang -Wall -fPIC -fwrapv -Qunused-arguments -O2 -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mbmi -mbmi2 -mpopcnt -mavx2 -mfma -mtune=skylake
gcc -Wall -fPIC -fwrapv -O3 -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mbmi -mbmi2 -mpopcnt -mavx2 -mfma -mtune=skylake
The -mavx2
option tells these compilers that they are free to use the
AVX2 instruction-set extension.
Code compiled using the compilers in
compilers/amd64+sse3+ssse3+sse41+popcnt+avx+bmi1+bmi2+avx2+fma
will be considered at run time by the libntruprime selection mechanism if
the supports()
function in
compilers/amd64+sse3+ssse3+sse41+popcnt+avx+bmi1+bmi2+avx2+fma.c
returns nonzero. This function checks whether the run-time CPU supports
AVX2 (and SSE3 and so on, and OSXSAVE with XMM/YMM being saved;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85100
says that all versions of gcc until 2018 handled this incorrectly in
target
). Similar comments apply to other compilers/*
files.
If some compilers fail (for example, clang is not installed, or the compiler version is too old to support the compiler options used in libntruprime), the libntruprime compilation process will try its best to produce a working library using the remaining compilers, even if this means lower performance.
Trimming
By default, to reduce size of the compiled library, the libntruprime compilation process trims the library down to the implementations that are selected by libntruprime's selection mechanism.
For example, if the selection mechanism decides that CPUs with AVX2
should use invsntrup761/avx
with clang
and that other CPUs should
use invsntrup761/portable
with gcc
, then trimming will remove
invsntrup761/avx
compiled with gcc
and invsntrup761/portable
compiled with clang
.
This trimming is handled at link time rather than compile time to increase the chance that, even if some implementations are broken by compiler "upgrades", the library will continue to build successfully.
To avoid this trimming, pass the --no-trim
option to ./configure
.
All implementations that compile are then included in the library,
tested by ntruprime-test
, and measured by ntruprime-speed
. You'll want
to avoid trimming if you're adding new instruction sets or new
implementations (see below), so that you can run tests and benchmarks of
code that isn't selected yet.
How to recompile after changes
If you make changes under crypto_*
, the fully supported recompilation
mechanism is to run ./configure
again to clean and repopulate the
build directory, and then run make
again to recompile everything.
This can be on the scale of seconds if you have enough cores, but maybe you're developing on a slower machine. Three options are currently available to accelerate the edit-compile cycle:
-
There is an experimental
--no-clean
option to./configure
that, for some simple types of changes, can produce a successful build without cleaning. -
Running
make
without./configure
can work for some particularly simple types of changes. However, not all dependencies are currently expressed inMakefile
, and some types of dependencies that./configure
understands would be difficult to express in theMakefile
language. -
You can disable the implementations you're not using by setting sticky bits on the source directories for those implementations: e.g.,
chmod +t crypto_*/*/avx
.
Make sure to reenable all implementations and do a full clean build if
you're collecting data to add to the source benchmarks
directory.
How to add new instruction sets
Adding another file compilers/amd64+foo
, along with a supports()
implementation in compilers/amd64+foo.c
, will support a new
instruction set. Do not assume that the new foo
instruction set
implies support for older instruction sets (the idea of "levels" of
instruction sets); instead make sure to include the older instruction
sets in +
tags, as illustrated by
compilers/amd64+sse3+ssse3+sse41+popcnt+avx+bmi1+bmi2+avx2+fma
.
In the compiler options, always make sure to include -fPIC
to support
shared libraries, and -fwrapv
to switch to a slightly less dangerous
version of C.
The foo
tags don't have to be instruction sets. For example, if a CPU
has the same instruction set but wants different optimizations because
of differences in instruction timings, you can make a tag for those
optimizations, using, e.g., CPU IDs or benchmarks in the corresponding
supports()
function to decide whether to enable those optimizations.
Benchmarks tend to be more future-proof than a list of CPU IDs, but the
time taken for benchmarks at program startup has to be weighed against
the subsequent speedup from the resulting optimizations.
To see how well libntruprime performs with the new compilers, run
ntruprime-speed
on the target machine and look for the foo
lines in
the output. If the new performance is better than the performance shown
on the selected
lines:
-
Copy the
ntruprime-speed
output into a file on thebenchmarks
directory, typically named after the hostname of the target machine. -
Run
./prioritize
in the top-level directory to createpriority
files. These files tell libntruprime which implementations to select for any given architecture. -
Reconfigure (again with
--no-trim
), recompile, rerunntruprime-test
, and rerunntruprime-speed
to check that theselected
lines now use thefoo
compiler.
If the foo
implementation is outperformed by other implementations,
then these steps don't help except for documenting this fact. The same
implementation might turn out to be useful for subsequent foo
CPUs.
How to add new implementations
Taking full advantage of the foo
instruction set usually requires
writing new implementations. Sometimes there are also ideas for taking
better advantage of existing instruction sets.
Structurally, adding a new implementation of a primitive is a simple
matter of adding a new subdirectory with the code for that
implementation. Most of the work is optimizing the use of foo
intrinsics in .c
files or foo
instructions in .S
files. Make sure
to include an architectures
file saying, e.g., amd64 avx2 foo
.
Names of implementation directories can use letters, digits, dashes, and underscores. Do not use two implementation names that are the same when dashes and underscores are removed.
All .c
and .S
files in the implementation directory are compiled and
linked. There is no need to edit a separate list of these files. You can
also use .h
files via the C preprocessor.
If an implementation is actually more restrictive than indicated in
architectures
then the resulting compiled library will fail on some
machines (although perhaps that implementation will not be used by
default). Putting unnecessary restrictions into architectures
will not
create such failures, but can unnecessarily limit performance.
Some, but not all, mistakes in architectures
will produce warnings
from the checkinsns
script that runs automatically when libntruprime is
compiled. Running the ntruprime-test
program tries all implementations,
but only on the CPU where ntruprime-test
is being run;
also, ntruprime-test
does not guarantee code coverage.
amd64
implies little-endian, and implies architectural support for
unaligned loads and stores. Beware, however, that the Intel/AMD
vectorized load
/store
intrinsics (and the underlying movdqa
instruction) require alignment; if in doubt, use loadu
/storeu
(and
movdqu
). The ntruprime-test
program checks unaligned inputs and
outputs, but can miss issues with unaligned stack variables.
To test your implementation, compile everything, check for compiler
warnings and errors, run ntruprime-test
(or just ntruprime-test xof
to
test a crypto_xof
implementation), and check for a line saying all
tests succeeded
. To use AddressSanitizer (for catching, at run time,
buffer overflows in C code), add -fsanitize=address
to the gcc
and
clang
lines in compilers/*
; you may also have to add return;
at
the beginning of the limits()
function in command/limits.inc
.
To see the performance of your implementation, run ntruprime-speed
. If
the new performance is better than the performance shown on the
selected
lines, follow the same steps as for a new instruction set:
copy the ntruprime-speed
output into a file on the benchmarks
directory; run ./prioritize
in the top-level directory to create
priority
files; reconfigure (again with --no-trim
); recompile; rerun
ntruprime-test
; rerun ntruprime-speed
; check that the selected
lines
now use the new implementation.
How to handle namespacing
As in SUPERCOP and NaCl, to call crypto_sort_int32()
, you have to
include crypto_sort_int32.h
; but to write an implementation of
crypto_sort_int32()
, you have to instead include crypto_sort.h
and
define crypto_sort
. Similar comments apply to other primitives.
The function name that's actually linked might end up as, e.g.,
libntruprime_sort_int32_avx2_C2
where avx2
indicates the
implementation and C2
indicates the compiler. Don't try to build this
name into your implementation.
If you have another global symbol x
(for example, a non-static
function in a .c
file, or a non-static
variable outside functions in
a .c
file), you have to replace it with CRYPTO_NAMESPACE(x)
, for
example with #define x CRYPTO_NAMESPACE(x)
.
For global symbols in .S
files and shared-*.c
files, use
CRYPTO_SHARED_NAMESPACE
instead of CRYPTO_NAMESPACE
. For .S
files
that define both x
and _x
to handle platforms where x
in C is _x
in assembly, use CRYPTO_SHARED_NAMESPACE(x)
and
_CRYPTO_SHARED_NAMESPACE(x)
; CRYPTO_SHARED_NAMESPACE(_x)
is not
sufficient.
libntruprime includes a mechanism to recognize files that are copied across implementations (possibly of different primitives) and to unify those into a file compiled only once, reducing the overall size of the compiled library and possibly improving cache utilization. To request this mechanism, include a line
// linker define x
for any global
symbol x
defined in the file, and a line
// linker use x
for any
global symbol x
used in the file from the same implementation (not
crypto_*
subroutines that you're calling, randombytes
, etc.). This
mechanism tries very hard, perhaps too hard, to avoid improperly
unifying files: for example, even a slight difference in a .h
file
included by a file defining a used symbol will disable the mechanism.
Typical namespacing mistakes will produce either linker failures or
warnings from the checknamespace
script that runs automatically when
libntruprime is compiled.
Version: This is version 2024.08.18 of the "Internals" web page.