-r--r--r-- 34470 libntruprime-20240910/doc/html/internals.html raw<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<style type="text/css">
html{overflow-y:scroll;background-color:#004591}
body{font-family:"Noto Sans","Droid Sans","DejaVu Sans","Arial",sans-serif;line-height:1.5}
tt,code{background-color:#f0f0f0;font-family:"Noto Sans Mono","Droid Sans Mono","DejaVu Sans Mono","Courier New",monospace,sans-serif;font-size:1em;}
pre{margin-left:3em}
p,ul,ol,blockquote,pre{font-size:1.0em;line-height:1.6}
li p{font-size:1.0em}
blockquote p{font-size:1.0em}
h1{font-size:1.5em}
h2{font-size:1.3em}
h3{font-size:1.0em}
h1 a{text-decoration:none}
table{border-collapse:collapse}
th,td{border:1px solid black}
table a{text-decoration:none}
table tr{font-size:1.0em;line-height:1.6em}
table tr{font-size:1.0em;line-height:1.5}
tbody tr:nth-child(12n+1){background-color:#f0ffff}
tbody tr:nth-child(12n+2){background-color:#f0ffff}
tbody tr:nth-child(12n+3){background-color:#f0ffff}
tbody tr:nth-child(12n+4){background-color:#f0ffff}
tbody tr:nth-child(12n+5){background-color:#f0ffff}
tbody tr:nth-child(12n+6){background-color:#f0ffff}
tbody tr:nth-child(12n+7){background-color:#fffff0}
tbody tr:nth-child(12n+8){background-color:#fffff0}
tbody tr:nth-child(12n+9){background-color:#fffff0}
tbody tr:nth-child(12n+10){background-color:#fffff0}
tbody tr:nth-child(12n+11){background-color:#fffff0}
tbody tr:nth-child(12n+12){background-color:#fffff0}
.headline{padding:0;font-weight:bold;font-size:1.0em;vertical-align:top;padding-bottom:0.5em;color:#ffffff;background-color:#004591}
.navt{display:block;box-sizing:border-box;-moz-box-sizing:border-box;-webkit-box-sizing:border-box;margin:0;padding:0;vertical-align:center;font-size:1.0em}
.here{background-color:#004591}
.here{color:#ffffff}
.away{background-color:#004591}
.away a{text-decoration:none;display:block;color:#ffffff}
.away a:hover,.away a:active{text-decoration:underline}
.main{padding:5px}
.main{background-color:#ffffff}
.pagetitle{font-size:1.4em;font-weight:bold}
@media only screen and (min-width:512px) {
  .fixed{margin:0;padding:0;width:160px;height:100%;position:fixed;overflow:auto}
  .main{margin-left:170px}
}
</style>
<title>
libntruprime: Internals</title>
</head>
<body>
<div class=fixed>
<div class=headline>
libntruprime</div>
<div class="navt away"><a href=index.html>Intro</a>
</div><div class="navt away"><a href=download.html>Download</a>
</div><div class="navt away"><a href=install.html>Install</a>
</div><div class="navt away"><a href=test.html>Test</a>
</div><div class="navt away"><a href=api.html>API</a>
</div><div class="navt away"><a href=cli.html>CLI</a>
</div><div class="navt away"><a href=security.html>Security</a>
</div><div class="navt away"><a href=verification.html>Verification</a>
</div><div class="navt away"><a href=speed.html>Speed</a>
</div><div class="navt here">Internals
</div><div class="navt away"><a href=people.html>People</a>
</div><div class="navt away"><a href=license.html>License</a>
</div></div>
<div class=main>
<div class=pagetitle>libntruprime: Internals</div>
<p>This document explains the internal structure of libntruprime, and explains
how to add new instruction sets and new implementations. The libntruprime
infrastructure is adapted from the infrastructure used in lib25519 and
libmceliece.</p>
<h3>Code generation</h3>
<p>Portions of the distributed libntruprime package were automatically
generated by scripts in the <code>autogen</code> directory, using (among other
things) source in the <code>src</code> directory.</p>
<p>For example, <code>autogen/src</code> converts the <code>src/kem/sntrupP</code> directory into
<code>crypto_kem/sntrup{653,761,857,953,1013,1277}</code>, converts the
the <code>src/core/multsntrupP</code> directory into
<code>crypto_core/multsntrup{653,761,857,953,1013,1277}</code>, etc. (The
<code>crypto_hash*</code> directories are not auto-generated.) This structure
means that source code is shared across all of the <code>sntrup</code> sizes, while
allowing per-size specialization of the compiled code, reducing the
in-memory code size for the typical case of an application using just
one size.</p>
<p>The installation process (<code>./configure</code> etc.) does not run the <code>autogen</code>
scripts. Developers have a choice of two development cycles:</p>
<ul>
<li>
<p>Cycle 1: Modify files in, e.g., <code>src/core/multsntrupP</code>; run
  <code>autogen/src</code>; install; test; repeat.</p>
</li>
<li>
<p>Cycle 2: Modify files in, e.g., <code>crypto_core/multsntrup653</code>; install;
  test; repeat. Once <code>crypto_core/multsntrup653</code> is working, generalize
  to <code>src/core/multsntrupP</code> and switch to cycle 1.</p>
</li>
</ul>
<h3>Primitives</h3>
<p>The <code>crypto_kem/sntrup*</code> directories inside libntruprime are intended to
compute exactly the KEM primitives defined by the
<a href="https://ntruprime.cr.yp.to/software.html">NTRU Prime Sage reference implementation</a>.</p>
<p>Internally, the implementations rely on lower-level subroutines defined
in further <code>crypto_*/*</code> directories. This structuring provides smaller
targets for optimization, testing, and verification.</p>
<p>The subroutines are intended to compute the primitives defined in Python
inside libntruprime's <code>autogen/test</code> Python script. For example, the
<code>crypto_core/multsntrup*</code> functions are intended to match the Python
function <code>core_multsntrup</code>. The <code>autogen/test</code> script creates a test
program, <code>ntruprime-test</code>, which checks libntruprime's subroutines
against some outputs of the Python functions and against some SUPERCOP
"checksums" (hashes of outputs for various inputs).</p>
<p>As a concrete introduction to the subroutines, the list below describes
specifically the primitives used for <code>sntrup761</code>. Various numbers appear
in these descriptions. The <code>sntrup761</code> parameters are p = 761, q = 4591,
and w = 286. Related numbers appearing below are (p+3)/4 = 191;
(q+2)/3 = 1531; (q−1)/2 = 2295; 1007, the number of bytes produced by
the NTRU Prime <code>Encode</code> function for <code>M = 761*[1531]</code>; 1039 = 1007 + 32;
and 1158, the number of bytes produced by <code>Encode</code> for <code>M = 761*[4591]</code>.
The constant 32 is independent of the parameter set, as is the constant
10923 = 32769/3 appearing below. The 1039 and 1158 here are also the
ciphertext size and public-key size for <code>sntrup761</code>. See <code>autogen/test</code>
for the <code>Encode</code> function and a corresponding <code>Decode</code> function; these
are copied from Figures 1 and 2 in the NTRU Prime specification.</p>
<p>Here are the primitives:</p>
<ul>
<li>
<p><code>crypto_verify/1039</code>:
  <code>crypto_verify_1039(s,t)</code> returns 0 when the 1039-byte arrays <code>s</code> and
  <code>t</code> are equal, otherwise <code>-1</code>.</p>
</li>
<li>
<p><code>crypto_decode/int16</code>:
  <code>crypto_decode_int16(x,s)</code>, where <code>x</code> is a <code>uint16[1]</code> array and <code>s</code>
  is a <code>uint8[2]</code> array, sets <code>x[0]</code> to the <code>uint16</code> whose little-endian
  encoding is <code>s[0],s[1]</code>.</p>
</li>
<li>
<p><code>crypto_decode/761xint16</code>:
  <code>crypto_decode_761xint16(x,s)</code>, where <code>x</code> is a <code>uint16[761]</code> array and
  <code>s</code> is a <code>uint8[2*761]</code> array, sets each <code>x[i]</code> to the <code>uint16</code> whose
  little-endian encoding is <code>s[2*i],s[2*i+1]</code>.</p>
</li>
<li>
<p><code>crypto_decode/761xint32</code>:
  <code>crypto_decode_761xint16(x,s)</code>, where <code>x</code> is a <code>uint32[761]</code> array and
  <code>s</code> is a <code>uint8[4*761]</code> array, sets each <code>x[i]</code> to the <code>uint32</code> whose
  little-endian encoding is <code>s[4*i],s[4*i+1],s[4*i+2],s[4*i+3]</code>.</p>
</li>
<li>
<p><code>crypto_decode/761x3</code>:
  <code>crypto_decode_761x3(x,s)</code>, where <code>x</code> is a <code>uint8[761]</code> array and <code>s</code>
  is a <code>uint8[191]</code> array,
  sets <code>x[0]</code> to <code>(s[0]&amp;3)-1</code>,
  sets <code>x[1]</code> to <code>((s[0]&gt;&gt;2)&amp;3)-1</code>,
  sets <code>x[2]</code> to <code>((s[0]&gt;&gt;4)&amp;3)-1</code>,
  sets <code>x[3]</code> to <code>((s[0]&gt;&gt;6)&amp;3)-1</code>,
  sets <code>x[4]</code> to <code>(s[1]&amp;3)-1</code>,
  etc.</p>
</li>
<li>
<p><code>crypto_decode/761x1531</code>:
  <code>crypto_decode_761x1531(x,s)</code>, where <code>x</code> is an <code>int16[761]</code> array and
  <code>s</code> is a <code>uint8[1007]</code> array, applies <code>Decode</code> to convert <code>s</code> into 761
  integers between 0 and 1530 (i.e., the <code>M</code> input to the function is
  <code>761*[1531]</code>), and then multiplies each integer by 3 and subtracts
  2295 to obtain <code>x</code>.</p>
</li>
<li>
<p><code>crypto_decode/761x4591</code>:
  <code>crypto_decode_761x4591(x,s)</code>, where <code>x</code> is an <code>int16[761]</code> array and
  <code>s</code> is a <code>uint8[1158]</code> array, applies <code>Decode</code> to convert <code>s</code> into 761
  integers between 0 and 4590 (i.e., the <code>M</code> input to the function is
  <code>761*[4591]</code>), and then subtracts 2295 from each integer to obtain <code>x</code>.</p>
</li>
<li>
<p><code>crypto_encode/int16</code>:
  <code>crypto_encode_int16(s,x)</code>, where <code>s</code> is a <code>uint8[2]</code> array and <code>x</code>
  is a <code>uint16[1]</code> array, sets <code>s[0],s[1]</code> to the little-endian encoding
  of <code>x[0]</code>.</p>
</li>
<li>
<p><code>crypto_encode/761x3</code>:
  <code>crypto_encode_761x3(s,x)</code>, where <code>s</code> is a <code>uint8[191]</code> array and <code>x</code>
  is a <code>uint8[761]</code> array,
  sets <code>s[0]</code> to <code>(x[0]+1)+4*(x[1]+1)+16*(x[2]+1)+64*(x[3]+1)</code>,
  sets <code>s[1]</code> to <code>(x[4]+1)+4*(x[5]+1)+16*(x[6]+1)+64*(x[7]+1)</code>,
  ...,
  sets <code>s[189]</code> to <code>(x[756]+1)+4*(x[757]+1)+16*(x[758]+1)+64*(x[759]+1)</code>,
  and sets <code>s[190]</code> to <code>x[760]+1</code>.</p>
</li>
<li>
<p><code>crypto_encode/761xfreeze3</code>:
  <code>crypto_encode_761xfreeze3(s,x)</code>, where <code>s</code> is a <code>uint8[761]</code> array
  and <code>x</code> is an <code>int16[761]</code> array, sets each <code>s[i]</code> to
  <code>x[i]-3*((10923*x[i]+16384)&gt;&gt;15)</code>.</p>
</li>
<li>
<p><code>crypto_encode/761x1531</code>:
  <code>crypto_encode_761x1531(s,x)</code>, where <code>s</code> is a <code>uint8[1007]</code> array and
  <code>x</code> is an <code>int16[761]</code> array, sets <code>s</code> to <code>Encode(R,M)</code>, where <code>M</code> is
  <code>761*[1531]</code> and <code>R[i]</code> is <code>(((x[i]+2295)&amp;16383)*10923)&gt;&gt;15</code>. (The way
  this is used in <code>sntrup761</code> has <code>R[i]</code> below <code>M[i]</code>; however, tests
  include larger <code>R[i]</code>.)</p>
</li>
<li>
<p><code>crypto_encode/761x1531round</code>:
  <code>crypto_encode_761x1531round(s,x)</code>, where <code>s</code> is a <code>uint8[1007]</code> array
  and <code>x</code> is an <code>int16[761]</code> array, is the same as
  <code>crypto_encode_761x1531(s,y)</code>,
  where <code>y[i] = 3*((10923*x[i]+16384)&gt;&gt;15)</code>.</p>
</li>
<li>
<p><code>crypto_encode/761x4591</code>:
  <code>crypto_encode_761x4591(s,x)</code>, where <code>s</code> is a <code>uint8[1158]</code> array and
  <code>x</code> is an <code>int16[761]</code> array, sets <code>s</code> to <code>Encode(R,M)</code>, where <code>M</code> is
  <code>761*[4591]</code> and <code>R[i]</code> is <code>(x[i]+2295)&amp;16383</code>. (The way this is used
  in <code>sntrup761</code> has <code>R[i]</code> below <code>M[i]</code>; however, tests include larger
  <code>R[i]</code>.)</p>
</li>
<li>
<p><code>crypto_sort/int32</code>: <code>crypto_sort_int32(x,n)</code> sorts the <code>int32</code>
  values <code>x[0]</code>, <code>x[1]</code>, ..., <code>x[n-1]</code>.</p>
</li>
<li>
<p><code>crypto_sort/uint32</code>: <code>crypto_sort_uint32(x,n)</code> sorts the <code>uint32</code>
  values <code>x[0]</code>, <code>x[1]</code>, ..., <code>x[n-1]</code>.</p>
</li>
<li>
<p><code>crypto_core/inv3sntrup761</code>:
  <code>crypto_core_inv3sntrup761(h,f,0,0)</code> sets polynomial <code>h</code> to the
  reciprocal of polynomial <code>f</code> modulo x<sup>761</sup>−x−1 modulo 3. The
  polynomial <code>f</code> is expressed as a <code>uint8[761]</code> array where bytes that
  reduce to 0, 1, 2, 3 modulo 4 are interpreted as the small integers
  0, 1, 0, −1 respectively; these integers in {−1,0,1} are then
  interpreted as the coefficients of x<sup>0</sup>, x<sup>1</sup>, etc. in that order.
  Coefficients −1, 0, 1 in <code>1/f</code> modulo 3 are converted to bytes
  255, 0, 1 in array <code>h</code>. There is then a final byte 0 indicating that
  the reciprocal exists, so <code>h</code> is a <code>uint8[762]</code> array. If <code>f</code> does not
  have a reciprocal then the <code>h</code> array is instead set to 761 bytes 0
  followed by final byte 255.</p>
</li>
<li>
<p><code>crypto_core/invsntrup761</code>:
  <code>crypto_core_invsntrup761(h,f,0,0)</code> sets polynomial <code>h</code> to the
  reciprocal of <code>3f</code> modulo x<sup>761</sup>−x−1 modulo 4591. The input polynomial
  <code>f</code> is expressed as an <code>int8[761]</code> array; the array entries in
  {−128,...,127} are interpreted as the coefficients of x<sup>0</sup>, x<sup>1</sup>, etc.
  in that order. Each coefficient of the reciprocal of <code>3f</code> is reduced
  modulo q to the range −(q−1)/2 through (q−1)/2 and then encoded as 2
  bytes in little-endian form in <code>h</code>. There is then a final byte 0
  indicating that the reciprocal exists, so <code>h</code> is a <code>uint8[2*761+1]</code>
  array. If <code>3f</code> does not have a reciprocal then the <code>h</code> array is
  instead set to <code>2*761</code> bytes 0 followed by final byte 255.</p>
</li>
<li>
<p><code>crypto_core/mult3sntrup761</code>:
  <code>crypto_core_multsntrup761(h,f,g,0)</code> sets <code>h</code> to the product of two
  small-coefficient polynomials <code>f</code> and <code>g</code> modulo x<sup>761</sup>−x−1 modulo 3.
  The polynomial <code>f</code> is expressed as a <code>uint8[761]</code> array where bytes
  that reduce to 0, 1, 2, 3 modulo 4 are interpreted as the small
  integers 0, 1, 0, −1 respectively; these integers in {−1,0,1} are then
  interpreted as the coefficients of x<sup>0</sup>, x<sup>1</sup>, etc. in that order. The
  polynomial <code>g</code> is expressed the same way. Each coefficient of the
  product modulo x<sup>761</sup>−x−1 is reduced modulo 3 to the range −1, 0, 1,
  and then stored in the <code>uint8[761]</code> array <code>h</code> as byte 255, 0, 1
  respectively.</p>
</li>
<li>
<p><code>crypto_core/multsntrup761</code>:
  <code>crypto_core_multsntrup761(h,f,g,0)</code> sets <code>h</code> to the product of a
  polynomial <code>f</code> and a small-coefficient polynomial <code>g</code> modulo
  x<sup>761</sup>−x−1 modulo q. The polynomial <code>f</code> is expressed as a
  <code>uint8[2*761]</code> array storing 761 <code>int16</code> values in little-endian form.
  The polynomial <code>g</code> is expressed as a <code>uint8[761]</code> array, where bytes
  that reduce to 0, 1, 2, 3 modulo 4 are interpreted as the small
  integers 0, 1, 0, −1 respectively. Each coefficient of the product
  modulo x<sup>761</sup>−x−1 is reduced modulo q to the range −(q−1)/2 through
  (q−1)/2, and then encoded as 2 bytes in little-endian form in <code>h</code>,
  which is a <code>uint8[2*761]</code> array.</p>
</li>
<li>
<p><code>crypto_core/scale3sntrup761</code>:
  <code>crypto_core_scale3sntrup761(h,f,0,0)</code> transforms a polynomial <code>f</code> to
  a polynomial <code>h</code>. Each polynomial is expressed as a <code>uint8[2*761]</code>
  array storing 761 <code>int16</code> values in little-endian form. Each <code>f</code>
  coefficient is transformed to the corresponding <code>h</code> coefficient with
  the following sequence of <code>int16</code> operations (all intermediate results
  reduced to <code>int16</code>): multiply by 3; subtract 2296; if negative, add
  4591; if negative, add 4591; subtract 2295.</p>
</li>
<li>
<p><code>crypto_core/weightsntrup761</code>:
  <code>crypto_core_weightsntrup761(y,x,0,0)</code>, where <code>y</code> is a <code>uint8[2]</code>
  array and <code>x</code> is a <code>uint8[761]</code> array, sets <code>y[0],y[1]</code> to the
  little-endian encoding of the sum of the 761 bits <code>x[0]&amp;1,...,x[760]&amp;1</code>.</p>
</li>
<li>
<p><code>crypto_core/wforcesntrup761</code>:
  <code>crypto_core_wforcesntrup761(y,x,0,0)</code>, where <code>y</code> is a <code>uint8[761]</code>
  array and <code>x</code> is a <code>uint8[761]</code> array, sets <code>y</code> to a copy of <code>x</code> if
  the little-endian encoding of the sum of the 761 bits <code>x[0]&amp;1,...,x[760]&amp;1</code>
  is 286. Otherwise it sets <code>y</code> to 286 bytes equal to <code>1</code> followed by
  761−286 bytes equal to <code>0</code>.</p>
</li>
<li>
<p><code>crypto_hashblocks/sha512</code>: <code>crypto_hashblocks_sha512(h,x,xlen)</code>
  updates an intermediate SHA-512 hash <code>h</code> using all of the full
  128-byte blocks at the beginning of the <code>xlen</code>-byte array <code>x</code>, and
  returns the number of bytes left over, namely <code>xlen</code> mod 128.</p>
</li>
<li>
<p><code>crypto_hash/sha512</code>: <code>crypto_hash_sha512(h,x,xlen)</code> computes the
  SHA-512 hash <code>h</code> of the <code>xlen</code>-byte array <code>x</code>.</p>
</li>
<li>
<p><code>crypto_kem/sntrup761</code>:
  <code>crypto_kem_sntrup761_keypair(pk,sk)</code> is key generation for <code>sntrup761</code>,
  and is provided by the
  <a href="api.html">stable API</a>
  as <code>sntrup761_keypair</code>. Similar comments apply to <code>enc</code> and <code>dec</code>.</p>
</li>
</ul>
<p>The functions <code>crypto_sort_int32(x,n)</code> and <code>crypto_sort_uint32(x,n)</code>
take time that depends on <code>n</code> but not on the contents of the <code>x</code> array.
Similarly, the <code>crypto_hash*</code> functions take time that depends on input
length but not on input contents. All other subroutines take constant
time. There is one use of "declassification" in <code>crypto_kem/sntrup</code>: a
rejection-sampling loop at the beginning of key generation enforces
invertibility mod 3.</p>
<p>As in SUPERCOP and NaCl, array lengths intentionally use <code>long long</code>,
not <code>size_t</code>. In libntruprime, as in lib25519 and libmceliece, array
lengths are signed.</p>
<h3>Implementations</h3>
<p>A single primitive can, and usually does, have multiple implementations.
Each implementation is in its own subdirectory. The implementations are
required to have exactly the same input-output behavior, and to some
extent this is tested, although it is not yet formally verified (except
for some components such as <code>crypto_sort</code>).</p>
<p>Different implementations typically offer different tradeoffs between
portability, simplicity, and efficiency. For example,
<code>crypto_core/inv3sntrup761/bits64</code> is portable;
<code>crypto_core/inv3sntrup761/avx</code> is faster and less portable.</p>
<p>Each unportable implementation has an <code>architectures</code> file. Each line in
this file identifies a CPU instruction set (and ABI) where the
implementation works. For example,
<code>crypto_core/inv3sntrup761/avx/architectures</code>
has two lines</p>
<pre><code>amd64 avx2
x86 avx2
</code></pre>
<p>meaning that the implementation works on CPUs that have the Intel/AMD
64-bit or 32-bit instruction sets with the AVX2 instruction-set
extension. The top-level <code>compilers</code> directory shows (among other
things) the allowed instruction-set names such as <code>avx2</code>.</p>
<p>At run time, libntruprime checks the CPU where it is running, and selects
an implementation where <code>architectures</code> is compatible with that CPU.
Each primitive makes its own selection once per program startup, using
the compiler's <code>ifunc</code> mechanism (or <code>constructor</code> on platforms that do
not support <code>ifunc</code>). This type of run-time selection means, for
example, that an <code>amd64</code> CPU without AVX2 can share binaries with an
<code>amd64</code> CPU with AVX2. However, correctness requires instruction sets to
be preserved by migration across cores via the OS kernel, VM migration,
etc.</p>
<p>The compiler has a <code>target</code> mechanism that makes an <code>ifunc</code> selection
based on CPU architectures. Instead of using the <code>target</code> mechanism,
libntruprime uses a more sophisticated mechanism that also accounts for
benchmarks collected in advance of compilation.</p>
<h3>Compilers</h3>
<p>libntruprime tries different C compilers for each implementation. For
example, <code>compilers/default</code> lists the following compilers:</p>
<pre><code>clang -Wall -fPIC -fwrapv -Qunused-arguments -O2
gcc -Wall -fPIC -fwrapv -O3
</code></pre>
<p>Sometimes <code>gcc</code> produces better code, and sometimes <code>clang</code> produces
better code.</p>
<p>As another example, <code>compilers/amd64+sse3+ssse3+sse41+popcnt+avx+bmi1+bmi2+avx2+fma</code>
lists the following compilers:</p>
<pre><code>clang -Wall -fPIC -fwrapv -Qunused-arguments -O2 -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mbmi -mbmi2 -mpopcnt -mavx2 -mfma -mtune=skylake
gcc -Wall -fPIC -fwrapv -O3 -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mbmi -mbmi2 -mpopcnt -mavx2 -mfma -mtune=skylake
</code></pre>
<p>The <code>-mavx2</code> option tells these compilers that they are free to use the
AVX2 instruction-set extension.</p>
<p>Code compiled using the compilers in
<code>compilers/amd64+sse3+ssse3+sse41+popcnt+avx+bmi1+bmi2+avx2+fma</code>
will be considered at run time by the libntruprime selection mechanism if
the <code>supports()</code> function in
<code>compilers/amd64+sse3+ssse3+sse41+popcnt+avx+bmi1+bmi2+avx2+fma.c</code>
returns nonzero. This function checks whether the run-time CPU supports
AVX2 (and SSE3 and so on, and OSXSAVE with XMM/YMM being saved;
<a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85100">https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85100</a>
says that all versions of gcc until 2018 handled this incorrectly in
<code>target</code>). Similar comments apply to other <code>compilers/*</code> files.</p>
<p>If some compilers fail (for example, clang is not installed, or the
compiler version is too old to support the compiler options used in
libntruprime), the libntruprime compilation process will try its best to
produce a working library using the remaining compilers, even if this
means lower performance.</p>
<h3>Trimming</h3>
<p>By default, to reduce size of the compiled library, the libntruprime
compilation process trims the library down to the implementations that
are selected by libntruprime's selection mechanism.</p>
<p>For example, if the selection mechanism decides that CPUs with AVX2
should use <code>invsntrup761/avx</code> with <code>clang</code> and that other CPUs should
use <code>invsntrup761/portable</code> with <code>gcc</code>, then trimming will remove
<code>invsntrup761/avx</code> compiled with <code>gcc</code> and <code>invsntrup761/portable</code>
compiled with <code>clang</code>.</p>
<p>This trimming is handled at link time rather than compile time to
increase the chance that, even if some implementations are broken by
compiler "upgrades", the library will continue to build successfully.</p>
<p>To avoid this trimming, pass the <code>--no-trim</code> option to <code>./configure</code>.
All implementations that compile are then included in the library,
tested by <code>ntruprime-test</code>, and measured by <code>ntruprime-speed</code>. You'll want
to avoid trimming if you're adding new instruction sets or new
implementations (see below), so that you can run tests and benchmarks of
code that isn't selected yet.</p>
<h3>How to recompile after changes</h3>
<p>If you make changes under <code>crypto_*</code>, the fully supported recompilation
mechanism is to run <code>./configure</code> again to clean and repopulate the
build directory, and then run <code>make</code> again to recompile everything.</p>
<p>This can be on the scale of seconds if you have enough cores, but maybe
you're developing on a slower machine. Three options are currently
available to accelerate the edit-compile cycle:</p>
<ul>
<li>
<p>There is an experimental <code>--no-clean</code> option to <code>./configure</code> that,
     for some simple types of changes, can produce a successful build
     without cleaning.</p>
</li>
<li>
<p>Running <code>make</code> without <code>./configure</code> can work for some particularly
     simple types of changes. However, not all dependencies are
     currently expressed in <code>Makefile</code>, and some types of dependencies
     that <code>./configure</code> understands would be difficult to express in the
     <code>Makefile</code> language.</p>
</li>
<li>
<p>You can disable the implementations you're not using by setting
     sticky bits on the source directories for those implementations:
     e.g., <code>chmod +t crypto_*/*/avx</code>.</p>
</li>
</ul>
<p>Make sure to reenable all implementations and do a full clean build if
you're collecting data to add to the source <code>benchmarks</code> directory.</p>
<h3>How to add new instruction sets</h3>
<p>Adding another file <code>compilers/amd64+foo</code>, along with a <code>supports()</code>
implementation in <code>compilers/amd64+foo.c</code>, will support a new
instruction set. Do not assume that the new <code>foo</code> instruction set
implies support for older instruction sets (the idea of "levels" of
instruction sets); instead make sure to include the older instruction
sets in <code>+</code> tags, as illustrated by
<code>compilers/amd64+sse3+ssse3+sse41+popcnt+avx+bmi1+bmi2+avx2+fma</code>.</p>
<p>In the compiler options, always make sure to include <code>-fPIC</code> to support
shared libraries, and <code>-fwrapv</code> to switch to a slightly less dangerous
version of C.</p>
<p>The <code>foo</code> tags don't have to be instruction sets. For example, if a CPU
has the same instruction set but wants different optimizations because
of differences in instruction timings, you can make a tag for those
optimizations, using, e.g., CPU IDs or benchmarks in the corresponding
<code>supports()</code> function to decide whether to enable those optimizations.
Benchmarks tend to be more future-proof than a list of CPU IDs, but the
time taken for benchmarks at program startup has to be weighed against
the subsequent speedup from the resulting optimizations.</p>
<p>To see how well libntruprime performs with the new compilers, run
<code>ntruprime-speed</code> on the target machine and look for the <code>foo</code> lines in
the output. If the new performance is better than the performance shown
on the <code>selected</code> lines:</p>
<ul>
<li>
<p>Copy the <code>ntruprime-speed</code> output into a file on the <code>benchmarks</code>
     directory, typically named after the hostname of the target
     machine.</p>
</li>
<li>
<p>Run <code>./prioritize</code> in the top-level directory to create <code>priority</code>
     files. These files tell libntruprime which implementations to select
     for any given architecture.</p>
</li>
<li>
<p>Reconfigure (again with <code>--no-trim</code>), recompile, rerun
     <code>ntruprime-test</code>, and rerun <code>ntruprime-speed</code> to check that the
     <code>selected</code> lines now use the <code>foo</code> compiler.</p>
</li>
</ul>
<p>If the <code>foo</code> implementation is outperformed by other implementations,
then these steps don't help except for documenting this fact. The same
implementation might turn out to be useful for subsequent <code>foo</code> CPUs.</p>
<h3>How to add new implementations</h3>
<p>Taking full advantage of the <code>foo</code> instruction set usually requires
writing new implementations. Sometimes there are also ideas for taking
better advantage of existing instruction sets.</p>
<p>Structurally, adding a new implementation of a primitive is a simple
matter of adding a new subdirectory with the code for that
implementation. Most of the work is optimizing the use of <code>foo</code>
intrinsics in <code>.c</code> files or <code>foo</code> instructions in <code>.S</code> files. Make sure
to include an <code>architectures</code> file saying, e.g., <code>amd64 avx2 foo</code>.</p>
<p>Names of implementation directories can use letters, digits, dashes, and
underscores. Do not use two implementation names that are the same when
dashes and underscores are removed.</p>
<p>All <code>.c</code> and <code>.S</code> files in the implementation directory are compiled and
linked. There is no need to edit a separate list of these files. You can
also use <code>.h</code> files via the C preprocessor.</p>
<p>If an implementation is actually more restrictive than indicated in
<code>architectures</code> then the resulting compiled library will fail on some
machines (although perhaps that implementation will not be used by
default). Putting unnecessary restrictions into <code>architectures</code> will not
create such failures, but can unnecessarily limit performance.</p>
<p>Some, but not all, mistakes in <code>architectures</code> will produce warnings
from the <code>checkinsns</code> script that runs automatically when libntruprime is
compiled. Running the <code>ntruprime-test</code> program tries all implementations,
but only on the CPU where <code>ntruprime-test</code> is being run;
also, <code>ntruprime-test</code> does not guarantee code coverage.</p>
<p><code>amd64</code> implies little-endian, and implies architectural support for
unaligned loads and stores. Beware, however, that the Intel/AMD
vectorized <code>load</code>/<code>store</code> intrinsics (and the underlying <code>movdqa</code>
instruction) require alignment; if in doubt, use <code>loadu</code>/<code>storeu</code> (and
<code>movdqu</code>). The <code>ntruprime-test</code> program checks unaligned inputs and
outputs, but can miss issues with unaligned stack variables.</p>
<p>To test your implementation, compile everything, check for compiler
warnings and errors, run <code>ntruprime-test</code> (or just <code>ntruprime-test xof</code> to
test a <code>crypto_xof</code> implementation), and check for a line saying <code>all
tests succeeded</code>. To use AddressSanitizer (for catching, at run time,
buffer overflows in C code), add <code>-fsanitize=address</code> to the <code>gcc</code> and
<code>clang</code> lines in <code>compilers/*</code>; you may also have to add <code>return;</code> at
the beginning of the <code>limits()</code> function in <code>command/limits.inc</code>.</p>
<p>To see the performance of your implementation, run <code>ntruprime-speed</code>. If
the new performance is better than the performance shown on the
<code>selected</code> lines, follow the same steps as for a new instruction set:
copy the <code>ntruprime-speed</code> output into a file on the <code>benchmarks</code>
directory; run <code>./prioritize</code> in the top-level directory to create
<code>priority</code> files; reconfigure (again with <code>--no-trim</code>); recompile; rerun
<code>ntruprime-test</code>; rerun <code>ntruprime-speed</code>; check that the <code>selected</code> lines
now use the new implementation.</p>
<h3>How to handle namespacing</h3>
<p>As in SUPERCOP and NaCl, to call <code>crypto_sort_int32()</code>, you have to
include <code>crypto_sort_int32.h</code>; but to write an implementation of
<code>crypto_sort_int32()</code>, you have to instead include <code>crypto_sort.h</code> and
define <code>crypto_sort</code>. Similar comments apply to other primitives.</p>
<p>The function name that's actually linked might end up as, e.g.,
<code>libntruprime_sort_int32_avx2_C2</code> where <code>avx2</code> indicates the
implementation and <code>C2</code> indicates the compiler. Don't try to build this
name into your implementation.</p>
<p>If you have another global symbol <code>x</code> (for example, a non-<code>static</code>
function in a <code>.c</code> file, or a non-<code>static</code> variable outside functions in
a <code>.c</code> file), you have to replace it with <code>CRYPTO_NAMESPACE(x)</code>, for
example with <code>#define x CRYPTO_NAMESPACE(x)</code>.</p>
<p>For global symbols in <code>.S</code> files and <code>shared-*.c</code> files, use
<code>CRYPTO_SHARED_NAMESPACE</code> instead of <code>CRYPTO_NAMESPACE</code>. For <code>.S</code> files
that define both <code>x</code> and <code>_x</code> to handle platforms where <code>x</code> in C is <code>_x</code>
in assembly, use <code>CRYPTO_SHARED_NAMESPACE(x)</code> and
<code>_CRYPTO_SHARED_NAMESPACE(x)</code>; <code>CRYPTO_SHARED_NAMESPACE(_x)</code> is not
sufficient.</p>
<p>libntruprime includes a mechanism to recognize files that are copied
across implementations (possibly of different primitives) and to unify
those into a file compiled only once, reducing the overall size of the
compiled library and possibly improving cache utilization. To request
this mechanism, include a line</p>
<pre><code>// linker define x
</code></pre>
<p>for any global
symbol <code>x</code> defined in the file, and a line</p>
<pre><code>// linker use x
</code></pre>
<p>for any
global symbol <code>x</code> used in the file from the same implementation (not
<code>crypto_*</code> subroutines that you're calling, <code>randombytes</code>, etc.). This
mechanism tries very hard, perhaps too hard, to avoid improperly
unifying files: for example, even a slight difference in a <code>.h</code> file
included by a file defining a used symbol will disable the mechanism.</p>
<p>Typical namespacing mistakes will produce either linker failures or
warnings from the <code>checknamespace</code> script that runs automatically when
libntruprime is compiled.</p><hr><font size=1><b>Version:</b>
This is version 2024.08.18 of the "Internals" web page.
</font>
</div>
</body>
</html>