Ripple

Ripple: Accelerating Programmable Bootstraps for FHE with Wavelet Approximations Homomorphic encryption can address key privacy challenges in cloud-based outsourcing by enabling potentially untrusted servers to perform meaningful computation directly on encrypted data. While most homomorphic encryption schemes offer addition and multiplication over ciphertexts na tively, any non-linear functions must be implemented as costly polynomial approximations due to this restricted computational model. Nevertheless, the CGGI cryptosystem is capable of performing arbi trary univariate functions over ciphertexts in the form of lookup tables through the use of programmable bootstrapping. While promising, this procedure can quickly become costly when high degrees of preci sion are required. To address this challenge, we propose Ripple: a framework that introduces different approximation methodologies based on discrete wavelet transforms (DWT) to decrease the number of entries in homomorphic lookup tables while maintaining high accuracy. Our empirical evaluations demonstrate significant error reduction compared to plain quantization methods across multiple non linear functions. Notably, Ripple improves runtime performance for several realistic benchmarks, such as logistic regression and cross-correlation, among others. Keywords: Applied Cryptography · Homomorphic Encryption · Lookup Tables · Privacy-Enhancing Tech nologies · Encrypted Computation. 1 Introduction Cloud computing enables corporations to leverage powerful computational resources, while avoiding the cost and upkeep associated with maintaining local computing infrastructure. Along with numerous benefits, this gives rise to privacy concerns over outsourced data as cloud service providers can possibly view data stored on their servers, and malicious actors have increasingly targeted cloud servers as they can be treasure troves of proprietary information from multiple clients [29,46]. While encryption techniques such as AES can be used to protect data confidentiality, the encrypted data must be static and the cloud cannot apply meaningful processing on ciphertexts (aside from storage). Thus, if a client wants to modify their data, they will have to download the ciphertexts, decrypt them, perform a computation to update the plaintext data, re-encrypt them, and re-upload the result to the cloud. Homomorphic encryption (HE) is a powerful technique that helps address privacy concerns in cloud computing by allowing computation on encrypted data [19]. With HE, a client can encrypt sensitive data, upload the corresponding ciphertexts to the cloud, have the cloud apply an arbitrary algorithm such as image classification, and then receive a valid encryption of the results, which can only be decrypted with the client’s secret key. In this way, the cloud learns nothing about the contents of the input data, intermediate results, or the output of the encrypted computation. ⋆ The first two authors have equal contribution and appear in alphabetical order. This work has been supported by NSF CAREER award #2239334. Nevertheless, for many HE schemes such as BGV [5], BFV [18], and CKKS [7], evaluating non-linear functions directly remains impossible, as only addition and multiplication operations can be executed over ciphertexts. Conversely, CGGI [9] allows encrypted lookup tables (LUTs) which allow non-linear functions to be computed exactly. Unfortunately, this operation is quite costly and becomes significantly more expensive as the plaintext modulus increases. Likewise, both the time and memory required to generate the lookup tables scale exponentially with the input’s precision, with LUT sizes of 32 bits and larger becoming impractical.3 One solution to this problem is to simply quantize the LUT inputs (i.e., reduce the bit width) to lower the number of entries in the lookup table, resulting in faster evaluation and LUT generation times. The reduced precision caused by quantization, however, can negatively impact a wide variety of applications that require high precision. For instance, quantization in deep neural networks can result in non-negligible accuracy loss [26] and thus lead to incorrect classifications. Particularly, errors occurring due to quantization in early layers will propagate to subsequent layers, resulting in an avalanche effect, where the errors are compounded in each layer. Indeed, this effect is not limited to privacy-preserving inference and can happen in any application. In this work, we propose the Ripple framework that offers new efficient techniques for encrypted LUT evaluation. Ripple provides all of the same benefits of quantization in terms of latency reduction while mini mizing the accuracy loss resulting from reduced precision and bit widths. Our approach leverages the discrete wavelet transform (DWT) [42] to approximate non-linear functions and generate significantly smaller lookup tables while maintaining high accuracy. Moreover, Ripple employs multiple DWT families and introduces be spoke HE-friendly protocols tailored to each family to maximize accuracy and minimize latency. For instance, we find that some DWT families benefit from multiple LUTs while others need only a single LUT evaluation. We apply Ripple to a variety of non-linear functions that are widely used across several domains from ma chine learning [6,15] to statistics [39], as well as multiple realistic applications for homomorphic encryption [44,24], such as logistic regression inference and edge detection. Our contributions can be summarized as follows:– We introduce Ripple to construct smaller encrypted LUTs with wavelet techniques without sacrificing accuracy.– We propose multiple protocols for evaluating wavelet-encoded LUTs in the encrypted domain.– We implement a suite of commonly adopted non-linear functions and optimize with Ripple, along with a set of representative benchmarks from various domains. Roadmap: Section 2 provides a brief background on homomorphic encryption and the discrete wavelet transform. Sections 3 and 4 outline our encrypted lookup table methodology based on the DWT, as well as our novel optimizations, while Section 5 introduces a representative set of applications that can benefit from our methodologies. Section 6 presents our experimental evaluations for each application, demonstrating the efficacy of our approach, and Section 7 discussed related works and how they compare to Ripple. Lastly, our concluding remarks appear in Section 8. 2 Preliminaries 2.1 Homomorphic Encryption Overview The key characteristic of all HE schemes is malleability, where ciphertexts can be manipulated to change the underlying plaintext data predictably. All HE schemes can be roughly divided into two categories depending on the primary computational domain: arithmetic-based schemes and Boolean-based schemes. 3 We empirically observed even 32-bit encrypted LUTs with the state-of-the-art TFHE-rs [47] HE library require approximately 515 GB of RAM and 65 minutes. For reference, 30-bit LUTs took almost 15 minutes requiring over 120 GB. 2 2.1.1 Arithmetic-Based Schemes This category consists of schemes that are primarily used for addition and multiplication over ciphertexts that encrypt modular integers or floating point numbers. We note that partial HE schemes such as Paillier [37] and unpadded RSA [40] fall into this category, but only allow for one type of operation (i.e., addition or multiplication) and are therefore only used in niche applications. The other cryptosystems in this category encompass schemes that can be instantiated as both leveled and fully homomorphic encryption, depending on whether or not a bootstrapping mechanism is added. The BGV [5] and BFV [18] cryptosystems are used for encrypting integers modulo a configurable plaintext modulus. On the other hand, an approximate scheme called CKKS [7] encrypts either complex or floating point numbers and the error that is inherent in homomorphic encryption is treated similarly to floating point arithmetic error. In fact, this error is integral to all HE schemes that derive their security from the Learning With Errors (LWE) [38] or Ring-LWE (RLWE) [35] problems. Ciphertexts take the form of tuples of high-degree poly nomials, and a small amount of random noise is injected into the coefficients upon encryption. The noise compounds as operations are conducted on the data; additively for ciphertext additions and multiplicatively for ciphertext multiplication. If the noise is allowed to grow past a certain threshold, it will begin to corrupt the underlying plaintext message. This has huge ramifications and there are two primary mechanisms for noise reduction. The first is a computationally efficient procedure called modulus switching that reduces the size of the coefficients of the ciphertext polynomials. Reducing the coefficient modulus also serves to scale down the noise, which essentially reduces it. Aside from noise reduction, this technique will also speed up subsequent operations as the total size of the ciphertext is also reduced. Nevertheless, this technique can only be invoked a finite number of times, and eventually, it will not be possible to reduce the modulus size further. Implementations that incorporate a bootstrapping procedure can reduce the noise indefinitely by homo morphically evaluating the decryption circuit of the cryptosystem (resulting in a ciphertext with lower noise). Also, the bootstrapping procedure regenerates the ciphertext modulus, allowing for more modulus switching operations in between bootstraps. These FHE contexts can evaluate arbitrarily deep arithmetic circuits, but the cost of bootstrapping remains extremely high relative to other HE operations. The overall latency of bootstrapping can range from several seconds to several minutes on a CPU depending on the parameter sets used. Additionally, this class of schemes is not well-suited for evaluating non-linear operations as the only supported operations are addition and multiplication. For instance, many activation functions in machine learning constructions, like ReLU and sigmoid, can not be directly implemented. Instead, one must use a polynomial approximation that must be balanced to yield an acceptable trade-off between accuracy and latency. A high-degree polynomial can closely mimic the behavior of a non-linear function but requires several homomorphic multiplications and additions to evaluate. 2.1.2 Boolean-based Schemes Instead of encrypting integers or floating point numbers, this class of schemes encrypts individual bits (or low-precision integers in certain cases). The addition and multiplication primitives are replaced by encrypted gate operations, such as AND, OR, and NOT gates. In practice, most logic gates are implemented as a series of linear operations between ciphertext polynomials followed by a functional bootstrap, which serves to scale the output to the expected value. Because Boolean schemes can support all standard logic gates, they are capable of executing arbitrary algorithms. The DM cryptosystem [16] was the first of these schemes and is meant to operate solely as an FHE construction. Compared to the relatively slow bootstrapping speeds of the arithmetic-based schemes, DM is capable of evaluating a bootstrap in less than a second. A successor to this scheme, known as CGGI [8], boasts an even faster bootstrapping speed of approximately 10 milliseconds on a CPU. Both of these schemes lack a mechanism to enable batching, but what they lack in throughput, they make up for in terms of lower latency operations. Additionally, compared to the prior class of HE schemes, Boolean schemes do not need to utilize poly nomial approximations. Indeed, non-linear operations can be implemented exactly as a Boolean circuit. As 3 an example, the ReLU activation function is directly mapped to a multi-bit comparator circuit followed by a multiplexer. However, this may require a larger number of Boolean gates and the majority of gate types require at least one bootstrapping operation, resulting in relatively high latency for large circuits. Prior frameworks, such as the Google Transpiler [21], ArctyrEX [22], and HELM [23], exploit the inherent circuit level parallelism to reduce the latency of circuit evaluation. Even still, the performance of these frameworks is limited by the critical path (or greatest depth) of the homomorphic circuit. Additionally, all three approaches rely on logic and/or high-level synthesis methods to convert input programs to optimized Boolean circuits, which results in very high pre-processing cost for non-trivial applications. Alternatively, with proper parameter selection, Boolean schemes can also encrypt low-precision integers instead of bits. In the case of CGGI, this encoding allows for ciphertext addition and multiplication with a public constant, but not multiplication between two ciphertexts (distinguishing this mode from arithmetic based schemes). Notably, it still retains the functional bootstrap and this can be utilized to evaluate N:N lookup tables, as discussed in the following subsection. This approach combines some of the key strengths of both arithmetic-based schemes and the Boolean mode of operation in the form of natively supported multi-bit arithmetic, as well as a mechanism for exactly evaluating non-linear functions. The primary challenge is the restriction on the size of the underlying plaintext space, but this can be overcome by representing high-precision plaintext values as vectors of ciphertexts, where each encryption encodes a low-precision chunk of the original message. This very methodology is employed in the TFHE-rs library [47] in two different ways: a Chinese remainder theorem (CRT) method and a radix method. The former involves generating multiple residues by reducing an input message by a series of co-prime bases and encrypting each residue as a separate ciphertext. Upon decryption, the residues are combined to form the f inal higher-precision result. The second method involves decomposing the input data into a series of digits, each of which is decrypted individually. Ripple utilizes the latter approach as it provides a convenient way to truncate ciphertexts, which is an integral operation in the DWT protocols explained in the following section. Truncating the digits of a ciphertext array encoded with the radix decomposition can be done with negligible latency overhead as no FHE operations are required. 2.2 Programmable Bootstrapping (PBS) A crucial feature of the DM and CGGI FHE cryptosystems is the functional bootstrap, which takes advan tage of the programmability of the bootstrapping algorithm employed in these schemes. A polynomial with crafted coefficients that encodes the set of desired output messages is rotated by an encrypted value and the first encrypted coefficient corresponding to the constant term of the polynomial is extracted. These two procedures, called blind rotation and extraction, form the core bootstrapping steps. By encoding chosen lookup table (LUT) entries in the coefficients of the polynomial to be rotated, one can evaluate a LUT T over a ciphertext. Essentially, this can be done by rotating the LUT polynomial by an encrypted amount (corresponding to the input ciphertext) and extracting the entry corresponding to the constant term. The result is a valid encryption that encodes the mapping from a LUT input to a desired LUT output. Thus, it allows computing arbitrary univariate functions by evaluating a function in the plaintext domain across all possible inputs and encoding them in the polynomial utilized during bootstrapping. This generalized bootstrapping technique is called programmable bootstrapping (PBS) [11,34]. Although the LUT needs to be relatively small to maintain efficient cryptographic parameter sets, it has two main advantages. First, it can encode any arbitrary univariate function, and second, it leads to a significant performance boost as it replaces expensive operations that otherwise would require multiple additions and multiplications. 2.3 The Discrete Wavelet Transform (DWT) ADiscrete Wavelet Transformation (DWT) is a process of splitting a discretely sampled signal into two parts: the approximation and the detail coefficients [42]. The former is half of the size of the original signal but encompasses the most interesting parts of it, while the latter contains information about the error incurred by the approximation coefficients. When combining both the detail coefficients and the approximation, one 4 Original signal s L0 a(s) 0 L1 L2 a(s) 1 . . . Approximations a d(s) 0 d(s) 1 . . . Details d Fig.1: Two DWT iterations. The signal s can be represented by the approximation a(s) (green) and the details d(s) (gray). can reconstruct the original signal. Fig. 1 illustrates the original signal on top and three applications of the DWT. In the first level (L0), we deconstruct the signal on top to approximation (a(s) (d(s) 0 ) and detail coefficients 0 ). Then, we can repeat the same process by treating the L0 approximation as a new signal and thus get new approximation (a(s) 1 ) and detail coefficients (d(s) 1 ) at level 1, and so on. In Fig. 1, notice that given the L2 approximation and detail coefficients along with the L1 and L0 detail coefficients, it is sufficient for recovering the original signal. There exist multiple wavelet families such as the Daubechy (Db) wavelets and the Biorthogonal wavelets. A special case of the Db wavelets is the Db-1– or Haar– wavelet. The core idea in all families is a matrix multiplication with a constant matrix W, where W differs based on the family. Haar uses an orthogonal matrix to obtain the approximation coefficients, which are calculated by averaging every two consecutive points of the original signal. Biorthogonal, on the other hand, relies on two different matrices, where the transpose of one matrix is the inverse of the other. We delve more into the details of both wavelet families in Section 3. 2.3.1 Haar DWT Haar applies a linear transformation to the input signal and generates the approximation and detail coeffi cients. Starting with a signal s = [s0,s1,...,s2N−1] of length 2N, the Haar DWT generates the approxima tion coefficients a = [a0,...,aN−1] and the detail coefficients d = [d0,...,dN−1], each with N entries. More specifically, the approximations are generated as ak = (s2k + s2k+1)/2, while the details are generated by dk = (s2k −s2k+1)/2 for k ∈ [0,...,N). It is easy to see that when the Haar DWT is applied to a one-hot vector, it results in yet another one-hot vector, albeit scaled. The index of the non-zero value in the new vector is in fact the old index divided by 2, which is akin to removing the last bit of the index. Therefore, we can simply truncate the value to get rid of the least significant bits and end up with the value that we want to do the lookup at. In effect, we have manually applied the Haar DWT transform to the one-hot vector using only truncation. 5 2.3.2 Biorthogonal DWT The linear transformation in Haar can be seen as a matrix multiplication with some public orthogonal matrix.4 The Biorthogonal DWT on the other hand, requires two different matrices, where the transpose of one matrix is the inverse of the other, i.e., MT 1 = M−1 2 . In this case, M1 is used for the decomposition of the signal, while M2 is used for the reconstruction. In Haar, we observe that the points in the approximation coefficient are averaged from the input signal S and the details are complementary in order to be able to recover S. Conversely, in Biorthogonal wavelets, the approximations and details are computed with weighted averages [42]. Since the Biorthogonal wavelets have more non-zero filters, when the transform is applied to a one-hot vector, the result is not a one-hot vector. Even so, we can manually calculate the resulting vector as it is a weighted average of the two consecutive values starting at the most significant bits of the original index. The weights depend on the least significant bits of the original index; hence, by splitting the original index into the MSBs and LSBs, we can calculate the transformed vector. 3 The Ripple Framework As mentioned in Section 2.2, a key feature of the CGGI cryptosystem is the ability to evaluate a lookup table with the programmable bootstrapping mechanism. Complex non-linear functions can now be encoded as LUTs and evaluated homomorphically, eliminating the need to perform expensive polynomial approxima tions. Unfortunately, as the size of the LUT grows, this technique becomes prohibitively expensive (recall footnote 3), and thus many non-linearities are impossible to evaluate in applications that require high pre cision. Weaddress this challenge by utilizing the DWT to reduce the size of LUTs without sacrificing correctness. Ripple is the first framework to explore wavelet approximations for FHE as a way to accelerate programmable bootstrapping. Our key observation is that if we apply the DWT to signals that represent smooth functions (e.g., logarithm, square root, sigmoid, etc.), then the detail coefficients are relatively small compared to the approximation coefficients. This means that our approximation is sufficient to represent the original signal and we can completely disregard the detail coefficients while maintaining a minimal error relative to the original function. Utilizing this observation, we can zero out the details in Fig. 1, and by just applying the DWTasingle time, we can halve the size of the LUT. This signal might still be quite big, so we can repeat the same process and half the LUT size even further. Of course, as this is an approximation, the smaller the LUT size, the higher the error we might have. With Ripple, however, these errors are marginal as we show in Section 6. Ripple’s Key Observation. The core idea behind Ripple relies on the orthogonality or biorthogonality of the DWT transform applied to the inner product. Consider the fact that the inner product between two vectors v and u is equal to the product between the transpose of the first vector and the second, i.e., ⟨v,u⟩ = vT ·u. As described in Section 2.3, the DWT of both v and u is a multiplication with a matrix W, which we can view as DWT(v) = W ·v. This results in a vector a(v) d(v) , where a(v) and d(v) represent the approximation and detail coefficients of v, respectively. Similarly for DWT(u) = W ·u we get a(u) and d(u). Observe that using orthogonality, ⟨DWT(v),DWT(u)⟩ = ⟨W ·v,W ·u⟩ =(W ·v)T ·W ·u =vT ·WT ·W ·u =vT ·I ·u=vT ·u=⟨v,u⟩. (1) We remark that the inner product of two vectors is equal to the sum of the inner product of the approxi mation coefficients and the inner product of the detail coefficients of the DWT transforms of the vectors, as 4 Anorthogonal matrix M has the property that MMT = I, where I is the identity matrix. A matrix M is orthogonal if its transpose (MT) is equal to its inverse (M−1). 6 follows: (W ·v)T ·W ·u= a(v) d(v) T · a(u) d(u) =a(v)T · a(u) +d(v)T · d(u). In Ripple, we represent ⟨v,u⟩ as ⟨a(v),a(u)⟩ + ⟨d(v),d(u)⟩ which is approximately equal to the inner product of their respective approximation coefficient vectors a(v) and a(u). By dropping the detail coefficients, we get the approximation of the original inner product via the inner product of the approximation coefficients, which is key to our proposed lookup methodology. Thus, ⟨v,u⟩ = ⟨DWT(v),DWT(u)⟩ ≈ ⟨a(v),a(u)⟩. This works nicely for orthogonal DWTs, but for the Biorthogonal DWT, we need extra considerations. Instead of applying the same transformation to both v and u, we have to apply the decomposition matrix to one, while applying the reconstruction matrix to the other. Since applying the reconstruction matrix is equivalent to applying the decomposition matrix with filters flipped, this can also be thought of as a decomposition. Then, by biorthogonality of these matrices, the same observation holds and ⟨v,u⟩ = ⟨M1 · v,M2 ·u⟩ ≈ ⟨a(v),a(u)⟩. Applying our Observation to the Encrypted Domain. In Eq. (1), we can view v as a one-hot vector where the non-zero value is at the index of the ciphertext and u as a public LUT T′. Then, the inner product ⟨v,u⟩ will yield the lookup value in table T′ at the non-zero index of v, which is the lookup value at the ciphertext. In particular, ⟨a(v),a(u)⟩ will be an approximation of this lookup. Notably, the approximation of the lookup table is easy to calculate since it is in plaintext, while we need a way to efficiently calculate the approximation vector of the one-hot vector that represents the ciphertext. Fortunately, it turns out that this approximation vector can be calculated by a weighted sum of lookups. Of course, the aforementioned technique is not practical in HE, but programmable bootstrapping (PBS) can be leveraged for this exact purpose. Starting with a ciphertext x′ we can evaluate the LUT T′ on x′ homomorphically and obtain ciphertext y′ = T′(x′). The novelty of Ripple here lies in the fact that x′ has fewer bits than x and T′ has fewer entries than T, while y′ is a very close approximation to y = T(x). Ripple focuses on the two most popular DWT families: Haar and Biorthogonal, which are both viable and constitute a tradeoff between accuracy and latency (as discussed in Section 6). In the following subsections, we describe how Ripple formulates and evaluates encrypted lookup tables with each of the two DWT families. 2 bits 0 1 10 ... 11 01 10 0 1 10 ... 11 01 (a) Truncation of multiples of 2-bits. 0 1 10 ... 11 01 10 0 0 11 ... 0 1 10 (b) Truncation of odd bit sizes. Fig.2: Two scenarios of truncating the least significant bits of a radix ciphertext, where each digit encrypts two bits of data. With this configuration, it is far more efficient to truncate an even number of bits, while truncating an odd number of bits is computationally expensive. 7 Efficient Truncation. The speedups gained from the DWT techniques revolve around reducing the total size of the LUT to be evaluated. However, in the encrypted domain, it is impossible to reduce the number of LUT entries without also decreasing the underlying plaintext modulus. Simply shifting to isolate MSBs and LSBs for LUT evaluation will result in virtually no speedup as the total ciphertext size and dimensions will be identical to the original, and only the underlying message is manipulated. However, we can leverage the radix decomposition mode of TFHE-rs (where encrypted values are represented as vectors of ciphertexts) to effectively reduce the number of ciphertexts instead of shifting to isolate bits. An example of this approach is shown in Fig. 2a. If each digit of the encrypted value can encode two bits of information, a truncation can be performed by simply discarding digits; this approach is essentially free as no HE operations are required. At the same time, the total number of ciphertexts encompassing the encrypted value is reduced and the overall number of PBS required to evaluate a full lookup table is decreased. The key restriction of this approach is that the number of bits to be truncated must be a multiple of the bit-width of each digit. If this is not the case, as in Fig. 2b, the entire ciphertext array must be shifted, requiring multiple PBS operations and non-negligible overheads. After the shift is done, the blocks that encode all zeros (i.e., the left-most block in Fig. 2b) can be discarded. We observe that it is possible to avoid this inefficient truncation through careful parameterization of the DWT, as discussed in each of the following subsections. T1 (W/2 bits) x (W bits) x′ (W/2 bits) LSB T2 (W/2 bits) MSB y (W bits) Fig.3: Haar DWT LUT example approximating y = T(x) by computing y = T1(x′)∥T2(x′), where x′ = x ≫ W/2. 3.1 PBS with Haar DWT Let W be the total bit width of our input radix-decomposed ciphertext vector. Naturally, we can represent the function that we want to approximate as a LUT T that maps k-bit inputs to k-bit outputs. For simplicity, we can assume that k = W. Creating a LUT of 2W entries, however, is not always feasible. For instance, W =64requires generating and storing 264 LUT entries, which is completely impractical. A straightforward option is to truncate our input ciphertexts by J bits and limit T to only have 2W−J entries and operate over W −J bit inputs and outputs. As it turns out, this approximation incurs high errors and is not sufficient for most applications. Ripple takes a different approach: First, we apply the Haar DWT over the public LUT T iteratively J times by dropping the detail coefficients to end up with an approximation T′ of our original function. The new table T′ now operates over W −J bits. Observe, however, that our input ciphertext vector is still W-bits long. To index T′, Ripple truncates the vector to encode W −J bits so it can be used during PBS to index the LUT. As long as W −J bits is a multiple of the size of our radix digits, the truncation is free. Conveniently, this is the index needed for the approximation vector a(v). This results in an output ciphertext vector also encoding W −J bits. As we started with an W-bit input ciphertext vector, after approximating the function we need to end up with a W-bit output as well. To do so, we repeat the same process twice by building two LUTs (T1 and T2); one for the LSBs and one for the MSBs. In both cases, the truncated W −J bit input ciphertext vector is used to index the LUT. This is illustrated in Fig. 3, where J = W/2. Note that each of T1 and T2 has 8 W/2 bits and thus takes approximately half the time to be evaluated compared to a W-bit LUT (which is not even possible to practically create for large W). Additionally, both tables can be evaluated in parallel and finally, the two outputs can be concatenated to get the final encrypted result encoding W bits of data. Weremark that, in certain functions, we only need to evaluate the LSB table and we can avoid evaluating T2 altogether. For instance, any function where the output can fit in less than half the bit width, such as the square root or Sigmoid activation function, only needs a single LUT evaluation

Last updated