A New Approximation Method for Constant Weight Coding and Its Hardware Implementation A New Approximation Method for Constant Weight Coding and Its Hardware Implementation

In this chapter, a more memory-efficient method for encoding binary information into words of prescribed length and weight is presented. The solutions in existing work include complex float point arithmetic or extra memory overhead which make it demanding for resource-constrained computing platform. The solution we propose here solves the problems above yet achieves better coding efficiency. We also correct a crucial error in previous implementations of code-based cryptography by exploiting and tweaking the proposed encoder. For the time being, the design presented in this work is the most compact one for any code-based encryption schemes. We show, for instance, that our lightweight implementation of Niederreiter encrypting unit can encrypt approximately 1 million plaintexts per second on a Xilinx Virtex-6 FPGA, requiring 183 slices and 18 memory blocks.


Introduction
Most modern public-key cryptographic systems rely on either the integer factorization or discrete logarithm problem, both of which expect to be solvable on large-scale quantum computers using Shor's algorithm [1]. The recent breakthroughs of powerful quantum computing have shown their strength in computing solutions to the hard mathematical problems mentioned [2,3]. The cryptographic research community has identified the urgency of insecure vulnerabilities rooted in these cryptosystems and begun to settle their security on alternative hard problems in the last years, such as multivariate-quadratic, lattice-based, and code-based cryptosystems [4]. In this chapter, we address the problem of encoding information into binary words of predefined length and Hamming weight in resource-constrained computing environment, e.g., reconfigurable hardware, embedded microcontroller systems, etc. This is of interest, in particular, of efficient implementations of McEliece's scheme [5,6] or Niederreiter's scheme [7], the most prospective candidates for code-based cryptography.
In the case of Niederreiter/McEliece encryption, the binary stream of plaintext is requested to be converted into the form of constant weight words. Constant weight means that there exists a constant number of "1" in the binary plaintext. Note that in the hybrid Niederreiter encryption systems (KEM/DEM encryption) [8][9][10], KEMs are designed to exchange symmetric keys securely, and DEMs use these symmetric keys for transmitting long messages. This class of encryption techniques does not get constant weight coding involved. Nevertheless, if we want to construct a complete code-based cryptography including standard public-key encryption, digital signature, and hybrid encryption, constant weight coding must be efficiently implemented as it is required by both public-key encryption [11,12] and signature [13]. The exact solution [14] of constant weight coding needs to compute large binomial coefficients and has a quadratic complexity though it is assumed to be optimal as a source coding algorithm. In [15], Sendrier proposed the first solution of linear complexity by incorporating Huffman codes. Later, the author of [15] further improves its coding efficiency very close to 1 by means of Goloumb's run-length encoding [16]. The new proposal is particularly easy to implement on hardware and has linear time complexity. The disadvantage though is that the length encoding is variable [17]. He also proposed an approximation version in this paper by regulating the value of d to the power of two. This approach significantly simplifies the encoding procedure and thus improves coding throughput.
Heyse et al. [18,19] continued the research and proposed to adapt the approximation method in [17] to embedded system applications. Their design is implemented on AVR microcontrollers and Xilinx FPGAs. However, we observe that such method is not applicable to large parameter sets for the Niederreiter scheme (see Table 1) [20][21][22]. The work in [18,19] preserves a lookup table of pre-stored data with the space complexity of O n ð Þ to encode input messages into constant weight words. The memory overhead of this table is still intolerable for small embedded systems, and therefore their design is unscalable if n is large. CFS signature scheme [20] exploits very large Goppa code, and it requires to compress the lengthy constant weight signatures into a binary stream. MDPC-McEliece/Niederreiter encryption [23] uses very large n for practical security levels. For instance, n is set as large as 19,712 for 128-bit security and 65,536 for 256-bit security. Baldi et al. proposed a novel LDGM sparse syndrome signature scheme [13] with compact key size, which also requests a large constant weight coding within the signature generation. His method was successfully attacked in 2016 by a French research group [24]. At the time being, we do not have a considerably lightweight yet efficient solution for the constant weight encoding if we consider realizing such encoding in real-world applications.
The purpose of this work is to tweak Sendrier's approximation method [17] and hence to make it easy to implement for all possible secure parameters of the Niederreiter cryptosystem proposed in literature while maintaining the efficiency. Our contributions include: 1. We propose a new approximation method of constant weight coding free from complicated float point arithmetic and heavy memory footprint. This method permits us to implement a compact yet fast constant weight encoder on the resource-constrained computing platform.

2.
We improve the coding efficiency by fine-tuning the optimal precision for computing the value of d, in comparison with other approximation methods. The experiments have shown that the performance of our new method is better than Heyse's approximate version [18] and even comparable to Sendrier's original proposal [17].

3.
We integrate our design with the Niederreiter encryption and obtain a more compact result. We fix a critical security flaw of Heyse et al.'s Niederreiter encryptor [19]. Our secure implementation of Niederreiter encryptor can encrypt approximately 1 million plaintexts per second on a Xilinx Virtex-6 FPGA, requiring 183 slices and 18 memory blocks.
This chapter is organized as follows. Sendrier's proposal of constant weight coding and its approximation variant [17,18] is first revisited in Section 2. After analyzing the downside of these schemes, we are motivated to propose a new approximation method and to fine-tune it for an optimal source coding performance, presented in Section 3. Our detailed implementations for the proposed constant weight encoder/decoder and Niederreiter encryption unit on FPGAs are described in Section 4 and Section 5. We present our experimental results compared with the state of arts in Section 6. Finally, Section 7 summarizes this chapter.

Sendrier's methods for constant weight coding
Sendrier presented an algorithm for encoding binary information into words of prescribed length and weight [17]. His encoding algorithm returns a t-tuple δ 1 ; δ 2 ; …; δ t ð Þ in which δ i s are the lengths of the longest strings of consecutive "0"s. This method is easy to implement and has linear complexity with a small loss of information efficiency. In this work, we unfold the n; t ð Þ Security level recursive encoding and decoding algorithms originated from [17] and rewrite them in Algorithm 1 and Algorithm 2 [26].
We use the same notations from [17,26] in the above two algorithms to keep consistency. For example, read B; i ð Þ moves forward and reads i bits in the stream B and returns the integer whose binary decomposition has been read, most significant bit first; Write B; i ð Þ moves forward and writes the binary string i into the stream B; and encodefd δ index ; d ð Þreturns a binary string and decodefd d; B ð Þ returns an integer. These two functions are actually the run-length encoding and decoding methods proposed by Golomb [16,17]. best_d n; t ð Þ returns an integer such that 1 ≤ best_d n; t ð Þ≤ n À t and Sendrier suggested to choose it close to the number defined by Eq. (1). In fact, best_d n; t ð Þ can take any value in the range though the efficiency would be reduced if this value is too far from Eq. (1): Sendrier also presented an approximation of the best d where the values of d (given by Eq. (1)) was restricted to the power of two [17]. More precisely, d is first computed via Eq. (1) and then round to 2 log 2 d ð Þ d e . This approximation greatly simplifies the functions of encodefd Á ð Þ and decodefd Á ð Þ and therefore outperforms in speed, while the loss of coding efficiency is trivial. The simplified versions of encoding and decoding with encodefd Á ð Þ and decodefd Á ð Þ after approximation are described as follows [26]: where base 2 δ; u ð Þ denotes the u least significant bits of the integer δ written in base 2 and For the above two equations, the minimum allowed value of d is noteworthy in the case of d ¼ 1. In this case we have u ¼ 0, and therefore we define by purpose that base 2 δ; 0 ð Þ ¼ null and read B; 0 ð Þ ¼ 0 to guarantee that our algorithm applies to all possible u.
Recently, Heyse et al. implemented Niederreiter encryption scheme on embedded microcontrollers in which they used a lookup table to compute the value of d for constant weight encoding [18]. Their method is based on the approximation method from [17]. One major contribution of their work is they observe that the last few bits of n can be ignored for constant weight encoding because these bits make little difference to the value of d. They do not keep n Á t entries but instead n entries; the least significant log 2 t d ebits of n are not considered and are substituted by t. This method significantly reduces the size of the lookup table. According to our analysis, the lookup table is shrunk to roughly O n ð Þ. It works pretty well for small parameters of n, for example, n ¼ 2 11 in the applications of Goppa code-based McEliece or Niederreiter encryption schemes. However, we occasionally found that it does not work well when we were implementing a Niederreiter signature scheme, called CFS signature. CFS requires an extremely large value of n, typically n ¼ 2 18 , n ¼ 2 20 . On the one hand, the size of lookup table increases linearly with n, resulting in somewhat unscalability. On the other hand, the coding efficiency drops dramatically and thus lowers the throughput of the constant weight encoder as n increases. All these downsides motivate us to figure out better ways of computing d. We would describe and analyze our new methods in the next section.

Reduce memory footprint and computational steps
The computation of the value of d is the most crucial step of constant weight encoding and decoding, as suggested by Eq. (1) which involves floating-point arithmetic. However, many embedded/hardware systems do not have dedicated floating-point units for such computations. [19] proposed to replace floating-point units by a lookup table with predefined data for reconfigurable hardware. The problem of their method is that, for some large n; t ð Þ, the lookup table could be sizeable. For example, n ¼ 2 16 ; t ¼ 9 À Á requests the size of lookup table to be 256 kb, which is obviously not a negligible memory overhead for embedded systems.
To solve this problem, we propose to eliminate such lookup table by computing d directly using fixed-point arithmetic. We separate the computation of d into two parts. In the first part, is precomputed and stored in the fixed-point format. In the second part, Furthermore, we substitute n À tÀ1 2 À Á by n due to the following observations [26]: • n ≫ t such that n À tÀ1 2 ≈ n.
• Eventually d must be round to an integer, and hence the difference between n À tÀ1 2 À Á θ and nθ is very likely to be ignored.
This substitution enables the removal of the computational steps of n À t À 1 ð Þ=2, and hence a faster and simpler realization of constant coding which makes use of a single integer multiplication is achievable.
In summary, our new proposal of the approximation of d is as follows [26]: is a function of t and precomputed. Our new approximation of d is lightweight, requiring only one multiplication. In the following, we will demonstrate that this method also permits reasonable high-coding efficiency as a source coding algorithm.

Represent θ t ½ in fixed-point format
As aforementioned, θ t ½ is actually a vector of fractional numbers and should be stored in fixed-point format. Note that the integer part of θ t ½ is always 0, and therefore we only need to preserve its fractional part. Hereafter, we denote our fixed-point format as fixed_0_i, where i indicates the bits we have used for storing the fractional part of θ t ½ .
In practice, we hope to use fixed_0_i with the smallest i while maintaining desirable coding efficiency [26]. Smaller i means lower data storage and a simpler multiplication with smaller operand size, which is particularly advisable for resource-constrained computing platforms. Indeed, one of the key issues of this chapter is to determine the best fixed_0_i for θ t ½ . In the next section, we describe our experiments on exploring the optimal fixed_0_i.

Find the optimal precision for constant weight encoding
The purpose of the precision tuning is to find the lowest precision that still maintains a relatively high coding efficiency. A lower precision means we can use a multiplier of smaller operand size leading to better path delay and slice utilization. A higher coding efficiency means one can encode more bits from the source into a constant weight word. This is of interest, in particular, when someone encrypts a relatively large file using code-based crypto: It takes much less time for encryption if we have a high coding efficiency close to 1 [26].
To find the optimal precision of fixed_0_i, we studied the relationship between distinct i's and their coding performance. In our experiments, all possible precision from fp_0_32 to fp_0_1 are investigated to compare with the Sendrier's methods [17] and Heyse's approximate version [18]. Figure 1. The performance of different methods for choosing the optimal d. We have listed five most frequently used sets of n; t ð Þ for the Niederreiter cryptosystem. We have done three experiments for each n; t ð Þ in which the input binary message contains "1" with the probability p = 0.1, 0.5, and 0.9, respectively. The results of each experiment are obtained by running 10,000 different messages. The X-axis lists different methods including Senderier's primitive [17], Senderier's approximation [17], Heyse's approximation [19], and our n*fixed_0_16 -N*fixed_0_2. The Y-axis represents the average length (bits) of the input message read for a successful constant weight encoding.
We measured the coding efficiency by calculating the average coding length for a successful encoding because longer coding length indicates a better coding efficiency. Since constant weight coding is variable length coding, we must consider how different plaintexts as input could affect the performance in order to determine which approximation is the best. To be thorough, different types of binary plaintexts, classified by the proportion of "1" contained, should be tested for evaluating the real performance of different encoding approximation methods. In our instances, we measure three particular types to simplify the model: "0" dominated texts (p = 0.1, "1" exists with probability of 0.1 in the plaintext), balanced texts (p = 0.5, "1" exists with probability of 0.5), and "1" dominated texts (p = 0.9, "1" exists with probability of 0.9) (Figure 1). Figure 2 describes the coding performances when we adjust the precision of θ t ½ . Taken as a whole, the p = 0.1 group and the p = 0.5 group have a similar trend of average message length encoded as the arithmetic precision decreases: The message length drops slightly from n*fixed_0_16n*fixed_0_2 in consistency. On the contrary, the p = 0.9 group appears to be quite different where the numbers of bits read for a single constant weight coding first stay stable and then drop with the approximation precision decreasing. The numbers first keep stable because the loss of precision in θ t ½ is comparatively trivial but if the precision drops too low, for instance, with fixed_0_2 representation θ t ½ ¼ 0 for 2 ≤ t ≤ 38 and hence d ¼ 0. It leads to a constant n and small value of i in Algorithm 3 forcing us to read more bits of input stream before the algorithm halts. According to the evaluation criteria mentioned in the last paragraph, we compute the average length of the three types of plaintexts and identify the best approximation of d from our proposal after analyzing the statistics obtained. On the one hand, the n*fixed_0_5 group outperforms at n ¼ 2 10 On the other hand, the n*fixed_0_4 group beats the others at n ¼ 2 16 [17]. Additionally, our proposal even outperforms the Sendrier's method at two of these parameter sets-n ¼ 2 10 ; t ¼ 38 À Á and n ¼ 2 11 ; t ¼ 27 À Á with 5.05 and 0.95% of improvements, respectively. It is also worth mentioning that for n ¼ 2 16 , the performance of our proposal falls slightly behind with 2.37, 2.08, and 2.22% of loss when compared with the Sendrier's method; it nonetheless outruns Sendrier's approximation and Heyse's approximation. In particular, the performance of Heyse's approximation becomes unfavorable with 23.56% loss at n ¼ 2 20 ; t ¼ 8 À Á , and we are pushing the limits of Heyse's method here as the lower bits of n are innegligible and cannot be removed with such large n.

best_d module
The best_d module is the most critical arithmetic unit which computes the best value of d according to the inputs n and t. Our proposal of computation of best_d consists of three stages which performs the following task in sequence:
2. Compute n Á θ t ½ via a fixed-point multiplier. Xilinx LogiCORE IP is configured to implement high-performance, optimized multipliers for different pairs of n and t. The fractional part of the multiplication result is truncated, but its integer part is preserved for the next stage to process.
3. Output the value of d and u. Recall that the value of n Á θ t ½ must be round to d ¼ 2 u . Another priority encoder is utilized to decode the integer part of n Á θ t ½ . The detailed decoding process is structured as lookup table mapping illustrated in Table 4.   Figure 2 depicts our best_d unit. This unit works in three-stage pipelines. It first computes θ t ½ and then obtains n Á θ t ½ using a multiplier. Finally, the value of d would be determined by a priority decoder.  Figure 4 renders the architecture of the proposed constant weight decoder. A symmetric m-tom bit FIFO is used to read the input t-tuple word by word. This logic is indeed the bottleneck of the constant weight decoder when compared with the encoder. Three registers are utilized to update the values of n and t as the Bin2CW encoder does, δ. The major difference is that the shift register here outputs the value of δ bit by bit as step 9 of Algorithm 2 demands.

Integrating with the Niederreiter encryptor
In this section, we demonstrate that the proposed Bin2CW encoder can integrate into the Niederreiter encryptor for data encryption, shown in Algorithm 3.

Algorithm 3. Niederreiter Message Encryption, referenced from [25]
Input: message vector m, public key pk ¼{ b H,t} where b H is an n by mt matrix Output: ciphertext c 1 Bob encodes the message m as a binary matrix/vector of length n and weight at most t.
2 Bob computes the ciphertext as c ¼ b Hm T ; m T is the transpose of matrix m.
The Bin2CW encoder performs the first step in Algorithm 3. Note that Bin2CW encoder returns a t-tuple of integers δ 1 ; …; δ t ð Þ , which represents the distance between consecutive "1"s in the string. Nevertheless such t-tuple cannot be directly transported to compute the ciphertext. We believe that the way Heyse et al. [19] encrypts c ¼ b Hm T with m ¼ δ 1 ; …; δ t ð Þis incorrect due to two reasons [26]: 1. It is very likely that δ i ¼ δ j where i 6 ¼ j such that the number of errors is less than t, and it is assumed to be insecure for cryptanalysis.
2. δ 1 ; …; δ t ð Þreturns the integer ranging from 0 to n À t, but the constant weight word exactly ranges from 0 to n. In other terms, the last t rows of the public key b H T are never used.
To correct this weakness from [19], we propose to generate the "real" constant weight binary words of length n and Hamming weight t. Assume the constant weight is represented by Þ , the coordinates of the "1"s in ascending order, then  , which can be implemented as a GF 2 297 À Á adder in this figure. It is also worth noting that the vector-matrix multiplication works concurrently with the CW encoding: Whenever a valid i k has been computed, it is transferred immediately to the GF 2 297 À Á adder for summing up the selected row. Once the last i t has been computed, the last indexed row of b H T also has been accumulated to the sum. This sum, stored in the 297-bit register, is now available to be interpreted as the ciphertext.
Our final remark is for the side channel attacks of code-based crypto using constant weight encoding (CWE). Admittedly, we cannot give a satisfying answer of them at the time being. We believe this is an open problem left to be solved. For the time being, if the users decide not to take the risk of timing attacks, we suggest forcing the CWE to be constant time. We can set the maximum time (it happens when we have all-zero input) for whatever the input is. Nevertheless, the price is a significant drop of timing performance.
We give here our analysis of timing attacks: The attackers can compromise the CWE if and only if he could analyze the timing differences among different inputs and use this information to recover the entire message. Unfortunately, the timing character of the operation of reading "1" or "0" is different: when reading "1," it consumes only 1 clock cycle count, whereas when reading "1," it continues to read log 2 d ð Þ d emore bits from B, consuming 1 þ log 2 d ð Þ d ecounts. These behaviors appear at first sight to be vulnerable to timing attacks: For different inputs, the execution time is slightly different, and the distinction between reading "0" and reading "1" is also significant. However, statistical approaches as [27,28] have introduced against RSA cryptosystems which seem to be not helpful: To our way of thinking, the situation we encounter is much more difficult. (1) We need to recover this particular message under attack, but the timing differences among different messages that we collect do not leak any useful information on the targeted message. On the contrary, in RSA we can use a large number of messages and compare the timing for recovering the secret key in a bit-by-bit fashion. (2) Note that d is changing each iteration according to the current state of n and t. That is to say, when reading "1," the timing is variable, sensitive to how "1" and "0" are permuted in the message and thus difficult to predict. Most importantly, even if CWE is somehow compromised, it does not reveal any information about secret keys. In the case of decryption, the ciphertext is first decrypted by an error correcting decoder (typically, Goppa-code or MDPC-code decoder) which holds the secret key. The result after error correcting is a GF 2 n ð Þ vector, and then this vector is encoded by CWE for the plaintext recovered. We can see the key points here: Timing attacks should be mounted on error correcting decoders rather than constant weight decoders for retrieving the secret keys. Perhaps a better strategy is to mount timing attacks on CWE for recovering the plaintext directly. This raises one more question: how do we distinguish or measure the peculiar timing of CWE out of the total execution time, given that error correcting decoders also take nonconstant time for decoding? This is indeed a very exciting topic for which we would investigate in our future work.

Results and comparisons
We captured our constant weight coding architecture in the Verilog language and prototyped our design on Xilinx Virtex-6 FPGA ( code-based cryptography accept Virtex-6 or even lower ends for implementation aspects [19,[29][30][31][32]. The benefit of using Virtex-6 from our standpoint is that we could fairly compare our design with others given that most of them are also implemented on Virtex-6. To the best of our knowledge, the only compact implementations of constant weight coding have been proposed by Heyse et al. [19]. Their lightweight architecture is generally identical to ours except the design of best_d module. Their best_d module works in two pipeline stages: In the first stage, it retrieves the value of u by table lookup. Then in the second stage, it outputs d according to the value of u using a simple decoder. Comparatively, our best_d module has three stages of the pipeline, and thus it leads to a lower throughput, but our architectures are smaller and improve the area-time tradeoff of the constant weight coding implementations proposed by Heyse et al. [19], shown in Table 5. In particular, we use only one 18 kb memory block for all parameter sets of our experiments. We also observe that in our designs, the memory footprint does not increase, and the high clock frequency also maintains as the parameters grow. This is because the main difference among encoders or decoders with distinct parameters n and t is the data width of multiplier embedded in the best_d module, which increases logarithmically from 10bit Â 5bit to 20bit Â 4bit. On the other hand, the memory overhead of Heyse's implementations grows linearly with n and might introduce problems when n is large as aforementioned. To verify this argument, we re-implemented Heyse's work for n ¼ 2 16 ; t ¼ 9 À Á , n ¼ 2 18 ; t ¼ 9 À Á , and n ¼ 2 20 ; t ¼ 8 À Á . The experimental results validate this point. Additionally, another negative side effect of heavy memory overhead is that the working frequency of circuits drops rapidly as shown in Table 6. For small parameters (a) and (b), the lookup table in Heyse's design could be made of distributed memory (LUT) and therefore has little impact on frequency. However, for large parameters (c), (d), and (e), such lookup table can no longer be instantiated as LUTs because Xilinx Virtex-6 distributed memory generator only allows maximum data depth of 6,5536. We instead use block memory resource of the FPGAs to construct the table, and this accordingly hinders speed performance due to relatively far and complicated routing. The usage of block memory is the real bottleneck of Heyse's work as n grows.  11 , t = 30. We also put it here for reference. We finally implemented the Niederreiter encryptor, a cryptographic application where constant weight coding is used exactly as described in Section 5. Table 6 compares our work with the state of art [19,26]. It is seen that our new implementation is the most compact, with better area-time tradeoffs. The same amount of block memory is occupied in our design as [19] did where 16 Â 36kb + 1 Â 18kb RAMs are utilized to save the public-key matrix b H and one 18 kb RAM for the 8-to-1 FIFO within the constant weight encoder.