High-Speed Area-Efficient Implementation of AES Algorithm on Reconfigurable Platform

Nowadays, digital information is very easy to process, but it allows unauthorized users to access to this information. To protect this information from unauthorized access, cryptography is one of the most powerful and commonly used techniques. There are various cryptographic algorithms out of which advanced encryption standard (AES) is one of the most frequently used symmetric key cryptographic algorithms. The main objective of this chapter is to implement fast, secure, and area-efficient AES algorithm on a reconfigurable platform. In this chapter, AES algorithm is designed using Xilinx system generator, implemented on Nexys-4 DDR FPGA development board and simulated using MATLAB Simulink. Synthesis results show that the implementation consumes 121 slice registers, and its maximum operating frequency is 1102.536 MHz. Throughput achieved by this implementation is 14.1125 Gbps.


Introduction
NIST has started a development process of FIPS for AES algorithm stating that this is the replacement for data encryption standard (DES) algorithm. Alternatively, this algorithm is also known as Rijndael algorithm. Rijndael algorithm has the advantages like resistance against all recognized attacks, code and speed compactness, and simple design. Cryptography is a process in which the information to be sent is added with secret key so as to transmit the data securely at the destination. There are two types of cryptography based on the type of key applied: symmetric key cryptography and asymmetric key cryptography. In symmetric key cryptography, equal key is utilized for encryption as well as decryption, whereas in asymmetric key cryptography, different keys are required in encryption and decryption. AES algorithm is selected for implementation because it is secure and its components and design principles are completely specified. AES is a symmetric key block cipher. The design of AES algorithm is based on linear transformation. Due to the use of Rijndael algorithm, different block and key sizes can be selected which was not possible in DES algorithm. Block and key size can be selected from 128/160/192/224/256 bits and need not be the same. According to AES standard, this algorithm can only accept 128 bits of block, and key size can be selected from 128/192/256 bits. Based on the key size, the number of rounds will vary. For example, if key size is 128, 192, or 256, then the number of rounds will be 10, 12, and 14, respectively. The structure of AES algorithm is shown in Figure 1. In this chapter, this algorithm is designed with 128 bits of block size and key size, respectively, that is, AES generates cipher text of 128 bits for 128 bits of plaintext. After the initial round, plaintext processes through ten rounds. Each round contains processes like byte substitution, shift rows, mix columns, and add round key.

Byte substitution
The 16 input bytes are substituted by using fixed lookup table known as s-box.

Shift row
Each row from the matrix generated from the byte substitution is cyclically shifted to the left. Any entry that is dropped off is reinserted to the right side. The first row is kept as it is, the second row is shifted by one-byte position to the left, the third row is shifted by two-byte position to the left, and the fourth row is shifted by three-byte position to the left. The resultant matrix consists of same 16 bytes but at different position. Figure 4 shows shift row stage in AES algorithm.

Mix column
Each column of four bytes is now transformed using special arithmetical function of Galois field (GF) 28. This function takes four bytes of the column as input and  Computer and Network Security 4 outputs completely new four bytes that replaces the original four bytes. Figure 5 shows mix column stage in AES algorithm.

Add round key
The 16 bytes of the resultant matrix generated from mix column stage are then considered as 128 bits. In add round key stage, 128 bits of state are bitwise EX-ORed with 128 bits of round key. If this result belongs to the last round, then the output is cipher text else the resulting 128 bits is considered as 16 bytes, and another round is started with new byte substitution process. This is a column-wise operation between four bytes of state column and one word of round key. In the last round, there is no mix column step. Figure 6 shows add round key stage in AES algorithm.
Decryption of cipher text, generated from AES encryption, contains all the stages in encryption but in reverse order. AES decryption starts with inverse initial round. The remaining nine rounds in decryption consist of processes like add round key, inverse shift rows, inverse byte substitution, and inverse mix columns.
Add round key: Add round key has its own inverse function since XOR functions its own inverse and the round keys should be selected in reverse order.
Inverse shift rows: Inverse shift rows functions exactly in the same way as shift row stage but in opposite direction. The first row is kept as it is, the second row is shifted by one-byte position to the right, the third row is shifted by two-byte position to the right, and the fourth row is shifted by three-byte position to the right. The resultant matrix consists of same 16 bytes but at different position. Figure 7 shows inverse shift row stage in AES algorithm.
Inverse byte substitution: Inverse byte substitution is done using predefined substitution table known as inverse s-box. Figure 8 shows inverse s-box in AES algorithm.
Inverse mix column: Transformation in inverse mix column is done using polynomials of degree less than 4 over Galois field (GF) 28 in which coefficients are the elements from the column of the state.
The rest of the chapter is organized as follows: Section 2 presents the survey based on the various kinds of implementation of AES algorithm on reconfigurable platform. In Section 3, implementation of AES algorithm using the proposed approach is discussed. In Section 4, experimental results achieved using the proposed method along with the comparative analysis with existing methods are discussed.    Computer and Network Security 6

Literature survey
In this section, focus is given on the work done by various researchers on FPGAbased implementation of AES algorithm. There are various researchers which have either concentrated on area optimization or speed optimization. Mulani and Mane [1] discussed integrating of DWT and AES algorithm for implementation of watermarking on FPGA. The design was implemented on xc6vcx75t-2ff484, and it utilizes 2117 slices at maximum operating frequency of 228.064 MHz. Ratheesh and Narayanan [2] proposed implementation of AES algorithm with low-power MUX LUT-based s-box on FPGA. This design achieved total power distribution of 0.55 W. Agarwal et al. [3] suggested implementation of AES algorithm using Verilog on Spartan-3E FPGA. This design utilizes 1464 slices. Farooq and Faisal Aslam [4] discussed implementation of AES algorithm on FPGA device using five different techniques which are suitable for area critical applications and speed critical applications. This design was implemented on Spartan-6 FPGA device, and it utilizes 161 slices at maximum operating frequency which is 886.64 MHz. The throughput of this system is 113.5 Gbps. Sai Srinivas and Akramuddin [5] proposed less complex hardware implementation of AES Rijndael algorithm on Xilinx Virtex-7 XC7VX90T FPGA. In the proposed design, synthesis tool was set to optimize speed, area, and power. Mathur and Bansode [6] proposed a cryptosystem, which is a combination of AES algorithm and ECC. This is a hybrid encryption scheme and the key size is 192 bits and there are 12 numbers of iterations in this system. Kalaiselvi and Mangalam [7] proposed a low-power and high-throughput FPGA implementation of AES algorithm using key expansion technique. This design accepts key size of 256 bits for both encryption and decryption. This design utilizes 5493 slices, and its maximum operating frequency is 277.4 MHz. The throughput of this system is 0.06 Gbps. Deshpande et al. [8] suggested BRAM-based and FPGA-based implementation of AES algorithm. Due to the use of BRAMs for implementing s-box, this design utilizes less number of slices. The design was implemented on XC3S1400AN and it utilizes 3376 slices. Ibrahim [9] presented FPGA implementation of AES encryption core that is suitable for limited resource-limited applications. This design was implemented on Spartan-3, and it utilizes 150 slices at maximum operating frequency of 90 MHz. Khose and Raut [10] proposed implementation of AES algorithm on FPGA in order to achieve high speed of data processing and also to reduce time for generating key. This design utilizes 201 slices and 2 BRAMs at maximum operating frequency of 70 MHz. Mulani and Mane [11] proposed FPGA implementation of DES algorithm. The design was implemented on XC2S200, and it utilizes 2118 slices and 97 IOBs. Yewale Minal and Sayyad [12] proposed implementation of AES encryption using VHSIC hardware description language VHDL) and decryption using Visual Basic. With this approach, 1403 slices are utilized at maximum operating frequency of 160.875 MHz, and it has a throughput of 2.059 Gbps. Deshpande et al. [13] discussed FPGA-based optimized architecture that utilizes less area. This design was intended for plaintext of 128 bits and key of 128 bits. Tonde and Dhande [14] discussed FPGA-based implementation of AES algorithm using iterative looping approach for 128 bits of block and key size. Varhade and Kasat [15] proposed a FPGA-based AES algorithm, which utilizes 1746 logic elements and 32,768 memory bits. This design was synthesized on Cyclone-II using Altera. Wadi and Zainal [16] proposed some modifications like decreasing number of rounds and replacing S-box with new s-box to reduce hardware requirements in order to enhance the performance of AES algorithm in terms of time ciphering and pattern appearance. Wang et al. [17] suggested high-speed implementation of AES algorithm on FPGA to transmit the data securely using pipelining and parallel processing methods. Shylashree et al. [18] focused on various novel FPGA architectures of AES algorithm. Borkar et al. [19] proposed iterative design approach for FPGA implementation of AES algorithm using VHDL. This design utilizes 1853 slices, and its operating frequency is 140.390 MHz. Deshpande et al. [20] presented very low complexity FPGA-based architecture for integrated AES encryptor and decryptor. This design is synthesized on Spartan-3 XC3S400 FPGA. Kaur and Vig [21] suggested an efficient implementation of AES algorithm on FPGA in which multiple rounds are processed simultaneously. Due to this implementation, speed is increased but it increases area. This design utilizes 6279 slices and 5 BRAMs, and its operating frequency is 119.954 MHz. Samanta [22] proposed fast and efficient reconfigurable platformbased implementation of AES algorithm using pipelining. This design utilizes 1051 slices and 11 BRAMs, and its operating frequency is 76.699 MHz. Good and Benaissa [23] discussed hardware implementation of fastest and slowest AES algorithm which utilizes 16,693 slices at maximum operating frequency of 184.8 MHz.
From the literature survey, it is clear that many researchers have either worked on optimizing the area or speed. Few researchers have concentrated on optimizing the speed as well as area. Implementation of AES algorithm, which is optimized in speed as well as area, is discussed in this chapter.

Implementation of AES algorithm
The proposed design is implemented with the aim to achieve both area and speed optimization. In the proposed design, keys for each round are initially generated by using MATLAB code, and then those keys are used in the design. Due to this approach, the design occupies less number of slices, and also the speed is faster than the normal approach. The design is implemented using Xilinx system generator. Figure 9 shows Xilinx system generator-based model for AES algorithm.

AES encryption
A plaintext of 128-bit is processed through 10 rounds. Each round contains processes like byte substitution, shift rows, mix columns, and add round key. As keys are generated using MATLAB code, only remaining system generator-based models like byte substitution, shift rows, and mix columns are discussed in this section.
Round function is one of the important processes in AES algorithm. Figure 10 shows system generator-based model for implementing round0 function.
Round function consists of s-box, shift row, and mix column as shown in Figure 11. Figure 12 shows implementation of s-box. Figure 13 shows implementation of shift row. Figure 14 shows implementation of mix column. Mix column consists of group_1, group_2, group_3, and group_4. Figure 15 shows implementation of group. Further each group consists of four multiplication blocks such as mul_blk, mul_blk1, mul_blk2, and mul_blk3. Figure 16 shows implementation of multiplication block.

AES decryption
A cipher text of 128-bits is processed through 10 inverse rounds. Each round contains processes like inverse byte substitution, inverse shift rows, inverse mix columns, and add round key. Figure 17 shows implementation of inverse round function. Inverse round function consists of inverse s-box, inverse shift row, and inverse mix column as shown in Figure 18. Figure 19 shows implementation of inverse mix column. Inverse mix column consists of four groups, i.e., group_1, group_2, group_3, and group_4. Figure 20 shows implementation of group. Each group consists of multiplication blocks like mul_blk, mul_blk1, mul_blk2, and mul_blk3. Figure 21 shows implementation of multiplication block.
Each multiplication block consists of three multipliers mul_2, mul_4, and mul_8 and EX-OR operations. Figure 22 shows implementation of multipliers.    Figure 24 shows implementation of inverse s-box.

Software utilized
For implementing the proposed design, MATLAB 2013a and Xilinx ISE Design Suite are used. MATLAB is used for generating the keys and also to get the results in terms of images, whereas Xilinx ISE Design Suite is used to get the synthesis result, RTL schematic, and throughput of this implementation.            Figure 25 shows detailed RTL schematic of the proposed implementation of AES algorithm.

Synthesis result
The design is synthesized using Xilinx XST synthesizer. In the proposed design, an optimized and synthesizable very high speed integrated circuit (VHSIC) hardware description language (VHDL) code for the implementation of image as well as 128-bit data encryption is developed so as to utilize less area and increase the speed. Table 1 shows design utilization summary of the proposed design.
From the synthesis results of the proposed design, it is clear that this system utilizes only 121 slice registers, and its maximum operating frequency is 1102.536 MHz. The throughput of the system is calculated using the following formula: ( Throughput ) of the system = 128 bits × Clock frequency _______________________________ Cycles per Encrypted block (1) By substituting the values in Eq. (1), throughput of the systems is 14.1125 Gbps. Figure 26 shows simulation result when an image is applied as an input.

Performance analysis
Performance analysis is a must to compare the performance of the proposed implementation with existing methods. The performance is compared on the basis of area and operating frequency. Till date various researchers have worked on FPGAbased implementations of AES algorithm; some of them have optimized speed and

Author details
Altaf O. Mulani* and Pradeep B. Mane AISSMS Institute of Information Technology, Pune, Maharashtra, India *Address all correspondence to: aksaltaaf@gmail.com some have optimized area. In the proposed system, both area and speed are optimized. Table 2 shows performance comparison of the proposed system with previous work.

Conclusion
In this chapter, fast, area-efficient, and secure implementation of AES algorithm on FPGA is suggested. As per the literature survey, it is clear that Farooq and Faisal Aslam [4] achieved better performance in terms of speed, whereas Ibrahim [9] achieved better performance in terms of area. In this design, due to better Xilinx system generator-based design, the system is optimized, and it utilizes only 121 slice registers at maximum operating frequency of 1102.536 MHz. Also, throughput of the proposed system is 14.1125 Gbps.

Conflict of interest
There is no conflict of interest.