Open access peer-reviewed chapter

Trans_Proc: A Processor to Implement the Linear Transformations on the Image and Signal Processing and Its Future Scope

Written By

Atri Sanyal and Amitabha Sinha

Submitted: 24 June 2021 Reviewed: 28 June 2021 Published: 26 October 2022

DOI: 10.5772/intechopen.99122

Chapter metrics overview

111 Chapter Downloads

View Full Metrics

Abstract

We present here Transproc, a reconfigurable generic processor which can execute operations related to linear transformations like FFT, FDCT or FDWT. A graph theoretic lemma is used to find the applicability of such a processor to calculate the flow graph related parallel operations found in these linear transformations. The architecture level design and processing element level design is presented. The primitive instruction set and the control signal implementing the instruction set is proposed. A detailed simulation validating the correctness of PE level and the architecture level data calculation and routing operations are carried out using Xilinx Vivado Webpack. The result related to size, power and timing requirement is presented.

Keywords

  • Transform processor
  • Graph Theoretic Concept
  • Design
  • Primitive Instruction Set
  • Simulation

1. Introduction

In this paper we have proposed an efficient architecture for implementation of frequently used and computationally intensive linear transformations in signal or image processing. The linear transformations like FFT, FDCT or FDWT are computationally intensive and also critical for the processing applications. The papers proposing different designs in this domain are mainly of three types. The first category papers propose architectures to implement only a single category of linear transformations like FFT or FDCT [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. Since these implementation’s primary focus is on speed so they are mainly implemented on ASIC. These include a variety of algorithms to decrease the number of computationally intensive operations. We have seen multiplier less variety, high speed pipeline, data forwarding, step lifting techniques implementing FFT or FDCT algorithms which greatly decrease the computational complexity and increase the speed, and others. The second category of papers propose processors or architectures which can implement a number of general linear transformations like FFT, FDCT, FDWT. Since these architectures include basic building blocks common to all these transformations and so they need to reconfigure itself before executing different transformations, they are mainly implemented using reconfigurable architectures like FPGA [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. Our paper proposes a processor of that category. The third category of papers discuss implementation of more generic image/signal applications [18, 19, 20]. While describing a linear transform data flow graph is used extensively in different literatures. It was proved earlier in [21, 22] by graph theoretical and mathematical induction that a MIMD processor consisting of processing elements connected like a completely connected equi- vertex bi partite graph can copy any actions shown in the flow graph of transformations like FFT, FDCT, FDWT etc. of any arbitrary size. This confirms that a processor with such type of architecture can execute the transforms represented using flow graph method. The architecture of processing element and the overall architecture discussed in [21, 22] is described thoroughly here. The architecture of control unit and the data exchange procedure between the main CPU and memory and this processor and its local memory is discussed in detail here. The instruction set for processing element and the overall processor are all described along with their corresponding control lines. The representative examples of each category of the instruction set are considered and the step wise control signal to implement them is discussed. The entire architecture requires reconfigurability as it is capable of implementing several transforms by its own. Then the architecture is coded in VHDL, synthesized and simulated using Xilinx Vivado. The processor is simulated to verify the operations in three stages. First the component inside the processing element (floating point adder and multiplier) is simulated and tested. Then the longest sequence of execution required in Loefflers FDCT algorithm is tested for each and every processing element and finally the testing of the overall architecture and the data routing between different processing element is simulated and tested. The synthesis result showing the size of the architecture in LUT level and the synthesis result of power and time are discussed. The rest of the paper is composed in this way, Section 2 discusses the theoretical background of the architecture, Section 3 discusses the implementation of the processor in a modular way, the overall architecture of the processor and the implementable CU is presented, then the processing element level architecture is presented, instruction set and the control signals implementing some representative examples of the instruction set is shown. Section 4 discusses the step by step synthesis and the simulation results in terms of speed, timing and size. Finally Section 5 discusses the conclusion and future scope of the work.

Advertisement

2. Proof of the architecture using graph theoretic approach

The theoretical proof using mathematical induction is given in [21] in detail. Here in this paper we will just present a brief of the argument.

The flow graph shown in the picture [23] is a widely used method of calculating transformations like FFT, FDCT and FDWT. In FFT or FDCT we can see that the flow graph looks physically like an equi vertex k partite graph where k is equal to the no of stages, the vertices are processing elements and the connections among the processing elements are the edges. Since the stages are mutually exclusive among each other so an equi vertex k partite graph like architecture can be reproduced by a fully connected equi vertex bipartite graph if the vertex set contains an one to one mapping between every stage of the k –partite and two stages of the bipartite graph. So any algorithm which is described by a flow graph of the first category can be described by a graph of the later category since the vertex set has the one to one mapping as described. From this argument it is clear that an architecture representing the second category will be efficient as a transform processor and the reconfigurity will make it easy to switch over from one transform to another making it a general transform processor. The orginal architecture requires two sets of processing elements in both the parts and a fully connected bidirectional communication wire between them. The hardware cost can be largely reduced if instead of that we take one set of processing elements and another set of registers, a fully connected feed forward network from register to processing elements and a single feedback network connecting each processing element to their corresponding register. Then the data exchange between two processing elements Pi →Pj can be rewritten as Pi→Ri→Pj. This will take two clock pulses rather than one but the hardware cost will be significantly reduced.

Advertisement

3. Implementation of the architecture

3.1 Implemenation of the overall architecture design

The fully connected feed forward path described in the previous section is created by 8 multiplexers of size 8 x 3. Each one of them can take input from any eight registers and send the output to any one Processsing Element. The signal lines of the individual multiplexer select the input register loading the value in the Processing Element (PE). This constitutes the most simple but effective feedforward communication lines between the registers and PEs The feedback line is implemented by a combination of 1x2 demultiplexer and 2x1 multiplexer duo which direct the output of the PE to the Input line of the corresponding register.n The same duo can also load the data from the memory in the beginning and once the calculation is complete can store them. The current design is examined with 8 such stages keeping mainly the view of implementing one stage of a FDCT algorithm. The architecture uses 8 bit register sets to latch value while entering or exiting to/from processing element.

3.2 Implementation of the processing element inside the processor

The implementation of the processing element (PE) inside the processor is done keeping in mind the type of operations which are performed to compute these type of transformations. Most of the operations are floating point type. So we used one floating point adder/subtractor and one floating point multiplier inside the PE. We have used commonly found floating point adder and multiplier in this PE. Keeping open the testing of state of the art designs to improve the performance of adder and multiplier in future for this design. There are two registers which will be used to latch source data of adder and similar two registers which will be used to latch source data of multiplier. The result of adder and multiplier is stored in similar two registers. The PE contains multiplexer and demultiplexer inside to route the data from one internal register to another and to send/receive data to/from the registers outside PE.

The Table 1 lists below the routing control signals and their functions for the processor and the routing and activation control signals and their functions for Processing elements PE.

Signal NameSelect bitsFunction
Inmux 1–80/10 = select input data from outside memory (TP1–8)
1 = select input data from the feedback line (FB1–8)
Routmux1–8000–111000 = Select data from the Midreg1 to PE1–8
.....................
111 = Select data from the Midreg8 to PE1–8
Outdemux1–80/10 = select output data from Outreg1–8 to FB1–8
1 = select output data from Outreg1–8 to output
CMUXSEL000–111Select any constants C0-C7 based upon the select line.
PECL10/10 = DEMUX1→MUX0
1 = DEMUX1→MUX1
PECL200–1100 = Direct Load from outside to D0/D1
01 = Movement of data from D4 to D0/D1
10 = Movement of data from D5 to D0/D1
11 = Load constant data from C0-C7
PECL300–1100 = Direct Load from outside to D2/D3
01 = Movement of data from D5 to D2/D3
10 = Movement of data from D4 to D2/D3
11 = Load constant data from C0-C7
PECL40/10 = Enable bit for D0, 1 = Enable bit for D1
PECL50/10 = Enable bit for D2, 1 = Enable bit for D3
PEEN10/11 = Enable bit for D4
PEEN20/11 = Enable bit for D5
PECL60/10 = DEMUX2→.MUX0/MUX1
1 = DEMUX2→.MUX2
PECL70/10 = DEMUX3→.MUX0/MUX1
1 = DEMUX3→.MUX2
PECL80/10 = Select input from DEMUX2
1 = Select input from DEMUX3

Table 1.

Name of the control signals, there values and functions used in Trans_Proc.

3.3 Primitive instruction set of the processor

The primitive instruction set which is formulated for the processor is mainly contains two categories. Category A is for the instructions to implement routing operations of the processor outside PE and category B is for the arithmetic calculation and data movement operations inside PE.

  1. Data Loading/Routing Operations Outside PE:

    1. Load Direct MIDREG i: To load data from outside memory in the MIDREG i from TP i [i = 1…8]

    2. Load Feedback MIDREG i: To load data in MIDREG i from feedback line FB i [i = 1…8]

    3. 3. Rout PE i,MIDREG j = Routing data from any MIDREG j to any PE i. [i,j = 1…8]

    4. Out OUTREG i = For storing the value from OUTREG i to outside memory. [i = 1….8]

  2. Data Loading/Movement and Mathematical Operations Inside PE:

    1. Load [D0-D3][PE i] = to load data in any of the registers D0-D3 from outside memory of PE i.

    2. Load [D0-D3], [C0-C7], [PE i] = to load data in any of the registers of D0-D3 from any of the constant registers C0-C7of PE i.

    3. Add [PE i] = to add the data present in D0 and D1 and keep it in D4 of PE i.

    4. Mul [PE i] = to multiply the data present in D2 and D3 and keep it in D5 of PE i.

    5. Move [D0-D3], [D4-D5], [PE i] = data movement operation from any of the output registers of D4-D5 to any of the input registers of D0-D3 of PE i.

    6. Out [D4-D5], [PE i] = Write back data from any of the output registers D4-D5 of PE i to OUTREG i of PE i.

Next we calculate the total number of instruction per PE and the overall architecture in the Table 2 below for each group as well as the overall total:

Group nameTotal number of instruction per PETotal number of instruction
A1N/A8
A2N/A8
A3N/A64
A4N/A8
B1432
B232256
B318
B418
B5864
B6216
Total48472

Table 2.

Total no of instructions of different group.

We can see that the total numbers of instructions are 472 out of which 48 are for each PE and 88 are for outside PE. The control signals of the different components and their functions of the processor units are specified in the previous table, from that we can specify the sequence of control signals which will be activated in order to implement each of the instructions of the instruction set. We can see one representative instruction for each group and the corresponding control signals and their sequence of activation to implement the instruction in the following Table 3. The table listing all the instructions can be found in appendix.

Category ASequence of control signals
Load direct MIDREG 11.TP1→Data 2. INMUX1→0 3. EN-MIDREG1→1
Load Feedback MIDREG11.OUTDEMUX1→0 2. INMUX1→1 3.EN-MIDREG1→1
Rout MIDREG3,PE51.EN-MIDREG3→0 2. ROUTMUX5→011
Out OUTREG61.EN-OUTREG6→0 2. OUTDEMUX6→1
Category BSequence of control signals
Load D0,PE 11.input_PE1→data 2.PECL1_PE1→0 3.PECL2_PE1→00 4.PECL4_PE1→1
Load D0,C5,PE 41.EN-C5_PE4→1 2.CMUX_PE4→101 3.PECL2_PE4→11 4.PECL4_PE4→1
Add PE31.PEEN1_PE3→1
Mul_PE22.PEEN2_PE2→1
Out D4_PE71.PECL6_PE7→1 2.PECL8_PE7→0 3.data_PE7→output
Move D1,D5,PE 61.PECL7_PE6→0 2.PECL2_PE6→10 3. PECL4_PE6→0

Table 3.

Sequence of operations for implementing C1*X + C2*Y.

3.4 Implementation of operations using the instruction set of the architecture

If we consider the flow graph of the FDCT algorithm of figure taken as an example, we can see that the algorithm is divided into 4 stages and each stage contains 8 PE executing operations which are of three types: floating point addition/subtraction, floating point multiplication and floating point operation evaluating expression of the type C1*X + C2* Y. Next, we see a stage wise operation schedule of the 8 PEs (specifying what each PE does in these 4 stages) in the below Table 4:

Stage 1
P1: Reg0 + Reg7
P2: Reg1 + Reg6
P3: Reg2 + Reg5
P4: Reg3 + Reg4
P5: Reg3-Reg4
P6: Reg2-Reg5
P7: Reg1-Reg6
P8: Reg0-Reg7
Stage 2
P1: Reg0 + Reg3
P2: REg1 + Reg2
P3: Reg1-Reg2
P4: Reg0-Reg3
P5: C3π/16*Reg4 + S3π/16*Reg7
P6: Cπ/16*Reg5 + Sπ/16*Reg6
P7: -Sπ/16*Reg5 + Cπ/16*Reg6
P8: -S3π/16*Reg4 + C3π/16*Reg7
Stage 3
P1: Reg0 + Reg1
P2: Reg0-Reg1
P3: √2 C3π/8*Reg2 + S3π/8*Reg3
P4: -S3π/8*Reg2 + √2C3π/8*Reg3
P5: Reg4 + Reg6
P6: Reg5-Reg7
P7: Reg4-Reg6
P8: Reg5 + Reg7
Stage 4
P1:----
P2:----
P3:-----
P4:-----
P5: Reg4-Reg7
P6: √2*Reg5
P7: √2*Reg6
P8: Reg4 + Reg7

Table 4.

Stage wise operation schedule 8 PEs performing FDCT algorithm.

We will list the instructions required to execute three cases as a representative example: a > stage 1 operation of PE 5 b > stage 4 operation of PE 7 and c > stage 2 operations of PE 6. These three cases exhibit three category of floating point operations described previously (Table 5).

Time unitInstructionDescription
1Load direct MIDREG3Load data from the TP3 line to MIDREG 3.
2Rout PE[5],MIDREG[3]Load data from MIDREG3 to input line of PE5
3Load [D0],[PE 5]Load input data to D0 from input line of PE 5
4Load direct MIDREG4Load data from the TP4 line to MIDREG 4.
5Rout PE[5],MIDREG[4]Load data from MIDREG4 to input line of PE5
6Load [D1],[PE 5]Load input data to D1 from input line of PE 5
7Add [PE5]Add the content of D0 and 2’s complement value of D1 and store the value in D4 of PE5
8Out D4Output data from D4 to OUTREG 5 of PE 5
9Load Feedback MIDREG [5]Load the data from the OUTREG 5 of PE5 to FB5 and then to MIDREG 5.
1Rout PE[7],MIDREG[6]Load data from MIDREG6 to input line of PE7
2Load [D2],[PE 7]Load input data to D2 from input line of PE 7
3Load [D3],[C7],[PE 7]Load D3 with constant from the constant register C7 selected by CMUX
4Mul [PE7]Multiply the content of D2 and D3 and store the value in D5 of PE 7
5Out D5 [PE7]Output data from D5 to OUTREG 7 of PE 7
6Load Feedback MIDREG [7]Load the data from the OUTREG 7 of PE7 to FB7 and then to MIDREG 7.
1Rout PE[6],MIDERG[5]Load data from MIDREG5 to input line of PE6
2Load [D2], [PE 6]Load input data to D2 from input line of PE 6
3Load [D3],[C5],[PE 6]Load D3 with constant from the constant register C5 selected by CMUX
4Mul [PE 6]Multiply the content of D2 and D3 and store the value in D5 of PE 6
5Mov [D0],[D5],[PE 6]Move the content from D5 to D0 of PE6
6Rout PE[6],MIDERG[6]Load data from MIDREG6 to input line of PE6
7Load [D2], [PE 6]Load input data to D2 from input line of PE 6
8Load [D3],[C6],[PE 6]Load D3 with constant from the constant register C6 selected by CMUX
9Mul [PE 6]Multiply the content of D2 and D3 and store the value in D5 of PE 6
10Mov [D1],[D5],[PE 6]Move the content from D5 to D1 of PE6
11Add [PE 6]Add the content of D0 and D1 and store the value in D4 of PE6
12OUT [D4], [PE 6]Output data from D4 to OUTREG 6 of PE 6
13Load Feedback Data MIDREG[6]Load the data from the OUTREG 6 of PE6 to FB6 and then to MIDREG 6.

Table 5.

List of instructions for a > stage 1 operation of PE 5 b > stage 4 operation of PE 7 and c > stage 2 03 operations of PE 6.

3.5 Implementation of the control unit of the processor

Hardwired implementation of the correct control signals, their values and the sequence for total 472 instructions is very difficult physically. Here in this work we have only developed instructions required for proving the correctness of the design, which are of three type. 1. We have developed instructions inside the PE to do a floating point addition and multiplications. 2. We have developed instructions to implement the longest sequence of the FDCT algorithm C1*X + C2*Y inside one PE implemented of a single stage. And 3. Next we have done the same implementation of stage 2 for all PEs and routed the output values randomly to prove the correctness of the implementation. So the control unit is partially developed. We require a programming based approach to develop a full grown assembler to generate all the instructions for all the instructions. These is an incomplete design of the TransProc which we presented in the paper but shows that it has the capability which can be used correctly for generationg all the instructions required for all the transform generators as a hardware co processor implemented in FPGA once the CU is finished generating all the instructions.

Advertisement

4. Simulation and synthesis

The first two simulations show the correct floating point implementation of floating point multiplier and adder/subtractor. While the floating point multiplier has lots of scope of improvement but floating point adder/subtractor is quite state of the art.

Here we see the longest sequence of multiplication and adder inside a single PE. Pein1xCein5 + Pein1xCEin6 = 2.0x0.5 + 4.0x8.0 = 33.0.

Here we see the routing correctness of the every PEs of the Trans_Proc according to the following flow graph shown in a tabular format:

PE1 = 1, PE2 = 2, PE3 = 3, PE4 = 4, PE5 = 5, PE6 = 6, PE7 = 7, PE8 = 8.

C1 = 2, C2 = 8.

PE1 = PE1x C1 + PE8xC2 = 66.

PE2 = PE2x C1 + PE7xC2 = 58.

PE3 = PE3x C1 + PE6xC2 = 50.

PE4 = PE4x C1 + PE5xC2 = 42.

PE5 = PE4x C2 + PE5xC1 = 34.

PE6 = PE3x C2 + PE6xC1 = 26.

PE7 = PE2x C2 + PE7xC1 = 1.

PE8 = PE1x C2 + PE8xC1 = 10.

This is the way the routing correctness among the different PEs of the processor is tested and we can see that it is working.

Once the behavioral simulation is correctly shown, next we present the result the synthesis of the entire processor done by the Xilinx Vivado and comment on the result (Tables 69).

Utilization report (Summery)
No of LUT10897
No of FF6928
No of IOB562

Table 6.

Summery of utilization report.

Utilization report (Primitive blocks)
Primitive nameNumberFunctuional category
LUT66096LUT
LUT5920LUT
LUT42984LUT
LUT3576LUT
LUT2536LUT
LUT11257LUT
FDCE3616Flop & Latch
FDRE3312Flop & Latch
MUXF7320MuxFx
CARRY4168Carry Logic
IBUF489IO
OBUF73IO
BUFG1Clock

Table 7.

Utilization report of primitives block.

Power report (Summery)
Total On-Chip Power0.417 W
Device Dynamic Power0.335 W
Device Static Power0.082 W

Table 8.

Power report summery.

Timing report (Summery)
Max Setup Time3.419 ns
Worse Pulse Width Slack4.650 ns
Avg CP required for FP operations inside PE4
Max Clock Frequency292 MHz

Table 9.

Timing report summery.

The overall utilization report gives an idea of the size of the processor while the number of primitive blocks used in the processor is also given. Please remember that the study here did not include the CU utilization as that is incomplete but and will be used as an separate design in the future study. Total on chip power with its two components dynamic and static is also suggesting an implementable design. T ming report shows Setup up time, WPWS is 4.650 ns, we calculated by hand that the instruction inside the floating point operations inside the takes maximum 4 clock pulses. This makes the maximum clock frequency as 292 MHZ.

Advertisement

5. A discussion on the memory and instruction exchange between the main processor and Trans_proc

Here we can see the data transfer procedure between the main processor and Trans-Proc which will be implemented as a future scope of this study. The process uses an linear image RAM (LIRAM) to store the primary data. Then there are two data registers used as buffers while going in and out to the Trans-Proc. There is one counter to count the no of blocks going to Trans-Proc and one address register to store the block of transformed image again back to LIRAM. This will be implemented further as the future scope of this study.

References

  1. 1. Po-Chih Tseng et al, “Reconfigurable discrete cosine transform processor for object-based video signal processing”, in ISCAS '04. Proceedings of the 2004 International Symposium on Circuits and System, 2004.
  2. 2. Po-Chih Tseng, Chao-Tsung Huang, Liang-Gee Chen, “Reconfigurable Discrete Wavelet Transform Processor for Heterogeneous Reconfigurable Multimedia Systems”, Journal of VLSI signal processing systems for signal, image and video technology, 2005.
  3. 3. Gregory W. Donohoe, “The Fast Fourier Transform on a Reconfigurable Processor”, Proc. NASA Earth Sciences Technology Conference, Pasadena, CA, June 11-13, 2002
  4. 4. Srivatsava P S V, SaradaV, “Reconfigurable MDC Architecture Based FFT Processor”, International Journal of Engineering Research & Technology, 2014
  5. 5. K. Joe Hass David F. Cox, “Transform Processing on a Reconfigurable Data Path Processor”, 7th NASA Symposium on VLSI Design 1998
  6. 6. V. Sarada, T. Vigneswaran, “Reconfigurable FFT Processor – A Broader Perspective Survey”, International Journal of Engineering and Technology (IJET) 2013
  7. 7. Asadollah Shahbahrami, Mahmood Ahmadi, Stephan Wong, Koen Bertels, “A New Approach to Implement Discrete Wavelet Transform using Collaboration of Reconfigurable Elements”, Proc. 2009 International Conference on Reconfigurable Computing and FPGAs
  8. 8. Konstantinos E. Manolopoulos, Konstantinos G. Nakos, Dionysios I. Reisis and Nikolaos G. Vlassopoulos, “Reconfigurable Fast Fourier Transform Architecture for Orthogonal Frequency Division Multiplexing Systems”, 2003, available: https://pdfs.semanticscholar.org/dd5c/263725af00e5dd4d42d573c269f57d917c8d.pdf?_ga=2.84059166.640751657.1573804365-914446569.1569299704
  9. 9. Amitabha Sinha, Mitrava Sarkar, Soumojit Acharyya, Suranjan Chakraborty, “A Novel Reconfigurable Architecture of a DSP Processor for Efficient Mapping of DSP Functions using Field Programmable DSP Arrays”, ACM SIGARCH Computer Architecture News Vol. 41, No. 2, May 2013
  10. 10. Sumit Wadekar, Laxman P. Thakare, Dr. A.Y. Deshmukh, “Reconfigurable N-Point FFT Processor Design For OFDM System, International Journal of Engineering Research and General Science Volume 3, Issue 2, March-April, 2015
  11. 11. Alexey Petrovsky, Maxim Rodionov and Alexander Petrovsky, “Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized Time-Frequency Tiling for Real-Time Audio Applications”, Design and Architectures for Digital Signal Processing, available: http://www.intechopen.com/books/design-and-architectures-fordigital-signal-processing2013.
  12. 12. Sharon Thomas & V Sarada, “Design of Reconfigurable FFT Processor With Reduced Area And Power”, ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE), 2013.
  13. 13. Uma Rajaram, “Design Of Fir Filter For Adaptive Noise Cancellation Using Context Switching Reconfigurable EHW Architecture”, Ph.D dissertation, Anna University, Chennai, 2009, available: https://shodhganga.inflibnet.ac.in/handle/10603/27245
  14. 14. P. S. Reddy, S. Mopuri and A. Acharyya, “A Reconfigurable High Speed Architecture Design for Discrete Hilbert Transform,” in IEEE Signal Processing Letters, vol. 21, no. 11, pp. 1413-1417, Nov. 2014, doi: 10.1109/LSP.2014.2333745
  15. 15. Atri Sanyal, Swapan Kumar Samaddar, Amitabha Sinha, “A Generalized Architecture for Linear Transform”, Proc. IEEE International Conference on CNC 2010, Oct 04-05, 2010, Calicut, Kerala, India, IEEE Computer society, pp. 55-60, ISBN: 97-0-7695-4209-6.
  16. 16. A. Sanyal, S. K. Samaddar, “A Combined Architecture for FDCT Algorithm,” Proc. 2012 Third International Conference on Computer and Communication Technology, Allahabad, 2012, pp. 33-37, doi: 10.1109/ICCCT.2012.16
  17. 17. Atri Sanyal, SaloniKumari, Amitabha Sinha, “An Improved Combined Architecture of the Four FDCT Algorithms”, International Journal of Research in Electronics and Computer Engineering, (IJRECE), Vol 6 Issue 4 December 2018, ISSN: 2348-2281
  18. 18. Davide Rossi, Fabio Campi, Simone Spolzino, Stefano Pucillo, Roberto Guerrieri, “A Heterogeneous Digital Signal Processor for Dynamically Reconfigurable Computing”, IEEE Journal of Solid-State Circuits,Volume: 45, Issue: 8, Aug. 2010
  19. 19. Sohan Purohit, Sai Rahul Chalamalasetti, Martin Margala WimVanderbauwhede, “Throughput/Resource-Efficient Reconfigurable Processor for Multimedia Applications”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Volume: 21, Issue: 7, July 2013
  20. 20. Vikram, K.N., Vasudevan, V. “Mapping Data-Parallel Tasks Onto Partially Reconfigurable Hybrid Processor Architectures”, IEEE Transactions On Very Large Scale Integration (VLSI) Systems, Vol. 14, No. 9, September 2006.
  21. 21. Atri Sanyal, Amitabha Sinha, “A Reconfigurable Architecture to Implement Linear Transforms of Image Processing Applications”, International Conference on Frontiers in Computing and System (COMSYS 2020), Jalpaiguri, West Bengal, India, January 13-15,2020
  22. 22. B. Heyne, C. C. Sun, J. Goetze, S. J. Ruan, “A Computationally Efficient High-Quality Cordic Based DCT”, 14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006
  23. 23. N. Deo, “Graph Theory with applications to engineering and computer science”, PHI, 2007

Written By

Atri Sanyal and Amitabha Sinha

Submitted: 24 June 2021 Reviewed: 28 June 2021 Published: 26 October 2022