A NEW HARDWARE ARCHITECTURE FOR HIGH-PERFORMANCE PARALLEL TURBO DECODER

SUJATHA ELUKURU1*, SUBHAS CHENNAPALLI2 AND GIRIPRASAD MAHENDRA NANJAPPA2

1Department of Electronics and Communications, Sree Vidyanikethan Engineering College, Andhra Pradesh, India.  
2Department of Electronics and Communications, JNTUA College of Engineering, Andhra Pradesh, India.  
*Corresponding author: sujathaece88@gmail.com  
(Received: 1st January 2022; Accepted: 26th March 2022; Published on-line: 4th July 2022)

ABSTRACT: Recent wireless communications demand maximum achievable data rates without intervention. The channel decoder in the physical layer would support such high data rates with a flexible hardware structure. The turbo channel decoder offers flexible hardware architecture and reliable decoding, but the turbo decoder design is complex, and its hardware architecture consumes more power and area in a communication system. Hence, an optimized high-performance turbo decoder architecture with simplified QPP interleaver is needed for supporting various data rates. In this context, this article presented a new hardware architecture with a three-stage pipeline parallel turbo decoding process and each MAP decoder in the proposed parallel turbo decoder with a three-stage micro pipeline process. The proposed structure optimized the circuit complexity and improved the throughput through parallel pipeline decoding process. Also, this article presents a simplified semi-recursive QPP interleaver, which avoids complex ‘mod’ operations for a high-performance turbo decoder. The performance analysis has been done using Modelsim, Xilinx Vivado design suite, and estimated performance analysis was observed on various 28 nm CMOS technology FPGAs and compared with the conventional designs. Analysis of the proposed design showed improvement throughput up to 55.6% and a reduction in the power consumption up to 43% as compared to the recently reported architectures.

ABSTRAK: Komunikasi tanpa wayar terkini menuntut kadar data maksimum yang boleh dicapai tanpa intervensi. Penyahkod saluran dalam lapisan fizikal akan menyokong kadar data yang tinggi dengan struktur perkakasan fleksibel. Penyahkod saluran turbo menawarkan seni bina perkakasan fleksibel dan penyahkodan yang boleh dipercayai. Tetapi, penyahkod turbo merupakan blok yang kompleks, lebih berkua dan menggunakan kawasan yang luas dalam sistem komunikasi. Oleh itu, seni bina penyahkod turbo optimum berprestasi tinggi dengan antara lemar QPP yang mudah diperlukan bagi menyokong pelbagai kadar data. Dalam konteks ini, kajian ini merupakan seni bina perkakas baru dengan proses penyahkod turbo selari bersama salur paip tiga peringkat dan setiap penyahkod MAP yang dicadangkan dalam penyahkod turbo selari bersama proses saluran paip mikro tiga peringkat dibentangkan. Struktur yang dicadangkan dapat mengurangkan kerumitan litar dan meningkatkan daya pemprosesan melalui penyahkodan saluran paip selari. Selain itu, kajian ini merupakan antara lemar mudah QPP rekursif, yang dapat mengelakkan operasi ‘mod’ yang kompleks bagi penyahkod turbo berprestasi tinggi. Analisis prestasi telah dilakukan menggunakan sim Model, reka bentuk suit Xilinx Vivado, dan analisis prestasi anggaran telah diperhatikan pada pelbagai teknologi FPGA CMOS 28 nm dan dibandingkan dengan reka bentuk konvensional. Analisis reka bentuk
yang dicadangkan menunjukkan peningkatan sepanjang 55.6% dan pengurangan penggunaan kuasa sehingga 43% berbanding seni bina laporan terkini.

**KEYWORDS:** turbo decoder; MAP decoder; VLSI; interleaver; FPGA

1. INTRODUCTION

Channel coding techniques are essential for a wireless communication system to achieve a reliable and high-performance transmission between transmitter and receiver, in a noisy channel. State-of-the-art iterative channel codes such as Turbo codes [1], Low-density parity-check codes (LDPC) [2], and Polar codes [3] are often used. Turbo codes offer more flexible architecture for their encoder and decoder than LDPC and polar codes. Also, Turbo codes achieve high diversity, reliable data transmission, and possible large coding gain in fading channels.

The efficient-hardware implementation of Turbo codes, in order to meet real-time constraints, is an active area of research and there is a need for innovation in the VLSI design of high-performance Turbo Decoders in terms of throughput, silicon area, and power-efficiency as well. Hence, the present study is aimed at developing a high-throughput, low area, and low power turbo decoder by modifying the hardware architecture of the decoder, simplifying mathematical computations involved in the decoding and interleaving process, and applying the optimization techniques. Maximum a-posteriori probability (MAP) algorithm introduced by Bahl-Cocke-Jelinek-Raviv (BCJR) [4] for SISO decoders and the simplifications of MAP algorithm called Log-MAP and Max-Log-MAP [5] were studied and Max-log-MAP algorithm is adopted in the design and hardware implementation of the proposed turbo decoder due to its lower complexity than the log-MAP algorithm.

To improve the throughput performance of the turbo decoder, the number of MAP decoders could be increased and all operated in parallel at the cost of degradation in error-correcting performance, especially with higher code rates. Moreover, employing multiple decoders to increase the throughput does not solve the additional challenge of lower latency requirements. The throughput could also be increased by increasing the block size (from 40 to 6144), but this would result in consequent complexities in computational latency, area requirement, and power consumption. The trade-off among the performance parameters could be best compromised by effective hardware design and suitable optimization techniques [6]. Interleaver is an essential part of turbo decoder and is also responsible for BER performance of decoding. The algebraic properties and contention-free property of QPP interleaver [7] guarantee contention-free access to memory and generated addresses.

Also, one of the problems in the implementation of highly parallel decoders is memory contention during decoding, where all the sub-block of MAP decoders simultaneously tries to access the same memory bank on reading or writing the extrinsic information from or into it. To solve this problem, an efficient approach of collision-free parallel interleavers in which data is read or written on the intra-sub block as well as inter-sub block of MAP decoder to achieve low complexity architecture having no additional hardware resources.

Some benchmarked research works on high throughput turbo decoders were discussed here. A high throughput turbo decoder with 8 and 64 parallel radix-2 MAP decoder architecture in 90 nm CMOS technology was proposed [8]. This paper proposed a new ungrouped backward recursion scheme and a new state metric normalization technique to offer retiming and pipelining in architecture for performance improvement. Also, this work adopted a fine-grain clock gating technique to solve the power issue and the throughput achieved is 301 Mbps at 272 mW of power. A highly parallel turbo decoder structure in
2015 was reported [9] to achieve the highest throughput rate of 1.45 Gbps implemented in 90 nm CMOS technology. This work was aimed at improving the decoding efficiency and this improvement was possible by modifying the parallel window MAP decoding algorithm.

A fully parallel turbo decoding [FPTD] algorithm was reported [10] which allows parallel processing to offer higher processing throughput. This novel FPTD algorithm reduced computational complexity by 50% and enhanced its suitability for FPGA implementations. It was concluded that the fully parallel turbo decoder with radix-2 and 6144 parallel MAP decoders resulted in 14.8Gbps but this design utilized 9618 mW of high power at 100 MHz clock frequency. Various VLSI architectures were presented in [11] for the computing blocks of the turbo decoder and made the SISO decoder support Radix- 2/4/8 modes. The design resulted in throughput in the range of 80Mbps to 270 Mbps, reducing power consumption to up to 61% as compared to the other state-of-art designs. A parallel turbo decoder with reverse address generator in interleaver for low latency and high throughput architecture with double buffer technique was proposed in [12] for effective utilization of FPGA resources for broadcasting systems. This work resulted in a throughput performance of 2.12 Gbps at 250 MHz and a latency of 23.2 µsec with 64 parallel map decoders.

A memory-reduced turbo decoder was proposed by a reverse recalculation technique using the Log-MAP algorithm with a focus on power reduction [13]. It was reported that the technique helped to reduce the memory and power consumption as compared to other conventional turbo decoder designs. The Vedic multiplier-based implementation presented in [14] could be preferred in-branch metric calculations in Max-log-MAP algorithms for low latency turbo applications, but the implementation consumes more area. An optimized turbo decoder for performance improvement of turbo decoder, where the parallel computation of state metrics, reusing of memory and single SISO decoder in the hardware implementation was proposed [15]. A low memory turbo decoder with reverse calculation techniques was reported where the trellis diagram was partitioned and the max* operator was simplified [16]. The findings revealed that the architecture achieved a 65% reduction in state metric cache (SMC) capacity with other designs and lower power dissipation. By this motivation, the present study focused on developing a new hardware architecture for parallel turbo decoder to achieve high performance and balanced hardware implementation using optimization techniques.

2. TURBO DECODER DESIGN PERSPECTIVE

The general structure of a turbo decoder consists of two SISO decoders connected through an interleaver and de-interleaver to perform the iterative process of soft bits to provide a-posteriori LLRs after the required number of iterations. The soft-demodulated values of transmitted bits are referred to as a-priori probability values and are fed to constituent SISO decoders as input LLRs, shown in Fig. 1 [17]. Each decoder operates on the systematic and parity bits associated with its constituent encoder and produces soft outputs of the original data bits in the form of a-posteriori probabilities. The extrinsic information is computed using a-posteriori probability values from the SISO decoder, interleaved/non-interleaved a-priori probability values, and interleaved/de-interleaved extrinsic information from another SISO decoder. Such extrinsic information values are shuffled between two SISO decoders and are iteratively processed along with a-priori probability values to produce error-free a-posteriori probabilities of the transmitted bits.
In the iterative process, the MAP algorithm decodes the probabilities for each bit correctly. The complexity of the MAP algorithm has been reduced by operating the algorithm in the log domain variants such as the log-MAP algorithm and max-log-MAP algorithm. In order to realize the high-performance turbo decoder, SISO decoders involved in the turbo decoder should provide high-speed data transmission without significant coding loss. Major tasks of the SISO decoder are computation of branch metrics, state metrics, and LLR computation to extract the final extrinsic information. However, two SISO decoders do not work simultaneously in each half iteration to compute the state metrics. Hence, the present study utilized the turbo decoder with a single SISO decoder for one complete iteration as shown in Fig. 2.

The main objective of the present study is to design an efficient parallel turbo decoder that can support higher throughputs using streaming techniques. QPP Interleaver plays a vital role in turbo encoder/decoder error correction. Hardware design of the QPP interleaver involves complex mathematical functions and dependency of previous computations. A semi recursive QPP interleaver is proposed in the present study which simplifies the aforementioned disadvantages of QPP interleaver.

3. QPP INTERLEVER

QPP interleaver is an integral part of the turbo encoder and decoder and it plays a critical role in turbo codes, especially in turbo decoder, for achieving high-speed decoding. For each of the 188 block lengths, a different set of $f_1, f_2$ parameters were pre-defined in 3GPP LTE [17]. In the recent 3GPP LTE/ LTE-A, QPP interleaver is based on algebraic properties and contention-free properties, providing contention-free memory access for any specified code block size between 40 to 6144. The efficient design of a conflict-free reconfigurable QPP interleaver for turbo encoder and turbo decoder is a pre-eminent task in turbo channel coding scheme. The hardware implementation of QPP interleaver/de-interleaver should support parallel interleaving for the high-performance parallel decoder. This research work proposed the design of reconfigurable semi-recursive QPP interleaver for parallel and direct
computation of address locations of all the bits for turbo decoder by semi recursive computation approach as explained below.

3.1 Semi Recursive QPP Interleaver

The mathematical complexity and dependency of the current address location of the previous address location are solved by the semi-recursive computation method. The address locations of interleaved bits/the sequence of numbers of interleaving \( \pi(i) \) of current symbol \( i \) in QPP interleaver is computed as

\[
\pi(i) = (f_1 i + f_2 i^2) \mod K \tag{1}
\]

In Eq. 1 [17], parameters \( f_1 \) and \( f_2 \) depend on ‘K’ and all the possible variants of block size \( K \) and variables \( f_1 \) and \( f_2 \), are defined. In hardware implementation of Eq.1, the address computation of current index \( i \), depends on previous computations recursively and this recursive dependency creates high decoding latency and is not preferable for high-performance turbo decoders. The proposed design does not contain mod operation, as mod operator implementation is complex in the hardware design of the QPP interleaver; it is replaced by an Add-Compare-Select (ACS) unit. The ACS unit is composed of only arithmetic operators like addition and subtraction. Replacing the mod operation by the ACS unit is called the modulo normalization technique.

To simplify the complex interleaver computation and to avoid large storage requirements, the proposed semi recursive computation approach for parallel interleaver supports the independent parallel computation of interleaved addresses. The input sequence (Num) is denoted as Metric Weight (MW) and it is represented as MW (1, K+1) in the first column. The subsequent columns are defined as mentioned below.

**Case 1:** If mod (Num, 2) \( \neq 0 \), then Num = Num+1 and

\[
\begin{align*}
\text{MW} (2, K+1) &= (\text{Num+1})/2; \\
\text{MW} (3, K+1) &= \text{MW} (2, K+1) - 1;
\end{align*}
\]

**Case 2:** If mod (Num, 2) =0, then Num= Num;

\[
\begin{align*}
\text{MW} (2, K+1) &= \text{Num}/2; \\
\text{MW} (3, K+1) &= \text{Num} - \text{MW} (2, K+1);
\end{align*}
\]

Then, Value (V) is defined as,

\[
V = \delta(0) \times \text{MW} (2, K+1) + \delta(1) \times \text{MW} (3, K+1),
\]

where, \( \delta(0) = f_1 + f_2 \) and \( \delta(1) = \delta(0) + 2f_2 ; \)

It can be observed from Tables 1 and Table 2, that computation of address locations of 40 bits was done within 5 clock cycles independently. This approach is proposed to minimize the computational complexity and avoid the storage of interleaver tables.

<table>
<thead>
<tr>
<th>MW (1, K+1), Num</th>
<th>MW (2, K+1)</th>
<th>MW (3, K+1)</th>
<th>Value (V)</th>
<th>( \Pi(i) = \text{mod} (V, 40) )</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>13</td>
<td>13</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
<td>1</td>
<td>46</td>
<td>6</td>
</tr>
<tr>
<td>3</td>
<td>2</td>
<td>1</td>
<td>59</td>
<td>19</td>
</tr>
<tr>
<td>4</td>
<td>2</td>
<td>2</td>
<td>92</td>
<td>12</td>
</tr>
</tbody>
</table>
Table 2: Proposed Parallel Computation of Sub blocks

<table>
<thead>
<tr>
<th>CLOCK</th>
<th>Sub-block1</th>
<th>Sub-block2</th>
<th>Sub-block3</th>
<th>Sub-block4</th>
<th>Sub-block5</th>
<th>Sub-block6</th>
<th>Sub-block7</th>
<th>Sub-block8</th>
</tr>
</thead>
<tbody>
<tr>
<td>Clock1</td>
<td>0</td>
<td>5</td>
<td>10</td>
<td>15</td>
<td>20</td>
<td>25</td>
<td>30</td>
<td>35</td>
</tr>
<tr>
<td>Clock2</td>
<td>1</td>
<td>6</td>
<td>11</td>
<td>16</td>
<td>21</td>
<td>26</td>
<td>31</td>
<td>36</td>
</tr>
<tr>
<td>Clock3</td>
<td>2</td>
<td>7</td>
<td>12</td>
<td>17</td>
<td>22</td>
<td>27</td>
<td>32</td>
<td>37</td>
</tr>
<tr>
<td>Clock4</td>
<td>3</td>
<td>8</td>
<td>13</td>
<td>18</td>
<td>23</td>
<td>28</td>
<td>33</td>
<td>38</td>
</tr>
<tr>
<td>Clock5</td>
<td>4</td>
<td>9</td>
<td>14</td>
<td>19</td>
<td>24</td>
<td>29</td>
<td>34</td>
<td>39</td>
</tr>
</tbody>
</table>

From Tables 1 and 2, it can be observed that the parallel computation of 40 bits has been done with 8 parallel operations. In the first clock cycle, bits 0, 5, 10, 15, 20, 25, 30 and 35 will be computed simultaneously. Similarly in the second, third, fourth, and fifth clock cycles, the parallel computation of the remaining bits is performed in the order shown in Table 2. The proposed method is most suitable for highly parallel turbo decoding architectures. The proposed design and FPGA implementation of a new hardware architecture for a high-performance turbo decoder using streaming techniques is presented below.

4. PARALLEL TURBO DECODER

The parallel decoding approach of turbo decoder with P parallel MAP decoders roughly increases the decoding throughput by a factor of ‘P’ compared to non-parallel turbo-decoders. Modern parallel hardware architectures can have either spatial or functional parallelization to improve the throughput performance. For a high-performance turbo decoder, this article proposed a new hardware architecture, which is an 8-parallel MAP decoder structure. The proposed architecture is designed in a three-stage pipelined process.

In the first stage, the input LLRs load into the three buffers namely systematic buffer, parity-1 buffer and parity-2 buffer in parallel. Here, the input LLRs could be related to any of 188 block sizes varying from 40 to 6144. In the second stage, the data of eight coded words are processed parallel with the eight BCJR decoders as shown in Fig. 3.

In the second stage, the BCJR decoder is further processed into three micro pipeline stages. Two SISO decoders, named SISO-1 and SISO-2, the first decoder processes the systematic input, parity-1 and a-priori data. Similarly, the second decoder process interleaved systematic input, parity-2, interleaved a-priori data in the micro-pipeline stage is presented below.

![Fig. 3: Block diagram of proposed three-stage pipeline parallel turbo decoder.](image)

In the first micro pipeline stage, all the SISO decoders are processed in parallel with the given two inputs and then extrinsic information is produced as the output of the SISO
decoder. In the second micro-pipeline stage, the produced output information is processed to interleaved/de-interleaved block. Finally, in the third micro-pipeline stage, the third input of de-interleaved a-priori data to SISO decoder blocks to process the extrinsic information. This three-stage micro pipeline process continues for 8 number of iterations. This process is depicted in Fig. 4. Then the third pipeline stage of the parallel ‘8’ turbo decoder continues until maximum convergence is achieved and the output LLRs are processed into the output buffer.

Fig. 4: Block diagram of three stages micro pipeline Turbo Decoder.

4.1 Simplified Computation of Soft-output

The soft output $L$ can be computed as shown in Eq. 2 [18] from the state metrics and branch metrics to find maximum value as,

$$L = \max \left( \max \left( \alpha'_0 + \beta_0 + \gamma_{00}, \alpha'_{1} + \beta_{4} + \gamma_{00}, \alpha'_2 + \beta_5 + \gamma_{01}, \alpha'_3 + \beta_1 + \gamma_{01} \right) \right) - \max \left( \max \left( \alpha'_0 + \beta_4 + \gamma_{11}, \alpha'_{1} + \beta_0 + \gamma_{11}, \alpha'_{2} + \beta_1 + \gamma_{10}, \alpha'_{3} + \beta_5 + \gamma_{10} \right) \right)$$

where, $\alpha'_0$ to $\alpha'_7$ denotes the forward state metrics, $\beta_0$ to $\beta_7$ denotes backward state metrics of 8 states and $\gamma_{00}$ to $\gamma_{11}$ denotes branch metrics.

Equation 2 is further simplified as Eq. 3 in our proposed simplification for computing soft output ($L$) with common $\gamma_{00}$ to $\gamma_{11}$,

$$L = \max \left( \max(s_0, s_1) + \gamma_{00}, \max(s_2, s_3) + \gamma_{01} \right) - \max(t_0, t_1) + \gamma_{11}, \max(t_2, t_3) + \gamma_{10} \right)$$

where,

$$s_0 = \max \left( \alpha'_0 + \beta_0, \alpha'_1 + \beta_4 \right)$$

$$s_1 = \max \left( \alpha'_6 + \beta_7, \alpha'_7 + \beta_3 \right)$$

$$s_2 = \max \left( \alpha'_2 + \beta_5, \alpha'_3 + \beta_1 \right)$$

$$s_3 = \max \left( \alpha'_4 + \beta_2, \alpha'_5 + \beta_6 \right)$$

$$t_0 = \max \left( \alpha'_0 + \beta_4, \alpha'_1 + \beta_0 \right)$$

$$t_1 = \max \left( \alpha'_6 + \beta_3, \alpha'_7 + \beta_7 \right)$$
\[ t_2 = \max (\alpha'_2 + \beta_1, \alpha'_3 + \beta_5) \]
\[ t_3 = \max (\alpha'_4 + \beta_6, \alpha'_5 + \beta_2) \]

The extrinsic information / a-posteriori information \( \lambda_{\text{out}}(k) \) can be calculated as in Eq. 4 [18], with the aid of \( L(k) \), \( x(k) \) and \( y(k) \) as,

\[ \lambda_{\text{out}}(k) = \frac{1}{2} L(k) - x(k) - \lambda_{\text{in}}(k) \]  

where, \( L(k) \) denote soft-output, \( x(k) \) is the received soft systematic information, \( \lambda_{\text{in}}(k) \) is a-priori information.

### 4.2 Performance Analysis

The performance analysis of the channel decoder can be done by decoding delay/latency and the throughput obtained. But a hardware digital system/circuit performance will be measured in three parameters called power, area, and throughput. This analysis can be done when the proposed architecture is synthesized by hardware design tool like Xilinx ISE/Vivado.

For the proposed design of turbo decoder, the decoding delay is calculated as Eq. 5 and 6 [17] for block sizes less than 264 and from 264 to 6144,

\[ \text{If } K < 264, \] 
\[ D = (26 + (2f(K, N) + 14)2I) \]  
\[ \text{If } K \geq 264, \] 
\[ D = (26 + (f(K, N) + 46)2I) \]  

where, \( K \) denote block size, \( N \) denote number of decoders and \( I \) denote number of iterations and

\[ f(K, N) = \begin{cases} \frac{K}{N} & \text{if } K \text{ is divisible by } N \\ \frac{K}{6} & \text{if } K \text{ is not divisible by } N \end{cases} \]

Decoding latency \( (L) \) is calculated as

\[ L = \frac{D}{f_{\text{max}}} \text{ sec} \]  

The throughput \( (T) \) is calculated as

\[ T = \frac{[K \cdot f_{\text{max}}]}{D} \text{ bps} \]  

where, \( f_{\text{max}} \) denote the maximum operating frequency, which effects both latency and throughput as in Eq. 7 and 8 [17].

For instance, if the operating frequency of this hardware is about 250 MHz, then the throughput for the block size of 40 bits is 24.38 Mbps and for block size of 6144 bits is 117.7 Mbps.

### 5. RESULTS AND DISCUSSION

In order to get a higher throughput and lower latency, the most commonly adopted design methodology is to improve the level of parallelism. A new architecture consisting of
an 8-parallel decoder structure has been proposed for the high-performance turbo decoder proposed in the present study. The proposed hardware architecture of the turbo decoder is designed into the three-stage pipeline and three-stage micro-pipeline procedures for high performance. The high-level block diagram is shown in Fig. 5, the simulation waveform and performance analysis of the proposed design are discussed below.

The three-stage pipeline and three-stage micro-pipeline procedure in the proposed parallel turbo decoder is to improve the speed of data processing in the whole structure to improve the throughput and to reduce the latency. The proposed architecture has been designed, simulated in MAT Lab and Modelsim for functionality verification and the simulation waveform is shown in Fig. 6. From the simulation diagram, the throughput latency, or the time taken to produce the first output for the given input, is 1.9 ns.

![Fig. 5: High level block diagram of proposed turbo decoder.](image)

![Fig. 6: Simulation waveform of proposed parallel turbo decoder.](image)
Then, the RTL schematic shown in Fig. 7 is observed for the proposed architecture in detail for hardware components utilized. Also, the submodules of the proposed parallel decoder, like branch metrics, parallel state metric computations, and LLR computations, are run to find the maximum value of the computed posteriori LLRs to finalize whether the decoded bit belongs to either “0” or “1”.

![RTL schematic of proposed parallel turbo decoder using Xilinx VIVADO.](image)

The architecture is implemented over Xilinx Vivado for 28 nm CMOS technology Kintex 7, Vertex-7, and Zynq-7000 Zed FPGA evaluation boards for its performance analysis. The hardware utilization is summarized in Table 3. It can be observed from Table 3 that a much smaller number of logic cells and memory cells are occupied by the proposed design with VLSI optimization techniques than the standard design. As ACS units have been used for metric computation, instead of many arithmetic/logical units, the hardware resource utilization has been reduced. It is evident from the observation that hardware utilization is less at post-implementation than post-synthesis of the design.

<table>
<thead>
<tr>
<th>Hardware Resource</th>
<th>Utilization (%)</th>
<th>Utilization (%)</th>
<th>Available</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Post-Synthesis</td>
<td>Post-Implementation</td>
<td></td>
</tr>
<tr>
<td>FF</td>
<td>809 (0.76%)</td>
<td>809 (0.76%)</td>
<td>106400</td>
</tr>
<tr>
<td>LUT</td>
<td>1072 (2.02%)</td>
<td>1059 (1.99%)</td>
<td>53200</td>
</tr>
<tr>
<td>I/O</td>
<td>38 (19%)</td>
<td>34 (17%)</td>
<td>200</td>
</tr>
<tr>
<td>BRAM</td>
<td>32 (22.86%)</td>
<td>32 (22.86%)</td>
<td>140</td>
</tr>
<tr>
<td>BUFG</td>
<td>1 (3.12%)</td>
<td>1 (3.12%)</td>
<td>32</td>
</tr>
<tr>
<td>Power consumption (in Watt)</td>
<td>0.157</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Once the functionality is proven, then the netlist of the design is ready for further processing. Synthesized-netlist has been placed, routed, and checked for timing violations.
The timing report was generated for the proposed design and the critical path delay of 3.04ns and the respective maximum operating clock frequency obtained was 329MHz, as presented in Table 4.

Table 4: Throughput, latency, and power utilization of the proposed turbo decoder

<table>
<thead>
<tr>
<th>Platform</th>
<th>Critical path delay (ns)</th>
<th>Max.Clock frequency $f_{\text{max}}$ (MHz)</th>
<th>Block size K</th>
<th>Latency (L) $\mu$s</th>
<th>Throughput (T) Mbps</th>
</tr>
</thead>
<tbody>
<tr>
<td>Kintex-7 28 nm CMOS</td>
<td>3.04</td>
<td>329</td>
<td>40</td>
<td>1.34</td>
<td>32</td>
</tr>
</tbody>
</table>

The proposed parallel turbo decoder on Xilinx Kintex-7 FPGA, achieved a throughput of 155 Mbps and 32 Mbps, and the latency of 39.67 $\mu$s and 1.34 $\mu$s for the block lengths of 6144 and 40, respectively. Furthermore, maximum clock frequency $f_{\text{max}}$ of 329 MHz was observed as listed in Table 4. It can be seen from Table 4 that this parallel design achieved 155 Mbps of throughput at maximum flock frequency of 329 MHz and 39.67 $\mu$s of latency for block size 6144 on 28 nm CMOS Kintex-7 FPGA.

The proposed parallel architecture with these techniques gives reduction in energy consumption of the proposed architecture compared to the general architecture. The estimated performance analysis of the proposed turbo decoder on various Xilinx FPGA and the comparison of obtained results with other recent turbo decoder designs are shown in Table 5. It is observed that the present work provides a balanced design between performance parameters of speed, area, and power. It is evident from the results that for similar Algorithm, block size, and approximately the same number of interactions, the proposed turbo decoder gives a much better throughput.

Table 5: Comparison of the proposed Turbo decoders with other reported works

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Target device/ FPGA family</td>
<td>130 nm CMOS</td>
<td>28 nm Vertex-7</td>
<td>28 nm Vertex-7</td>
<td>28 nm Zynq</td>
<td>28 nm Vertex-7</td>
<td>28 nm Vertex-7</td>
<td>28 nm Vertex-7/Zynq</td>
<td>Kintex-7</td>
</tr>
<tr>
<td>Parallelism/Radix</td>
<td>08-Apr</td>
<td>64</td>
<td>-</td>
<td>8/2</td>
<td>8</td>
<td>-</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<td>Algorithm</td>
<td>Max-Log MAP</td>
<td>Max-Log MAP</td>
<td>Max-log-MAP</td>
<td>Max-log-MAP</td>
<td>Max-log-MAP</td>
<td>Max-log-MAP</td>
<td>Max-log-MAP</td>
<td>Max-log-MAP</td>
</tr>
<tr>
<td>Block size</td>
<td>6144</td>
<td>6144</td>
<td>6144</td>
<td>6144</td>
<td>6144</td>
<td>6144</td>
<td>6144</td>
<td>6144</td>
</tr>
<tr>
<td>Number of iterations</td>
<td>5.5</td>
<td>8</td>
<td>5</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<td>Maximum clock rate (MHz)</td>
<td>290</td>
<td>250</td>
<td>270.9</td>
<td>276</td>
<td>86.3</td>
<td>86.3</td>
<td>252.5</td>
<td>329</td>
</tr>
<tr>
<td>Throughput (Mbps)</td>
<td>384.3</td>
<td>2120</td>
<td>5</td>
<td>80</td>
<td>86.3</td>
<td>10.7</td>
<td>118</td>
<td>155</td>
</tr>
</tbody>
</table>

6. CONCLUSIONS

The present study highlights the concept of a new architecture with a three-stage pipelined parallel turbo decoder and three-stage micro-pipelined MAP decoder. These techniques have specifically improved the throughput and operating clock frequency by
pipelined parallel implementation of the turbo decoder and shortened the critical path delay in the whole design. Algorithmic approximation and architectural optimization like pipelining and parallelizing were used to minimize the critical path and attain a higher throughput. However, the hardware complexity advances linearly as the number of subblocks or iterations increases and increased recursions in architecture of the MAP decoder normally limit the throughput of the turbo decoder. The estimated performance has been observed by implementing the proposed parallel turbo decoder at 28 nm CMOS technology Xilinx Kintex7 FPGA and achieved a maximum estimated throughput of 155 Mbps with 8 iterations, which is suitable for 3GPP-LTE-Advanced, as per its specification. The proposed design improved throughput to the tune of 55.6% as compared to other recently reported designs.

From the performance analysis of the proposed turbo decoders and comparison with other recent turbo decoder designs, it is evident that the proposed architecture provides a balanced design among performance parameters, speed, and area. It can be concluded that throughput increases for the optimized turbo decoder and parallel turbo decoder architectures as compared to the standard design. However, the area requirement or power consumption increases proportionately with the throughput.

ACKNOWLEDGEMENT
The authors would like to thank the editors and anonymous reviewers for their insightful comments and constructive suggestions. This work was supported by the Department of Science and Technology, Government of India under women Scientist Scheme-A (WOS-A) (SR/WOS-A/ET-72/2017) and the work was carrying out at Sree Vidyanikethan Engineering College, Tirupati, Andra Pradesh, India.

REFERENCES


