Low Complex and Hardware Efficient Successive Cancellation Polar decoder Architecture
Department of ECE,
Mepco Schlenk Engineering College,
Department of ECE,
Mepco Schlenk Engineering College, Sivakasi.
Department of ECE,
Mepco Schlenk Engineering College, Sivakasi.
Polar codes have emerged as the first provable capacity achieving error- correcting codes. Polar decoders are based on Successive Cancellation Algorithm. Likelihood or log likelihood forms are not applicable for practical application and also render high hardware complexity. Thus we adopt a reformulated Log likelihood Ratio (LLR) based Successive cancellation algorithm. Here the hardware efficiency decreases without any performance loss. Using this, the polar decoder architecture were implemented and the performance were analyzed. Further the high latency which is prevailing in the above architecture is reduced by decoding two bits simultaneously. Therefore this new decoder, referred to as 2b-SC decoder, reduces
latency from (2n-2) to (1.5n-2) clock cycles without any degradation in performance.
Keywords: Polar codes, FEC, Successive cancellation, Log likelihood forms
Polar codes  are a new family of error correcting codes. For the past few decades, researchers had been involved in finding a new approach to reach the Shannon limit in accordance to the prior codes. Polar codes proved to have the excellent error correcting performance with explicit construction and efficient encoding and decoding algorithms. Moreover, polar codes are capable of achieving channel capacity asymptotically if they have the code length n and an underlying symmetric, memoryless channel. Hence polar codes have emerged as a major breakthrough in coding theory. Since their invention, several
hardware architectures were proposed and implemented.In , Ar''kan implemented a fast Fourier transform structure to efficiently reuse computations. Due to the rapid progress in the theory of polar codes our motivation in SC decoding will allow high-throughput and low-area implementations. Ar''kan  showed that the SC decoding algorithm can be implemented with complexity O(n log2 n), where n is the code length. In this paper, taking into account the structure proposed by Ar''kan , we show that SC decoding can actually be implemented with hardware complexity O(n). We also propose to increase the throughput by decoding several consecutive vectors at the same time. Since SC decoding have low intrinsic parallelism, complementary works focused on increasing the throughput of SC decoders. In  and , lookahead techniques are used to reduce the decoding latency while using limited extra hardware resources. In , a simplification of SC decoding is proposed in order to reduce the number of computations without altering error correction performance. Extra latency reduction technique is investigated in  where maximum likelihood decoding is used to further speedup the decoding process. However, these low latency decoders have not been implemented yet.
2. POLAR CODES
In information theory, a polar code is a linear block error correcting code. The construction of polar codes is based on the multiple recursive concatenation of a short kernel code.This Kernel codes transforms the physical channel into virtual outer channels. While iterating,when the number of recursions becomes large, the virtual channels may tend to either have high reliability or low reliability (or they polarize), and the message bits are allocated to the most reliable channels. The construction was first described by Stolte, and later independently by Erdal Ar''kan. Polar codes were provable to achieve the channel capacity for symmetric binary-input, discrete, memoryless channels (B-DMC) with polynomial dependence.
3. ENCODING PROCESS
Fig.1.Implementation of n=8 polar encoder
Polar codes are unique from other block codes. An (n, k) polar code is generated in two steps.
First, the source message is taken and denoted as k. It is extended to an n bit message x by padding (n-k) 0 bits.As because the postdecoding reliability of n bit positions of u can be precomputed as shown in ,the k most reliable positions of u are assigned k information bits and other (n 'k) least reliable positions are forced to be 0. Then, the n-bit message u is multiplied with an n''n generator matrix G to generate the transmitted codeword x. Fig. 1 shows the implementation of a polar code encoder with
4. SUCCESSSIVE CANCELLATION DECODER IMPLEMENTATION
4.1 FFT STRUCTURE
SC decoding can be implemented by the factor graph of the code which resembles the Fast Fourier Transform (FFT) structure. Figure 2 shows the graph of the SC decoder for n = 8. ''i are assumed to be the channel likelihood ratios(LRs) and ''i are taken as the estimated bits. The SC decoder is composed of m = log2 n stages. Each stage contains n nodes. Each node updates its output according to one of the two update rules:
f(a, b) = (1+ab)/ a+b
g''s (a, b) = a1'2us b.
The values a and b are likelihood ratios while ''s is a bit that represents the partial modulo-2 sum of previously estimated bits. The value of ''s determines whether its function g should be a multiplication or a division. However, since these functions involve multiplications and divisions, these update rules are complex to implement in hardware. Fig.2 highlights the path to activate the first bit ''0. If we assume that each node processor can memorize its updated value by keeping a register in between them, then some results can be reused.
Fig.2.FFT-like SC decoder architecture n=8
Despite providing this well-defined structure and FFT-like decoder scheduling architecture, in , Ar''kan does not resolve the problem of resource sharing, memory management or control generation that would be required for implementing the hardware. This architecture however suggests that it could be implemented with nlog2n combinatorial node processors combined with n registers between each stage to memorize intermediate results. For storing the channel information, n extra registers are also included. The total complexity of such a decoder is CT = (Cnp + Cr)n log2 n + nCr
where Cnp and Cr are the hardware complexity of a node processor and a register respectively. It can be shown that the above decoder with the right-to-left scheduling would take 2n'2 clock cycles to decode n bits. The throughput in bits per second would then be
T =n/(2n ' 2)tnp
where tnp is the propagation time in seconds through a node processor. It follows that every node processor is actually used once every 2n ' 2 clock cycles. This encourages us to find a structure to merge some of the nodes into a single processing element.
4.2. PIPELINED TREE ARCHITECTURE
Analyzing further into the scheduling, we find that whenever stage l is activated, only 2l nodes are actually updated. For instance in Figure 2, when stage 0 is enabled, only one node is updated. Then the n nodes of stage 0 can be implemented using a single processing element (PE). A PE can be portrayed as a configurable element that can perform either function f or g.
Fig.3.Pipelined tree SC architecture for n=8
Fig 3.a.Processing Element Architecture
We notice that in general, for stage l, 2l processing elements (PEs) are required to update the nodes. However, this resource sharing does not guarantee that the memories assigned to the merged nodes can also be merged. Table 1 shows the activation of a stage during the decoding of one vector y. When stage l is enabled, which function (f or g) is applied to the 2l activated nodes at stage Sl is indicated during each clock cycle (CC). Every variable which is generated is used twice during the decoding.
Table 1. Schedule for the FFT-like and pipeline tree SC architectures (n = 8)
For example, the four variables generated in stage 2 at Clock cycle 1 are used on Clock cycle 2 and Clock cycle 5 in stage 1. This means that in stage 2, the four registers that are related with the function f can be reused at Clock cycle 8 to memorize the four data values generated by the g function. This interpretation is applicable to any stage in the decoder. The resulting architecture is shown in Figure 3 for n = 8. n registers are used for storing the LRs ''i. The decoder is built of a pipelined tree structure that contains n ' 1 PEs, and n ' 1 registers. A decision unit generates the estimated bit ui which is then thrown back to every PE. It also contains the ''s computation block that will update the value of ''s with the last decoded bit ''i only when the control bit bl,j = 1. The selection of the function f or g is done by control bit bl. On Comparing to the FFT-like structure, the pipelined tree architecture does perform the same amount of computation with the same scheduling (see Table 1) but with a decreased number of PEs and registers. Assuming that a PE (which implements both f and g) contains twice the complexity of a node processor which implements a single f or g function, the pipelined tree decoder complexity is
CT = (n ' 1)(2Cnp + Cr) + nCr.
However, the routing network is much simpler in the tree architecture than in the FFT-like structure. Moreover the connections between PEs are also local and this lowers the risk of congestion during the wire routing phase of an integrated circuit design and guaranteed increase in the clock frequency and the throughput.
4.3 LINE SC ARCHITECTURE
Despite the low complexity of the pipelined tree architecture, it is possible to further reduce the number of PEs.
Fig.4. Line SC architecture for n = 8.
On looking at Table 1, it can be noticed that only one stage is activated at a time. Considering the worst case scenario (activation of stage m ' 1), n/ 2 PEs have to be activated simultaneously. It can be inferred that the same throughput can be achieved with only n/2 PEs. The resulting architecture is shown in Figure 4 for n = 8. The processing elements Pj are arranged in a line while the registers keep a tree structure. Registers and PEs are connected via multiplexers that provides the tree structure. For example since P2,0 and P1,0 (in Figure 3) are merged to P2 (in Figure 4), P2 should write either to R2,0 or R1,0 while it should also be able to read from the channel registers or from R2,0 and R2,1. The ''s computation block is moved out of Pj and kept close to the associated register because ''s should also be forwarded to the PE. The overall complexity of the line SC architecture is
CT = (n'1)(Cr +Cus)+nCnp +((n/2)-1)3Cmux + nCr
where Cmux is the complexity of a 2-input multiplexer and Cus is the complexity of the ''s computation block. Despite the extra multiplexing logic which is required to send the data through the PE line, the reduction in number of PEs makes this SC decoder less complex than the pipelined tree architecture while obtaining the same throughput as computed in (4). Looking at Table 1, during the decoding of one vector, stage l is activated 2m'l times. Consequently, in the line architecture of Figure 4, n/2 stages are activated simultaneously, only twice during the decoding of a vector, regardless of the code size. This architecture would improve the hardware efficiency with a small decrease of throughput. The Line SC architecture can be viewed as a tree architecture in which complexity is reduced by merging some of the PEs.
5. 2b SC DECODING
The above architectures however incurs high latency. So inorder to reduce the latency, taking the proposed reformulated scheme, the corresponding 2b-SC algorithm can be developed. Fig. 5 shows the 2b-SC decoding procedure with n=8 polar code in Fig. 2. Compared with the conventional SC scheme in Fig.3, the proposed 2b-SC algorithm replaces the f and g nodes with new nodes at stage-3. The node outputs the successive ''2i-1 and ''2i simultaneuosly. Therefore, the overall latency is reduced. For example, the original latency of 14 cycles in Fig. 3 is now reduced to 10 cycles in Fig. 5. Tables 1 and 2 describe the clock cycle timing information of the conventional SC and 2b-SC algorithms. The original SC algorithm requires n=8 cycles to output ''2i-1 and ''2i in stage-3. By employing p nodes to compute the decoded bits n/2=4, cycles are saved by the 2b-SC algorithm. In general, comparing to the original SC algorithm, the overall latency of 2b-SC algorithm is reduced from (2n-2) to (1.5n-2).
Fig.5. The decoding procedure of 2b-SC algorithm with n=8.
5.2 HARDWARE ARCHITECTURES OF 2B-SC DECODER
In this section, the hardware architecture of the proposed 2b-SC algorithm is presented. According to Fig. 5, the 2b-SC decoder architecture primarily consists of three types of processing nodes: f, g and p nodes. In this
2b-SC decoder architecture a processing element is needed for the f and g nodes. The stage-m utilizes the p nodes and the remaining stages uses the f and g nodes for computation. The processing element (PE) architecture is presented in Fig.6.Here the
CC 1 2 3 4 5 6 7 8 9 10
Stage 1 f g
Stage 2 f g f g
Stage 3 p p p p
Table 2.Schedule for 2b SC architecture
conversion from sign-magnitude form to 2's complement form is performed in the S2C block, while C2S block carries out the inverse conversion. Furthermore, adder and subtractor are employed to perform the addition and subtraction between the two inputs. Finally, a control signal in the multiplexer determines the output LLR which is then propagated to the next stage.
Fig. 6. The architecture of PE for and nodes.
Constructing the PE and the node in Figs. 6 and 7,the overall 2b-SC decoder can be viewed as a butterfly-like architecture (Fig. 5). However, this hardware design is not hardware-efficient. In Fig. 5, nearly half of the nodes in each stage are always idle during decoding procedure.
Fig. 7. The architecture of p node.
Fig. 8. The tree-based 2b-SC architecture with n=8.
The architecture of a tree-based 2b-SC decoder with n=8 is shown in Fig. 8. In this architecture, all the nodes in that stage are activated when a particular stage is activated.Therefore, a total of (n-2) PEs and a single node are only needed. One of the disadvantages of the tree-based architecture is that only the activated stage can achieve 100% hardware utilization in each cycle.
Logic Utilization Pipelined SC Decoder Line SC Decoder
Slice registers 26 8
Slice LUTs 1471 2571
LUT FF pair 21 4
Delay 37.044ns 36.732ns
Table 3.Comparison between Pipelined and line SC Decoder
Logic Utilization Conventional SC Decoder 2b SC Decoder
Slice registers 26 46
Slice LUTs 1471 52
LUT FF pair 21 40
Delay 37.044ns 1.556ns
Table 4.Comparison between Conventional and 2b SC Decoder.
In this paper we showed that the architecture proposed by Ar''kan in  can be improved by taking advantage of the scheduling in SC decoding. The pipelined tree architecture and the line architecture allow us to reach
the same throughput while reducing the hardware complexity. However the reformulated 2b-SC decoding helps us to reduce the latency while keeping the performance at peak.
 E. Arikan, 'Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels,' IEEE Trans. on Inform. Theory, vol. 55, no. 7, pp. 3051 '3073, Jul. 2009.
 H. Mahdavifar and A. Vardy, 'Achieving the secrecy capacity of wiretap channels using polar codes,' in IEEE ISIT 2010, Jun. 2010, pp. 913 '917.
 I. Tal and A. Vardy, 'How to construct polar codes,' in IEEE ITW 2010, Aug. 2010.
 N. Hussami, R. Urbanke, and S.B. Korada, 'Performance of polar codes for channel and source coding,' in IEEE ISIT 2009, Jun. 2009, pp. 1488 '1492.
 E. Arikan, 'Polar codes: A pipelined implementation,' in ISBC2010, Jul. 2010.
 M.P.C. Fossorier, M. Mihaljevic, and H. Imai, 'Reduced complexity iterative decoding of low-density parity check codes
based on belief propagation,' IEEE Trans. on Comm., vol. 47, no. 5, pp. 673 '680, May. 1999.
 C. Zhang, B. Yuan, and K. K. Parhi, 'Low-latency SC decoder architectures for polar codes,' arXiv:1111.0705, Nov. 2011.
 C. Zhang and K. Parhi, 'Low-latency sequential and overlapped architectures for successive cancellation polar decoder,' IEEE Transactions on Signal Processing, 2013.
 A. Alamdar-Yazdi and F. R. Kschischang, 'A simplified successive-cancellation decoder for polar codes,' IEEE Communications Letters, Dec. 2011.
 G. Sarkis and W. J. Gross, 'Increasing the throughput of polar decoders,' IEEE Communications Letters, 2013.
...(download the rest of the essay above)