# An area-efficient 256-point FFT design for WiMAX systems 

Jian Yu*, Kyung-Ju Cho**


#### Abstract

This paper presents a low area 256-point pipelined FFT architecture, especially for IEEE 802.16a WiMAX systems. Radix-24 algorithm and single-path delay feedback (SDF) architecture are adopted in the design to reduce the complexity of twiddle factor multiplication. A new cascade canonical signed digit (CSD) complex multipliers are proposed for twiddle factor multiplication, which has lower area and less power consumption than conventional complex multipliers composed of 4 multipliers and 2 adders. Also, the proposed cascade CSD multipliers can remove look-up table for storing coefficient of twiddle factors. In hardware implementation with Cyclone 10LP FPGA, it is shown that the proposed FFT design method achieves about $62 \%$ reduction in gate count and $64 \%$ memory reduction compared with the previous schemes.


Key Words : FFT, pipelined, SDF, CSD, complex multiplier, twiddle factor

## 1. Introduction

FFT is a very important technique in modern communications, especially for applications in OFDM systems, such as IEEE $802.11 \mathrm{a} / \mathrm{g} / \mathrm{n}$, WiMAX, wireless personal area networks (WPANs), long-term evolution (LTE) systems [1]. FFT is one of the modules with high computational complexity in the physical layer of OFDM systems. Thus, many FFT design algorithms have been developed to reduce the computational complexities, including radix -2 , radix -4 , radix -22 , radix-23, radix-24, and so on [2]-[5].
The radix-2 algorithm has ever been popular for FFT because it has simple butterfly unit. However, it requires more complex multiplications following with the rising point of FFT. Though the radix-4 algorithm reduce the number of complex multiplications, it needs relative complex butterfly unit. The radix- 2 k algorithms have been presented in [3]-[4], which can reduce the number of
complex multiplication with same butterfly unit of radix-2 algorithm.
In general, FFT architectures can be divided into two different types: memory- based and pipelined architectures. Memory -based architecture consists of main processing element (butterfly unit), memory units and control logics. This kind of architecture owns low hardware cost and power consumption at the cost of low throughput and serious latency. On the other hands, the pipelined FFT architectures can satisfy real-time applications due to high throughput, but cost more hardware resources [6]-[7].
For computing FFT, complex multipliers and look-up talbe are required to multiply the twiddle factors with input signal and to store the required twiddle factors, respectively. These reads to large area and power consuming.

In this paper, to reduce hardware cost, we propose the novel CSD constant complex multipliers instead of conventional complex

This paper was supported by Wonkwang University in 2017.
*Department of Electronic Engineering Wonkwang University
**Corresponding Author: Department of Electronic Engineering Wonkwang University (kjcho@wku.ac.kr) Received May 11, 2018 Revised May 28, 2018 Accepted June 05, 2018
multiplier which is composed of 4 multipliers and 2 adders for twiddle factor multiplication. By doing this, any look-up table is not required for storing twiddle factors.

## 2. Design Consideration of FFT

The N-point discrete Fourier transform $X(k)$ of input sequence $x(n)$ is defined as

$$
\begin{equation*}
X(k)=\sum_{n=0}^{N-1} x(n) W_{N}^{n k}, \quad 0 \leq k \leq N-1, \tag{1}
\end{equation*}
$$

where the twiddle factor $W_{N}^{n k}=e^{-j 2 \pi n k / N}$.
Direct implementation of (1) needs large hardware and computation time. Thus, radix-2 FFT algorithm by Cooley-Tukey [2] have been developed to improve the speed of computation and to reduce the requirement of hardware. Later, radix- 2 k algorithm appears to show the advantage compared to the radix-2 algorithm because it simultaneously achieves a reduced number of complex multiplication and simple butterfly unit like radix-2 algorithm. Thus, we consider radix-2k algorithm in our design.

The 256-point FFT computation with radix- 2 k algorithm consists of eight stages. The radix-2k algorithm retains the structure of the radix-2 algorithm and has the same butterfly structure regardless of $k$. However, the twiddle factor multiplication is different for k . Table 1 shows the sequence of 256 -point FFT twiddle factor at each stage for radix-2k algorithm. From Table 1, the radix-24 algorithm is optimal candidate for 256-point FFT since it has the least number of complex multiplication and the lowest complexity of twiddle factors, where -j means trivial multiplication.

Table 1. Base number of twiddle factors for 256-point at each stage

| Algorithms | Stages |  |  |  |  |  |  | \#CM |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | 1 | 2 | 3 | 4 | 5 | 6 | 7 |  |
| Radx-2 | W256 | W128 | W64 | W32 | W16 | W8 | - | 642 |
| Radix-22 | -j | W256 | - | W64 | - | W16 | - | 492 |
| Radix-23 | - | W8 | W256 | - | W8 | W32 | - | 504 |
| Radix-24 | -j | W16 | - | W256 | - | W16 | - | 480 |

Generally, two design styles exist in the pipelined FFT architectures: feedforward and feedback. Feedforward architectures can be divided into single-path delay commutator (SDC) and multi-path delay commutator (MDC). Feedback architectures can be divided into SDF and multi-path delay feedback (MDF) [7]. The control logic of feedforward styles is more complex than that of feedback styles, and MDF architecture requires more hardware cost than SDF architecture.

The demand for wireless devices is increasing rapidly that requires less hardware and low power FFT architecture [8]. Thus, in this paper SDF FFT architecture was adopted since it requires less memory elements and its control unit is easy to design.

Fig. 1 shows an architecture of the radix- 24 256-point SDF FFT. Two types of butterfly (BF1 and BF2) and several delay buffers of various sizes are used for data shuffling to obtain appropriate data at the butterfly input. The symbol $\otimes$ represents the complex multiplier. Control signals are used to switch the butterfly types and it also provides a appropriate control for twiddle factor multiplication.


Fig. 1. 256-point radix- $2^{4}$ SDF FFT architecture.

## 3. Proposed FFT Design

In this section, using CSD multiplier, we present the method to reduce the required hardware of complex multiplication with twiddle factors.

### 3.1 Proposed $W_{16}^{i}$ CSD complex multiplier

Twiddle factors $W_{16}^{i}$ at stage 2 only need seven factors: $W_{16}^{0}, W_{16}^{1}, W_{16}^{2}, W_{16}^{3}, W_{16}^{4}, W_{16}^{6}$, and $W_{16}^{9}$. By using $W_{16}^{4}=-j$ and symmetry property of complex sinusoidal function, the factors can be expressed by three values of $\operatorname{Re}\left\{W_{16}^{1}\right\}, \operatorname{Re}\left\{W_{16}^{2}\right\}$, and $\operatorname{Re}\left\{W_{16}^{3}\right\}$ as shown in Table 2, where $\operatorname{Re}\{t\}$ represents the real part of t . The three values Re $\left\{W_{16}^{1}\right\}, \operatorname{Re}\left\{W_{16}^{2}\right\}$, and $\operatorname{Re}\left\{W_{16}^{3}\right\}$ are equal to 0.9239 , 0.7071 and 0.3827 , respectively.

In hardware design, CSD representation and common sub-expression (CSE) sharing methods are exploited to reduce the occupied hardware resources. Table 3 shows the CSD representations of $\operatorname{Re}\left\{W_{16}^{1}\right\}, \operatorname{Re}\left\{W_{16}^{2}\right\}$, and $\operatorname{Re}\left\{W_{16}^{3}\right\}$. The CSE block '101' (or $-10-1$ ) is enclosed by the blue ellipses. Therefore, the CSD constant complex multiplier is only composed of adders, shifters and multiplexers with low cost of hardware compared to conventional complex multiplier.

Table 2. Representation of $W_{16}^{i}$

| $W_{16}^{0}$ | 1 | $W_{16}^{4}$ | -j |
| :---: | :---: | :---: | :---: |
| $W_{16}^{1}$ | $\operatorname{Re}\left\{W_{16}^{1}\right\}-\mathrm{jRe}\left\{W_{16}^{3}\right\}$ | $W_{16}^{6}$ | $-\operatorname{Re}\left\{W_{16}^{2}\right\}-\mathrm{jee}\left\{W_{16}^{2}\right\}$ |
| $W_{16}^{2}$ | $\operatorname{Re}\left\{W_{16}^{2}\right\}-\mathrm{jRe}\left\{W_{16}^{2}\right\}$ | $W_{16}^{9}$ | $-\operatorname{Re}\left\{W_{16}^{1}\right\}-\mathrm{j} \operatorname{Re}\left\{W_{16}^{3}\right\}$ |
| $W_{16}^{3}$ | $\operatorname{Re}\left\{W_{16}^{3}\right\}-\mathrm{j} \operatorname{Re}\left\{W_{16}^{1}\right\}$ |  |  |

Table 3. CSD representation for $W_{16}^{i}$ with 12 bits

| $\operatorname{Re}\left\{W_{16}^{1}\right\}$ | 1 | 0 | 0 | 0 | -1 | 0 | -1 | 0 | 0 | 1 | 0 | 0 |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\operatorname{Re}\left\{W_{16}^{2}\right\}$ | 1 | 0 | -1 | 0 | -1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| $\operatorname{Re}\left\{W_{16}^{3}\right\}$ | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |

The three CSD multipliers can be obtained by simply using 6 shifters and 7 additions as

$$
\begin{align*}
& C S E=d+d \gg 2 \\
& d \times \operatorname{Re}\left\{W_{16}^{1}\right\}=d-C S E \gg 4+d \gg 9 \\
& d \times \operatorname{Re}\left\{W_{16}^{2}\right\}=d-C S E \gg 2+d \gg 6  \tag{2}\\
& d \times \operatorname{Re}\left\{W_{16}^{3}\right\}=d \gg 1-d \gg 3+d \gg 7
\end{align*}
$$

where, d and $\gg t$ represent the multiplicand for twiddle factors and the right-shift operation by t .

In order to select the two sets of twiddle factor multiplications in Table 2, 4-to-1 multiplexers are needed. Fig. 2 shows the detailed structure of the proposed constant complex multiplier $W_{16}^{i}$. Two signals sel1 and sel2 are needed to select the proper results.

### 3.2 Proposed $W_{256}^{i}$ CSD complex multiplier

As shown in Fig. 1, the butterfly output signals at stage 4 are multiplied by appropriate twiddle factors $W_{256}^{i}(i=0 \sim 255)$. The twiddle factors are $W_{256}^{i}=x+j y$, where $i$ are divided into 8 regions. Only $\mathrm{N} / 8$ sets of constant values, i.e., $W_{256}^{p}=x_{p}+j y_{p}$, where $p$ is from 0 to $\mathrm{N} / 8$ as region A in Table 4, are needed since the twiddle factors in other regions can be obtained through mapping from region A [7]. Table 4 shows the corresponding mapping.

Table 4. Twiddle factors with corresponding mapping in 8 regions

| $i$ | Real | Imaginary | Region |
| :---: | :---: | :---: | :---: |
| $0 \leq i \leq N / 8$ | $x_{p}$ | $y_{p}$ | A |
| $N / 8<i<N / 4$ | $-y_{p}$ | $x_{p}$ | B |
| $N / 4<i \leq 3 N / 8$ | $y_{p}$ | $-x_{p}$ | C |
| $3 N / 8<i<N / 2$ | $-x_{p}$ | $y_{p}$ | D |
| $N / 2<i \leq 5 N / 8$ | $-x_{p}$ | $-y_{p}$ | E |
| $5 N / 8<i<3 N / 4$ | $y_{p}$ | $x_{p}$ | F |
| $3 N / 4<i \leq 7 N / 8$ | $-y_{p}$ | $x_{p}$ | G |
| $7 N / 8<i<N$ | $x_{p}$ | $-y_{p}$ | H |



Fig. 2. Detailed structure of the CSD constant complex multiplier for $W_{16}^{i}$.

To reduce the hardware cost for multiplication of $W_{256}^{i}$, the cascade CSD constant complex multiplier structure is proposed as follows.

1. By using the $1 / 8$ symmetry property, the exponent $i(i=0 \sim 255)$ of $W_{256}^{i}$ can be reduced as $p$ ( $0 \leq p \leq 32$ ).
2. In order to further reduce the number of twiddle factors, decompose $p$ into $p_{1}$ and $p_{2}$ as $p=4 p_{1}+p_{2}\left(p_{1}=0 \sim 8, p_{2}=0 \sim 3\right)$.
3. Tabulate CSD representations for $W_{256}^{p_{1}}$ and $W_{256}^{p_{2}}$ and find the optimized CSE.

The proposed cascade CSD complex multiplier needs two stage complex multiplication operation to achieve a complete multiplication of twiddle factor.

$$
\begin{align*}
d \times W_{256}^{p}= & d \times W_{256}^{4 p_{1}+p_{2}}=d \times W_{256}^{4 p_{1}} \times W_{256}^{p_{2}} \\
= & d \times\left(\operatorname{Re}\left\{W_{256}^{4 p_{1}}\right\}+j \operatorname{Im}\left\{W_{26}^{4 p_{1}}\right\}\right)  \tag{3}\\
& \times\left(\operatorname{Re}\left\{W_{256}^{p_{5}}\right\}+j \operatorname{Im}\left\{W_{256}^{p_{2}}\right\}\right)
\end{align*}
$$

For example, $d \times W_{256}^{43}$ can be decomposed into $d \times W_{256}^{4 \times 3} \times W_{256}^{4}\left(p_{1}=3, p_{2}=1\right)$.

Note that 32 twiddle factors are further reduced
to 12 different values. Table 5 lists the CSD representation of the 12 values. Multiplier-less realization of twiddle factor multiplication $W_{256}^{i}$ can be accomplished by optimizing the CSE eliminations of the 12 values. From Table 5, one of the term '101' (or -10-1) as a CSE block is enclosed by blue ellipses, and '10-1' (or -101) and '1000-1' (or -10001) are also considered as the CSE block enclosed by red and purple ellipses, respectively.

Table 5. CSD representation of 12 values for composing twiddle factors with 12-bit

| $i_{1}$ | $\operatorname{Re}\left\{\boldsymbol{W}_{256}^{44^{4}}\right\}$ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | 1 | 0 | 0 | 0 |  | 0 | 0 | 0 |  | $0<1$ | -1 |  |  |  |  |
| 2 | 1 | 0 | 0 | 0 |  | 0 | 0 | -1 | 10 | $0 \rightarrow$ | -1 0 | 0 | 0 | 0 |  |
| 3 | 1 | 0 | 0 | 0 |  | $-1$ | 0 | 1 |  | 01 | 10 | 0 | 0 | - | - |
| 4 | 1 | 0 | 0 | 0 |  | -1 | 0 | - 1 | $1)$ | 0 | 01 | 1 | 0 | 0 |  |
| 5 | 1 | 0 | 0 | - |  | 0 | 0 | 0 | 1 | 10 | 0 | 0 | -1 | 1 |  |
| 6 | 1 | 0 | $-1$ | 0 |  | 1 | 0 | 1 | $1)$ | 01 | 10 | 0 | $-1$ | 1 |  |
| 7 | 1 | 0 | -1 | 0 |  | 0 | 1 | 0 | $\bigcirc$ | -10 | 00 | 0 | 0 | - |  |
| 8 | 1 | 0 | -1 | 0 |  | -1 | 0 | C1 | 10 | 01 | 10 | 0 | 0 | 0 | 0 |
| 12 | $\mathcal{R e}\left\{\boldsymbol{W}_{20}^{2}\right\}$ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 0 | 1 | 0 | 0 | 0 |  | 0 | 0 | 0 |  | 0 | 0 | 0 | 0 | 0 |  |
| 1 | 1 | 0 | 0 | 0 |  | 0 | 0 | 0 | 0 | 00 | 0 | 0 | 0 | - |  |
| 2 | 1 | 0 | 0 | 0 |  | 0 | 0 | 0 | 0 | 0 | $0-$ |  | 0 | 1 |  |
| 3 |  | 0 | 0 | 0 |  | 0 | 0 | 0 | 0 | $0<1$ | -1 0 | 0 | 1 | 0 | 0 |
| $i_{1}$ | $\rightarrow \mathrm{im}\left\{\boldsymbol{W}_{256}^{4 t_{2}}\right\}$ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 1 | 0 | 0 | 0 |  |  | 0 | -1 | 0 | 0 | 011 | 10 | 0 | 0 | O |  |
| 2 | 0 |  | 1 | 0 |  |  | 0 |  | 1 | 10 | 0 | 0 | 0 | - |  |
| 3 | 0 | 0 | 1 | 0 |  |  | 1 | 0 | 1 | 10 | 0 | 0 | 1 |  | 0 |
| 4 | 0 | 1 | 0 | 5 |  | 0 | 0 | 0 | 1 | 10 | 00 | 0 | 0 | - |  |
| 5 | 0 | 1 | 0 | 0 |  | 0 | -1 | 0 | 0 | 00 | 0 |  | 0 |  |  |
| 6 | 0 | 1 | 0 | 0 |  | 1 | 0 | 0 | - | 10 | 00 | 0 | 0 | 1 |  |
| 7 | 0 | 1 | 0 | 1 |  | 0 | 0 |  | -1 | 10 | 01 | 1 | 0 | - |  |
| 8 | 1 | 0 | -1 | 0 |  | -1 | 0 | 1 |  | 01 | $1) 0$ | 0 | 0 | 0 |  |
| $i_{2}$ | $-1 m\left\{W_{56}^{2}\right\}$ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 0 | 0 | 0 | 0 | 0 |  | 0 | 0 | 0 | 0 | 010 | 0 | 0 | 0 | 0 |  |
| 1 | 0 | 0 | 0 | 0 |  | 0 | 1 | 0 | - | -10 | 0 | 0 | 1 | 0 |  |
| 2 | 0 | 0 | 0 | 0 |  | 1 | 0 | -1 | 10 | 00 | 01 | 1 | 0 | 0 |  |
| 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |  | D 0 | $0<1$ | -1 0 | 0 | $-1$ |  |  |

The detailed architecture of $W_{256}^{i}$ CSD constant multiplier is shown in Fig. 3. The CSE block in Fig. 3 consists of adders, and shifters and the rectangle boxes represent right shifters, and these shifters are realized using simple hardware connection. Three 4 -to- 1 and six 8 -to- 1 multiplexers are used to obtain the appropriate results.

## 4. Results and Comparison

To evaluate the proposed CSD constant complex multipliers, the proposed and conventional complex multipliers with 12 -bit wordlength were designed using Verilog HDL, and synthesized using Cyclone 10LP and QUARTUS PRIME design tool.

Table 6 shows results of the design. Note that the proposed complex multipliers for $W_{16}^{i}$ and $W_{256}^{i}$ can reduce about $76 \%$ and $34 \%$ in gate counts, compared with the conventional complex multiplier using modified Booth multiplier, respectively.

In addition, the proposed FFT and previous FFT for 256 -point FFT for IEEE 802.16a WiMAX systems were designed using Verilog HDL and synthesized using Cyclone 10LP.
Table 7 shows the performance comparison between the proposed scheme and the other schemes. Note that the proposed design achieves $62 \%$ gate count reduction and $64 \%$ memory reduction compared to conventional design.

The throughput rate of the implementation is suitable for WiMAX applications, whose the maximum sample rate is 32 MHz .

Table 6. Hardware comparison for $W_{16}^{i}$ and $W_{256}^{i}$

| Methods | Logics |
| :---: | :---: |
| Conventional complex mul. | $2,090(1)$ |
| $W_{16}^{i}$ using CSD mul. | $501(0.24)$ |
| $W_{256}^{i}$ using cascade CSD mul. | $1,369(0.66)$ |

Table 7. Hardware comparison of 256-point FFT

| Methods | Logic <br> elements | Registers | Memory <br> bits |
| :---: | :---: | :---: | :---: |
| Radix-2 ${ }^{4}$ | 9,450 | 509 | 19,044 |
| $(1)$ | $(1)$ | $(1)$ |  |
| Radix-24 $[1]$ | 8,763 | 509 | 13,640 |
|  | $(0.93)$ | $(1)$ | $(0.72)$ |
| Radix-24 $[8]$ | 4,981 | 489 | 12,900 |
|  | $(0.53)$ | $(0.96)$ | $(0.68)$ |
| Proposed | 3,578 | 489 | 6,756 |
|  | $(0.38)$ | $(0.96)$ | $(0.36)$ |

## 5. Conclusion

In this paper, we proposed a hardware efficient FFT design method for WiMAX systems using


Fig. 3. Detailed structure of the cascade CSD constant complex multiplier for $W_{256}^{i}$.
radix-24 algorithm and SDF architecture. To reduce the hardware cost, we proposed the CSD constant complex multipliers which replace conventional complex multiplier and remove look-up table for storing twiddle factors. By simulation, it was shown that the proposed FFT design method achieves about $62 \%$ reduction in gate count and $64 \%$ reduction in memory size compared with the previous schemes.

## REFERENCES

[1] J. H. Kim and I. C. Park, "Long-point FFT processing based on twiddle factor table reduction", IEICE Trans. Fundam. Electron. Commun. Comput. Sci., E90-A. no. 11, pp. 2526-2532, 2007.
[2] J. W. Cooley, J. W. Tukey, "An algorithm for the machine calculation of complex Fourier series",

Math. Comput., 19(90), pp.297-301, 1965.
[3] S. He and M. Torkelson, "Designing pipeline FFT processor for OFDM (de) modulation", Proc. URSI Int. Symp. Signals. Syst., Electron., 1998. pp. 257-262.
[4] J. Y. Oh and M. S. Lim, "New radix-2 to the 4th power pipeline FFT processor", IEICE trans. Electron., vol. E88-C, no. 8, pp.1740-1746, 2005.
[5] C. Wang, Y. Yan, and X. Fu, "A high-throughput low-complexity radix-22-22 -23 FFT/IFFT Processor with parallel and normal input/output order for IEEE 802.11ad systems", IEEE Trans. Very Large Scale Integr. (VLSI) Syst, vol. 23, no. 11, pp.2728-2732, 2015.
[6] S. J. Huang, S. G. Chen, "A high-throughput radix-16 FFT processor with parallel and normal input/output ordering for IEEE 802.15. 3c systems", IEEE Tran. Circuits Sys. I: Reg. Papers, vol. 59, no. 8, pp.1752-1765, 2012.
[7] G. K. Ganjikunta and S. K. Sahoo, "An area-efficient and low-power 64-point pipeline Fast Fourier Transform for OFDM applications", Integration, the VLSI Journal, vol. 57, pp.125-131, 2017.
［8］C．P．Fan，M．S．，Lee and G．A．Su，＂A low multiplier and multiplication costs 256 －point FFT implementation with simplified radix－ 24 SDF architecture＂，Proc IEEE APCCAS 2006．pp． 1935－1938．

## Author Biography

Jian Yu
［Member］

$\bullet$ Jun．2001：Hebei Normal Univ．，Electronic Engr．，BA －Mar．2008：Tianjin Polytechnic Univ．，Electronic Engr．，MS －Mar． 2016 ～current Wonkwang Univ．，Dept．of Electronic Engr．，PhD course
＜Research Interests〉Low－power system，VLSI Design

## Kyung－Ju Cho

## ［Member］



〈Research Interests〉 Low－power system，VLSI Design，SOC

