# Low-Complexity and Low-Power MIMO Symbol Detector for Mobile Devices with Two TX/RX Antennas

Soohyun Jang<sup>1</sup>, Seongjoo Lee<sup>2,\*</sup>, and Yunho Jung<sup>1</sup>

Abstract-In this paper, a low-complexity and lowpower soft output multiple input multiple output (MIMO) symbol detector is proposed for mobile devices with two transmit and two receive antennas. The proposed symbol detector can support both the spatial multiplexing mode and spatial diversity mode in single hardware and shows the optimal maximum likelihood (ML) performance. By applying a multistage pipeline structure and using a complex multiplier based on the polar-coordinate, the complexity of the proposed architecture is dramatically decreased. Also, by applying a clockgating scheme to the internal modules for MIMO modes, the power consumption is also reduced. The proposed symbol detector was designed using a description language hardware (HDL) and implemented using a 65nm CMOS standard cell library. With the proposed architecture, the proposed MIMO detector takes up an area of approximately  $0.31mm^2$  with 183K equivalent gates and achieves a 150Mbps throughput. Also, the power estimation results show that the proposed MIMO detector can reduce the power consumption by a maximum of 85% for the various test cases.

*Index Terms*—MIMO, ML, spatial diversity, spatial multiplexing, symbol detector

E-mail : yjung@kau.ac.kr

### **I. INTRODUCTION**

Recently, the demand has continued to increase for higher data rates and improved multimedia services through wireless internet access. As such, mobile devices such as smart-phones, portable media players (PMPs), laptops, digital cameras, and tablet PCs with built-in 3rd generation partnership project (3GPP) long-term evolution/advanced (LTE/A) and IEEE 802.16e/m mobile worldwide interoperability for microwave access (WiMAX) are gaining in popularity [1, 2].

In order to increase the data rate and link reliability, 3GPP LTE/A and IEEE 802.16e/m WiMAX systems incorporate MIMO transmission schemes [3, 4]. Since the hardware complexity increases with the number of transmit data streams and mobile devices have limited physical dimensions, an MIMO system with two antennas at both the transmitter and the receiver (2×2) is considered to be a possible solution for mobile devices. For this reason, this paper focuses on the efficient design of  $2\times 2$  MIMO symbol detector.

MIMO techniques can basically be classified into spatial diversity (SD) schemes and spatial multiplexing (SM) schemes [5]. In an SM scheme, since independent data streams are transmitted from the individual transmit antennas, the overall data rate is increases significantly as the number of transmit antennas increases. Meanwhile, since SD systems transmit multiple streams bearing the same information, link reliability is considerably improved from the spatial diversity gain even though there is no increase in data rate.

In an SD scheme, the optimal ML symbol detection can be easily accomplished with a simple linear combination at the receiver [6]. However, since the ML

Manuscript received Jun. 11, 2014; accepted Mar. 10, 2015 <sup>1</sup>School of Electronics, Telecommunication and Computer Engineering,

Korea Aerospace University, Goyang-si South Korea. <sup>2</sup>Department of Information and Communication Engineering, Sejong University, Seoul. South Korea

detection (MLD) for SM schemes requires an exhaustive search for all transmitted symbols from all transmit antennas, its complexity is proportional to  $M^{N_T}$ , where M is the constellation size and  $N_T$  is the number of transmit antennas, and exponentially increases as M and  $N_T$  increase. Therefore, its real-time implementation is infeasible when a large number of antennas are used together with a high constellation size such as 64QAM.

As an alternative to the MLD, the sphere detection (SPD) algorithm [7] was introduced and was further discussed in various publications [8-11]. In order to avoid the exponential complexity of the MLD, the search for the closest lattice point is restricted to include only vector constellation points that fall within a certain search sphere. This approach allows the ML solution to be found with only polynomial complexity for sufficiently high signal-to-noise ratio (SNR) [9]. However, SPD has a disadvantage in that the computational complexity varies with different signals and channels. Hence, the detection throughput is nonfixed, which is not desirable for real-time hardware implementation. To resolve this problem, an MLD with QR decomposition and an M-algorithm (QRM-MLD) [12, 13] was proposed. At each search layer in QRM-MLD, only the best M candidates are kept for the next level search and therefore, it has a fixed complexity and throughput that is suitable for the pipeline hardware implementation. However, since these algorithms, which are based on the tree search, rely on the computation of many path metrics by using QR decomposition, the complexity is still exponentially increasing with the number of transmit antennas [12].

In order to solve these complexity problems, the vigorous research has been conducted in recent decades [14-18]. Among them, a modified ML (MML) detection algorithm [14], which would reduce the complexity by the ratio of 1/M, was proposed, and was applied to several implementations such as [15] and [17]. Since recent communication systems mostly support two transmit and two receive antennas to be incorporated into a mobile device, MML detection can be considered as suitable for the symbol detector of those systems because its complexity is proportional to only M. Moreover, MML detection does not require the complex matrix computation such as QR decomposition.

Although MML detection provides a lower amount of

complexity than the classical ML detection, its complexity and power consumption are still too high to be implemented in real time for mobile devices, especially when supporting 64QAM, because 64 complex calculations for the Euclidean distance (ED) should be performed in parallel. Also, since SD schemes such as space-time block coding (STBC) and spacefrequency block coding (SFBC) should be supported together with the SM scheme in most systems, the design of the efficient hardware architecture is really important for the MIMO symbol detector.

In this paper, we propose a low-complexity and lowpower 2×2 MIMO symbol detector supporting both SD and SM modes, and its design and implementation results are presented. By fully sharing the common function blocks and applying multi-stage pipelining, the proposed detector is implemented with very low-complexity. Also, by applying a clock-gating scheme to the internal modules that are only used for the SM mode, the average power consumption of the proposed detector is dramatically decreased.

This paper is organized as follows: In Section II, the MIMO system model is presented, and ML and MML symbol detection algorithms are introduced in Section III. The hardware architecture for the proposed symbol detector is described in Section IV, and the implementation results are presented in Section V. Finally, Section VI concludes the paper.

#### **II. SYSTEM MODEL**

Fig. 1 depicts the MIMO system model with 2 transmit and 2 receive antennas. The receive signal vector is given by

$$\mathbf{y} = \mathbf{H}\mathbf{X} + \mathbf{N}$$
  
=  $\begin{bmatrix} \mathbf{h}_1 & \mathbf{h}_2 \end{bmatrix} \mathbf{X} + \mathbf{N}$   
=  $\begin{bmatrix} h_{11} & h_{21} \\ h_{12} & h_{22} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + \begin{bmatrix} n_1 \\ n_2 \end{bmatrix},$  (1)

where  $x_j$ , (j = 1, 2) is the signal transmitted from the *j*th transmit antenna,  $y_i$ , (i = 1, 2) is the signal received from the *i*-th receive antenna, and  $h_{j,i}$  is the fading channel coefficient. Also,  $n_i$  is independent and identically distributed (*i.i.d.*) complex zero-mean Gaussian noise with variance  $\sigma^2$  per dimension.



Fig. 1. MIMO system model with 2 transmit and 2 receive antennas.

### **III. SYMBOL DETECTION ALGORITHM**

#### 1. Soft-ML Symbol Detection Algorithm

The soft-output symbol detector generates a posteriori probability of the received bit being a 0 or 1, expressed as a log likelihood ratio (LLR). From the received signal vector  $\mathbf{y}$ , the soft information about each coded bit comprising the transmitted symbol vector  $\mathbf{X}$  is defined in the form of an LLR by

$$LLR(b_{k,l} | \mathbf{y}) = \log\left(\frac{\Pr[b_{k,l} = 1 | \mathbf{y}]}{\Pr[b_{k,l} = 0 | \mathbf{y}]}\right),$$
(2)

where  $b_{k,l}$  is the *l*-th bit from the *k*-th transmit antenna for  $l=1, \dots, M$  and  $k=1, \dots, N_T$ , and log(-) represents the natural logarithmic function. Pr $[b_{k,l}=a|\mathbf{y}]$  denotes the conditional probability that takes the value of 'a' for a given  $\mathbf{y}$ . By approximating via Bayes' rule and max-log approximation, the LLR values for the soft-output ML symbol detector with two transmit and two receive antennas can be expressed as (3) and (4).

$$LLR(b_{1,l}) = \log \left( \frac{\sum_{c \in C_l^{\dagger}} \sum_{x_2 \in C} \exp\left[-\frac{|\mathbf{y} - \mathbf{h}_1 c - \mathbf{h}_2 x_2|^2}{2\sigma^2}\right]}{\sum_{c \in C_l^{0}} \sum_{x_2 \in C} \exp\left[-\frac{|\mathbf{y} - \mathbf{h}_1 c - \mathbf{h}_2 x_2|^2}{2\sigma^2}\right]}\right) = \arg\min_{c \in C_l^{0}, x_2 \in C} |\mathbf{y} - \mathbf{h}_1 c - \mathbf{h}_2 x_2|^2 - \arg\min_{c \in C_l^{1}, x_2 \in C} |\mathbf{y} - \mathbf{h}_1 c - \mathbf{h}_2 x_2|^2$$

$$LLR(b_{2,l}) = \log \left(\frac{\sum_{c \in C_l^{1}} \sum_{x_1 \in C} \exp\left[-\frac{|\mathbf{y} - \mathbf{h}_1 x_1 - \mathbf{h}_2 c|^2}{2\sigma^2}\right]}{\sum_{c \in C_l^{1}} \sum_{x_1 \in C} \exp\left[-\frac{|\mathbf{y} - \mathbf{h}_1 x_1 - \mathbf{h}_2 c|^2}{2\sigma^2}\right]}\right)$$
(3)

$$\left(\sum_{c \in C_{l}^{0}} \sum_{x_{1} \in C} \left[2\sigma^{2}\right]\right)$$

$$= \arg \min_{c \in C_{l}^{0}, x_{1} \in C} \left|\mathbf{y} - \mathbf{h}_{1}x_{1} - \mathbf{h}_{2}c\right|^{2} - \arg \min_{c \in C_{l}^{1}, x_{1} \in C} \left|\mathbf{y} - \mathbf{h}_{1}x_{1} - \mathbf{h}_{2}c\right|^{2}$$
(4)

where *C* denotes the set consisting of all the constellation points. Also, the sets  $C_l^0$  and  $C_l^1$  include all the symbols whose *l*-th bit are 0 and 1, respectively.

As shown in (3) and (4), when calculating the LLR values for every transmitted bit, the joint search for  $x_1$  and  $x_2$  are needed and its complexity exponentially increases as the number of transmit antennas as the constellation size increase. For example, in the case of  $N_T=2$  and 64QAM, 4096 (=64<sup>2</sup>) ED calculations are required for each received signal vector. Therefore, its real-time implementation is very difficult.

#### 2. Soft-MML Symbol Detection Algorithm

The LLR values by soft-output MML algorithm [14] can be expressed as (5) and (6).

$$LLR(b_{1,l}) = \log \left( \frac{\sum_{c \in C_l^{2}} \exp\left[-\frac{|\mathbf{y} - \mathbf{h}_1 c - \mathbf{h}_2 x_2(c)|^2}{2\sigma^2}\right]}{\sum_{c \in C_l^{0}} \exp\left[-\frac{|\mathbf{y} - \mathbf{h}_1 c - \mathbf{h}_2 x_2(c)|^2}{2\sigma^2}\right]}\right),$$

$$= \arg\min_{c \in C_l^{0}} |\mathbf{y} - \mathbf{h}_1 c - \mathbf{h}_2 x_2(c)|^2 - \arg\min_{c \in C_l^{1}} |\mathbf{y} - \mathbf{h}_1 c - \mathbf{h}_2 x_2(c)|^2$$
(5)
$$LLR(b_{2,l}) = \log \left(\frac{\sum_{c \in C_l^{0}} \exp\left[-\frac{|\mathbf{y} - \mathbf{h}_1 x_1(c) - \mathbf{h}_2 c|^2}{2\sigma^2}\right]}{\sum_{c \in C_l^{0}} \exp\left[-\frac{|\mathbf{y} - \mathbf{h}_1 x_1(c) - \mathbf{h}_2 c|^2}{2\sigma^2}\right]}{\sum_{c \in C_l^{0}} \exp\left[-\frac{|\mathbf{y} - \mathbf{h}_1 x_1(c) - \mathbf{h}_2 c|^2}{2\sigma^2}\right]}\right),$$

$$= \arg\min_{c \in C_l^{0}} |\mathbf{y} - \mathbf{h}_1 x_1(c) - \mathbf{h}_2 c|^2 - \arg\min_{c \in C_l^{1}} |\mathbf{y} - \mathbf{h}_1 x_1(c) - \mathbf{h}_2 c|^2$$
(6)

The ML estimate of the symbol  $x_1(c)$  and  $x_2(c)$ , which are corresponding to *c* in the set *C*, can be calculated directly as

$$x_{2}(c) = \mathcal{Q}\left(\frac{\mathbf{h}_{2}^{H}}{\left\|\mathbf{h}_{2}\right\|^{2}}\left[\mathbf{y} - \mathbf{h}_{1}c\right]\right)$$
(7)

$$x_1(c) = Q\left(\frac{\mathbf{h}_1^H}{\|\mathbf{h}_1\|^2} [\mathbf{y} - \mathbf{h}_2 c]\right)$$
(8)

where Q(-) represents a slicing (quantization) function. Once the argument of the slicing function is determined, the output of the function can be determined without iterations over the constellation points. This means that



Fig. 2. Block diagram of the proposed symbol detector for 2×2 MIMO systems.

the LLR values can be calculated without the joint search unlike soft-MLD. Consequently, the required computational number of the MML metric is  $M^{N_T-1}$ , whereas that of the ML metric is  $M^{N_T}$ . Even though the soft-MML algorithm reduces the computational complexity significantly, its complexity is still too high to be implemented in real-time, especially when supporting 64QAM. Therefore, an efficient architecture design for the real-time implementation is required.

# IV. HARDWARE ARCHITECTURE DESIGN FOR THE PROPOSED MIMO SYMBOL DETECTOR

An efficient hardware structure of the soft-output MIMO symbol detector to support all MIMO transmission modes is presented in this section. In order to achieve more reliable performance and higher-rate data transmission, the latest wireless communication systems specify the support for an SD mode such as single-input multiple-output (SIMO), multiple input single-output (MISO), STBC and SFBC as well as SM mode. If the symbol detector for each mode is designed independently, it is not efficient. By sharing a commonly used function block for all MIMO modes, the complexity of the proposed architecture is decreased.

Fig. 2 shows the proposed hardware structure of a  $2 \times 2$  MIMO symbol detector, and the timing diagram is depicted in Fig. 3. Tables 1 and 2 summarizes the SM and SD detection procedures, which are optimized for

| Table 1. | Algorithm st | teps for SM detection $\mathbf{H} = \begin{pmatrix} h_{11} & h_{21} \\ h_{12} & h_{22} \end{pmatrix}$                                                                                                                                                                      |
|----------|--------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Step     | Module       | Operation                                                                                                                                                                                                                                                                  |
| -        | Input        | $\mathbf{H} = \begin{pmatrix} h_{11} & h_{21} \\ h_{12} & h_{22} \end{pmatrix}, \ \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \end{pmatrix}$ $h_{ji} : \text{channel between } j\text{th TX and } i\text{th RX}$ antennas $y_i : \text{RX signal from } i\text{th RX antenna}$ |
| 1        | IPM          | Input of PCM are set as in Table 3                                                                                                                                                                                                                                         |
| 2        | PCM          | $p_1 = \mathbf{a}^H \mathbf{b}$ , $p_2 = \mathbf{c}^H \mathbf{d}$ , $p_3 = \ \mathbf{e}\ ^2$                                                                                                                                                                               |
| 3        | X2CCM        | $x_2(c_m) = \mathbf{Q} \left( p_1 - p_2 c_m, p_3 \right)$ $\left( m = 1, \ 2, \ \dots, M \right)$                                                                                                                                                                          |
| 4        | EDCM         | $\boldsymbol{e}_{m} = \left\  \mathbf{y} - \mathbf{h}_{1} \boldsymbol{c}_{m} - \mathbf{h}_{2} \boldsymbol{x}_{2} (\boldsymbol{c}_{m}) \right\ ^{2}$                                                                                                                        |
| 5        | 2DLCM        | $LLR = \operatorname*{argmin}_{c_m \in C_l^0} (e_m) - \operatorname*{argmin}_{c_m \in C_l^1} (e_m)$                                                                                                                                                                        |
| 6        | QM           | LLR values are quantized into 8bits                                                                                                                                                                                                                                        |

hardware architecture design. The proposed structure of the MIMO symbol detection is composed of input preprocessor module (IPM), parameter calculation module (PCM), decision variable calculation module (DVCM), X2C calculation module (X2CCM), Euclidean distance calculation module (EDCM), 1-dimensional LLR calculation module (1DLCM) for the SD mode, 2dimensional LLR calculation module (2DLCM) for the SM mode, 8-bit quantization module (QM) and gatedclock generation module (GCGM). For the real-time verification with microprocessor, a bus interface is integrated with the proposed detector.



Fig. 3. Timing diagram of the proposed symbol detector for 2×2 MIMO systems.

Table 2. Algorithm steps for SD detection

| Step | Module | Operation                                                                                                                                                                                                                                                                                                      |
|------|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -    | Input  | $\mathbf{H} = \begin{pmatrix} h_{11} & h_{21} \\ h_{12} & h_{22} \end{pmatrix}, \ \mathbf{y} = \begin{pmatrix} y_{11} & y_{12} \\ y_{21} & y_{22} \end{pmatrix}$ $h_{ji} : \text{channel between } j\text{th TX and } i\text{th RX}$ $\text{antennas}$ $y_{ik} : \text{RX signal from } i\text{th RX antenna}$ |
| 1    | IPM    | Input of PCM are set as in Table 3                                                                                                                                                                                                                                                                             |
| 2    | PCM    | $p_1 = \mathbf{a}^H \mathbf{b}, \ p_2 = \mathbf{c}^H \mathbf{d}, \ p_3 = \ \mathbf{e}\ ^2$                                                                                                                                                                                                                     |
| 3    | DVCM   | SISO/SIMO/MISO<br>z (decision variable)= $p_1$<br>CSI (channel state information) = $p_3$<br>STBC/SFBC : $z = p_1 + p_2$ , CSI = $p_3$                                                                                                                                                                         |
| 4    | 1DLCM  | LLR values are calculated by simplified demapping scheme in [20]                                                                                                                                                                                                                                               |
| 5    | QM     | LLR values are quantized into 8bits                                                                                                                                                                                                                                                                            |

#### 1. Input Preprocessor Module (IPM)

The IPM sets the input data to the PCM for the MIMO modes by reordering the estimated channel matrix and received signal vector in Table 3. In particular, the column-switching of the channel matrix H is performed for multi-stage pipelining in the case of the SM mode. Since the vertical coding [19] for the SM mode is generally specified in most recent wireless communication standards, LLR values are generated sequentially by column switching in the IPM, and the hardware blocks are fully shared to reduce the complexity in the proposed architecture.

#### 2. Parameter Calculation Module (PCM)

As shown in Fig. 4, the PCM calculates the parameters,

| Mode                                   | a                                                                                         | b                                                                                         | c                                                   | d                                                  | e                                                                                         |
|----------------------------------------|-------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|-----------------------------------------------------|----------------------------------------------------|-------------------------------------------------------------------------------------------|
| SISO                                   | $\begin{bmatrix} h_{11} \\ 0 \end{bmatrix}$                                               | $\begin{bmatrix} y_{11} \\ y_{12} \end{bmatrix}$                                          | -                                                   | -                                                  | $\begin{bmatrix} h_{11} \\ 0 \end{bmatrix}$                                               |
| SIMO                                   | $\begin{bmatrix} h_{11} \\ h_{12} \end{bmatrix}$                                          | $\begin{bmatrix} y_{11} \\ y_{12} \end{bmatrix}$                                          | -                                                   | -                                                  | $\begin{bmatrix} h_{11} \\ h_{12} \end{bmatrix}$                                          |
| MISO <sub>1</sub><br>MISO <sub>2</sub> | $\begin{bmatrix} h_{11} \\ h_{21}^* \end{bmatrix}$ $\begin{bmatrix} h_{21} \end{bmatrix}$ | $\begin{bmatrix} y_{11} \\ y_{12}^* \end{bmatrix}$ $\begin{bmatrix} y_{11} \end{bmatrix}$ | -                                                   | -                                                  | $\begin{bmatrix} h_{11} \\ h_{21}^* \end{bmatrix}$ $\begin{bmatrix} h_{21} \end{bmatrix}$ |
| WII3O <sub>2</sub>                     | $\begin{bmatrix} -h_{11}^* \end{bmatrix}$                                                 | $\begin{bmatrix} y_{11} \\ y_{12}^* \end{bmatrix}$                                        | -                                                   | -                                                  | $\lfloor -h_{11}^* \rfloor$                                                               |
| $SD_1$                                 | $\begin{bmatrix} h_{11} \\ h_{21}^* \end{bmatrix}$                                        | $\begin{bmatrix} y_{11} \\ y_{12}^* \end{bmatrix}$                                        | $\begin{bmatrix} h_{12} \\ h_{22}^* \end{bmatrix}$  | $\begin{bmatrix} y_{21} \\ y_{22}^* \end{bmatrix}$ | $\begin{bmatrix} h_{11} \\ h_{21}^* \end{bmatrix}$                                        |
| $SD_2$                                 | $\begin{bmatrix} h_{21} \\ -h_{11}^* \end{bmatrix}$                                       | $\begin{bmatrix} y_{11} \\ y_{12}^* \end{bmatrix}$                                        | $\begin{bmatrix} h_{22} \\ -h_{12}^* \end{bmatrix}$ | $\begin{bmatrix} y_{21} \\ y_{22}^* \end{bmatrix}$ | $\begin{bmatrix} h_{21} \\ -h_{11}^* \end{bmatrix}$                                       |
| SM <sub>1</sub>                        | $\begin{bmatrix} h_{21} \\ h_{22} \end{bmatrix}$                                          | $\begin{bmatrix} y_1 \\ y_2 \end{bmatrix}$                                                | $\begin{bmatrix} h_{21} \\ h_{22} \end{bmatrix}$    | $\begin{bmatrix} h_{11} \\ h_{12} \end{bmatrix}$   | $\begin{bmatrix} h_{21} \\ h_{22} \end{bmatrix}$                                          |
| SM <sub>2</sub>                        | $\begin{bmatrix} h_{11} \\ h_{12} \end{bmatrix}$                                          | $\begin{bmatrix} y_1 \\ y_2 \end{bmatrix}$                                                | $\begin{bmatrix} h_{11} \\ h_{12} \end{bmatrix}$    | $\begin{bmatrix} h_{21} \\ h_{22} \end{bmatrix}$   | $\begin{bmatrix} h_{11} \\ h_{12} \end{bmatrix}$                                          |

**Table 3.** Data mapping scheme for inpu of PCM. The index *t* of MISO<sub>t</sub>, SD<sub>t</sub>, and SM<sub>t</sub> denote *t*-th time unit (t=1, 2)



Fig. 4. Block diagram of PCM.

 $p_1$ ,  $p_2$  and  $p_3$ , which are the commonly required operations for both SD and SM modes. In the case of the SD mode,  $p_1$  and  $p_2$  are used to calculate the decision variables in the DVCM, and  $p_3$  is utilized as the channel state information (CSI). In the case of SM mode, all of the parameters are mapped to the input data of the X2CCM.



Fig. 5. Block diagram of X2CCM.

## 3. X2C Calculation Module (X2CCM)

As shown in Fig. 5, the X2CCM consists of the polarcoordinate based multiplier (PBM) and slicer (quantization) module (SCM). The SCM makes the output,  $x_2(c_m)$ , m=1,2,...,M, and is implemented without division operations through the scaled-constellation as in (9).

$$x_{2}(c_{m}) = Q\left(\frac{\mathbf{h}_{2}^{H}}{\|\mathbf{h}_{2}\|^{2}} \left[\mathbf{y} - \mathbf{h}_{1}c_{m}\right]\right) = Q\left(p_{1} - p_{2}c_{m}, p_{3}\right)$$
(9)

In order to calculate  $p_2c_m$  in (9), *M* number of complex multiplications should be performed in parallel, which makes the X2CCM very difficult to design. For example, in the case of the 64QAM, 64 complex multiplications are required.

In the proposed architecture, the complex multiplication is replaced by the PBM as in Fig. 6, which can be simply implemented with a sign-inverter, shifters, and adders, because  $c_m$  in constellation is constant and symmetric. Especially, the PBM is designed with 4-stage pipeline architecture to reduce the computational complexity by sharing the hardware resources. As shown in Fig. 7, at first, it calculates the  $p_2c_m$  corresponding to four symbols on A1. Next, the  $p_2c_m$  corresponding to four symbols on A2, A3 and A4 are calculated easily as values



Fig. 6. Block diagram of PBM.



Fig. 7. The proposed PBM operation on 64QAM constellation.

obtained by rotating four symbols from A1 by  $\pi/2$ ,  $\pi$ , and  $3\pi/2$ , which are equal to the trivial multiplications with *i*, -1, and -i, respectively. These operations are performed within the first clock cycle. Similarly, afterward during three clock cycles, the  $p_2c_m$  corresponding to B, C, and D can be obtained respectively by applying the above scheme repeatedly. Although the throughput performance may degrade, it is practically negligible because the throughput bottleneck of the baseband modem is mostly in the forward error correction (FEC) module such as the turbo decoder. For example, when the proposed detector is applied to the LTE/WiMAX baseband processor including the turbo decoder with six iterations, it was verified from the timing analysis that the 4-stage



Fig. 8. Block diagram of EDCM.

pipelining of the PBM does not degrade the throughput performance.

#### 4. ED Calculation Module (EDCM)

The EDCM calculates the Euclidean distance,  $e_m$ , which is given by

$$e_m = \left\| \mathbf{y} - \mathbf{h}_1 c_m - \mathbf{h}_2 x_2(c_m) \right\|^2.$$
 (10)

As shown in Fig. 8,  $\mathbf{h}_1 c_m$  and  $\mathbf{h}_2 x_2(c_m)$  are also computed by the PBM. The norm calculation can be approximated to avoid costly complex multiplications [21]:

$$\|y_m\| \approx \frac{3}{8} \left( |\Re(y_m)| + |\Im(y_m)| \right) + \frac{5}{8} \max\left( |\Re(y_m)|, |\Im(y_m)| \right).$$

$$\tag{11}$$

This approximation shows negligible performance degradation as shown in Fig. 9. In this simulation, a 2×2 SM-MIMO system was considered with 16QAM and 64-QAM. Each path of MIMO channel was configured with international telecommunication union (ITU) pedestrian-B model, which is assumed to be uncorrelated. Soft-decision turbo code with code rate of 1/2 and block size of 408-bit was applied. Turbo decoding was performed by maximum-logarithmic-MAP (MAX-LOG-MAP) algorithm and the number of iterations was set to be six. Also, Fig. 9 shows the fixed-point simulation results for the proposed detector as defined in Table 4. The results show that the proposed detector achieves almost the same performance as ML and MML.



**Fig. 9.** Performance evaluation results of the proposed MIMO symbol detector.

 Table 4. Word-length and SQNR analysis results for the proposed MIMO symbol detector

| Block | Word-length (bit) | SQNR (dB) |
|-------|-------------------|-----------|
| IPM   | I : 16 / Q : 16   | Infinite  |
| PCM   | I : 33 / Q : 33   | 56        |
| X2CCM | I : 20 / Q : 20   | 54        |
| EDCM  | I:28 / Q:28       | 52        |
| DVCM  | I : 17 / Q : 17   | 53        |
| 1DLCM | I : 24 / Q : 24   | 51        |
| 2DLCM | 19                | 50        |

#### 5. Gated-Clock Generation Module (GCGM)

From the complexity analysis for the internal modules in the proposed detector, it was confirmed that the X2CCM, EDCM, and 2DLCM occupy about 79% of the total complexity as shown in Table 5. Since these modules are only used for SM symbol detection, the clock-domain is separated and the clock-gating scheme [26] is applied in order to reduce the power consumption. For example, the CLK\_SM is gated (not toggled) in the case of SD detection, whereas it is running in the case of SM detection. With this clock-gating scheme, the average power consumption of the proposed detector is dramatically decreased.

## V. IMPLEMENTATION AND VERIFICATION Results

The MIMO symbol detector supporting all MIMO modes with the proposed architecture was designed in

 Table 5. Logic synthesis results of the proposed MIMO symbol detector

|       | Gate Count (K) | Prop. (%) |
|-------|----------------|-----------|
| IPM   | 5.3            | 2.9       |
| PCM   | 15.0           | 8.2       |
| X2CCM | 14.4           | 7.7       |
| EDCM  | 119.1          | 64.5      |
| 1DLCM | 2.0            | 1.2       |
| 2DLCM | 13.0           | 7.1       |
| QM    | 0.5            | 0.2       |
| Etc.  | 13.7           | 8.2       |
| Total | 183            | 100       |



Fig. 10. Layout of the proposed MIMO symbol detector.

HDL and implemented using a 65 nm 1-poly 9-metal (1P9M) 1.2 V CMOS standard cell library. Table 5 depicts the logic synthesis results for a 100 MHz operating clock frequency. Fig. 10 shows the layout of the proposed MIMO symbol detector with dual-port RAM used for verification. The proposed MIMO symbol detector includes about 183K logic gates, occupies a core area of 0.31  $mm^2$ , and can achieve a throughput of 150 Mbps.

Table 6 shows the comparison results of our design and the existing 2x2 MIMO symbol detectors [16, 22-25]. For the fair comparison, four normalized metrics are considered such as normalized area (NA), normalized power (NP), normalized hardware efficiency (NHE) and normalized power efficiency (NPE):

|                     | [16]                       | [22]             | [23]             | [24]               | [25]               | This work          |
|---------------------|----------------------------|------------------|------------------|--------------------|--------------------|--------------------|
| MIMO Mode           | SM/SD                      | SM               | SM               | SM                 | SM                 | SM/SD              |
| TX/RX antennas      | 2×2                        | 2×2              | 2×2-8×8          | 2×2                | 2×2                | 1×1-2×2            |
| Modulation          | QPSK<br>16/64QAM           | QPSK<br>16/64QAM | 16/64QAM         | 64QAM              | B/QPSK<br>16/64QAM | B/QPSK<br>16/64QAM |
| Detection algorithm | MFCSO                      | LORD             | SSFE             | K-Best             | SQRDML             | MML                |
| Soft/Hard output    | Soft                       | Soft             | Soft             | Soft               | Soft               | Soft               |
| Performance         | Near ML<br>(w/ strong FEC) | Near ML          | Near ML          | Not ML             | Not ML             | Equal to ML        |
| Process             | 65 nm                      | 65 nm            | 65 nm            | 130 nm             | 180 nm             | 65 nm              |
| Max. clock rate     | 300 MHz                    | 80 MHz           | 400 MHz          | 287 MHz            | 40 MHz             | 100 MHz            |
| Gate count          | 90 K                       | 408 K            | 63 K             | 24 <i>K</i> **     | 279 K              | 183 K              |
| Throughput          | 225 Mbps                   | 240 Mbps         | 75 Mbps          | 107 Mbps           | N.A.               | 150 Mbps           |
| Area                | $0.37 \ mm^2$              | $0.64 \ mm^2$    | $0.09 \ mm^2$    | N.A.               | N.A.               | $0.31 \ mm^2$      |
| Power               | N.A.                       | 38 mW<br>@ 1.2 V | 9 mW<br>@ 1.08 V | 54.4 mW<br>@ 1.5 V | N.A.               | 12.2 mW<br>@ 1.2 V |
| NA                  | $0.37 \ mm^2 *$            | $0.64 \ mm^2$    | $0.09 \ mm^2$    | N.A.               | N.A.               | $0.31 \ mm^2$      |
| NP                  | N.A.                       | 38 mW            | 11.1 mW          | 17.4 mW            | N.A.               | 12.2 mW            |
| NHE [Mbps/kGE]      | 2.5                        | 0.58             | 1.19             | 8.92               | N.A.               | 0.81               |
| NPE [Mbps/mW]       | N.A.                       | 6.32             | 8.33             | 3.93               | N.A.               | 12.30              |

Table 6. Comparison results of the proposed detector and existing detectors

$$NA = Area \cdot \left(\frac{0.065}{Process}\right)^2, \tag{12}$$

$$NP = Power \cdot \left(\frac{1.2}{Voltage}\right)^2 \cdot \left(\frac{0.065}{Process}\right), \qquad (13)$$

$$NHE = \frac{Throughput \cdot \left(\frac{Frocess}{0.065}\right)}{Gate Count},$$
 (14)

$$NPE = \frac{Throughput \cdot \left(\frac{Process}{0.065}\right)}{Power \cdot \left(\frac{1.2}{Voltage}\right)^2}.$$
 (15)

Unlike the previous works, the proposed MIMO detector can support all MIMO modes such as SM, SISO, SIMO, MISO and SD (STBC/SFBC) with the optimal ML performance. Also, the proposed detector shows the best power efficiency. Although MFCSO [16] and LORD [22] support the near ML performance, the proposed detector outperforms them with respect to area and power consumption. Also, SSFE [23] shows the near ML performance, however, the throughput and NPE are lower than those of the proposed detector. Even though

the NHE of the detector [24] is higher than the proposed detector, the results in [24] do not include the channel pre-processor. Also, it cannot support ML performance and has lower NPE. SQRDML [25] includes larger number of gate counts without supporting ML performance.

In order to evaluate the power consumption of the proposed MIMO detector, 10 test scenarios that contain varying numbers of SM slots are defined as depicted in Table 7. Each test vector for test scenario consists of 10 slots and each slot includes 48 symbols. The number of SM slots in test scenario 0 - 9 is also 0 - 9. For example, the test scenario 3 includes 3 SM slots. As shown in Fig. 11(a), in the case of the non-gated-clock scheme, SM blocks consume the most power even though there is no SM symbol. However, with the gated-clock scheme, the power consumption is dramatically reduced when there are a small number of SM packets as shown 11(b). Table 8 summarizes the evaluation results. As shown in this table, the MIMO detector with the gated-clock scheme can reduce the average power consumption by 4.17 -85.35% compared with the detector without clock-gating.

| Scenario No. | 0     | 1     | 2     | 3     | 4     | 5     | 6     | 7     | 8     | 9     |
|--------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| Slot 0       | SD    | SD    | SD    | SM    | SM    | SM    | SD    | SM    | SM    | SM    |
|              | 64QAM |
| Slot 1       | SISO  | SISO  | SISO  | SISO  | SISO  | SISO  | SM    | SM    | SM    | SM    |
|              | QPSK  |
| Slot 2       | SIMO  | SM    |
|              | QPSK  | 16QAM | 16QAM | 16QAM | 16QAM | QPSK  | 16QAM | 16QAM | 16QAM | 16QAM |
| Slot 3       | SISO  | SM    |
|              | 64QAM |
| Slot 4       | SD    | SD    | SD    | SD    | SM    | SM    | SM    | SM    | SM    | SM    |
|              | 16QAM | 16QAM | 16QAM | 16QAM | 16QAM | QPSK  | 16QAM | 16QAM | 16QAM | 16QAM |
| Slot 5       | MISO  | SM    | MISO  |
|              | 16QAM | 16QAM | 16QAM | 16QAM | 16QAM | QPSK  | 16QAM | 16QAM | 16QAM | 16QAM |
| Slot 6       | MISO  | MISO  | SM    | MISO  | MISO  | MISO  | MISO  | SM    | MISO  | SM    |
|              | QPSK  | QPSK  | 64QAM | QPSK  | QPSK  | QPSK  | QPSK  | 64QAM | QPSK  | QPSK  |
| Slot 7       | SIMO  | SIMO  | SIMO  | SIMO  | SIMO  | SM    | SM    | SIMO  | SM    | SM    |
|              | 16QAM | 16QAM | 16QAM | 16QAM | 16QAM | QPSK  | 16QAM | 16QAM | 16QAM | 16QAM |
| Slot 8       | SD    | SD    | SD    | SM    | SM    | SD    | SM    | SM    | SM    | SM    |
|              | QPSK  |
| Slot 9       | SISO  | SISO  | SISO  | SISO  | SISO  | SM    | SM    | SM    | SM    | SM    |
|              | 16QAM | 16QAM | 16QAM | 16QAM | 16QAM | QPSK  | 16QAM | 16QAM | 16QAM | 16QAM |

Table 7. Test scenario 0 - 9 for estimating the average power consumption

Table 8. Comparison of average power consumption for test scenarios

| Scenario No.                | 0     | 1     | 2     | 3     | 4     | 5     | 6     | 7     | 8    | 9    |
|-----------------------------|-------|-------|-------|-------|-------|-------|-------|-------|------|------|
| Non-gated clock scheme (mW) | 12.3  | 13.7  | 15.4  | 16.9  | 18.5  | 20.2  | 21.6  | 23.3  | 24.9 | 26.4 |
| Gated clock scheme (mW)     | 1.77  | 4.8   | 7.04  | 9.51  | 12.2  | 14.9  | 17.4  | 20.1  | 22.7 | 25.3 |
| Reduction ratio (%)         | 85.35 | 64.96 | 54.29 | 43.73 | 34.05 | 26.24 | 19.44 | 13.73 | 8.84 | 4.17 |



Fig. 11. Power consumption for test scenarios (a) Non-gated clock scheme, (b) Gated clock scheme.

#### **VI.** CONCLUSIONS

In this paper, the low-complexity and low-power hardware architecture for a soft-output MIMO symbol detector that can support all MIMO modes such as SD and SM is proposed. The implementation results show that the hardware complexity can be significantly reduced by the proposed architecture with the multi-stage pipelining and simplified multiplication based on the polar-coordinate. Also, with the clock-gating scheme applied to the most complex modules used only for SM detection, the power consumption is decreased by a maximum of 85.35%. Since the recent wireless systems specify support for both SD and SM modes and need to be implemented with low-complexity and low-power consumption, the proposed MIMO symbol detector can be considered to be suitable for those systems.

#### **ACKNOWLEDGMENTS**

This work was supported by the Technology Innovation Program, 10049009, funded by the Ministry of Trade, Industry and Energy (MOTIE, Korea).

#### REFERENCES

[1] Q. Li, et al, "MIMO techniques in WiMAX and

LTE: A feature overview," *Communications, IEEE Magazine*, Vol.48, No.5, pp.86–92, May 2010.

- [2] S. Srikanth, P.A. Murugesa Pandian, and X. Fernando, "Orthogonal frequency division multiple access in WiMAX and LTE: a comparison," *Communications, IEEE Magazine*, Vol.50, No.9, pp.153–161, Sep. 2012.
- [3] 3GPP, "Evolved universal terrestrial radio access (EUTRA, physical channels and modulation (Release 10)," TS 36.211, V10.2.0, Jun. 2011.
- [4] IEEE 802.16-2005, "Part 16: air interface for fixed and mobile broadband wireless access systems amendment 2: physical and medium access control layers for combined fixed and mobile operation in licensed bands and corrigendum 1," Feb. 2006.
- [5] A. J. Paulraj, D. A. Gore, R. U. Nabar, and H. Bölcskei, "An overview of MIMO communications —A key to gigabit wireless," *Proceedings of the IEEE*, Vol.92, No.2, pp.198–218, Feb. 2004.
- [6] J. Soler-Garrido, D. Milford, M. Sandell, and H. Vetter, "Implementation and evaluation of a highperformance MIMO detector for wireless LAN systems," *Consumer Electronics, IEEE Transactions on*, Vol.57, No.4, pp.1519–1527, Nov. 2011.
- [7] W. Zhao and G.B. Giannakis, "Sphere decoding algorithms with improved radius search," *Communications*, *IEEE Transactions on*, Vol.53, No.7, pp.1104–1109, Jul. 2005.
- [8] Z. Guo and P. Nilsson, "Algorithm and implementation of the K-best sphere decoding for MIMO detection," *Selected Areas in Communications, IEEE Journal of*, Vol.24, No.3, pp.491–503, Mar. 2006.
- [9] C. Studer, A. Burg, and H. Bolcskei, "Soft-output sphere decoding: Algorithms and VLSI implementation," *Selected Areas in Communications, IEEE Journal of*, Vol.26, No.2, pp. 290–300, Feb. 2008.
- [10] L. Liu, J. Lofgren, and P. Nilsson, "Lowcomplexity likelihood information generation for spatial-multiplexing MIMO signal detection," *Vehicular Technology, IEEE Transactions on*, Vol.61, No.2, pp.607–617, Feb. 2012.
- [11] S. L. Shieh, R. D. Chiu, S. L. Feng, and P. N. Chen, "Low-complexity soft-output sphere decoding with modified repeated tree search strategy,"

Communication Letters, IEEE, Vol.17, No.1, pp.51–54, Jan. 2013.

- [12] H. Kawai, K. Higuchi, N. Maeda, and M. Sawahashi, "Adaptive control of surviving symbol replica candidates in QRD-MLD for OFDM MIMO multiplexing," *Selected Areas in Communications, IEEE Journal of*, Vol.24, No.6, pp.1130-1140, Jun. 2006.
- [13] H. Lee, M. Baek, J. Kim, and H. Song, "Efficient detection scheme in MIMO-OFDM for high speed wireless home network system," *Consumer Electronics*, *IEEE Transactions on*, Vol.55, No.2, pp.507-512, May. 2009
- [14] S. Yu, T. Im, C. Park, J. Kim, and Y. Cho, "An FPGA implementation of MML-DFE for spatially multiplexed MIMO systems," *Circuits and Systems II, IEEE Transactions on*, Vol.55, No.7, pp.705– 709, Jul. 2008.
- [15] S. Jang, Y. Jung, "Efficient symbol detector for MIMO communication systems," Wireless and Mobile Communications 2011, ICWMC 2011, IARIA International Conference on, pp.182-187, Jun. 2011.
- [16] D. Wu, J. Eilert, R. Asghar, M. Ge, andD. Liu, "VLSI Implementation of a Multi-Standard MIMO Symbol Detector for 3GPP LTE and WiMAX," *Wireless Telecommunications Symposium 2010*, *WTS 2010, IEEE International Conference on*, pp.1-4, Apr. 2010.
- [17] C. Huang, C. Yu, and H. Ma, "A Power-Efficient Configurable Low-Complexity MIMO Detector," *Circuits and Systems I, IEEE Transactions on*, Vol.56, No.2, pp.485-496, Feb. 2009.
- [18] K. Kim, Y. Jung, S. Lee, and J. Kim, "Efficient list extension algorithm using multiple detection order for soft-output MIMO detection," *Communications, IEICE Transactions on*, Vol.E95-B, No.3, pp.898-912, Mar. 2012.
- [19] B. Vucetic and J. Yuan, Space-Time Coding, Wiley, 2003.
- [20] F. Tosato, P. Bisaglia, "Simplified soft-output demapper for binary interleaved COFDM with application to HIPERLAN/2," *Communications* 2002, ICC 2002, IEEE International Conference on, Vol.2, pp.664-668, May 2002.
- [21] A. Adjoudani, E. Beck, A. Burg, G.M. Djuknic, T. Gvoth, D. Haessig, S. Manji, M. Milbrodt, M.

Rupp, D. Samardzija, A. Siegel, T. Sizer II, C. Tran, S. Walker, S.A. Wilkus, and P. Wolniansky, "Prototype experience for MIMO BLAST over third-generation wireless system," *Selected Areas in Communications, IEEE Journal of*, Vol.21, No.3, pp.440-451, Apr. 2003.

- [22] M. Arora, The Art of Hardware Architecture: Design Methods and Techniques for Digital Circuits, Springer, 2012
- [23] T. Cupaiuolo, M. Siti, and A. Tomasoni, "Lowcomplexity high throughput VLSI architecture of soft-output ML MIMO detector," Design, Automation & Test in Europe Conference & Exhibition 2010, DATE 2010, IEEE International Conference on, pp.1396-1401, Mar. 2010.
- [24] R. Fasthuber, M. Li, D. Novo, P. Raghavan, L. Van Der Perre, and F. Catthoor, "Novel energy-efficient scalable soft output SSFE MIMO detection architectures," *Systems, Architectures, Modeling, and Simulation 2009, SAMOS 2009, IEEE International Conference on*, pp.20-23, Jul. 2009.
- [25] N. Moezzi-Madani, T. Thorolfsson, J. Crop, P. Chiang, and W.R. Davis, "An energy-efficient 64-QAM MIMO detector for emerging wireless standards," *Design, Automation & Test in Europe Conference & Exhibition 2011, DATE 2011, IEEE International Conference on*, pp.1-6, Mar. 2011.
- [26] J. Im, M. Cho, Y. Jung, Y. Jung, and J. Kim, "A Low-power and Low-complexity Baseband Processor for MIMO-OFDM WLAN Systems," *Signal Processing Systems, Springer Journal of*, Vol.68, No.1, pp.19-30, Jul. 2012.



**Soohyun Jang** received the B.S. and M.S. degrees in the School of Electronics, Tele-communication, and Computer Engineering from Korea Aerospace University, Goyang, Korea, in 2009 and 2011, respectively. He is currently working towards the Ph.D.

degree in electronic engineering, Korea Aerospace University. His research interests include the signal processing algorithm and VLSI implementation for the wireless communication systems.



Seongjoo Lee received his BS, MS, and PhD degrees in electrical and electronic engineering from Yonsei University, Seoul, Korea, in 1993, 1998, and 2002, respectively. From 1993 to 1996, he served as an officer in the Korean Air force. From 2002

to 2003, he was a senior research engineer at the IT SOC Research Center and the ASIC Research Center, Yonsei University, Seoul, Korea. From 2003 to 2005, he was a senior engineer in the Core Tech Sector, Visual-Display Division, Samsung Electronics Co. Ltd., Suwon, Korea. He was a research professor at the IT Center and the IT SoC Research Center, Yonsei University, Seoul, Korea from 2005 and to 2006. He is currently an associative professor in the department of information and communication engineering at Sejong University, Seoul, Korea. His current research interests include PN code acquisition algorithms, cdma2000 modem SoC design, CDMA communication, and SoC design for image processing.



**Yunho Jung** received the B.S., M.S., and Ph.D. degrees in Department of Electrical and Electronic Engineering from Yonsei University, Seoul, Korea, in 1998, 2000, and 2005, respectively. From 2005 to 2007, he was a senior engineer in the Wireless

Device Solution Team, Communication Research Center, Telecommunication Network Division. Samsung Electronics Co. Ltd., Suwon, Korea. From 2007 to 2008, he was a research professor at Institute of TMS Information Technology, Yonsei University, Seoul, Korea. He is currently an associative professor in the School of Electronics, Tele-communication, and Computer Engineering, Korea Aerospace University, Goyang, Korea. His research interests include the signal processing algorithm and VLSI implementation for the wireless communication systems and image processing systems.