DOI QR코드

DOI QR Code

A Reinforcement Learning Framework for Autonomous Cell Activation and Customized Energy-Efficient Resource Allocation in C-RANs

  • Sun, Guolin (School of Computer Science and Engineering, University of Electronic Science and Technology of China) ;
  • Boateng, Gordon Owusu (School of Computer Science and Engineering, University of Electronic Science and Technology of China) ;
  • Huang, Hu (School of Computer Science and Engineering, University of Electronic Science and Technology of China) ;
  • Jiang, Wei (German Research Center for Artificial Intelligence (DFKI GmbH))
  • Received : 2018.12.12
  • Accepted : 2019.03.13
  • Published : 2019.08.31

Abstract

Cloud radio access networks (C-RANs) have been regarded in recent times as a promising concept in future 5G technologies where all DSP processors are moved into a central base band unit (BBU) pool in the cloud, and distributed remote radio heads (RRHs) compress and forward received radio signals from mobile users to the BBUs through radio links. In such dynamic environment, automatic decision-making approaches, such as artificial intelligence based deep reinforcement learning (DRL), become imperative in designing new solutions. In this paper, we propose a generic framework of autonomous cell activation and customized physical resource allocation schemes for energy consumption and QoS optimization in wireless networks. We formulate the problem as fractional power control with bandwidth adaptation and full power control and bandwidth allocation models and set up a Q-learning model to satisfy the QoS requirements of users and to achieve low energy consumption with the minimum number of active RRHs under varying traffic demand and network densities. Extensive simulations are conducted to show the effectiveness of our proposed solution compared to existing schemes.

Keywords

1. Introduction

Research on the fifth-generation (5G) mobile cellular communication technology indicates that the traffic density in crowded cities or hotspot areas will reach 20~Tbps/Km2 in the near future. It is expected that by 2020, mobile internet will need to be delivering 1GB of personalized data per user per day. Furthermore, traffic by 2030 is predicted to be up to 10,000 times greater than in 2010 and 100 Mbps end-user services will have to be supported [1]. To be able to support such demand, future mobile cellular networks are expected to be deployed in a very dense and multi-layered way. Ultra-dense small cell network (UDN) is considered as one of the most promising methods to meet the traffic volume requirement of 5G. The realization of this is simply done by the dense deployment of small cells in the hotspots, where immense traffic is generated [1]. However, this triggers a proportional consumption on energy. From the perspective of network operators, the increasing energy costs cannot sustain future network operations. From the environmental point of view,&ld quo;greenness & rd quo; can be more meaningful with a comprehensive evaluation that includes bothenergy savings and network performance, which is the basis for energy efficiency (EE) metrics.

Cloud radio access networks(C-RANs) have been proposed and regarded as a promising concept in the information and communications technology (ICT) area, where base-band units (BBUs) and radios are separated [2]. All DSP processors are moved into a central BBUpool in the cloud, and the distributed remote radio heads (RRHs) take the responsibility of compressing and forwarding the received radio signals from mobile users to BBUs through radio links. This will reduce the overall capital cost and operational cost, and make large-scale high-density network deployments possible. Especially, this centralized architecturemakes it easy to collect and analyze statistics data of runtime system, as it motivates us toseek autonomous schemes for network energy management.

Recently, reinforcement learning (RL) has been advocated as a viable technology to enhance resource utilization. RL is a form of machine learning technique where a learning agent does not have a prior knowledge of the environment. To obtain low energy consumption on the RRHs and satisfy the QoS requirements of users under varying traffic demand and network densities, RL techniques are best to switch the RRHs on or off at defined time steps. While traditional solutions to optimizing networks such as greedy linear programming and greedy search satisfy instantaneous requirements of the system, RL agents survey the entire network taking into account every possible state [3]. For dynamic systems where conditions change periodically, the agent selects the most appropriate policy forallocating resource in real time. In the context of C-RAN architecture, the agent can betrained through each learning stage and then updates the trained data to determine the state of each RRH in each decision epoch to implement continuous control. This paper develops a framework for energy-efficiency where RL techniques are used to determine the powerconsumption states (sleep and active) for each RRH. The idea is to develop an autonomouscell activation scheme and a customized physical resource allocation scheme to achieveoptimal network structure to reduce power consumption. The proposed framework can berealized in two steps: Firstly, we identify the active and inactive RRHs using a Q-learning based algorithm. Secondly, we set up a flexible resource allocation module based on the active RRH set by optimizing power and bandwidth allocation and control.

In other related works [4], [5], where power consumption is optimized over currenttimeslot or time frame, we present a cell activation RL-based framework which makes a sequence of resource allocation decisions to minimize total energy consumption for the whole operational period. To efficiently solve this problem, we first use a Q-learning method to solve the cell activation problem and formulate the resource allocation problem for users as a convex optimization problem. Our motivation is to achieve the balance between EE and QoS to satisfy infrastructure providers (InPs) and mobile users via flexible power and band width control or allocation, decoupled from cell activation techniques in a dense C-RAN system. In this paper, our main contributions can be summarized as:

  • We propose an autonomous energy management framework using cell activation techniques for the customized network. We design a Q-learning model with a reduced size of state space set considering varying resource demand and user population.
  • In this framework, we formulate the EE-QoS optimization as two models, fractional power control with bandwidth adaption and full power and bandwidth allocation.
  • Considering physical resource allocation for a customized network, we optimize power and bandwidth jointly. We formulate the problem as a convex optimization with the aim of satisfying QoS requirements of user equipment (UEs) with the minimum number of active RRHs.

The remainder of this paper is organized as follows. In Section II, we present related works. Section III presents the system model in terms of network model, traffic model, energy model and utility model. Section IV provides the problem formulation and ourproposed RL-based autonomous energy management framework. Simulation results and analysis are discussed in Section V. We conclude this work in section VI.

2. Related Work

The EE and QoS performance metric has become a design goal as the discussion onenergy consumption continues to grow across every field. It has become a requirement fornetwork engineers and scientists to develop systems that manage energy efficiently. Authors in [6] studied energy efficient wireless communications and identified energy-efficiency resource allocation as one of the key challenges of 5G. In C-RAN, baseband and processing functionality of a network are virtualized and shared among physical units. This architecture improves energy efficiency in the sense that the RRHs have less functions. In [7], authors considered RRH selection and power minimization jointly as the resource allocation problemin group sparse beamforming for green Cloud-RAN. Authors extended their work in [8] toreduce the computational complexity in selecting RRH using lagrangian dual methods. In [9], the effect of optimizing data-sharing and the compression on energy efficiency were studied in C-RAN. By minimizing the total power consumption in the network, they proved that a higher energy efficiency depends on the user target rate.

Intuitively, high density of active small cell base stations (sc-BSs) results in severe interference and also inefficient energy consumption. The inter-cell or inter-tier interferencemitigation is the key to improve EE performances. Therefore, Luo et al. in [4] proposed ajoint downlink and uplink mobile users access point (MU-AP) association and beamforming design for interference management and energy minimization in C-RAN. Authors in [10]also proposed an enhanced soft fractional frequency reuse scheme. In this scheme, they formulated a joint optimization problem with the resource block assignment and powerallocation for interference mitigation in order to maximize EE performances inheterogeneous C-RAN. The joint rate allocation, routing, scheduling, power control and channel assignment problem was investigated in [5] with the aim of maximizing throughputand achieving fairness of users. Joint optimization by cell activation or cell coverageadjustment, user association, and sub-carrier allocation has been investigated in [11] [12]. This was done under the constraints of maintaining an average sum rate and rate fairness. The authors argued that energy consumption was dependent on both the spatio-temporal variations of traffic demands and the internal hardware components of sc-BS.

Several studies in recent times also suggested a scheme, known as multiple base stationscheduling (MBSS) [13]. Due to the computational complexity in MBSS, authors in [14]proposed a low complexity flexible flow scheduling algorithm to compensate for the energy consumption caused by increasing dimension of ultra-dense nodes. Trade-offs between QoS and EE for users with different traffics was presented in [15]. In [16], the authors studied the user association problem aiming at maximizing the network energy efficiency for the downlink of heterogeneous networks (HetNets). The goal of minimizing the system energy consumption and also maximizing the ratio of the peak-signal-to-noise-ratio was considered in [17] but only for QoE-aware energy efficiency (QEE) and QoE-aware spectral efficiency (QSE).

Reinforcement learning can be widely utilized in many applications with differentoptimization objectives, such as resource allocation in data centers, residential smart grid, embedded system power management and autonomous control [18]. The work in [18]developed a framework for solving the overall resource allocation and power management problem in cloud computing systems using deep reinforcement learning. Shams et al [19]proposed a Q-learning-based algorithm to achieve both energy efficiency and overall datarate. Xu et al proposed a framework which uses reinforcement learning to achieve optimalsolution for power-efficient resource allocation for beamforming problem [20].

To the best of our knowledge, there are lack of solutions to maximize the EE performance in C-RANs where power and bandwidth are optimized jointly. The authors in [21] investigated energy-efficient power allocation and wireless backhaul band widthallocation in heterogeneous small cells. They formulated the problem as a non-convex non-linear programming problem and decomposed it into two sub-problems. Then, they proposed a sub-optimal low-complexity algorithm to solve the bandwidth allocation problem and anear optimal iterative resource allocation algorithm. However, these algorithms are still model-based method, which cannot autonomously produce optimal solutions in sequentialtime steps. In this paper, based on the traffic load prediction results and the currentinformation, the power manager adopts the model-free RL technique to adaptively determinethe suitable action for turning on/off of the RRHs and simultaneously reduce the power/energy consumption and improve QoS satisfaction.

3. System Model

The proposed autonomous cell activation and customized physical resource allocationschemes for energy consumption and QoS optimization framework is made up of RRHs, BBUs and UEs. The UEs are connected to the RRHs based on the execution of cellactivation by the BBU pool. In this section, we present the network model, traffic model, energy model, and utility model.

3.1 Network model

The network model in the paper is based on the C-RAN architecture. In C-RAN, the BBUs are combined into a single resource pool, i.e BBU pool and shared among the RRHs [22]. All functions of the RAN are partially or completely integrated into the BBU pool in the cloud. RRHs at different locations can access the functions from the virtual BBU pool. Let \( \mathcal{J}=\{1,2, \ldots ., 1\}\) be a set of infrastructure nodes called RRHs. For each node \(j \in \mathcal{J}\) , a setof UEs are connected to them. We represent a set of UEs \(\mathfrak{T}=\{1,2, \ldots, I\}\) as the mobile UEsconnected to the RRHs. At each time interval, it is assumed that user \(i \in \mathfrak{T}\) is connected to a RRH j. The spectrum bandwidth of RRH j is Wj Hz and the maximum transmit power of RRH j is \(P_{j}^{\text {trans}}\) watts. We denote the fraction of resource allocated to UE i from RRH j as xij ∈ [0, 1], where xij = 0 means that UE i is not associated with RRH and xij ≠ 0 means that UE i from RRH j is allocated a bandwidth proportion of  xij.

In our system model, the path-loss is calculated as follows:

\(\text { PathLoss }=20 * \log (F)+20 * \log (D)+32.4\)        (1)

where F is the frequency band and D is the distance between a UE and a RRH. The shadowing small-scale fading is assumed as a Gaussian random variable with zero mean and standard deviation \(\delta\) equal to 8dB [23].

For the resource allocation, the signal-to-interference-plus-noise-ratio (SINR) experienced by each UE i associated with a RRH j is modeled as;

\(\chi_{i j}=g_{i j} P_{i j}^{\text {trans}} /\left(\sum_{k, k \neq j} g_{i j} P_{i k}^{\text {trans}}+\sigma\right)\)       (2)

where gij is large-scale channel gain resulting from propagation loss and shadowing effects. \(\sigma\) is the power spectrum density of additive white Gaussian noise. \(P_{ij}^{\text {trans}}\) is the received signal power for UE i from RRH j. Next, we use the Shannon capacity formula to calculate the spectrum efficiency of UE i with RRH j as:

\(b_{i j}=\log _{2}\left(1+\chi_{i j}\right)\)       (3)

where \(\chi_{i j}\) is the SINR of UE i from RRH j.

With the fraction of the bandwidth resource allocated to UE from RRH being andits transmission rate denoted by \(\mathfrak{R}_{i j}\), we have \(\mathfrak{R}_{i j}=W_{j} x_{i j} b_{i j}\). Based on equation (3), the transmission rate of UE obtained from RRH can be written as:

\(\mathfrak{R}_{i j}=W_{j} x_{i j} \log _{2}\left(1+\chi_{i j}\right)\)       (4)

Since it is possible for a UE to associate with any RRH, the effective transmission rate \(\mathfrak{R}_{i j}\) of UE i can be written as follow:

\(\mathfrak{R}_{i}=\sum_{j \in J}\left|x_{i j}\right|_{0} \mathfrak{R}_{i j}\)       (5)

where \(\left|x_{i j}\right|_{0}\) denotes an association indicator between UE i and RRH j. If \(\left|x_{i j}\right|_{0}\)= 0, there is no association between the UE and RRH; otherwise, \(\left|x_{i j}\right|_{0}\)= 1.

3.2. Traffic model

In our scenario, we monitor the spatial-time traffic distribution in the network over a 24-hour period. The number of active users and the traffic demands vary over this period. Using this model greatly increases the complexity of the traffic mode on the network. A traffic profile is based on the on-site measurements from the EU FP EARTH project [14]. An idealtraffic profile is configured based on the trapezoidal traffic pattern, which is a simpleexample of daily traffic pattern [14]. For a trapezoidal curve with a maximum value of one and different slopes, the traffic function is defined by the angular coefficient .

\(f(t)=\left\{\begin{array}{c} 1-v t ;\left(0 \leq t \leq \frac{1}{v}\right) \\ 0 \\ 1+v(t-T) ;\left(T-\frac{1}{v} \leq t \leq T\right) \end{array}\right.\)       (6)

where T represents a 24-hour scan period, represents the slope and f(t) is a normalized value between 0 and 1 as shown in Figure 2. If \(v\) is equal to 1/10, then we move the f(t) to f(t) + 12, which is close to the real scenario. If the slope \(v\) is equal to 1/8, the traffic profile will be changed. Since the traffic changing trend is close to real situation when is equal to 1/10, we will use the traffic function in that  \(v\) as equal to 1/10.

3.3. Energy model

We define two power consumption states, sleep and active for each RRH. The active statecombines the sum of power consumption of the transmit power and RRH power. Powerconsumption of the RRH at the sleep state is negligible. Therefore, we define the total powermodel for each RRH as follows:

\(P_{j}^{\text {total }}=\left\{\begin{array}{ll} P_{j}^{\text {active}}+P_{j}^{\text {trans}} & j \epsilon j_{A} \\ P_{j}^{\text {sleep}} & j \epsilon j_{s} \end{array}\right.\)       (7)

where \(P_{j}^{\text {active}}\) denotes the essential power consumption of the RRH j in the active state, which is necessary in order to maintain the basic operation of the RRH and \(P_{j}^{\text {trans}}\) is the used transmit power of RRH which is to ensure data transmission of user equipment (UE). If the RRH is not selected for transmission, it enters sleep mode. \(\mathcal{J}_{A} \subseteq \mathcal{J}\) and \(\mathcal{J}_{S} \subseteq \mathcal{J}\) denotethe sets of active and sleep RRHs respectively. The total number of RRHs in the network is the sum of active set of RRHs and the sleep set of RRHs, i.e. \(\mathcal{J}_{A} \cup \mathcal{J}_{S}=\mathcal{J}\).

Given a time t = {1,2,3 … , T}, a set of active RRHs \(\mathcal{J}_{A}\) and a set of sleep RRHs \(\mathcal{J}_{S}\), the total energy consumption of RRHs in the entire period can be expressed as:

\(E=\sum_{t=1}^{T}\left(\sum_{j \in J_{A}} P_{j}^{\text {active }}+\sum_{j \in J_{A}} P_{j}^{\text {trans }}+\sum_{j \in \partial_{S}} P_{j}^{\text {sleep }}\right)\)       (8)

In the C-RAN architecture, inactive RRHs are put to sleep in order to conserve energy. Our proposed Q-learning based cell activation scheme provides flexibility for managingenergy consumption. The C-RAN control unit dynamically optimizes the total expected and cumulative energy during the entire operational period instead of the instantaneous energy consumption in a decision period.

3.4. Utility model

Based on the objective of the proposed scheme, we can know the precondition of savingenergy consumption of network is to ensure that we satisfy the QoS requirement of UEs. In order to offer better QoS to UEs, the required transmission rate should be guaranteed. We consider measuring the satisfaction of a UE with a sigmoid function, which can beexpressed in [24] as:

\(\xi\left(\mathfrak{R}_{i}\right)=\frac{1}{1+e^{-\tau\left(\Re_{i}-\mathfrak{A}_{i}^{\min }\right)}}\)       (9)

where ℜ is the minimum rate demand required by the UE and is a constant deciding the steepness of the satisfactory curve. In addition, \(\Re_{i}\) is the real transmission rate for UE , which is determined by the network infrastructure, transmission power, noise, interference and many other related factors. It is easy to verify that: 1) \(\xi\left(\mathfrak{R}_{i}^{\min }\right)\) is a monotonic increasing function with respect to \(\Re_{i}\) , because individual UEs will feel more satisfied if they receivehigher throughput above their minimum demand and vice versa; 2) \(\xi\left(\mathfrak{R}_{i}^{\min }\right)\) of each UE i is scaled between 0 and 1, i.e. \(\xi\left(\mathfrak{R}_{i}^{\min }\right) \in(0,1)\).

The analysis shows it results in a trivial solution using linear utility function (5) for throughput maximization as in [25], in which each RRH serves only its strongest user. While throughput is optimal, this is not a satisfactory solution for many reasons. Instead, we seek autility function which achieves load balancing and user fairness naturally. A logarithmic function, in particular, is a very common choice of utility function. The resulting objective function with logarithmic utility is defined as;

\(U_{i}\left(\Re_{i}\right)=\log \left(\Re_{i}\right)\)       (10)

4. Problem Formulation

The framework of our proposed energy consumption optimization system has three hierarchies as shown in Fig. 1. Firstly, user association between UEs and RRHs are established through user admission control. Then, the RL agent executes cell activation using the Q-learning technique to select the active RRH set. Resource allocation module uses the active RRH set to execute radio resource allocation based on the needed active RRH set forsatisfying the QoS requirement of UEs. The result of radio resource allocation serves as there ward that is fed back into the Q-Learning-based cell activation module. The RL agent dynamically monitors the change in user population, distribution, QoS demand and resourceutilization of UEs caused by the dynamics in the UEs’ number and their location. Oncelearning is completed, the agent executes cell activation autonomously as the action of the Q-learning algorithm for minimizing energy consumption. The resource allocation module alsoperforms energy management and QoS satisfaction based on the set of active RRHs obtained from cell activation. The admission control and association result in association between UEs and RRHs. We demonstrate autonomous cell activation and customized resourceallocation modules of the system framework one by one.

Fig. 1. System framework


4.1. Autonomous cell activation

In this section, we present the Q-learning-based autonomous cell activation framework, which minimizes the number of active RRHs to achieve low energy consumption whileensuring that the demand of each UE can be satisfied by the set of active RRHs. Unlike most of the previous works that presented algorithms optimizing a certain objective (such aspower consumption) for the current timeslot (or time frame), our proposed Q-learning-based framework makes a sequence of cell activation decisions to minimize total energy consumption while satisfying QoS demand of UEs for the whole operational period. In ourframework, the RL agent can turn off some RRHs in order to minimize energy consumption if the available UEs can be satisfied by few number of RRHs. It can also turn on some RRHsif the current active RRHs cannot satisfy the requirements of some UEs. These on-offswitching decisions are made by the RL learning agent deployed in the BBU cloud.

Q-learning: Reinforcement learning technique is a form of machine learning which does not need much labeled data to make decisions. There are a number of reinforcement learning technique variations such as Q-learning, deep Q-learning and double Q-learning. One of the most well-known and generally applicable implementations of reinforcement learning is Q-learning [26]. Q-learning is a model-free reinforcement learning algorithm whereby an agentinteracts solely with an environment, without requiring additional information about the environment except for awareness of the environment states, possible (enabled) actions fromits current state, and the obtained rewards after performing an action.

In Q-learning, we define a matrix-like Q-table, which has the form ∶\(Q: S \times A \rightarrow R\) where S is the set of possible states in the environment, A is the set of actions that arepossible for those states and is the reward obtained after performing the action. The Q-table Q(s, a) with s ∈ S and a ∈ A maps state-action pairs to the maximum discounted future reward ′ when performing action from state . The Q-value which can be looked up in the Q-table can be expressed as follows:

\(Q\left(s^{\prime}, a^{\prime}\right)=\max R^{\prime}\)       (11)

where ′ is the action of next state ′ and ′ is the discounted future reward.

The letter is derived from the word “quality,” as the Q-function represents the quality score for performing an action in a certain state. The ideal policy for an agent to follow to maximize the future (discounted) reward from state is to always choose theaction with the highest Q-value as follows:

\(\pi_{i d e a l}(s)=\arg \max _{a \in A_{s}} Q(s, a)\)       (12)

where is the set of actions that are enabled in state .

The idea of Q-learning is that, we iteratively approximate the Q-table. Consider a singletransition performed by a reinforcement learning agent: (s, a, r, s') where s denotes the previous state of the agent, a is the chosen action by the agent when being in state , a is the obtained reward for performing action in state s, and s' represents the resulting state the environment is in after the agent performed action . We can express the Q-value of state-action pair (s, a) in terms of the next state s′ using the Bellman equation as:

\(Q(s, a)=r+\gamma \max _{a^{\prime} \epsilon A_{s^{\prime}}} Q\left(s^{\prime}, a^{\prime}\right)\)        (13)

We begin to formulate the Q-learning-based cell activation problem of our wireless network scenario by defining network states, actions and reward in the context of the generic semi-markov decision process (SMDP) framework [27].

State(s): As mentioned above, the purpose of Q-learning is to minimize the number of active RRHs while satisfying the QoS requirements of the UEs. Based on this, the state space needsto reflect the on-off states of RRHs and their bandwidth occupancy. Therefore, we define the state space of the agent to include the on-off state of RRHs and the proportion of band widthresources occupied by the current RRHs. Each RRH has two states, the active state and the sleep state. The two states for any RRH j can be expressed as Mj∈ {0, 1}. Mj= 0 indicates that RRH j is in the sleep state while Mj = 1 indicates that RRH is in the active state. We use to represent the proportion of total system bandwidth resources occupied by all of the UEs on all of the RRHs. Since is a continuous value, it leads to infinite states. Therefore, in order to reduce the size of state spaces in Q-table, the range of from zero to one ispartitioned into eight non-overlapped subzones averagely uniformly. In summary, the statespace for wireless network scenario can be expressed as\(s=\left[M_{1}, M_{2} \ldots M_{J}, \theta_{1}, \theta_{2} \ldots \theta_{J}\right]\) with (2J ) discrete combinational states.

Action(a): In this paper, the goal of Q-learning-based cell activation is to minimize the number of active RRHs by switching off some RRHs when a few number of RRHs cansatisfy the demand of the UEs. The action to be performed is the switching decision that ismade by the agent on the RRHs. That is, the agent makes a corresponding switching action of RRHs according to the current state. Each RRH corresponds to two actions, switching on or off. For any RRH j, the two actions can be represented as \(N_{j} \in\{0,1\}, N_{j}=0\) indicatess witching off RRH to turn it to sleep, and = 1 indicates switching on RRH to turn it to active. The action space can be expressed as \(a=\left[N_{1}, N_{2}, \dots N_{J}\right]\).

Reward(r): Reward is the feedback received from the environment after performing anaction in a certain state. Therefore, the reward needs to reflect the purpose of the Q-learning algorithm; in our case, the satisfaction of the user's service quality and the energy consumption minimization of the wireless network. Since the optimal strategy of Q-learning is to find the action with the largest value in the Q-table for each state, satisfaction of user QoS and energy consumption minimization of the wireless network after the action isperformed gives the agent the largest Q-value. We define the reward as follows:

\(R(s, a)=\frac{1}{E}+\omega * \xi(\cdot)\)       (14)

where ξ(∙)ϵ[0,1] is an indicator to show the satisfaction of UEs, with the utility functiondefined in (9) and \(w\) > 0. After obtaining the set of active RRHs through Q-learning-based cell activation, we focus on how to allocate power and bandwidth to the UEs through customized resource allocation module in the next sub-section.

4.2. Customized resource allocation

In the resource allocation module, we set up different objective functions for differentresource allocation schemes. In this paper, we propose a scheme where the QoS requirements of the UEs are satisfied with a limited amount of resource available. It is assumed that, the proportion of bandwidth occupied by UEs is equal to the transmit powerconsumption. On the other hand, a scheme to maximize the throughput of UEs by fully utilizing the available resource is considered.

4.2.1. Fractional power control with bandwidth allocation

In this resource allocation scheme, we assume that the value of transmit power of unit band width is the equal for all RRHs. The objective of this scheme is to minimize the usage of power and bandwidth while ensuring QoS satisfaction of UEs. Based on this, we definethe objective function as the sum of the occupied resource in the whole system B as follows:

\(\min B=\sum_{i=1}^{I} \sum_{j=1}^{J} x_{i j}\)        (15)

such that;

\(x_{i j} \epsilon[0,1], \quad \forall i \epsilon\{1,2, \ldots, I\}, \forall j \in\{1,2, \ldots, J\}\)       (16)

\(\sum_{j=1}^{J}\left|x_{i j}\right|_{0}=1, \quad \forall i \in\{1,2, \ldots, I\}\)       (17)

\(\sum_{i=1}^{I} x_{i j} \leq 1, \forall j \in\{1,2, \ldots, J\}\)       (18)

 

\(\sum_{j=1}^{J} x_{i j} * w * \log _{2}\left(1+\chi_{i j}\right) \geq d_{i}, \quad \forall i \epsilon\{1,2, \ldots, I\}\)       (19)

where xij is the proportion of bandwidth occupied by UE i from RRH j and dis the demand of UE i. We assume all UEs have equal demand. Constraint (16) states that, the fraction of resource allocated to UEs ranges from 0 to 1. Constraint (17) states that a UE can only associate with one RRH simultaneously. This is because; we assume each UE has only one interface at a time. \(\left|x_{i j}\right|_{0}\) denotes an association indicator between UE and RRH j. If | |0 = 0 , there is no association between the UE and RRH; otherwise, | |0 = 1 . Constraint (18) means the proportion of bandwidth occupied by the UEs should not exceed one. This is because all occupied bandwidth of UEs on each RRH should not be more thanthe total bandwidth of the associated RRH. Constraint (19) indicates that bandwidth resource occupied by UE should be greater than its QoS requirement. This optimization problem is amixed non-linear integer programming problem and can be solved efficiently using existing MATLAB solver YALMIP [28]. In this solver, the mixed integer linear programming (MILP) is used and appropriate for solving this resource allocation optimization problem.

4.2.2. Full power and bandwidth allocation

In this resource allocation scheme, our objective is to maximize the throughput of thenetwork system. Instead of fractional power control with bandwidth adaptation, the networkwould allocate as much power and bandwidth as possible to UEs to maximize throughput while observing user fairness. We define the objective function of this scheme as follows:

\(\max T=\sum_{i=1}^{I} U\left(\Re_{i}\right)\)       (20)

\(\sum_{i=1}^{I} P_{i j}^{\text {trans}}=P_{j}^{\text {trans}}, \quad \forall j \in\{1,2, \ldots, J\}\)       (21)

\(\sum_{i=1}^{I} x_{i j}=1, \quad \forall j \in\{1,2, \ldots, J\}\)       (22)

\(\sum_{j=1}^{J} \mathfrak{R}_{i j} \geq d_{i}, \quad \forall i \epsilon\{1,2, \ldots, J\}\)       (23)

\(\sum_{j=1}^{J}\left|x_{i j}\right|_{0}=1, \quad \forall i \in\{1,2, \ldots, I\}\)       (24)

\(x_{i j} \epsilon[0,1], \forall i \in\{1,2, \ldots, I\}, \forall j \in\{1,2, \ldots, J\}\)       (25)

where is the transmit power allocated to UE from RRH , is the maximum RRH power, indicates the bandwidth resource proportion allocated to UE from RRH , and is the rate demand of UE . Constraint (21) indicates that RRH associated with UEshould run out of resource to UEs for maximizing throughput. Constraint (22) means the sumof bandwidth resource allocated to all UEs associated with RRH j should be equal to one. This is because in maximizing throughput, a lot of bandwidth must be used. From constraint(23), we can state that the achieved throughput of UEs should be greater than their minimum QoS demands. Constraint (24) means that one UE can only associate with one RRHsimultaneously. Constraint (25) states that, the fraction of resource allocated to users ranges from 0 to 1. This is a convex optimization problem and can be efficiently solved using existing MATLAB solver CVX [29].

The proposed algorithm framework is summarized in detail as follows; In step 1, in line 1-2, the association between UEs and RRHs take place before UEs request resource. In step 2 from line 3-9, the Q-table of the Q-learning algorithm is initialized and iterated for each decision epoch as the demand of UEs changes. Actions are selected randomly initially as learning is ongoing. After some time, actions are selected based on the maximum Q-value to obtain the set of active RRHs. In step 3, in line 10-12, we obtain an optimal energy consumption based on the set of active RRHs customized by solving the resource allocation models using MILP for fractional power control with bandwidth adaptation or CVX for full power and bandwidth allocation. Lastly, we observe the reward and update the Q-table instep 4.

5. Performance Evaluation

5.1 Scenario configuration

To evaluate the performance of our proposed algorithm, we perform the numerical simulations using MATLAB. Two solvers are used as shown in the algorithm specificationto solve the physical resource allocation problem namely: MILP solver [28] and CVX solver [29]. The simulation parameters are provided based on LTE standards and listed in Table 1 below.

With the specified actors in the defined system model, we consider three RRHs as acluster which is connected to one BBU pool in the network. In each cluster, the BBU takes over the on-off action using the Q-table generated from the Q-learning agent. Inter-clusterresource allocation is controlled by a multiple-agent controller. The number of RRHs is assumed based on the use case scenario in the experiment. The two use cases defined in this paper are assuming fixed number of RRHs from the performance evaluation with changing traffic demand perspective and changing number of RRHs from the performance evaluation with changing network density perspective.

Table 1. Simulation Parameters

We also assume the maximum number of RRHs for each BBU cloud to be 3, and at most 18 in total in the whole system. To be more realistic in our work, we set the systemband width of RRHs at 20MHz. The threshold of UE sensitivity is set at -120dBm for edge UEs. As specified in the network model, the RRH coverage in the network is 200m. The number of UEs ranges from 4 to 192 according to the traffic model [14]. The user demand of 1Mbps is equal for each of the UEs. Each UE is considered to have only one interface. The energy consumption largely depends on the transceiver power settings, traffic load and the active duration of RRH. As specified in the energy model, the power consumption of RRH isset at 6.8W, and 4.3W for the active, and idle states respectively [20]. The transmitter powerper RRH in active state is 1.0W, while in sleep state it is negligible and therefore, eliminated from our model. In addition, two utility functions are adopted in terms of throughput/datarate in (10) and QoS satisfaction in (9).

In order to evaluate our proposed model and algorithms, we define four different schemes. We define scheme I as simple on-off cell activation with the simple nearest-RRH association, which is also called the simple on-off scheme. We define scheme II as the cell activation with the load ordering-based heuristic scheduling algorithm. In scheme III, Q-learning based cell activation algorithm is used, and the fractional power with bandwidth adaptation issolved by the MILP solver, which is identified as Q-learning with MILP (Q-learning-MILP). Lastly in algorithm scheme IV, we use Q-learning to make a cell activation decision, but the full power and bandwidth allocation is solved by the CVX solver, which is identified as Q-learning with CVX (Q-learning-CVX). We compare our proposed algorithm with the simpleon-off scheme and the heuristic scheduling algorithm because of the following reasons; all the three algorithms are model-free and take the dynamics of traffic distribution into consideration. However, the simple on-off scheme is a baseline algorithm that prefers nearest-association of UEs and RRHs. If there are no UEs near to an RRH, the RRH isswitched off and vice versa. The difference between the heuristic scheduling based algorithmand the proposed Q-learning algorithm is that, the heuristic algorithm is based on static policy, i.e. there is no feedback to the former after scheduling. The learning agent in the Q-learning based algorithm receives feedback in the form of a reward. As the traffic distribution changes, the learning agent selects an optimal solution to the problem.

The objective of this paper is to optimize the wireless network energy consumption and radio resource occupancy while satisfying the QoS requirement of UEs. The simulation results can be classified into the following metrics; the number of active RRHs, transmit power cost, accumulated total energy consumption and average user QoS satisfaction. Thenormalized number of active RRHs can be used to evaluate the effect of cell activation. For Q-learning-MILP scheme, we assume that, the used bandwidth proportion is equal totransmit power cost proportion. The transmit power cost can be used to evaluate the effect ofour radio resource allocation model. Accumulated total energy consumption can be used to evaluate optimized effect of the entire wireless network’s energy consumption. Since UEsonly care about their QoS satisfaction, we can use the QoS satisfaction metric to evaluate the effect of user satisfaction. For these four performance criteria, we develop two aspects of evaluation in the experiment. One is that we observe the performance of 24 hours-in-a-day-based traffic model to evaluate the performance of our proposed algorithm with changing traffic demand. Another is that we observe the performance with changing network density to evaluate the extension of our proposed algorithm. We define the density as the number of UEs over the number of RRHs.

5.2. Performance evaluation with changing load

In this simulation, we configure 18 RRHs which can be considered as 6 clusters with 3RRHs each in a coverage area of 400m-by-600m. The user demand of the individual UEs do not change but the total user demand changes based on the traffic model of 24-hours-in-a-day. Considering one hour as a decision cycle, we observe the performance in 24

cycles/hours on the above-mentioned evaluation metrics. The result presented in Fig. 2 is the number of active RRHs against 24 hours-in-a-day-based traffic model in terms of UEpopulation. The traffic of each hour changes leading to the states of RRHs changing between active and sleep. The results of the number of active RRHs, as illustrated in Fig. 2, show that the number of active RRHs correlates positively with the traffic volume.

Fig. 2. Normalized number of active RRHs

In all 4 schemes, a change in trend of traffic results in relatively same proportional change in trend in the number of active RRHs. Cell activation based Q-learning schemesoutperform scheme I and scheme II. Performance of Q-learning-MILP is nearly the same as that of Q-learning-CVX in cell activation. The gain is higher in heuristic scheduling than Simple on-off. While the simple on-off increases the number of active RRHs up to 100%, the cell activation-based Q-learning schemes increase to less than 70% with the same traffic profile. It can be concluded that, cell activation-based Q-learning schemes can support fewernumber of active RRHs compared with the simple on-off and heuristic scheduling schemes, making the cell activation-based Q-learning schemes better than the simple on-off and heuristic scheduling schemes in this scenario.

In Fig. 3, we consider the gains of transmit power cost. Based on the above assumption in Q-Learning-MILP that the used bandwidth proportion is equal to transmit power cost proportion, we can know that the change in trend of radio resource occupancy proportion is the same as transmit power cost. From Fig. 3, it is observed that the transmit power cost of Q-learning-CVX is greater than the other three schemes while in schemes I, II and III, the transmit power cost is almost the same. This is so because the objective of Q-learning-CVX is maximizing throughput. Therefore, this scheme will need more radio resource, while the other three schemes allocate radio resource by their QoS satisfaction requirements to saveradio resource. It can be deduced that, when network is in the limited resource situation Q-learning-MILP is more suitable. On the other hand, when there is abundance of resource, Q-learning-CVX is more suitable.

Fig. 3. Normalized transmit power cost

To illustrate the achieved total energy consumption of the proposed algorithm, asimulation is done with two schemes, Q-learning-MILP and Q-learning-CVX compared with the existing heuristic scheduling scheme and simple on-off schemes. The normalized totalenergy consumption is compared, the result being illustrated in Fig. 4. In this paper, we consider two aspects of energy cost including energy cost of active RRHs and transmit power cost. The total energy consumption is the sum of energy cost of active RRHs and transmit power cost. From Fig. 4, the normalized total energy consumption of the simple on-off scheme is just above 0.83, which corresponds to 1800KJ in actual value, whiles it is about 0.67, which corresponds to 1600KJ in the Q-Learning-based schemes. The proposed Q-learning-MILP algorithm outperforms the others having the least total energy consumption and is closely near to the Q-learning-CVX. This is because, the transmit powercost of Q-learning-CVX is more than Q-learning-MILP. The Q-learning based algorithmsoutperform scheme I and scheme II, while heuristic scheduling scheme outperforms simpleon-off scheme.

Fig. 4. Normalized total energy consumption

In order to check if our proposed algorithm can satisfy QoS requirement of all UEs, asimulation is done to observe QoS satisfaction with the four schemes. The result is illustrated in Fig. 5. Obviously, Q-learning-CVX scheme outperforms the other schemes in terms of satisfaction. This is because; the Q-learning-CVX scheme uses as much radio resourceavailable to satisfy UEs. In other words, Q-learning-CVX sacrifices its transmit power and band width resource to achieve higher satisfaction and throughput.

Fig. 5. User satisfaction

The satisfaction rate of Q-learning-MILP hinges at 50% for each hour, meaning the Q-learning-MILP scheme only cares about satisfying the minimum QoS requirement of the UEs. For simple on-off and heuristic scheduling schemes, it is observed that, the QoS requirements of all UEs are satisfied under light network load. At hour 1-19, the satisfaction rate is 50%. As the network load increases further, say at hour 20-24, satisfaction of the UEs begin to drop to as low as 35% at hour 23. In summary, we observed the performance of the four schemes in terms of number of active RRHs, transmit power cost, accumulated totalenergy consumption and QoS satisfaction. Based on the above discussion, we conclude thatour proposed algorithm works better than the other schemes with changing traffic demand. To check the scalability of our proposed algorithm for other scenarios, we extend the evaluation with changing network density.

5.3. Performance evaluation with clusters

In this evaluation, we configure 6 density values based on the ratio of the number of UEsto the number of RRHs. We set the number of RRHs range from 3 to 18 by adding 3 RRHs for each density value change. In order to show the change of network load from light to heavy, we set the number of UEs as 4, 16, 36, 64, 100 and 144. Then the value of density is 4/3, 8/3, 16/3, 20/3, 24/3. Let each density be divided by the biggest value of density as normalization, so the value of density is normalized as 1/6, 2/6, 3/6, 4/6, 5/6, 1. The higher the network density, the heavier the network load.

The result presented in Fig. 6 is the number of active RRHs against the value of normalized network density from 1/6 to 1. A change in density leads to the status of RRHschanging between active and sleep. The results of the number of active RRHs, as illustrated in Fig. 6 show that the number of active RRHs correlates positively with the density. Q-learning based cell activation schemes outperform schemes I and II just like the evaluation with changing traffic demand. The performance of Q-learning-MILP is same with Q- learning-CVX in cell activation. The gain is higher in heuristic scheduling than simple on-off scheme. However, when the network load is very heavy, for instance at a density of 1, the number of active RRHs is the same for all four schemes. In conclusion, Q-learning-based schemes use less number of RRHs under light network load but increase to the same level as the other schemes under heavy load.

Fig. 6. Normalized number of active RRHs with density changing

In this simulation, an evaluation on total energy consumption is done, with the resultillustrated in Fig. 7. From Fig. 7 the proposed Q-learning-MILP algorithm outperforms the others and is closely near to the Q-learning-CVX. This is because, the transmit power cost of Q-learning-CVX is more than Q-learning-MILP. Q-learning based algorithm outperforms schemes I and II, while heuristic scheduling scheme outperforms simple on-off scheme. It is deduced that, the Q-learning-based algorithms attain lower energy costs than the simple on-off and heuristic scheduling schemes even with increasing density.

Fig. 7. Normalized total energy cost with density changing

In this simulation, we check the ability of our proposed algorithm to satisfy the QoS requirement of all UEs. A simulation is done to observe QoS satisfaction with the fourschemes. The result illustrated in Fig. 8 show that Q-learning-CVX scheme outperforms the other schemes in terms of satisfaction.

Fig. 8. Satisfaction with density changing

The satisfaction of Q-learning-MILP rests at 50% for each density change, implying that Q-learning-MILP can satisfy the minimum QoS requirement of all UEs without focusing on maximizing throughput.For the simple on-off and heuristic scheduling schemes, we observe that with light network load or at lower densities i.e. for example normalized network density 1/6, 2/6 and 3/6, the QoS requirement of all UEs are satisfied, the satisfaction level being 50%. As the network load increases with increasing density, thesetwo schemes’ satisfaction levels drop to as low as 40%. It can be concluded that, the Q-Learning based schemes attain higher satisfaction gains under heavy network loads than thesimple on-off and heuristic scheduling schemes. In a summary, we observed the performance of the four schemes in terms of four metrics namely; number of active RRHs, transmit powercost, accumulative total energy consumption and QoS satisfaction with density changing. Based on the above discussion, we conclude that our proposed algorithm performs betterthan the others even under the changing density scenario.

6. Conclusion

In this paper, we proposed a generic framework of autonomous cell activation and customized physical resource allocation schemes for energy consumption and QoSoptimization in wireless networks. In the cell activation scheme, we set up a Q-learning model to satisfy the QoS requirements of users and to achieve low energy consumption with the minimum number of active RRHs under varying traffic demand. In the customized physical resource allocation scheme, we formulated the EE-QoS optimization problem as fractional power control with bandwidth adaptation and full power and bandwidth allocation models. Under the fractional power control with bandwidth adaptation model, we minimized band width resource usage while satisfying user QoS with limited resource. In the full powerand bandwidth allocation model, we maximized the system throughput while kept fairness among users by utilizing all bandwidth resource available. The proposed schemes, Q-learning-CVX and Q-learning-MILP were compared with the existing simple on-off and heuristic scheduling schemes. Simulation results showed that, the proposed Q-learning based schemes outperform the other existing schemes in terms of energy consumption and user satisfaction.

Acknowledgment

This work is supported by National Natural Science Research Foundation of China, Grant, no. 61771098, by the Science and Technology Planning project of Sichuan Province, China, under grant, no. 2016GZ0075, by the Fundamental Research Funds for the Central Universities under grant, no. ZYGX2014J060, and the ZTE Innovation Research Fund for Universities Program 2016.

References

  1. N. Bhushan, D. Malladi, J. Li and S. Geirhofer, "Network densification: the dominant theme for wireless evolution into 5G," IEEE Commun. Mag., vol. 52, no. 2, pp. 82-89, Feb. 2014. https://doi.org/10.1109/MCOM.2014.6736747
  2. China Mobile, "C-RAN: the road towards green RAN," White Paper, 2011.
  3. R. S. Sutton and A. G. Barto, "Reinforcement learning: an introduction," MIT Press, Cambridge, MA, Feb. 1998.
  4. S. Luo, R. Zhang and T. J. Lim, "Downlink and uplink energy minimization through user association and beam forming in C-RAN," IEEE Transactions on Wireless Communications, vol. 14, no. 1, pp. 494-508, Feb. 2015. https://doi.org/10.1109/TWC.2014.2352619
  5. J. Tang, G. Xue and W. Zhang, "Cross-layer optimization for end-to-end rate allocation in multi-radio wireless mesh networks," Wireless Networks, vol. 15, no. 1, pp. 53-64, Feb. 2009. https://doi.org/10.1007/s11276-007-0024-y
  6. S. Buzzi, C. L. I, T. E. Klein, H. V. Poor, C. Yang and A. Zappone, "A survey of energy-efficient techniques for 5G networks and challenges ahead," IEEE Journal on Selected Areas in Communications, vol. 34, no. 4, pp. 697-709, Apr. 2016. https://doi.org/10.1109/JSAC.2016.2550338
  7. Y. Shi, J. Zhang and K. B. Letaief, "Group sparse beamforming for green cloud-RAN," IEEE Transactions on Wireless Communication, vol. 13, no. 5, pp. 2809-2823, May 2014. https://doi.org/10.1109/TWC.2014.040214.131770
  8. Y. Shi, J. Zhang, W. Chen and K. B. Letaief, "Enhanced group sparse beamforming for green cloud-RAN: a random matrix approach," IEEE Transactions on Wireless Communications, vol. 17, no. 4, pp. 2511-2524, Nov. 2017. https://doi.org/10.1109/twc.2018.2797203
  9. B. Dai and W. Yu, "Energy efficiency of downlink transmission strategies for cloud radio access networks," IEEE Journal on Selected Areas in Communications, vol. 34, no. 4, pp. 1037-1050, Apr. 2016. https://doi.org/10.1109/JSAC.2016.2544459
  10. M. Peng, K. Zhang, J. Jiang, J. Wang and W. Wang, "Energy-efficient resource assignment and power allocation in heterogeneous cloud radio access networks," IEEE Transactions on Vehicular Technology, vol. 64, no. 11, pp. 5275-5287, Nov. 2015. https://doi.org/10.1109/TVT.2014.2379922
  11. Lin, Yicheng, Bao, Wei, Yu, Wei, Liang and Ben, "Optimizing user association and spectrum allocation in HetNets: a utility perspective," IEEE Journal on Selected Areas in Communications, vol. 33, no. 6, pp. 1025 - 1039, Jun. 2015. https://doi.org/10.1109/JSAC.2015.2417011
  12. Wei-Sheng Lai , Tsung-Hui Chang and Ta-Sung Lee, "Joint power and admission control for spectral and energy efficiency maximization in heterogeneous OFDMA networks," IEEE Transactions on Wireless Communications, vol. 15, pp. 3531 - 3547, May 2016. https://doi.org/10.1109/TWC.2016.2522958
  13. Koudouridis, G.P., H. Gao and Legg, P., "A centralised approach to power on-off optimization for heterogeneous networks," in Proc. of IEEE Vehicular Technology Conference (VTC Fall), pp. 3-6, Sept. 2012.
  14. Sun, G., Addo, P. C., Wang, G., and Liu, G., "Energy efficient cell management by flow scheduling in ultra-dense networks," KSII Transactions on Internet and Information Systems, vol. 10, no. 9, pp. 4108-4122, Sept. 2016. https://doi.org/10.3837/tiis.2016.09.005
  15. X. Zhang, J. Zhang, Y. Huang and W. Wang, "On the study of fundamental trade-offs between QoE and energy efficiency in wireless networks," Transactions on Emerging Telecommunications Technologies, vol. 24, no. 3, pp. 259-265, Apr. 2013. https://doi.org/10.1002/ett.2640
  16. A. Mesodiakaki, F. Adelantado, L. Alonso and C. Verikoukis, "Energy-efficient context-aware user association for outdoor small cell heterogeneous networks," in Proc. of IEEE Int. Conf. on Commun. (ICC), pp. 1614-1619, Jun. 2014.
  17. Y. Xu, R. Hu, L. Wei and G. Wu, "QoE-aware mobile association and resource allocation over wireless heterogeneous networks," in Proc. of IEEE Global Commun. Conf. (GLOBECOM), pp. 4695-4701, Dec. 2014.
  18. H. Li, T. Wei, A. Ren, Q. Zhu and Y. Wang, "Deep reinforcement learning: framework, applications and embedded implementations," in Proc. of IEEE/ACM International Conference on Computer-aided Design (ICCAD), Irvine, CA, pp. 847-854, Oct. 2017.
  19. F. Shams, G. Bacci and M. Luise, "Energy-efficient power control for multiple-relay cooperative networks using Q-learning," IEEE Transactions on Wireless Communications, vol. 14, no. 3, pp. 1567-1580, Mar. 2015. https://doi.org/10.1109/TWC.2014.2370046
  20. X. Zhiyuan, Y. Wang and J. Tang, "A deep reinforcement learning based framework for power-efficient resource allocation in cloud RANs," in Proc. of IEEE International Conference on Communications (ICC), pp. 1 - 6, May 2017.
  21. H. Zhang, H. Liu, J. Cheng and V.C.M. Leung, "Downlink energy efficiency of power allocation and wireless backhaul bandwidth allocation in heterogeneous small cell networks," IEEE Transactions on Communications, vol. 66, no. 4, pp. 1705-1716, 2018. https://doi.org/10.1109/TCOMM.2017.2763623
  22. Checko A., Christiansen H. L., Yan Y., et al. "Cloud RAN for mobile networks-A technology overview," IEEE Communications Surveys & Tutorials, vol. 17, no. 1, pp. 405-426, Firstquarter 2015. https://doi.org/10.1109/COMST.2014.2355255
  23. A. Moubayed, A. Shami and H. Lutfiyya, "Wireless resource virtualization with device-to-device communication underlaying LTE Network," IEEE Transactions on Broadcasting, vol. 61, no. 4, pp. 734-740, Dec. 2015. https://doi.org/10.1109/TBC.2015.2492458
  24. C. Xu, T.Li, M. Sheng, et al, "Self-organized dynamic caching space sharing in virtualized wireless networks," in Proc. of IEEE Globecom Workshops (GC Wkshps), pp.1-6, Dec. 2016.
  25. Q. Ye, B. Rong, Y. Chen and M. AI-Shalash, "User association for load balancing in heterogeneous cellular networks," IEEE Transactions on Wireless Communications, vol. 12, no. 6, pp. 2706-2716, Jun. 2013. https://doi.org/10.1109/TWC.2013.040413.120676
  26. D. A. Duwaer, "On the deep reinforcement learning for data-driven traffic control," LD Software, Eindhoven, 2016.
  27. Baxter A. Laurence, "Markov decision processes: discrete stochastic dynamic programming," Technometrics, vol. 37, no. 3, pp.353, 1995. https://doi.org/10.2307/1269933
  28. J. Lofberg, "YALMIP: a toolbox for modeling and optimization in MATLAB," in Proc. of the IEEE International Symposium on Computer-Aided Control System Design (CACSD04), Taipei, Taiwan, Oct. 2004.
  29. Michael Grant and Stephen Boyd. "CVX: Matlab software for disciplined convex programming, version 2.0 beta," 2013.

Cited by

  1. Random Balance between Monte Carlo and Temporal Difference in off-policy Reinforcement Learning for Less Sample-Complexity vol.21, pp.5, 2019, https://doi.org/10.7472/jksii.2020.21.5.1