Soft Actor-Critic (SAC)

https://arxiv.org/abs/1801.01290

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergenc

arxiv.org

Concepts

SAC는 기존의 Actor-Critic 방식을 개선한 model-free RL 기법으로, continuous state 및 action space에 적용 가능하며, maximum entropy objective의 적용을 통해 expected return과 policy의 expected entropy를 최대화하는 방향으로 학습함으로써 더 효과적이고 안정적이라는 장점을 갖는다. 주요 구성 요소는 다음과 같다. (2018 PMLR)

Actor-Critic architecture : use seperate policy and value function netowrks
Off-policy formulation : enables reuse of previously collected data
Entropy maximization : enables stability and exploration

Preliminaries

Notation

Consider infinite-horizon MDP in continuous action spaces : $(\mathcal{S}, \mathcal{A}, p, r)$

$\mathcal{S}$ : State space
$\mathcal{A}$ : Action space
$p : \mathcal{S} \times \mathcal{S} \times \mathcal{A} \to \left [ 0, \infty \right )$ : unknown state transition probability of $( \mathbf{s}_t, \mathbf{s}_{t+1}, \mathbf{a}_t)$
$r : \mathcal{S} \times \mathcal{A} \to \left [ r_{min}, r_{max} \right ]$ : bounded reward emitted by environment
$\rho_{\pi}(\mathbf{s}_t)$ / $\rho_{\pi}(\mathbf{s}_t, \mathbf{a}_t)$ : state / state-action marginals of the trajectory distribution induced by policy $\pi(\mathbf{a}_t | \mathbf{s}_t)$

Maximum Entropy RL

$\rho_{\pi}(\mathbf{s}_t)$ 에 대한 expected entropy 항을 추가한 general maximum entorpy objective를 사용하며, stochastic policy로 수렴하게 된다.

$J(\pi) = \sum_{t=0}^T \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t) \sim \rho_{\pi}} [ r(\mathbf{s}_t, \mathbf{a}_t) + \alpha \mathcal{H}(\pi(\cdot | \mathbf{s}_t))] \tag{1}$

여기서, $\alpha$ 는 reward와 entropy 항 사이의 상대적 중요도를 결정하는 temperature parameter이고, optimal policy의 stochasticity를 조절한다. 이러한 형태의 objective는 다음의 이점들을 갖는다.

Incentivized to explore more widely.
Capture multiple modes of near-optimal behavior.
Considerably improves learning speed.

Methods

Derivation of Soft Policy Iteration

이 논문에서는 SAC 방법론 설명 전에, soft policy iteration에 대해 먼저 설명하고 있다. 아래의 내용은 tabular setting에 대한 것으로, soft policy iteration 과정을 수행하면 optimal policy로 수렴한다는 내용을 보여준다.

Policy evaluation

위에서 설명한 Eqn.(1)을 objective로 주어진 policy $\pi$ 의 value를 계산하는 과정이다. 이때, 임의의 soft Q-function $Q : \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ 에 대해 다음의 modified Bellman backup operator $\mathcal{T}^{\pi}$ 를 적용하면 그 값을 iterative한 방법으로 구할 수 있고, 항상 수렴하게 된다.

$\mathcal{T}^{\pi} Q(\mathbf{s}_t, \mathbf{a}_t) \triangleq r(\mathbf{s}_t, \mathbf{a}_t) + \gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p} [V(\mathbf{s}_{t+1})] \tag{2}$

여기서, $V(\mathbf{s}_{t+1})$ 는 soft state value function으로, 다음과 같다. 기존에 우리가 알던 standard 형태에 entropy ( $- \log p(x)$ ) 항을 더한 형태로, policy의 entropy가 높을수록 높은 value를 부여함으로써 exploration을 촉진한다.

$V(\mathbf{s}_{t+1}) = \mathbb{E}_{\mathbf{a}_t \sim \pi} [Q(\mathbf{s}_t, \mathbf{a}_t) - \log \pi( \mathbf{a}_t | \mathbf{s}_t )] \tag{3}$

[Lemma 1] Soft Policy Evaluation
Consider the soft Bellman backup operator $\mathcal{T}^{\pi}$ in Eqn.(2) and a mapping
$Q^0 : \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ with $| \mathcal{A} | < \infty$ , and define $Q^{k+1} = \mathcal{T}^{\pi} Q^k$ .
Then the sequence $Q^k$ will converge to the soft Q-value of $\pi$ as $k \to \infty$ .

Policy improvement

이렇게 구한 새로운 Q-function을 사용하여 policy를 update하는 과정이다. 이 때, 여기서는 policy를 실제 사용하기 쉬운 형태로 제한하기 위해 policy set $\Pi$ 를 설정했다. (e.g. parameterized family of distributions such as Gaussians)

$\pi \in \Pi$ 라는 constraint를 만족시키기 위해 improved policy $\pi'$ 을 $\Pi$ 으로 projection한다. 이때, 일반적으로 Kullback-Leibler divergence라는 information projection이 사용된다. 이렇게 구한 $\pi_{new}$ 는 항상 $\pi_{old}$ 보다 높은 value를 갖는다.

$\pi_{new} = \operatorname*{argmin}_{\pi' \in \Pi} D_{KL} \left ( \pi'(\cdot | \mathbf{s}_t) \middle \| \frac{\exp(Q^{\pi_{old}} (\mathbf{s}_t, \cdot))} {Z^{\pi_{old}} (\mathbf{s}_t)} \right ) \tag{4}$

[Lemma 2] Soft Policy Improvement
Let $\pi_{old} \in \Pi$ and let $\pi_{new}$ be the optimizer of the minimization problem defined in Eqn.(4).
Then $Q^{\pi_{new}} (\mathbf{s}_t, \mathbf{a}_t) \geq Q^{\pi_{old}} (\mathbf{s}_t, \mathbf{a}_t)$ for all $(\mathbf{s}_t, \mathbf{a}_t) \in \mathcal{S} \times \mathcal{A}$ with $| \mathcal{A} | < \infty$ .

Policy Iteration

전체 soft policy iteration 과정은 evaluation과 improvement를 서로 번갈아 가며 반복하는 것으로, $\Pi$ 내의 optimal maximum entropy policy로 수렴하게 된다.

[Lemma 3] Soft Policy Iteration
Repeated application of soft policy evaluation and soft policy improvement from any $\pi \in \Pi$ converges to a policy $\pi^*$ such that $Q^{\pi^*} (\mathbf{s}_t, \mathbf{a}_t) \geq Q^{\pi} (\mathbf{s}_t, \mathbf{a}_t)$ for all $\pi \in \Pi$ and $(\mathbf{s}_t, \mathbf{a}_t) \in \mathcal{S} \times \mathcal{A}$ , assuming $| \mathcal{A} | < \infty$ .

하지만 이 과정은 tabular case에 대해서만 정확하게 수행할 수 있으며, function approximation case에 대해서 수행하기 위해서는 그 연산 비용이 너무 높아진다. 따라서, 이를 해결하기 위해 soft actor-critic 방식이 제안되었다.

Soft Actor-Critic

위에서 언급한 바와 같이, large contiunous domain에 대해 학습을 진행하기 위해서는 tabular 방식이 아닌 function approximation 방식을 적용하는게 좋다. SAC는 policy iteration 과정을 evaluation과 improvement의 반복이 아닌 actor-critic 기법을 채용했으며, 다음과 같이 parameterization을 진행한다.

State value function : $V_{\psi} (\mathbf{s}_t)$
Soft Q-function : $Q_{\theta} (\mathbf{s}_t, \mathbf{a}_t)$
Tractable policy : $\pi_{\phi} (\mathbf{a}_t | \mathbf{s}_t)$

Value function의 경우, parameterization을 진행하지 않아도 Eqn.(3)을 통해 추정이 가능하지만, 별도의 parameter를 갖도록 설계할 때 더 효과적이고 안정적인 학습이 가능하다고 한다.

State value function

State value parameter 학습의 objective는 squared residual error를 최소화하는 것이다.

$J_V(\psi) = \mathbb{E}_{\mathbf{s}_t \sim \mathcal{D}} \left [ \frac{1}{2} ( V_{\psi}(\mathbf{s}_t) - \mathbb{E}_{\mathbf{a}_t \sim \pi_{\phi}} [ Q_{\theta}(\mathbf{s}_t, \mathbf{a}_t) - \log \pi_{\phi} (\mathbf{a}_t | \mathbf{s}_t) ] )^2 \right ] \tag{5}$

여기서, $\mathcal{D}$ 는 이전에 sample된 (state, action) pair들의 distrubution이며, 보통 replay buffer라고 불린다. Eqn.(5)의 gradient는 다음의 unbiased estimator로 추정될 수 있다.

$\hat{\triangledown}_{\psi} J_V(\psi) = \triangledown_{\psi} V_{\psi}(\mathbf{s}_t) ( V_{\psi}(\mathbf{s}_t) - Q_{\theta}(\mathbf{s}_t, \mathbf{a}_t) + \log \pi_{\phi} (\mathbf{a}_t | \mathbf{s}_t) ) \tag{6}$

Soft Q-function

Action value parameter 학습의 objective는 soft Bellman residual를 최소화하는 것이다.

$J_Q(\theta) = \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t) \sim \mathcal{D}} \left [ \frac{1}{2} ( Q_{\theta}(\mathbf{s}_t, \mathbf{a}_t) - \hat{Q} (\mathbf{s}_t, \mathbf{a}_t) )^2 \right ] \tag{7}$

$\text{where,} \quad \hat{Q} (\mathbf{s}_t, \mathbf{a}_t) = r(\mathbf{s}_t, \mathbf{a}_t) + \gamma \mathbb{E}_{\mathbf{s}_{t+1} \sim p} [V_{\bar{\psi}}(\mathbf{s}_{t+1})] \tag{8}$

Eqn.(7)은 다음의 stochastic gradient를 통해 최적화될 수 있다.

$\hat{\triangledown}_{\theta} J_Q(\theta) = \triangledown_{\theta} Q_{\theta} (\mathbf{s}_t, \mathbf{a}_t) \left ( Q_{\theta} (\mathbf{s}_t, \mathbf{a}_t) - r(\mathbf{s}_t, \mathbf{a}_t) - \gamma V_{\bar{\psi}}(\mathbf{s}_{t+1}) \right ) \tag{9}$

Tractable Policy

Policy parameter 학습의 objective는 expected KL-divergence를 최소화하는 것이다.

$J_{\pi}(\phi) = \mathbb{E}_{\mathbf{s}_t \sim \mathcal{D}} \left [ D_{KL} \left ( \pi_{\phi} (\cdot | \mathbf{s}_t) \middle \| \frac{\exp(Q_{\theta} (\mathbf{s}_t, \cdot))} {Z_{\theta} (\mathbf{s}_t)} \right ) \right ] \tag{10}$

Policy의 경우 $J_{\pi}$ 를 최소화하는 방법으로 여러가지가 있지만, 여기서는 타겟으로 하는 density가 Q-function이고, 이는 neural network 구조로 되어 있기 때문에 다음의 reparameterization을 진행한다.

$\mathbf{a}_t = f_{\phi}(\epsilon_t; \mathbf{s}_t) \tag{11}$

이때, $\epsilon_t$ 는 어떠한 고정된 distribution에서 sample되는 input noise vector이다. 이와 같이 policy를 neural network output 형태로 표현하면, Eqn.(10)은 아래와 같이 다시 쓸 수 있다.

$J_{\pi}(\phi) = \mathbb{E}_ {\mathbf{s}_t \sim \mathcal{D}, \epsilon_t \sim \mathcal{N} } \left [ \log \pi_{\phi} (f_{\phi} (\epsilon_t; \mathbf{s}_t) | \mathbf{s}_t) - Q_{\theta}(\mathbf{s}_t, (f_{\phi} (\epsilon_t; \mathbf{s}_t)) \right ] \tag{12}$

즉, $\pi_{\phi}$ 는 $f_{\phi}$ 를 통해 implicit하게 정의되며, Eqn.(12)는 다음의 gradient를 통해 최적화된다.

$\hat{\triangledown}_{\phi} J_{\pi}(\phi) = \triangledown_{\phi} \log \pi_{phi} (\mathbf{a}_t | \mathbf{s}_t) + (\triangledown_{\mathbf{a}_t} \log \pi_{\phi} (\mathbf{a}_t | \mathbf{s}_t) - \triangledown_{\mathbf{a}_t} Q(\mathbf{s}_t, \mathbf{a}_t) ) \triangledown_{\phi} f_{\phi} ( \epsilon_t; \mathbf{s}_t) \tag{13}$

SAC Algorithm

최종적으로, SAC 알고리즘은 아래와 같이 정리되며, positive bias를 피하기 위해 두 개의 Q-function을 사용한다.

저작자표시 비영리 변경금지 (새창열림)

'개념 공부 > Reinforcement learning' 카테고리의 다른 글

Dream to Control : Learning Behaviors by Latent Imagination (0)	2024.02.01
Addressing Function Approximation Error in Actor-Critic Methods (TD3) (0)	2024.01.31
Asynchronous Methods for Deep Reinforcement Learning (A3C) (0)	2024.01.30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Lv1. 초보자

Soft Actor-Critic (SAC)

Concepts