[2017] Attention is all you need

#Transformer #Attention is all you need

0. Abstract

  • ํ˜„์žฌ ์ฃผ๋„์ ์ธ Sequence transduction model๋“ค์€ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ๊ตฌ์กฐ๋ฅผ ํฌํ•จํ•œ ๋ณต์žกํ•œ RNN ํ˜น์€ CNN๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ.

  • ์žฌ๊ท€(=recurrent)๋‚˜ ํ•ฉ์„ฑ๊ณฑ(=convolution) ๋ฐฉ์‹์„ ์•„์˜ˆ ๋ฐฐ์ œํ•˜๊ณ  Attension Mechanism๋งŒ์„ ํ™œ์šฉํ•œ Transformer๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ฐ„๋‹จํ•œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆ.

  • 2๊ฐ€์ง€ Translation Task๋ฅผ ์ˆ˜ํ–‰ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ด ๋ชจ๋ธ๋“ค์€ ํ’ˆ์งˆ ๋ฉด์—์„œ ์šฐ์ˆ˜ํ•จ์„ ๋ณด์˜€์œผ๋ฉฐ Transformer๊ฐ€ ๋‹ค๋ฅธ ์ž‘์—…์—๋„ ์ž˜ ์ผ๋ฐ˜ํ™”๋จ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Œ.

1. Introduce

  • RNN, LSTM๊ณผ GRU๋Š” ์–ธ์–ด ๋ชจ๋ธ๋ง, ๊ธฐ๊ณ„ ๋ฒˆ์—ญ๊ณผ ๊ฐ™์€ ์ˆœ์ฐจ ๋ชจ๋ธ๋ง ๋ฐ ๋ณ€ํ™˜ ๋ฌธ์ œ์—์„œ ์ตœ๊ณ ์˜ ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ ํ™•๊ณ ํžˆ ์ž๋ฆฌ์žก๊ณ  ์žˆ๊ณ  ์ดํ›„์—๋„ ๋งŽ์€ ์—ฐ๊ตฌ๊ฐ€ ์ˆœํ™˜ ์–ธ์–ด ๋ชจ๋ธ๊ณผ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ์˜ ํ•œ๊ณ„๋ฅผ ๋›ฐ์–ด๋„˜๊ธฐ ์œ„ํ•ด ๊ณ„์†๋˜๊ณ  ์žˆ์Œ.

    • ์ˆœํ™˜ ๋ชจ๋ธ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ์‹œํ€€์Šค์˜ ์‹ฌ๋ณผ ์œ„์น˜์— ๋”ฐ๋ผ ๊ณ„์‚ฐ์„ ๋ถ„ํ• ํ•จ. ์—ฐ์‚ฐ steps์˜ ์œ„์น˜์— ๋”ฐ๋ผ, ์ด์ „ hidden state์ธ htโˆ’1h_{t-1} ์™€ position tt ์˜ ์ž…๋ ฅ์„ ๊ธฐ๋Šฅ์œผ๋กœ ํ•˜๋Š” hidden state hth_t๊ฐ€ ์ƒ์„ฑ๋จ.

    • ์ด๋Ÿฐ inherentlyํ•œ sequential ๋ณธ์งˆ์€ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๋ฅผ ๋ชป ํ•˜๊ฒŒํ•˜๋ฉฐ, ๊ธด ๋ฌธ์žฅ์—์„œ๋Š” criticalํ•œ ๋ฌธ์ œ์ด๊ณ  ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ์˜ˆ์ œ์—์„œ ์ผ๊ด„ ์ฒ˜๋ฆฌ๊ฐ€ ์ œํ•œ๋จ.

    • factorization tricks๊ณผ conditional computation์„ ํ†ตํ•ด ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ, ํ›„์ž์˜ ๊ฒฝ์šฐ ๋ชจ๋ธ ์„ฑ๋Šฅ๋„ ๊ฐœ์„ ๋˜์—ˆ์ง€๋งŒ ์—ฌ์ „ํžˆ ์ˆœ์ฐจ์  ๊ณ„์‚ฐ์˜ ๊ทผ๋ณธ์ ์ธ ์ œ์•ฝ์€ ๋‚จ์•„ ์žˆ์Œ.

  • Attention Mechanism์€ ๋‹ค์–‘ํ•œ task์—์„œ compelling sequence modeling ๋ฐ transduction model์˜ ํ•„์ˆ˜์ ์ธ ๋ถ€๋ถ„์ด ๋˜์—ˆ์œผ๋ฉฐ, ์ž…๋ ฅ ๋˜๋Š” ์ถœ๋ ฅ ์‹œํ€€์Šค์—์„œ์˜ ๊ฑฐ๋ฆฌ์˜ ์ œ์•ฝ์„ ๋ฐ›์ง€ ์•Š๊ฒŒ ํ•˜์˜€์Œ.

  • ๋ณธ ์—ฐ๊ตฌ์—์„œ ์šฐ๋ฆฌ๋Š” ์ˆœํ™˜์„ ๋ฐฐ์ œํ•˜๋Š” ๋Œ€์‹  ์˜ค๋กœ์ง€ Attention Mechanism์— ์˜์กดํ•˜์—ฌ ์ž…์ถœ๋ ฅ ๊ฐ„์˜ ์ „์—ญ์  ์˜์กด์„ฑ์„ ํŒŒ์•…ํ•˜๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ์ธ Transformer๋ฅผ ์ œ์•ˆํ•˜๊ณ ์ž ํ•จ.

    • ํ•ด๋‹น ๋ชจ๋ธ์€ ํ›จ์”ฌ ๋” ๋งŽ์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉฐ, 8๊ฐœ์˜ P100 GPU์—์„œ ๋‹จ 12์‹œ๊ฐ„ ํ›ˆ๋ จํ•œ ํ›„ ๋ฒˆ์—ญ ํ’ˆ์งˆ์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ์Œ.

2. Background

  • ์ˆœ์ฐจ์  ๊ณ„์‚ฐ์„ ์ค„์ด๋ ค๋Š” ๋ชฉํ‘œ๋Š” Extended Neural GPU, ByteNet, ๊ทธ๋ฆฌ๊ณ  ConvS2S์—์„œ๋„ ๋‹ค๋ฃจ์–ด์กŒ์œผ๋ฉฐ, ์ด ๋ชจ๋ธ๋“ค์€ ๋ชจ๋‘ CNN์„ ์‚ฌ์šฉํ•จ.

    • input, output ๊ฑฐ๋ฆฌ์—์„œ dependency๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ค์›€.

    • Transformer์—์„œ๋Š” ์ด๊ฒƒ์„ Multi-Head Attention์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์ˆ˜์‹œ๊ฐ„์œผ๋กœ ์ค„์–ด๋“ค์—ˆ์Œ.

  • Self-attention, ๋•Œ๋•Œ๋กœ intra-attention์ด๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ์ด attention mechanism์€ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค ๋‚ด ๋‹ค์–‘ํ•œ ์œ„์น˜๋ฅผ ์„œ๋กœ ์—ฐ๊ด€์‹œ์ผœ ์‹œํ€€์Šค์˜ ํ‘œํ˜„์„ ๊ณ„์‚ฐํ•จ.

    • ๋…ํ•ด(=comprehension), ์ถ”์ƒ์  ์š”์•ฝ(=abstractive summerization), ํ…์ŠคํŠธ ํ•จ์ถ•(=textual entailment) ๋ฐ task-independent sentnece representation๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ์„ฑ๊ณต์ ์œผ๋กœ ์‚ฌ์šฉ๋จ.

  • End-to-End Memory Network๋Š” sequence-aligned recurrence ๋Œ€์‹  recurrent attention mechanism์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ, ๋‹จ์ˆœ ์–ธ์–ด ์งˆ๋ฌธ ์‘๋‹ต ๋ฐ ์–ธ์–ด ๋ชจ๋ธ๋ง ์ž‘์—…์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚จ.

  • Transformer๋Š” Sequence-aligned RNN์ด๋‚˜ CNN์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , Self-Attention์—๋งŒ ์˜์กดํ•˜์—ฌ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ํ‘œํ˜„์„ ๊ณ„์‚ฐํ•˜๋Š” ์ตœ์ดˆ์˜ transduction ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

3. Model Architecture

  • ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์˜ sequence transduction ๋ชจ๋ธ์€ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ธ์ฝ”๋”์˜ ์ž…๋ ฅ ์‹œํ€€์Šค (x1,โ‹ฏโ€‰,xn)(x_1,\cdots,x_n) ๋ฅผ ์—ฐ์†์  ํ‘œํ˜„์ธ ์™€ ๊ฐ™์ด ๋งคํ•‘ํ•˜์—ฌ ํ‘œํ˜„ํ•จ.

  • zz๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋””์ฝ”๋”๋Š” ์ถœ๋ ฅ ์‹œํ€€์Šค (y1,โ‹ฏโ€‰,ym)(y_1, \cdots, y_m) ๋ฅผ ํ•˜๋‚˜์”ฉ ์ƒ์„ฑํ•จ.

  • ๊ฐ๊ฐ์˜ ๋‹จ๊ณ„์—์„œ ๋ชจ๋ธ์€ auto-regressiveํ•˜๋ฉฐ, ๋‹ค์Œ์„ ์ƒ์„ฑํ•  ๋•Œ ์ด์ „์— ์ƒ์„ฑ๋œ symbol๋“ค์„ ์ถ”๊ฐ€ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • Transformer๋Š” ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”์— ๋Œ€ํ•ด ๊ฐ๊ฐ Figure 1์˜ ์™ผ์ชฝ๊ณผ ์˜ค๋ฅธ์ชฝ ๋ฐ˜์ชฝ์—์„œ ๋ณด์—ฌ์ง€๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, Self-Attention๊ณผ Point-wise, fully connected layer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ตฌ์„ฑ๋จ.

3.1 Encoder and Decoder Stacks

Encoder

  • ์ธ์ฝ”๋”๋Š” N=6๊ฐœ์˜ ๋™์ผํ•œ ๊ณ„์ธต์œผ๋กœ ๊ตฌ์„ฑ๋จ.

  • ๊ฐ ๊ณ„์ธต์€ 2๊ฐœ์˜ Sub-Layer๋ฅผ ๊ฐ€์ง.

    • ์ฒซ ๋ฒˆ์งธ๋Š” Multi-Head Self Attention Mechanism

    • ๋‘ ๋ฒˆ์งธ๋Š” ์œ„์น˜๋ณ„ Fully connected FFN

  • ๋‘ Sub-Layer ๊ฐ๊ฐ์„ ๋‘˜๋Ÿฌ์‹ธ๋Š” Resiual connection์„ ์‚ฌ์šฉํ•˜๊ณ , ๊ทธ ๋‹ค์Œ์— ์ •๊ทœํ™”๋ฅผ ์ ์šฉํ•จ.

  • ๊ฐ Sub-Layer์˜ ์ถœ๋ ฅ์€ LayerNorm(x+Sublayer(x))LayerNorm(x + Sublayer(x)) ์ธ๋ฐ, ์—ฌ๊ธฐ์„œ Sublayer(x)Sublayer(x)๋Š” Sub-Layer์ž์ฒด๊ฐ€ ๊ตฌํ˜„ํ•˜๋Š” ํ•จ์ˆ˜์ž„.

  • Residual Connection์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด, ๋ชจ๋ธ์˜ ๋ชจ๋“  ์„œ๋ธŒ๋ ˆ์ด์–ด์™€ ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋Š” dmodel=512d_{model}=512์˜ ์ฐจ์›์„ ๊ฐ€์ง„ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•จ.

Decoder

  • ๋””์ฝ”๋” ๋˜ํ•œ N = 6๊ฐœ์˜ ๋™์ผํ•œ ๊ณ„์ธต์œผ๋กœ ๊ตฌ์„ฑ๋จ

  • ์ธ์ฝ”๋” ๊ณ„์ธต์˜ ๋‘ ์„œ๋ธŒ๋ ˆ์ด์–ด ์™ธ์—๋„, ์„ธ ๋ฒˆ์งธ ์„œ๋ธŒ๋ ˆ์ด์–ด๋ฅผ ์ถ”๊ฐ€ํ•จ

    • ์ธ์ฝ”๋”์˜ ์ถœ๋ ฅ์— ๋Œ€ํ•ด Multi-Head Attention Mechanism์„ ์ˆ˜ํ–‰ํ•จ

  • ๊ฐ Sub-Layer ์ฃผ๋ณ€์— Residual connection์„ ์‚ฌ์šฉํ•˜๊ณ , ๊ทธ ๋‹ค์Œ์— ๋ ˆ์ด์–ด ์ •๊ทœํ™”๋ฅผ ์ ์šฉํ•จ.

  • ๋””์ฝ”๋”์˜ self-attention layer๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ position์ด ๋‹ค๋ฅธ ์œ„์น˜๋กœ ์ด๋™ํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๋Š”๋ฐ ์ด๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์ด Masking์ž„.

    • ์–ด๋–ค i๋ฒˆ์งธ position์—์„œ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•  ๋•Œ, ๋ฏธ๋ž˜์— ์˜ฌ ์œ„์น˜์— ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๊ณ  ํ•ด๋‹น ์œ„์น˜์™€ ๊ทธ ์ด์ „์˜ ์œ„์น˜๋“ค์— ๋Œ€ํ•ด์„œ๋งŒ ์˜์กดํ•˜๋„๋ก ํ•จ.

3.2 Attention

  • Attention Function์€ query, key, value ๊ทธ๋ฆฌ๊ณ  output ๋ชจ๋‘๊ฐ€ ๋ฒกํ„ฐ์ธ ์ถœ๋ ฅ์— ๋Œ€ํ•ด์„œ query์™€ key-value pair๋ฅผ output์— ๋งคํ•‘ํ•˜๋Š” ํ•จ์ˆ˜์ž„.

  • output์€ value๋“ค์˜ ๊ฐ€์ค‘์น˜ ํ•ฉ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ์ด๋•Œ, ๊ฐ value์— ํ• ๋‹น๋œ ๊ฐ€์ค‘์น˜๋Š” ํ•ด๋‹น key๋ฅผ ๊ฐ€์ง„ query์™€ ์—ฐ๊ด€๋œ ํ•จ์ˆ˜(=> ์ข…๋ฅ˜๊ฐ€ ๋งŽ์Œ! dot-prod, badanau, etc...)์— ์˜ํ•ด ๊ณ„์‚ฐ๋จ.

3.2.1 Scaled Dot Product Attention

  • Scaled Dot-Product Attention์˜ ์ž…๋ ฅ์€ dkd_k์— ๋Œ€ํ•œ query, key์™€ dvd_v์— ๋Œ€ํ•œ value์˜ ๋ฒกํ„ฐ๋“ค๋กœ ๊ตฌ์„ฑ๋จ.

  • ๋ชจ๋“  key๋กœ dot-product๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฐ๊ฐ์„ dk\sqrt{d_k}๋กœ ๋‚˜๋ˆˆ ๋‹ค์Œ, Softmaxํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ value์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์–ป๊ฒŒ๋จ.

  • ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” attention function์€ 'additive attention'๊ณผ 'dot-product attention'์ด๊ณ  ์ด์ค‘ Dot-product attention์˜ ์Šค์ผ€์ผ๋ง ์ธ์ž์ธ 1dk\frac{1}{\sqrt{d_k}}๋ฅผ ์ œ์™ธํ•˜๋ฉด ๋ณธ ๋…ผ๋ฌธ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๊ฐ™์Œ.

  • additive attention์€ feed-forward network๋ฅผ ์ด์šฉํ•˜์—ฌ compatibility function์„ ๊ณ„์‚ฐํ•จ.

  • ์œ„์˜ ๋‘ ๊ฐ€์ง€๋Š” ์ด๋ก ์ ์ธ ๋ณต์žก์„ฑ์€ ์œ ์‚ฌํ•˜์ง€๋งŒ, dot-product attention์ด ํ›จ์”ฌ ๋น ๋ฅด๊ณ  ๊ณต๊ฐ„ํšจ์œจ์ ์ž„.

    • dkd_k๊ฐ€ ์ž‘์œผ๋ฉด ๋‘ ๋ฐฉ์‹์˜ ์„ฑ๋Šฅ์€ ๋น„์Šทํ•˜์ง€๋งŒ, dkd_k๊ฐ€ ํฐ ๊ฒฝ์šฐ additie attention์ด ๋” ์„ฑ๋Šฅ์ด ์ข‹์Œ.

    • dkd_k๊ฐ€ ํฌ๋ฉด dot-product์˜ ๊ฒฝ์šฐ gradient vanishing ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 1dk\frac{1}{\sqrt{d_k}}๋กœ ์Šค์ผ€์ผ๋ง ํ•จ.

3.2.2 Multi-Head Attention

  • dmodeld_{model}์˜ ์ฐจ์›ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง€๋Š” key, value, query๋กœ ๋‹จ์ผ ์–ดํ…์…˜ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋Œ€์‹ ์— query, key,value์„ ๊ฐ๊ฐ dk,dk,dvd_k, d_k, d_v ์ฐจ์›์œผ๋กœ hh๋ฒˆ linear projectionํ•˜๋Š” ๊ฒƒ(=> ์„œ๋กœ ๋‹ค๋ฅด๊ฒŒ ํ•™์Šต)์ด ๋” ํšจ๊ณผ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•จ.

  • ์ด๋ ‡๊ฒŒ projection๋œ query, key, value์— ๋Œ€ํ•ด์„œ attention function์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜์—ฌ hh๊ฐœ์˜ dvd_v ์ฐจ์›์˜ ๊ฒฐ๊ณผ๋ฅผ ์‚ฐ์ถœํ•จ. ์ด ๊ฒฐ๊ณผ๋“ค์„ ์—ฐ๊ฒฐ์‹œํ‚จ ํ›„ linear projectionํ•˜์—ฌ ์ตœ์ข…์ ์ธ ๊ฒฐ๊ณผ๋ฒกํ„ฐ๋ฅผ ์–ป๊ฒŒ๋จ.

  • ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค

    • WiQโˆˆRdmodelร—dkW^Q_i \in \mathbb{R}^{d_{model} \times d_k}

    • WiKโˆˆRdmodelร—dkW^K_i \in \mathbb{R}^{d_{model} \times d_k}

    • WiVโˆˆRdmodelร—dvW^V_i \in \mathbb{R}^{d_{model} \times d_v}

    • WiOโˆˆRhdvร—dmodelW^O_i \in \mathbb{R}^{hd_v \times d_{model}}

    • ๋ณธ ์—ฐ๊ตฌ์—์„  h=8h=8์ด๊ณ  dk=dv=dmodel/h=64d_k=d_v=d_{model}/h=64์ด๋‹ค. ๊ฐ head๋งˆ๋‹ค ์ฐจ์›์„ ์ค„์ด๊ธฐ ๋•Œ๋ฌธ์— ์ „์ฒด๊ณ„์‚ฐ๋น„์šฉ์ด ๋‹จ์ผ ์–ดํ…์…˜๊ณผ ๋น„์Šทํ•˜๋‹ค.

3.2.3 Applications of Attention in our Model

  • Transformer์—์„œ๋Š” Multi-Head Attention์„ 3๊ฐ€์ง€ ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉํ•จ.

  • ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ ˆ์ด์–ด

    • query๋Š” ์ด์ „ ๋””์ฝ”๋” ๋ ˆ์ด์–ด์—์„œ ๋‚˜์˜ด.

    • key์™€ value๋Š” ์ธ์ฝ”๋”์˜ output์—์„œ ๋‚˜์˜ด.

      • ๋””์ฝ”๋”์˜ ๋ชจ๋“  ์œ„์น˜์—์„œ input sequence์˜ ๋ชจ๋“  position์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Œ.

    • ์ „ํ˜•์ ์ธ Seq2Seq์—์„œ์˜ attention ๋ฐฉ์‹.

  • ์ธ์ฝ”๋”

    • self-attention layer๋ฅผ ํฌํ•จํ•จ.

    • query, key, value๋Š” ๋ชจ๋‘ ์ธ์ฝ”๋”์˜ ์ด์ „ layer์˜ output์—์„œ ๋‚˜์˜ด.

    • ์ธ์ฝ”๋”์˜ ๊ฐ position์€ ์ธ์ฝ”๋”์˜ ์ด์ „ layer์˜ ๋ชจ๋“  position์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Œ.

  • ๋””์ฝ”๋”

    • ๋””์ฝ”๋”์˜ ๊ฐ position์€ ํ•ด๋‹น position๊นŒ์ง€ ๋ชจ๋“  position์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Œ.

    • ๋””์ฝ”๋”์˜ auto-regressive ์„ฑ์งˆ์„ ๋ณด์กดํ•˜๊ธฐ ์œ„ํ•ด leftward์˜ ์ •๋ณดํ๋ฆ„์„ ๋ง‰์•„์•ผํ•จ (= ๋ฏธ๋ž˜ ์‹œ์ ์˜ ๋‹จ์–ด๋“ค์„ ๋ฏธ๋ฆฌ ์กฐํšŒํ•จ์— ๋”ฐ๋ผ ํ˜„์žฌ๋‹จ์–ด ๊ฒฐ์ •์— ๋ฏธ์น  ์ˆ˜ ์žˆ๋Š” ์˜ํ–ฅ์„ ๋ง‰์Œ).

      • Scaled-dot product attention์—์„œ ๋ชจ๋“  softmax์˜ input value ์ค‘ illegal connection์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์„ โˆ’โˆž-\infty ๋กœ masking outํ•ด์„œ ๊ตฌํ˜„ํ•จ. (= Softmax๋ฅผ ์ทจํ–ˆ์„ ๋•Œ ํ•ด๋‹น์œ„์น˜์˜ ๊ฐ’์ด 0์ด ๋˜๊ฒŒํ•˜๊ธฐ ์œ„ํ•ด์„œ)

3.3 Point wise Feed Forward Network

  • ์ธ์ฝ”๋” ๋””์ฝ”๋”์˜ ๊ฐ layer๋Š” fully connected feed-forward network๋ฅผ ๊ฐ€์ง.

    • ๊ฐ position์— ๋”ฐ๋กœ๋”ฐ๋กœ, ๋™์ผํ•˜๊ฒŒ ์ ์šฉ๋จ.

    • ReLu ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ํฌํ•จํ•œ ๋‘ ๊ฐœ์˜ Linear Transformation์ด ํฌํ•จ๋จ.

  • ์„ ํ˜•๋ณ€ํ™˜์€ position์— ๋Œ€ํ•ด์„œ๋Š” ๋™์ผํ•˜์ง€๋งŒ, ๊ฐ ์ธต๋งˆ๋‹ค ๋‹ค๋ฅธ ํŒŒ๋ผ๋ฏธํ„ฐ (W1,W2)(W_1, W_2)๋ฅผ ์‚ฌ์šฉํ•จ.

  • ์ž…์ถœ๋ ฅ ์ฐจ์›์€ ๋™์ผํ•˜๊ฒŒ dmodel=512d_{model}=512.

  • FFN ๋‚ด๋ถ€ hidden layer์˜ ์ฐจ์›์€ dFFN=2018d_{FFN}=2018.

3.4 Embedding and Softmaxs

  • ๋‹ค๋ฅธ ์‹œํ€€์Šค ๋ณ€ํ™˜ ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ, ํ•™์Šต๋œ ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•จ.

    • ์ž…๋ ฅ ํ† ํฐ๊ณผ ์ถœ๋ ฅ ํ† ํฐ์„ dmodeld_{model}์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•จ.

  • ๋””์ฝ”๋”์˜ ์ถœ๋ ฅ์„ ์˜ˆ์ธก๋˜๋Š” ๋‹ค์Œ ํ† ํฐ์˜ ํ™•๋ฅ ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ํ•™์Šต๋œ linear transformation๊ณผ softmaxํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•จ.

    • ๋‘ ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด์™€ softmax ์ด์ „์˜ linear transformation์— ๋™์ผํ•œ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์„ ๊ณต์œ ํ•˜๋„๋ก ํ•จ.

  • ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด์—์„œ๋Š” ๊ทธ ๊ฐ€์ค‘์น˜์— dmodel\sqrt{d_{model}}์„ ๊ณฑํ•ด์คŒ.

3.5 Positional Encoding

  • Transformer๋Š” ์–ด๋–ค recurrence, convolution๋„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, sequence์˜ ์ˆœ์„œ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด position์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ฃผ์ž…ํ•ด์ค˜์•ผ ํ•จ.

  • ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” stack ์•„๋ž˜์˜ input ์ž„๋ฒ ๋”ฉ์— "Positional Encoding"์„ ์ถ”๊ฐ€ํ•จ.

  • Positional Encoding์€ input ์ž„๋ฒ ๋”ฉ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, dmodeld_{model}์ฐจ์›์„ ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ์— ํ•ฉ์น  ์ˆ˜ ์žˆ์Œ.

  • Positional encoding์€ ์—ฌ๋Ÿฌ๊ฐ€์ง€๊ฐ€ ์žˆ์ง€๋งŒ, Transformer๋Š” ๋‹ค๋ฅธ ์ฃผ๊ธฐ๋ฅผ ๊ฐ€์ง€๋Š” Sine, Cosine function์„ ์‚ฌ์šฉํ•จ.

  • ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค

    • pospos : position์„ ์˜๋ฏธํ•จ

    • ii : ์ฐจ์›์„ ์˜๋ฏธํ•จ

  • positional encoding์˜ ๊ฐ ์ฐจ์›ii๋Š” Sine Curve์— ๋Œ€์‘๋จ.

  • ํŒŒ์žฅ์€ 2ฯ€2\pi์—์„œ 10000ร—2ฯ€10000\times2\pi๊นŒ์ง€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ๋Š˜์–ด๋‚จ.

  • ์œ„์˜ ํ•จ์ˆ˜๋ฅผ ์„ ํƒํ•œ ์ด์œ ๋Š” ์–ด๋–ค ๊ณ ์ •๋œ ์˜คํ”„์…‹ kk์— ๋Œ€ํ•ด,PEpos+kPE_{pos+k}๊ฐ€ PEposPE_{pos}์˜ ์„ ํ˜•ํ•จ์ˆ˜๋กœ ํ‘œํ˜„๋  ๊ฒƒ์ด๋ผ๋Š” ๊ฐ€์„ค๋•Œ๋ฌธ. (์ƒ๋Œ€์ ์œผ๋กœ ์œ„์น˜๋ฅผ ์‰ฝ๊ฒŒ ํ•™์Šตํ•  ๊ฒƒ์ด๋‹ค!)

  • ์‹คํ—˜๊ฒฐ๊ณผ ๋‘ ๋ฒ„์ „์ด ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์‚ฐ์ถœํ•˜์˜€๊ณ  ๊ฒฐ๊ตญ Sine Curve๋ฅผ ์„ ํƒํ•จ.

    • ๋ชจ๋ธ์ด ํ›ˆ๋ จ์ค‘์— ์ ‘ํ•œ ๊ฒƒ๋ณด๋‹ค ๋” ๊ธด ์‹œํ€€์Šค ๊ธธ์ด์— ๋Œ€ํ•ด์„œ๋„ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ

4. Why Self-Attention

  • self-attention๊ณผ recurrent, convolution layer๋ฅผ ๋‹ค์Œ์— ๋Œ€ํ•ด ๋น„๊ตํ•จ.

    • Layer๋‹น ๊ณ„์‚ฐ ๋ณต์žก๋„

    • sequential parallelize ํ•  ์ˆ˜ ์žˆ๋Š” ๊ณ„์‚ฐ๋Ÿ‰

    • network์—์„œ long-range dependency ์‚ฌ์ด์˜ path ๊ธธ์ด

      • network์—์„œ ์ˆœํšŒํ•ด์•ผํ•˜๋Š” forward ์™€ backward์˜ path ๊ธธ์ด๊ฐ€ ์ด๋Ÿฐ dependency๋ฅผ ํ•™์Šตํ•˜๋Š” ๋Šฅ๋ ฅ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์ฃผ์š” ์š”์ธ

      • input๊ณผ output sequence์—์„œ position์˜ ์กฐํ•ฉ ๊ฐ„์˜ path๊ฐ€ ์งง์„์ˆ˜๋ก, long-range dependecy๋ฅผ ํ•™์Šตํ•˜๊ธฐ๊ฐ€ ์‰ฌ์›€

      • -> input๊ณผ output position ์‚ฌ์ด์˜ ์ตœ๋Œ€ path ๊ธธ์ด๋ฅผ ๋น„๊ตํ•  ๊ฒƒ

5. Training

5.1 Training Data and Batching

  • English-German

    • 450๋งŒ๊ฐœ์˜ sentence pairs๋กœ ๊ตฌ์„ฑ๋œ WMT 2014 English-German ๋ฐ์ดํ„ฐ์…‹ ํ™œ์šฉํ•จ.

    • ๋ฌธ์žฅ๋“ค์€ byte-pair ์ธ์ฝ”๋”ฉ์œผ๋กœ ์ธ์ฝ”๋”ฉ ๋˜์–ด์žˆ์Œ.

    • Source target vocabulary๋Š” 37000๊ฐœ์ž„.

  • English-French

    • 3600๋งŒ๊ฐœ์˜ sentence๋กœ ๊ตฌ์„ฑ๋œ WMT 2014 English-French ๋ฐ์ดํ„ฐ์…‹ ํ™œ์šฉํ•จ.

    • 32000๊ฐœ์˜ word-piece vocabulary๋กœ ๊ตฌ์„ฑ๋จ.

5.2 Hardware and Schedule

  • 8๊ฐœ์˜ NVIDIA P100 GPU๋ฅผ ์‚ฌ์šฉํ•จ.

  • base model์€ 12์‹œ๊ฐ„ ๋™์•ˆ (100,000 step) ํ•™์Šตํ•จ.

  • big model ์€ 3.5์ผ ๋™์•ˆ (300,000 step) ํ•™์Šตํ•จ.

5.3 Optimizer

  • Adam optimizer ์‚ฌ์šฉํ•จ.

    • ฮฒ1=0.9\beta_1=0.9

    • ฮฒ2=0.98\beta_2=0.98

    • ฯต=10โˆ’9\epsilon=10^{-9}

    • warmup;steps=4000warmup;steps=4000

5.4 Regularization

  • Residual Dropout

    • ํ•˜์œ„ ๋ ˆ์ด์–ด ์ž…๋ ฅ์ด ์ถ”๊ฐ€๋˜๊ณ  ์ •๊ทœํ™”๋˜๊ธฐ ์ „์—, ๊ฐ ํ•˜์œ„ ๊ณ„์ธต์˜ ์ถœ๋ ฅ์— ๋“œ๋กญ์•„์›ƒ์„ ์ ์šฉํ•จ.

    • ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ์Šคํƒ ๋ชจ๋‘์—์„œ ์ž„๋ฒ ๋”ฉ๊ณผ ์œ„์น˜ ์ธ์ฝ”๋Ž…์˜ ํ•ฉ๊ณ„์— ๋“œ๋กญ์•„์›ƒ์„ ์ ์šฉํ•จ.

    • Pdrop=0.1P_{drop}=0.1

  • Label Smoothing

    • ํ›ˆ๋ จํ•˜๋Š”๋™์•ˆ label smoothing value ฯตls=0.1\epsilon_{ls}=0.1์„ ์‚ฌ์šฉํ•จ.

  • ๋ชจ๋ธ์ด ๋ถˆํ™•์‹คํ•ด์ง€๊ธด ํ•˜์ง€๋งŒ ์ •ํ™•๋„์™€ BLUE ์ ์ˆ˜๋ฅผ ํ–ฅ์ƒํ•˜์˜€์Œ.

Last updated