Trend-007, ReZero is All You Need: Fast Convergence at Large Depth (2020.03.10-Arxiv)

initializes an arbitrary layer as the identity map, using a single additional learned parameter per layer.

이렇게 ReZero-Transformers 네워크는 100개가 넘는 layers을 가져도 학습이 쉽다고 한다.
실제로 12 layers의 Transformer에 ReZero을 적용하면 enwiki8에서 56% 빠르게 학습이 된다.
ReZero는 다른 residual 네트워크들에 적용이 가능하고 ResNet-56에 적용했을때 CIFAR 10 로 학습할 시 43% 빠르게 수렴한다.
The idea is simple: ReZero initializes each layer to perform the identity operation.
For each layer, we introduce a residual connection for the input signal x and one trainable parameter α that modulates the non-trivial transformation of the layer F(x),

Rather than propagating the signal through each of the non-trivial functions F[Wi ] at initialization, we add a skip connection and rescale the function by L learnable parameters αi (which we call residual weights) that are initialized to zero. The signal now propagates according to

AI Information