[인공지능 논문 Review - 06] Reformer: The efficient transformer
1. Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, Yee Whye Teh (2019),
Proceedings of the Thirty-Sixth International Conference on Machine Learning (ICML-2019),
Long Beach, California, USA, June 9-15, 2019.
(earlier version in preprint arXiv:1810.00825 )
2. Juho Lee, Lancelot James, Seungjin Choi, and François Caron (2019),
Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS-2019),
Naha, Okinawa, Japan, April 16-18, 2019. (oral)
(earlier version in preprint arXiv:1810.01778 )
3. Yoonho Lee and Seungjin Choi (2018),
in Proceedings of the Thirty-Fifth International Conference on Machine Learning (ICML-2018),
Stockholm, Sweden, July 10-15, 2018.
(earlier version in preprint arXiv:1810.05558 )
4. Saehoon Kim, Jungtaek Kim, and Seungjin Choi (2018),
in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-2018),
New Orleans, Louisiana, USA, February 2-7, 2018.
5. Juho Lee, Creighton Heaukulani, Zoubin Ghahramani, Lancelot James, and Seungjin Choi (2017),
in Proceedings of the International Conference on Machine Learning (ICML-2017),
Sydney, Australia, August 6-11, 2017.
(earlier version in preprint arXiv:1702.08239 )
6. Saehoon Kim and Seungjin Choi (2017),
in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-2017),
San Francisco, California, USA, February 4-9, 2017.
[Reformer: The efficient transformer]
Since first introduced in 2017, there has been a great success in applying transformer models to various tasks where deep learning handles sequence data, in particular including natural language processing.
Self-attention is a critical ingredient, modeling dependency among tokens in a sequence, without recurrent connections or convolutional kernels. Large transformer models often achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. Recently [Kitaev et al., 2020] proposed the ‘reformer’ which scales up the vanilla transformer model.
Two key ingredients in the reformer include:
(1) LSH attention (instead of dot-product attention) which reduces its complexity from O(T^2) to O(T logT) where T is the length of sequence ;
(2) reversible residual layers which allow storing activation only once in the training process.
효율적인 Transformer에 대한 논문이 궁금하시다면??
논문보기 링크↓
https://arxiv.org/pdf/2001.04451.pdf