File talk:Full GPT architecture.png
Jump to navigation
Jump to search
Source?[edit]
Where exactly does this graph come from?
The original GPT papers (Radford et al.) as well as the original Transformer papers had the normalization step after the transformer/feedforward layers.
It seems to be more modern to use a pre-normalization.
C.f., On Layer Normalization in the Transformer Architecture (2020) by Xiong, Yang, He, K Zheng, S Zheng, Xing, Zhang, Lan, Wang, and Liu, https://arxiv.org/abs/2002.04745
--2A01:C23:6C81:B00:DE52:9AB2:6F6B:91E1 21:02, 16 January 2024 (UTC)