File talk:Full GPT architecture.png

Source?[edit]

Where exactly does this graph come from?

The original GPT papers (Radford et al.) as well as the original Transformer papers had the normalization step after the transformer/feedforward layers.

It seems to be more modern to use a pre-normalization.

C.f., On Layer Normalization in the Transformer Architecture (2020) by Xiong, Yang, He, K Zheng, S Zheng, Xing, Zhang, Lan, Wang, and Liu, https://arxiv.org/abs/2002.04745

--2A01:C23:6C81:B00:DE52:9AB2:6F6B:91E1 21:02, 16 January 2024 (UTC)[reply]

File talk:Full GPT architecture.png

Source?[edit]

Navigation menu

Search