nanoGPT repo reading notes

nanoGPT repo reading notes

nanoGPT repo reading notes
Photo by Joanna Kosinska / Unsplash

I personally haven't been using PyTorch and writing model code for a long time and found Andrej Karpathy's nanoGPT video as super helpful refresher (

He also released a repo that has slightly more involved examples, I took some notes below as I went through the repo learning about some of the implementation details and PyTorch features. Hope it's useful to others as well.

Note: the repo was accessed around 5/18/2023.

Alternative implementation:

Distributed training


Wandb logging

Simulating larger batch

Auto mixed precision (AMP)



Using scaler

Learning rate schedule

It looks like this

Batched MHA



Weight tying

Weight init

Model output for train vs inference

Copy weight from gpt2

Due to variable naming the two implementation basically have the same param names

Optimization group

and fused AdamW

Model flops utilization



A few things here

  1. Concat output and shift out input (idx_cond) as it goes beyond block_size
  2. Variable length input is ok
  3. Because in forward, it will do t based on idx.size()

This is quite a bit different than the slightly more complicated version in llama

where it relies on pad_id to identify what was generated; and eos, bos token for decoding stopping