nanoGPT repo reading notes

nanoGPT repo reading notes

I personally haven't been using PyTorch and writing model code for a long time and found Andrej Karpathy's nanoGPT video as super helpful refresher (https://www.youtube.com/watch?v=kCc8FmEb1nY)

He also released a repo that has slightly more involved examples, I took some notes below as I went through the repo learning about some of the implementation details and PyTorch features. Hope it's useful to others as well.

Note: the repo was accessed around 5/18/2023.

https://github.com/karpathy/nanoGPT

Alternative implementation:

https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/trainer.py

train.py

Distributed training

Ref https://pytorch.org/docs/stable/_modules/torch/nn/parallel/distributed.html#DistributedDataParallel.no_sync

Wandb logging

Simulating larger batch

Auto mixed precision (AMP)

https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html

Context

scaler

Using scaler

Learning rate schedule

It looks like this

model.py

Batched MHA

scaled_dot_product_attention

Register_buffer

https://discuss.pytorch.org/t/what-is-the-difference-between-register-buffer-and-register-parameter-of-nn-module/32723

Weight tying

Weight init

Model output for train vs inference

Copy weight from gpt2

Due to variable naming the two implementation basically have the same param names

Optimization group

and fused AdamW

Model flops utilization

In train.py

Generate

A few things here

  1. Concat output and shift out input (idx_cond) as it goes beyond block_size
  2. Variable length input is ok
  3. Because in forward, it will do t based on idx.size()

This is quite a bit different than the slightly more complicated version in llama

where it relies on pad_id to identify what was generated; and eos, bos token for decoding stopping

openwebtext/prepare.py

dataset.map

dataset.shard

np.memmap