This is a work sample adapted from Stanford CS336. Codes in cs336_basics were created by myself from scratch. This work sample is a simple transformer language model telling stories based on user-input prompts.
For a full description of the assignment, see the assignment handout at cs336_spring2025_assignment1_basics.pdf
If you see any issues with the assignment handout or code, please feel free to raise a GitHub issue or open a pull request with a fix.
We manage our environments with uv to ensure reproducibility, portability, and ease of use.
Install uv here (recommended), or run pip install uv/brew install uv.
We recommend reading a bit about managing projects in uv here (you will not regret it!).
You can now run any code in the repo using
uv run <python_file_path>and the environment will be automatically solved and activated when necessary.
uv run pytestInitially, all tests should fail with NotImplementedErrors.
To connect your implementation to the tests, complete the
functions in ./tests/adapters.py.
Download the TinyStories data and a subsample of OpenWebText
mkdir -p data
cd data
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt
wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_train.txt.gz
gunzip owt_train.txt.gz
wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_valid.txt.gz
gunzip owt_valid.txt.gz
cd ..Model source codes are updated in cs336_basics and have passed most tests except for the time and memory limits of BPE. Function modules are built from scratch instead of importing from PyTorch!
First, download datasets in ../data/ as described above. To train a byte pair encoding model on TinyStoriesV2-GPT4-train.txt as an example:
python -m cs336_basics.train_bpe
This will generate two files: ../results/vocab.json (token vocabulary with IDs) and ../results/merges.txt (merges of token pairs in order).
To run a tokenizer on TinyStoriesV2-GPT4-train.txt:
python -m cs336_basics.tokenizer
This will encode the text file into token IDs.
To train the transformer-based language model:
python -m cs336_basics.training_together --input "../results/tokens.npy" --checkpoint "../results/checkpoint.pt"
This will generate a trained model in the checkpoint file.
Here we use an example model including 4 transformer blocks with 16 heads multi-head attention, as well as classic pre-Norm, SwiGLU activation, and rotary position encoding. For training, we use AdamW optimizer with cosine annealing learning rate scheduling and gradient clipping. Users are encouraged to test different hyperparameters. To list all options, run python -m cs336_basics.training_together --help.
To generate text from an initial sentence using the model:
python -m cs336_basics.decoding --init "Once upon a time, " --checkpoint "../results/checkpoint.pt" --vocab "../results/vocab.json" --merges "../results/merges.txt"
This will print out a tiny story. Here is one example:
Once upon a time, there was a little boy named Tom. Tom liked to play outside with his friends. One day, Tom and his friends decided to have a big party. They invited all their friends and family. The party was very happy to go to the party for the guests.
At the party, they played games, ate yummy treats, and had lots of fun. Tom was not lazy anymore. He played all day long and had lots...
Not perfect but a good start!
To list all options, run python -m cs336_basics.decoding --help.