CS336 Spring 2025 Assignment 1: Basics

Introduction

This is a work sample adapted from Stanford CS336. Codes in cs336_basics were created by myself from scratch. This work sample is a simple transformer language model telling stories based on user-input prompts.

For a full description of the assignment, see the assignment handout at cs336_spring2025_assignment1_basics.pdf

If you see any issues with the assignment handout or code, please feel free to raise a GitHub issue or open a pull request with a fix.

Setup

Environment

We manage our environments with uv to ensure reproducibility, portability, and ease of use. Install uv here (recommended), or run pip install uv/brew install uv. We recommend reading a bit about managing projects in uv here (you will not regret it!).

You can now run any code in the repo using

uv run <python_file_path>

and the environment will be automatically solved and activated when necessary.

Run unit tests

uv run pytest

Initially, all tests should fail with NotImplementedErrors. To connect your implementation to the tests, complete the functions in ./tests/adapters.py.

Download data

Download the TinyStories data and a subsample of OpenWebText

mkdir -p data
cd data

wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt

wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_train.txt.gz
gunzip owt_train.txt.gz
wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_valid.txt.gz
gunzip owt_valid.txt.gz

cd ..

Run model

Model source codes are updated in cs336_basics and have passed most tests except for the time and memory limits of BPE. Function modules are built from scratch instead of importing from PyTorch!

BPE

First, download datasets in ../data/ as described above. To train a byte pair encoding model on TinyStoriesV2-GPT4-train.txt as an example:

python -m cs336_basics.train_bpe

This will generate two files: ../results/vocab.json (token vocabulary with IDs) and ../results/merges.txt (merges of token pairs in order).

Tokenization

To run a tokenizer on TinyStoriesV2-GPT4-train.txt:

python -m cs336_basics.tokenizer

This will encode the text file into token IDs.

Model training

To train the transformer-based language model:

python -m cs336_basics.training_together --input "../results/tokens.npy" --checkpoint "../results/checkpoint.pt"

This will generate a trained model in the checkpoint file.

Here we use an example model including 4 transformer blocks with 16 heads multi-head attention, as well as classic pre-Norm, SwiGLU activation, and rotary position encoding. For training, we use AdamW optimizer with cosine annealing learning rate scheduling and gradient clipping. Users are encouraged to test different hyperparameters. To list all options, run python -m cs336_basics.training_together --help.

Text generation

To generate text from an initial sentence using the model:

python -m cs336_basics.decoding --init "Once upon a time, " --checkpoint "../results/checkpoint.pt" --vocab "../results/vocab.json" --merges "../results/merges.txt"

This will print out a tiny story. Here is one example:

Once upon a time, there was a little boy named Tom. Tom liked to play outside with his friends. One day, Tom and his friends decided to have a big party. They invited all their friends and family. The party was very happy to go to the party for the guests.
At the party, they played games, ate yummy treats, and had lots of fun. Tom was not lazy anymore. He played all day long and had lots...

Not perfect but a good start!

To list all options, run python -m cs336_basics.decoding --help.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
cs336_basics		cs336_basics
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
cs336_spring2025_assignment1_basics.pdf		cs336_spring2025_assignment1_basics.pdf
make_submission.sh		make_submission.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS336 Spring 2025 Assignment 1: Basics

Introduction

Setup

Environment

Run unit tests

Download data

Run model

BPE

Tokenization

Model training

Text generation

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

Jiawei-Xing/TransformerLM

Folders and files

Latest commit

History

Repository files navigation

CS336 Spring 2025 Assignment 1: Basics

Introduction

Setup

Environment

Run unit tests

Download data

Run model

BPE

Tokenization

Model training

Text generation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages