This project involves implementing (from scratch) and experimenting with different parts of the transformer architecture. We work with speeches from three American politicians to build classification and language modeling systems.
The dataset consists of speeches from three American politicians:
- 0: Barack Obama
- 1: George W. Bush
- 2: George H. Bush
- Classification Task:
- Train:
train_CLS.tsv(tab-separated: label + tab + speech segment) - Test:
test_CLS.txt
- Train:
- Language Modeling Task:
- Train:
train_LM.txt - Test:
test_LM_obama.txt,test_LM_wbush.txt,test_LM_hbush.txt
- Train:
- PyTorch
- NLTK (for tokenization)
- Standard Python libraries
dataset.py- PyTorch Dataset classes for classification and language modelingtokenizer.py- Simple word-level tokenizer using NLTKutilities.py- Helper functions for attention matrix sanity checks and visualizationmain.py- Default parameters and example usagetransformer.py- Empty file where you'll implement your transformer components
For the base implementation, use simple absolute positional embeddings:
- Two embedding tables: one for tokens, one for positions
- Add positional embeddings to token embeddings before feeding into transformer blocks
We implemented a transformer encoder with a feedforward classifier for politician speech classification. We trained both components jointly from scratch without pretraining.
- Implemented transformer encoder following hyperparameters in
main.py - Output: sequence of embeddings for each input word
- Used mean pooling across the sequence dimension to provide embeddings to the classifier
- Simple feedforward network with one hidden layer
- Input: mean-pooled encoder embeddings
- Output: predictions for which politician spoke the speech segment
- Passed input through the encoder, generated embeddings, and fed them to the classifier
- Computed loss between predictions and true labels
- Updated both encoder and classifier weights via backpropagation
- Trained both components simultaneously
- Used
utilities.pyhelper function to verify attention implementation - Checked that the attention matrix rows sum to 1
Tested on test_CLS.txt.
Performance:
| Metric | Value |
|---|---|
| Number of Parameters | 2,158,155 |
| Vocabulary Size | 30,522 |
| Epoch 1 Train Loss | 1.0763 |
| Epoch 15 Train Loss | 0.1263 |
| Epoch 15 Test Accuracy | 86.13% |
We implemented a GPT-like transformer decoder for autoregressive language modeling and evaluated perplexity on speeches given by different politicians.
- Similar to encoder but used masked self-attention
- Prevented the model from seeing future tokens during training
- Feedforward specifications:
- Hidden dimensionality: 100
- Activation function: ReLU
- Input/output dimensionality:
n_embed = 64
- Task: Predict the next word given previous words in sequence
- Output: Probability distribution over vocabulary
- Loss: Cross-entropy between predictions and true next words
- Training limit: 500 iterations (batches) with batch size 16, block size 32
- Total tokens processed: ~256,000
- Used
utilities.pyto verify attention implementation
Tested on all three politician test sets and reported perplexity.
Performance:
| Metric | Value |
|---|---|
| Number of Parameters | 863,243 |
| Vocabulary Size | 5,755 |
| Step 500 Train Perplexity | 169.3392 |
| Step 500 Obama Perplexity | 367.9337 |
| Step 500 H. Bush Perplexity | 419.1233 |
| Step 500 W. Bush Perplexity | 482.0752 |
- Multi-head attention mechanisms
- Layer normalization
- Feedforward networks
- Positional embeddings
- Masked attention (for decoder)
- Monitor attention matrix properties during development
- Verify masking implementation by checking the train perplexity behavior
- Use the provided hyperparameters initially for comparable results
This is an instruction on how to run the project.
torch
numpy
matplotlib
nltk
transformers # BertTokenizer
# Part 1: Encoder Trained With Classifier
python main.py --part 1
# Part 2: Pretraining Decoder Language Model
python main.py --part 2
# Part 3 [Exploration]: Decoder with Masked Multi-query Attention
python main.py --part 3
# Get visualization (the figures in the report) by appending the following option to the commands above
--visualization true