Training a Nano Version of ChatGPT2

This project aims to train a language model from scratch, making it as functional and user-friendly as possible. It is an adaptation of https://github.com/karpathy/nanoGPT/tree/master with a focus on simplifying the process and adding basic "instruction-following" capabilities to the model.

DISCLAIMER - This is a reduced architecture on small datasets allowing training on CPU. It allowes you to mimic the training and finetuning process of a GPT2 style model, but in reality, the NanoGPT architecture (expecially block_size) does not allow enough context to properly include the full QA stream. This means the model performances will be pretty lousy. Bigger datasets (base + QA) on bigger architecture (as suggested in model_config.json) will significantly improve the model's quality. With that being said, on the simple task of language modeling the small models are doing quite fine.

Getting Started

Clone this repository. Run the following command:

git clone https://github.com/shaharoded/NanoChatGPT2.git
cd NanoChatGPT2

Set up a virtual environment and activate it:

python -m venv venv
.\venv\Scripts\Activate

Install the required packages:

pip install -r requirements.txt

Train the Model

This part focuses on training the initial text generation model based on chosen .txt.

1. Load Train and Validation Data

A first step would be to load, tokenize and encode the data. In the data\data_config.json file you'll find the link to access the designated training data. The data_load.py file holds the simple code to load the text and tokenize it based on the chosen tokenizer (hardcoded to match OpenAI's). Run the following (mind the user prompts):

cd data
python data_load.py

This code will prompt you to choose a base text to train on. The size of the text matters to the number of needed iterations and model complexity. In order to later train the model to QA, opensource QA datasource was found and preprocessed here as well. After running this file you'll have 6 new files in this repository:

pretrain_input.txt - The full readable .txt file - Your entire data
pretrain_train.bin - The tokenized train data for the model, holding 90% of the dataset.
pretrain_val.bin - The tokenized validation data for the model, holding 10% of the dataset.
qa_input.json - The full .json file of the question / answer data, unprocessed.
qa_train.bin - The tokenized train data for the model, holding 90% of the dataset.
qa_val.bin - The tokenized validation data for the model, holding 10% of the dataset.

Those can later be accessed in the training loops. This step should only be performed once to generate the data files, as newer files will run over older files.

NOTE: The tokenizer is hardcoded in this module and imported to other connected files.

2. Train a Base Model Object

The base GPT model is an nn.Module model structured in the file gpt.py. After creating the datasets, this model can be trained using the pretrain.py module, to create a language model capable of generating text based on learnt corpus. This is the base GPT model, later fineuned into an assistant. Using this process will create out\{model_name} directories with the model's best checkpoint. The best model will be saved under the name {model_name}_base_model.pt. This model will then be loaded to generate text as a POC. I would say a good goal would be to train nanoGPT model to val loss < 4 which took me about 4 hours (on CPU). Use the following to run (mind the user prompts):

python pretrain.py

NOTE: This module defines torch.manual_seed(1337) meaning every random process is the consistent. This also causes the trained model to generate the same responses to re-used test cases, which might contradict the expectation for some randomness in the responses.

3. Finetune the Base Model to QA

As outlined in step #1, the QA data has already been prepared for fine-tuning. This module trains the base model to adapt to generating responses to a wide range of questions. The data is preprocessed into streams and batches based on the model's block_size configuration.

Fine-tuning the nanoGPT model on ~88MB of QA data with the current configurations took ~ 4 hours (CPU), achieving val loss ~ 5.8 on the QA data. While the model could generate responses, they were incorrect, showcasing the limitations of a small architecture and limited data. This training loop was done without the context of every QA, due to small block_size, which also reduced the model's abillity to handle the task.

With that being said, the designed training flow is based on SOTA flows for QA finetuning of Decoder Only models, and I expect it to work much better on larder architectures.

Use the following to run (mind the user prompts):

python qa_finetune.py

4. Reinforcement Learning from Human Feedback (RLHF)

The final stage of training a conversational bot like this typically involves Reinforcement Learning from Human Feedback (RLHF). This stage helps align the model's behavior with human preferences, such as being more helpful, accurate, or less biased. On the QA task, this stage will take (for example) 2 generated answers from the model, and will rank them as better / worse, which allowes for another tuning to be performed.

Currently, due to limitations in resources and time (lack of tagged responses and a reinforcement mechanism for feedback), this step is not implemented. However, feel free to extend the project and experiment with RLHF.

5. Try Your Model

This project also offers a playground.py module that allows you to insert free prompts to a trained model of your choice (from out repository, generated after training at least 1 model). This module is added with functionality to collect user's feedback which might later be used to fine-tune the model based on RLHF, as described in #4. The feedback data will be saved in data repository under the model's name.

In order to start:

python playground.py

GitHub Push Actions

To commit and push all changes to the repository follow these steps:

```bash
git init
git remote add origin https://github.com/shaharoded/NanoChatGPT2.git
git fetch origin
git add .
git commit -m "Reasons for disrupting GIT (commit message)"
git branch -M main
git push -u origin main
```

*Note: Replace `main` with your branch name if you're not using the `main` branch.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Training a Nano Version of ChatGPT2

Getting Started

Train the Model

1. Load Train and Validation Data

2. Train a Base Model Object

3. Finetune the Base Model to QA

4. Reinforcement Learning from Human Feedback (RLHF)

5. Try Your Model

GitHub Push Actions

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
.gitignore		.gitignore
README.md		README.md
gpt.py		gpt.py
model_config.json		model_config.json
playground.py		playground.py
pretrain.py		pretrain.py
qa_finetune.py		qa_finetune.py
requirements.txt		requirements.txt
utils.py		utils.py

shaharoded/NanoChatGPT2

Folders and files

Latest commit

History

Repository files navigation

Training a Nano Version of ChatGPT2

Getting Started

Train the Model

1. Load Train and Validation Data

2. Train a Base Model Object

3. Finetune the Base Model to QA

4. Reinforcement Learning from Human Feedback (RLHF)

5. Try Your Model

GitHub Push Actions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages