Training got unstable after around 300 steps.

<img width="1420" height="440" alt="Image" src="https://github.com/user-attachments/assets/ba45e731-fd29-42cd-a671-c849a6e7f1f2" />

These are my args:
```

                    - run_cfg@_global_=llama2_7b_drope_qk_norm.yaml
                    - train_batch_size=512
                    - per_device_train_batch_size=4
```

dataset - PrimeIntellect/fineweb-edu

The only thing I changed was to reduce the per_device_train_batch_size as , I am using 40 x 8 GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training got unstable after around 300 steps. #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training got unstable after around 300 steps. #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions