This is the official repo for paper:
Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment
Overall framework of AutoRegEmbed. Firstly, we perform the information compression task to inject key information from the context and instruction into the compressed tokens. Then, we align the probability distribution generated by the compressed token, aiming to minimize the distribution distance between the context and positive samples while maximizing the distance from negative samples.
The required core libraries are as follows:
torch==2.5.1
transformers==4.49.0
SentencePiece==0.2.0
mteb==1.12.93
Additionally we recommend running our code on the A100-80Gs.
Download the data file from this huggingface path and place it in the ./data path.
As described in the paper, the training phase consists of two tasks: information compression and conditional distribution alignment.
Execute the following code:
cd script
chmod +x run_information_compress.sh
./run_information_compress.sh
Note: modify the number of GPUs, model path and other parameters according to your server.
Firstly, we compute a reference score for the model produced for the information compression task.
cd ..
python compute_reference_score.py
Then run the corresponding script.
cd script
chmod +x run_condition_distribution_alignment.sh
./run_condition_distribution_alignment.sh
Please make sure the paths are consistent.
Execute the following command:
cd ..
python eval/eval_autoregembed.py --model_path 'your_trained_model_path' --save_path 'your_saved_evaluation_path' --mp_size 8 --field_template --dtype float16 --task_type STS --num_compress_token 5
View the evaluation results by running the following command:
python show_results.py -i 'your_saved_evaluation_path/no_model_name_available/no_revision_available'
Note that the evaluation results are saved in the ./no_model_name_available/no_revision_available folder.