AutoRegEmbed

This is the official repo for paper:

Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment

Model Architecture

Overall framework of AutoRegEmbed. Firstly, we perform the information compression task to inject key information from the context and instruction into the compressed tokens. Then, we align the probability distribution generated by the compressed token, aiming to minimize the distribution distance between the context and positive samples while maximizing the distance from negative samples.

Requirements

The required core libraries are as follows:

torch==2.5.1
transformers==4.49.0
SentencePiece==0.2.0
mteb==1.12.93

Additionally we recommend running our code on the A100-80Gs.

Usage

Data

Download the data file from this huggingface path and place it in the ./data path.

Train

As described in the paper, the training phase consists of two tasks: information compression and conditional distribution alignment.

Information Compression

Execute the following code:

cd script
chmod +x run_information_compress.sh
./run_information_compress.sh

Note: modify the number of GPUs, model path and other parameters according to your server.

Conditional Distribution Alignment

Firstly, we compute a reference score for the model produced for the information compression task.

cd ..
python compute_reference_score.py

Then run the corresponding script.

cd script
chmod +x run_condition_distribution_alignment.sh
./run_condition_distribution_alignment.sh

Please make sure the paths are consistent.

Evaluation

Execute the following command:

cd ..
python eval/eval_autoregembed.py --model_path 'your_trained_model_path' --save_path 'your_saved_evaluation_path' --mp_size 8 --field_template --dtype float16 --task_type STS --num_compress_token 5

View the evaluation results by running the following command:

python show_results.py -i 'your_saved_evaluation_path/no_model_name_available/no_revision_available'

Note that the evaluation results are saved in the ./no_model_name_available/no_revision_available folder.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
eval		eval
figtures		figtures
script		script
src		src
README.md		README.md
compute_reference_score.py		compute_reference_score.py
config_zero1.json		config_zero1.json
show_results.py		show_results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AutoRegEmbed

Model Architecture

Requirements

Usage

Data

Train

Information Compression

Conditional Distribution Alignment

Evaluation

About

Uh oh!

Releases

Packages

Languages

TrustedLLM/AutoRegEmbed

Folders and files

Latest commit

History

Repository files navigation

AutoRegEmbed

Model Architecture

Requirements

Usage

Data

Train

Information Compression

Conditional Distribution Alignment

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages