Text2VQL: Teaching a Model Query Language to Open-Source Language Models with ChatGPT

This is the codebase of the paper "Text2VQL: Teaching a Model Query Language to Open-Source Language Models with ChatGPT".

Repository structure

The repository is structured as follows:

The dataset_construction contains all the procedures to generate the synthetic dataset of VQL-NL pairs.
The training folder contains all the scripts to fine-tune open-source LLMs using the synthetic dataset.
The results contains the results presented in the paper and the scripts needed to execute the testing framework.
The eclipse-rdp folder contains the docker set-up for the Java scripts.

Running everything

To reproduce the full paper, the order is the following:

Run the scripts of eclipse-rdp to build the docker image and run the container.
Run the scripts of dataset_construction.
Run the scripts of training.
Run the scripts of results.

For each phase/folder, you have to move the working directory to the associated folder (e.g., cd eclipse-rdp). Each phase/folder has its own README.md (e.g., eclipse-rdp/README.md, dataset_construction/README.md, etc.) explaining all the requirements and steps.

The results folder contains not only the scripts used to answer all the RQs but also the data associated to the paper. Therefore, if you start running all the scripts from the very first phase, you will eventually overwrite all this data.

Requirements

The hardware and software requirements are the following.

One GPU similar or superior to NVIDIA RTX A5000 GPU for training and running the open-source models.
- CUDA version >= 12.1.
Ubuntu OS.
Conda.
- Linux installation guide.
- Getting started with conda (optional).
Docker engine.
- Ensure that you can run docker as a non-root user. See post-installation guide.
All the phases have one environment.yml specifying the Python version and the required Python libraries. These yml files will be read by conda to generate a new environment with all the dependencies.

Licenses

The code of this repository is under the MIT LICENSE (LICENSE-CODE). The models and dataset associated to the paper (uploaded to HuggingFace) are under a research-only LICENSE (LICENSE-MODEL-DATA).

The repository includes a modified copy of refinery under eclipse-rdp/refinery which is provided under Eclipse Public License - v 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

Text2VQL: Teaching a Model Query Language to Open-Source Language Models with ChatGPT

Repository structure

Running everything

Requirements

Licenses

About

Licenses found

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
dataset_construction		dataset_construction
eclipse-rdp		eclipse-rdp
figures		figures
results		results
training		training
.gitignore		.gitignore
LICENSE-CODE		LICENSE-CODE
LICENSE-MODEL-DATA		LICENSE-MODEL-DATA
LICENSE-REFINERY		LICENSE-REFINERY
README.md		README.md
Status.md		Status.md

License

Licenses found

PELAB-LiU/Text2VQL

Folders and files

Latest commit

History

Repository files navigation

Text2VQL: Teaching a Model Query Language to Open-Source Language Models with ChatGPT

Repository structure

Running everything

Requirements

Licenses

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Languages

Packages