StataAgent is an AI-powered agent designed for intuitive, question-based data exploration and analysis directly within Stata. Built upon the smolagents framework, StataAgent interprets both simple Stata commands and more complex analytical questions, converting them into executable Stata code. With an integrated understanding of dataset metadata—including variable names and labels—StataAgent streamlines analytical workflows, making data exploration seamless and efficient.
As of 11 March 2025: this is an experimental project
-
Natural Language Queries:
- Easily handle questions like:
- "What is the effect of x1 on y, controlling for x2?"
- "What is the distribution of homeowners over time?"
- Easily handle questions like:
-
Command Execution:
- Directly execute straightforward Stata commands like
regress y on x1 and x2.
- Directly execute straightforward Stata commands like
-
Metadata Integration:
- Automatically leverages dataset metadata to accurately interpret questions and commands.
-
Powered by smolagents:
- Robust AI framework optimized for minimal overhead and maximal efficiency.
- Stata (version 15 or newer recommended)
- Python 3.7 or newer
- Hugging Face
transformerslibrary and API key
Clone this repository and navigate to the project directory:
git clone https://github.com/yourusername/StataAgent.git
cd StataAgentSet up a Python virtual environment (recommended):
python -m venv venv
source venv/bin/activate # or on Windows: venv\Scripts\activateInstall required Python packages:
pip install -r requirements.txtEnsure smolagents is properly installed and configured per smolagents documentation.
Launch StataAgent with:
python main.pyOnce running, interact with the agent using plain English queries or standard Stata commands.
Natural Language Query:
What is the effect of education on wages, controlling for experience and gender?
Internally executed Stata Command:
regress wage education experience genderNatural Language Query:
What is the distribution of homeowners by year?
Internally Executed Stata Command:
tab homeowners yearThe results of the executed command will be displayed in the console.
** Note: As of now there is no way to retain data in memory, so each query loads the data from disk. As this is not computationally efficient, StataAgent is not recommended for datasets >100,000 observations.
Contributions are welcome! Please:
- Fork this repository.
- Create a feature branch (
git checkout -b feature/your-feature). - Commit your changes (
git commit -am 'Add feature'). - Push to the branch (
git push origin feature/your-feature). - Submit a Pull Request.
- smolagents: Lightweight AI agent framework.
- StataCorp for providing robust statistical analysis software.
Happy exploring with StataAgent!