FastAPI-based API for text generation using Hugging Face models.
- Text generation API using Hugging Face models
- Model management (loading/unloading)
- Response caching with Redis
- Asynchronous API endpoints
- Swagger documentation
- Support for GPU acceleration (CUDA and Apple Silicon MPS)
- Python 3.10+
- Poetry
- Redis (optional, for caching)
- Clone the repository
- Install dependencies with Poetry:
poetry install- Copy the
.env.examplefile to.envand fill in the necessary configuration:
cp .env.example .env- Update the environment variables in the
.envfile:- Set your Hugging Face API token
- Configure Redis if using caching
- Set device to "cuda" if using GPU
# Development mode
poetry run python -m src.main
# Production mode
poetry run uvicorn src.main:app --host 0.0.0.0 --port 8000The API will be available at http://localhost:8000/api/v1/
POST /api/v1/text/generate- Generate text from a prompt
Example request:
{
"prompt": "Once upon a time",
"model": "gpt2",
"max_length": 100,
"temperature": 0.8,
"num_return_sequences": 1
}GET /api/v1/text/models- List available modelsGET /api/v1/text/models/{model_id}- Get information about a modelPOST /api/v1/text/models/{model_id}/load- Load a modelPOST /api/v1/text/models/{model_id}/unload- Unload a model
API documentation is available at:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
All configuration is managed through environment variables in the .env file:
API_PREFIX- API URL prefixDEBUG- Enable debug modeSECRET_KEY- Secret key for securityHOST- Server hostPORT- Server portHF_API_TOKEN- Hugging Face API tokenDEFAULT_MODEL- Default model to useDEVICE- Device to use ("cpu", "cuda" for NVIDIA GPUs, or "mps" for Apple Silicon M1/M2/M3)USE_CACHE- Enable Redis cachingREDIS_HOST- Redis hostREDIS_PORT- Redis portREDIS_PASSWORD- Redis passwordCACHE_EXPIRATION- Cache expiration time in seconds
Set DEVICE=cuda in your .env file.
Set DEVICE=mps in your .env file. This requires PyTorch 1.12+ with MPS support.
The system will automatically check for device availability and fall back to CPU if the requested device is not available.
MIT