MOUSSE (Metadata fOcUsed Semantic Search Engine) is an innovative semantic search tool supercharged by cutting-edge Large Language Models (LLMs). Designed to revolutionize data discovery, MOUSSE focuses on metadata, making it the perfect solution for navigating and unlocking the potential of vast, diverse datasets — even those without structured ontologies.
The Mousse platform consists of a React-powered UI and a FastAPI-based REST API. Metadata records are indexed in a PostgeSQL database, utilizing:
- PostGIS for spatial indexing.
- pgvector for embedding-based vector indexing.
- A timerange index for efficient temporal searches.
A key challenge addressed by this solution is efficient applying spatiotemporal and semantic filtering on large datasets. This is achieved through a dynamic query builder that constructs SQL queries based on user input. However, PostGIS and pvgector indexes does not work together out of the box. To overcome this, a semantic index (pgvectorscale) has been implemented on top of pgvector by TimescaleDB, enabling hybrid searches. This approach implements the diskANN algorithm, leveraging SSD storage for indexing, rather than relying solely on in-memory semantic indexes, which often comes with high scalability costs.
A standout feature of the platform is its automatic detection of the spatial and temporal filters based on user queries. This is powered by a lightweight, fast LLM, which is prompted to function as a Name-Entity Recognition (NER) system, specifically tuned for the project's needs. The NER system:
- Identifies location- and datetime-related entities.
- Maps locations to their corresponding country or list of countries.
- Converts datetime entities into structured time ranges or recurring epochs.
Users can manually adjust the detected filters, extending, refining, or removing them as needed.
When a query returns a large number of results (above a certain threshold), the platform offers a clustering view to provide users with a high-level overview. Clustering is performed on a lower-dimensional projection of the embedding space, which must currently be precomputed offline and uploaded to the platform.
The system uses the K-Means algorithm to group similar results into clusters, chosen for its simplicity and fast execution on large datasets, making ideal for responsive, interactive use. Once clustering is triggered, the resulting clusters -along with the IDs of their member records- are cached in an in-memory store Valkey, enabling fast and interactive exploration of the grouped data. Users can then browse the contents of each cluster, with cluster members presented in a paginated view to support smooth navigation through large result sets.
architecture-beta
group api[API]
group front[Frontend]
group llm[LLM services] in api
group cache[Cache] in api
group db[Database] in api
service postgres(database)[PostgreSQL] in db
service disk2(disk)[DiskANN] in db
service server(server)[FastAPI] in api
service valkey(server)[ValKey] in cache
service triton(server)[NVidia Triton] in llm
service tgi(server)[HuggingFace TGI] in llm
service react(internet)[ReactJS] in front
service gateway(cloud)[nginx]
react:B -- T:gateway
gateway:R -- L:server
server:T -- L:triton
postgres:T -- B:server
tgi:L-- R:server
disk2:L -- R:postgres
valkey:R -- L:server
The Mousse platform consists of multiple interconnected components, each responsible for a specific part of the system's functionality. The architecture follows a microservices-based approach, where different services handle API requests, frontend interactions, database operations, and AI-powered processing.
-
Frontend (ReactJS): The user interface is built with ReactJS, providing an interactive and dynamic experience. The Leaflet library is used for visualizing geospatial data, while Redux manages state and facilitates communication between components through a universal store. The frontend interacts with the backend via an API gateway.
-
API Gateway (Nginx): Manages incoming requests and routes them to the appropriate backend service.
-
Backend API (FastAPI): The core REST API, built with FastAPI, handles business logic, user queries, and database interactions.
-
Database Layer (PostgreSQL + PostGIS + pgvector/pgvectorscale):
- PostgreSQL: Stores metadata records and structured data.
- PostGIS: Enables geospatial indexing for spatial queries.
- pgvector: Supports semantic searches using vector embeddings.
- DiskANN (Disk-based Approximate Nearest Neighbor): Implements efficient vector search indexing with SSD-based storage for scalability and hydrid searches.
-
Caching Layer (ValKey): An in memory storage is used to enable the user to efficiently browse clustered results pages.
-
Inference Server (NVIDIA Triton): Handles ML model inference, its role is to project a text query into the corresponding embedding.
-
LLM NER (Hugging Face TGI): A Text Generation Inference (TGI) server is used to parse and enhance user queries by extracting spatial and temporal information using a fine-tuned LLM-based NER system.
This architecture ensures efficient handling of spatiotemporal and semantic queries, leveraging database indexing, ML inference, and a responsive API layer for seamless user interactions.
A recipe for the project deployment is defined in the docker-compose YAML files. To get started, first copy the contents of the .env.example into .env and fill in the required information.
Then, build and start the deployment stack using the following commands:
docker compose -f docker-compose.yml -f docker-compose.production.yml buildand
docker compose -f docker-compose.yml -f docker-compose.production.yml up -d --remove-orphansOnce the system is running, the database must be updated to the latest migration state. This is handled by an ephemeral container, which can be executed with:
docker compose --profile manual run migrateData for ingestion should to be stored in (partitioned) Parquet files before being imported into the system. Currently, only specific attribute names are allowed for the core dataset fields, as shown in the following table:
| Attribute | Description | Type |
|---|---|---|
| id | Unique id | str |
| title | Record title | str |
| description | Record description | str |
| format | Resources format | str[] |
| type | Record type | Enum[simple, composed] |
| keyword | Associated keywords | str[] |
| when | Time range | obj<from, to> |
| where | Spatial extent | obj<east,west,north,south> |
| mean_embeddings | Record embedding | float[] |
To ingest data into the database, a second ephemeral container id provided. Assuming the Parquet files are located in a directory with an absolute path /path/to/parquet, you can start the ingestion process with:
docker compose --profile manual run -v /path/to/parquet:/data ingest bulk /dataTo ingest the lower dimensional vectors, run
docker compose --profile manual run -v /path/to/parquet:/data ingest lower-dim /dataFor development, hot reloading can be enabled by running:
docker compose up -dThis ensures that changes are automatically reflected without needing to restart the containers manually.