Skip to content

bluedotiya/web-crawler

Repository files navigation

Release & PublishPR Title Check

Web Crawler

A distributed, recursive web crawler built in Rust. Feed it a URL and it maps all linked websites at a chosen depth, storing the graph in Neo4j and visualizing it in a React frontend.

Dashboard

Features

  • Recursive crawling — follow links up to a configurable depth
  • Distributed workers — 8 feeder replicas process URLs in parallel
  • Graph storage — Neo4j stores URL nodes and link relationships
  • Real-time progress — WebSocket updates stream crawl status live
  • Interactive graph visualization — force-directed graph with color-coded node status
  • Crawl management — create, monitor, cancel, and review crawl statistics
  • Kubernetes-native — Helm chart deploys all services with a single command

Architecture

graph TD
    User([User]) -->|Port 30080| Frontend
    Frontend["Frontend<br/>(nginx + React SPA)"]
    Frontend -->|"/api/*" reverse proxy| Manager
    Manager["Manager<br/>(Axum REST API)"]
    Manager -->|Reads / Writes| Neo4j
    Neo4j[("Neo4j<br/>Graph Database")]
    Feeder["Feeder (x8)<br/>(Background Workers)"]
    Feeder -->|Claims / Updates| Neo4j
Loading
Service Tech Role
Frontend nginx + React/Vite/TypeScript SPA UI, API reverse proxy
Manager Rust + Axum REST API, crawl initiation, WebSocket
Feeder Rust Background URL processing workers
Neo4j Neo4j 5.x Graph database for crawl data

Quick Start

Prerequisites

Install

helm install web-crawler oci://ghcr.io/bluedotiya/web-crawler/charts/web-crawler \
  --version 1.0.0 -n web-crawler --create-namespace

Verify

kubectl rollout status statefulset/crawler-neo4j -n web-crawler
kubectl get pods -n web-crawler

Access the UI

# Get your node IP
minikube ip   # or use your cluster's external IP

# Open the frontend
open http://<NODE_IP>:30080

Start a crawl

Use the web UI at /new, or via the API:

curl -X POST http://<NODE_IP>:30080/api/v1/crawls \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://example.com", "depth": 2}'

Screenshots

Dashboard
Crawl List
New Crawl
Crawl Progress
Graph Visualization
Statistics

Documentation

Document Description
Architecture System design, data flow, concurrency model, WebSocket protocol
Neo4j Graph Model Node labels, properties, relationships, indexes, example queries
API Reference REST endpoints, request/response schemas, WebSocket protocol
Deployment Guide Helm install, configuration reference, scaling, monitoring
Development Guide Local setup, testing, CI/CD pipeline, conventional commits

Tech Stack

Layer Technology
Backend Rust, Axum, Tokio, neo4rs, hickory-resolver
Frontend React, TypeScript, Vite, Tailwind CSS
Database Neo4j 5.x
Infra Docker, Kubernetes, Helm, GitHub Actions
Registry GitHub Container Registry (GHCR)

Contributing

  1. Fork the repository
  2. Create a feature branch from master
  3. Follow conventional commit format for PR titles
  4. Ensure all checks pass: cargo test --workspace && cargo clippy --workspace -- -D warnings
  5. For frontend changes: cd frontend && npm run lint && npm run type-check
  6. Open a pull request — PRs are squash-merged, and the title drives the version bump

See the Development Guide for detailed setup instructions.

Security

  • Report security vulnerabilities via GitHub Security Advisories
  • All services run as non-root users in containers
  • Neo4j credentials are stored in Kubernetes secrets

License

This project is licensed under the GPL-3.0 License.