Local-first anomaly detection for public welfare records.
Canary helps journalists, auditors, researchers, and concerned citizens analyze publicly available government data to spot patterns that may warrant further investigationβaddress clustering, payment outliers, ownership networks, campaign contribution timing, and more.
Like a canary in a coal mine, this tool provides early warningsβnot conclusions.
All flagged items are potential anomalies requiring human review. This tool analyzes public data only and makes no accusations.
Keywords: fraud detection, childcare oversight, eldercare licensing, public records analysis, investigative journalism, government accountability, welfare program audit, anomaly detection, data journalism, OSINT
Public welfare programs disburse billions annually. The data to oversee themβlicensing records, payment reports, violation historiesβis often publicly available but scattered across state portals in formats that require technical skills to analyze.
Canary changes that:
- No coding required: Point-and-click interface built on Streamlit
- 100% local: All data stays on your machine. No cloud. No telemetry. No accounts.
- Privacy-first: Share findings by exporting your database fileβyou control what leaves your computer
- Extensible: Add scrapers for new states, customize detection rules, integrate your own AI
- Investigative journalists cross-referencing childcare provider licenses with campaign contributions
- Government auditors identifying unusual patterns in welfare provider networks
- Academic researchers studying public program administration and oversight
- Civic watchdogs monitoring local childcare and eldercare facility licensing
- Newsroom data teams building accountability stories from public records
| Feature | Description |
|---|---|
| Data Import | Upload CSV/Excel files or trigger built-in scrapers for supported states |
| Anomaly Detection | Rule-based flagging for address clustering, payment/capacity ratios, shared phones/owners |
| Risk Scoring | Configurable weighted scoring with enable/disable and custom weights per rule |
| Map Visualization | Geographic view with color-coded risk markers |
| Evidence Collection | Attach photos, documents, and notes to providers |
| Report Generation | Export PDF/HTML reports with AI-generated narratives |
| AI Assistant | Natural language queries, risk explanations, pattern discovery, investigation narratives |
| AI Integration | BYOK support for OpenAI, Anthropic, Google Gemini, xAI Grok, Ollama, and LM Studio |
- Python 3.11 or higher (download)
- Git (optional, for cloning)
Canary includes built-in data scrapers for childcare and eldercare licensing in:
Childcare: Alabama, Colorado, Illinois, Michigan, Minnesota, Virginia, Wisconsin
Eldercare: Alabama, Colorado, Illinois, Michigan, Minnesota, Virginia, Wisconsin
See Available Scrapers for detailed source information.
# Clone the repository
git clone https://github.com/canary-audit/canary.git
cd canary
# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Optional: Install Playwright browsers for full scraper support
playwright installNote on Playwright: Some state websites (Illinois, Minnesota) use bot protection that requires browser automation. Without Playwright browsers installed, these scrapers will run in limited mode or return no data. Other scrapers (Colorado, Alabama, Virginia) work without Playwright.
streamlit run Home.pyThis starts a local server and opens http://localhost:8501 in your browser.
-
Import data: Go to "Data Import" and either:
- Upload a CSV/Excel file from a state portal
- Run a built-in scraper (state scrapers available as contributed)
-
Review anomalies: The Dashboard shows top-scoring providers and recent clusters
-
Explore the map: See geographic patterns at a glance
-
Investigate: Click any provider to see details, add evidence, and generate reports
canary/
βββ Home.py # Application entry point
βββ pages/ # Streamlit pages
β βββ 1_Dashboard.py # Summary metrics and top anomalies
β βββ 2_Data_Import.py # Upload files and run scrapers
β βββ 3_Providers.py # Browse and search providers (with AI explanations)
β βββ 4_Map.py # Geographic visualization
β βββ 5_Cross_References.py # Link external data sources
β βββ 6_Network.py # Relationship analysis
β βββ 7_Timeline.py # Temporal analysis
β βββ 8_Evidence.py # Attach documents and notes
β βββ 9_Reports.py # Generate reports with AI narratives
β βββ 10_Settings.py # Configure thresholds, AI, and rule management
β βββ 11_AI_Assistant.py # Natural language queries and pattern discovery
βββ src/
β βββ analysis/ # Detection rules, scoring, and custom rules
β βββ scrapers/ # State-specific data fetchers
β βββ fetchers/ # External data fetchers (FEC, PPP, etc.)
β βββ importers/ # CSV/Excel/PDF handlers
β βββ exporters/ # Report and data export
β βββ models/ # Data models
β βββ ai/ # AI providers, NL queries, reasoning, patterns, narratives
β βββ ui/ # Theme and UI components
β βββ utils/ # Shared utilities (geocoding, etc.)
βββ tests/ # Test suite
βββ data/ # Local database and evidence (gitignored)
βββ .streamlit/ # Streamlit configuration
Adjust sensitivity in Settings:
| Rule | Default | Description |
|---|---|---|
| Address clustering | 3+ providers | Flag addresses with multiple providers |
| Phone clustering | 2+ providers | Flag shared phone numbers |
| Owner clustering | 5+ facilities | Flag owners with many facilities |
| Payment/capacity ratio | $15K/slot/year | Flag unusually high payments |
| Missing address | Any | Flag providers with no street address on file |
| Political donations | $1,000+ | Flag provider owners with campaign contributions |
| Temporal correlation | 60 days | Flag donations shortly before license events |
Canary supports bringing your own API keys for AI-powered features:
- Go to Settings β AI Configuration
- Select your provider and model (cheapest models selected by default)
- Enter your API key (stored securely in your system keyring)
| Provider | Default Model | Local/Cloud | Notes |
|---|---|---|---|
| OpenAI | gpt-4o-mini | Cloud | Cheapest at $0.15/1M tokens |
| Anthropic | claude-3-haiku | Cloud | Cheapest at $0.25/1M tokens |
| Google Gemini | gemini-2.0-flash | Cloud | Cheapest at $0.10/1M tokens |
| xAI Grok | grok-4-1-fast | Cloud | Cheapest at $0.20/1M tokens |
| Ollama | llama3.2 | Local | Free, privacy-preserving |
| LM Studio | (loaded model) | Local | Free, OpenAI-compatible API |
| Feature | Location | Description |
|---|---|---|
| Natural Language Query | AI Assistant | Ask questions in plain English ("Show me Illinois providers with donations over $500") |
| Risk Explanations | Providers page | AI explains why a specific provider was flagged |
| Pattern Discovery | AI Assistant | AI analyzes data to suggest new detection rules |
| Investigation Narratives | Reports page | AI generates journalist-ready writeups |
Privacy Note: Cloud AI providers receive your query text. For sensitive investigations, use local providers (Ollama or LM Studio) which process everything on your machine.
# Install dev dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install# Run the app
streamlit run Home.py
# Run tests
pytest
# Run tests with coverage
pytest --cov=src --cov-report=html
# Type checking
mypy src/
# Linting
ruff check .
# Formatting
ruff format .Canary uses auto-discoveryβjust drop a new scraper file in src/scrapers/ with the @register_scraper decorator and it works automatically. No registry edits needed.
# Quick start
cp docs/templates/scraper_template.py src/scrapers/your_state.py
# Edit the file, run the app, and your scraper appears in the UISee CONTRIBUTING.md for step-by-step instructions, patterns for different data sources (API, HTML, Playwright), and testing templates.
| State | Source | Stability | Records | Playwright |
|---|---|---|---|---|
| CO | CDEC Socrata API | π’ stable | ~6,000 | |
| CO | CDPHE ArcGIS | π’ stable | ~1,500 | |
| AL | DHR ASP.NET | π’ stable | ~3,000 | |
| AL | ADPH Directory | π‘ beta | ~1,000 | β |
| VA | DSS Facility Search | π’ stable | ~8,000 | |
| VA | DSS ALF Search | π’ stable | ~600 | |
| IL | DCFS Sunshine | π’ stable | ~8,600 | β |
| IL | IDPH Socrata | π’ stable | ~2,000 | |
| MN | DHS Licensing | π΄ experimental | ~10,000 | β |
| MN | MDH Directory | π’ stable | ~5,000 | |
| WI | DCF Child Care | π’ stable | ~10,000 | |
| WI | DHS Eldercare | π΄ experimental | ~3,000 | β |
| MI | LARA Childcare | π’ stable | ~10,000 | |
| MI | LARA Eldercare | π’ stable | ~5,000 |
| Source | Type | Stability | API Key |
|---|---|---|---|
| CMS Nursing Home | Medicare/Medicaid facilities | π’ stable | |
| FEC Contributions | Federal campaign donations | π’ stable | Optional |
| USASpending | Federal grants/contracts | π’ stable | |
| SBA PPP | Pandemic loans | π’ stable | |
| OIG LEIE | Medicare/Medicaid exclusions | π’ stable | |
| SAM.gov | Federal debarments | π’ stable | Required |
| ProPublica 990 | Nonprofit tax filings | π’ stable |
| State | Source | Type | Stability |
|---|---|---|---|
| MN | DEED | State loans/grants | π’ stable |
| MN | CFB | Campaign contributions | π’ stable |
| CO | TRACER | Campaign contributions | π’ stable |
| IL | SBOE | Campaign contributions | π’ stable |
| VA | DOE | Campaign contributions | π’ stable |
| AL | FCPA | Campaign contributions | π’ stable |
| WI | CFIS | Campaign contributions | π’ stable |
| MI | BOE | Campaign contributions | π’ stable |
Note on campaign finance fetchers: These require manual CSV download from each state's website. See the fetcher docstrings for download instructions.
When to use Playwright: If a site uses JavaScript-heavy frameworks (DevExpress, DataTables with AJAX), bot protection (Radware, Cloudflare), or CAPTCHAs, you'll likely need Playwright. Try HTTP firstβit's faster and simpler.
Canary uses pytest with a focus on:
- Unit tests: Analysis rules, scoring, normalization
- Integration tests: Database operations, import/export pipelines
- Mocked scrapers: HTTP responses recorded for reproducibility
Run the full suite:
pytestRun specific tests:
pytest tests/test_analysis/ # Just analysis tests
pytest -k "test_address" # Tests matching pattern| Document | Description |
|---|---|
| CONTRIBUTING.md | How to add scrapers, fetchers, and tests |
| SECURITY.md | Security policy and operational security guidance |
| CODE_OF_CONDUCT.md | Community standards and responsible use guidelines |
| PRD.md | Product requirements, user stories, success metrics |
| DesignDoc.md | Technical architecture, data model, APIs |
Contributions are welcome! See CONTRIBUTING.md for detailed instructions.
- State scrapers: Each new state multiplies Canary's reach
- Detection rules: Novel anomaly patterns
- Documentation: Tutorials, troubleshooting guides
- Testing: Edge cases, integration tests
# Copy the template
cp docs/templates/scraper_template.py src/scrapers/your_state.py
# Edit the file with your scraper logic
# The @register_scraper decorator handles registration automatically
# Test your scraper
python -c "from src.scrapers.your_state import YourStateScraper; print(YourStateScraper().source_name)"
# Run tests
pytest tests/test_scrapers/- Fork the repository
- Create a feature branch (
git checkout -b feature/new-scraper) - Write tests for your changes
- Ensure all tests pass (
pytest) - Submit a pull request
Please read CONTRIBUTING.md for code style and patterns.
"Module not found" errors
# Ensure you're in the virtual environment
source .venv/bin/activate
pip install -r requirements.txtScraper returns no data
- Check your internet connection
- The source site may have changed; open an issue
- For Illinois/Minnesota: Ensure Playwright browsers are installed (
playwright install)
Playwright browser errors
# If you see "Executable doesn't exist" errors:
playwright install
# This downloads Chromium and other browsers (~400MB)
# Required for: Illinois (full mode), MinnesotaMap doesn't display
- Geocoding may have failed; check the provider's coordinates
- Try a different browser if tiles don't load
Database locked errors
- Close other applications that may have the database open
- Ensure only one instance of Canary is running
- Open an issue for bugs or feature requests
- Check existing issues before creating new ones
Canary analyzes publicly available data only. All outputs are preliminary findings that require human verification before any action.
- Flagged items indicate statistical anomalies, not wrongdoing
- Users are responsible for verifying findings through appropriate channels
- This tool makes no accusations and should not be used to harass individuals
- Consult legal counsel before publishing or acting on findings
Use responsibly.
MIT License β free to use, modify, and distribute.
See LICENSE for full text.
Built for transparency and accountability.
Inspired by the work of investigative journalists and government auditors who protect public resources.
- NICAR (Investigative Reporters and Editors) - Data journalism training
- DocumentCloud - Document analysis platform
- OpenSecrets - Campaign finance data
- ProPublica Nonprofit Explorer - 990 tax filings
If Canary helps your work, consider starring the repository to help others find it.
To optimize GitHub discoverability, configure these settings:
Repository Description:
Local-first anomaly detection for childcare and eldercare public records. Helps journalists, auditors, and researchers spot patterns in licensing data.
Topics (Tags):
fraud-detection investigative-journalism public-records data-journalism childcare eldercare anomaly-detection government-accountability osint streamlit python civic-tech
Social Preview: Upload a 1280x640 image showing the dashboard or map view.