Canary

Local-first anomaly detection for public welfare records.

Canary helps journalists, auditors, researchers, and concerned citizens analyze publicly available government data to spot patterns that may warrant further investigation—address clustering, payment outliers, ownership networks, campaign contribution timing, and more.

Like a canary in a coal mine, this tool provides early warnings—not conclusions.

All flagged items are potential anomalies requiring human review. This tool analyzes public data only and makes no accusations.

Keywords: fraud detection, childcare oversight, eldercare licensing, public records analysis, investigative journalism, government accountability, welfare program audit, anomaly detection, data journalism, OSINT

Why Canary?

Public welfare programs disburse billions annually. The data to oversee them—licensing records, payment reports, violation histories—is often publicly available but scattered across state portals in formats that require technical skills to analyze.

Canary changes that:

No coding required: Point-and-click interface built on Streamlit
100% local: All data stays on your machine. No cloud. No telemetry. No accounts.
Privacy-first: Share findings by exporting your database file—you control what leaves your computer
Extensible: Add scrapers for new states, customize detection rules, integrate your own AI

Use Cases

Investigative journalists cross-referencing childcare provider licenses with campaign contributions
Government auditors identifying unusual patterns in welfare provider networks
Academic researchers studying public program administration and oversight
Civic watchdogs monitoring local childcare and eldercare facility licensing
Newsroom data teams building accountability stories from public records

Features

Feature	Description
Data Import	Upload CSV/Excel files or trigger built-in scrapers for supported states
Anomaly Detection	Rule-based flagging for address clustering, payment/capacity ratios, shared phones/owners
Risk Scoring	Configurable weighted scoring with enable/disable and custom weights per rule
Map Visualization	Geographic view with color-coded risk markers
Evidence Collection	Attach photos, documents, and notes to providers
Report Generation	Export PDF/HTML reports with AI-generated narratives
AI Assistant	Natural language queries, risk explanations, pattern discovery, investigation narratives
AI Integration	BYOK support for OpenAI, Anthropic, Google Gemini, xAI Grok, Ollama, and LM Studio

Quick Start

Prerequisites

Python 3.11 or higher (download)
Git (optional, for cloning)

Supported States

Canary includes built-in data scrapers for childcare and eldercare licensing in:

Childcare: Alabama, Colorado, Illinois, Michigan, Minnesota, Virginia, Wisconsin

Eldercare: Alabama, Colorado, Illinois, Michigan, Minnesota, Virginia, Wisconsin

See Available Scrapers for detailed source information.

Installation

# Clone the repository
git clone https://github.com/canary-audit/canary.git
cd canary

# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Optional: Install Playwright browsers for full scraper support
playwright install

Note on Playwright: Some state websites (Illinois, Minnesota) use bot protection that requires browser automation. Without Playwright browsers installed, these scrapers will run in limited mode or return no data. Other scrapers (Colorado, Alabama, Virginia) work without Playwright.

Run

streamlit run Home.py

This starts a local server and opens http://localhost:8501 in your browser.

First Steps

Import data: Go to "Data Import" and either:
- Upload a CSV/Excel file from a state portal
- Run a built-in scraper (state scrapers available as contributed)
Review anomalies: The Dashboard shows top-scoring providers and recent clusters
Explore the map: See geographic patterns at a glance
Investigate: Click any provider to see details, add evidence, and generate reports

Project Structure

canary/
├── Home.py             # Application entry point
├── pages/              # Streamlit pages
│   ├── 1_Dashboard.py      # Summary metrics and top anomalies
│   ├── 2_Data_Import.py    # Upload files and run scrapers
│   ├── 3_Providers.py      # Browse and search providers (with AI explanations)
│   ├── 4_Map.py            # Geographic visualization
│   ├── 5_Cross_References.py # Link external data sources
│   ├── 6_Network.py        # Relationship analysis
│   ├── 7_Timeline.py       # Temporal analysis
│   ├── 8_Evidence.py       # Attach documents and notes
│   ├── 9_Reports.py        # Generate reports with AI narratives
│   ├── 10_Settings.py      # Configure thresholds, AI, and rule management
│   └── 11_AI_Assistant.py  # Natural language queries and pattern discovery
├── src/
│   ├── analysis/       # Detection rules, scoring, and custom rules
│   ├── scrapers/       # State-specific data fetchers
│   ├── fetchers/       # External data fetchers (FEC, PPP, etc.)
│   ├── importers/      # CSV/Excel/PDF handlers
│   ├── exporters/      # Report and data export
│   ├── models/         # Data models
│   ├── ai/             # AI providers, NL queries, reasoning, patterns, narratives
│   ├── ui/             # Theme and UI components
│   └── utils/          # Shared utilities (geocoding, etc.)
├── tests/              # Test suite
├── data/               # Local database and evidence (gitignored)
└── .streamlit/         # Streamlit configuration

Configuration

Detection Thresholds

Adjust sensitivity in Settings:

Rule	Default	Description
Address clustering	3+ providers	Flag addresses with multiple providers
Phone clustering	2+ providers	Flag shared phone numbers
Owner clustering	5+ facilities	Flag owners with many facilities
Payment/capacity ratio	$15K/slot/year	Flag unusually high payments
Missing address	Any	Flag providers with no street address on file
Political donations	$1,000+	Flag provider owners with campaign contributions
Temporal correlation	60 days	Flag donations shortly before license events

AI Integration (Optional)

Canary supports bringing your own API keys for AI-powered features:

Go to Settings → AI Configuration
Select your provider and model (cheapest models selected by default)
Enter your API key (stored securely in your system keyring)

Supported Providers

Provider	Default Model	Local/Cloud	Notes
OpenAI	gpt-4o-mini	Cloud	Cheapest at $0.15/1M tokens
Anthropic	claude-3-haiku	Cloud	Cheapest at $0.25/1M tokens
Google Gemini	gemini-2.0-flash	Cloud	Cheapest at $0.10/1M tokens
xAI Grok	grok-4-1-fast	Cloud	Cheapest at $0.20/1M tokens
Ollama	llama3.2	Local	Free, privacy-preserving
LM Studio	(loaded model)	Local	Free, OpenAI-compatible API

AI Features

Feature	Location	Description
Natural Language Query	AI Assistant	Ask questions in plain English ("Show me Illinois providers with donations over $500")
Risk Explanations	Providers page	AI explains why a specific provider was flagged
Pattern Discovery	AI Assistant	AI analyzes data to suggest new detection rules
Investigation Narratives	Reports page	AI generates journalist-ready writeups

Privacy Note: Cloud AI providers receive your query text. For sensitive investigations, use local providers (Ollama or LM Studio) which process everything on your machine.

Development

Setup

# Install dev dependencies
pip install -r requirements-dev.txt

# Install pre-commit hooks
pre-commit install

Commands

# Run the app
streamlit run Home.py

# Run tests
pytest

# Run tests with coverage
pytest --cov=src --cov-report=html

# Type checking
mypy src/

# Linting
ruff check .

# Formatting
ruff format .

Adding a Scraper

Canary uses auto-discovery—just drop a new scraper file in src/scrapers/ with the @register_scraper decorator and it works automatically. No registry edits needed.

# Quick start
cp docs/templates/scraper_template.py src/scrapers/your_state.py
# Edit the file, run the app, and your scraper appears in the UI

See CONTRIBUTING.md for step-by-step instructions, patterns for different data sources (API, HTML, Playwright), and testing templates.

Available Scrapers

State	Source	Stability	Records	Playwright
CO	CDEC Socrata API	🟢 stable	~6,000
CO	CDPHE ArcGIS	🟢 stable	~1,500
AL	DHR ASP.NET	🟢 stable	~3,000
AL	ADPH Directory	🟡 beta	~1,000	✓
VA	DSS Facility Search	🟢 stable	~8,000
VA	DSS ALF Search	🟢 stable	~600
IL	DCFS Sunshine	🟢 stable	~8,600	✓
IL	IDPH Socrata	🟢 stable	~2,000
MN	DHS Licensing	🔴 experimental	~10,000	✓
MN	MDH Directory	🟢 stable	~5,000
WI	DCF Child Care	🟢 stable	~10,000
WI	DHS Eldercare	🔴 experimental	~3,000	✓
MI	LARA Childcare	🟢 stable	~10,000
MI	LARA Eldercare	🟢 stable	~5,000

Available Fetchers

Federal Data Sources

Source	Type	Stability	API Key
CMS Nursing Home	Medicare/Medicaid facilities	🟢 stable
FEC Contributions	Federal campaign donations	🟢 stable	Optional
USASpending	Federal grants/contracts	🟢 stable
SBA PPP	Pandemic loans	🟢 stable
OIG LEIE	Medicare/Medicaid exclusions	🟢 stable
SAM.gov	Federal debarments	🟢 stable	Required
ProPublica 990	Nonprofit tax filings	🟢 stable

State Data Sources

State	Source	Type	Stability
MN	DEED	State loans/grants	🟢 stable
MN	CFB	Campaign contributions	🟢 stable
CO	TRACER	Campaign contributions	🟢 stable
IL	SBOE	Campaign contributions	🟢 stable
VA	DOE	Campaign contributions	🟢 stable
AL	FCPA	Campaign contributions	🟢 stable
WI	CFIS	Campaign contributions	🟢 stable
MI	BOE	Campaign contributions	🟢 stable

Note on campaign finance fetchers: These require manual CSV download from each state's website. See the fetcher docstrings for download instructions.

When to use Playwright: If a site uses JavaScript-heavy frameworks (DevExpress, DataTables with AJAX), bot protection (Radware, Cloudflare), or CAPTCHAs, you'll likely need Playwright. Try HTTP first—it's faster and simpler.

Testing

Canary uses pytest with a focus on:

Unit tests: Analysis rules, scoring, normalization
Integration tests: Database operations, import/export pipelines
Mocked scrapers: HTTP responses recorded for reproducibility

Run the full suite:

pytest

Run specific tests:

pytest tests/test_analysis/  # Just analysis tests
pytest -k "test_address"     # Tests matching pattern

Documentation

Document	Description
CONTRIBUTING.md	How to add scrapers, fetchers, and tests
SECURITY.md	Security policy and operational security guidance
CODE_OF_CONDUCT.md	Community standards and responsible use guidelines
PRD.md	Product requirements, user stories, success metrics
DesignDoc.md	Technical architecture, data model, APIs

Contributing

Contributions are welcome! See CONTRIBUTING.md for detailed instructions.

Priority Areas

State scrapers: Each new state multiplies Canary's reach
Detection rules: Novel anomaly patterns
Documentation: Tutorials, troubleshooting guides
Testing: Edge cases, integration tests

Quick Start for Contributors

# Copy the template
cp docs/templates/scraper_template.py src/scrapers/your_state.py

# Edit the file with your scraper logic
# The @register_scraper decorator handles registration automatically

# Test your scraper
python -c "from src.scrapers.your_state import YourStateScraper; print(YourStateScraper().source_name)"

# Run tests
pytest tests/test_scrapers/

Process

Fork the repository
Create a feature branch (git checkout -b feature/new-scraper)
Write tests for your changes
Ensure all tests pass (pytest)
Submit a pull request

Please read CONTRIBUTING.md for code style and patterns.

Troubleshooting

Common Issues

"Module not found" errors

# Ensure you're in the virtual environment
source .venv/bin/activate
pip install -r requirements.txt

Scraper returns no data

Check your internet connection
The source site may have changed; open an issue
For Illinois/Minnesota: Ensure Playwright browsers are installed (playwright install)

Playwright browser errors

# If you see "Executable doesn't exist" errors:
playwright install

# This downloads Chromium and other browsers (~400MB)
# Required for: Illinois (full mode), Minnesota

Map doesn't display

Geocoding may have failed; check the provider's coordinates
Try a different browser if tiles don't load

Database locked errors

Close other applications that may have the database open
Ensure only one instance of Canary is running

Getting Help

Open an issue for bugs or feature requests
Check existing issues before creating new ones

Legal Disclaimer

Canary analyzes publicly available data only. All outputs are preliminary findings that require human verification before any action.

Flagged items indicate statistical anomalies, not wrongdoing
Users are responsible for verifying findings through appropriate channels
This tool makes no accusations and should not be used to harass individuals
Consult legal counsel before publishing or acting on findings

Use responsibly.

License

MIT License — free to use, modify, and distribute.

See LICENSE for full text.

Acknowledgments

Built for transparency and accountability.

Inspired by the work of investigative journalists and government auditors who protect public resources.

Related Resources

NICAR (Investigative Reporters and Editors) - Data journalism training
DocumentCloud - Document analysis platform
OpenSecrets - Campaign finance data
ProPublica Nonprofit Explorer - 990 tax filings

Star History

If Canary helps your work, consider starring the repository to help others find it.

Repository Setup (for maintainers)

To optimize GitHub discoverability, configure these settings:

Repository Description:

Local-first anomaly detection for childcare and eldercare public records. Helps journalists, auditors, and researchers spot patterns in licensing data.

Topics (Tags): fraud-detection investigative-journalism public-records data-journalism childcare eldercare anomaly-detection government-accountability osint streamlit python civic-tech

Social Preview: Upload a 1280x640 image showing the dashboard or map view.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.streamlit		.streamlit
docs/templates		docs/templates
pages		pages
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DesignDoc.md		DesignDoc.md
Home.py		Home.py
LICENSE		LICENSE
PRD.md		PRD.md
QUICKSTART.txt		QUICKSTART.txt
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

License

CanaryInAMine/Canary

Folders and files

Latest commit

History

Repository files navigation

Canary

Why Canary?

Use Cases

Features

Quick Start

Prerequisites

Supported States

Installation

Run

First Steps

Project Structure

Configuration

Detection Thresholds

AI Integration (Optional)

Supported Providers

AI Features

Development

Setup

Commands

Adding a Scraper

Available Scrapers

Available Fetchers

Federal Data Sources

State Data Sources

Testing

Documentation

Contributing

Priority Areas

Quick Start for Contributors

Process

Troubleshooting

Common Issues

Getting Help

Legal Disclaimer

License

Acknowledgments

Related Resources

Star History

Repository Setup (for maintainers)

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages