Skip to content

Local-first anomaly detection for childcare and eldercare public records. Helps journalists, auditors, and researchers spot patterns in licensing data.

License

Notifications You must be signed in to change notification settings

CanaryInAMine/Canary

Canary

License: MIT Python 3.11+ Streamlit Code style: ruff Type checked: mypy

Local-first anomaly detection for public welfare records.

Canary helps journalists, auditors, researchers, and concerned citizens analyze publicly available government data to spot patterns that may warrant further investigationβ€”address clustering, payment outliers, ownership networks, campaign contribution timing, and more.

Like a canary in a coal mine, this tool provides early warningsβ€”not conclusions.

All flagged items are potential anomalies requiring human review. This tool analyzes public data only and makes no accusations.

Keywords: fraud detection, childcare oversight, eldercare licensing, public records analysis, investigative journalism, government accountability, welfare program audit, anomaly detection, data journalism, OSINT


Why Canary?

Public welfare programs disburse billions annually. The data to oversee themβ€”licensing records, payment reports, violation historiesβ€”is often publicly available but scattered across state portals in formats that require technical skills to analyze.

Canary changes that:

  • No coding required: Point-and-click interface built on Streamlit
  • 100% local: All data stays on your machine. No cloud. No telemetry. No accounts.
  • Privacy-first: Share findings by exporting your database fileβ€”you control what leaves your computer
  • Extensible: Add scrapers for new states, customize detection rules, integrate your own AI

Use Cases

  • Investigative journalists cross-referencing childcare provider licenses with campaign contributions
  • Government auditors identifying unusual patterns in welfare provider networks
  • Academic researchers studying public program administration and oversight
  • Civic watchdogs monitoring local childcare and eldercare facility licensing
  • Newsroom data teams building accountability stories from public records

Features

Feature Description
Data Import Upload CSV/Excel files or trigger built-in scrapers for supported states
Anomaly Detection Rule-based flagging for address clustering, payment/capacity ratios, shared phones/owners
Risk Scoring Configurable weighted scoring with enable/disable and custom weights per rule
Map Visualization Geographic view with color-coded risk markers
Evidence Collection Attach photos, documents, and notes to providers
Report Generation Export PDF/HTML reports with AI-generated narratives
AI Assistant Natural language queries, risk explanations, pattern discovery, investigation narratives
AI Integration BYOK support for OpenAI, Anthropic, Google Gemini, xAI Grok, Ollama, and LM Studio

Quick Start

Prerequisites

  • Python 3.11 or higher (download)
  • Git (optional, for cloning)

Supported States

Canary includes built-in data scrapers for childcare and eldercare licensing in:

Childcare: Alabama, Colorado, Illinois, Michigan, Minnesota, Virginia, Wisconsin

Eldercare: Alabama, Colorado, Illinois, Michigan, Minnesota, Virginia, Wisconsin

See Available Scrapers for detailed source information.

Installation

# Clone the repository
git clone https://github.com/canary-audit/canary.git
cd canary

# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Optional: Install Playwright browsers for full scraper support
playwright install

Note on Playwright: Some state websites (Illinois, Minnesota) use bot protection that requires browser automation. Without Playwright browsers installed, these scrapers will run in limited mode or return no data. Other scrapers (Colorado, Alabama, Virginia) work without Playwright.

Run

streamlit run Home.py

This starts a local server and opens http://localhost:8501 in your browser.

First Steps

  1. Import data: Go to "Data Import" and either:

    • Upload a CSV/Excel file from a state portal
    • Run a built-in scraper (state scrapers available as contributed)
  2. Review anomalies: The Dashboard shows top-scoring providers and recent clusters

  3. Explore the map: See geographic patterns at a glance

  4. Investigate: Click any provider to see details, add evidence, and generate reports


Project Structure

canary/
β”œβ”€β”€ Home.py             # Application entry point
β”œβ”€β”€ pages/              # Streamlit pages
β”‚   β”œβ”€β”€ 1_Dashboard.py      # Summary metrics and top anomalies
β”‚   β”œβ”€β”€ 2_Data_Import.py    # Upload files and run scrapers
β”‚   β”œβ”€β”€ 3_Providers.py      # Browse and search providers (with AI explanations)
β”‚   β”œβ”€β”€ 4_Map.py            # Geographic visualization
β”‚   β”œβ”€β”€ 5_Cross_References.py # Link external data sources
β”‚   β”œβ”€β”€ 6_Network.py        # Relationship analysis
β”‚   β”œβ”€β”€ 7_Timeline.py       # Temporal analysis
β”‚   β”œβ”€β”€ 8_Evidence.py       # Attach documents and notes
β”‚   β”œβ”€β”€ 9_Reports.py        # Generate reports with AI narratives
β”‚   β”œβ”€β”€ 10_Settings.py      # Configure thresholds, AI, and rule management
β”‚   └── 11_AI_Assistant.py  # Natural language queries and pattern discovery
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ analysis/       # Detection rules, scoring, and custom rules
β”‚   β”œβ”€β”€ scrapers/       # State-specific data fetchers
β”‚   β”œβ”€β”€ fetchers/       # External data fetchers (FEC, PPP, etc.)
β”‚   β”œβ”€β”€ importers/      # CSV/Excel/PDF handlers
β”‚   β”œβ”€β”€ exporters/      # Report and data export
β”‚   β”œβ”€β”€ models/         # Data models
β”‚   β”œβ”€β”€ ai/             # AI providers, NL queries, reasoning, patterns, narratives
β”‚   β”œβ”€β”€ ui/             # Theme and UI components
β”‚   └── utils/          # Shared utilities (geocoding, etc.)
β”œβ”€β”€ tests/              # Test suite
β”œβ”€β”€ data/               # Local database and evidence (gitignored)
└── .streamlit/         # Streamlit configuration

Configuration

Detection Thresholds

Adjust sensitivity in Settings:

Rule Default Description
Address clustering 3+ providers Flag addresses with multiple providers
Phone clustering 2+ providers Flag shared phone numbers
Owner clustering 5+ facilities Flag owners with many facilities
Payment/capacity ratio $15K/slot/year Flag unusually high payments
Missing address Any Flag providers with no street address on file
Political donations $1,000+ Flag provider owners with campaign contributions
Temporal correlation 60 days Flag donations shortly before license events

AI Integration (Optional)

Canary supports bringing your own API keys for AI-powered features:

  1. Go to Settings β†’ AI Configuration
  2. Select your provider and model (cheapest models selected by default)
  3. Enter your API key (stored securely in your system keyring)

Supported Providers

Provider Default Model Local/Cloud Notes
OpenAI gpt-4o-mini Cloud Cheapest at $0.15/1M tokens
Anthropic claude-3-haiku Cloud Cheapest at $0.25/1M tokens
Google Gemini gemini-2.0-flash Cloud Cheapest at $0.10/1M tokens
xAI Grok grok-4-1-fast Cloud Cheapest at $0.20/1M tokens
Ollama llama3.2 Local Free, privacy-preserving
LM Studio (loaded model) Local Free, OpenAI-compatible API

AI Features

Feature Location Description
Natural Language Query AI Assistant Ask questions in plain English ("Show me Illinois providers with donations over $500")
Risk Explanations Providers page AI explains why a specific provider was flagged
Pattern Discovery AI Assistant AI analyzes data to suggest new detection rules
Investigation Narratives Reports page AI generates journalist-ready writeups

Privacy Note: Cloud AI providers receive your query text. For sensitive investigations, use local providers (Ollama or LM Studio) which process everything on your machine.


Development

Setup

# Install dev dependencies
pip install -r requirements-dev.txt

# Install pre-commit hooks
pre-commit install

Commands

# Run the app
streamlit run Home.py

# Run tests
pytest

# Run tests with coverage
pytest --cov=src --cov-report=html

# Type checking
mypy src/

# Linting
ruff check .

# Formatting
ruff format .

Adding a Scraper

Canary uses auto-discoveryβ€”just drop a new scraper file in src/scrapers/ with the @register_scraper decorator and it works automatically. No registry edits needed.

# Quick start
cp docs/templates/scraper_template.py src/scrapers/your_state.py
# Edit the file, run the app, and your scraper appears in the UI

See CONTRIBUTING.md for step-by-step instructions, patterns for different data sources (API, HTML, Playwright), and testing templates.

Available Scrapers

State Source Stability Records Playwright
CO CDEC Socrata API 🟒 stable ~6,000
CO CDPHE ArcGIS 🟒 stable ~1,500
AL DHR ASP.NET 🟒 stable ~3,000
AL ADPH Directory 🟑 beta ~1,000 βœ“
VA DSS Facility Search 🟒 stable ~8,000
VA DSS ALF Search 🟒 stable ~600
IL DCFS Sunshine 🟒 stable ~8,600 βœ“
IL IDPH Socrata 🟒 stable ~2,000
MN DHS Licensing πŸ”΄ experimental ~10,000 βœ“
MN MDH Directory 🟒 stable ~5,000
WI DCF Child Care 🟒 stable ~10,000
WI DHS Eldercare πŸ”΄ experimental ~3,000 βœ“
MI LARA Childcare 🟒 stable ~10,000
MI LARA Eldercare 🟒 stable ~5,000

Available Fetchers

Federal Data Sources

Source Type Stability API Key
CMS Nursing Home Medicare/Medicaid facilities 🟒 stable
FEC Contributions Federal campaign donations 🟒 stable Optional
USASpending Federal grants/contracts 🟒 stable
SBA PPP Pandemic loans 🟒 stable
OIG LEIE Medicare/Medicaid exclusions 🟒 stable
SAM.gov Federal debarments 🟒 stable Required
ProPublica 990 Nonprofit tax filings 🟒 stable

State Data Sources

State Source Type Stability
MN DEED State loans/grants 🟒 stable
MN CFB Campaign contributions 🟒 stable
CO TRACER Campaign contributions 🟒 stable
IL SBOE Campaign contributions 🟒 stable
VA DOE Campaign contributions 🟒 stable
AL FCPA Campaign contributions 🟒 stable
WI CFIS Campaign contributions 🟒 stable
MI BOE Campaign contributions 🟒 stable

Note on campaign finance fetchers: These require manual CSV download from each state's website. See the fetcher docstrings for download instructions.

When to use Playwright: If a site uses JavaScript-heavy frameworks (DevExpress, DataTables with AJAX), bot protection (Radware, Cloudflare), or CAPTCHAs, you'll likely need Playwright. Try HTTP firstβ€”it's faster and simpler.


Testing

Canary uses pytest with a focus on:

  • Unit tests: Analysis rules, scoring, normalization
  • Integration tests: Database operations, import/export pipelines
  • Mocked scrapers: HTTP responses recorded for reproducibility

Run the full suite:

pytest

Run specific tests:

pytest tests/test_analysis/  # Just analysis tests
pytest -k "test_address"     # Tests matching pattern

Documentation

Document Description
CONTRIBUTING.md How to add scrapers, fetchers, and tests
SECURITY.md Security policy and operational security guidance
CODE_OF_CONDUCT.md Community standards and responsible use guidelines
PRD.md Product requirements, user stories, success metrics
DesignDoc.md Technical architecture, data model, APIs

Contributing

Contributions are welcome! See CONTRIBUTING.md for detailed instructions.

Priority Areas

  1. State scrapers: Each new state multiplies Canary's reach
  2. Detection rules: Novel anomaly patterns
  3. Documentation: Tutorials, troubleshooting guides
  4. Testing: Edge cases, integration tests

Quick Start for Contributors

# Copy the template
cp docs/templates/scraper_template.py src/scrapers/your_state.py

# Edit the file with your scraper logic
# The @register_scraper decorator handles registration automatically

# Test your scraper
python -c "from src.scrapers.your_state import YourStateScraper; print(YourStateScraper().source_name)"

# Run tests
pytest tests/test_scrapers/

Process

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-scraper)
  3. Write tests for your changes
  4. Ensure all tests pass (pytest)
  5. Submit a pull request

Please read CONTRIBUTING.md for code style and patterns.


Troubleshooting

Common Issues

"Module not found" errors

# Ensure you're in the virtual environment
source .venv/bin/activate
pip install -r requirements.txt

Scraper returns no data

  • Check your internet connection
  • The source site may have changed; open an issue
  • For Illinois/Minnesota: Ensure Playwright browsers are installed (playwright install)

Playwright browser errors

# If you see "Executable doesn't exist" errors:
playwright install

# This downloads Chromium and other browsers (~400MB)
# Required for: Illinois (full mode), Minnesota

Map doesn't display

  • Geocoding may have failed; check the provider's coordinates
  • Try a different browser if tiles don't load

Database locked errors

  • Close other applications that may have the database open
  • Ensure only one instance of Canary is running

Getting Help

  • Open an issue for bugs or feature requests
  • Check existing issues before creating new ones

Legal Disclaimer

Canary analyzes publicly available data only. All outputs are preliminary findings that require human verification before any action.

  • Flagged items indicate statistical anomalies, not wrongdoing
  • Users are responsible for verifying findings through appropriate channels
  • This tool makes no accusations and should not be used to harass individuals
  • Consult legal counsel before publishing or acting on findings

Use responsibly.


License

MIT License β€” free to use, modify, and distribute.

See LICENSE for full text.


Acknowledgments

Built for transparency and accountability.

Inspired by the work of investigative journalists and government auditors who protect public resources.


Related Resources


Star History

If Canary helps your work, consider starring the repository to help others find it.


Repository Setup (for maintainers)

To optimize GitHub discoverability, configure these settings:

Repository Description:

Local-first anomaly detection for childcare and eldercare public records. Helps journalists, auditors, and researchers spot patterns in licensing data.

Topics (Tags): fraud-detection investigative-journalism public-records data-journalism childcare eldercare anomaly-detection government-accountability osint streamlit python civic-tech

Social Preview: Upload a 1280x640 image showing the dashboard or map view.

About

Local-first anomaly detection for childcare and eldercare public records. Helps journalists, auditors, and researchers spot patterns in licensing data.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages