PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text to determine whether it represents a genuine scientific publication. It rejects junk (flyers, invoices, non-scholarly PDFs) before expensive downstream processing.
Runs in 3.3ms per document, no GPU needed.
| Head | Classes | Accuracy | What it detects |
|---|---|---|---|
| doc_type | 4 | 99.7% | scientific_paper · poster · abstract_only · junk |
| ai_detect | 2 | 83.4% | human · ai_generated |
| toxicity | 2 | 84.7% | clean · toxic |
Each head is a single linear layer stored as a .npz file (8–12 KB). Inference is pure numpy — no torch needed.
pip install git+https://github.com/jimnoneill/pubguard.gitWith training dependencies:
pip install "pubguard[train] @ git+https://github.com/jimnoneill/pubguard.git"Or install locally for development:
git clone https://github.com/jimnoneill/pubguard.git
cd pubguard
pip install -e ".[train]"from pubguard import PubGuard
guard = PubGuard()
guard.initialize()
verdict = guard.screen("Introduction: We present a novel deep learning approach...")
print(verdict)
# {
# 'doc_type': {'label': 'scientific_paper', 'score': 0.994},
# 'ai_generated': {'label': 'human', 'score': 0.875},
# 'toxicity': {'label': 'clean', 'score': 0.999},
# 'pass': True
# }import fitz # PyMuPDF
doc = fitz.open("paper.pdf")
text = " ".join(page.get_text() for page in doc)
doc.close()
verdict = guard.screen(text[:8000])
if verdict["pass"]:
print("Valid scientific publication — proceed with analysis")
else:
print(f"Rejected: {verdict['doc_type']['label']}")verdicts = guard.screen_batch(["text1", "text2", "text3"])Only scientific_paper passes the gate. Everything else — posters, standalone abstracts, junk — is blocked. The PubVerse pipeline processes publications only.
scientific_paper → ✅ PASS
poster → ❌ BLOCKED (classified, but not a publication)
abstract_only → ❌ BLOCKED
junk → ❌ BLOCKED
AI detection and toxicity are informational by default — reported but not blocking.
Drop into any bash pipeline:
# Extract text from PDF
PDF_TEXT=$(python3 -c "import fitz; d=fitz.open('$PDF'); print(' '.join(p.get_text() for p in d)[:8000])")
# Screen it
echo "$PDF_TEXT" | python3 scripts/pubguard_gate.py 2>/dev/null
if [ $? -ne 0 ]; then
echo "REJECTED — not a valid scientific publication"
exit 1
fipip install -e ".[train]"python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000Downloads datasets from HuggingFace, embeds with model2vec, trains sklearn LogisticRegression heads. Completes in ~1 minute on CPU.
| Head | Sources |
|---|---|
| doc_type | armanc/scientific_papers, gfissore/arxiv-abstracts-2021, ag_news, poster-sentry-training-data |
| ai_detect | liamdugan/raid, NicolaiSivesind/ChatGPT-Research-Abstracts |
| toxicity | google/civil_comments, skg/toxigen-data |
The poster class uses real scientific poster text from the posters.science corpus via PosterSentry.
┌─────────────┐
│ PDF text │
└──────┬──────┘
│
model2vec encode ──► emb ∈ R^512
│
├─────────────────┬─────────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ doc_type │ │ ai_detect │ │ toxicity │
│ [emb+feat]│ │ [emb] │ │ [emb] │
│ →softmax4 │ │ →softmax2 │ │ →softmax2 │
└───────────┘ └───────────┘ └───────────┘
Same embedding backbone as the OpenAlex Topic Classifier — shares the cached model2vec weights.
pubguard/
├── src/pubguard/
│ ├── __init__.py # PubGuard, PubGuardConfig exports
│ ├── classifier.py # PubGuard class — screen(), screen_batch()
│ ├── config.py # Configuration + model path resolution
│ ├── text.py # Text cleaning + structural feature extraction
│ ├── train.py # Training pipeline (sklearn LogisticRegression)
│ ├── data.py # Dataset download + preparation
│ ├── errors.py # PV-XXXX error code system
│ └── cli.py # CLI interface
├── scripts/
│ ├── pubguard_gate.py # Bash pipeline integration (exit 0/1)
│ └── train_pubguard.py # Training entry point
├── ERRORS.md # Error code reference guide
├── PubGuard.png # Logo
└── pyproject.toml # pip-installable package
| Resource | Link |
|---|---|
| Trained model | jimnoneill/pubguard-classifier |
| Training data | jimnoneill/pubguard-training-data |
MIT License — See LICENSE for details.
@software{pubguard_2026,
title = {PubGuard: Multi-Head Scientific Publication Gatekeeper},
author = {O'Neill, James},
year = {2026},
url = {https://github.com/jimnoneill/pubguard}
}