Lean, self-hosted HTML renderer that turns any HTTP(S) page into a cleaned SEO snapshot with injected JSON-LD. Built on Puppeteer and Express as a drop-in alternative to prerender.io / Rendertron.
- GET /render?url=… returns minimized HTML with JSON-LD; errors come back as
{ "error": "…" }. - Filters non-essential requests (fonts, stylesheets, media, xhr/ws/ping, common analytics/AB scripts) for repeatable renders.
- Waits for DOM stability via
MutationObserver, then pulls markup directly with CDPDOM.getOuterHTMLto avoid Puppeteer timing quirks. - Cleans markup: strips scripts/styles/forms/nav/svg/etc., keeps only safe attributes, collapses empty div soup, normalizes whitespace, and enforces
<base>+ canonical link. - Builds a JSON-LD graph (Organization, WebSite, derived WebPage type) from existing meta tags and canonical URLs.
- Tracks in-flight work in a file-backed flag (
tmp/process) surfaced at GET /progress. - Supports a custom USER_AGENT for pages that gate content; sanitizes snapshot filenames when snapshotting is wired in.
- Prereq: Node.js >= 18.18 (Puppeteer downloads Chromium on first install).
- Install:
npm install - Configure:
cp .env.example .envthen tweak values (see below). - Run:
npm start(ornpm run devfor watch mode,npm run checkfor syntax,npm testfor node's built-in tests if added).
Defined in .env (defaults from .env.example):
SERVER_HOST/SERVER_PORT— bind address and port (default127.0.0.1:51000).SERVER_TIMEOUT_MS— overall request timeout; also used as max DOM-stability wait (default60000).FETCH_HTML_TIMEOUT— CDP outerHTML fetch timeout in ms (default1000).STABLE_PAGE_TIMEOUT— quiet period (ms) with no DOM mutations before snapshot (default500).TMP_DIR— progress flag directory (default./tmp).LOG_DIR,LOG_FILE,LOG_LEVEL— log destination and enabled levels (log,info,warn,error; inline//comments are ignored).USER_AGENT— optional custom UA applied to page requests; omit to use Puppeteer's default.SNAPSHOT— toggles snapshot helper if you wirePageRenderer.persistHtmlSnapshotinto the flow; filenames are URL-safe and truncated to 120 chars.STRIP_CSS— whentrue, remove<link rel="stylesheet">and<style>during cleaning; whenfalse, keep them.
- GET /render?url=ENCODED_HTTP_URL →
text/html- Validates the
urlis HTTP/HTTPS. Returns cleaned HTML with injected JSON-LD. - Failures return HTTP 4xx/5xx with JSON body
{ "error": "message" }.
- Validates the
- GET /progress →
{ "progress": 0 | 1 }- Reflects whether a render is running (file-backed flag reset even on errors).
- Launch headless Chromium with
--no-sandboxand intercept requests to drop heavy/analytics resources. - Await DOM stability (
MutationObserver+ quiet timer) within the global timeout, then pull the full document via CDP. - Clean HTML (
src/reduce/index.js): optionally strip CSS tags whenSTRIP_CSS=true, remove disallowed tags/attrs, keep meaningful classes, drop non-description meta tags, ensure<base>and canonical, collapse empty wrappers, normalize whitespace and nbsp. - Generate JSON-LD (
src/ldgen/jsonLdBuilder.js): Organization + WebSite + heuristically typed WebPage (ItemPage, CollectionPage, SearchResultsPage, etc.) based on existing meta and path heuristics; injects into<head>.
- Logger (
src/services/logger.js) writes[ISO][LEVEL]entries toLOG_DIR/LOG_FILE; falls back to console on write failures. - Include
loginLOG_LEVELto trace every intercepted request. - Snapshot helper (
PageRenderer.persistHtmlSnapshot) is available for wiring whenSNAPSHOT=true; filenames are sanitized from hostname/path and saved underLOG_DIR.
- Entrypoint:
src/server.js, Express app insrc/app.js, routes insrc/routes/renderRoute.js. - Progress tracking:
src/utils/processTracker.js(filetmp/process). - Node ESM; lint via
npm run check; tests vianpm testwhen present.