Skip to content

High-performance HTML-to-plain-text conversion for .NET. Optimised for speed, low allocations, and predictable output.

License

Notifications You must be signed in to change notification settings

pavlosmcg/Html2Text.Net

Repository files navigation

Html2Text.Net

Live Demo CI Benchmarks NuGet License

Just fast HTML -> plain text.

Lightweight, hand rolled, high-performance HTML to plain text conversion for .NET.

Check out the LIVE DEMO!

Use cases

  • Search / indexing pipelines: Strip HTML down to text for full-text search, indexing, classification, or deduping.
    • Example: convert HTML to text before indexing in Elasticsearch / OpenSearch
  • Batch processing: Convert large archives of HTML (docs, KB articles, CMS exports) into text efficiently.
  • Email & notification processing: Get a readable text version of HTML emails for previews, logs, or plain-text fallbacks.
  • LLM / NLP preprocessing: Normalize HTML into clean text before chunking, embedding, or extraction.
  • Logging / auditing: Store a text representation of HTML content for review or compliance.

Usage

Simple as possible:

using Html2Text;

string html = "<h1>Hello</h1><p>World</p>";

string text = Html2Text.Convert(html);

Output:

Hello

World

Install, build, test, contribute

NuGet

Install using NuGet (recommended):

dotnet add package Html2Text.Net

Supported frameworks

  • .Net 8+
  • .Net Framework 4.6.2+
  • .Net Standard 2.0 for compatibility with other frameworks, including .Net 5/6/7

For .Net Framework users, PackageReference style dependencies are recommended. Also ensure binding redirects are enabled.

Contributing

Contributions and pull requests are welcome! With .Net 10 SDK installed, to build locally:

dotnet build

To run unit and regression tests:

(windows): dotnet test
(linux/mac): dotnet test -f net10.0

To run the example console app:

dotnet build
dotnet run --project Html2Text.Example Samples/scottallen.html

How it works

Pipeline

HTML document -> Lexer (tokens) -> Parser (AST nodes) -> Renderer (string text)
  • Text nodes are emitted in document order.
  • Basic block separation is preserved (e.g., paragraphs/headings insert newlines).
  • Whitespace is normalized to produce readable plain text.

Minimal formatting is added to make the plain text output readable in only 4 cases:

  1. HTML tables are given cell separators | and horizontal lines --- under column headers:
| Chart                  | Record Holder     | Record       |
| ---------------------- | ----------------- | ------------ |
| Opening Days           | Avengers: Endgame | $157,461,641 |
| Top Single Day Grosses | Avengers: Endgame | $157,461,641 |
  1. Lists and nested lists are indented and given a leading - like so:
 - 1 Early life
 - 2 Enigma machine
 - 3 Solving the wiring
 - Toggle Solving the wiring subsection
   - 3.1 French help
 - 4 Solving daily settings
 - Toggle Solving daily settings subsection
   - 4.1 Early methods
   - 4.2 Bomba and sheets
   - 4.3 Allies informed
  1. In preformatted areas <pre> whitespace is preserved:
private int GetSmallestNonNegative(int x, int y) {
    return x < 0 && y < 0 ? 0
        : x < 0 ? y
        : y < 0 ? x
        : Math.Min(x, y);
}
  1. The <hr/> element adds a horizontal line of dashes ---.

Goals

This project is focused on:

  • High performance: designed for low allocations and fast throughput.
  • Text extraction only: get the words from the page/document.
  • No dependencies: Lightweight, not an embedded browser engine. No dependencies other than .NET itself.

Non-goals (by design)

The following are intentionally out of scope so the library can excel at the goals above:

  • Respecting CSS, computed styles, display:none, or visibility.
  • Pixel-accurate layout, whitespace mirroring, or browser-equivalent rendering.
  • Executing JavaScript or loading remote resources.

Performance notes

Benchmarks

High performance is a goal of this project. This library:

  • designed for converting many documents quickly (batch processing, indexing, search pipelines).
  • avoids DOM dependencies.
  • uses a lightweight, hand rolled lexer/parser/renderer pipeline.

Benchmarks are in Html2Text.PerfTests and can be run locally with:

dotnet run -c Release --project Html2Text.PerfTests

Or check out the latest automated perf test results here: https://pavlosmcg.github.io/Html2Text.Net/dev/bench/

perf-test-console perf-test-chart

Regression tests

Each file in the Samples/ directory acts as an acceptance/regression test. The results of converting these HTML files to plain text are saved in Html2Text.RegressionTests/*.verified.txt:

Samples/<file-name>.html -> Html2Text.Convert(<file-contents>) -> <file-name>.verified.txt

For example scottallen.html -> scottallen.verified.txt

Html2Text.RegressionTests uses Verify to make test assertions against verified output snapshots. If you need to update the outputs please see the Verify docs for snapshot management.

Projects in this repository

  • Html2Text/: core library
  • Html2Text.Example/: small example app
  • Html2Text.Tests/: unit tests
  • Html2Text.RegressionTests/: regression/acceptance tests
  • Html2Text.PerfTests/: performance benchmarking console app
  • Samples/: sample HTML files used during development and automated regression testing

Distributed under MPL-2.0 see LICENSE.txt

About

High-performance HTML-to-plain-text conversion for .NET. Optimised for speed, low allocations, and predictable output.

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages