Skip to content

janreges/siteone-crawler-markdown-examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SiteOne Crawler - Web to markdown conversion examples

This page belongs to SiteOne Crawler and serves as an overview of the functionality of converting entire web pages to markdown.

Website crawler.siteone.io

  • Open markdown version of crawler.siteone.io - this webpage is based on Starlight.
  • The Markdown version was generated by the specific command below.
  • For better performance, some parts of the page (DOM elements) have been removed by --markdown-exclude-selector.
  • Using --ignore-regex, it was ensured that URL addresses to HTML reports or examples exports were not passed through, so that only absolute URLs to these URLs remained in the markdown.
  • I put the --disable-* attributes here only to avoid downloading these types of files unnecessarily. They do not affect the output markdown content.
./crawler \
  --url=https://crawler.siteone.io/ \
  --ignore-regex='/^.*\/html\//' \
  --ignore-regex='/^.*\/examples\-exports\//' \
  --markdown-export-dir=tmp/crawler.siteone.io/ \
  --markdown-exclude-selector='header' \
  --markdown-exclude-selector='starlight-theme-select' \
  --markdown-exclude-selector='.isMobile' \
  --markdown-exclude-selector='#starlight__on-this-page--mobile' \
  --markdown-exclude-selector='.social-icons' \
  --disable-styles --disable-javascript --disable-fonts

Website react.dev

  • Open markdown version of react.dev.
  • The Markdown version was generated by the specific command below. For better performance, some parts of the page (DOM elements) have been removed.
  • I used the --markdown-disable-images so that the images are not included and are removed from the markdown.
  • I used the --disable-all-assets here only to avoid downloading assets (JS, CSS, etc.) unnecessarily. That do not affect the output markdown content.
./crawler \
  --url=https://react.dev/ \
  --markdown-export-dir=tmp/react.dev/ \
  --markdown-disable-images \
  --disable-all-assets

Website docs.astro.build

  • Open markdown version of docs.astro.build - this webpage is based on Starlight.
  • The Markdown version was generated by the specific command below. For better performance, some parts of the page (DOM elements) have been removed.
  • I put the --disable-* attributes here only to avoid downloading these types of files unnecessarily. They do not affect the output markdown content.
./crawler \
  --url=https://docs.astro.build/ \
  --markdown-export-dir=tmp/docs.astro.build/ \
  --markdown-exclude-selector='header' \
  --markdown-exclude-selector='starlight-theme-select' \
  --markdown-exclude-selector='.isMobile' \
  --markdown-exclude-selector='#starlight__on-this-page--mobile' \
  --markdown-exclude-selector='.social-icons' \
  --disable-styles --disable-javascript --disable-fonts

About

This page belongs to https://crawler.siteone.io/ and serves as an overview of the functionality of converting entire web pages to markdown.

Topics

Resources

License

Stars

Watchers

Forks