Skip to content

Diselorya/pdf2md

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf2md

Convert PDF to Markdown and TXT, especially for obsidian.

Due to the use of the pytesseract library, it is necessary to manually install the Tesseract-OCR software and add system variables. For detailed methods, please refer to: 基于pytesseract进行图片文字识别 - 知乎 (zhihu.com)

  • Get content:
    • Convert text pdf.
    • Convert picture pdf by ocr.
      • Higher OCR recognition accuracy.
    • Save pictures and insert to markdown by obsidian way.
  • Fix broken sentences. (most but not 100%)
    • Need to optimize based on more samples.
    • AI assisted recognition of sentence breaks.
  • Add headings:
    • Convert pdf bookmarks to headings.
    • Use page number as headings for picture pdf.
      • Fetch first sentence for page number headings.
      • Compare the headers, catalog, and page numbers to identify the levels of headings.
  • Filename handling:
    • Fix unsupported characters in filename.
    • Replace characters conflicting with obsidian in filename.
  • Character encoding problem handling:
    • Normalise the same character but different unicode, which can't read by TTS.
  • Batch convert.
  • Catalog: Replace catalog to obsidian way. (little significance)

About

Convert PDF to Markdown and TXT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages