Convert PDF to Markdown and TXT, especially for obsidian.
Due to the use of the pytesseract library, it is necessary to manually install the Tesseract-OCR software and add system variables. For detailed methods, please refer to: 基于pytesseract进行图片文字识别 - 知乎 (zhihu.com)
- Get content:
- Convert text pdf.
- Convert picture pdf by ocr.
- Higher OCR recognition accuracy.
- Save pictures and insert to markdown by obsidian way.
- Fix broken sentences. (most but not 100%)
- Need to optimize based on more samples.
- AI assisted recognition of sentence breaks.
- Add headings:
- Convert pdf bookmarks to headings.
- Use page number as headings for picture pdf.
- Fetch first sentence for page number headings.
- Compare the headers, catalog, and page numbers to identify the levels of headings.
- Filename handling:
- Fix unsupported characters in filename.
- Replace characters conflicting with obsidian in filename.
- Character encoding problem handling:
- Normalise the same character but different unicode, which can't read by TTS.
- Batch convert.
- Catalog: Replace catalog to obsidian way. (little significance)