Skip to content

A simple Java CLI tool for batch-converting PDF files to TXT format. Supports file filtering by filename wildcards and last modified date.

License

Notifications You must be signed in to change notification settings

akoutsop1909/pdf-to-txt-converter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF-to-TXT Converter

A command-line Java tool for batch-converting PDF files to TXT format using Apache PDFBox. It supports filtering input files by filename wildcards (e.g., *.pdf, report_??.pdf) and last modified date ranges. Output .txt files are saved to a specified directory, which is created automatically if it doesn't exist.

Note

Initially built during my internship, tailored to the company's internal document needs.
Code shared with permission.

⚙️ System Requirements

  • Java 11 or later is required. You can download it from the official Oracle website.
  • Ensure that the java command is available in your system's PATH.
    You may also need to set the JAVA_HOME environment variable on some systems. Instructions here.

🚀 Getting Started

You can download the latest release from the Releases section of this repository, which includes the executable pdf2txt.jar and a run_pdf2txt.bat script for Windows users. This batch script prompts you for input and runs the pdf2txt.jar with the parameters you provide. Note that using the batch script is optional. You can still run the JAR file directly from the command line.

Alternatively, you can clone the repository and build the JAR manually using a Java IDE like IntelliJ IDEA, or the jar command. Instructions here and here.

To run the converter directly from the command line, you must use the following format:

java -jar pdf2txt.jar [source] [dest] [minDate] [maxDate]

Arguments

  • [source] – Path to the PDF files. Supports wildcards (e.g., ./input/report_?.pdf).
  • [dest] – Directory to save converted TXT files. (e.g., ./output/).
  • [minDate] – Min modified date for PDFs. (format: dd-MM-yyyy, e.g., 01-01-2022).
  • [maxDate] – Max modified date for PDFs. (format: dd-MM-yyyy, e.g., 01-01-2023).

All arguments are optional but must be provided in the order listed above. If one or more are omitted at the end, default values will be used for the missing ones:

  • [source] → Current directory.
  • [dest] → Current directory.
  • [minDate] → 01-01-1970.
  • [maxDate] → Current date.

Limitations

  • Arguments must be provided in order: [source], [dest], [minDate], [maxDate].
  • Supports only wildcard patterns (* and ?), not full regular expressions.
  • Wildcards apply to filenames only, not directory names.
  • The * wildcard is greedy (matches as many characters as possible).
  • Date filtering is based on the file's last modified timestamp.
  • Recursive folder traversal is not supported.

Example

java -jar pdf2txt.jar ./input/report_?.pdf ./output 01-01-2022 01-01-2023

This command will convert PDF files in the ./input directory, modified between Jan 1, 2022, and Jan 1, 2023, into TXT files, saving them in the .output directory. If no matching files are found, an appropriate message will be displayed.

Important

Shell (e.g., Command Prompt, PowerShell, bash) may automatically expand wildcard characters before passing them to the converter, leading to unexpected behavior. To prevent this, you can either:

  • Run the JAR from a Java IDE, such as IntelliJ IDEA, which bypasses shell expansion.
  • Run the JAR from the command line, quoting the source path, though this may not fully prevent expansion.

For more details and information, you can run:

java -jar pdf2txt.jar --help  # Displays detailed usage and arguments guide.
java -jar pdf2txt.jar --about # Displays general information and limitations.

⌨️ Demo Run

pdf2txt demo

📂 Folder Structure

pdf-to-txt-converter/
├── lib/                     # Dependency JARs
│   └── pdfbox-app-2.0.30.jar
├── src/                     # Java source code
│   └── Pdf2Txt.java    
├── test/                    # JUnit tests
│   ├── ConversionTest.java
│   ├── PathParsingTest.java
│   └── WildcardTest.java
├── .gitignore               # Files/folders to ignore in Git
├── LICENSE                  # License file (MIT)
└── README.md                # This file