Example crawler service with some basic multithreading
Project requires java8 and maven installed
- First install and build the project using
mvn installormvnw installif you have wrapper configured for the project - To test application run command
mvn test - To build jar call respectively
mvn package - To start program call
java -jar target/crawler-0.0.1-SNAPSHOT.jar https://wiprodigital.com/ --crawler.threads=10
Command supports additional optional flag --crawler.visits-limit which will change amount of threads used internally by crawler, by default uses 10 threads
I'v wanted to create quite simple web crawler which at least uses some basic threading. There's a few things which could be done better, like adding option to limit pages visited, change page processing to include what errors might have occurred, sizes on resources and more detailed information. We could add option to define with a flag desired output format as well. For bigger changes, few tweaks to make it into a bean which could be used both for web, server and command line app.