1d34224416
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that. |
||
---|---|---|
.. | ||
content-type | ||
crawl-blocklist | ||
link-parser | ||
readme.md |
Crawl Features
These are bits of search-engine related code that are relatively isolated pieces of business logic, that benefit from the clarity of being kept separate from the rest of the crawling code.
- content-type - Content Type identification
- crawl-blocklist - IP and URL blocklists
- link-parser - Code for parsing and normalizing links