CatgirlIntelligenceAgency/code/features-crawl/link-parser
Viktor Lofgren 785d8deadd (crawler) Improve meta-tag redirect handling, add tests for redirects.
Wrote a new test to examine the redirect behavior of the crawler, ensuring that the redirect URL is the URL that is reported in the parquet file.  This works as intended.

Noticed in the course of this that the crawler doesn't add links from meta-tag redirects to the crawl frontier.  Added logic to handle this case, amended the test case to verify the new behavior.  Added the meta-redirect case to the HtmlDocumentProcessorPlugin as well, so that we consider it a link between documents in the unlikely case that a meta redirect is to another domain.
2024-02-01 20:30:43 +01:00
..
src/main/java/nu/marginalia/link_parser (crawler) Improve meta-tag redirect handling, add tests for redirects. 2024-02-01 20:30:43 +01:00
build.gradle (build) Move unit test configuration to root build.gradle 2023-10-04 12:46:22 +02:00
readme.md The refactoring will continue until morale improves. 2023-03-12 11:42:07 +01:00

Link Parser

Deals with the various cases in link parsing, such as relative links, internal links, external links, pathological links, etc.

Central Classes