(crawling-model) Fix bug where CrawledDocument.getDomain() trimmed www-prefixes

This had the knock-on effect of breaking the anchor tag loading in the processor for a lot of domains, since they'd grab domains for the wrong domain name.
This commit is contained in:
Viktor Lofgren 2023-12-17 13:53:31 +01:00
parent bcad6492d6
commit 4801c47273

View File

@ -52,7 +52,7 @@ public class CrawledDocument implements SerializableCrawlData {
return EdgeUrl
.parse(url)
.map(EdgeUrl::getDomain)
.map(d -> d.domain)
.map(Object::toString)
.orElse(null);
}