(crawling-model) Fix bug where CrawledDocument.getDomain() trimmed www-prefixes

This had the knock-on effect of breaking the anchor tag loading in the processor for a lot of domains, since they'd grab domains for the wrong domain name.
This commit is contained in:
Viktor Lofgren 2023-12-17 13:53:31 +01:00
parent bcad6492d6
commit 4801c47273

View file

@ -52,7 +52,7 @@ public class CrawledDocument implements SerializableCrawlData {
return EdgeUrl return EdgeUrl
.parse(url) .parse(url)
.map(EdgeUrl::getDomain) .map(EdgeUrl::getDomain)
.map(d -> d.domain) .map(Object::toString)
.orElse(null); .orElse(null);
} }