(crawling-model) Fix bug where CrawledDocument.getDomain() trimmed www-prefixes
This had the knock-on effect of breaking the anchor tag loading in the processor for a lot of domains, since they'd grab domains for the wrong domain name.
This commit is contained in:
parent
bcad6492d6
commit
4801c47273
@ -52,7 +52,7 @@ public class CrawledDocument implements SerializableCrawlData {
|
||||
return EdgeUrl
|
||||
.parse(url)
|
||||
.map(EdgeUrl::getDomain)
|
||||
.map(d -> d.domain)
|
||||
.map(Object::toString)
|
||||
.orElse(null);
|
||||
}
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user