(crawling-model) Fix bug where CrawledDocument.getDomain() trimmed www-prefixes
This had the knock-on effect of breaking the anchor tag loading in the processor for a lot of domains, since they'd grab domains for the wrong domain name.
This commit is contained in:
parent
bcad6492d6
commit
4801c47273
1 changed files with 1 additions and 1 deletions
|
@ -52,7 +52,7 @@ public class CrawledDocument implements SerializableCrawlData {
|
||||||
return EdgeUrl
|
return EdgeUrl
|
||||||
.parse(url)
|
.parse(url)
|
||||||
.map(EdgeUrl::getDomain)
|
.map(EdgeUrl::getDomain)
|
||||||
.map(d -> d.domain)
|
.map(Object::toString)
|
||||||
.orElse(null);
|
.orElse(null);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue