(warc) Update URL encoding in WarcProtocolReconstructor

The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.
This commit is contained in:
Viktor Lofgren 2023-12-29 19:41:37 +01:00
parent 68ac8d3e09
commit 0b112cb4d4

View File

@ -26,7 +26,7 @@ public class WarcProtocolReconstructor {
requestStringBuilder.append(request.method()).append(" ").append(encodedURL);
if (uri.getQuery() != null) {
requestStringBuilder.append("?").append(uri.getQuery());
requestStringBuilder.append("?").append(URLEncoder.encode(uri.getQuery(), StandardCharsets.UTF_8));
}
requestStringBuilder.append(" HTTP/1.1\r\n");
requestStringBuilder.append("Host: ").append(uri.getHost()).append("\r\n");