Commit Graph

266 Commits

Author SHA1 Message Date
Viktor Lofgren
ef02b712ad (build) Remove false depdencency between icp and index-service
This dependency causes the executor service docker image to change when the index service docker image changes.
2024-01-05 13:22:13 +01:00
Viktor Lofgren
fdec565b34 (converter) Add upper 128KB limit to how much HTML we'll parse 2024-01-05 13:22:13 +01:00
Viktor Lofgren
33c2188c87 (feature) More trackers 2024-01-05 13:22:13 +01:00
Viktor Lofgren
b3c8fa74cc (feature) Add another doubleclick variant to the adtech trackers 2024-01-05 13:22:13 +01:00
Viktor Lofgren
e53bb70bef (converter) Penalize chatgpt content farm spam 2024-01-05 13:22:13 +01:00
Viktor Lofgren
9f7df59945 (sideload) Reduce quality assessment.
This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.
2024-01-02 12:35:59 +01:00
Viktor Lofgren
faa50bf578 (sideload) Just index based on first paragraph
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!

This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
2024-01-02 12:35:44 +01:00
Viktor Lofgren
00a974a721 (crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions
This commit also improves the test coverage for this part of the code.
2023-12-27 20:02:17 +01:00
Viktor Lofgren
f811a29f87 (crawler) Fix resource leak in crawler
A 10 MB thread local buffer wasn't static.  Oops.
2023-12-27 16:32:17 +01:00
Viktor Lofgren
9e5fe71f5b (crawler) Switch hash function in crawler
Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler.   This switches to the modified Murmur hash function used throughout Marginalia.
2023-12-27 13:29:00 +01:00
Viktor Lofgren
3ea1ddae22 (crawler) Roll back switch to virtual thread pool in crawler
This seems to cause a resource leak, it seems the http library uses thread locals?
2023-12-26 19:37:34 +01:00
Viktor Lofgren
25d086c4e1 (crawler) Clean up stale warc files
We should probably have an option to keep them, but not by default!
2023-12-25 15:07:36 +01:00
Viktor Lofgren
f779f760c4 (crawler) Even more lenient resyncing 2023-12-25 01:44:18 +01:00
Viktor Lofgren
f18f82e229 (crawler) Write etags and last-modified on reference copy
This commit also fixes a test that broke with a previous change.
2023-12-25 01:40:13 +01:00
Viktor Lofgren
67ef2b45fa (crawler) Reduce logging 2023-12-25 01:10:03 +01:00
Viktor Lofgren
d72e871265 (warc) Fix resync 2023-12-25 01:03:03 +01:00
Viktor Lofgren
4c9bc13309 (warc) Reduce log spam 2023-12-25 00:58:31 +01:00
Viktor Lofgren
84563b0d46 (crawler) Be a bit more conservative about pulling etags and so on if the previous fetch wasn't OK 2023-12-25 00:55:05 +01:00
Viktor Lofgren
c5aab7e8db (warc) Fix NPE in WarcRecorder 2023-12-25 00:54:38 +01:00
Viktor Lofgren
1755b646b8 (warc) Fix NPE in WarcRecorder 2023-12-25 00:48:42 +01:00
Viktor Lofgren
e1a155a9c8 (crawler) Increase growth of crawl jobs
A number of crawl jobs get stuck at about 300 documents, or just under.  This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl.  GOOD_URLS is based on how many documents successfully process, which is typically fairly small.  Switching to KNOWN_URLS should let this grow faster.

The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table.

The floor is also increased to 250 from 200.
2023-12-23 13:22:10 +01:00
Viktor Lofgren
dc773c5c20 (adjacencies) Clean up AdjacenciesLoader
Make JDBC batching more consistent, also adds a test case for the loader.
2023-12-21 14:14:22 +01:00
Viktor Lofgren
b6253b03c2 (adjacencies) Fix bug in AdjacenciesLoader
This fixes a bug where a prepared statement was created before the table it was supposed to insert into was created.  This fails and does nothing.

Furthermore, added the logging that would have warned about this failure, had it been in place.
2023-12-21 13:12:31 +01:00
Viktor Lofgren
a5bc29245b (cleanup) Remove vestigial support for WARC crawl data streams 2023-12-20 15:46:21 +01:00
Viktor Lofgren
bfae478251 Refactor CrawlerRevisitor for better consistency 2023-12-20 15:21:49 +01:00
Viktor Lofgren
a7cd490593 (minor) Remove dead code. 2023-12-19 18:58:33 +01:00
Viktor Lofgren
dd8fb04886 (converter) Add sizeloadSizeAdvice field to several ProcessedDomain
Since the sideloaders don't populate the documents list in ProcessedDomain to keep the memory footprint manageable, the code that estimates knownUrls etc. will set them to zero, which has negative effects on their ranking.  This change will populate them with a bullshit value within a sane ballpark, ensuring that these domains show up in the rankings.
2023-12-19 18:37:51 +01:00
Viktor Lofgren
3a56a06c4f (warc) Add a fields for etags and last-modified headers to the new crawl data formats
Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats.  This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely.

The commit also adds a few tests to this logic.
2023-12-18 17:45:54 +01:00
Viktor Lofgren
126ac3816f (converter) Reduce queue size in ConverterWriter
The size of the ArrayBlockingQueue in ConverterWriter.java has been reduced from 4 to 1. This change aims to reduce the memory utilization by not having fully processed domains piling up in RAM.  This may cause the writer to go idle in waiting for new data, but that may be preferable to an OOM.
2023-12-18 13:42:40 +01:00
Viktor Lofgren
d02bed1a55 (loader) Optimize DomainLoaderService for faster startups
Initialization parameters in DomainLoaderService and DomainIdRegistry have been updated to improve performance. This is done by adding sane default sizes to the hash tables involved, reducing GC churn, but also by setting a sensible fetch size to the queries used, and not fetching irrelevant information such as the domain name.
2023-12-18 13:15:10 +01:00
Viktor Lofgren
b7ed0ce537 (loader) Reset count after executing batch in DomainLoaderService
This should greatly speed up starting the loader process.
2023-12-18 12:43:53 +01:00
Viktor Lofgren
c422f0b9fb (geo-ip) Tidy up error handling 2023-12-17 16:06:51 +01:00
Viktor Lofgren
c92f1b8df8 (geo-ip) Revert removal of ip2location logic
We do both ip2location and ASN data.

The change also adds some keywords based on autonomous system information, on a somewhat experimental basis.  It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.
2023-12-17 15:03:00 +01:00
Viktor Lofgren
bde68ba48b Merge branch 'master' into asn-info 2023-12-17 14:00:23 +01:00
Viktor Lofgren
bf44805e69 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 14:00:07 +01:00
Viktor Lofgren
edf9aa2c23 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 13:59:54 +01:00
Viktor Lofgren
bcad6492d6 (sideloader) Fix integration problems with sideloaders
In encyclopedia, add a class "mw-content-text" that the WikiSpecialization class is looking for during pruning to give the articles a more fair treatment.

Also add generator keywords based on the generator type provided, to ensure that these documents show up in appropriate filters.

Further, add a new document flag value 'Sideloaded' to be able to distinguish these entries.
2023-12-17 13:28:17 +01:00
Viktor Lofgren
d7bd540683 (*) Replace the ip2location IP geolocation data with ASN information from apnic.net.
Doesn't really make sense to use ip2location as a middle man for information that is already freely available...
2023-12-16 21:55:04 +01:00
Viktor Lofgren
3113b5a551 (warc) Filter WarcResponses based on X-Robots-Tags
There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.
2023-12-16 15:58:27 +01:00
Viktor Lofgren
54ed3b86ba (minor) Remove dead code. 2023-12-15 21:49:35 +01:00
Viktor Lofgren
2001d0f707 (converter) Add @Deprecated annotation to a few fields that should no longer be used. 2023-12-15 21:42:00 +01:00
Viktor Lofgren
0f9cd9c87d (warc) More accurate filering of advisory records
Further create records for resources that were blocked due to robots.txt; as well as tests to verify this happens.
2023-12-15 21:37:02 +01:00
Viktor Lofgren
2e7db61808 (warc) More accurate filering of advisory records
We want to mute some of these records so that they don't produce documents, but in some cases we want a document to be produced for accounting purposes.

Added improved tests that reach for known resources on www.marginalia.nu to test the behavior when encountering bad content type and 404s.

The commit also adds some safety try-catch:es around the charset handling, as it may sometimes explode when fed incorrect data, and we do be guessing...
2023-12-15 21:31:16 +01:00
Viktor Lofgren
5329968155 (crawler) Update CrawlingThenConvertingIntegrationTest
This commit updates CrawlingThenConvertingIntegrationTest with additional tests for invalid, redirecting, and blocked domains. Improvements have also been made to filter out irrelevant entries in ParquetSerializableCrawlDataStream.
2023-12-15 21:04:06 +01:00
Viktor Lofgren
2e536e3141 (crawler) Add timestamp to CrawledDocument records
This update includes the addition of timestamps to the parquet format for crawl data, as extracted from the Warc stream.

The parquet format stores the timestamp as a 64 bit long, seconds since unix epoch, without a logical type.  This is to avoid having to do format conversions when writing and reading the data.

This parquet field populates the timestamp field in CrawledDocument.
2023-12-15 20:23:27 +01:00
Viktor Lofgren
cf935a5331 (converter) Read cookie information
Add an optional new field to CrawledDocument containing information about whether the domain has cookies.  This was previously on the CrawledDomain object, but since the WarcFormat requires us to write a WarcInfo object at the start of a crawl rather than at the end, this information is unobtainable when creating the CrawledDomain object.

Also fix a bug in the deduplication logic in the DomainProcessor class that caused a test to break.
2023-12-15 18:09:53 +01:00
Viktor Lofgren
fa81e5b8ee (warc) Use a non-standard WARC header to convey information about whether a website uses cookies
This information is then propagated to the parquet file as a boolean.

For documents that are copied from the reference, use whatever value we last saw.  This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.
2023-12-15 16:37:53 +01:00
Viktor Lofgren
9fea22b90d (warc) Further tidying
This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled.

A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics.

Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.
2023-12-15 15:38:23 +01:00
Viktor Lofgren
0889b6d247 (warc) Clean up parquet conversion
This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure.

It also refactors the fetch result, body extraction and content type abstractions.
2023-12-14 20:39:40 +01:00
Viktor Lofgren
1328bc4938 (warc) Clean up parquet conversion
This commit cleans up the warc->parquet conversion.  Records with a http status other than 200 are now included.

The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body.

The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.
2023-12-14 16:05:48 +01:00