CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	3113b5a551	(warc) Filter WarcResponses based on X-Robots-Tags There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.	2023-12-16 15:58:27 +01:00
Viktor Lofgren	54ed3b86ba	(minor) Remove dead code.	2023-12-15 21:49:35 +01:00
Viktor Lofgren	fa81e5b8ee	(warc) Use a non-standard WARC header to convey information about whether a website uses cookies This information is then propagated to the parquet file as a boolean. For documents that are copied from the reference, use whatever value we last saw. This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.	2023-12-15 16:37:53 +01:00
Viktor Lofgren	9fea22b90d	(warc) Further tidying This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled. A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics. Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.	2023-12-15 15:38:23 +01:00
Viktor Lofgren	0889b6d247	(warc) Clean up parquet conversion This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure. It also refactors the fetch result, body extraction and content type abstractions.	2023-12-14 20:39:40 +01:00
Viktor Lofgren	1328bc4938	(warc) Clean up parquet conversion This commit cleans up the warc->parquet conversion. Records with a http status other than 200 are now included. The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body. The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.	2023-12-14 16:05:48 +01:00
Viktor Lofgren	440e097d78	(crawler) WIP integration of WARC files into the crawler and converter process. This commit is in a pretty rough state. It refactors the crawler fairly significantly to offer better separation of concerns. It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data. This works, -ish. There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either. A problem is that the WARC files are a bit too large. It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.	2023-12-13 15:33:42 +01:00
Viktor Lofgren	b74a3ebd85	(crawler) WIP integration of WARC files into the crawler process. At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly. This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled. The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.	2023-12-11 19:32:58 +01:00
Viktor Lofgren	45987a1d98	Merge branch 'master' into warc	2023-12-11 14:32:35 +01:00
Viktor Lofgren	f655ec5a5c	(*) Refactor GeoIP-related code In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services. The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions. The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server. The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.	2023-12-10 17:30:43 +01:00
Viktor Lofgren	e6a1052ba7	Simplify CrawlerMain, removing the CrawlerLimiter and using a global HttpFetcher with a virtual thread pool dispatcher instead of the default.	2023-12-08 20:24:01 +01:00
Viktor Lofgren	968dce50fc	(crawler) Refactored IpInterceptingNetworkInterceptor for clarity.	2023-12-08 17:45:46 +01:00
Viktor Lofgren	3bbffd3c22	(crawler) Refactor HttpFetcher to integrate WarcRecorder Partially hook in the WarcRecorder into the crawler process. So far it's not read, but should record the crawled documents. The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.	2023-12-08 17:12:51 +01:00
Viktor Lofgren	072b5fcd12	Implement Warc-recording wrapper for OkHttp3 client This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted. This component is currently not hooked into anything. The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'. The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.	2023-12-08 13:49:16 +01:00
Viktor Lofgren	064265b0b9	(crawler) Move content type/charset sniffing to a separate microlibrary This functionality needs to be accessed by the WarcSideloader, which is in the converter. The resultant microlibrary is tiny, but I think in this case it's justifiable.	2023-12-07 15:16:37 +01:00
Viktor Lofgren	166a391eae	(docs) Improve architectural documentation for the crawler.	2023-11-30 21:30:57 +01:00
Viktor Lofgren	09917837d0	(process) Ensure construction exceptions are logged Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs. The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.	2023-11-22 18:32:06 +01:00
Viktor Lofgren	7617b4cbc2	(crawler) Fix NPE in crawler caused by not having fetched the domains list yet	2023-11-06 18:16:38 +01:00
Viktor Lofgren	ebd10a5f28	(crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized	2023-11-06 16:14:58 +01:00
Viktor Lofgren	8f74dbdbb4	(crawler) Set more lenient parameters for recrawl	2023-10-30 11:35:30 +01:00
Viktor Lofgren	fd5a7eac87	(crawler) Exit crawler retriever on thread interrupted	2023-10-30 11:34:16 +01:00
Viktor Lofgren	a497e4c920	(crawler) Terminate crawler after a few hours of no progress	2023-10-26 12:49:28 +02:00
Viktor Lofgren	81dd3809e9	(*) WIP Add node affinity to EC_DOMAIN Very messy commit due to fractalline yak shaving	2023-10-19 17:48:34 +02:00
Viktor Lofgren	4baf9527d7	() WIP Control GUI redesign, executor-service, multi-node mq This turned out to be very difficult to do in small isolated steps. Design overhaul of the control gui using bootstrap * Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes * Add node-affinity to message queue	2023-10-14 12:08:43 +02:00
Viktor Lofgren	199c459697	(*) Add node-affinity to services, processes and file storage.	2023-10-10 12:32:22 +02:00
Viktor Lofgren	3889c4bdd9	(refactor) Remove features-search and update documentation	2023-10-09 15:12:30 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	d895f83520	(blocking-thread-pool) Move DumbThreadPool to its own micro-library Also rename it to SimpleBlockingThreadPool.	2023-09-20 10:11:49 +02:00
Viktor Lofgren	5c040f7a46	(crawl-spec) Parquetify crawl spec * Crawl-specs are now parquet files * Deprecate the crawl-job-extractor tool	2023-09-17 09:41:34 +02:00
Viktor Lofgren	eaeb23d41e	(refactor) Remove converting-model package completely	2023-09-14 11:21:44 +02:00
Viktor Lofgren	39c1857c61	(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.	2023-08-29 13:07:55 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	1d486bddee	(crawler) Reduce log spam	2023-08-16 11:12:09 +02:00
Viktor Lofgren	e7192a9cad	(mq) Refactor mq and actor library and move it to libraries out of common	2023-08-15 10:53:23 +02:00
Viktor Lofgren	251fc63b42	(*) Fix merge gore	2023-08-09 13:33:28 +02:00
Viktor	52e2ab45bf	Merge branch 'master' into master-control-program	2023-08-07 12:53:43 +02:00
Viktor Lofgren	c22feaf42e	(crawl) Make crawler limiter request a GC when throttling	2023-08-03 17:58:18 +02:00
Viktor Lofgren	e5c9791b14	(crawler) Fix rare ConcurrentModificationError due to HashSet	2023-08-01 17:28:29 +02:00
Viktor Lofgren	37c4cc68ed	TODO	2023-07-31 10:34:42 +02:00
Viktor Lofgren	5c071ce4d3	(crawler) Clean up the code and remove unnecessary logging	2023-07-30 16:53:39 +02:00
Viktor Lofgren	caf3d231a8	(crawler) Fix rare issue with NPEs if the crawl queue is empty	2023-07-30 16:53:13 +02:00
Viktor Lofgren	730e8f74e4	(crawler) Even more memory optimizations. * Fix minor resource leak in zstd streams * Use pools for zstd streams * Reduce the SSL session cache size	2023-07-30 14:19:55 +02:00
Viktor Lofgren	aba134284f	(crawler) Reduce log spam	2023-07-29 19:22:58 +02:00
Viktor Lofgren	2a6183f9e0	(crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size.	2023-07-29 19:20:09 +02:00
Viktor Lofgren	ee143bbc48	(crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended.	2023-07-29 19:19:09 +02:00
Viktor Lofgren	05ba3bab96	(crawler) Make SitemapRetriever abort on too large sitemaps.	2023-07-29 19:18:12 +02:00
Viktor Lofgren	d2b6b2044c	(crawler) Reduce log spam in HttpFetcherImpl	2023-07-29 19:18:12 +02:00
Viktor Lofgren	7611b7900d	(crawler) Reduce long term memory allocation in DomainCrawlFrontier (crawler) Reduce long term memory allocation in DomainCrawlFrontier	2023-07-29 19:18:12 +02:00

1 2

85 Commits