CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	5329968155	(crawler) Update CrawlingThenConvertingIntegrationTest This commit updates CrawlingThenConvertingIntegrationTest with additional tests for invalid, redirecting, and blocked domains. Improvements have also been made to filter out irrelevant entries in ParquetSerializableCrawlDataStream.	2023-12-15 21:04:06 +01:00
Viktor Lofgren	2e536e3141	(crawler) Add timestamp to CrawledDocument records This update includes the addition of timestamps to the parquet format for crawl data, as extracted from the Warc stream. The parquet format stores the timestamp as a 64 bit long, seconds since unix epoch, without a logical type. This is to avoid having to do format conversions when writing and reading the data. This parquet field populates the timestamp field in CrawledDocument.	2023-12-15 20:23:27 +01:00
Viktor Lofgren	cf935a5331	(converter) Read cookie information Add an optional new field to CrawledDocument containing information about whether the domain has cookies. This was previously on the CrawledDomain object, but since the WarcFormat requires us to write a WarcInfo object at the start of a crawl rather than at the end, this information is unobtainable when creating the CrawledDomain object. Also fix a bug in the deduplication logic in the DomainProcessor class that caused a test to break.	2023-12-15 18:09:53 +01:00
Viktor Lofgren	fa81e5b8ee	(warc) Use a non-standard WARC header to convey information about whether a website uses cookies This information is then propagated to the parquet file as a boolean. For documents that are copied from the reference, use whatever value we last saw. This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.	2023-12-15 16:37:53 +01:00
Viktor Lofgren	9fea22b90d	(warc) Further tidying This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled. A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics. Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.	2023-12-15 15:38:23 +01:00
Viktor Lofgren	0889b6d247	(warc) Clean up parquet conversion This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure. It also refactors the fetch result, body extraction and content type abstractions.	2023-12-14 20:39:40 +01:00
Viktor Lofgren	1328bc4938	(warc) Clean up parquet conversion This commit cleans up the warc->parquet conversion. Records with a http status other than 200 are now included. The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body. The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.	2023-12-14 16:05:48 +01:00
Viktor Lofgren	787a20cbaa	(crawling-model) Implement a parquet format for crawl data This is not hooked into anything yet. The change also makes modifications to the parquet-floor library to support reading and writing of byte[] arrays. This is desirable since we may in the future want to support inputs that are not text-based, and codifying the assumption that each document is a string will definitely cause us grief down the line.	2023-12-13 16:22:19 +01:00
Viktor Lofgren	440e097d78	(crawler) WIP integration of WARC files into the crawler and converter process. This commit is in a pretty rough state. It refactors the crawler fairly significantly to offer better separation of concerns. It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data. This works, -ish. There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either. A problem is that the WARC files are a bit too large. It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.	2023-12-13 15:33:42 +01:00
Viktor Lofgren	b74a3ebd85	(crawler) WIP integration of WARC files into the crawler process. At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly. This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled. The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.	2023-12-11 19:32:58 +01:00
Viktor Lofgren	45987a1d98	Merge branch 'master' into warc	2023-12-11 14:32:35 +01:00
Viktor Lofgren	d0982e7ba5	(converter) Add error handling and lazy load external domain links The converter was not properly initiating the external links for each domain, causing an NPE in conversion. This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data. Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.	2023-12-09 12:33:39 +01:00
Viktor Lofgren	064265b0b9	(crawler) Move content type/charset sniffing to a separate microlibrary This functionality needs to be accessed by the WarcSideloader, which is in the converter. The resultant microlibrary is tiny, but I think in this case it's justifiable.	2023-12-07 15:16:37 +01:00
Viktor Lofgren	10fc489822	(converter) More robust filename resolution	2023-10-20 14:16:03 +02:00
Viktor Lofgren	2bf0c4497d	(*) Tool for unfcking old crawl data so that it aligns with the new style IDs	2023-10-19 17:48:34 +02:00
Viktor Lofgren	5c040f7a46	(crawl-spec) Parquetify crawl spec * Crawl-specs are now parquet files * Deprecate the crawl-job-extractor tool	2023-09-17 09:41:34 +02:00
Viktor	52e2ab45bf	Merge branch 'master' into master-control-program	2023-08-07 12:53:43 +02:00
Viktor Lofgren	5c071ce4d3	(crawler) Clean up the code and remove unnecessary logging	2023-07-30 16:53:39 +02:00
Viktor Lofgren	730e8f74e4	(crawler) Even more memory optimizations. * Fix minor resource leak in zstd streams * Use pools for zstd streams * Reduce the SSL session cache size	2023-07-30 14:19:55 +02:00
Viktor Lofgren	667b0ca0b0	(converter, WIP) Refactor CrawledDomainReader to not return iterators. Instead return a closable class SerializableCrawlDataStream.	2023-07-24 16:28:30 +02:00
Viktor Lofgren	a56953c798	(converter, WIP) Refactor converter to not have to load everything into RAM.	2023-07-24 15:25:09 +02:00
Viktor Lofgren	789e8eea85	(crawler) Clean up and refactor the code a bit	2023-07-23 19:08:38 +02:00
Viktor Lofgren	35b29e4f9e	(crawler) Clean up and refactor the code a bit	2023-07-23 19:06:37 +02:00
Viktor Lofgren	c069c8c182	(crawler) Clean up crawl data reference and recrawl logic	2023-07-22 18:42:21 +02:00
Viktor Lofgren	58f2f86ea8	(crawler) Don't read all the data into RAM when doing a refresh-crawl	2023-07-21 19:47:52 +02:00
Viktor Lofgren	f91d92cccb	(crawler) WIP	2023-07-20 21:05:16 +02:00
Viktor Lofgren	d7ab21fe34	(*) Refactor Control Service and processes	2023-07-17 21:20:31 +02:00
Viktor Lofgren	bca4bbb6c8	(*) Refactor MQ and MQSM	2023-07-17 13:57:32 +02:00
Viktor Lofgren	8b74e3aa0d	(*) File Storage WIP	2023-07-14 17:08:10 +02:00
Viktor Lofgren	74caf9e38a	(processes) Remove forEach-constructs in favor of iterators.	2023-07-12 17:47:36 +02:00
Viktor Lofgren	4c016b0318	Process monitoring * Also refactored the SQL tables a bit	2023-07-11 14:46:21 +02:00
Viktor Lofgren	dbb758d1a8	Minor: Better error handling in crawled domain reader	2023-07-10 18:58:43 +02:00
Viktor Lofgren	da8bcc6e24	Minor: Don't blow up the reader on a corrupted file	2023-07-10 18:58:43 +02:00
Viktor Lofgren	17db23c2c1	Minor: Better error handling in crawled domain reader	2023-07-07 19:48:32 +02:00
Viktor Lofgren	040bea1f75	Minor: Don't blow up the reader on a corrupted file	2023-07-07 19:48:11 +02:00
Viktor Lofgren	baff83912e	Small optimizations that shave an hour of processing time :D	2023-06-28 15:41:10 +02:00
Viktor Lofgren	fbdedf53de	Fix bug in CrawlerRetreiver ... where the root URL wasn't always added properly to the front of the crawl queue.	2023-06-27 15:50:38 +02:00
Viktor Lofgren	7d741ff499	Fix so crawl plan replay doesn't crash if a file is missing.	2023-06-27 10:57:54 +02:00
Viktor Lofgren	16e37672fc	Bugfix crawl plan, doesn't use rewrite() everywhere	2023-03-30 15:41:07 +02:00
Viktor Lofgren	449471a076	Yet more restructuring. Improved search result ranking.	2023-03-16 21:35:54 +01:00
Viktor Lofgren	d82532b7f1	More restructuring, big bug fixes in keyword extraction.	2023-03-13 17:39:53 +01:00

41 Commits