CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	e237df4a10	(converter) Use a dumb thread pool instead of Java's executor service.	2023-07-28 18:15:16 +02:00
Viktor Lofgren	667b0ca0b0	(converter, WIP) Refactor CrawledDomainReader to not return iterators. Instead return a closable class SerializableCrawlDataStream.	2023-07-24 16:28:30 +02:00
Viktor Lofgren	a56953c798	(converter, WIP) Refactor converter to not have to load everything into RAM.	2023-07-24 15:25:09 +02:00
Viktor Lofgren	35b29e4f9e	(crawler) Clean up and refactor the code a bit	2023-07-23 19:06:37 +02:00
Viktor Lofgren	69f333c0bf	(crawler) Clean up and refactor the code a bit	2023-07-23 18:59:14 +02:00
Viktor Lofgren	c069c8c182	(crawler) Clean up crawl data reference and recrawl logic	2023-07-22 18:42:21 +02:00
Viktor Lofgren	9e4aa7da7c	(crawler) Support for X-Robots-Tag	2023-07-22 18:42:21 +02:00
Viktor Lofgren	58f2f86ea8	(crawler) Don't read all the data into RAM when doing a refresh-crawl	2023-07-21 19:47:52 +02:00
Viktor Lofgren	f91d92cccb	(crawler) WIP	2023-07-20 21:05:16 +02:00
Viktor Lofgren	5deec63667	(work-log) Better tests	2023-07-12 18:04:06 +02:00
Viktor Lofgren	74caf9e38a	(processes) Remove forEach-constructs in favor of iterators.	2023-07-12 17:47:36 +02:00
Viktor Lofgren	4c016b0318	Process monitoring * Also refactored the SQL tables a bit	2023-07-11 14:46:21 +02:00
Viktor	cbbf60a599	Better fingerprinting (#35 ) * Better fingerprinting for server tech * Many more features in FeatureExtractor * Blog specialization * SiteType table	2023-07-10 18:58:43 +02:00
Viktor Lofgren	f03146de4b	(crawler) Fix bug poor handling of duplicate ids * Also clean up the code a bit	2023-07-10 18:58:43 +02:00
Viktor Lofgren	b73fcc19fe	Fix so that crawler tests don't sometimes fetch real sitemaps when they're run.	2023-07-06 18:05:03 +02:00
Viktor Lofgren	24dce8c03b	Remove link filtering for mediawiki, it's too strict and not every site uses the /wiki/-pattern.	2023-07-01 19:32:25 +02:00
Viktor Lofgren	7d86586594	Remove annoying log spam in sitemap retriever	2023-06-30 17:08:35 +02:00
Viktor Lofgren	11c26e700e	Remove annoying log spam in crawler retriever	2023-06-30 17:08:24 +02:00
Viktor Lofgren	d71124961e	Better tests for crawling and processing.	2023-06-27 16:11:27 +02:00
Viktor Lofgren	fbdedf53de	Fix bug in CrawlerRetreiver ... where the root URL wasn't always added properly to the front of the crawl queue.	2023-06-27 15:50:38 +02:00
Viktor Lofgren	d167ad2017	Remove sitemap related log spam	2023-06-27 13:59:47 +02:00
Viktor Lofgren	f8f9f04158	Specialized logic for processing Lemmy-based websites.	2023-06-27 10:57:54 +02:00
Viktor Lofgren	b0c7480d06	Set default timeouts for java.net.URL-connections	2023-06-27 10:57:54 +02:00
Viktor Lofgren	e7af77e151	Tests for crawler specialization + testdata	2023-06-27 10:57:54 +02:00
Viktor Lofgren	ec940e36d0	Sitemap support, refined crawler specialization	2023-06-27 10:57:54 +02:00
Viktor Lofgren	ed373eef61	Refactor crawler and add special logic for some platforms * Break apart CrawlerRetreiver * Break apart HttpFetcher into an interface and impl for testing sanity * Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.	2023-06-27 10:57:54 +02:00
Viktor Lofgren	e4372289a5	Use fixed buffers for BigString compression and decompression to reduce GC churn. fixup! Use fixed buffers for BigString compression and decompression to reduce GC churn.	2023-06-19 17:58:19 +02:00
Viktor Lofgren	eb2ca942d5	Up the default crawl delay to 1 second.	2023-06-07 22:02:17 +02:00
Viktor Lofgren	e332faa07e	Fix test that broke when memex.marginalia.nu started redirecting to www.marginalia.nu.	2023-05-28 13:46:24 +02:00
Viktor Lofgren	2eb972dea1	Remove unrelated code, break tools into their own directory.	2023-03-17 16:03:11 +01:00
Viktor Lofgren	449471a076	Yet more restructuring. Improved search result ranking.	2023-03-16 21:35:54 +01:00
Viktor Lofgren	d82532b7f1	More restructuring, big bug fixes in keyword extraction.	2023-03-13 17:39:53 +01:00

32 Commits