CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	440e097d78	(crawler) WIP integration of WARC files into the crawler and converter process. This commit is in a pretty rough state. It refactors the crawler fairly significantly to offer better separation of concerns. It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data. This works, -ish. There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either. A problem is that the WARC files are a bit too large. It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.	2023-12-13 15:33:42 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	81bfd7e5fb	(experiment) Utility for exporting atags	2023-10-31 16:10:21 +01:00
Viktor Lofgren	f6fcb04817	(experiment) Repair the experiment runner	2023-10-27 16:16:50 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	5c040f7a46	(crawl-spec) Parquetify crawl spec * Crawl-specs are now parquet files * Deprecate the crawl-job-extractor tool	2023-09-17 09:41:34 +02:00
Viktor Lofgren	eaeb23d41e	(refactor) Remove converting-model package completely	2023-09-14 11:21:44 +02:00
Viktor Lofgren	c68d17d482	(keyword-extraction) Fix bug leading to position data missing on some keywords. This was due to a discrepancy between the KeywordPositionBitmask and WordsTfIdfCounts' concept of a keyword.	2023-09-02 14:48:55 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	a56953c798	(converter, WIP) Refactor converter to not have to load everything into RAM.	2023-07-24 15:25:09 +02:00
Viktor Lofgren	8b74e3aa0d	(*) File Storage WIP	2023-07-14 17:08:10 +02:00
Viktor Lofgren	74caf9e38a	(processes) Remove forEach-constructs in favor of iterators.	2023-07-12 17:47:36 +02:00
Viktor	cbbf60a599	Better fingerprinting (#35 ) * Better fingerprinting for server tech * Many more features in FeatureExtractor * Blog specialization * SiteType table	2023-07-10 18:58:43 +02:00
Viktor Lofgren	e7af77e151	Tests for crawler specialization + testdata	2023-06-27 10:57:54 +02:00
Viktor Lofgren	ed373eef61	Refactor crawler and add special logic for some platforms * Break apart CrawlerRetreiver * Break apart HttpFetcher into an interface and impl for testing sanity * Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.	2023-06-27 10:57:54 +02:00
Viktor Lofgren	a9fabba407	Tell experiment runner to only process some domains. Updated the experiment runner, as well as the script.	2023-06-20 14:14:01 +02:00
Viktor Lofgren	d0c72ceb7e	Improve experiment runner, convenient start script.	2023-03-30 15:40:31 +02:00
Viktor Lofgren	8f51345a1d	Add experiment runner tool and got rid of experiments module in processes.	2023-03-28 16:58:46 +02:00

20 Commits