CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	e49ba887e9	(crawl data) Add compatibility layer for old crawl data format The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records. This is true for the new parquet format, but not for the old zstd/gson format. To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order. This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be. Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.	2024-01-08 19:16:49 +01:00
Viktor Lofgren	24051fec03	(converter) WIP Run sideload-style processing for large domains The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis. This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process. These websites now receive a simplified treatment. This is executed in the converter batch writer thread. This is slower, but the documents will not be persisted in memory.	2023-12-27 18:20:03 +01:00
Viktor Lofgren	440e097d78	(crawler) WIP integration of WARC files into the crawler and converter process. This commit is in a pretty rough state. It refactors the crawler fairly significantly to offer better separation of concerns. It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data. This works, -ish. There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either. A problem is that the WARC files are a bit too large. It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.	2023-12-13 15:33:42 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	3047e2dd7c	(screenshot-capture-tool) Make screenshot-capture-tool cooperate with docker	2023-11-01 16:38:55 +01:00
Viktor Lofgren	81bfd7e5fb	(experiment) Utility for exporting atags	2023-10-31 16:10:21 +01:00
Viktor Lofgren	f6fcb04817	(experiment) Repair the experiment runner	2023-10-27 16:16:50 +02:00
Viktor Lofgren	7b5ec6b98f	(executor-service) Embed dist/ in executor-service's docker image	2023-10-19 17:48:34 +02:00
Viktor Lofgren	2bf0c4497d	(*) Tool for unfcking old crawl data so that it aligns with the new style IDs	2023-10-19 17:48:34 +02:00
Viktor Lofgren	5dd55c7cad	(refactor) Rename satellite services to application services This is a better descriptor, since they now all implement different applications on top of the core services' APIs.	2023-10-09 13:45:45 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	5b0a6d7ec1	(stackexchange-converter) Create tool for converting stackexchange 7z-files to digestible sqlite db:s	2023-09-20 15:15:13 +02:00
Viktor Lofgren	5c040f7a46	(crawl-spec) Parquetify crawl spec * Crawl-specs are now parquet files * Deprecate the crawl-job-extractor tool	2023-09-17 09:41:34 +02:00
Viktor Lofgren	eaeb23d41e	(refactor) Remove converting-model package completely	2023-09-14 11:21:44 +02:00
Viktor Lofgren	c68d17d482	(keyword-extraction) Fix bug leading to position data missing on some keywords. This was due to a discrepancy between the KeywordPositionBitmask and WordsTfIdfCounts' concept of a keyword.	2023-09-02 14:48:55 +02:00
Viktor Lofgren	39c1857c61	(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.	2023-08-29 13:07:55 +02:00
Viktor Lofgren	1e6800565a	(system) Remove EdgeId<T> and similar objects They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.	2023-08-24 17:46:02 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	46d761f34f	(language) fasttext based language filter	2023-08-16 15:48:12 +02:00
Viktor Lofgren	a5d980ee56	(converter) Hook crawl job extractor and adjacencies calculator into control service.	2023-07-26 15:46:22 +02:00
Viktor Lofgren	a56953c798	(converter, WIP) Refactor converter to not have to load everything into RAM.	2023-07-24 15:25:09 +02:00
Viktor Lofgren	c069c8c182	(crawler) Clean up crawl data reference and recrawl logic	2023-07-22 18:42:21 +02:00
Viktor Lofgren	f91d92cccb	(crawler) WIP	2023-07-20 21:05:16 +02:00
Viktor Lofgren	8b74e3aa0d	(*) File Storage WIP	2023-07-14 17:08:10 +02:00
Viktor Lofgren	480abfe966	(minor) Add limit to pol count in MqPersistence, fix test	2023-07-12 18:16:23 +02:00
Viktor Lofgren	74caf9e38a	(processes) Remove forEach-constructs in favor of iterators.	2023-07-12 17:47:36 +02:00
Viktor	cbbf60a599	Better fingerprinting (#35 ) * Better fingerprinting for server tech * Many more features in FeatureExtractor * Blog specialization * SiteType table	2023-07-10 18:58:43 +02:00
Viktor Lofgren	dbb758d1a8	Minor: Better error handling in crawled domain reader	2023-07-10 18:58:43 +02:00
Viktor Lofgren	e7af77e151	Tests for crawler specialization + testdata	2023-06-27 10:57:54 +02:00
Viktor Lofgren	ed373eef61	Refactor crawler and add special logic for some platforms * Break apart CrawlerRetreiver * Break apart HttpFetcher into an interface and impl for testing sanity * Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.	2023-06-27 10:57:54 +02:00
Viktor Lofgren	a9fabba407	Tell experiment runner to only process some domains. Updated the experiment runner, as well as the script.	2023-06-20 14:14:01 +02:00
Viktor Lofgren	4fc0ddbc45	Improved crawl-job-extractor. Let crawl-job-extractor run offline and allow it to read domains from file. Improved docs.	2023-06-20 11:37:52 +02:00
Viktor Lofgren	7ed3306be3	Make the adjacency calculator behave like it used to in the past, when it gave better results.	2023-06-07 22:03:06 +02:00
Viktor Lofgren	2afbdc2269	Adjust the logic for the crawl job extractor to set a relatively low visit limit for websites that are new in the index or has not yielded many good documents previously.	2023-06-07 22:01:35 +02:00
Viktor	5a5cdaf70e	Improvements to the adjacency calculator and screenshots tool (#13 ) * WIP: Improvements to website adjacencies loader tool. * Improving screenshots capture bot.	2023-04-18 22:21:49 +02:00
Viktor Lofgren	4d298cd5fa	Improving screenshots capture bot.	2023-04-17 18:04:22 +02:00
Viktor Lofgren	fbbaf584ba	Adjustments to screenshot capture tool.	2023-04-16 08:55:57 +02:00
Viktor Lofgren	3e9b37c264	Refactor website screenshot tool and website adjacencies calculator into code/tools.	2023-04-11 16:20:27 +02:00
Viktor Lofgren	fe419b12b4	Better handling of quote terms, fix bug in handling of longer queries. ... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java	2023-04-10 13:11:40 +02:00
Viktor Lofgren	716ab35b4e	Search ranking debuggability improvements.	2023-04-02 13:43:24 +02:00
Viktor Lofgren	affcf8cf41	Load test tool	2023-04-02 09:43:43 +02:00
Viktor Lofgren	d0c72ceb7e	Improve experiment runner, convenient start script.	2023-03-30 15:40:31 +02:00
Viktor Lofgren	8f51345a1d	Add experiment runner tool and got rid of experiments module in processes.	2023-03-28 16:58:46 +02:00
Viktor	ac1ac3ea57	Move database to a separate module * Move database to a separate project, break apart sql file into separate entities. * Fix front page news listing.	2023-03-25 15:26:17 +01:00
Viktor Lofgren	2eb972dea1	Remove unrelated code, break tools into their own directory.	2023-03-17 16:03:11 +01:00

47 Commits