CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	e5cee1f46d	(sideload) Fix sideloading so that it doesn't get disproportionately good rankings Also add type flags so that e.g. wikipedia shows up in the wikis filter.	2023-11-12 14:57:57 +01:00
Viktor Lofgren	e0c769fd19	(converter) Integrate atags.parquet with the encyclopedia sideloader Also clean up stackexchange and dirtree a bit.	2023-11-06 18:03:01 +01:00
Viktor Lofgren	ebd10a5f28	(crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized	2023-11-06 16:14:58 +01:00
Viktor Lofgren	2b77184281	(converter) Integrate atags with the topology field	2023-11-06 13:46:44 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	f613f4f2df	(array) Fix spurious search results This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss. It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.	2023-10-26 15:27:02 +02:00
Viktor Lofgren	d7686b665e	Refactoring * Encyclopedia sideloader; permit providing base URL. * Storage base shows node id in GUI * ProcessLivenessMonitorActor restarts automatically * Clean-up of outbox code	2023-10-25 18:51:02 +02:00
Viktor Lofgren	e06a8c1de2	(converter) Put upper limit on number of worker threads.	2023-10-22 14:03:09 +02:00
Viktor Lofgren	81dd3809e9	(*) WIP Add node affinity to EC_DOMAIN Very messy commit due to fractalline yak shaving	2023-10-19 17:48:34 +02:00
Viktor Lofgren	4baf9527d7	() WIP Control GUI redesign, executor-service, multi-node mq This turned out to be very difficult to do in small isolated steps. Design overhaul of the control gui using bootstrap * Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes * Add node-affinity to message queue	2023-10-14 12:08:43 +02:00
Viktor Lofgren	199c459697	(*) Add node-affinity to services, processes and file storage.	2023-10-10 12:32:22 +02:00
Viktor Lofgren	8375237de5	(converter) Add special keyword for websites with a tilde url.	2023-10-09 17:02:32 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	e0cd3cd991	(converter) Alter StackexchangeSideloader's summary length to align with the rest of the system.	2023-09-26 12:19:43 +02:00
Viktor Lofgren	81ae501e73	(converter) Use ThreadLocalSentenceExtractorProvider for PlainText plugin as well	2023-09-25 18:28:34 +02:00
Viktor Lofgren	f797a92f87	(converter, minor) Use domain name in task heartbeat progress	2023-09-25 18:27:04 +02:00
Viktor Lofgren	a433bbbe45	(converter) Fix rare sentence extractor bug It was caused by non-thread safe concurrent memory access in SentenceExtractor.	2023-09-24 19:39:48 +02:00
Viktor Lofgren	8ca20f184d	(keyword-extraction) Chasing my tail looking for a bug	2023-09-24 19:39:48 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	ad660cf420	(converter) Bugfix: Don't try to Path.of() on optional field	2023-09-21 13:27:09 +02:00
Viktor Lofgren	70aa04c047	(converter, stackexchange-xml) Add the ability to sideload stackexchange data	2023-09-21 12:48:33 +02:00
Viktor Lofgren	d895f83520	(blocking-thread-pool) Move DumbThreadPool to its own micro-library Also rename it to SimpleBlockingThreadPool.	2023-09-20 10:11:49 +02:00
Viktor Lofgren	f6b9e8c5eb	(converter) JavadocSpecialization should truncate its summary if it gets too long	2023-09-17 16:25:33 +02:00
Viktor Lofgren	98bcdf6028	(converter) DirtreeSideloader now trims /index.html from the URL if present This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.	2023-09-17 16:08:16 +02:00
Viktor Lofgren	9b385ec7cc	(converter) Make it possible to sideload documents from a directory tree	2023-09-17 14:35:06 +02:00
Viktor Lofgren	5c040f7a46	(crawl-spec) Parquetify crawl spec * Crawl-specs are now parquet files * Deprecate the crawl-job-extractor tool	2023-09-17 09:41:34 +02:00
Viktor Lofgren	c67d95c00f	(converter) Write dummy processor log when sideloading	2023-09-14 14:13:03 +02:00
Viktor Lofgren	5e5aaf9a7e	(converter, control) Re-enable sideloading encyclopedia data	2023-09-14 12:12:07 +02:00
Viktor Lofgren	eaeb23d41e	(refactor) Remove converting-model package completely	2023-09-14 11:21:44 +02:00
Viktor Lofgren	c71f6ad417	(converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup	2023-09-14 10:11:57 +02:00
Viktor Lofgren	4799dd769e	(converting) WIP begin to remove converting-model and the old InstructionsCompiler	2023-09-13 19:18:58 +02:00
Viktor Lofgren	24b4606f96	(converter,loader) Converter outputs parquet files instead of compressed json.	2023-09-13 16:13:41 +02:00
Viktor Lofgren	39c1857c61	(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.	2023-08-29 13:07:55 +02:00
Viktor Lofgren	3101b74580	(index) Move to a lexicon-free index design This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it also added a fairly significant RAM penalty to both the index service and the loader. The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices. It also became necessary half-way through to upgrade guice as its error reporting wasn't quite compatible with JDK20.	2023-08-28 14:02:23 +02:00
Viktor Lofgren	194a6057dd	(index,control) Recoverable index backups	2023-08-25 14:57:43 +02:00
Viktor Lofgren	e710e057e2	(db) Remove EC_URL and EC_PAGE_DATA from mariadb database	2023-08-25 13:45:03 +02:00
Viktor Lofgren	6a04cdfddf	(loader) Implement new linkdb in loader Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal. For now, we no longer store new URLs in different domains. We need to re-implement this somehow, probably in a different job or a as a different output.	2023-08-24 13:07:54 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	bf92c270dc	(language) Rollback language filter change a bit. It appears to lead to too much junk in the lexicon.	2023-08-23 10:16:57 +02:00
Viktor Lofgren	e507844616	(language) Rollback language filter change a bit. It appears to lead to too much junk in the lexicon.	2023-08-23 10:03:25 +02:00
Viktor Lofgren	704de50a9b	(forward-index, valuator) HTML features in valuator Put it in the forward index for easy access during index-side valuation.	2023-08-18 11:54:56 +02:00
Viktor Lofgren	fcfe07fb7d	(valuator) Clean up code	2023-08-18 11:26:56 +02:00
Viktor Lofgren	ccf4990add	(minor) Clean up code	2023-08-18 11:26:39 +02:00
Viktor Lofgren	f2638dd845	(feature-extractor) More adtech nonsense	2023-08-18 11:26:19 +02:00
Viktor Lofgren	239980ecae	(minor) Improve comment	2023-08-18 11:26:05 +02:00
Viktor Lofgren	bee815b1c4	(converter) Add monsterinsights as an adtech tracker	2023-08-17 17:44:11 +02:00
Viktor Lofgren	e296b02649	(converter) Optimize LSH based within-domain deduplication	2023-08-17 17:43:46 +02:00
Viktor Lofgren	46d761f34f	(language) fasttext based language filter	2023-08-16 15:48:12 +02:00
Viktor Lofgren	4598c7f40f	(valuation) Penalize wordpress style kebab case urls	2023-08-16 13:11:24 +02:00

1 2 3

114 Commits