CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	fab36d6e63	(converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy.	2024-02-14 17:35:44 +01:00
Viktor Lofgren	1b8b97b8ec	(sample-exporter) Add some limits on sizes and lengths Tar files will reject entries with filenames over 100b, so we need a limit there. Also added a maximum size limit to keep the file sizes reasonable.	2024-01-25 11:51:53 +01:00
Viktor Lofgren	805afad4fe	(control) New GUI for exporting crawl data samples Not going to win any beauty pageants, but this is pretty peripheral functionality.	2024-01-23 17:08:21 +01:00
Viktor Lofgren	0081328aca	(converter) Adjust which flags are set by anchor text keywords It's a mistake to let it bleed into Title, as this is a high quality signal. We'll co-opt Site and SiteAdjacent instead to reinforce the ExternalLink when count is high.	2024-01-23 11:54:00 +01:00
Viktor Lofgren	40c9d2050f	(control) Fully automatic conversion Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine. Removed the tool itself. This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency. This has been fixed, and :third-party:xz was removed.	2024-01-22 13:03:24 +01:00
Viktor Lofgren	c41e68aaab	(control) New export actions for RSS/Atom feeds and term frequency data This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.	2024-01-15 14:54:26 +01:00
Viktor Lofgren	e46e174b59	(keyword-extractor) Add another test for Name-extractor	2024-01-01 15:21:51 +01:00
Viktor Lofgren	1cbf23e7e7	(test) Don't fail test if atags.parquet is not in ~vlofgren	2023-11-15 09:11:38 +01:00
Viktor Lofgren	e0c769fd19	(converter) Integrate atags.parquet with the encyclopedia sideloader Also clean up stackexchange and dirtree a bit.	2023-11-06 18:03:01 +01:00
Viktor Lofgren	ebd10a5f28	(crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized	2023-11-06 16:14:58 +01:00
Viktor Lofgren	2b77184281	(converter) Integrate atags with the topology field	2023-11-06 13:46:44 +01:00
Viktor Lofgren	72afa0341f	duckdb connection may need to be synchronized?	2023-11-04 14:30:25 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	4415f52e18	(keyword-extraction) Fix broken test	2023-10-27 12:19:33 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	93dc80000c	(bugfix) Fix NPE in KeywordExtractor due to bad SoftReference handling	2023-09-26 17:16:41 +02:00
Viktor Lofgren	9b781f8404	(keyoword-extractor) Address very rare race condition in memoization logic	2023-09-25 18:28:04 +02:00
Viktor Lofgren	8ca20f184d	(keyword-extraction) Chasing my tail looking for a bug	2023-09-24 19:39:48 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	70aa04c047	(converter, stackexchange-xml) Add the ability to sideload stackexchange data	2023-09-21 12:48:33 +02:00
Viktor Lofgren	5b0a6d7ec1	(stackexchange-converter) Create tool for converting stackexchange 7z-files to digestible sqlite db:s	2023-09-20 15:15:13 +02:00
Viktor Lofgren	3b4d08f52b	(stackexchange-integration) Add better comments	2023-09-20 14:43:06 +02:00
Viktor Lofgren	6bbf40d7d2	(stackexchange-integration) Tools for reading stackexchange xml files	2023-09-20 14:17:33 +02:00
Viktor Lofgren	eaeb23d41e	(refactor) Remove converting-model package completely	2023-09-14 11:21:44 +02:00
Viktor Lofgren	c68d17d482	(keyword-extraction) Fix bug leading to position data missing on some keywords. This was due to a discrepancy between the KeywordPositionBitmask and WordsTfIdfCounts' concept of a keyword.	2023-09-02 14:48:55 +02:00
Viktor Lofgren	676e7c7947	(keywords) Add Serializable properties that went missing as the record became a class	2023-09-02 09:52:01 +02:00
Viktor Lofgren	5f427d2b4c	(keywords) Clean up leaky abstractions, clean up tests	2023-09-01 13:52:00 +02:00
Viktor Lofgren	3101b74580	(index) Move to a lexicon-free index design This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it also added a fairly significant RAM penalty to both the index service and the loader. The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices. It also became necessary half-way through to upgrade guice as its error reporting wasn't quite compatible with JDK20.	2023-08-28 14:02:23 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	db0216936e	(summary) Reduce the chance of expensive operations	2023-08-16 15:48:34 +02:00
Viktor Lofgren	46d761f34f	(language) fasttext based language filter	2023-08-16 15:48:12 +02:00
Viktor Lofgren	4ab1cd9502	(*) last touches	2023-08-07 12:57:44 +02:00
Viktor Lofgren	77d5e39fe0	Make processed data Serializable	2023-07-28 18:11:19 +02:00
Viktor Lofgren	baff83912e	Small optimizations that shave an hour of processing time :D	2023-06-28 15:41:10 +02:00
Viktor Lofgren	f8f9f04158	Specialized logic for processing Lemmy-based websites.	2023-06-27 10:57:54 +02:00
Viktor Lofgren	7326ba74fe	Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right. Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.	2023-06-20 14:15:05 +02:00
Viktor Lofgren	32a6735d03	Undo change in requirements for counting as a high tf-idf word	2023-06-19 17:58:19 +02:00
Viktor Lofgren	f0b4acb358	Better logic for summarization.	2023-06-19 17:58:19 +02:00
Viktor Lofgren	9579cdd151	Improved heuristic for which words are considered important in selecting the summary text.	2023-06-19 17:58:19 +02:00
Viktor Lofgren	443cf0cf1e	Expose additional functionality through WordsTfIdfCounts. Bump requirements for being flagged as high TF-IDF from 2 occurences to 3.	2023-06-19 17:58:19 +02:00
Viktor Lofgren	4138233ddf	Truncate repeated strings of any non-alnum symbols in SummaryExtractor	2023-06-19 17:58:19 +02:00
Viktor Lofgren	2979f4703e	Allocation-free text utility	2023-06-19 17:58:19 +02:00
Viktor Lofgren	379bccc1a3	Disable AdblockSimulator since it's slow and doesn't really work. Just wasting CPU cycles until it's fixed.	2023-06-19 17:58:19 +02:00
Viktor Lofgren	21125206b4	Fix some bugs in JSON+LD-heuristics for pub date.	2023-06-19 17:58:19 +02:00
Viktor Lofgren	88399e30e2	Consider keyword relevance signals when creating the document summary using the DOM walker.	2023-06-19 17:58:19 +02:00
Viktor Lofgren	d82a858491	Don't consider slash to be a sentence separator.	2023-05-31 16:54:30 +02:00
Viktor	7694a15f62	Fix kale's unreasonably high weighting factor	2023-04-22 20:55:09 +02:00
Viktor Lofgren	619fb8ba80	(converter) Adjust the pub-date sniffing heuristics' order. Doing HTML5 tags too early puts some sites too early. Also expanded support for JSON+LD.	2023-04-19 15:28:50 +02:00
Viktor Lofgren	810515c08d	Clean up artifact extractor.	2023-04-10 13:07:54 +02:00

1 2

66 Commits