CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	30bc3f9281	(converter) Use the prefix ip: instead of geopip: for country codes This is the same as the prefix for the IP address, but I don't think that substantially matters, the as two have such different namespaces there can be no confusion.	2023-12-11 13:59:23 +01:00
Viktor Lofgren	f655ec5a5c	(*) Refactor GeoIP-related code In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services. The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions. The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server. The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.	2023-12-10 17:30:43 +01:00
Viktor Lofgren	5c46af0edb	(converter) Refactor EncyclopediaMarginaliaNuSideloader to use ProcessingIterator Refactored the getDocumentsStream method in EncyclopediaMarginaliaNuSideloader to use the newly extracted ProcessingIterator class that encapsulates processing a stream of results from e.g a database query in parallel and returning the computed results as an iterator. The iterator was also improved on to be more reliable, previous versions of the logic would sometimes deadlock due to false positives in hasMore().	2023-12-09 15:20:53 +01:00
Viktor Lofgren	b6511fbfe2	(converter) Add AnchorTextKeywords to EncyclopediaMarginaliaNuSideloader processing The commit updates EncyclopediaMarginaliaNuSideloader to include the AnchorTextKeywords in processing documents, aiding search result relevance. It also removes old test-related functionality and a large but fairly useless test previously used to debug a specific problem, to the detriment of the overall code quality.	2023-12-09 15:20:52 +01:00
Viktor Lofgren	d0982e7ba5	(converter) Add error handling and lazy load external domain links The converter was not properly initiating the external links for each domain, causing an NPE in conversion. This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data. Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.	2023-12-09 12:33:39 +01:00
Viktor Lofgren	fc30da0d48	(converter) Add academia recognition to DomainProcessor The code now includes an additional function in the DomainProcessor class that checks if a domain is associated with academia. An academic domain is identified by the ".edu" TLD, or fits a specific regex pattern matching domains like .ac.ccTld or .edu.ccTld. If these conditions are met, the search term "special:academia" is added to the domain. The existing academia search filter uses personalized pagerank to select academia-adjacent domains, but it isn't working very well. The hope is that filtering on domain names will be more effective, and that it can supplant the ranking-based approach.	2023-12-08 20:31:34 +01:00
Viktor Lofgren	f615cf2391	(convert) Loosen up the rules enforcement for documents that have external links.	2023-12-01 17:44:29 +01:00
Viktor Lofgren	5fb24bb27f	(docs) Improve architectural documentation for the converter.	2023-11-30 20:43:22 +01:00
Viktor Lofgren	5a5430b383	(convert) Wiki specialization that should do a better job at removing junk keywords and providing a useful summary.	2023-11-30 20:04:46 +01:00
Viktor Lofgren	09917837d0	(process) Ensure construction exceptions are logged Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs. The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.	2023-11-22 18:32:06 +01:00
Viktor Lofgren	5de37cb820	(converter) Set feature flags appropriately on stackexchange posts	2023-11-12 15:48:08 +01:00
Viktor Lofgren	e5cee1f46d	(sideload) Fix sideloading so that it doesn't get disproportionately good rankings Also add type flags so that e.g. wikipedia shows up in the wikis filter.	2023-11-12 14:57:57 +01:00
Viktor Lofgren	e0c769fd19	(converter) Integrate atags.parquet with the encyclopedia sideloader Also clean up stackexchange and dirtree a bit.	2023-11-06 18:03:01 +01:00
Viktor Lofgren	ebd10a5f28	(crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized	2023-11-06 16:14:58 +01:00
Viktor Lofgren	2b77184281	(converter) Integrate atags with the topology field	2023-11-06 13:46:44 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	f613f4f2df	(array) Fix spurious search results This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss. It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.	2023-10-26 15:27:02 +02:00
Viktor Lofgren	d7686b665e	Refactoring * Encyclopedia sideloader; permit providing base URL. * Storage base shows node id in GUI * ProcessLivenessMonitorActor restarts automatically * Clean-up of outbox code	2023-10-25 18:51:02 +02:00
Viktor Lofgren	e06a8c1de2	(converter) Put upper limit on number of worker threads.	2023-10-22 14:03:09 +02:00
Viktor Lofgren	81dd3809e9	(*) WIP Add node affinity to EC_DOMAIN Very messy commit due to fractalline yak shaving	2023-10-19 17:48:34 +02:00
Viktor Lofgren	4baf9527d7	() WIP Control GUI redesign, executor-service, multi-node mq This turned out to be very difficult to do in small isolated steps. Design overhaul of the control gui using bootstrap * Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes * Add node-affinity to message queue	2023-10-14 12:08:43 +02:00
Viktor Lofgren	199c459697	(*) Add node-affinity to services, processes and file storage.	2023-10-10 12:32:22 +02:00
Viktor Lofgren	8375237de5	(converter) Add special keyword for websites with a tilde url.	2023-10-09 17:02:32 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	e0cd3cd991	(converter) Alter StackexchangeSideloader's summary length to align with the rest of the system.	2023-09-26 12:19:43 +02:00
Viktor Lofgren	81ae501e73	(converter) Use ThreadLocalSentenceExtractorProvider for PlainText plugin as well	2023-09-25 18:28:34 +02:00
Viktor Lofgren	f797a92f87	(converter, minor) Use domain name in task heartbeat progress	2023-09-25 18:27:04 +02:00
Viktor Lofgren	a433bbbe45	(converter) Fix rare sentence extractor bug It was caused by non-thread safe concurrent memory access in SentenceExtractor.	2023-09-24 19:39:48 +02:00
Viktor Lofgren	8ca20f184d	(keyword-extraction) Chasing my tail looking for a bug	2023-09-24 19:39:48 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	ad660cf420	(converter) Bugfix: Don't try to Path.of() on optional field	2023-09-21 13:27:09 +02:00
Viktor Lofgren	70aa04c047	(converter, stackexchange-xml) Add the ability to sideload stackexchange data	2023-09-21 12:48:33 +02:00
Viktor Lofgren	d895f83520	(blocking-thread-pool) Move DumbThreadPool to its own micro-library Also rename it to SimpleBlockingThreadPool.	2023-09-20 10:11:49 +02:00
Viktor Lofgren	f6b9e8c5eb	(converter) JavadocSpecialization should truncate its summary if it gets too long	2023-09-17 16:25:33 +02:00
Viktor Lofgren	98bcdf6028	(converter) DirtreeSideloader now trims /index.html from the URL if present This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.	2023-09-17 16:08:16 +02:00
Viktor Lofgren	9b385ec7cc	(converter) Make it possible to sideload documents from a directory tree	2023-09-17 14:35:06 +02:00
Viktor Lofgren	5c040f7a46	(crawl-spec) Parquetify crawl spec * Crawl-specs are now parquet files * Deprecate the crawl-job-extractor tool	2023-09-17 09:41:34 +02:00
Viktor Lofgren	c67d95c00f	(converter) Write dummy processor log when sideloading	2023-09-14 14:13:03 +02:00
Viktor Lofgren	5e5aaf9a7e	(converter, control) Re-enable sideloading encyclopedia data	2023-09-14 12:12:07 +02:00
Viktor Lofgren	eaeb23d41e	(refactor) Remove converting-model package completely	2023-09-14 11:21:44 +02:00
Viktor Lofgren	c71f6ad417	(converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup	2023-09-14 10:11:57 +02:00
Viktor Lofgren	4799dd769e	(converting) WIP begin to remove converting-model and the old InstructionsCompiler	2023-09-13 19:18:58 +02:00
Viktor Lofgren	24b4606f96	(converter,loader) Converter outputs parquet files instead of compressed json.	2023-09-13 16:13:41 +02:00
Viktor Lofgren	39c1857c61	(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.	2023-08-29 13:07:55 +02:00
Viktor Lofgren	3101b74580	(index) Move to a lexicon-free index design This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it also added a fairly significant RAM penalty to both the index service and the loader. The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices. It also became necessary half-way through to upgrade guice as its error reporting wasn't quite compatible with JDK20.	2023-08-28 14:02:23 +02:00
Viktor Lofgren	194a6057dd	(index,control) Recoverable index backups	2023-08-25 14:57:43 +02:00
Viktor Lofgren	e710e057e2	(db) Remove EC_URL and EC_PAGE_DATA from mariadb database	2023-08-25 13:45:03 +02:00
Viktor Lofgren	6a04cdfddf	(loader) Implement new linkdb in loader Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal. For now, we no longer store new URLs in different domains. We need to re-implement this somehow, probably in a different job or a as a different output.	2023-08-24 13:07:54 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00

1 2 3

125 Commits