CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	b74a3ebd85	(crawler) WIP integration of WARC files into the crawler process. At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly. This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled. The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.	2023-12-11 19:32:58 +01:00
Viktor Lofgren	45987a1d98	Merge branch 'master' into warc	2023-12-11 14:32:35 +01:00
Viktor Lofgren	30bc3f9281	(converter) Use the prefix ip: instead of geopip: for country codes This is the same as the prefix for the IP address, but I don't think that substantially matters, the as two have such different namespaces there can be no confusion.	2023-12-11 13:59:23 +01:00
Viktor Lofgren	f655ec5a5c	(*) Refactor GeoIP-related code In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services. The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions. The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server. The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.	2023-12-10 17:30:43 +01:00
Viktor Lofgren	84b4158555	(minor) Fix broken test	2023-12-10 14:39:20 +01:00
Viktor Lofgren	91dd45cf64	(search) IP and IP geolocation in site info view This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.	2023-12-09 20:06:55 +01:00
Viktor Lofgren	5c46af0edb	(converter) Refactor EncyclopediaMarginaliaNuSideloader to use ProcessingIterator Refactored the getDocumentsStream method in EncyclopediaMarginaliaNuSideloader to use the newly extracted ProcessingIterator class that encapsulates processing a stream of results from e.g a database query in parallel and returning the computed results as an iterator. The iterator was also improved on to be more reliable, previous versions of the logic would sometimes deadlock due to false positives in hasMore().	2023-12-09 15:20:53 +01:00
Viktor Lofgren	b6511fbfe2	(converter) Add AnchorTextKeywords to EncyclopediaMarginaliaNuSideloader processing The commit updates EncyclopediaMarginaliaNuSideloader to include the AnchorTextKeywords in processing documents, aiding search result relevance. It also removes old test-related functionality and a large but fairly useless test previously used to debug a specific problem, to the detriment of the overall code quality.	2023-12-09 15:20:52 +01:00
Viktor Lofgren	d0982e7ba5	(converter) Add error handling and lazy load external domain links The converter was not properly initiating the external links for each domain, causing an NPE in conversion. This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data. Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.	2023-12-09 12:33:39 +01:00
Viktor Lofgren	fc30da0d48	(converter) Add academia recognition to DomainProcessor The code now includes an additional function in the DomainProcessor class that checks if a domain is associated with academia. An academic domain is identified by the ".edu" TLD, or fits a specific regex pattern matching domains like .ac.ccTld or .edu.ccTld. If these conditions are met, the search term "special:academia" is added to the domain. The existing academia search filter uses personalized pagerank to select academia-adjacent domains, but it isn't working very well. The hope is that filtering on domain names will be more effective, and that it can supplant the ranking-based approach.	2023-12-08 20:31:34 +01:00
Viktor Lofgren	e6a1052ba7	Simplify CrawlerMain, removing the CrawlerLimiter and using a global HttpFetcher with a virtual thread pool dispatcher instead of the default.	2023-12-08 20:24:01 +01:00
Viktor Lofgren	968dce50fc	(crawler) Refactored IpInterceptingNetworkInterceptor for clarity.	2023-12-08 17:45:46 +01:00
Viktor Lofgren	3bbffd3c22	(crawler) Refactor HttpFetcher to integrate WarcRecorder Partially hook in the WarcRecorder into the crawler process. So far it's not read, but should record the crawled documents. The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.	2023-12-08 17:12:51 +01:00
Viktor Lofgren	072b5fcd12	Implement Warc-recording wrapper for OkHttp3 client This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted. This component is currently not hooked into anything. The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'. The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.	2023-12-08 13:49:16 +01:00
Viktor Lofgren	fabffa80f0	(warc) Integrate the crawler's content type parsing and charset logic into the WarcSideloader	2023-12-07 15:26:01 +01:00
Viktor Lofgren	064265b0b9	(crawler) Move content type/charset sniffing to a separate microlibrary This functionality needs to be accessed by the WarcSideloader, which is in the converter. The resultant microlibrary is tiny, but I think in this case it's justifiable.	2023-12-07 15:16:37 +01:00
Viktor Lofgren	2d5d11645d	(warc) Refactor WarcSideloaderTest to not rely on specific test files on the computer	2023-12-06 19:00:29 +01:00
Viktor Lofgren	cc813a5624	(convert) Add basic support for Warc file sideloading This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.	2023-12-06 18:43:55 +01:00
Viktor Lofgren	f615cf2391	(convert) Loosen up the rules enforcement for documents that have external links.	2023-12-01 17:44:29 +01:00
Viktor Lofgren	166a391eae	(docs) Improve architectural documentation for the crawler.	2023-11-30 21:30:57 +01:00
Viktor Lofgren	5fb24bb27f	(docs) Improve architectural documentation for the converter.	2023-11-30 20:43:22 +01:00
Viktor Lofgren	5a5430b383	(convert) Wiki specialization that should do a better job at removing junk keywords and providing a useful summary.	2023-11-30 20:04:46 +01:00
Viktor Lofgren	1dafa0c74d	(mqapi/control) Repair repartition endpoint, deprecate notify endpoints. The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId. In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.	2023-11-27 16:01:12 +01:00
Viktor Lofgren	09917837d0	(process) Ensure construction exceptions are logged Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs. The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.	2023-11-22 18:32:06 +01:00
Viktor Lofgren	f58a9f46be	(loader) Don't truncate the entire links table on load This behavior is an old vestige from the days of only having a single loader process. We'd truncate the links table because doing inserts/updates was too slow. This was also important because we had 32 bit ID, and there's a lot of links between domains to go around... Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE. We also update the PRIMARY KEY to a BIGINT. We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.	2023-11-16 10:30:12 +01:00
Viktor Lofgren	5de37cb820	(converter) Set feature flags appropriately on stackexchange posts	2023-11-12 15:48:08 +01:00
Viktor Lofgren	e5cee1f46d	(sideload) Fix sideloading so that it doesn't get disproportionately good rankings Also add type flags so that e.g. wikipedia shows up in the wikis filter.	2023-11-12 14:57:57 +01:00
Viktor Lofgren	7617b4cbc2	(crawler) Fix NPE in crawler caused by not having fetched the domains list yet	2023-11-06 18:16:38 +01:00
Viktor Lofgren	e0c769fd19	(converter) Integrate atags.parquet with the encyclopedia sideloader Also clean up stackexchange and dirtree a bit.	2023-11-06 18:03:01 +01:00
Viktor Lofgren	ebd10a5f28	(crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized	2023-11-06 16:14:58 +01:00
Viktor Lofgren	2b77184281	(converter) Integrate atags with the topology field	2023-11-06 13:46:44 +01:00
Viktor Lofgren	1847845151	Revert "(loader) Optimize INSERT statements" This reverts commit `7cb92195d1`.	2023-11-04 19:32:02 +01:00
Viktor Lofgren	7cb92195d1	(loader) Optimize INSERT statements INSERT IGNORE is too slow.	2023-11-04 17:43:55 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	8f74dbdbb4	(crawler) Set more lenient parameters for recrawl	2023-10-30 11:35:30 +01:00
Viktor Lofgren	fd5a7eac87	(crawler) Exit crawler retriever on thread interrupted	2023-10-30 11:34:16 +01:00
Viktor Lofgren	f613f4f2df	(array) Fix spurious search results This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss. It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.	2023-10-26 15:27:02 +02:00
Viktor Lofgren	a497e4c920	(crawler) Terminate crawler after a few hours of no progress	2023-10-26 12:49:28 +02:00
Viktor Lofgren	d7686b665e	Refactoring * Encyclopedia sideloader; permit providing base URL. * Storage base shows node id in GUI * ProcessLivenessMonitorActor restarts automatically * Clean-up of outbox code	2023-10-25 18:51:02 +02:00
Viktor Lofgren	313cc2965c	(index-creation) Print whether full or prio is created Previous state of saying reverse index for both was pretty confusing.	2023-10-24 16:23:10 +02:00
Viktor Lofgren	e06a8c1de2	(converter) Put upper limit on number of worker threads.	2023-10-22 14:03:09 +02:00
Viktor Lofgren	1d75b974b5	(loader bugfix) Set DOMAIN_METADATA appropriately	2023-10-20 13:03:27 +02:00
Viktor Lofgren	7b5ec6b98f	(executor-service) Embed dist/ in executor-service's docker image	2023-10-19 17:48:34 +02:00
Viktor Lofgren	81dd3809e9	(*) WIP Add node affinity to EC_DOMAIN Very messy commit due to fractalline yak shaving	2023-10-19 17:48:34 +02:00
Viktor Lofgren	16e0738731	(*) Get multi-node routing working.	2023-10-15 18:38:30 +02:00
Viktor Lofgren	4baf9527d7	() WIP Control GUI redesign, executor-service, multi-node mq This turned out to be very difficult to do in small isolated steps. Design overhaul of the control gui using bootstrap * Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes * Add node-affinity to message queue	2023-10-14 12:08:43 +02:00
Viktor Lofgren	199c459697	(*) Add node-affinity to services, processes and file storage.	2023-10-10 12:32:22 +02:00
Viktor Lofgren	8375237de5	(converter) Add special keyword for websites with a tilde url.	2023-10-09 17:02:32 +02:00
Viktor Lofgren	3889c4bdd9	(refactor) Remove features-search and update documentation	2023-10-09 15:12:30 +02:00
Viktor Lofgren	5dd55c7cad	(refactor) Rename satellite services to application services This is a better descriptor, since they now all implement different applications on top of the core services' APIs.	2023-10-09 13:45:45 +02:00

1 2 3 4 5

215 Commits