CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	c92f1b8df8	(geo-ip) Revert removal of ip2location logic We do both ip2location and ASN data. The change also adds some keywords based on autonomous system information, on a somewhat experimental basis. It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.	2023-12-17 15:03:00 +01:00
Viktor Lofgren	bde68ba48b	Merge branch 'master' into asn-info	2023-12-17 14:00:23 +01:00
Viktor Lofgren	bf44805e69	(*) Rename EdgeDomain$domain into topDomain This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time. Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.	2023-12-17 14:00:07 +01:00
Viktor Lofgren	d7bd540683	(*) Replace the ip2location IP geolocation data with ASN information from apnic.net. Doesn't really make sense to use ip2location as a middle man for information that is already freely available...	2023-12-16 21:55:04 +01:00
Viktor Lofgren	722b56c8ca	(index) Fix rare bug in the index-switching logic This is caused by a resource contention with the query code. The proper way to fix this is to use some form of synchronization, but that will slow the code down. So we just hammer it a few times and let the GC deal with the problem if it fails. Not optimal, but fast.	2023-12-16 18:57:35 +01:00
Viktor Lofgren	f3f12058dc	(assistant) Fix logic error in filtering related domains	2023-12-16 18:45:53 +01:00
Viktor Lofgren	3da38d0483	(assistant) Fix logic error in filtering related domains	2023-12-16 18:44:25 +01:00
Viktor Lofgren	e13fa25e11	(assistant) Clean up the site info related domains view by filtering viable domains	2023-12-16 18:37:09 +01:00
Viktor Lofgren	34d4834ff6	(assistant) Clean up the site info related domains view by filtering viable domains	2023-12-16 18:27:24 +01:00
Viktor Lofgren	440e097d78	(crawler) WIP integration of WARC files into the crawler and converter process. This commit is in a pretty rough state. It refactors the crawler fairly significantly to offer better separation of concerns. It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data. This works, -ish. There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either. A problem is that the WARC files are a bit too large. It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.	2023-12-13 15:33:42 +01:00
Viktor Lofgren	45987a1d98	Merge branch 'master' into warc	2023-12-11 14:32:35 +01:00
Viktor Lofgren	f655ec5a5c	(*) Refactor GeoIP-related code In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services. The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions. The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server. The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.	2023-12-10 17:30:43 +01:00
Viktor Lofgren	91dd45cf64	(search) IP and IP geolocation in site info view This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.	2023-12-09 20:06:55 +01:00
Viktor Lofgren	e3ebb0c5bb	(*) Rename the search filter 'RETRO' into 'POPULAR' This will make the terminology more consistent between the GUI and the code. The rankings yaml still uses 'retro' though, for to retain compatibility.	2023-12-09 20:06:54 +01:00
Viktor Lofgren	8ef34883a8	(search) Move site information out of the search service and into assistant. This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available. It also permits exposing this information via API in the future if there is interest in this. The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time. Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.	2023-12-09 16:30:06 +01:00
Viktor Lofgren	eccb12b366	(control) Fix spurious state detection in control-side actors A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor! To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.	2023-12-09 12:50:05 +01:00
Viktor Lofgren	cc813a5624	(convert) Add basic support for Warc file sideloading This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.	2023-12-06 18:43:55 +01:00
Viktor Lofgren	01621c6344	(renderer) Make helpers configurable on a by-service basis.	2023-12-02 17:06:40 +01:00
Viktor Lofgren	c7934342a6	(control) Automatic recrawl	2023-12-02 17:06:24 +01:00
Viktor Lofgren	f5c324c06b	(minor) Fix broken test	2023-12-01 17:44:39 +01:00
Viktor Lofgren	67a1e1c874	(control) GUI for triggering control-side actors	2023-11-29 15:31:14 +01:00
Viktor Lofgren	4155fbe94c	(control) Reprocess-all actor	2023-11-28 17:58:48 +01:00
Viktor Lofgren	347fe6b7be	(control) Reindex-all actor	2023-11-28 16:41:09 +01:00
Viktor Lofgren	ff3ceb981e	(control) Button for removing a stale 'NEW' status If a process is violently terminated, the associated file storage may get stuck in the ephemeral 'NEW' state, preventing future operations on the associated data. To remedy this without having to dig through the database, a button was added to reset the state. It's a band-aid, but the situation is rare enough that I think it's fine.	2023-11-28 15:18:24 +01:00
Viktor Lofgren	1dafa0c74d	(mqapi/control) Repair repartition endpoint, deprecate notify endpoints. The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId. In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.	2023-11-27 16:01:12 +01:00
Viktor Lofgren	dd9406d0ac	(control) Make storage type tabs consistent This had fallen off in the Create New Specification view, it lacked Exports.	2023-11-17 11:26:45 +01:00
Viktor Lofgren	e9a01caa5c	(index) Fix broken metrics	2023-11-11 12:53:47 +01:00
Viktor Lofgren	858357a246	(metrics) Get prometheus up out of disrepair * Fix bad labels * Add nodeId where appropriate * Hopefully fix histogram buckets for index query times	2023-11-08 14:01:28 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	8e9698c9a0	(control/search) Add ability to suggest removing a site from random exploration This is what most complaints have been about.	2023-11-02 15:29:49 +01:00
Viktor Lofgren	3047e2dd7c	(screenshot-capture-tool) Make screenshot-capture-tool cooperate with docker	2023-11-01 16:38:55 +01:00
Viktor Lofgren	a8b9d21f2d	(executor) Refine atag export logic * Remove obviously uninteresting tags * Omit URL schema for more sensible sorting * Change the column order to put the source domain last	2023-11-01 13:23:14 +01:00
Viktor Lofgren	c77a5b7cb6	(control) GUI for atags export	2023-10-31 17:55:47 +01:00
Viktor Lofgren	23f2068e33	(executor) Actor for exporting anchor tag data from crawl data	2023-10-31 17:32:34 +01:00
Viktor Lofgren	ffadfb4149	(control) Use a partial template for the storage types tabs.	2023-10-31 17:12:14 +01:00
Viktor Lofgren	b7e38cfbae	(control) Add exports view	2023-10-31 17:08:48 +01:00
Viktor Lofgren	659743b39c	(executor) Export Data actor allocates its own storage	2023-10-31 17:04:07 +01:00
Viktor Lofgren	69758c5859	(control) Nicer redirects acknowledging actions	2023-10-31 16:26:29 +01:00
Viktor Lofgren	2871a326e6	(ctrl/exe) Clean up UX and code	2023-10-29 14:09:39 +01:00
Viktor Lofgren	abb42f0f36	(crawler) Fix bug in SQL statement Arguments were in the wrong order in inserting fetching sites submitted to be crawled	2023-10-29 13:19:17 +01:00
Viktor Lofgren	88f49834fd	(docs) Update documentation	2023-10-27 12:45:39 +02:00
Viktor Lofgren	c7cb6664b4	(control) Indicate missing services with danger-color instead of having a distracting and constantly updating last-seen number	2023-10-26 18:05:22 +02:00
Viktor Lofgren	79adba9284	(index) Fix bug in dealing with quoted search terms	2023-10-26 16:28:23 +02:00
Viktor Lofgren	f613f4f2df	(array) Fix spurious search results This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss. It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.	2023-10-26 15:27:02 +02:00
Viktor Lofgren	abbadc92a0	(exdecutor) Prevent TriggerAdjacencyCalculationActor from showing up in the actions tab when it isn't running	2023-10-25 21:25:07 +02:00
Viktor Lofgren	97fcbdd6d9	(control) Move storage actions into the actions tab * Also disable annoying CSS animations	2023-10-25 21:23:56 +02:00
Viktor Lofgren	d7686b665e	Refactoring * Encyclopedia sideloader; permit providing base URL. * Storage base shows node id in GUI * ProcessLivenessMonitorActor restarts automatically * Clean-up of outbox code	2023-10-25 18:51:02 +02:00
Viktor Lofgren	84cdac83d6	(control) Move message queue monitor to control	2023-10-24 16:44:28 +02:00
Viktor Lofgren	313cc2965c	(index-creation) Print whether full or prio is created Previous state of saying reverse index for both was pretty confusing.	2023-10-24 16:23:10 +02:00
Viktor Lofgren	95f74c5ea7	(control) Filter out heartbeats that are stopped	2023-10-24 16:09:28 +02:00

1 2 3 4 5 ...

255 Commits