CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	085137ca63	* Extract the index functionality	2024-02-22 17:31:25 +01:00
Viktor Lofgren	3fd2a83184	* Extract the search-query function	2024-02-22 15:27:39 +01:00
Viktor Lofgren	66c1281301	(zk-registry) epic jak shaving WIP Cleaning out a lot of old junk from the code, and one thing lead to another... * Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds. * The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs. * Project is migrated to GraalVM * gRPC clients are re-written with a neat fluent/functional style. e.g. ```channelPool.call(grpcStub::method) .async(executor) // <-- optional .run(argument); ``` This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall. * For now the project is all in on zookeeper * Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh. * To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP! Missing is documentation and testing, and some more breaking apart of code.	2024-02-22 14:01:23 +01:00
Viktor Lofgren	73947d9eca	(zk-registry) Filter out phantom addresses in the registry The change adds a hostname validation step to remove endpoints from the ZkServiceRegistry when they do not resolve. This is a scenario that primarily happens when running in docker, and the entire system is started and stopped.	2024-02-20 18:09:11 +01:00
Viktor Lofgren	a69c0b2718	(grpc-client) Fix warmup crash The warmup would sometimes crash during a cold start-up, because it could not get an API. Changed the warmup to just create a GrpcSingleNodeChannelPool for the node.	2024-02-20 18:03:57 +01:00
Viktor Lofgren	6c764bceeb	(doc) Update documentation for `service-discovery`	2024-02-20 16:09:49 +01:00
Viktor Lofgren	273aeb7bae	(doc) Update documentation with new gRPC service setup	2024-02-20 16:06:05 +01:00
Viktor Lofgren	d185858266	(minor) Add missing query parameter to ServiceEndpoint.toURL	2024-02-20 15:49:43 +01:00
Viktor Lofgren	453bd6064b	(minor) Add warm-up to GrpcMultiNodeChannelPool to speed up the initial messages Without doing this, connections would be created lazily, which is probably never desirable.	2024-02-20 15:45:16 +01:00
Viktor Lofgren	904f2587cd	(minor) Add default ZOOKEEPER_HOSTS to service.env	2024-02-20 15:44:26 +01:00
Viktor Lofgren	14172312dc	(query-client) Fix query client The query service delegates and aggregates IndexDomainLinksApiGrpc messages to the index services. The query client was accidentally also doing this, instead of talking to the query client. Fixed so it correctly talks to the query client and nothing else.	2024-02-20 15:44:07 +01:00
Viktor Lofgren	c600d7aa47	(refac) Inject ServiceRegistry into WebsiteAdjacenciesCalculator	2024-02-20 15:42:32 +01:00
Viktor Lofgren	3c9234078a	(refac) Propagate ZOOKEEPER_HOSTS to spawned processes	2024-02-20 15:42:16 +01:00
Viktor Lofgren	ee8e0497ae	(refac) Move service discovery injection to a separate guice module	2024-02-20 15:41:04 +01:00
Viktor Lofgren	fd5d121648	(minor) Add WMSA_IN_DOCKER to all docker files	2024-02-20 15:39:46 +01:00
Viktor Lofgren	30bdb4b4e9	(config) Clean up service configuration for IP addresses Adds new ways to configure the bind and external IP addresses for a service. Notably, if the environment variable WMSA_IN_DOCKER is present, the system will grab the HOSTNAME variable and announce that as the external address in the service registry. The default bind address is also changed to be 0.0.0.0 only if WMSA_IN_DOCKER is present, otherwise 127.0.0.1; as this is a more secure default.	2024-02-20 14:22:48 +01:00
Viktor Lofgren	2ee492fb74	(gRPC) Bind gRPC services to an interface By default gRPC it magically decides on an interface. The change will explicitly tell it what to use.	2024-02-20 14:22:47 +01:00
Viktor Lofgren	36a5c8b44c	(cleanup) Clean up code	2024-02-20 14:22:47 +01:00
Viktor Lofgren	07b625c58d	(query-client) Add support for fault-tolerant requests to single node services Adding a method importantCall that will retry a failing request on each route until it succeeds or the routes run out.	2024-02-20 14:16:05 +01:00
Viktor Lofgren	746a865106	(client) Fix handling of channel refreshes The previous code made an incorrect assumption that all routes refer to the same node, and would overwrite the route list on each update. This lead to storms of closing and opening channels whenever an update was received. The new code is correctly aware that we may talk to multiple nodes.	2024-02-20 14:14:09 +01:00
Viktor	f85ec28a16	Merge branch 'master' into service-discovery	2024-02-20 11:44:12 +01:00
Viktor Lofgren	0307c55f9f	(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.	2024-02-20 11:41:14 +01:00
Viktor	d05c916491	Merge pull request #80 from MarginaliaSearch/ranking-algorithms Clean up domain ranking code	2024-02-18 09:52:34 +01:00
Viktor Lofgren	c73e43f5c9	(recrawl) Mitigate recrawl-before-load footgun In the scenario where an operator * Performs a new crawl from spec * Doesn't load the data into the index * Recrawls the data The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file, irrecoverably losing the crawl log making it impossible to load! To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening. More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl. This should help the DbCrawlSpecProvider to find them regardless of loaded state. This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.	2024-02-18 09:23:20 +01:00
Viktor Lofgren	e61e7f44b9	(blacklist) Delay startup of blacklist To help services start faster, the blacklist will no longer block until it's loaded. If such a behavior is desirable, a method was added to explicitly wait for the data.	2024-02-18 09:23:20 +01:00
Viktor Lofgren	f9b6ac03c6	(api) Clean up incorrect error handling in GrpcChannelPool	2024-02-18 08:45:35 +01:00
Viktor Lofgren	296ccc5f8e	(blacklist) Clean up blacklist impl The domain blacklist blocked the start-up of each process that injected it, adding like 30 seconds to the start-up time in prod. This change moves the loading to a separate thread entirely. For threads or processes that require the blacklist to be definitely loaded, a helper method was added that blocks until that time.	2024-02-18 08:16:48 +01:00
Viktor Lofgren	8cb5825617	(search) Temporarily disable the Popular filter This filter currently does not distinguish itself very much from the unfiltered results, and lends the impression that the filters don't "do anything". It may come back in some shape or form in the future, with some additional tweaking of the rankings...	2024-02-18 08:02:01 +01:00
Viktor Lofgren	cee707abd8	(crawler) Implement domain shuffling in DbCrawlSpecProvider Modified the DbCrawlSpecProvider to shuffle domains after loading to ensure a good mix for each crawl. This change prevents overload of crawling the same server in parallel from different subdomains or crawling big domains all at once.	2024-02-17 17:47:38 +01:00
Viktor Lofgren	92717a4832	(client) Refactor GrpcStubPool to handle error states Refactored the GRPC Stub Pool for better handling of channel SHUTDOWN state. Any disconnected channels are now re-created before returning the stub. The class was also renamed to GrpcChannelPool, as we no longer pool the stubs.	2024-02-17 14:42:26 +01:00
Viktor Lofgren	37a7296759	(sideload) Clean up the sideloading code Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach. The reddit sideloader now uses the SideloaderProcessing class. It also properly sets js-attributes for the sideloaded documents. The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.	2024-02-17 14:32:36 +01:00
Viktor Lofgren	ebbe49d17b	(sideload) Fix sideloading of explicitly selected stackexchange files Fix a bug where sideloading stackexchange files by explicitly selecting the 7z file would fail, since the 7z file would be passed along to the converter rather than the path to the pre-converted .db file.	2024-02-17 13:24:04 +01:00
Viktor Lofgren	b7e330855f	(control) Update descriptive text in the control GUI	2024-02-16 20:32:31 +01:00
Viktor Lofgren	ac89224fb0	(domain-ranking) Remove lingering mentions of the algorithms field from the GUI	2024-02-16 20:28:37 +01:00
Viktor Lofgren	9ec262ae00	(domain-ranking) Integrate new ranking logic The change deprecates the 'algorithm' field from the domain ranking set configuration. Instead, the algorithm will be chosen based on whether influence domains are provided, and whether similarity data is present.	2024-02-16 20:22:01 +01:00
Viktor Lofgren	64acdb5f2a	(domain-ranking) Clean up domain ranking The domain ranking code was admittedly a bit of a clown fiesta; at the same time buggy, fragile and inscrutable. Migrating over to use JGraphT to store the link graph when doing rankings, and using their PageRank implementation. Also added a modified version that does PersonalizedPageRank.	2024-02-16 18:04:58 +01:00
Viktor Lofgren	a175b36382	(search) Correct accidental regression of the SmallWeb filter	2024-02-15 18:16:56 +01:00
Viktor Lofgren	16526d283c	(search) Correct accidental regression of the Vintage filter	2024-02-15 18:13:34 +01:00
Viktor Lofgren	752e677555	(search) Expose getSearchTitle in DecoratedSearchResults	2024-02-15 13:56:44 +01:00
Viktor Lofgren	f796af1ae8	(search) Fix failed refactoring	2024-02-15 13:53:19 +01:00
Viktor Lofgren	2515993536	(search) Fix issue where searchTitle setting gets lost when searching again It's important that the field names in SearchParameters matches the fields referenced in search-form.hdb, otherwise they will get lost in transit.	2024-02-15 13:52:11 +01:00
Viktor Lofgren	66b3e71e56	(search) Expose more search options This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias. The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period. These options are added to the search interface. The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well. The vintage filter is modified to add a temporal bias for the past.	2024-02-15 13:39:51 +01:00
Viktor Lofgren	652d151373	(process-models) Improve documentation	2024-02-15 12:21:12 +01:00
Viktor Lofgren	300b1a1b84	(index-query) Add some tests for the QueryFilter code	2024-02-15 12:03:30 +01:00
Viktor Lofgren	6c3b49417f	(index-query) Improve documentation and code quality	2024-02-15 11:33:50 +01:00
Viktor Lofgren	dcc5cfb7c0	(index-journal) Improve documentation and code quality	2024-02-15 10:51:49 +01:00
Viktor	d970836605	Merge pull request #79 from MarginaliaSearch/reddit (converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy, and improves the sideload UX a tiny bit.	2024-02-15 09:17:56 +01:00
Viktor Lofgren	8021bd0aae	(control) Sort upload listing results Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename. The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.	2024-02-15 09:13:40 +01:00
Viktor Lofgren	8f91156d80	(control) Improve sideload UX The sideload forms didn't properly set the label 'for' property, meaning that while label tags existed, they weren't appropriately clickable. Also removed unnecessary limits on the sideload target being a directory for stackexchange and warc. It's been possible to directly load a particular file for a while, but not allowed due to GUI limits.	2024-02-14 18:38:20 +01:00
Viktor Lofgren	fab36d6e63	(converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy.	2024-02-14 17:35:44 +01:00

1 2 3 4 5 ...

1776 Commits