CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	22c8fb3f59	(crawler) Fix a bug where reference copies of crawl data was written without etag and last-modified This commit also adds a band-aid to ParquetSerializableCrawlDataStream to fetch this from the 304-entity. This can be removed in a few months.	2024-01-18 16:02:27 +01:00
Viktor Lofgren	964419803a	Fix broken test	2024-01-18 15:42:01 +01:00
Viktor Lofgren	6271d5d544	(mq) Add relation tracking between MQ messages for easier tracking and debugging. The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID. This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers. The existing RELATED_ID field has too many semantics associated with them, among other things the FSM code uses them this field in tracking state changes. The change set also improves the consistency of inbox names. The IndexClient was buggy and populated its outbox with a UUID. This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.	2024-01-18 15:08:27 +01:00
Viktor Lofgren	175bd310f5	(control) Message queue UX improvements	2024-01-18 13:05:50 +01:00
Viktor Lofgren	67ee6f4126	(control) Clean up filtering UX in Events table	2024-01-18 12:35:39 +01:00
Viktor Lofgren	01b312f14c	(*) Make new index nodes accept queries by default It's a confusing default behavior. This was off for nodes n>1 before as a bandaid since querying indices with no data caused delays and errors. This has been fixed now, so there's no need to do this anymore!	2024-01-18 12:05:37 +01:00
Viktor Lofgren	18638c62de	(control) Rephrase text	2024-01-18 11:53:10 +01:00
Viktor Lofgren	753d000788	(control) Add toggle for automatic loading of processed data	2024-01-18 11:52:58 +01:00
Viktor Lofgren	19e781b104	(control) Add basic input validation to node actions Will present a simple error message when required fields aren't populated, instead of a cryptic HTTP status error.	2024-01-18 11:52:49 +01:00
Viktor Lofgren	aa2df327db	(index) Prevent index from attempting to answer queries when no index data is loaded This improves query times, and gets rid of exceptions in the logs when one of the index nodes doesn't have any data loaded, yet is configured to answer queries.	2024-01-18 11:05:45 +01:00
Viktor Lofgren	321fa94b8f	(crawler) Fix rare exception in content type handling due to improper length checking of a split() array	2024-01-18 11:05:45 +01:00
Viktor Lofgren	41cdb8f71b	(control) Fix broken update button in the update-domain-ranking-set form id property was on the wrong element.	2024-01-17 18:21:09 +01:00
Viktor Lofgren	304d4c9acf	(control) Fix result ordering in the file storage listing view In some scenarios, such as when restoring storage items from json-manifest on db failure, the file storage view would present the items in a non-chronological order. Added a sort() operation to mitigate this.	2024-01-17 10:56:30 +01:00
Viktor Lofgren	7fd4c092e3	(control) Clean up UX and accessibility for new domain ranking sets. The change also adds basic support for error messages in the GUI.	2024-01-17 10:47:14 +01:00
Viktor Lofgren	2fe5705542	(control) GUI for ranking sets Still missing is some polish, forms don't have proper labels, validation is inconsistent, no error messages, etc.	2024-01-16 17:10:09 +01:00
Viktor Lofgren	e968365858	(index) Use new DomainRankingSets to configure ranking algos in index svc	2024-01-16 12:43:32 +01:00
Viktor Lofgren	36ad4c7466	(db) Add a new configuration object 'domain ranking set' for storing ranking parameters	2024-01-16 12:34:00 +01:00
Viktor Lofgren	5a62b3058f	(query-api) Make the search set identifier a string value in the API This will free the core marginalia search engine to use arbitrary search set definitions, while the app can use its hardcoded defaults.	2024-01-16 10:55:24 +01:00
Viktor Lofgren	a1df9e886a	(control) Also clean up stale 'NEW' messages	2024-01-15 16:14:02 +01:00
Viktor Lofgren	fd1eec99b5	(cleanup) Fix broken tests	2024-01-15 15:44:33 +01:00
Viktor Lofgren	e162406d40	(control) New control-side actors for cleaning up stale service heartbeats and message queue entries	2024-01-15 15:44:23 +01:00
Viktor Lofgren	c41e68aaab	(control) New export actions for RSS/Atom feeds and term frequency data This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.	2024-01-15 14:54:26 +01:00
Viktor Lofgren	4665af6c42	(control) Move export data endpoint to actions controller	2024-01-15 11:06:22 +01:00
Viktor Lofgren	c0b15427fe	(control) New crawl view should use radio buttons as multiple specs aren't supported	2024-01-15 11:03:47 +01:00
Viktor Lofgren	f29a9d972d	(control) Move 'new crawl spec' to /node/:id/actions, out of /node/:id/storage	2024-01-15 11:02:00 +01:00
Viktor Lofgren	b192373ae7	(control) Highlight unavailable items (creating, deleting) in node actions views	2024-01-15 10:47:54 +01:00
Viktor Lofgren	c042650382	(docs) Improve query service documentation	2024-01-13 21:16:45 +01:00
Viktor Lofgren	07a916a720	(search) Give the swipe hint on mobile a nicer finish	2024-01-13 18:51:54 +01:00
Viktor Lofgren	5134044530	(assistant) Make assistant client more robust to the service going down This is especially important for the non-essential functions, like website similarities...	2024-01-13 18:29:30 +01:00
Viktor Lofgren	4c62065e74	(install) Add two separate templates for the install script One template is for the full Marginalia Search style install, and the other is for a barebones install with no Marginalia-related fluff.	2024-01-13 18:27:42 +01:00
Viktor Lofgren	d28fc99119	(MainClass) ensure logging isn't loaded before service name is known This causes logs all to have names like ${sys:service-name}, instead of the service name...	2024-01-13 18:19:50 +01:00
Viktor Lofgren	c9fb45c85f	(search) Fix control.hideMarginaliaApp handling	2024-01-13 17:24:15 +01:00
Viktor Lofgren	7c6e18f7a7	(*) Overhaul settings and properties Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.	2024-01-13 17:12:18 +01:00
Viktor Lofgren	176b9c9666	(convert) Add sizeHints to legacy serializable cawl data stream This reduces the maximum memory usage when processing legacy crawl data	2024-01-13 15:50:36 +01:00
Viktor Lofgren	ecd9c35233	(control) Clean up the event log * Generate fewer uninteresting event messages. * Display fewer irrelevant fields in the overview table.	2024-01-13 13:28:02 +01:00
Viktor Lofgren	71e32c57d9	(control) Add better timestamps for the events and message queue views Adjust display precision based on distance into the past, full ms-accurate timestamps available via hover-action.	2024-01-13 13:04:56 +01:00
Viktor Lofgren	2fefd0e4e3	(control) Add better timestamps for the events and message queue views Adjust display precision based on distance into the past, full ms-accurate timestamps available via hover-action.	2024-01-13 13:03:52 +01:00
Viktor Lofgren	81eaf79a25	(control) UX polish	2024-01-13 12:31:13 +01:00
Viktor Lofgren	8dea7217a6	(control) UX fixes, node GUI doesn't break when an executor service goes offline.	2024-01-13 12:17:30 +01:00
Viktor Lofgren	c0fb9e17e8	(control) Add filter dropdown to message queue table This makes inspecting the queues for processes much easier, as it's otherwise often these important messages are drowned out by FSM chatter.	2024-01-12 18:46:17 +01:00
Viktor Lofgren	83776a8dce	(control) Wean the ExportDataActor off EC_DOMAIN_LINK The EC_DOMAIN_LINK table is deprecated and slated for removal, use QueryClient.getAllDomainLinks() instead. The ExportDataActor now uses the QueryClient appropriately. The CSV format was also changed to quote the values, to prevent e.g. Excel from interpreting the comma as a decimal separator when previewing the file. Finally the form for triggering an export was overhauled.	2024-01-12 17:09:11 +01:00
Viktor Lofgren	98c0972619	(control) Add a summary table for Actors in the Node overview	2024-01-12 16:32:15 +01:00
Viktor Lofgren	56d832d661	(control) Adjust the margins of the headings to be consistent	2024-01-12 16:16:57 +01:00
Viktor Lofgren	de3a350afe	(control) Disable broken actions and mark the actions view as WIP	2024-01-12 16:16:39 +01:00
Viktor Lofgren	708a741960	(test) Clean up test usage of migrations Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing. A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.	2024-01-12 15:55:50 +01:00
Viktor Lofgren	0caef1b307	(warc) Toggle for saving WARC data Add a toggle for saving the WARC data generated by the search engine's crawler. Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest. The warc files are concatenated into larger archives, up to about 1 GB each. An index is also created containing filenames, domain names, offsets and sizes to help navigate these larger archives. The warc data is saved in a directory warc/ under the crawl data storage.	2024-01-12 13:45:14 +01:00
Viktor Lofgren	264e2db539	(control) UX-improvements for control service This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views. It has many small tweaks to make the work flow better. It also adds a new /uploads directory in each index node, from which sideloaded data can be selected. This is a bit of a breaking change, as this directory needs to exist in each index node.	2024-01-12 12:33:05 +01:00
Viktor Lofgren	734996002c	(*) install script for deploying Marginalia outside the codebase The changeset also makes the control service responsible for flyway migrations. This helps reduce the number of places the database configuration needs to be spread out. These automatic migrations can be disabled with -DdisableFlyway=true. The commit also adds curl to the docker container, to enable docker health checks and interdependencies.	2024-01-11 12:40:03 +01:00
Viktor Lofgren	a0f28a7f9b	(*) Add a barebones configuration This adds a docker-compose file 'docker-compose-barebones.yml' which will only start the minimal number of services needed to run a whitelabel Marginalia Search-style search engine, with none of the surrounding frills. The change also adds a minimal search GUI to the query service, which is also available with JSON results if the appropriate Accept header is provided.	2024-01-10 20:23:51 +01:00
Viktor Lofgren	14b7680328	(loader) Update the size of the keyword files created by the loader Previously these ended up being about 200 Mb each, which is wastefully small. Increasing the size of these files makes the index construction faster.	2024-01-10 17:09:19 +01:00
Viktor Lofgren	f44222ce53	(control) Add a 'cancel' button to the process list This is a very nice QoL improvement, since it means you don't have to dig in the Actors view to terminate processes.	2024-01-10 15:02:42 +01:00
Viktor Lofgren	f310ad8d98	(control) Actor terminations work better Improves jank in the abort actor action, which would sometimes cause actors to hang or restart.	2024-01-10 14:18:49 +01:00
Viktor Lofgren	d56b394bcc	(control) GUI for loading external WARC files	2024-01-10 12:13:30 +01:00
Viktor Lofgren	55c9501e57	(search) Serve proper content type for static resources	2024-01-10 10:46:51 +01:00
Viktor	fad9575154	Merge pull request #69 from MarginaliaSearch/converter-optimizations Refactor the DomainProcessor to take advantage of the new crawl data format	2024-01-10 09:46:54 +01:00
Viktor Lofgren	97e11e1ac9	(search) Fix acknowledgement page for domain complaints rendering as plain text This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.	2024-01-10 09:37:40 +01:00
Viktor Lofgren	e6a1e164b2	(search) Swap swipe direction for more consistent experience	2024-01-10 09:37:40 +01:00
Viktor Lofgren	e4f8f81e89	(search) Mobile UX improvements. Swipe right to show filter menu. Fix CSS bug that caused parts of the menu to not have a background.	2024-01-10 09:37:39 +01:00
Viktor Lofgren	176b3bb526	(search) Toggle for showing recent results Actually persist the value of the toggle between searches too...	2024-01-10 09:37:39 +01:00
Viktor Lofgren	b07752fa9b	(search) Toggle for showing recent results Will by default show results from the last 2 years. May need to tune this later.	2024-01-10 09:37:39 +01:00
Viktor Lofgren	68fd0efbde	(search) Clean up search results template Rendering is very slow. Let's see if this has a measurable effect on latency.	2024-01-10 09:37:39 +01:00
Viktor Lofgren	c80d3eb812	(search) Remove dead code	2024-01-10 09:37:35 +01:00
Viktor Lofgren	f9320995d6	(search) When clicking asn-links, show results from the unfiltered view...	2024-01-10 09:37:13 +01:00
Viktor Lofgren	f592c9f04d	(search) Fix acknowledgement page for domain complaints rendering as plain text This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.	2024-01-10 09:26:34 +01:00
Viktor Lofgren	bd7970fb1f	(search) Swap swipe direction for more consistent experience	2024-01-09 13:38:40 +01:00
Viktor Lofgren	c47730f2cc	(search) Mobile UX improvements. Swipe right to show filter menu. Fix CSS bug that caused parts of the menu to not have a background.	2024-01-09 13:30:30 +01:00
Viktor Lofgren	41cccfd2aa	(search) Toggle for showing recent results Actually persist the value of the toggle between searches too...	2024-01-09 11:36:49 +01:00
Viktor Lofgren	aff690f7d6	(search) Toggle for showing recent results Will by default show results from the last 2 years. May need to tune this later.	2024-01-09 11:28:36 +01:00
Viktor Lofgren	d4b0539d39	(search) Clean up search results template Rendering is very slow. Let's see if this has a measurable effect on latency.	2024-01-08 20:57:40 +01:00
Viktor Lofgren	cb55273769	(search) When clicking asn-links, show results from the unfiltered view...	2024-01-08 20:02:19 +01:00
Viktor Lofgren	fbad625126	(linkdb) Add delegating implementation of DomainLinkDb This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.	2024-01-08 19:56:33 +01:00
Viktor Lofgren	e49ba887e9	(crawl data) Add compatibility layer for old crawl data format The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records. This is true for the new parquet format, but not for the old zstd/gson format. To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order. This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be. Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.	2024-01-08 19:16:49 +01:00
Viktor Lofgren	edc1acbb7e	(*) Replace EC_DOMAIN_LINK table with files and in-memory caching The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need. This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service. A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file. The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.	2024-01-08 15:53:13 +01:00
Viktor Lofgren	ef02b712ad	(build) Remove false depdencency between icp and index-service This dependency causes the executor service docker image to change when the index service docker image changes.	2024-01-05 13:22:13 +01:00
Viktor Lofgren	aca217cf9a	(qs) Better metrics for QS	2024-01-05 13:22:13 +01:00
Viktor Lofgren	9e3386dbbb	(search) Fetch fewer results per page This is a test to evaluate how this impacts load times.	2024-01-05 13:22:13 +01:00
Viktor Lofgren	fdec565b34	(converter) Add upper 128KB limit to how much HTML we'll parse	2024-01-05 13:22:13 +01:00
Viktor Lofgren	33c2188c87	(feature) More trackers	2024-01-05 13:22:13 +01:00
Viktor Lofgren	b3c8fa74cc	(feature) Add another doubleclick variant to the adtech trackers	2024-01-05 13:22:13 +01:00
Viktor Lofgren	e53bb70bef	(converter) Penalize chatgpt content farm spam	2024-01-05 13:22:13 +01:00
Viktor Lofgren	109bec372c	(index) Adjust BM25 parameters	2024-01-05 13:21:52 +01:00
Viktor Lofgren	5c2561d05d	(search) Add query strategy requiring link	2024-01-05 13:21:52 +01:00
Viktor Lofgren	0e970b8037	(valuation) Tweaking penalties a bit	2024-01-05 13:21:52 +01:00
Viktor Lofgren	1694b4d6ef	(valuation) Increase the penalty for adtech a bit	2024-01-05 13:21:34 +01:00
Viktor Lofgren	396299c1db	(index) Reduce the value of site and site-adjacent in BM25P calculations	2024-01-05 13:21:33 +01:00
Viktor Lofgren	71d789aab0	(index) Tweak result valuation renormalization	2024-01-05 13:21:33 +01:00
Viktor Lofgren	6d2e14a656	(build) Remove false depdencency between icp and index-service This dependency causes the executor service docker image to change when the index service docker image changes.	2024-01-05 13:17:29 +01:00
Viktor Lofgren	4078708aea	(qs) Better metrics for QS	2024-01-04 13:27:14 +01:00
Viktor Lofgren	343ea9c6d8	(search) Fetch fewer results per page This is a test to evaluate how this impacts load times.	2024-01-04 13:18:07 +01:00
Viktor Lofgren	60361f88ed	(converter) Add upper 128KB limit to how much HTML we'll parse	2024-01-03 23:14:03 +01:00
Viktor Lofgren	f7560cb1d8	(feature) More trackers	2024-01-03 17:31:02 +01:00
Viktor Lofgren	1f66568d59	(feature) More trackers	2024-01-03 17:27:25 +01:00
Viktor Lofgren	7af07cef95	(feature) Add another doubleclick variant to the adtech trackers	2024-01-03 17:21:12 +01:00
Viktor Lofgren	41a540a629	(converter) Penalize chatgpt content farm spam	2024-01-03 17:04:38 +01:00
Viktor Lofgren	f599944942	(converter) Penalize chatgpt content farm spam	2024-01-03 16:51:26 +01:00
Viktor Lofgren	1e06aee6a2	(index) Adjust BM25 parameters	2024-01-03 16:30:46 +01:00
Viktor Lofgren	7bbaedef97	(search) Add query strategy requiring link	2024-01-03 16:23:00 +01:00
Viktor Lofgren	87048511fe	(valuation) Tweaking penalties a bit	2024-01-03 16:02:25 +01:00
Viktor Lofgren	c770f0b68b	(valuation) Tweaking penalties a bit	2024-01-03 15:59:21 +01:00
Viktor Lofgren	78c00ad512	(valuation) Tweaking penalties a bit	2024-01-03 15:52:57 +01:00
Viktor Lofgren	a19879d494	(valuation) Tweaking penalties a bit	2024-01-03 15:32:33 +01:00
Viktor Lofgren	ac1aca36b0	(valuation) Increase the penalty for adtech a bit	2024-01-03 15:20:38 +01:00
Viktor Lofgren	1f3b89cf28	(index) Reduce the value of site and site-adjacent in BM25P calculations	2024-01-03 15:20:18 +01:00
Viktor Lofgren	f732f6ae6f	(index) Tweak result valuation renormalization	2024-01-03 14:53:53 +01:00
Viktor Lofgren	0b9f3d1751	(*) Remove accidental commit of debug logging	2024-01-03 14:32:00 +01:00
Viktor Lofgren	0806aa6dfe	(language-processing) Add maximum length limit for text input in SentenceExtractor Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.	2024-01-03 14:27:47 +01:00
Viktor Lofgren	32436d099c	(language-processing) Add maximum length limit for text input in SentenceExtractor Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.	2024-01-03 14:27:47 +01:00
Viktor Lofgren	4ce692ccaf	(converter) Use SimpleBlockingThreadPool in ProcessingIterator	2024-01-03 14:27:47 +01:00
Viktor Lofgren	3caa4eed75	Merge branch 'master' into converter-optimizations	2024-01-02 17:13:25 +01:00
Viktor Lofgren	c70f508ae8	(prometheus) Saner histogram buckets	2024-01-02 17:13:14 +01:00
Viktor Lofgren	9e64d7aaf9	Merge branch 'master' into converter-optimizations	2024-01-02 15:46:24 +01:00
Viktor Lofgren	72b773f06d	(search) fix search metrics labeling	2024-01-02 15:46:14 +01:00
Viktor Lofgren	5f978b865b	Merge branch 'master' into converter-optimizations	2024-01-02 15:41:48 +01:00
Viktor Lofgren	57a4f92722	(api) fix missing metrics label in api service	2024-01-02 15:41:38 +01:00
Viktor Lofgren	87351e89ca	Merge branch 'master' into converter-optimizations	2024-01-02 15:17:02 +01:00
Viktor Lofgren	192e356169	(prometheus) Add instrumentation to the api service	2024-01-02 15:12:44 +01:00
Viktor Lofgren	31232e49fb	(prometheus) Add instrumentation to the search, qs and index services.	2024-01-02 15:02:29 +01:00
Viktor Lofgren	9d93a31755	Merge branch 'master' into converter-optimizations	2024-01-02 12:36:16 +01:00
Viktor Lofgren	9f7df59945	(sideload) Reduce quality assessment. This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.	2024-01-02 12:35:59 +01:00
Viktor Lofgren	d2418521a7	(index) Further ranking adjustments	2024-01-02 12:35:59 +01:00
Viktor Lofgren	9330b5b1d9	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	faa50bf578	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	f0d9618dfc	(sideload) Reduce quality assessment. This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.	2024-01-02 12:34:58 +01:00
Viktor Lofgren	310a880fa8	(index) Further ranking adjustments	2024-01-02 12:24:52 +01:00
Viktor Lofgren	fc6e3b6da0	(index) Further ranking adjustments	2024-01-01 18:51:03 +01:00
Viktor Lofgren	50771045d0	(index) Further ranking adjustments	2024-01-01 18:43:17 +01:00
Viktor Lofgren	8f522470ed	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-01 17:16:29 +01:00
Viktor Lofgren	dc90c9ac65	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-01 16:19:38 +01:00
Viktor Lofgren	e46e174b59	(keyword-extractor) Add another test for Name-extractor	2024-01-01 15:21:51 +01:00
Viktor Lofgren	7f3f3f577c	(backup) Add task heartbeats to the backup service	2024-01-01 15:20:57 +01:00
Viktor Lofgren	75d87c73d1	(crawler) Disable Java's infinite DNS caching	2023-12-31 16:59:08 +01:00
Viktor Lofgren	0fe44c9bf2	(crawler) Fix broken test A necessary step was accidentally deleted when cleaning up these tests previously.	2023-12-30 13:56:44 +01:00
Viktor Lofgren	7a1d20ed0a	(converter) Better use of ProcessingIterator Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service. This reduces thread churn in the converter sideloader style processing of regular crawl data.	2023-12-30 13:53:55 +01:00
Viktor Lofgren	70c83b60a1	(converter) Clean up fullProcessing() This function made some very flimsy-looking assumptions about the order of an iterable. These are still made, but more explicitly so.	2023-12-30 13:36:18 +01:00
Viktor Lofgren	7ba296ccdf	(converter) Route sizeHint to SideloadProcessing Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number. This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.	2023-12-30 13:05:10 +01:00
Viktor Lofgren	0b112cb4d4	(warc) Update URL encoding in WarcProtocolReconstructor The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.	2023-12-29 19:41:37 +01:00
Viktor Lofgren	68ac8d3e09	(search) Fetch fewer linking and similar domains. Showing a total of 200 connected domains is not very informative.	2023-12-29 16:37:27 +01:00
Viktor Lofgren	f6fa8bd722	(search) Fetch fewer linking and similar domains. Showing a total of 200 connected domains is not very informative.	2023-12-29 16:37:00 +01:00
Viktor Lofgren	6aee27a3f1	(*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style.	2023-12-29 16:36:01 +01:00
Viktor Lofgren	401568033c	Merge branch 'master' into converter-optimizations	2023-12-29 15:55:57 +01:00
Viktor Lofgren	ea73be6831	(search) Remove the ugly placeholder screenshots from the site info view.	2023-12-29 15:55:46 +01:00
Viktor Lofgren	ba8a75c84b	Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool	2023-12-29 15:10:32 +01:00
Viktor Lofgren	a1f3ccdd6d	Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool	2023-12-29 14:59:39 +01:00
Viktor Lofgren	647d38007f	Reduce queue polling time in ProcessingIterator Updated ProcessingIterator's queue polling from one second to 50 milliseconds for improved performance. This facilitates faster document processing across more cores, reducing bottlenecks and slow single-threaded processing.	2023-12-29 14:27:58 +01:00
Viktor Lofgren	e7dd28b926	(converter) Optimize sideload-loading Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.	2023-12-29 14:25:48 +01:00
Viktor Lofgren	b5fc9673d9	Merge branch 'master' into converter-optimizations	2023-12-29 14:04:43 +01:00
Viktor Lofgren	a065040323	(search) Don't inject arbitrary HTML into the site info view xD	2023-12-29 14:04:26 +01:00
Viktor Lofgren	dec3b1092d	(converter) Fix bugs in conversion This commit adds a safety check that the URL of the document is from the correct domain. It also adds a sizeHint() method to SerializableCrawlDataStream which may provide an indication if the stream is very large and benefits from sideload-style processing (which is slow). It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...	2023-12-29 13:58:08 +01:00
Viktor Lofgren	407915a86e	(converter) Fix NPEs in converter due to the new data format	2023-12-28 22:54:53 +01:00
Viktor Lofgren	c488599879	(converter) Fix NPE in converter	2023-12-28 19:52:26 +01:00
Viktor Lofgren	bcecc93e39	(converter) Swallow errors in parquet data stream	2023-12-28 19:45:35 +01:00
Viktor Lofgren	ff7d1a250e	Merge branch 'master' into converter-optimizations	2023-12-28 19:35:00 +01:00
Viktor Lofgren	70f338c3de	(search) Fix NPE in layout selection	2023-12-28 19:34:46 +01:00
Viktor Lofgren	c847d83011	(converter) Add size hint to converter sideload processing	2023-12-28 19:14:16 +01:00
Viktor Lofgren	5ce46a61d4	Merge branch 'master' into converter-optimizations	2023-12-28 13:26:19 +01:00
Viktor	775974d5ec	Merge pull request #67 from MarginaliaSearch/rss-feeds-in-site-info Add RSS Feeds to site info (WIP)	2023-12-28 13:25:38 +01:00
Viktor Lofgren	c7af40c368	(search) Change layout balance when feeds/samples are present	2023-12-28 13:16:10 +01:00
Viktor Lofgren	00a974a721	(crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions This commit also improves the test coverage for this part of the code.	2023-12-27 20:02:17 +01:00
Viktor Lofgren	7428ba2dd7	(converter) Basic test coverage for sideloading-style processing	2023-12-27 19:29:26 +01:00
Viktor Lofgren	b37223c053	(converter) Basic test coverage for sideloading-style processing	2023-12-27 18:33:16 +01:00
Viktor Lofgren	24051fec03	(converter) WIP Run sideload-style processing for large domains The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis. This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process. These websites now receive a simplified treatment. This is executed in the converter batch writer thread. This is slower, but the documents will not be persisted in memory.	2023-12-27 18:20:03 +01:00
Viktor Lofgren	f811a29f87	(crawler) Fix resource leak in crawler A 10 MB thread local buffer wasn't static. Oops.	2023-12-27 16:32:17 +01:00
Viktor Lofgren	acf7bcc7a6	(converter) Refactor the DomainProcessor for new format of crawl data With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter. This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while. The first step is to move stuff out of the domain processor into the document processor.	2023-12-27 13:57:59 +01:00
Viktor Lofgren	9707366348	(test) Fix a few slow tests that broke due to domainCount	2023-12-27 13:29:59 +01:00
Viktor Lofgren	9e5fe71f5b	(crawler) Switch hash function in crawler Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler. This switches to the modified Murmur hash function used throughout Marginalia.	2023-12-27 13:29:00 +01:00
Viktor Lofgren	5d1b7da728	Updated site info feed and search service Modified site info feed template to secure the description field against injected code. Also adjusted search service by extracting samples within the correct scope and including them in the returned site info. This improves the quality and security of the displayed information.	2023-12-26 22:06:01 +01:00
Viktor Lofgren	3ea1ddae22	(crawler) Roll back switch to virtual thread pool in crawler This seems to cause a resource leak, it seems the http library uses thread locals?	2023-12-26 19:37:34 +01:00
Viktor Lofgren	1694e9c78c	(search) Add RSS Feeds to site info This change integrates the Feedlot RSS Bot with Marginalia's site info view to offer a preview of the latest updates. The change introduces a new tiny feature that is a feedlot-client based on Java's HttpClient.	2023-12-26 16:21:40 +01:00
Viktor Lofgren	4763077b76	(search/index) Add a new keyword "count" This is for filtering results on how many times the term appears on the domain. The intent is to be beneficial in creating e.g. a domain search feature. It's also very helpful when tracking down spammy domains.	2023-12-25 20:38:29 +01:00
Viktor Lofgren	c0eaca220c	(search) Add convenient link for AS search to the search view	2023-12-25 15:07:58 +01:00
Viktor Lofgren	25d086c4e1	(crawler) Clean up stale warc files We should probably have an option to keep them, but not by default!	2023-12-25 15:07:36 +01:00
Viktor Lofgren	88551043cd	(crawler) Even more lenient resyncing	2023-12-25 01:48:11 +01:00
Viktor Lofgren	f779f760c4	(crawler) Even more lenient resyncing	2023-12-25 01:44:18 +01:00
Viktor Lofgren	f18f82e229	(crawler) Write etags and last-modified on reference copy This commit also fixes a test that broke with a previous change.	2023-12-25 01:40:13 +01:00
Viktor Lofgren	67ef2b45fa	(crawler) Reduce logging	2023-12-25 01:10:03 +01:00
Viktor Lofgren	d72e871265	(warc) Fix resync	2023-12-25 01:03:03 +01:00
Viktor Lofgren	4c9bc13309	(warc) Reduce log spam	2023-12-25 00:58:31 +01:00
Viktor Lofgren	84563b0d46	(crawler) Be a bit more conservative about pulling etags and so on if the previous fetch wasn't OK	2023-12-25 00:55:05 +01:00
Viktor Lofgren	c5aab7e8db	(warc) Fix NPE in WarcRecorder	2023-12-25 00:54:38 +01:00
Viktor Lofgren	1755b646b8	(warc) Fix NPE in WarcRecorder	2023-12-25 00:48:42 +01:00
Viktor Lofgren	85f906ea53	(executor) Fix removal of stale process heartbeats	2023-12-23 13:49:24 +01:00
Viktor Lofgren	e1a155a9c8	(crawler) Increase growth of crawl jobs A number of crawl jobs get stuck at about 300 documents, or just under. This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl. GOOD_URLS is based on how many documents successfully process, which is typically fairly small. Switching to KNOWN_URLS should let this grow faster. The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table. The floor is also increased to 250 from 200.	2023-12-23 13:22:10 +01:00
Viktor Lofgren	0454447e41	(executor) Implement process removal for long-absent heartbeats Added functionality to remove processes from listing that have not checked in for over a day. A 'removeProcessHeartbeat' function was created to delete the respective entry from the PROCESS_HEARTBEAT table in case heartbeats are absent for more than one day.	2023-12-23 13:18:21 +01:00
Viktor Lofgren	7b40c0bbee	(assistant) Clean up similar websites' results	2023-12-22 14:07:01 +01:00
Viktor Lofgren	dc773c5c20	(adjacencies) Clean up AdjacenciesLoader Make JDBC batching more consistent, also adds a test case for the loader.	2023-12-21 14:14:22 +01:00
Viktor Lofgren	b6253b03c2	(adjacencies) Fix bug in AdjacenciesLoader This fixes a bug where a prepared statement was created before the table it was supposed to insert into was created. This fails and does nothing. Furthermore, added the logging that would have warned about this failure, had it been in place.	2023-12-21 13:12:31 +01:00
Viktor Lofgren	a5bc29245b	(cleanup) Remove vestigial support for WARC crawl data streams	2023-12-20 15:46:21 +01:00
Viktor Lofgren	bfae478251	Refactor CrawlerRevisitor for better consistency	2023-12-20 15:21:49 +01:00
Viktor Lofgren	a7cd490593	(minor) Remove dead code.	2023-12-19 18:58:33 +01:00
Viktor Lofgren	dd8fb04886	(converter) Add sizeloadSizeAdvice field to several ProcessedDomain Since the sideloaders don't populate the documents list in ProcessedDomain to keep the memory footprint manageable, the code that estimates knownUrls etc. will set them to zero, which has negative effects on their ranking. This change will populate them with a bullshit value within a sane ballpark, ensuring that these domains show up in the rankings.	2023-12-19 18:37:51 +01:00
Viktor	5bd3934d22	Merge pull request #64 from dreimolo/macos_AS_fix Macos apple silicon fix, and slight improvements to sample downloader	2023-12-18 18:29:14 +01:00
Viktor Lofgren	3a56a06c4f	(warc) Add a fields for etags and last-modified headers to the new crawl data formats Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats. This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely. The commit also adds a few tests to this logic.	2023-12-18 17:45:54 +01:00
Viktor Lofgren	126ac3816f	(converter) Reduce queue size in ConverterWriter The size of the ArrayBlockingQueue in ConverterWriter.java has been reduced from 4 to 1. This change aims to reduce the memory utilization by not having fully processed domains piling up in RAM. This may cause the writer to go idle in waiting for new data, but that may be preferable to an OOM.	2023-12-18 13:42:40 +01:00
Viktor Lofgren	d02bed1a55	(loader) Optimize DomainLoaderService for faster startups Initialization parameters in DomainLoaderService and DomainIdRegistry have been updated to improve performance. This is done by adding sane default sizes to the hash tables involved, reducing GC churn, but also by setting a sensible fetch size to the queries used, and not fetching irrelevant information such as the domain name.	2023-12-18 13:15:10 +01:00
Viktor Lofgren	b7ed0ce537	(loader) Reset count after executing batch in DomainLoaderService This should greatly speed up starting the loader process.	2023-12-18 12:43:53 +01:00
Viktor Lofgren	a742503508	(search) Add view for showing mutual links between two websites	2023-12-17 17:50:44 +01:00
Viktor Lofgren	33312ab09e	(geo-ip) Update readme	2023-12-17 16:08:33 +01:00
Viktor Lofgren	c422f0b9fb	(geo-ip) Tidy up error handling	2023-12-17 16:06:51 +01:00
Viktor Lofgren	c92f1b8df8	(geo-ip) Revert removal of ip2location logic We do both ip2location and ASN data. The change also adds some keywords based on autonomous system information, on a somewhat experimental basis. It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.	2023-12-17 15:03:00 +01:00
Viktor Lofgren	bde68ba48b	Merge branch 'master' into asn-info	2023-12-17 14:00:23 +01:00

... 2 3 4 5 6 ...

1079 Commits