CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	467ba5be20	(index-construction) Split repartition into two actions This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets. Since only the first is necessary before the index construction, the rest can be delayed until after... To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one. The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader. Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data. Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead. To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.	2024-02-06 17:20:07 +01:00
Viktor Lofgren	29ddf9e61d	(doc) Update docs	2024-02-06 16:29:55 +01:00
Viktor Lofgren	92e119cab3	(doc) Update docs	2024-02-06 12:43:42 +01:00
Viktor Lofgren	92049ba8e4	(doc) Update docs	2024-02-06 12:41:28 +01:00
Viktor Lofgren	54330b9921	(*) Remove dead code	2024-02-06 12:41:13 +01:00
Viktor Lofgren	d1aeb030f2	(doc) Update RandomWriteFunnel documentation	2024-02-06 12:35:24 +01:00
Viktor Lofgren	f89274d1ea	(minor) Fix broken test Fallout from changes in endianness made in `d986f90074`	2024-02-06 12:12:26 +01:00
Viktor Lofgren	7286596fb4	(deps) Remove monkey patched GSON The codebase used to have a monkey patched version of gson that made special optimizations for the unusually large JSON files that used to store e.g. crawl data. Since JSON is no longer used in this fashion, the GSON fork is not needed anymore.	2024-02-06 12:11:39 +01:00
Viktor Lofgren	a2fc83d94e	(control) Add configurable border styling To help distinguish between environments, a system property 'control.appBorder' is added that is injected as a body element border property in the control GUI stylesheets.	2024-02-06 12:05:02 +01:00
Viktor Lofgren	2161799cc3	(sideload) Fix filename error in dealing with stackoverflow files	2024-02-06 11:18:00 +01:00
Viktor Lofgren	c88f132057	(sideload) Fix filename error in dealing with stackoverflow files	2024-02-06 11:10:03 +01:00
Viktor Lofgren	c6313a5906	(sideload) Fix filename error in dealing with stackoverflow files	2024-02-06 11:06:36 +01:00
Viktor Lofgren	eadcdb5bed	(minor) Improve error handling, naming logging in IndexResultDecorator	2024-02-05 21:05:44 +01:00
Viktor Lofgren	6e7649b5f7	(loader) Mitigate fragile paging behavior IndexJournalWriterPagingImpl was modified to not page on number of entries written, but number of (equivalent uncompressed) bytes written. Since the failure mode if too much data is written per file is quiet corruption of the index, the former behavior was extremely fragile. The new behavior should consistently ensure that the data is sufficiently small to not cause any integer rollovers. The change in `6dcc20038c` was reverted, as there is really no sane reason to have this configurable in software.	2024-02-05 21:05:03 +01:00
Viktor Lofgren	d986f90074	(index) Fix consistency between RandomFileAssembler implementations The RandomFileAssembler implementations, introduced in commit `53c575db3f` were all acting subtly differently. The RWF implementation wrote BigEndian longs instead of the native endianness used by the other implementations (and expected by the index construction code), further the mmap implementation exposed a bug in LongArray.write() that caused it to create a larger file than necessary. A test was built to ensure the output of these implementations is equivalent.	2024-02-05 21:01:32 +01:00
Viktor Lofgren	53c575db3f	(index-construction) Make random-write file strategy configurable To cope with writing large files out of order, the system needs some form of strategy to avoid writing them directly to disk, as this causes insane amounts of disk thrashing. By default, the data is just buffered in RAM. This works well on a large server, but smaller systems struggle. To help systems with small RAM process large amounts of data, the old RandomWriteFunnel is brought back if the system property 'system.conserve-memory' is set to true. RandomWriteFunnel is buffering the construction by creating a series of small files that pigeonhole the writes into rough neighborhoods, and then it goes over the files one by one to construct one area of the file at a time. This is relatively slow and uses more than twice the disk size. A new interface RandomFileAssembler is introduced as an abstraction for this operation. A third strategy, direct mmaps, is also introduced if the file is very small (less than 1 GB). In this domain, disk thrashing is unlikely since it will comfortably fit in RAM.	2024-02-05 12:31:15 +01:00
Viktor Lofgren	6dcc20038c	(index-journal) Make index journal page size configurable Adds a new system property loader.journal-page-size to configure this setting.	2024-02-05 11:26:05 +01:00
Viktor Lofgren	fa145f632b	(sideload) Add special handling for sideloaded wiki documents This update enhances the SideloaderProcessing and DocumentClass modules to specially handle sideloaded wiki documents. Wiki content is generally truncated to the first paragraph, which generally tends to be too short to be included independently. An additional DocumentClass (SIDELOAD) has been introduced to suppress the length check in this case.	2024-02-02 21:22:07 +01:00
Viktor Lofgren	785d8deadd	(crawler) Improve meta-tag redirect handling, add tests for redirects. Wrote a new test to examine the redirect behavior of the crawler, ensuring that the redirect URL is the URL that is reported in the parquet file. This works as intended. Noticed in the course of this that the crawler doesn't add links from meta-tag redirects to the crawl frontier. Added logic to handle this case, amended the test case to verify the new behavior. Added the meta-redirect case to the HtmlDocumentProcessorPlugin as well, so that we consider it a link between documents in the unlikely case that a meta redirect is to another domain.	2024-02-01 20:30:43 +01:00
Viktor Lofgren	93a2d5afbf	(*) Fix poorly named test Likely old refactoring gore.	2024-02-01 20:08:15 +01:00
Viktor Lofgren	d60c6b18d4	(doc) Update the readme's the crawler, as they've grown stale.	2024-02-01 18:10:55 +01:00
Viktor Lofgren	d1e02569f4	(language-processing) Add a system property for configuring which language detection model to use The flag is `system.languageDetectionModelVersion`. * If negative, no model is used. * If 0, both models are used. * If 1, the old crappy model is used. * If 2, the new fasttext model is used.	2024-01-31 13:02:33 +01:00
Viktor Lofgren	9ce67029ca	(language-processing) Add a system property for configuring which language detection model to use The flag is `system.languageDetectionModelVersion`. * If negative, no model is used. * If 0, both models are used. * If 1, the old crappy model is used. * If 2, the new fasttext model is used.	2024-01-31 13:02:16 +01:00
Viktor Lofgren	98f3382cea	(minor) Fix test and improve error message	2024-01-31 11:53:41 +01:00
Viktor Lofgren	52a0255814	() Add flag for disabling ASCII flattening The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this. This assumption holds poorly in the wild. Adding an experimental* system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior. IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.	2024-01-31 11:50:59 +01:00
Viktor Lofgren	eb59ac8535	(index-ranking) Adjust the BM25P factors a bit Since the bleed-flags set by the anchor tags logics have been changed to Site and SiteAdjacent, give them a bit of more importance when set together with ExternalLink. UrlDomain and UrlPath are also only more consistently only rewarded once.	2024-01-30 21:27:29 +01:00
Viktor Lofgren	acc2b4e10f	(*) Update the readme with a link to the demo video	2024-01-26 13:49:41 +01:00
Viktor Lofgren	6f830f0e08	(*) Update the readme with a link to the demo video	2024-01-26 13:48:47 +01:00
Viktor Lofgren	6edc318597	(control) Fix typo in URL linking to new-crawl-specs	2024-01-26 10:43:10 +01:00
Viktor Lofgren	182c0cf28e	(control) Add warnings about domain data contamination	2024-01-25 18:26:15 +01:00
Viktor Lofgren	0b105b5986	(converter) Update hyperlink text for new crawl spec creation. Fix minor typo.	2024-01-25 18:05:11 +01:00
Viktor Lofgren	081c7d22bc	Fix typo in install.sh	2024-01-25 17:08:18 +01:00
Viktor Lofgren	6aee896657	(*) Add single-node barebones configuration This adds a single-node barebones configuration to the install script. It also moves the log4j configuration into system.properties, and sets assertions to disabled by default.	2024-01-25 16:40:28 +01:00
Viktor Lofgren	cae1bad274	(*) Add download-sample action, refactor file storage This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu. It also refactors out some leaky abstractions out of FileStorageService. allocateTemporaryStorage has been renamed allocateStorage. The storage was never temporary in any scenario... It also doesn't take a storage base, as there was always only one valid option for this input. The allocateStorage method finds the appropriate base itself.	2024-01-25 13:36:30 +01:00
Viktor Lofgren	1b8b97b8ec	(sample-exporter) Add some limits on sizes and lengths Tar files will reject entries with filenames over 100b, so we need a limit there. Also added a maximum size limit to keep the file sizes reasonable.	2024-01-25 11:51:53 +01:00
Viktor Lofgren	0846606b12	(doc) Add ide quick-start guide	2024-01-24 14:39:33 +01:00
Viktor Lofgren	245ebcdfc6	(doc) Add ide quick-start guide	2024-01-24 14:37:58 +01:00
Viktor Lofgren	1b1e711c93	(doc) Add ide quick-start guide	2024-01-24 14:36:44 +01:00
Viktor Lofgren	c088c25b09	(*) Fix broken test, clean up code	2024-01-24 12:50:41 +01:00
Viktor Lofgren	958d64720e	(control) Add a view for restarting aborted processes This will avoid having to dig in the message queue to perform this relatively common task. The control service was also refactored to extract common timestamp formatting logic out of the data objects and into the rendering.	2024-01-24 12:47:10 +01:00
Viktor Lofgren	805afad4fe	(control) New GUI for exporting crawl data samples Not going to win any beauty pageants, but this is pretty peripheral functionality.	2024-01-23 17:08:21 +01:00
Viktor Lofgren	400f4840ad	(*) Fix broken code in jmh	2024-01-23 17:08:21 +01:00
Viktor Lofgren	ee7792596d	(*) Fix broken test Probably shouldn't have tests depending on external data like this...	2024-01-23 12:03:47 +01:00
Viktor Lofgren	0081328aca	(converter) Adjust which flags are set by anchor text keywords It's a mistake to let it bleed into Title, as this is a high quality signal. We'll co-opt Site and SiteAdjacent instead to reinforce the ExternalLink when count is high.	2024-01-23 11:54:00 +01:00
Viktor Lofgren	3fff7f6878	(converter) Fix issue where quality limits were no longer enforced	2024-01-23 11:42:17 +01:00
Viktor Lofgren	f15dd06473	(index) Delayed close() of SearchIndexReader This avoids concurrent access errors. This is especially important when using Unsafe-based LongArrays, since we have concurrent access to the underlying memory-mapped file. If pull the rug from under the caller by closing the file, we'll get a SIGSEGV. Even with a "safe" MemorySegment, we'll get ugly stacktraces if we close the file while a thread is still accessing it. So we spin up a thread that sleeps for a minute before actually unmapping the file, allowing any ongoing requests to wrap up. This is 100% a hack, but it lets us get away with doing this without adding locks to the index readers. Since this is "just" mmapped data, and this operation happens optimistically once a month, it should be safe if the call gets lost.	2024-01-23 11:08:41 +01:00
Viktor Lofgren	dd26819d66	(actor) Try to rare data race where a finished job is considered dead.	2024-01-22 21:22:38 +01:00
Viktor Lofgren	562012fb22	(doc) Migrate documentation https://docs.marginalia.nu/	2024-01-22 19:40:08 +01:00
Viktor Lofgren	a6d257df5b	(converter) Update Stackexchange sideload instruction The sideload instruction in the stackexchange template was updated. The instruction now states that stackexchange data will be loaded from a directory on the server and directs users to a new documentation url for more detailed information.	2024-01-22 18:29:20 +01:00
Viktor Lofgren	41d896ba3e	(converter) Refactor content type check in PlainTextDocumentProcessorPlugin The method `isApplicable` in the `PlainTextDocumentProcessorPlugin` was refactored to handle a wider range of content types beyond merely "text/plain". It now also handles any content type that starts with "text/plain;", to accomodate contentTypes that append a charset as well.	2024-01-22 17:52:14 +01:00

1 2 3 4 5 ...

1700 Commits