CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	93a2d5afbf	(*) Fix poorly named test Likely old refactoring gore.	2024-02-01 20:08:15 +01:00
Viktor Lofgren	d60c6b18d4	(doc) Update the readme's the crawler, as they've grown stale.	2024-02-01 18:10:55 +01:00
Viktor Lofgren	d1e02569f4	(language-processing) Add a system property for configuring which language detection model to use The flag is `system.languageDetectionModelVersion`. * If negative, no model is used. * If 0, both models are used. * If 1, the old crappy model is used. * If 2, the new fasttext model is used.	2024-01-31 13:02:33 +01:00
Viktor Lofgren	9ce67029ca	(language-processing) Add a system property for configuring which language detection model to use The flag is `system.languageDetectionModelVersion`. * If negative, no model is used. * If 0, both models are used. * If 1, the old crappy model is used. * If 2, the new fasttext model is used.	2024-01-31 13:02:16 +01:00
Viktor Lofgren	98f3382cea	(minor) Fix test and improve error message	2024-01-31 11:53:41 +01:00
Viktor Lofgren	52a0255814	() Add flag for disabling ASCII flattening The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this. This assumption holds poorly in the wild. Adding an experimental* system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior. IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.	2024-01-31 11:50:59 +01:00
Viktor Lofgren	eb59ac8535	(index-ranking) Adjust the BM25P factors a bit Since the bleed-flags set by the anchor tags logics have been changed to Site and SiteAdjacent, give them a bit of more importance when set together with ExternalLink. UrlDomain and UrlPath are also only more consistently only rewarded once.	2024-01-30 21:27:29 +01:00
Viktor Lofgren	acc2b4e10f	(*) Update the readme with a link to the demo video	2024-01-26 13:49:41 +01:00
Viktor Lofgren	6f830f0e08	(*) Update the readme with a link to the demo video	2024-01-26 13:48:47 +01:00
Viktor Lofgren	6edc318597	(control) Fix typo in URL linking to new-crawl-specs	2024-01-26 10:43:10 +01:00
Viktor Lofgren	182c0cf28e	(control) Add warnings about domain data contamination	2024-01-25 18:26:15 +01:00
Viktor Lofgren	0b105b5986	(converter) Update hyperlink text for new crawl spec creation. Fix minor typo.	2024-01-25 18:05:11 +01:00
Viktor Lofgren	081c7d22bc	Fix typo in install.sh	2024-01-25 17:08:18 +01:00
Viktor Lofgren	6aee896657	(*) Add single-node barebones configuration This adds a single-node barebones configuration to the install script. It also moves the log4j configuration into system.properties, and sets assertions to disabled by default.	2024-01-25 16:40:28 +01:00
Viktor Lofgren	cae1bad274	(*) Add download-sample action, refactor file storage This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu. It also refactors out some leaky abstractions out of FileStorageService. allocateTemporaryStorage has been renamed allocateStorage. The storage was never temporary in any scenario... It also doesn't take a storage base, as there was always only one valid option for this input. The allocateStorage method finds the appropriate base itself.	2024-01-25 13:36:30 +01:00
Viktor Lofgren	1b8b97b8ec	(sample-exporter) Add some limits on sizes and lengths Tar files will reject entries with filenames over 100b, so we need a limit there. Also added a maximum size limit to keep the file sizes reasonable.	2024-01-25 11:51:53 +01:00
Viktor Lofgren	0846606b12	(doc) Add ide quick-start guide	2024-01-24 14:39:33 +01:00
Viktor Lofgren	245ebcdfc6	(doc) Add ide quick-start guide	2024-01-24 14:37:58 +01:00
Viktor Lofgren	1b1e711c93	(doc) Add ide quick-start guide	2024-01-24 14:36:44 +01:00
Viktor Lofgren	c088c25b09	(*) Fix broken test, clean up code	2024-01-24 12:50:41 +01:00
Viktor Lofgren	958d64720e	(control) Add a view for restarting aborted processes This will avoid having to dig in the message queue to perform this relatively common task. The control service was also refactored to extract common timestamp formatting logic out of the data objects and into the rendering.	2024-01-24 12:47:10 +01:00
Viktor Lofgren	805afad4fe	(control) New GUI for exporting crawl data samples Not going to win any beauty pageants, but this is pretty peripheral functionality.	2024-01-23 17:08:21 +01:00
Viktor Lofgren	400f4840ad	(*) Fix broken code in jmh	2024-01-23 17:08:21 +01:00
Viktor Lofgren	ee7792596d	(*) Fix broken test Probably shouldn't have tests depending on external data like this...	2024-01-23 12:03:47 +01:00
Viktor Lofgren	0081328aca	(converter) Adjust which flags are set by anchor text keywords It's a mistake to let it bleed into Title, as this is a high quality signal. We'll co-opt Site and SiteAdjacent instead to reinforce the ExternalLink when count is high.	2024-01-23 11:54:00 +01:00
Viktor Lofgren	3fff7f6878	(converter) Fix issue where quality limits were no longer enforced	2024-01-23 11:42:17 +01:00
Viktor Lofgren	f15dd06473	(index) Delayed close() of SearchIndexReader This avoids concurrent access errors. This is especially important when using Unsafe-based LongArrays, since we have concurrent access to the underlying memory-mapped file. If pull the rug from under the caller by closing the file, we'll get a SIGSEGV. Even with a "safe" MemorySegment, we'll get ugly stacktraces if we close the file while a thread is still accessing it. So we spin up a thread that sleeps for a minute before actually unmapping the file, allowing any ongoing requests to wrap up. This is 100% a hack, but it lets us get away with doing this without adding locks to the index readers. Since this is "just" mmapped data, and this operation happens optimistically once a month, it should be safe if the call gets lost.	2024-01-23 11:08:41 +01:00
Viktor Lofgren	dd26819d66	(actor) Try to rare data race where a finished job is considered dead.	2024-01-22 21:22:38 +01:00
Viktor Lofgren	562012fb22	(doc) Migrate documentation https://docs.marginalia.nu/	2024-01-22 19:40:08 +01:00
Viktor Lofgren	a6d257df5b	(converter) Update Stackexchange sideload instruction The sideload instruction in the stackexchange template was updated. The instruction now states that stackexchange data will be loaded from a directory on the server and directs users to a new documentation url for more detailed information.	2024-01-22 18:29:20 +01:00
Viktor Lofgren	41d896ba3e	(converter) Refactor content type check in PlainTextDocumentProcessorPlugin The method `isApplicable` in the `PlainTextDocumentProcessorPlugin` was refactored to handle a wider range of content types beyond merely "text/plain". It now also handles any content type that starts with "text/plain;", to accomodate contentTypes that append a charset as well.	2024-01-22 17:52:14 +01:00
Viktor Lofgren	51cdf46645	(control) Improve accessibility in search-to-ban template This update enhances accessibility by associating labels with the corresponding checkboxes in the search-to-ban template.	2024-01-22 15:01:00 +01:00
Viktor Lofgren	1eb0adf6d3	(array) Add sun.misc.Unsafe variant of LongArray	2024-01-22 13:38:42 +01:00
Viktor Lofgren	40c9d2050f	(control) Fully automatic conversion Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine. Removed the tool itself. This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency. This has been fixed, and :third-party:xz was removed.	2024-01-22 13:03:24 +01:00
Viktor Lofgren	3a325845c7	(mq) Add better error handling in fsm and mq java.lang.Error:s were not handled properly, leading to mismatch in the bookkeeping of the FSMs. These are now caught, acted on, and re-thrown. MqSynchronousInbox also no longer assumes all exceptions are InterruptedException.	2024-01-22 13:03:24 +01:00
Viktor Lofgren	6a1bfd6270	(array) Remove unused 'madvise' code and 3rd party dependency on 'uppend' This wasn't actually hooked in anywhere. Removing the dependency and code. If it turns out we need madvise in the future, we'll re-introducde it.	2024-01-22 13:01:57 +01:00
Viktor Lofgren	b91ea1d7ca	(control) Re-add gui for sideloading dirtrees	2024-01-20 18:09:40 +01:00
Viktor Lofgren	c5760cd535	(test) Fix broken test	2024-01-20 13:39:40 +01:00
Viktor Lofgren	91c7960800	(crawler) Extract additional configuration properties This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties. The documentation is updated to reflect the change. Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.	2024-01-20 10:36:04 +01:00
Viktor Lofgren	2079a5574b	(control) Update heading in restore backup template Changed the heading in the partial restore backup page from "Load" to "Restore Backup".	2024-01-19 21:46:53 +01:00
Viktor Lofgren	27ffb8fa8a	(converter) Integrate zim->db conversion into automatic encyclopedia processing workflow Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file. This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically. The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.	2024-01-19 13:59:03 +01:00
Viktor Lofgren	22c8fb3f59	(crawler) Fix a bug where reference copies of crawl data was written without etag and last-modified This commit also adds a band-aid to ParquetSerializableCrawlDataStream to fetch this from the 304-entity. This can be removed in a few months.	2024-01-18 16:02:27 +01:00
Viktor Lofgren	964419803a	Fix broken test	2024-01-18 15:42:01 +01:00
Viktor Lofgren	6271d5d544	(mq) Add relation tracking between MQ messages for easier tracking and debugging. The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID. This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers. The existing RELATED_ID field has too many semantics associated with them, among other things the FSM code uses them this field in tracking state changes. The change set also improves the consistency of inbox names. The IndexClient was buggy and populated its outbox with a UUID. This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.	2024-01-18 15:08:27 +01:00
Viktor Lofgren	175bd310f5	(control) Message queue UX improvements	2024-01-18 13:05:50 +01:00
Viktor Lofgren	67ee6f4126	(control) Clean up filtering UX in Events table	2024-01-18 12:35:39 +01:00
Viktor Lofgren	01b312f14c	(*) Make new index nodes accept queries by default It's a confusing default behavior. This was off for nodes n>1 before as a bandaid since querying indices with no data caused delays and errors. This has been fixed now, so there's no need to do this anymore!	2024-01-18 12:05:37 +01:00
Viktor Lofgren	18638c62de	(control) Rephrase text	2024-01-18 11:53:10 +01:00
Viktor Lofgren	753d000788	(control) Add toggle for automatic loading of processed data	2024-01-18 11:52:58 +01:00
Viktor Lofgren	19e781b104	(control) Add basic input validation to node actions Will present a simple error message when required fields aren't populated, instead of a cryptic HTTP status error.	2024-01-18 11:52:49 +01:00

... 2 3 4 5 6 ...

1831 Commits