CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	d60c6b18d4	(doc) Update the readme's the crawler, as they've grown stale.	2024-02-01 18:10:55 +01:00
Viktor Lofgren	52a0255814	() Add flag for disabling ASCII flattening The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this. This assumption holds poorly in the wild. Adding an experimental* system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior. IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.	2024-01-31 11:50:59 +01:00
Viktor Lofgren	3fff7f6878	(converter) Fix issue where quality limits were no longer enforced	2024-01-23 11:42:17 +01:00
Viktor Lofgren	41d896ba3e	(converter) Refactor content type check in PlainTextDocumentProcessorPlugin The method `isApplicable` in the `PlainTextDocumentProcessorPlugin` was refactored to handle a wider range of content types beyond merely "text/plain". It now also handles any content type that starts with "text/plain;", to accomodate contentTypes that append a charset as well.	2024-01-22 17:52:14 +01:00
Viktor Lofgren	40c9d2050f	(control) Fully automatic conversion Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine. Removed the tool itself. This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency. This has been fixed, and :third-party:xz was removed.	2024-01-22 13:03:24 +01:00
Viktor Lofgren	91c7960800	(crawler) Extract additional configuration properties This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties. The documentation is updated to reflect the change. Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.	2024-01-20 10:36:04 +01:00
Viktor Lofgren	27ffb8fa8a	(converter) Integrate zim->db conversion into automatic encyclopedia processing workflow Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file. This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically. The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.	2024-01-19 13:59:03 +01:00
Viktor Lofgren	22c8fb3f59	(crawler) Fix a bug where reference copies of crawl data was written without etag and last-modified This commit also adds a band-aid to ParquetSerializableCrawlDataStream to fetch this from the 304-entity. This can be removed in a few months.	2024-01-18 16:02:27 +01:00
Viktor Lofgren	fd1eec99b5	(cleanup) Fix broken tests	2024-01-15 15:44:33 +01:00
Viktor Lofgren	c41e68aaab	(control) New export actions for RSS/Atom feeds and term frequency data This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.	2024-01-15 14:54:26 +01:00
Viktor Lofgren	7c6e18f7a7	(*) Overhaul settings and properties Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.	2024-01-13 17:12:18 +01:00
Viktor Lofgren	708a741960	(test) Clean up test usage of migrations Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing. A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.	2024-01-12 15:55:50 +01:00
Viktor Lofgren	0caef1b307	(warc) Toggle for saving WARC data Add a toggle for saving the WARC data generated by the search engine's crawler. Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest. The warc files are concatenated into larger archives, up to about 1 GB each. An index is also created containing filenames, domain names, offsets and sizes to help navigate these larger archives. The warc data is saved in a directory warc/ under the crawl data storage.	2024-01-12 13:45:14 +01:00
Viktor Lofgren	734996002c	(*) install script for deploying Marginalia outside the codebase The changeset also makes the control service responsible for flyway migrations. This helps reduce the number of places the database configuration needs to be spread out. These automatic migrations can be disabled with -DdisableFlyway=true. The commit also adds curl to the docker container, to enable docker health checks and interdependencies.	2024-01-11 12:40:03 +01:00
Viktor Lofgren	14b7680328	(loader) Update the size of the keyword files created by the loader Previously these ended up being about 200 Mb each, which is wastefully small. Increasing the size of these files makes the index construction faster.	2024-01-10 17:09:19 +01:00
Viktor Lofgren	d56b394bcc	(control) GUI for loading external WARC files	2024-01-10 12:13:30 +01:00
Viktor Lofgren	fbad625126	(linkdb) Add delegating implementation of DomainLinkDb This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.	2024-01-08 19:56:33 +01:00
Viktor Lofgren	e49ba887e9	(crawl data) Add compatibility layer for old crawl data format The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records. This is true for the new parquet format, but not for the old zstd/gson format. To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order. This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be. Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.	2024-01-08 19:16:49 +01:00
Viktor Lofgren	edc1acbb7e	(*) Replace EC_DOMAIN_LINK table with files and in-memory caching The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need. This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service. A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file. The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.	2024-01-08 15:53:13 +01:00
Viktor Lofgren	6d2e14a656	(build) Remove false depdencency between icp and index-service This dependency causes the executor service docker image to change when the index service docker image changes.	2024-01-05 13:17:29 +01:00
Viktor Lofgren	60361f88ed	(converter) Add upper 128KB limit to how much HTML we'll parse	2024-01-03 23:14:03 +01:00
Viktor Lofgren	f7560cb1d8	(feature) More trackers	2024-01-03 17:31:02 +01:00
Viktor Lofgren	1f66568d59	(feature) More trackers	2024-01-03 17:27:25 +01:00
Viktor Lofgren	7af07cef95	(feature) Add another doubleclick variant to the adtech trackers	2024-01-03 17:21:12 +01:00
Viktor Lofgren	41a540a629	(converter) Penalize chatgpt content farm spam	2024-01-03 17:04:38 +01:00
Viktor Lofgren	f599944942	(converter) Penalize chatgpt content farm spam	2024-01-03 16:51:26 +01:00
Viktor Lofgren	4ce692ccaf	(converter) Use SimpleBlockingThreadPool in ProcessingIterator	2024-01-03 14:27:47 +01:00
Viktor Lofgren	f0d9618dfc	(sideload) Reduce quality assessment. This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.	2024-01-02 12:34:58 +01:00
Viktor Lofgren	dc90c9ac65	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-01 16:19:38 +01:00
Viktor Lofgren	75d87c73d1	(crawler) Disable Java's infinite DNS caching	2023-12-31 16:59:08 +01:00
Viktor Lofgren	0fe44c9bf2	(crawler) Fix broken test A necessary step was accidentally deleted when cleaning up these tests previously.	2023-12-30 13:56:44 +01:00
Viktor Lofgren	7a1d20ed0a	(converter) Better use of ProcessingIterator Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service. This reduces thread churn in the converter sideloader style processing of regular crawl data.	2023-12-30 13:53:55 +01:00
Viktor Lofgren	70c83b60a1	(converter) Clean up fullProcessing() This function made some very flimsy-looking assumptions about the order of an iterable. These are still made, but more explicitly so.	2023-12-30 13:36:18 +01:00
Viktor Lofgren	7ba296ccdf	(converter) Route sizeHint to SideloadProcessing Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number. This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.	2023-12-30 13:05:10 +01:00
Viktor Lofgren	0b112cb4d4	(warc) Update URL encoding in WarcProtocolReconstructor The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.	2023-12-29 19:41:37 +01:00
Viktor Lofgren	e7dd28b926	(converter) Optimize sideload-loading Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.	2023-12-29 14:25:48 +01:00
Viktor Lofgren	dec3b1092d	(converter) Fix bugs in conversion This commit adds a safety check that the URL of the document is from the correct domain. It also adds a sizeHint() method to SerializableCrawlDataStream which may provide an indication if the stream is very large and benefits from sideload-style processing (which is slow). It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...	2023-12-29 13:58:08 +01:00
Viktor Lofgren	c488599879	(converter) Fix NPE in converter	2023-12-28 19:52:26 +01:00
Viktor Lofgren	c847d83011	(converter) Add size hint to converter sideload processing	2023-12-28 19:14:16 +01:00
Viktor Lofgren	5ce46a61d4	Merge branch 'master' into converter-optimizations	2023-12-28 13:26:19 +01:00
Viktor Lofgren	00a974a721	(crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions This commit also improves the test coverage for this part of the code.	2023-12-27 20:02:17 +01:00
Viktor Lofgren	7428ba2dd7	(converter) Basic test coverage for sideloading-style processing	2023-12-27 19:29:26 +01:00
Viktor Lofgren	b37223c053	(converter) Basic test coverage for sideloading-style processing	2023-12-27 18:33:16 +01:00
Viktor Lofgren	24051fec03	(converter) WIP Run sideload-style processing for large domains The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis. This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process. These websites now receive a simplified treatment. This is executed in the converter batch writer thread. This is slower, but the documents will not be persisted in memory.	2023-12-27 18:20:03 +01:00
Viktor Lofgren	f811a29f87	(crawler) Fix resource leak in crawler A 10 MB thread local buffer wasn't static. Oops.	2023-12-27 16:32:17 +01:00
Viktor Lofgren	acf7bcc7a6	(converter) Refactor the DomainProcessor for new format of crawl data With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter. This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while. The first step is to move stuff out of the domain processor into the document processor.	2023-12-27 13:57:59 +01:00
Viktor Lofgren	9e5fe71f5b	(crawler) Switch hash function in crawler Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler. This switches to the modified Murmur hash function used throughout Marginalia.	2023-12-27 13:29:00 +01:00
Viktor Lofgren	3ea1ddae22	(crawler) Roll back switch to virtual thread pool in crawler This seems to cause a resource leak, it seems the http library uses thread locals?	2023-12-26 19:37:34 +01:00
Viktor Lofgren	25d086c4e1	(crawler) Clean up stale warc files We should probably have an option to keep them, but not by default!	2023-12-25 15:07:36 +01:00
Viktor Lofgren	f779f760c4	(crawler) Even more lenient resyncing	2023-12-25 01:44:18 +01:00

1 2 3 4 5 ...

303 Commits