CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	ac1aca36b0	(valuation) Increase the penalty for adtech a bit	2024-01-03 15:20:38 +01:00
Viktor Lofgren	1f3b89cf28	(index) Reduce the value of site and site-adjacent in BM25P calculations	2024-01-03 15:20:18 +01:00
Viktor Lofgren	f732f6ae6f	(index) Tweak result valuation renormalization	2024-01-03 14:53:53 +01:00
Viktor Lofgren	0b9f3d1751	(*) Remove accidental commit of debug logging	2024-01-03 14:32:00 +01:00
Viktor Lofgren	0806aa6dfe	(language-processing) Add maximum length limit for text input in SentenceExtractor Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.	2024-01-03 14:27:47 +01:00
Viktor Lofgren	32436d099c	(language-processing) Add maximum length limit for text input in SentenceExtractor Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.	2024-01-03 14:27:47 +01:00
Viktor Lofgren	4ce692ccaf	(converter) Use SimpleBlockingThreadPool in ProcessingIterator	2024-01-03 14:27:47 +01:00
Viktor Lofgren	3caa4eed75	Merge branch 'master' into converter-optimizations	2024-01-02 17:13:25 +01:00
Viktor Lofgren	c70f508ae8	(prometheus) Saner histogram buckets	2024-01-02 17:13:14 +01:00
Viktor Lofgren	9e64d7aaf9	Merge branch 'master' into converter-optimizations	2024-01-02 15:46:24 +01:00
Viktor Lofgren	72b773f06d	(search) fix search metrics labeling	2024-01-02 15:46:14 +01:00
Viktor Lofgren	5f978b865b	Merge branch 'master' into converter-optimizations	2024-01-02 15:41:48 +01:00
Viktor Lofgren	57a4f92722	(api) fix missing metrics label in api service	2024-01-02 15:41:38 +01:00
Viktor Lofgren	87351e89ca	Merge branch 'master' into converter-optimizations	2024-01-02 15:17:02 +01:00
Viktor Lofgren	192e356169	(prometheus) Add instrumentation to the api service	2024-01-02 15:12:44 +01:00
Viktor Lofgren	31232e49fb	(prometheus) Add instrumentation to the search, qs and index services.	2024-01-02 15:02:29 +01:00
Viktor Lofgren	9d93a31755	Merge branch 'master' into converter-optimizations	2024-01-02 12:36:16 +01:00
Viktor Lofgren	9f7df59945	(sideload) Reduce quality assessment. This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.	2024-01-02 12:35:59 +01:00
Viktor Lofgren	d2418521a7	(index) Further ranking adjustments	2024-01-02 12:35:59 +01:00
Viktor Lofgren	9330b5b1d9	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	faa50bf578	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	f0d9618dfc	(sideload) Reduce quality assessment. This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.	2024-01-02 12:34:58 +01:00
Viktor Lofgren	310a880fa8	(index) Further ranking adjustments	2024-01-02 12:24:52 +01:00
Viktor Lofgren	fc6e3b6da0	(index) Further ranking adjustments	2024-01-01 18:51:03 +01:00
Viktor Lofgren	50771045d0	(index) Further ranking adjustments	2024-01-01 18:43:17 +01:00
Viktor Lofgren	8f522470ed	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-01 17:16:29 +01:00
Viktor Lofgren	dc90c9ac65	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-01 16:19:38 +01:00
Viktor Lofgren	e46e174b59	(keyword-extractor) Add another test for Name-extractor	2024-01-01 15:21:51 +01:00
Viktor Lofgren	7f3f3f577c	(backup) Add task heartbeats to the backup service	2024-01-01 15:20:57 +01:00
Viktor Lofgren	75d87c73d1	(crawler) Disable Java's infinite DNS caching	2023-12-31 16:59:08 +01:00
Viktor Lofgren	0fe44c9bf2	(crawler) Fix broken test A necessary step was accidentally deleted when cleaning up these tests previously.	2023-12-30 13:56:44 +01:00
Viktor Lofgren	7a1d20ed0a	(converter) Better use of ProcessingIterator Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service. This reduces thread churn in the converter sideloader style processing of regular crawl data.	2023-12-30 13:53:55 +01:00
Viktor Lofgren	70c83b60a1	(converter) Clean up fullProcessing() This function made some very flimsy-looking assumptions about the order of an iterable. These are still made, but more explicitly so.	2023-12-30 13:36:18 +01:00
Viktor Lofgren	7ba296ccdf	(converter) Route sizeHint to SideloadProcessing Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number. This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.	2023-12-30 13:05:10 +01:00
Viktor Lofgren	0b112cb4d4	(warc) Update URL encoding in WarcProtocolReconstructor The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.	2023-12-29 19:41:37 +01:00
Viktor Lofgren	68ac8d3e09	(search) Fetch fewer linking and similar domains. Showing a total of 200 connected domains is not very informative.	2023-12-29 16:37:27 +01:00
Viktor Lofgren	f6fa8bd722	(search) Fetch fewer linking and similar domains. Showing a total of 200 connected domains is not very informative.	2023-12-29 16:37:00 +01:00
Viktor Lofgren	6aee27a3f1	(*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style.	2023-12-29 16:36:01 +01:00
Viktor Lofgren	401568033c	Merge branch 'master' into converter-optimizations	2023-12-29 15:55:57 +01:00
Viktor Lofgren	ea73be6831	(search) Remove the ugly placeholder screenshots from the site info view.	2023-12-29 15:55:46 +01:00
Viktor Lofgren	ba8a75c84b	Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool	2023-12-29 15:10:32 +01:00
Viktor Lofgren	a1f3ccdd6d	Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool	2023-12-29 14:59:39 +01:00
Viktor Lofgren	647d38007f	Reduce queue polling time in ProcessingIterator Updated ProcessingIterator's queue polling from one second to 50 milliseconds for improved performance. This facilitates faster document processing across more cores, reducing bottlenecks and slow single-threaded processing.	2023-12-29 14:27:58 +01:00
Viktor Lofgren	e7dd28b926	(converter) Optimize sideload-loading Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.	2023-12-29 14:25:48 +01:00
Viktor Lofgren	b5fc9673d9	Merge branch 'master' into converter-optimizations	2023-12-29 14:04:43 +01:00
Viktor Lofgren	a065040323	(search) Don't inject arbitrary HTML into the site info view xD	2023-12-29 14:04:26 +01:00
Viktor Lofgren	dec3b1092d	(converter) Fix bugs in conversion This commit adds a safety check that the URL of the document is from the correct domain. It also adds a sizeHint() method to SerializableCrawlDataStream which may provide an indication if the stream is very large and benefits from sideload-style processing (which is slow). It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...	2023-12-29 13:58:08 +01:00
Viktor Lofgren	407915a86e	(converter) Fix NPEs in converter due to the new data format	2023-12-28 22:54:53 +01:00
Viktor Lofgren	c488599879	(converter) Fix NPE in converter	2023-12-28 19:52:26 +01:00
Viktor Lofgren	bcecc93e39	(converter) Swallow errors in parquet data stream	2023-12-28 19:45:35 +01:00
Viktor Lofgren	ff7d1a250e	Merge branch 'master' into converter-optimizations	2023-12-28 19:35:00 +01:00
Viktor Lofgren	70f338c3de	(search) Fix NPE in layout selection	2023-12-28 19:34:46 +01:00
Viktor Lofgren	c847d83011	(converter) Add size hint to converter sideload processing	2023-12-28 19:14:16 +01:00
Viktor Lofgren	5ce46a61d4	Merge branch 'master' into converter-optimizations	2023-12-28 13:26:19 +01:00
Viktor	775974d5ec	Merge pull request #67 from MarginaliaSearch/rss-feeds-in-site-info Add RSS Feeds to site info (WIP)	2023-12-28 13:25:38 +01:00
Viktor Lofgren	c7af40c368	(search) Change layout balance when feeds/samples are present	2023-12-28 13:16:10 +01:00
Viktor Lofgren	00a974a721	(crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions This commit also improves the test coverage for this part of the code.	2023-12-27 20:02:17 +01:00
Viktor Lofgren	7428ba2dd7	(converter) Basic test coverage for sideloading-style processing	2023-12-27 19:29:26 +01:00
Viktor Lofgren	b37223c053	(converter) Basic test coverage for sideloading-style processing	2023-12-27 18:33:16 +01:00
Viktor Lofgren	24051fec03	(converter) WIP Run sideload-style processing for large domains The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis. This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process. These websites now receive a simplified treatment. This is executed in the converter batch writer thread. This is slower, but the documents will not be persisted in memory.	2023-12-27 18:20:03 +01:00
Viktor Lofgren	f811a29f87	(crawler) Fix resource leak in crawler A 10 MB thread local buffer wasn't static. Oops.	2023-12-27 16:32:17 +01:00
Viktor Lofgren	acf7bcc7a6	(converter) Refactor the DomainProcessor for new format of crawl data With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter. This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while. The first step is to move stuff out of the domain processor into the document processor.	2023-12-27 13:57:59 +01:00
Viktor Lofgren	9707366348	(test) Fix a few slow tests that broke due to domainCount	2023-12-27 13:29:59 +01:00
Viktor Lofgren	9e5fe71f5b	(crawler) Switch hash function in crawler Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler. This switches to the modified Murmur hash function used throughout Marginalia.	2023-12-27 13:29:00 +01:00
Viktor Lofgren	5d1b7da728	Updated site info feed and search service Modified site info feed template to secure the description field against injected code. Also adjusted search service by extracting samples within the correct scope and including them in the returned site info. This improves the quality and security of the displayed information.	2023-12-26 22:06:01 +01:00
Viktor Lofgren	3ea1ddae22	(crawler) Roll back switch to virtual thread pool in crawler This seems to cause a resource leak, it seems the http library uses thread locals?	2023-12-26 19:37:34 +01:00
Viktor Lofgren	1694e9c78c	(search) Add RSS Feeds to site info This change integrates the Feedlot RSS Bot with Marginalia's site info view to offer a preview of the latest updates. The change introduces a new tiny feature that is a feedlot-client based on Java's HttpClient.	2023-12-26 16:21:40 +01:00
Viktor Lofgren	4763077b76	(search/index) Add a new keyword "count" This is for filtering results on how many times the term appears on the domain. The intent is to be beneficial in creating e.g. a domain search feature. It's also very helpful when tracking down spammy domains.	2023-12-25 20:38:29 +01:00
Viktor Lofgren	c0eaca220c	(search) Add convenient link for AS search to the search view	2023-12-25 15:07:58 +01:00
Viktor Lofgren	25d086c4e1	(crawler) Clean up stale warc files We should probably have an option to keep them, but not by default!	2023-12-25 15:07:36 +01:00
Viktor Lofgren	88551043cd	(crawler) Even more lenient resyncing	2023-12-25 01:48:11 +01:00
Viktor Lofgren	f779f760c4	(crawler) Even more lenient resyncing	2023-12-25 01:44:18 +01:00
Viktor Lofgren	f18f82e229	(crawler) Write etags and last-modified on reference copy This commit also fixes a test that broke with a previous change.	2023-12-25 01:40:13 +01:00
Viktor Lofgren	67ef2b45fa	(crawler) Reduce logging	2023-12-25 01:10:03 +01:00
Viktor Lofgren	d72e871265	(warc) Fix resync	2023-12-25 01:03:03 +01:00
Viktor Lofgren	4c9bc13309	(warc) Reduce log spam	2023-12-25 00:58:31 +01:00
Viktor Lofgren	84563b0d46	(crawler) Be a bit more conservative about pulling etags and so on if the previous fetch wasn't OK	2023-12-25 00:55:05 +01:00
Viktor Lofgren	c5aab7e8db	(warc) Fix NPE in WarcRecorder	2023-12-25 00:54:38 +01:00
Viktor Lofgren	1755b646b8	(warc) Fix NPE in WarcRecorder	2023-12-25 00:48:42 +01:00
Viktor Lofgren	85f906ea53	(executor) Fix removal of stale process heartbeats	2023-12-23 13:49:24 +01:00
Viktor Lofgren	e1a155a9c8	(crawler) Increase growth of crawl jobs A number of crawl jobs get stuck at about 300 documents, or just under. This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl. GOOD_URLS is based on how many documents successfully process, which is typically fairly small. Switching to KNOWN_URLS should let this grow faster. The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table. The floor is also increased to 250 from 200.	2023-12-23 13:22:10 +01:00
Viktor Lofgren	0454447e41	(executor) Implement process removal for long-absent heartbeats Added functionality to remove processes from listing that have not checked in for over a day. A 'removeProcessHeartbeat' function was created to delete the respective entry from the PROCESS_HEARTBEAT table in case heartbeats are absent for more than one day.	2023-12-23 13:18:21 +01:00
Viktor Lofgren	7b40c0bbee	(assistant) Clean up similar websites' results	2023-12-22 14:07:01 +01:00
Viktor Lofgren	dc773c5c20	(adjacencies) Clean up AdjacenciesLoader Make JDBC batching more consistent, also adds a test case for the loader.	2023-12-21 14:14:22 +01:00
Viktor Lofgren	b6253b03c2	(adjacencies) Fix bug in AdjacenciesLoader This fixes a bug where a prepared statement was created before the table it was supposed to insert into was created. This fails and does nothing. Furthermore, added the logging that would have warned about this failure, had it been in place.	2023-12-21 13:12:31 +01:00
Viktor Lofgren	a5bc29245b	(cleanup) Remove vestigial support for WARC crawl data streams	2023-12-20 15:46:21 +01:00
Viktor Lofgren	bfae478251	Refactor CrawlerRevisitor for better consistency	2023-12-20 15:21:49 +01:00
Viktor Lofgren	a7cd490593	(minor) Remove dead code.	2023-12-19 18:58:33 +01:00
Viktor Lofgren	dd8fb04886	(converter) Add sizeloadSizeAdvice field to several ProcessedDomain Since the sideloaders don't populate the documents list in ProcessedDomain to keep the memory footprint manageable, the code that estimates knownUrls etc. will set them to zero, which has negative effects on their ranking. This change will populate them with a bullshit value within a sane ballpark, ensuring that these domains show up in the rankings.	2023-12-19 18:37:51 +01:00
Viktor	5bd3934d22	Merge pull request #64 from dreimolo/macos_AS_fix Macos apple silicon fix, and slight improvements to sample downloader	2023-12-18 18:29:14 +01:00
Viktor Lofgren	3a56a06c4f	(warc) Add a fields for etags and last-modified headers to the new crawl data formats Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats. This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely. The commit also adds a few tests to this logic.	2023-12-18 17:45:54 +01:00
Viktor Lofgren	126ac3816f	(converter) Reduce queue size in ConverterWriter The size of the ArrayBlockingQueue in ConverterWriter.java has been reduced from 4 to 1. This change aims to reduce the memory utilization by not having fully processed domains piling up in RAM. This may cause the writer to go idle in waiting for new data, but that may be preferable to an OOM.	2023-12-18 13:42:40 +01:00
Viktor Lofgren	d02bed1a55	(loader) Optimize DomainLoaderService for faster startups Initialization parameters in DomainLoaderService and DomainIdRegistry have been updated to improve performance. This is done by adding sane default sizes to the hash tables involved, reducing GC churn, but also by setting a sensible fetch size to the queries used, and not fetching irrelevant information such as the domain name.	2023-12-18 13:15:10 +01:00
Viktor Lofgren	b7ed0ce537	(loader) Reset count after executing batch in DomainLoaderService This should greatly speed up starting the loader process.	2023-12-18 12:43:53 +01:00
Viktor Lofgren	a742503508	(search) Add view for showing mutual links between two websites	2023-12-17 17:50:44 +01:00
Viktor Lofgren	33312ab09e	(geo-ip) Update readme	2023-12-17 16:08:33 +01:00
Viktor Lofgren	c422f0b9fb	(geo-ip) Tidy up error handling	2023-12-17 16:06:51 +01:00
Viktor Lofgren	c92f1b8df8	(geo-ip) Revert removal of ip2location logic We do both ip2location and ASN data. The change also adds some keywords based on autonomous system information, on a somewhat experimental basis. It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.	2023-12-17 15:03:00 +01:00
Viktor Lofgren	bde68ba48b	Merge branch 'master' into asn-info	2023-12-17 14:00:23 +01:00
Viktor Lofgren	bf44805e69	(*) Rename EdgeDomain$domain into topDomain This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time. Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.	2023-12-17 14:00:07 +01:00
Viktor Lofgren	edf9aa2c23	(*) Rename EdgeDomain$domain into topDomain This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time. Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.	2023-12-17 13:59:54 +01:00
Viktor Lofgren	4801c47273	(crawling-model) Fix bug where CrawledDocument.getDomain() trimmed www-prefixes This had the knock-on effect of breaking the anchor tag loading in the processor for a lot of domains, since they'd grab domains for the wrong domain name.	2023-12-17 13:53:31 +01:00
Viktor Lofgren	bcad6492d6	(sideloader) Fix integration problems with sideloaders In encyclopedia, add a class "mw-content-text" that the WikiSpecialization class is looking for during pruning to give the articles a more fair treatment. Also add generator keywords based on the generator type provided, to ensure that these documents show up in appropriate filters. Further, add a new document flag value 'Sideloaded' to be able to distinguish these entries.	2023-12-17 13:28:17 +01:00
Viktor Lofgren	5ab2a22e88	(search) Fix result count back down to 1 per domain	2023-12-17 13:14:23 +01:00
Viktor Lofgren	d7bd540683	(*) Replace the ip2location IP geolocation data with ASN information from apnic.net. Doesn't really make sense to use ip2location as a middle man for information that is already freely available...	2023-12-16 21:55:04 +01:00
Viktor Lofgren	722b56c8ca	(index) Fix rare bug in the index-switching logic This is caused by a resource contention with the query code. The proper way to fix this is to use some form of synchronization, but that will slow the code down. So we just hammer it a few times and let the GC deal with the problem if it fails. Not optimal, but fast.	2023-12-16 18:57:35 +01:00
Viktor Lofgren	f3f12058dc	(assistant) Fix logic error in filtering related domains	2023-12-16 18:45:53 +01:00
Viktor Lofgren	3da38d0483	(assistant) Fix logic error in filtering related domains	2023-12-16 18:44:25 +01:00
Viktor Lofgren	d715b1f9ca	(search) Improve error handling in search parameters parsing The code now intercepts and deals with potential exceptions during the parsing of search parameters. This is in response to constant bad requests from bots which were cluttering the logs. A catch clause is added that suppresses these errors and redirects to the base URL.	2023-12-16 18:42:13 +01:00
Viktor Lofgren	e13fa25e11	(assistant) Clean up the site info related domains view by filtering viable domains	2023-12-16 18:37:09 +01:00
Viktor Lofgren	34d4834ff6	(assistant) Clean up the site info related domains view by filtering viable domains	2023-12-16 18:27:24 +01:00
Viktor Lofgren	117ddd17d7	(assistant) Fix bugs in IP flag emoji generation	2023-12-16 17:07:17 +01:00
Viktor Lofgren	6f2bf38f0e	(index) Fix off-by-1 error in the domain count limiter	2023-12-16 16:57:05 +01:00
Viktor Lofgren	320882c34a	(site-info) Try to discover the schema of the website with a site:-query The site info view can't blindly assume that every website supports https. To figure out which schema to use when linking to a site, execute a single-result search for site:domain.name and then grab the schema off the result. To allow this, a count parameter is introduced to doSiteSearch() in SearchOperator.	2023-12-16 16:34:53 +01:00
Viktor Lofgren	3113b5a551	(warc) Filter WarcResponses based on X-Robots-Tags There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.	2023-12-16 15:58:27 +01:00
dreimolo	c0cc05177f	corrects protobuf.plugins.grpc	2023-12-16 14:24:41 +01:00
dreimolo	0b34d43804	workaround for failing mac on apple silicon deps	2023-12-16 14:22:11 +01:00
Viktor Lofgren	54ed3b86ba	(minor) Remove dead code.	2023-12-15 21:49:35 +01:00
Viktor Lofgren	2001d0f707	(converter) Add @Deprecated annotation to a few fields that should no longer be used.	2023-12-15 21:42:00 +01:00
Viktor Lofgren	0f9cd9c87d	(warc) More accurate filering of advisory records Further create records for resources that were blocked due to robots.txt; as well as tests to verify this happens.	2023-12-15 21:37:02 +01:00
Viktor Lofgren	2e7db61808	(warc) More accurate filering of advisory records We want to mute some of these records so that they don't produce documents, but in some cases we want a document to be produced for accounting purposes. Added improved tests that reach for known resources on www.marginalia.nu to test the behavior when encountering bad content type and 404s. The commit also adds some safety try-catch:es around the charset handling, as it may sometimes explode when fed incorrect data, and we do be guessing...	2023-12-15 21:31:16 +01:00
Viktor Lofgren	5329968155	(crawler) Update CrawlingThenConvertingIntegrationTest This commit updates CrawlingThenConvertingIntegrationTest with additional tests for invalid, redirecting, and blocked domains. Improvements have also been made to filter out irrelevant entries in ParquetSerializableCrawlDataStream.	2023-12-15 21:04:06 +01:00
Viktor Lofgren	2e536e3141	(crawler) Add timestamp to CrawledDocument records This update includes the addition of timestamps to the parquet format for crawl data, as extracted from the Warc stream. The parquet format stores the timestamp as a 64 bit long, seconds since unix epoch, without a logical type. This is to avoid having to do format conversions when writing and reading the data. This parquet field populates the timestamp field in CrawledDocument.	2023-12-15 20:23:27 +01:00
Viktor Lofgren	cf935a5331	(converter) Read cookie information Add an optional new field to CrawledDocument containing information about whether the domain has cookies. This was previously on the CrawledDomain object, but since the WarcFormat requires us to write a WarcInfo object at the start of a crawl rather than at the end, this information is unobtainable when creating the CrawledDomain object. Also fix a bug in the deduplication logic in the DomainProcessor class that caused a test to break.	2023-12-15 18:09:53 +01:00
Viktor Lofgren	fa81e5b8ee	(warc) Use a non-standard WARC header to convey information about whether a website uses cookies This information is then propagated to the parquet file as a boolean. For documents that are copied from the reference, use whatever value we last saw. This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.	2023-12-15 16:37:53 +01:00
Viktor Lofgren	9fea22b90d	(warc) Further tidying This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled. A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics. Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.	2023-12-15 15:38:23 +01:00
Viktor Lofgren	0889b6d247	(warc) Clean up parquet conversion This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure. It also refactors the fetch result, body extraction and content type abstractions.	2023-12-14 20:39:40 +01:00
Viktor Lofgren	1328bc4938	(warc) Clean up parquet conversion This commit cleans up the warc->parquet conversion. Records with a http status other than 200 are now included. The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body. The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.	2023-12-14 16:05:48 +01:00
Viktor Lofgren	787a20cbaa	(crawling-model) Implement a parquet format for crawl data This is not hooked into anything yet. The change also makes modifications to the parquet-floor library to support reading and writing of byte[] arrays. This is desirable since we may in the future want to support inputs that are not text-based, and codifying the assumption that each document is a string will definitely cause us grief down the line.	2023-12-13 16:22:19 +01:00
Viktor Lofgren	440e097d78	(crawler) WIP integration of WARC files into the crawler and converter process. This commit is in a pretty rough state. It refactors the crawler fairly significantly to offer better separation of concerns. It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data. This works, -ish. There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either. A problem is that the WARC files are a bit too large. It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.	2023-12-13 15:33:42 +01:00
Viktor Lofgren	b74a3ebd85	(crawler) WIP integration of WARC files into the crawler process. At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly. This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled. The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.	2023-12-11 19:32:58 +01:00
Viktor Lofgren	45987a1d98	Merge branch 'master' into warc	2023-12-11 14:32:35 +01:00
Viktor Lofgren	8f0950fc44	(geoip) Fix incorrect synchronization.	2023-12-11 14:01:39 +01:00
Viktor Lofgren	30bc3f9281	(converter) Use the prefix ip: instead of geopip: for country codes This is the same as the prefix for the IP address, but I don't think that substantially matters, the as two have such different namespaces there can be no confusion.	2023-12-11 13:59:23 +01:00
Viktor Lofgren	f655ec5a5c	(*) Refactor GeoIP-related code In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services. The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions. The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server. The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.	2023-12-10 17:30:43 +01:00
Viktor Lofgren	84b4158555	(minor) Fix broken test	2023-12-10 14:39:20 +01:00
Viktor Lofgren	91dd45cf64	(search) IP and IP geolocation in site info view This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.	2023-12-09 20:06:55 +01:00
Viktor Lofgren	37af60254f	(search) Better recipe filter Tune the recipe filter to give better results, by using the 'popular' domains set along with excluding results with heavy tracking.	2023-12-09 20:06:55 +01:00
Viktor Lofgren	f0e736d4ea	(search) Update the search profile 'Academia' to strictly filter on academic tlds The previous version used a personalized pagerank centering on a few academic domains, but this didn't work very well and most results were not very academia-centric.	2023-12-09 20:06:55 +01:00
Viktor Lofgren	e3ebb0c5bb	(*) Rename the search filter 'RETRO' into 'POPULAR' This will make the terminology more consistent between the GUI and the code. The rankings yaml still uses 'retro' though, for to retain compatibility.	2023-12-09 20:06:54 +01:00
Viktor Lofgren	6382f779c3	(search) Revert back to using 'Popular' as the default search filter Unfiltered is a bit too ... unfiltered, and gives a bad first impression for many queries.	2023-12-09 16:34:12 +01:00
Viktor Lofgren	8ef34883a8	(search) Move site information out of the search service and into assistant. This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available. It also permits exposing this information via API in the future if there is interest in this. The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time. Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.	2023-12-09 16:30:06 +01:00
Viktor Lofgren	5c46af0edb	(converter) Refactor EncyclopediaMarginaliaNuSideloader to use ProcessingIterator Refactored the getDocumentsStream method in EncyclopediaMarginaliaNuSideloader to use the newly extracted ProcessingIterator class that encapsulates processing a stream of results from e.g a database query in parallel and returning the computed results as an iterator. The iterator was also improved on to be more reliable, previous versions of the logic would sometimes deadlock due to false positives in hasMore().	2023-12-09 15:20:53 +01:00
Viktor Lofgren	b6511fbfe2	(converter) Add AnchorTextKeywords to EncyclopediaMarginaliaNuSideloader processing The commit updates EncyclopediaMarginaliaNuSideloader to include the AnchorTextKeywords in processing documents, aiding search result relevance. It also removes old test-related functionality and a large but fairly useless test previously used to debug a specific problem, to the detriment of the overall code quality.	2023-12-09 15:20:52 +01:00
Viktor Lofgren	eccb12b366	(control) Fix spurious state detection in control-side actors A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor! To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.	2023-12-09 12:50:05 +01:00
Viktor Lofgren	d0982e7ba5	(converter) Add error handling and lazy load external domain links The converter was not properly initiating the external links for each domain, causing an NPE in conversion. This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data. Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.	2023-12-09 12:33:39 +01:00
Viktor Lofgren	fc30da0d48	(converter) Add academia recognition to DomainProcessor The code now includes an additional function in the DomainProcessor class that checks if a domain is associated with academia. An academic domain is identified by the ".edu" TLD, or fits a specific regex pattern matching domains like .ac.ccTld or .edu.ccTld. If these conditions are met, the search term "special:academia" is added to the domain. The existing academia search filter uses personalized pagerank to select academia-adjacent domains, but it isn't working very well. The hope is that filtering on domain names will be more effective, and that it can supplant the ranking-based approach.	2023-12-08 20:31:34 +01:00
Viktor Lofgren	e6a1052ba7	Simplify CrawlerMain, removing the CrawlerLimiter and using a global HttpFetcher with a virtual thread pool dispatcher instead of the default.	2023-12-08 20:24:01 +01:00
Viktor Lofgren	968dce50fc	(crawler) Refactored IpInterceptingNetworkInterceptor for clarity.	2023-12-08 17:45:46 +01:00
Viktor Lofgren	3bbffd3c22	(crawler) Refactor HttpFetcher to integrate WarcRecorder Partially hook in the WarcRecorder into the crawler process. So far it's not read, but should record the crawled documents. The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.	2023-12-08 17:12:51 +01:00

1 2 3 4 5 ...

928 Commits