CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	d2418521a7	(index) Further ranking adjustments	2024-01-02 12:35:59 +01:00
Viktor Lofgren	9330b5b1d9	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	faa50bf578	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	f6fa8bd722	(search) Fetch fewer linking and similar domains. Showing a total of 200 connected domains is not very informative.	2023-12-29 16:37:00 +01:00
Viktor Lofgren	6aee27a3f1	(*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style.	2023-12-29 16:36:01 +01:00
Viktor Lofgren	ea73be6831	(search) Remove the ugly placeholder screenshots from the site info view.	2023-12-29 15:55:46 +01:00
Viktor Lofgren	a065040323	(search) Don't inject arbitrary HTML into the site info view xD	2023-12-29 14:04:26 +01:00
Viktor Lofgren	70f338c3de	(search) Fix NPE in layout selection	2023-12-28 19:34:46 +01:00
Viktor	775974d5ec	Merge pull request #67 from MarginaliaSearch/rss-feeds-in-site-info Add RSS Feeds to site info (WIP)	2023-12-28 13:25:38 +01:00
Viktor Lofgren	c7af40c368	(search) Change layout balance when feeds/samples are present	2023-12-28 13:16:10 +01:00
Viktor Lofgren	00a974a721	(crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions This commit also improves the test coverage for this part of the code.	2023-12-27 20:02:17 +01:00
Viktor Lofgren	f811a29f87	(crawler) Fix resource leak in crawler A 10 MB thread local buffer wasn't static. Oops.	2023-12-27 16:32:17 +01:00
Viktor Lofgren	9707366348	(test) Fix a few slow tests that broke due to domainCount	2023-12-27 13:29:59 +01:00
Viktor Lofgren	9e5fe71f5b	(crawler) Switch hash function in crawler Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler. This switches to the modified Murmur hash function used throughout Marginalia.	2023-12-27 13:29:00 +01:00
Viktor Lofgren	5d1b7da728	Updated site info feed and search service Modified site info feed template to secure the description field against injected code. Also adjusted search service by extracting samples within the correct scope and including them in the returned site info. This improves the quality and security of the displayed information.	2023-12-26 22:06:01 +01:00
Viktor Lofgren	3ea1ddae22	(crawler) Roll back switch to virtual thread pool in crawler This seems to cause a resource leak, it seems the http library uses thread locals?	2023-12-26 19:37:34 +01:00
Viktor Lofgren	1694e9c78c	(search) Add RSS Feeds to site info This change integrates the Feedlot RSS Bot with Marginalia's site info view to offer a preview of the latest updates. The change introduces a new tiny feature that is a feedlot-client based on Java's HttpClient.	2023-12-26 16:21:40 +01:00
Viktor Lofgren	4763077b76	(search/index) Add a new keyword "count" This is for filtering results on how many times the term appears on the domain. The intent is to be beneficial in creating e.g. a domain search feature. It's also very helpful when tracking down spammy domains.	2023-12-25 20:38:29 +01:00
Viktor Lofgren	c0eaca220c	(search) Add convenient link for AS search to the search view	2023-12-25 15:07:58 +01:00
Viktor Lofgren	25d086c4e1	(crawler) Clean up stale warc files We should probably have an option to keep them, but not by default!	2023-12-25 15:07:36 +01:00
Viktor Lofgren	88551043cd	(crawler) Even more lenient resyncing	2023-12-25 01:48:11 +01:00
Viktor Lofgren	f779f760c4	(crawler) Even more lenient resyncing	2023-12-25 01:44:18 +01:00
Viktor Lofgren	f18f82e229	(crawler) Write etags and last-modified on reference copy This commit also fixes a test that broke with a previous change.	2023-12-25 01:40:13 +01:00
Viktor Lofgren	67ef2b45fa	(crawler) Reduce logging	2023-12-25 01:10:03 +01:00
Viktor Lofgren	d72e871265	(warc) Fix resync	2023-12-25 01:03:03 +01:00
Viktor Lofgren	4c9bc13309	(warc) Reduce log spam	2023-12-25 00:58:31 +01:00
Viktor Lofgren	84563b0d46	(crawler) Be a bit more conservative about pulling etags and so on if the previous fetch wasn't OK	2023-12-25 00:55:05 +01:00
Viktor Lofgren	c5aab7e8db	(warc) Fix NPE in WarcRecorder	2023-12-25 00:54:38 +01:00
Viktor Lofgren	1755b646b8	(warc) Fix NPE in WarcRecorder	2023-12-25 00:48:42 +01:00
Viktor Lofgren	85f906ea53	(executor) Fix removal of stale process heartbeats	2023-12-23 13:49:24 +01:00
Viktor Lofgren	e1a155a9c8	(crawler) Increase growth of crawl jobs A number of crawl jobs get stuck at about 300 documents, or just under. This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl. GOOD_URLS is based on how many documents successfully process, which is typically fairly small. Switching to KNOWN_URLS should let this grow faster. The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table. The floor is also increased to 250 from 200.	2023-12-23 13:22:10 +01:00
Viktor Lofgren	0454447e41	(executor) Implement process removal for long-absent heartbeats Added functionality to remove processes from listing that have not checked in for over a day. A 'removeProcessHeartbeat' function was created to delete the respective entry from the PROCESS_HEARTBEAT table in case heartbeats are absent for more than one day.	2023-12-23 13:18:21 +01:00
Viktor Lofgren	7b40c0bbee	(assistant) Clean up similar websites' results	2023-12-22 14:07:01 +01:00
Viktor Lofgren	dc773c5c20	(adjacencies) Clean up AdjacenciesLoader Make JDBC batching more consistent, also adds a test case for the loader.	2023-12-21 14:14:22 +01:00
Viktor Lofgren	b6253b03c2	(adjacencies) Fix bug in AdjacenciesLoader This fixes a bug where a prepared statement was created before the table it was supposed to insert into was created. This fails and does nothing. Furthermore, added the logging that would have warned about this failure, had it been in place.	2023-12-21 13:12:31 +01:00
Viktor Lofgren	a5bc29245b	(cleanup) Remove vestigial support for WARC crawl data streams	2023-12-20 15:46:21 +01:00
Viktor Lofgren	bfae478251	Refactor CrawlerRevisitor for better consistency	2023-12-20 15:21:49 +01:00
Viktor Lofgren	a7cd490593	(minor) Remove dead code.	2023-12-19 18:58:33 +01:00
Viktor Lofgren	283d2caa81	Merge remote-tracking branch 'origin/master'	2023-12-19 18:38:01 +01:00
Viktor Lofgren	dd8fb04886	(converter) Add sizeloadSizeAdvice field to several ProcessedDomain Since the sideloaders don't populate the documents list in ProcessedDomain to keep the memory footprint manageable, the code that estimates knownUrls etc. will set them to zero, which has negative effects on their ranking. This change will populate them with a bullshit value within a sane ballpark, ensuring that these domains show up in the rankings.	2023-12-19 18:37:51 +01:00
Viktor	ce8dca7659	Update Additional Contributors.md	2023-12-19 12:22:01 +01:00
Viktor	5bd3934d22	Merge pull request #64 from dreimolo/macos_AS_fix Macos apple silicon fix, and slight improvements to sample downloader	2023-12-18 18:29:14 +01:00
Viktor Lofgren	128f550ee5	(run) Download to a temporary file to avoid corruption from aborted downloads	2023-12-18 18:28:17 +01:00
Viktor Lofgren	3a56a06c4f	(warc) Add a fields for etags and last-modified headers to the new crawl data formats Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats. This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely. The commit also adds a few tests to this logic.	2023-12-18 17:45:54 +01:00
Viktor Lofgren	126ac3816f	(converter) Reduce queue size in ConverterWriter The size of the ArrayBlockingQueue in ConverterWriter.java has been reduced from 4 to 1. This change aims to reduce the memory utilization by not having fully processed domains piling up in RAM. This may cause the writer to go idle in waiting for new data, but that may be preferable to an OOM.	2023-12-18 13:42:40 +01:00
Viktor Lofgren	d02bed1a55	(loader) Optimize DomainLoaderService for faster startups Initialization parameters in DomainLoaderService and DomainIdRegistry have been updated to improve performance. This is done by adding sane default sizes to the hash tables involved, reducing GC churn, but also by setting a sensible fetch size to the queries used, and not fetching irrelevant information such as the domain name.	2023-12-18 13:15:10 +01:00
Viktor Lofgren	b7ed0ce537	(loader) Reset count after executing batch in DomainLoaderService This should greatly speed up starting the loader process.	2023-12-18 12:43:53 +01:00
Viktor Lofgren	a742503508	(search) Add view for showing mutual links between two websites	2023-12-17 17:50:44 +01:00
Viktor Lofgren	33312ab09e	(geo-ip) Update readme	2023-12-17 16:08:33 +01:00
Viktor Lofgren	c422f0b9fb	(geo-ip) Tidy up error handling	2023-12-17 16:06:51 +01:00

1 2 3 4 5 ...

1479 Commits