CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	fdec565b34	(converter) Add upper 128KB limit to how much HTML we'll parse	2024-01-05 13:22:13 +01:00
Viktor Lofgren	33c2188c87	(feature) More trackers	2024-01-05 13:22:13 +01:00
Viktor Lofgren	b3c8fa74cc	(feature) Add another doubleclick variant to the adtech trackers	2024-01-05 13:22:13 +01:00
Viktor Lofgren	e53bb70bef	(converter) Penalize chatgpt content farm spam	2024-01-05 13:22:13 +01:00
Viktor Lofgren	109bec372c	(index) Adjust BM25 parameters	2024-01-05 13:21:52 +01:00
Viktor Lofgren	5c2561d05d	(search) Add query strategy requiring link	2024-01-05 13:21:52 +01:00
Viktor Lofgren	0e970b8037	(valuation) Tweaking penalties a bit	2024-01-05 13:21:52 +01:00
Viktor Lofgren	1694b4d6ef	(valuation) Increase the penalty for adtech a bit	2024-01-05 13:21:34 +01:00
Viktor Lofgren	396299c1db	(index) Reduce the value of site and site-adjacent in BM25P calculations	2024-01-05 13:21:33 +01:00
Viktor Lofgren	71d789aab0	(index) Tweak result valuation renormalization	2024-01-05 13:21:33 +01:00
Viktor Lofgren	c70f508ae8	(prometheus) Saner histogram buckets	2024-01-02 17:13:14 +01:00
Viktor Lofgren	72b773f06d	(search) fix search metrics labeling	2024-01-02 15:46:14 +01:00
Viktor Lofgren	57a4f92722	(api) fix missing metrics label in api service	2024-01-02 15:41:38 +01:00
Viktor	7920c67a48	Merge pull request #71 from MarginaliaSearch/metrics Add Prometheus Instrumentation	2024-01-02 15:13:53 +01:00
Viktor Lofgren	192e356169	(prometheus) Add instrumentation to the api service	2024-01-02 15:12:44 +01:00
Viktor Lofgren	31232e49fb	(prometheus) Add instrumentation to the search, qs and index services.	2024-01-02 15:02:29 +01:00
Viktor Lofgren	116595d218	(prometheus) Add in-docker prometheus instance to exfiltrate metrics from the docker-based services	2024-01-02 14:28:53 +01:00
Viktor Lofgren	9f7df59945	(sideload) Reduce quality assessment. This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.	2024-01-02 12:35:59 +01:00
Viktor Lofgren	d2418521a7	(index) Further ranking adjustments	2024-01-02 12:35:59 +01:00
Viktor Lofgren	9330b5b1d9	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	faa50bf578	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	f6fa8bd722	(search) Fetch fewer linking and similar domains. Showing a total of 200 connected domains is not very informative.	2023-12-29 16:37:00 +01:00
Viktor Lofgren	6aee27a3f1	(*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style.	2023-12-29 16:36:01 +01:00
Viktor Lofgren	ea73be6831	(search) Remove the ugly placeholder screenshots from the site info view.	2023-12-29 15:55:46 +01:00
Viktor Lofgren	a065040323	(search) Don't inject arbitrary HTML into the site info view xD	2023-12-29 14:04:26 +01:00
Viktor Lofgren	70f338c3de	(search) Fix NPE in layout selection	2023-12-28 19:34:46 +01:00
Viktor	775974d5ec	Merge pull request #67 from MarginaliaSearch/rss-feeds-in-site-info Add RSS Feeds to site info (WIP)	2023-12-28 13:25:38 +01:00
Viktor Lofgren	c7af40c368	(search) Change layout balance when feeds/samples are present	2023-12-28 13:16:10 +01:00
Viktor Lofgren	00a974a721	(crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions This commit also improves the test coverage for this part of the code.	2023-12-27 20:02:17 +01:00
Viktor Lofgren	f811a29f87	(crawler) Fix resource leak in crawler A 10 MB thread local buffer wasn't static. Oops.	2023-12-27 16:32:17 +01:00
Viktor Lofgren	9707366348	(test) Fix a few slow tests that broke due to domainCount	2023-12-27 13:29:59 +01:00
Viktor Lofgren	9e5fe71f5b	(crawler) Switch hash function in crawler Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler. This switches to the modified Murmur hash function used throughout Marginalia.	2023-12-27 13:29:00 +01:00
Viktor Lofgren	5d1b7da728	Updated site info feed and search service Modified site info feed template to secure the description field against injected code. Also adjusted search service by extracting samples within the correct scope and including them in the returned site info. This improves the quality and security of the displayed information.	2023-12-26 22:06:01 +01:00
Viktor Lofgren	3ea1ddae22	(crawler) Roll back switch to virtual thread pool in crawler This seems to cause a resource leak, it seems the http library uses thread locals?	2023-12-26 19:37:34 +01:00
Viktor Lofgren	1694e9c78c	(search) Add RSS Feeds to site info This change integrates the Feedlot RSS Bot with Marginalia's site info view to offer a preview of the latest updates. The change introduces a new tiny feature that is a feedlot-client based on Java's HttpClient.	2023-12-26 16:21:40 +01:00
Viktor Lofgren	4763077b76	(search/index) Add a new keyword "count" This is for filtering results on how many times the term appears on the domain. The intent is to be beneficial in creating e.g. a domain search feature. It's also very helpful when tracking down spammy domains.	2023-12-25 20:38:29 +01:00
Viktor Lofgren	c0eaca220c	(search) Add convenient link for AS search to the search view	2023-12-25 15:07:58 +01:00
Viktor Lofgren	25d086c4e1	(crawler) Clean up stale warc files We should probably have an option to keep them, but not by default!	2023-12-25 15:07:36 +01:00
Viktor Lofgren	88551043cd	(crawler) Even more lenient resyncing	2023-12-25 01:48:11 +01:00
Viktor Lofgren	f779f760c4	(crawler) Even more lenient resyncing	2023-12-25 01:44:18 +01:00
Viktor Lofgren	f18f82e229	(crawler) Write etags and last-modified on reference copy This commit also fixes a test that broke with a previous change.	2023-12-25 01:40:13 +01:00
Viktor Lofgren	67ef2b45fa	(crawler) Reduce logging	2023-12-25 01:10:03 +01:00
Viktor Lofgren	d72e871265	(warc) Fix resync	2023-12-25 01:03:03 +01:00
Viktor Lofgren	4c9bc13309	(warc) Reduce log spam	2023-12-25 00:58:31 +01:00
Viktor Lofgren	84563b0d46	(crawler) Be a bit more conservative about pulling etags and so on if the previous fetch wasn't OK	2023-12-25 00:55:05 +01:00
Viktor Lofgren	c5aab7e8db	(warc) Fix NPE in WarcRecorder	2023-12-25 00:54:38 +01:00
Viktor Lofgren	1755b646b8	(warc) Fix NPE in WarcRecorder	2023-12-25 00:48:42 +01:00
Viktor Lofgren	85f906ea53	(executor) Fix removal of stale process heartbeats	2023-12-23 13:49:24 +01:00
Viktor Lofgren	e1a155a9c8	(crawler) Increase growth of crawl jobs A number of crawl jobs get stuck at about 300 documents, or just under. This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl. GOOD_URLS is based on how many documents successfully process, which is typically fairly small. Switching to KNOWN_URLS should let this grow faster. The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table. The floor is also increased to 250 from 200.	2023-12-23 13:22:10 +01:00
Viktor Lofgren	0454447e41	(executor) Implement process removal for long-absent heartbeats Added functionality to remove processes from listing that have not checked in for over a day. A 'removeProcessHeartbeat' function was created to delete the respective entry from the PROCESS_HEARTBEAT table in case heartbeats are absent for more than one day.	2023-12-23 13:18:21 +01:00

1 2 3 4 5 ...

1497 Commits