Viktor Lofgren
e53bb70bef
(converter) Penalize chatgpt content farm spam
2024-01-05 13:22:13 +01:00
Viktor Lofgren
109bec372c
(index) Adjust BM25 parameters
2024-01-05 13:21:52 +01:00
Viktor Lofgren
5c2561d05d
(search) Add query strategy requiring link
2024-01-05 13:21:52 +01:00
Viktor Lofgren
0e970b8037
(valuation) Tweaking penalties a bit
2024-01-05 13:21:52 +01:00
Viktor Lofgren
1694b4d6ef
(valuation) Increase the penalty for adtech a bit
2024-01-05 13:21:34 +01:00
Viktor Lofgren
396299c1db
(index) Reduce the value of site and site-adjacent in BM25P calculations
2024-01-05 13:21:33 +01:00
Viktor Lofgren
71d789aab0
(index) Tweak result valuation renormalization
2024-01-05 13:21:33 +01:00
Viktor Lofgren
41ca50ff0e
(build) Enable reproducible builds in build.gradle
...
Settings for enabling reproducible builds for all subprojects were added to improve build consistency. This includes preserving file timestamps and ordering files reproducibly.
This is primarily of help for docker, since it uses hashes to determine if a file or image layer has changed.
2024-01-05 13:19:59 +01:00
Viktor Lofgren
6d2e14a656
(build) Remove false depdencency between icp and index-service
...
This dependency causes the executor service docker image to change when the index service docker image changes.
2024-01-05 13:17:29 +01:00
Viktor Lofgren
4078708aea
(qs) Better metrics for QS
2024-01-04 13:27:14 +01:00
Viktor Lofgren
343ea9c6d8
(search) Fetch fewer results per page
...
This is a test to evaluate how this impacts load times.
2024-01-04 13:18:07 +01:00
Viktor Lofgren
60361f88ed
(converter) Add upper 128KB limit to how much HTML we'll parse
2024-01-03 23:14:03 +01:00
Viktor Lofgren
f7560cb1d8
(feature) More trackers
2024-01-03 17:31:02 +01:00
Viktor Lofgren
1f66568d59
(feature) More trackers
2024-01-03 17:27:25 +01:00
Viktor Lofgren
7af07cef95
(feature) Add another doubleclick variant to the adtech trackers
2024-01-03 17:21:12 +01:00
Viktor Lofgren
41a540a629
(converter) Penalize chatgpt content farm spam
2024-01-03 17:04:38 +01:00
Viktor Lofgren
f599944942
(converter) Penalize chatgpt content farm spam
2024-01-03 16:51:26 +01:00
Viktor Lofgren
1e06aee6a2
(index) Adjust BM25 parameters
2024-01-03 16:30:46 +01:00
Viktor Lofgren
7bbaedef97
(search) Add query strategy requiring link
2024-01-03 16:23:00 +01:00
Viktor Lofgren
87048511fe
(valuation) Tweaking penalties a bit
2024-01-03 16:02:25 +01:00
Viktor Lofgren
c770f0b68b
(valuation) Tweaking penalties a bit
2024-01-03 15:59:21 +01:00
Viktor Lofgren
78c00ad512
(valuation) Tweaking penalties a bit
2024-01-03 15:52:57 +01:00
Viktor Lofgren
a19879d494
(valuation) Tweaking penalties a bit
2024-01-03 15:32:33 +01:00
Viktor Lofgren
ac1aca36b0
(valuation) Increase the penalty for adtech a bit
2024-01-03 15:20:38 +01:00
Viktor Lofgren
1f3b89cf28
(index) Reduce the value of site and site-adjacent in BM25P calculations
2024-01-03 15:20:18 +01:00
Viktor Lofgren
f732f6ae6f
(index) Tweak result valuation renormalization
2024-01-03 14:53:53 +01:00
Viktor Lofgren
0b9f3d1751
(*) Remove accidental commit of debug logging
2024-01-03 14:32:00 +01:00
Viktor Lofgren
0806aa6dfe
(language-processing) Add maximum length limit for text input in SentenceExtractor
...
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
2024-01-03 14:27:47 +01:00
Viktor Lofgren
32436d099c
(language-processing) Add maximum length limit for text input in SentenceExtractor
...
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
2024-01-03 14:27:47 +01:00
Viktor Lofgren
4ce692ccaf
(converter) Use SimpleBlockingThreadPool in ProcessingIterator
2024-01-03 14:27:47 +01:00
Viktor Lofgren
3caa4eed75
Merge branch 'master' into converter-optimizations
2024-01-02 17:13:25 +01:00
Viktor Lofgren
c70f508ae8
(prometheus) Saner histogram buckets
2024-01-02 17:13:14 +01:00
Viktor Lofgren
9e64d7aaf9
Merge branch 'master' into converter-optimizations
2024-01-02 15:46:24 +01:00
Viktor Lofgren
72b773f06d
(search) fix search metrics labeling
2024-01-02 15:46:14 +01:00
Viktor Lofgren
5f978b865b
Merge branch 'master' into converter-optimizations
2024-01-02 15:41:48 +01:00
Viktor Lofgren
57a4f92722
(api) fix missing metrics label in api service
2024-01-02 15:41:38 +01:00
Viktor Lofgren
87351e89ca
Merge branch 'master' into converter-optimizations
2024-01-02 15:17:02 +01:00
Viktor
7920c67a48
Merge pull request #71 from MarginaliaSearch/metrics
...
Add Prometheus Instrumentation
2024-01-02 15:13:53 +01:00
Viktor Lofgren
192e356169
(prometheus) Add instrumentation to the api service
2024-01-02 15:12:44 +01:00
Viktor Lofgren
31232e49fb
(prometheus) Add instrumentation to the search, qs and index services.
2024-01-02 15:02:29 +01:00
Viktor Lofgren
116595d218
(prometheus) Add in-docker prometheus instance to exfiltrate metrics from the docker-based services
2024-01-02 14:28:53 +01:00
Viktor Lofgren
9d93a31755
Merge branch 'master' into converter-optimizations
2024-01-02 12:36:16 +01:00
Viktor Lofgren
9f7df59945
(sideload) Reduce quality assessment.
...
This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.
2024-01-02 12:35:59 +01:00
Viktor Lofgren
d2418521a7
(index) Further ranking adjustments
2024-01-02 12:35:59 +01:00
Viktor Lofgren
9330b5b1d9
(index) Adjust rank weightings to fix bad wikipedia results
...
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.
Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.
2024-01-02 12:35:44 +01:00
Viktor Lofgren
faa50bf578
(sideload) Just index based on first paragraph
...
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!
This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
2024-01-02 12:35:44 +01:00
Viktor Lofgren
f0d9618dfc
(sideload) Reduce quality assessment.
...
This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.
2024-01-02 12:34:58 +01:00
Viktor Lofgren
310a880fa8
(index) Further ranking adjustments
2024-01-02 12:24:52 +01:00
Viktor Lofgren
fc6e3b6da0
(index) Further ranking adjustments
2024-01-01 18:51:03 +01:00
Viktor Lofgren
50771045d0
(index) Further ranking adjustments
2024-01-01 18:43:17 +01:00