Commit Graph

846 Commits

Author SHA1 Message Date
Viktor Lofgren
fbad625126 (linkdb) Add delegating implementation of DomainLinkDb
This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.
2024-01-08 19:56:33 +01:00
Viktor Lofgren
e49ba887e9 (crawl data) Add compatibility layer for old crawl data format
The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records.  This is true for the new parquet format, but not for the old zstd/gson format.

To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order.

This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be.

Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.
2024-01-08 19:16:49 +01:00
Viktor Lofgren
edc1acbb7e (*) Replace EC_DOMAIN_LINK table with files and in-memory caching
The EC_DOMAIN_LINK MariaDB table stores links between domains.  This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB).  This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need.

This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains.  This file is loaded in memory in each node, and can be queried via the Query Service.

A migration step is needed before this file is created in each node.   Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file.

The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
2024-01-08 15:53:13 +01:00
Viktor Lofgren
6d2e14a656 (build) Remove false depdencency between icp and index-service
This dependency causes the executor service docker image to change when the index service docker image changes.
2024-01-05 13:17:29 +01:00
Viktor Lofgren
4078708aea (qs) Better metrics for QS 2024-01-04 13:27:14 +01:00
Viktor Lofgren
343ea9c6d8 (search) Fetch fewer results per page
This is a test to evaluate how this impacts load times.
2024-01-04 13:18:07 +01:00
Viktor Lofgren
60361f88ed (converter) Add upper 128KB limit to how much HTML we'll parse 2024-01-03 23:14:03 +01:00
Viktor Lofgren
f7560cb1d8 (feature) More trackers 2024-01-03 17:31:02 +01:00
Viktor Lofgren
1f66568d59 (feature) More trackers 2024-01-03 17:27:25 +01:00
Viktor Lofgren
7af07cef95 (feature) Add another doubleclick variant to the adtech trackers 2024-01-03 17:21:12 +01:00
Viktor Lofgren
41a540a629 (converter) Penalize chatgpt content farm spam 2024-01-03 17:04:38 +01:00
Viktor Lofgren
f599944942 (converter) Penalize chatgpt content farm spam 2024-01-03 16:51:26 +01:00
Viktor Lofgren
1e06aee6a2 (index) Adjust BM25 parameters 2024-01-03 16:30:46 +01:00
Viktor Lofgren
7bbaedef97 (search) Add query strategy requiring link 2024-01-03 16:23:00 +01:00
Viktor Lofgren
87048511fe (valuation) Tweaking penalties a bit 2024-01-03 16:02:25 +01:00
Viktor Lofgren
c770f0b68b (valuation) Tweaking penalties a bit 2024-01-03 15:59:21 +01:00
Viktor Lofgren
78c00ad512 (valuation) Tweaking penalties a bit 2024-01-03 15:52:57 +01:00
Viktor Lofgren
a19879d494 (valuation) Tweaking penalties a bit 2024-01-03 15:32:33 +01:00
Viktor Lofgren
ac1aca36b0 (valuation) Increase the penalty for adtech a bit 2024-01-03 15:20:38 +01:00
Viktor Lofgren
1f3b89cf28 (index) Reduce the value of site and site-adjacent in BM25P calculations 2024-01-03 15:20:18 +01:00
Viktor Lofgren
f732f6ae6f (index) Tweak result valuation renormalization 2024-01-03 14:53:53 +01:00
Viktor Lofgren
0b9f3d1751 (*) Remove accidental commit of debug logging 2024-01-03 14:32:00 +01:00
Viktor Lofgren
0806aa6dfe (language-processing) Add maximum length limit for text input in SentenceExtractor
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
2024-01-03 14:27:47 +01:00
Viktor Lofgren
32436d099c (language-processing) Add maximum length limit for text input in SentenceExtractor
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
2024-01-03 14:27:47 +01:00
Viktor Lofgren
4ce692ccaf (converter) Use SimpleBlockingThreadPool in ProcessingIterator 2024-01-03 14:27:47 +01:00
Viktor Lofgren
3caa4eed75 Merge branch 'master' into converter-optimizations 2024-01-02 17:13:25 +01:00
Viktor Lofgren
c70f508ae8 (prometheus) Saner histogram buckets 2024-01-02 17:13:14 +01:00
Viktor Lofgren
9e64d7aaf9 Merge branch 'master' into converter-optimizations 2024-01-02 15:46:24 +01:00
Viktor Lofgren
72b773f06d (search) fix search metrics labeling 2024-01-02 15:46:14 +01:00
Viktor Lofgren
5f978b865b Merge branch 'master' into converter-optimizations 2024-01-02 15:41:48 +01:00
Viktor Lofgren
57a4f92722 (api) fix missing metrics label in api service 2024-01-02 15:41:38 +01:00
Viktor Lofgren
87351e89ca Merge branch 'master' into converter-optimizations 2024-01-02 15:17:02 +01:00
Viktor Lofgren
192e356169 (prometheus) Add instrumentation to the api service 2024-01-02 15:12:44 +01:00
Viktor Lofgren
31232e49fb (prometheus) Add instrumentation to the search, qs and index services. 2024-01-02 15:02:29 +01:00
Viktor Lofgren
9d93a31755 Merge branch 'master' into converter-optimizations 2024-01-02 12:36:16 +01:00
Viktor Lofgren
9f7df59945 (sideload) Reduce quality assessment.
This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.
2024-01-02 12:35:59 +01:00
Viktor Lofgren
d2418521a7 (index) Further ranking adjustments 2024-01-02 12:35:59 +01:00
Viktor Lofgren
9330b5b1d9 (index) Adjust rank weightings to fix bad wikipedia results
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero.  This meant that "bad" results always rank the same.  The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.

Some of the weights were also re-adjusted based on what appears to produce better results.  Needs evaluation.
2024-01-02 12:35:44 +01:00
Viktor Lofgren
faa50bf578 (sideload) Just index based on first paragraph
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!

This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
2024-01-02 12:35:44 +01:00
Viktor Lofgren
f0d9618dfc (sideload) Reduce quality assessment.
This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.
2024-01-02 12:34:58 +01:00
Viktor Lofgren
310a880fa8 (index) Further ranking adjustments 2024-01-02 12:24:52 +01:00
Viktor Lofgren
fc6e3b6da0 (index) Further ranking adjustments 2024-01-01 18:51:03 +01:00
Viktor Lofgren
50771045d0 (index) Further ranking adjustments 2024-01-01 18:43:17 +01:00
Viktor Lofgren
8f522470ed (index) Adjust rank weightings to fix bad wikipedia results
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero.  This meant that "bad" results always rank the same.  The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.

Some of the weights were also re-adjusted based on what appears to produce better results.  Needs evaluation.
2024-01-01 17:16:29 +01:00
Viktor Lofgren
dc90c9ac65 (sideload) Just index based on first paragraph
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!

This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
2024-01-01 16:19:38 +01:00
Viktor Lofgren
e46e174b59 (keyword-extractor) Add another test for Name-extractor 2024-01-01 15:21:51 +01:00
Viktor Lofgren
7f3f3f577c (backup) Add task heartbeats to the backup service 2024-01-01 15:20:57 +01:00
Viktor Lofgren
75d87c73d1 (crawler) Disable Java's infinite DNS caching 2023-12-31 16:59:08 +01:00
Viktor Lofgren
0fe44c9bf2 (crawler) Fix broken test
A necessary step was accidentally deleted when cleaning up these tests previously.
2023-12-30 13:56:44 +01:00
Viktor Lofgren
7a1d20ed0a (converter) Better use of ProcessingIterator
Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service.

This reduces thread churn in the converter sideloader style processing of regular crawl data.
2023-12-30 13:53:55 +01:00