Commit Graph

1049 Commits

Author SHA1 Message Date
Viktor Lofgren
460998d512 (index) Move index construction to separate process.
This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service.  It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D
2023-08-25 12:52:54 +02:00
Viktor Lofgren
e741301417 (search) Remove endpoint flush-search-caches
It's not necessary anymore with the new linkdb.
2023-08-25 09:51:06 +02:00
Viktor Lofgren
5ed5298409 (converter) Update confusing state description
SWAP_LEXICON doesn't instruct the index service to do anything.  It just moves the file.
2023-08-24 18:56:49 +02:00
Viktor Lofgren
b911665691 (index) Clean up and optimize valuator 2023-08-24 18:34:06 +02:00
Viktor Lofgren
56eb83319d (index) Clean up result domain deduplicator 2023-08-24 18:24:55 +02:00
Viktor Lofgren
1e6800565a (system) Remove EdgeId<T> and similar objects
They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.
2023-08-24 17:46:02 +02:00
Viktor Lofgren
c909120ae1 (search) Basic working integration of linkdb in search service 2023-08-24 17:24:56 +02:00
Viktor Lofgren
9894f37412 (index) Implement new URL ID coding scheme.
Also refactor along the way.  Really needs an additional pass, these tests are very hairy.
2023-08-24 16:44:27 +02:00
Viktor Lofgren
6a04cdfddf (loader) Implement new linkdb in loader
Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal.

For now, we no longer store new URLs in different domains.  We need to re-implement this somehow, probably in a different job or a as a different output.
2023-08-24 13:07:54 +02:00
Viktor Lofgren
c70670bacb (common) New UrlIdCodec class
Have a single class responsible for encoding and decoding URL ids, as it's a bit finicky and used all over.
2023-08-24 11:41:07 +02:00
Viktor Lofgren
7bb3e44a76 (common) Deprecate EdgeId and similar 2023-08-24 11:16:28 +02:00
Viktor Lofgren
b958acb76a (file-storage) New File Storage type for linkdb 2023-08-24 09:06:13 +02:00
Viktor Lofgren
b22f4fbb72 (linkdb) New Module for sqlite-backed document db 2023-08-24 09:06:13 +02:00
Viktor Lofgren
e8c0648e04 Fix missing vol/ss dir in setup.sh 2023-08-23 17:59:40 +02:00
Viktor Lofgren
ebc84c22fb Upgrade antique lombok plugin
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
8bd9a00c38 Amend setup instructions with command 2023-08-23 14:02:21 +00:00
Viktor Lofgren
972d03efdf Fix error in run/readme where it suggested local dev environment uses HTTPS 2023-08-23 13:47:39 +00:00
Viktor Lofgren
aa0d256d6a Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
4d75fa2908 Upgrade gradle and docker plugin to support native JDK20 environments 2023-08-23 13:30:55 +00:00
Viktor Lofgren
1a05cba60a (keyword lexicon) Use three hash tables to increase the possible number of keywords to 2^31 from 0.75 x 2^30. 2023-08-23 11:25:20 +02:00
Viktor Lofgren
bf92c270dc (language) Rollback language filter change a bit.
It appears to lead to too much junk in the lexicon.
2023-08-23 10:16:57 +02:00
Viktor Lofgren
e507844616 (language) Rollback language filter change a bit.
It appears to lead to too much junk in the lexicon.
2023-08-23 10:03:25 +02:00
Viktor Lofgren
ca12dd59f7 (loader) Fix Cleaner resource leak
Apparently Cleaners have an associated native thread, so the way to use them is to have a single static cleaner.
2023-08-22 18:05:00 +02:00
Viktor Lofgren
6f222b9800 (search) Add refresh link to explore mode.
This is a QOL improvement for mobile users, who otherwise would have to scroll all the way up to refresh.

Also removed the confusing "this is a random set of domains"-message when viewing adjacent websites, as it's not random.
2023-08-22 12:43:44 +02:00
Viktor Lofgren
fca62f261e (mq) Down-tune polling intervals in MQ
Polling 10 times a second across dozens of queues is a bit too aggressive and wasteful.
2023-08-22 11:49:30 +02:00
Viktor Lofgren
c7f0276005 (control) Don't spin on process output printing
This is the "correct" way of copying stdout and stderr to the curren't process' output.
2023-08-22 11:48:54 +02:00
Viktor Lofgren
46409c4c2d (loader) Use the correct interface for InstructionCounter 2023-08-22 11:11:36 +02:00
Viktor Lofgren
46df58d28b (control-service) Use default value for WMSA_HOME if it is not set 2023-08-22 11:11:01 +02:00
Viktor Lofgren
15912f31d0 (control-service) Basic GUI for deleting bad links from exploration mode 2023-08-21 18:35:26 +02:00
Viktor
dd380a5fb3
(doc) Add control-service to conceptual overview
Not adding every interaction as it would turn into a rat king.
2023-08-20 13:28:32 +02:00
Viktor Lofgren
93f49f1fb3 (search-service) RSS feed for the news feed 2023-08-20 12:58:34 +02:00
Viktor Lofgren
b83bb5a48a (docker) Upgrade to jdk20 image to fix weird mojibake problems.
Super weird encoding bug that only arises on versions below jdk18 causing crawl data to be read incorrectly.

Seems possibly related to the new standard charset of UTF-8. Maybe some library (unknown which) is attempting to be backwards compatible in a way that totally breaks?
2023-08-19 10:58:47 +02:00
Viktor Lofgren
704de50a9b (forward-index, valuator) HTML features in valuator
Put it in the forward index for easy access during index-side valuation.
2023-08-18 11:54:56 +02:00
Viktor Lofgren
fcfe07fb7d (valuator) Clean up code 2023-08-18 11:26:56 +02:00
Viktor Lofgren
ccf4990add (minor) Clean up code 2023-08-18 11:26:39 +02:00
Viktor Lofgren
f2638dd845 (feature-extractor) More adtech nonsense 2023-08-18 11:26:19 +02:00
Viktor Lofgren
239980ecae (minor) Improve comment 2023-08-18 11:26:05 +02:00
Viktor Lofgren
6cb784df75 (minor) Improve comment 2023-08-18 11:25:36 +02:00
Viktor Lofgren
efee904531 (search) Use the adtech bit instead of ads for ads flag 2023-08-18 11:24:59 +02:00
Viktor Lofgren
bee815b1c4 (converter) Add monsterinsights as an adtech tracker 2023-08-17 17:44:11 +02:00
Viktor Lofgren
e296b02649 (converter) Optimize LSH based within-domain deduplication 2023-08-17 17:43:46 +02:00
Viktor Lofgren
2656fcfe2c (conf) Remove unnecessary JVM flags for processes 2023-08-17 17:42:47 +02:00
Viktor Lofgren
c019a029ec (flags) Documentation and preventative bugfix 2023-08-17 17:42:31 +02:00
Viktor Lofgren
db0216936e (summary) Reduce the chance of expensive operations 2023-08-16 15:48:34 +02:00
Viktor Lofgren
46d761f34f (language) fasttext based language filter 2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f (valuation) Penalize wordpress style kebab case urls 2023-08-16 13:11:24 +02:00
Viktor Lofgren
1d486bddee (crawler) Reduce log spam 2023-08-16 11:12:09 +02:00
Viktor Lofgren
606db54dc8 (docs) Fix dead links to message-queue after moving it to libraries 2023-08-15 19:26:40 +02:00
Viktor Lofgren
d8073f0dde (feature-extractor) Add mail.ru counter to non-adtech trackers 2023-08-15 19:10:43 +02:00
Viktor Lofgren
df85468c01 (control) Action for refreshing the blogs definition. 2023-08-15 11:38:52 +02:00