Commit Graph

308 Commits

Author SHA1 Message Date
Viktor Lofgren
84cdac83d6 (control) Move message queue monitor to control 2023-10-24 16:44:28 +02:00
Viktor Lofgren
313cc2965c (index-creation) Print whether full or prio is created
Previous state of saying reverse index for both was pretty confusing.
2023-10-24 16:23:10 +02:00
Viktor Lofgren
95f74c5ea7 (control) Filter out heartbeats that are stopped 2023-10-24 16:09:28 +02:00
Viktor Lofgren
0406e76889 (api) Remove logging cruft 2023-10-24 13:39:05 +02:00
Viktor Lofgren
c2b28c0f8d (api) Trial streaming API 2023-10-24 13:26:46 +02:00
Viktor Lofgren
a860f8f1a8 (index/qs) GRPC API for better query peformance 2023-10-24 11:38:07 +02:00
Viktor Lofgren
2ed2f35a9b (actor) Rewrite of the actor prototype class using record pattern matching 2023-10-23 10:18:20 +02:00
Viktor Lofgren
119151cad3 (converter) Separtion of concerns 2023-10-22 14:35:33 +02:00
Viktor Lofgren
758f9b5aa5 (converter) Get UUID pips out of the models
Rendering concerns shouldn't be in the models, it's poor separation of concerns and very difficult to follow.
2023-10-22 14:24:52 +02:00
Viktor Lofgren
eb4158df0b (control) Fix start/stop FSM endpoints 2023-10-22 14:03:09 +02:00
Viktor Lofgren
12fda1a36b (control) Temporarily re-writing the data balancer to get it to work in prod
Need to clean this up later.
2023-10-22 14:03:09 +02:00
Viktor Lofgren
e927f99777 (control) JSON serializes Map<Integer> to Map<Double> and Java gets confused 2023-10-21 16:24:20 +02:00
Viktor Lofgren
044bcf55bd (control) Fix SQL in rebalance actor 2023-10-21 16:13:37 +02:00
Viktor Lofgren
e475af9f49 (control) Initialize controlActorService 2023-10-21 16:06:53 +02:00
Viktor Lofgren
c6abcd91fa (control) Better use of FS states, fix bug with start/stop actors 2023-10-20 16:37:49 +02:00
Viktor Lofgren
d76d926c38 (control/executor) Add new configuration options for node
It's now possible to configure prod instance to not retain processed data.
2023-10-20 14:05:19 +02:00
Viktor Lofgren
2b3c167845 (controller) Additional configuration options for node 2023-10-20 13:13:36 +02:00
Viktor Lofgren
584bb3a648 (fs) interface cleanup 2023-10-20 12:24:18 +02:00
Viktor Lofgren
7b5ec6b98f (executor-service) Embed dist/ in executor-service's docker image 2023-10-19 17:48:34 +02:00
Viktor Lofgren
23526f6d1a (executor) Executor service now pulls DomainType list for CRAWL on "recrawl"
This is an automatic integration with the submit-site repo on github and also
crawl-queue.
2023-10-19 17:48:34 +02:00
Viktor Lofgren
809b3ee023 (control) Update GUI for crawl specs. They are now less important than they were before. 2023-10-19 17:48:34 +02:00
Viktor Lofgren
23f0c79fba (control) GUI for data sets/domain types. 2023-10-19 17:48:34 +02:00
Viktor Lofgren
81dd3809e9 (*) WIP Add node affinity to EC_DOMAIN
Very messy commit due to fractalline yak shaving
2023-10-19 17:48:34 +02:00
Viktor Lofgren
978550f809 (executor-service) Retire features-convert and move the corresponding packages into the executor service. 2023-10-16 15:43:46 +02:00
Viktor Lofgren
84fea0fd05 (node) Nodes auto-start their monitor actors. 2023-10-16 15:33:22 +02:00
Viktor Lofgren
2df3e0f881 (node) Nodes auto-configure on start-up instead of requiring manual configuration. 2023-10-16 14:46:35 +02:00
Viktor Lofgren
ede5d1f890 (actor) Give process spawners more easily recognizable names. 2023-10-16 14:19:00 +02:00
Viktor Lofgren
39911e3acd (control) Fix incorrect storage base and clean up GUI for data 2023-10-16 13:30:26 +02:00
Viktor Lofgren
8dafd13cd7 (client) Fix executor tests 2023-10-16 12:02:57 +02:00
Viktor Lofgren
c245f7ce3a (control) Bootstrapify review-domains and search-to-ban views. 2023-10-15 22:04:23 +02:00
Viktor Lofgren
607d647483 (control) Remove services listing view 2023-10-15 21:48:55 +02:00
Viktor Lofgren
9a38a455c9 (control/exec) File listings in control GUI 2023-10-15 19:15:44 +02:00
Viktor Lofgren
16e0738731 (*) Get multi-node routing working. 2023-10-15 18:38:30 +02:00
Viktor Lofgren
eacbf87979 (control) New list and form for index nodes. 2023-10-14 21:46:52 +02:00
Viktor Lofgren
108b4cb648 (service) Keep disabled multi-noded services dormant when they are configured to be disabled. 2023-10-14 20:58:55 +02:00
Viktor Lofgren
6308a8dfcd (control) Node configuration 2023-10-14 16:47:52 +02:00
Viktor Lofgren
4baf9527d7 (*) WIP Control GUI redesign, executor-service, multi-node mq
This turned out to be very difficult to do in small isolated steps.

* Design overhaul of the control gui using bootstrap
* Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes
* Add node-affinity to message queue
2023-10-14 12:08:43 +02:00
Viktor Lofgren
199c459697 (*) Add node-affinity to services, processes and file storage. 2023-10-10 12:32:22 +02:00
Viktor Lofgren
61288c5e68 (service, client) First steps towards multiple nodedness 2023-10-09 22:13:27 +02:00
Viktor Lofgren
6319b8ef51 (api-service) Improved testability, always set content type to application/json 2023-10-09 15:39:34 +02:00
Viktor Lofgren
397a85eaa4 (query-service) Apply blacklisting to search results 2023-10-09 15:18:53 +02:00
Viktor Lofgren
3889c4bdd9 (refactor) Remove features-search and update documentation 2023-10-09 15:12:30 +02:00
Viktor Lofgren
c899f1cb85 (docs) Update documentation to reflect new query service 2023-10-09 14:56:59 +02:00
Viktor Lofgren
d8956c51d0 (refactor) Remove api:search-api
Application services should not have an API, but purely act as clients
to the core services (which should always have an API).
2023-10-09 14:42:33 +02:00
Viktor Lofgren
c0e61d4c87 (refactor) Move search service into services-satellite 2023-10-09 13:40:01 +02:00
Viktor Lofgren
97e17282ab (query-service) Move query parsing from search-service to the new query service. 2023-10-09 13:27:44 +02:00
Viktor Lofgren
94c882af7d (query-service) Provide delegate of IndexApi's query functionality.
This is an intermediate step in the process of introducing the query-service as a proxy between search and index.
2023-10-08 22:22:26 +02:00
Viktor Lofgren
89c6d85f2f (query-service) Create new empty 'query-service' service 2023-10-08 17:31:50 +02:00
Viktor Lofgren
cf366c602f (search) Refactor SearchQueryIndexService in preparation for feature extraction.
Prefer working on DecoratedSearchResultItem in favor of UrlDetails.
2023-10-08 17:15:41 +02:00
Viktor Lofgren
77ccab7d80 (index) Move linkdb to index from search.
This makes index complete in the sense that you can deploy an index instance and build a complete separate application on top of it, without having to go through the Marginalia-laden search service.
2023-10-08 16:48:35 +02:00
Viktor Lofgren
f51ba63742 (search) Remove dead file 2023-10-07 21:05:06 +02:00
Viktor Lofgren
9044518be5 (search) Fix broken link to git repo 2023-10-07 19:43:22 +02:00
Viktor Lofgren
9e0367eef4 (search) Filter blacklisted items in API query service as well 2023-10-07 16:16:04 +02:00
Viktor Lofgren
235bb6c1b9 (control) Administrative QOL improvement, GUI for banning spam 2023-10-07 15:45:50 +02:00
Viktor Lofgren
49344d7ea8 (control) Administrative QOL improvement, GUI for banning spam 2023-10-07 15:43:18 +02:00
Viktor Lofgren
1b418d77ff (search) We got some new IP ranges to work with for the crawler 2023-10-07 13:41:55 +02:00
Viktor Lofgren
80cc302627 (search) We can't in claim to be on PC hardware anymore... 2023-10-07 11:49:29 +02:00
Viktor
8e1abc3f10
(index-reverse) Parallel construction of the reverse indexes. (#52)
* (index-reverse) Parallel construction of the reverse indexes.

* (array) Remove wasteful calculation of numDistinct before merging two sorted arrays.

* (index-reverse)  Force changes to disk on close, reduce logging.

* (index-reverse)  Clean up merging process and add back logging

* (run)  Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM

* (index-reverse)  Better logging during processing

* (array) 2GB+ compatible write() function

* (array) 2GB+ compatible write() function

* (index-reverse) We are logging like Bolsonaro and I will not have it.

* (reverse-index) Self-diagnostics

* (btree) Fix bug in btree reader to do with large data sizes
2023-10-07 10:00:00 +02:00
Viktor Lofgren
c51159672e (build) Move unit test configuration to root build.gradle 2023-10-04 12:46:22 +02:00
Viktor Lofgren
405300b4b2 (control) Fix bug where finishing one process ad hoc task would remove all other tasks from the db 2023-10-04 11:44:31 +02:00
Viktor Lofgren
40768e935b (test) Removing /tmp-guardrails as it doesn't hold in CI 2023-10-02 16:52:59 +02:00
Viktor Lofgren
d160954080 (index) Two useful debug endpoints 2023-09-24 19:39:48 +02:00
Viktor Lofgren
14372e0ef0 (index) Slightly reduce alloc churn 2023-09-24 19:36:14 +02:00
Viktor Lofgren
03bffa27ac (search) Add combined id to the search result HTML 2023-09-24 19:34:35 +02:00
Viktor Lofgren
028b5a4f0d (minor performance) Reduce GC churn in index 2023-09-24 12:12:08 +02:00
Viktor Lofgren
1bd146fb8e (minor) Remove dead code 2023-09-24 10:55:20 +02:00
Viktor Lofgren
5f6c3da7a4 (index) Add close methods on the index readers so they clean up their mmaps 2023-09-24 10:54:23 +02:00
Viktor Lofgren
dbe9235f3a (*) Upgrade to JDK21 with preview enabled.
... also move some common configuration into the root build.gradle-file.

Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work.  This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
d78569986b (backups) Fix bug where backup service would zero the linkdb when restoring. 2023-09-22 18:34:34 +02:00
Viktor Lofgren
95323e6caa (backups) Support restore multi-source load data 2023-09-22 18:34:17 +02:00
Viktor Lofgren
f809d22fc6 (loader) Support simultaneous loading of multiple processed data sets 2023-09-22 13:14:58 +02:00
Viktor Lofgren
70aa04c047 (converter, stackexchange-xml) Add the ability to sideload stackexchange data 2023-09-21 12:48:33 +02:00
Viktor Lofgren
f8050816ac (search) Don't run LSH deduplication on details with zero lsh to support not calculating this hash. 2023-09-21 12:47:02 +02:00
Viktor Lofgren
9b385ec7cc (converter) Make it possible to sideload documents from a directory tree 2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46 (crawl-spec) Parquetify crawl spec
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
5e5aaf9a7e (converter, control) Re-enable sideloading encyclopedia data 2023-09-14 12:12:07 +02:00
Viktor Lofgren
07d7507ac6 (control-service) Move Actions up in storage-details
Papercut fix. If a file storage area has a lot of files, you have to scroll down a long way to get to the actions otherwise.
2023-09-02 15:41:55 +02:00
Viktor Lofgren
9e185e80ce (control-service) Add timestamp to file storages. 2023-09-02 14:01:04 +02:00
Viktor Lofgren
d31d8ec5b0 (index) Log keyword ids on hex format 2023-09-01 15:40:24 +02:00
Viktor Lofgren
2b00cd632d (process) Propagate environment JVM params to the index constructor 2023-09-01 15:39:42 +02:00
Viktor Lofgren
764e7d1315 (index) Add more comprehensive integration tests for the index service. 2023-08-30 10:37:24 +02:00
Viktor Lofgren
e4d7958379 (control) ProcessLivenessMonitorActor shouldn't reap tasks based on service instance liveness 2023-08-29 18:19:04 +02:00
Viktor Lofgren
3f288e264b (minor) Clean up dead endpoints 2023-08-29 17:04:54 +02:00
Viktor Lofgren
dd593c292c (loader) Minor optimizations and bugfixes.
* Reduce memory churn in LoaderIndexJournalWriter, fix bug with keyword mappings as well
* Remove remains of OldDomains
* Ensure LOADER_PROCESS_OPTS gets fed to the processes
* LinkdbStatusWriter won't execute batch after each added item post 100 items
2023-08-29 15:37:52 +02:00
Viktor Lofgren
39c1857c61 (heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction. 2023-08-29 13:07:55 +02:00
Viktor Lofgren
c57a2d0dc3 (control-service) Remove old index journal files when restoring a backup. 2023-08-29 11:58:01 +02:00
Viktor Lofgren
6525b16e1f (minor) Improved logging and error messages 2023-08-28 19:53:55 +02:00
Viktor Lofgren
b6a92506d1 (index) Hook in missing DocIdRewriter
This enables documents to be ranked properly.
2023-08-28 19:53:43 +02:00
Viktor Lofgren
3101b74580 (index) Move to a lexicon-free index design
This is a system-wide change.  The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table.   This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.

The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.

It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
194a6057dd (index,control) Recoverable index backups 2023-08-25 14:57:43 +02:00
Viktor Lofgren
e710e057e2 (db) Remove EC_URL and EC_PAGE_DATA from mariadb database 2023-08-25 13:45:03 +02:00
Viktor Lofgren
28188a6e59 (control) Simplify ConvertAndLoadActor 2023-08-25 13:30:20 +02:00
Viktor Lofgren
70a5df96c8 (control) Display progress of process tasks 2023-08-25 13:05:21 +02:00
Viktor Lofgren
460998d512 (index) Move index construction to separate process.
This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service.  It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D
2023-08-25 12:52:54 +02:00
Viktor Lofgren
e741301417 (search) Remove endpoint flush-search-caches
It's not necessary anymore with the new linkdb.
2023-08-25 09:51:06 +02:00
Viktor Lofgren
5ed5298409 (converter) Update confusing state description
SWAP_LEXICON doesn't instruct the index service to do anything.  It just moves the file.
2023-08-24 18:56:49 +02:00
Viktor Lofgren
b911665691 (index) Clean up and optimize valuator 2023-08-24 18:34:06 +02:00
Viktor Lofgren
56eb83319d (index) Clean up result domain deduplicator 2023-08-24 18:24:55 +02:00
Viktor Lofgren
1e6800565a (system) Remove EdgeId<T> and similar objects
They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.
2023-08-24 17:46:02 +02:00
Viktor Lofgren
c909120ae1 (search) Basic working integration of linkdb in search service 2023-08-24 17:24:56 +02:00
Viktor Lofgren
9894f37412 (index) Implement new URL ID coding scheme.
Also refactor along the way.  Really needs an additional pass, these tests are very hairy.
2023-08-24 16:44:27 +02:00
Viktor Lofgren
ebc84c22fb Upgrade antique lombok plugin
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
4d75fa2908 Upgrade gradle and docker plugin to support native JDK20 environments 2023-08-23 13:30:55 +00:00
Viktor Lofgren
6f222b9800 (search) Add refresh link to explore mode.
This is a QOL improvement for mobile users, who otherwise would have to scroll all the way up to refresh.

Also removed the confusing "this is a random set of domains"-message when viewing adjacent websites, as it's not random.
2023-08-22 12:43:44 +02:00
Viktor Lofgren
c7f0276005 (control) Don't spin on process output printing
This is the "correct" way of copying stdout and stderr to the curren't process' output.
2023-08-22 11:48:54 +02:00
Viktor Lofgren
46df58d28b (control-service) Use default value for WMSA_HOME if it is not set 2023-08-22 11:11:01 +02:00
Viktor Lofgren
15912f31d0 (control-service) Basic GUI for deleting bad links from exploration mode 2023-08-21 18:35:26 +02:00
Viktor Lofgren
93f49f1fb3 (search-service) RSS feed for the news feed 2023-08-20 12:58:34 +02:00
Viktor Lofgren
704de50a9b (forward-index, valuator) HTML features in valuator
Put it in the forward index for easy access during index-side valuation.
2023-08-18 11:54:56 +02:00
Viktor Lofgren
efee904531 (search) Use the adtech bit instead of ads for ads flag 2023-08-18 11:24:59 +02:00
Viktor Lofgren
46d761f34f (language) fasttext based language filter 2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f (valuation) Penalize wordpress style kebab case urls 2023-08-16 13:11:24 +02:00
Viktor Lofgren
606db54dc8 (docs) Fix dead links to message-queue after moving it to libraries 2023-08-15 19:26:40 +02:00
Viktor Lofgren
df85468c01 (control) Action for refreshing the blogs definition. 2023-08-15 11:38:52 +02:00
Viktor Lofgren
e7192a9cad (mq) Refactor mq and actor library and move it to libraries out of common 2023-08-15 10:53:23 +02:00
Viktor Lofgren
019b61b330 (control) Remove message queue listing from actors view. 2023-08-13 13:50:04 +02:00
Viktor Lofgren
f997707049 (control) Move event log out of plumbing 2023-08-13 13:40:50 +02:00
Viktor Lofgren
c56ee10185 (control) Separate [Process] and [Process and Load] actions for crawl data; all SLOW data is deletable. 2023-08-13 13:39:59 +02:00
Viktor Lofgren
8210e49b4e (control) Helpful tooltips for the Actor table. 2023-08-13 12:55:56 +02:00
Viktor Lofgren
a8f2e9ee2c (control) Tidy up empty tables, remove actors from index view 2023-08-12 15:18:14 +02:00
Viktor Lofgren
a91b909103 (control) Event log on stop actor 2023-08-12 15:02:53 +02:00
Viktor Lofgren
99e031c529 (control) Remove broken pagination from events and message queue; new "light" events table for some views 2023-08-12 14:57:55 +02:00
Viktor Lofgren
998f239ed9 (control) Filterable event log view 2023-08-12 14:43:11 +02:00
Viktor Lofgren
0961f627b1 (control) Pretty up the nav bar 2023-08-12 14:42:42 +02:00
Viktor Lofgren
4f8048be31 (blacklist) Blacklist management 2023-08-10 15:40:07 +02:00
Viktor Lofgren
ce293029c7 (converter) Treat adtech tracking as advertisement. 2023-08-09 14:23:53 +02:00
Viktor Lofgren
251fc63b42 (*) Fix merge gore 2023-08-09 13:33:28 +02:00
Viktor Lofgren
47f3855a4b (control) More informative readme.md 2023-08-09 12:42:23 +02:00
Viktor Lofgren
71dfe9f33e (control) Clean up the ControlService, move mq-related endpoints to MessageQueueService. 2023-08-09 12:42:01 +02:00
Viktor Lofgren
4ab1cd9502 (*) last touches 2023-08-07 12:57:44 +02:00
Viktor Lofgren
be444f9172 (control) New actions view, re-arrange navigation menu 2023-08-05 14:45:04 +02:00
Viktor Lofgren
bf37a3eb25 (search-service) Make flushCaches endpoint a notice and not a request 2023-08-05 14:45:04 +02:00
Viktor Lofgren
00eb8b90dc (control) Message Queue GUI 2023-08-04 22:05:29 +02:00
Viktor Lofgren
912129311d (control) Message Queue GUI 2023-08-04 17:54:18 +02:00
Viktor Lofgren
624b78ec3a (heartbeat) Task heartbeats 2023-08-04 14:40:06 +02:00
Viktor Lofgren
1d0cea1d55 (converter) GUI for dealing with user complaints 2023-08-03 17:59:57 +02:00
Viktor Lofgren
f01f608474 (blacklist) Support blacklists with subdomain 2023-08-03 17:58:52 +02:00
Viktor Lofgren
63e857f7cd (control) Add basic api key management 2023-08-02 20:14:03 +02:00
Viktor Lofgren
9979c9defe (search/index) Add blogosphere filter 2023-08-02 20:13:30 +02:00
Viktor Lofgren
8de3e6ab80 (control) Fix bug where CrawlActor and RecrawlActor would steal each others' mail 2023-08-01 22:33:30 +02:00
Viktor Lofgren
867410c66b (file-storage) Automatic file storage discovery via manifest file 2023-08-01 18:05:43 +02:00
Viktor Lofgren
36a23707c1 (control) Control service should be a core service. 2023-08-01 15:49:50 +02:00
Viktor Lofgren
e22e65eee4 (index) Fix bug related to debug print statements 2023-07-22 14:33:58 +02:00
Viktor Lofgren
d7ab21fe34 (*) Refactor Control Service and processes 2023-07-17 21:20:31 +02:00
Viktor Lofgren
8b74e3aa0d (*) File Storage WIP 2023-07-14 17:08:10 +02:00
Viktor Lofgren
88b9ec70c6 (control, WIP) Run reconvert-load from converter :D 2023-07-11 18:05:37 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
96eecc6ea5 Minor: Readability. 2023-07-10 18:58:43 +02:00
Viktor Lofgren
d9e6c4f266 Trial integration of MQ-FSM into index service. 2023-07-06 18:04:16 +02:00