Viktor Lofgren
e5cee1f46d
(sideload) Fix sideloading so that it doesn't get disproportionately good rankings
...
Also add type flags so that e.g. wikipedia shows up in the wikis filter.
2023-11-12 14:57:57 +01:00
Viktor Lofgren
7617b4cbc2
(crawler) Fix NPE in crawler caused by not having fetched the domains list yet
2023-11-06 18:16:38 +01:00
Viktor Lofgren
e0c769fd19
(converter) Integrate atags.parquet with the encyclopedia sideloader
...
Also clean up stackexchange and dirtree a bit.
2023-11-06 18:03:01 +01:00
Viktor Lofgren
ebd10a5f28
(crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized
2023-11-06 16:14:58 +01:00
Viktor Lofgren
2b77184281
(converter) Integrate atags with the topology field
2023-11-06 13:46:44 +01:00
Viktor Lofgren
1847845151
Revert "(loader) Optimize INSERT statements"
...
This reverts commit 7cb92195d1
.
2023-11-04 19:32:02 +01:00
Viktor Lofgren
7cb92195d1
(loader) Optimize INSERT statements
...
INSERT IGNORE is too slow.
2023-11-04 17:43:55 +01:00
Viktor Lofgren
0152004c42
Initial Commit Anchor Tags
...
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
2023-11-04 14:24:17 +01:00
Viktor Lofgren
8f74dbdbb4
(crawler) Set more lenient parameters for recrawl
2023-10-30 11:35:30 +01:00
Viktor Lofgren
fd5a7eac87
(crawler) Exit crawler retriever on thread interrupted
2023-10-30 11:34:16 +01:00
Viktor Lofgren
f613f4f2df
(array) Fix spurious search results
...
This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss.
It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.
2023-10-26 15:27:02 +02:00
Viktor Lofgren
a497e4c920
(crawler) Terminate crawler after a few hours of no progress
2023-10-26 12:49:28 +02:00
Viktor Lofgren
d7686b665e
Refactoring
...
* Encyclopedia sideloader; permit providing base URL.
* Storage base shows node id in GUI
* ProcessLivenessMonitorActor restarts automatically
* Clean-up of outbox code
2023-10-25 18:51:02 +02:00
Viktor Lofgren
313cc2965c
(index-creation) Print whether full or prio is created
...
Previous state of saying reverse index for both was pretty confusing.
2023-10-24 16:23:10 +02:00
Viktor Lofgren
e06a8c1de2
(converter) Put upper limit on number of worker threads.
2023-10-22 14:03:09 +02:00
Viktor Lofgren
1d75b974b5
(loader bugfix) Set DOMAIN_METADATA appropriately
2023-10-20 13:03:27 +02:00
Viktor Lofgren
7b5ec6b98f
(executor-service) Embed dist/ in executor-service's docker image
2023-10-19 17:48:34 +02:00
Viktor Lofgren
81dd3809e9
(*) WIP Add node affinity to EC_DOMAIN
...
Very messy commit due to fractalline yak shaving
2023-10-19 17:48:34 +02:00
Viktor Lofgren
16e0738731
(*) Get multi-node routing working.
2023-10-15 18:38:30 +02:00
Viktor Lofgren
4baf9527d7
(*) WIP Control GUI redesign, executor-service, multi-node mq
...
This turned out to be very difficult to do in small isolated steps.
* Design overhaul of the control gui using bootstrap
* Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes
* Add node-affinity to message queue
2023-10-14 12:08:43 +02:00
Viktor Lofgren
199c459697
(*) Add node-affinity to services, processes and file storage.
2023-10-10 12:32:22 +02:00
Viktor Lofgren
8375237de5
(converter) Add special keyword for websites with a tilde url.
2023-10-09 17:02:32 +02:00
Viktor Lofgren
3889c4bdd9
(refactor) Remove features-search and update documentation
2023-10-09 15:12:30 +02:00
Viktor Lofgren
5dd55c7cad
(refactor) Rename satellite services to application services
...
This is a better descriptor, since they now all implement different applications on top of the core services' APIs.
2023-10-09 13:45:45 +02:00
Viktor Lofgren
c0e61d4c87
(refactor) Move search service into services-satellite
2023-10-09 13:40:01 +02:00
Viktor
8e1abc3f10
(index-reverse) Parallel construction of the reverse indexes. ( #52 )
...
* (index-reverse) Parallel construction of the reverse indexes.
* (array) Remove wasteful calculation of numDistinct before merging two sorted arrays.
* (index-reverse) Force changes to disk on close, reduce logging.
* (index-reverse) Clean up merging process and add back logging
* (run) Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM
* (index-reverse) Better logging during processing
* (array) 2GB+ compatible write() function
* (array) 2GB+ compatible write() function
* (index-reverse) We are logging like Bolsonaro and I will not have it.
* (reverse-index) Self-diagnostics
* (btree) Fix bug in btree reader to do with large data sizes
2023-10-07 10:00:00 +02:00
Viktor Lofgren
c51159672e
(build) Move unit test configuration to root build.gradle
2023-10-04 12:46:22 +02:00
Viktor Lofgren
e0cd3cd991
(converter) Alter StackexchangeSideloader's summary length to align with the rest of the system.
2023-09-26 12:19:43 +02:00
Viktor Lofgren
81ae501e73
(converter) Use ThreadLocalSentenceExtractorProvider for PlainText plugin as well
2023-09-25 18:28:34 +02:00
Viktor Lofgren
f797a92f87
(converter, minor) Use domain name in task heartbeat progress
2023-09-25 18:27:04 +02:00
Viktor Lofgren
a433bbbe45
(converter) Fix rare sentence extractor bug
...
It was caused by non-thread safe concurrent memory access in SentenceExtractor.
2023-09-24 19:39:48 +02:00
Viktor Lofgren
8ca20f184d
(keyword-extraction) Chasing my tail looking for a bug
2023-09-24 19:39:48 +02:00
Viktor Lofgren
dbe9235f3a
(*) Upgrade to JDK21 with preview enabled.
...
... also move some common configuration into the root build.gradle-file.
Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
f809d22fc6
(loader) Support simultaneous loading of multiple processed data sets
2023-09-22 13:14:58 +02:00
Viktor Lofgren
ad660cf420
(converter) Bugfix: Don't try to Path.of() on optional field
2023-09-21 13:27:09 +02:00
Viktor Lofgren
70aa04c047
(converter, stackexchange-xml) Add the ability to sideload stackexchange data
2023-09-21 12:48:33 +02:00
Viktor Lofgren
d895f83520
(blocking-thread-pool) Move DumbThreadPool to its own micro-library
...
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
f6b9e8c5eb
(converter) JavadocSpecialization should truncate its summary if it gets too long
2023-09-17 16:25:33 +02:00
Viktor Lofgren
98bcdf6028
(converter) DirtreeSideloader now trims /index.html from the URL if present
...
This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.
2023-09-17 16:08:16 +02:00
Viktor Lofgren
9b385ec7cc
(converter) Make it possible to sideload documents from a directory tree
2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46
(crawl-spec) Parquetify crawl spec
...
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
c67d95c00f
(converter) Write dummy processor log when sideloading
2023-09-14 14:13:03 +02:00
Viktor Lofgren
5e5aaf9a7e
(converter, control) Re-enable sideloading encyclopedia data
2023-09-14 12:12:07 +02:00
Viktor Lofgren
eaeb23d41e
(refactor) Remove converting-model package completely
2023-09-14 11:21:44 +02:00
Viktor Lofgren
c71f6ad417
(converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup
2023-09-14 10:11:57 +02:00
Viktor Lofgren
4799dd769e
(converting) WIP begin to remove converting-model and the old InstructionsCompiler
2023-09-13 19:18:58 +02:00
Viktor Lofgren
24b4606f96
(converter,loader) Converter outputs parquet files instead of compressed json.
2023-09-13 16:13:41 +02:00
Viktor Lofgren
9e185e80ce
(control-service) Add timestamp to file storages.
2023-09-02 14:01:04 +02:00
Viktor Lofgren
5f427d2b4c
(keywords) Clean up leaky abstractions, clean up tests
2023-09-01 13:52:00 +02:00
Viktor Lofgren
8c0ce4fc1d
(index journal; minor) Clean up
2023-09-01 11:32:24 +02:00