Viktor Lofgren
f809d22fc6
(loader) Support simultaneous loading of multiple processed data sets
2023-09-22 13:14:58 +02:00
Viktor
763d61db8d
Create Additional Contributors.md
2023-09-21 15:38:19 +02:00
Viktor Lofgren
10cad3abb2
(dating) Implementing @samstorment's fantastic design polish
2023-09-21 15:19:50 +02:00
Viktor Lofgren
9338f35cd8
(doc) Remove confusingly outdated ER-diagrams
2023-09-21 15:08:27 +02:00
Viktor Lofgren
ead6fa9daa
(doc) Update conceptual-overview.svg to reflect the removal of the lexicon
2023-09-21 13:47:05 +02:00
Viktor Lofgren
ad660cf420
(converter) Bugfix: Don't try to Path.of() on optional field
2023-09-21 13:27:09 +02:00
Viktor Lofgren
75f8ae2815
(file-storage) Use human-readable timestamps in the names of file storage directories
2023-09-21 13:22:53 +02:00
Viktor Lofgren
70aa04c047
(converter, stackexchange-xml) Add the ability to sideload stackexchange data
2023-09-21 12:48:33 +02:00
Viktor Lofgren
4aa47e87f2
(blocking-thread-pool) Add isTerminated convenience function
2023-09-21 12:47:41 +02:00
Viktor Lofgren
f8050816ac
(search) Don't run LSH deduplication on details with zero lsh to support not calculating this hash.
2023-09-21 12:47:02 +02:00
Viktor Lofgren
5b0a6d7ec1
(stackexchange-converter) Create tool for converting stackexchange 7z-files to digestible sqlite db:s
2023-09-20 15:15:13 +02:00
Viktor Lofgren
3b4d08f52b
(stackexchange-integration) Add better comments
2023-09-20 14:43:06 +02:00
Viktor Lofgren
6bbf40d7d2
(stackexchange-integration) Tools for reading stackexchange xml files
2023-09-20 14:17:33 +02:00
Viktor Lofgren
d895f83520
(blocking-thread-pool) Move DumbThreadPool to its own micro-library
...
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
f6b9e8c5eb
(converter) JavadocSpecialization should truncate its summary if it gets too long
2023-09-17 16:25:33 +02:00
Viktor Lofgren
98bcdf6028
(converter) DirtreeSideloader now trims /index.html from the URL if present
...
This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.
2023-09-17 16:08:16 +02:00
Viktor Lofgren
9b385ec7cc
(converter) Make it possible to sideload documents from a directory tree
2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46
(crawl-spec) Parquetify crawl spec
...
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor
46232c7fd4
Merge pull request #48 from MarginaliaSearch/parquet
...
Converter-Loader communicates via Parquet files
2023-09-15 13:32:06 +02:00
Viktor Lofgren
c67d95c00f
(converter) Write dummy processor log when sideloading
2023-09-14 14:13:03 +02:00
Viktor Lofgren
5e5aaf9a7e
(converter, control) Re-enable sideloading encyclopedia data
2023-09-14 12:12:07 +02:00
Viktor Lofgren
35996d0adb
(docs) Update the documentation up-to-date information
2023-09-14 11:33:36 +02:00
Viktor Lofgren
eaeb23d41e
(refactor) Remove converting-model package completely
2023-09-14 11:21:44 +02:00
Viktor Lofgren
c71f6ad417
(converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup
2023-09-14 10:11:57 +02:00
Viktor Lofgren
87a8593291
(work-log) Fix bug where items weren't added to the current batch on logItem
2023-09-14 10:11:04 +02:00
Viktor Lofgren
4799dd769e
(converting) WIP begin to remove converting-model and the old InstructionsCompiler
2023-09-13 19:18:58 +02:00
Viktor Lofgren
24b4606f96
(converter,loader) Converter outputs parquet files instead of compressed json.
2023-09-13 16:13:41 +02:00
Viktor Lofgren
9f672a0cf4
(parquet-floor) Modify the parquet library to permit list-fields.
2023-09-13 15:56:35 +02:00
Viktor Lofgren
064bc5ee76
(processed-data) New parquet-serializable models for converter output
2023-09-11 14:08:40 +02:00
Viktor Lofgren
a52d78c8ee
(work-log) New batching work log
2023-09-11 14:08:08 +02:00
Viktor Lofgren
a00cabe223
(parquet-floor) Patch in support for writing and reading repeated values
2023-09-11 14:06:43 +02:00
Viktor Lofgren
dbe974f510
(parquet) Use ZSTD compression by default.
2023-09-11 09:02:58 +02:00
Viktor Lofgren
a284682deb
(parquet) Add parquet library
...
This small library, while great, will require some modifications to fit the project's needs, so it goes into third-party directly.
2023-09-05 10:38:51 +02:00
Viktor Lofgren
07d7507ac6
(control-service) Move Actions up in storage-details
...
Papercut fix. If a file storage area has a lot of files, you have to scroll down a long way to get to the actions otherwise.
2023-09-02 15:41:55 +02:00
Viktor Lofgren
c68d17d482
(keyword-extraction) Fix bug leading to position data missing on some keywords.
...
This was due to a discrepancy between the KeywordPositionBitmask and WordsTfIdfCounts' concept of a keyword.
2023-09-02 14:48:55 +02:00
Viktor Lofgren
9e185e80ce
(control-service) Add timestamp to file storages.
2023-09-02 14:01:04 +02:00
Viktor Lofgren
676e7c7947
(keywords) Add Serializable properties that went missing as the record became a class
2023-09-02 09:52:01 +02:00
Viktor Lofgren
04212b2cef
(btree) Add more consistent asserts on sortedness
2023-09-01 15:45:02 +02:00
Viktor Lofgren
bafc2a1f30
(reverse-index) Force() final docs after being written
...
Unlikely to be a problem, but we want to ensure it's on dsik before we go read it later.
2023-09-01 15:43:53 +02:00
Viktor Lofgren
563e388a45
(reverse-index) Fix parallel documents sorting bug
...
Bug was caused by parallel sorting capturing the iterator rather than the offsets to sort.
2023-09-01 15:42:45 +02:00
Viktor Lofgren
d31d8ec5b0
(index) Log keyword ids on hex format
2023-09-01 15:40:24 +02:00
Viktor Lofgren
2b00cd632d
(process) Propagate environment JVM params to the index constructor
2023-09-01 15:39:42 +02:00
Viktor Lofgren
5f427d2b4c
(keywords) Clean up leaky abstractions, clean up tests
2023-09-01 13:52:00 +02:00
Viktor Lofgren
8c0ce4fc1d
(index journal; minor) Clean up
2023-09-01 11:32:24 +02:00
Viktor Lofgren
10a74f45ea
(index journal; minor) Even cleaner separation of concerns.
2023-09-01 11:28:02 +02:00
Viktor Lofgren
320dad7f1a
(index journal) Fix leaky abstraction in IndexJournalReader.
...
The caller shouldn't be required to know the on-disk layout of the file to make use of the data in a performant way.
2023-09-01 11:18:13 +02:00
Viktor Lofgren
88ac72c8eb
(journal/reverse index) Working WIP fix over-allocation of documents
2023-08-31 20:16:02 +02:00
Viktor Lofgren
f74b9df0a7
(array) Don't use paging arrays when mapping small files for writing
2023-08-31 20:15:10 +02:00
Viktor Lofgren
a6f1335375
(loader) Fix bugfix where the loader would omit some meta and words.
2023-08-31 17:48:43 +02:00
Viktor Lofgren
f321fa5ad3
(array) Override to Paging...Array$range()
...
This is a big performance boost in array.range().get().
Without an override, each access will go through pages[page].get(...) for each get()-operation. This adds up very quickly. BTreeReader does a bunch of get():s on a range()'d array during traversal in the queryData... methods.
2023-08-31 13:52:29 +02:00