Viktor Lofgren
|
95323e6caa
|
(backups) Support restore multi-source load data
|
2023-09-22 18:34:17 +02:00 |
|
Viktor Lofgren
|
f809d22fc6
|
(loader) Support simultaneous loading of multiple processed data sets
|
2023-09-22 13:14:58 +02:00 |
|
Viktor
|
763d61db8d
|
Create Additional Contributors.md
|
2023-09-21 15:38:19 +02:00 |
|
Viktor Lofgren
|
10cad3abb2
|
(dating) Implementing @samstorment's fantastic design polish
|
2023-09-21 15:19:50 +02:00 |
|
Viktor Lofgren
|
9338f35cd8
|
(doc) Remove confusingly outdated ER-diagrams
|
2023-09-21 15:08:27 +02:00 |
|
Viktor Lofgren
|
ead6fa9daa
|
(doc) Update conceptual-overview.svg to reflect the removal of the lexicon
|
2023-09-21 13:47:05 +02:00 |
|
Viktor Lofgren
|
ad660cf420
|
(converter) Bugfix: Don't try to Path.of() on optional field
|
2023-09-21 13:27:09 +02:00 |
|
Viktor Lofgren
|
75f8ae2815
|
(file-storage) Use human-readable timestamps in the names of file storage directories
|
2023-09-21 13:22:53 +02:00 |
|
Viktor Lofgren
|
70aa04c047
|
(converter, stackexchange-xml) Add the ability to sideload stackexchange data
|
2023-09-21 12:48:33 +02:00 |
|
Viktor Lofgren
|
4aa47e87f2
|
(blocking-thread-pool) Add isTerminated convenience function
|
2023-09-21 12:47:41 +02:00 |
|
Viktor Lofgren
|
f8050816ac
|
(search) Don't run LSH deduplication on details with zero lsh to support not calculating this hash.
|
2023-09-21 12:47:02 +02:00 |
|
Viktor Lofgren
|
5b0a6d7ec1
|
(stackexchange-converter) Create tool for converting stackexchange 7z-files to digestible sqlite db:s
|
2023-09-20 15:15:13 +02:00 |
|
Viktor Lofgren
|
3b4d08f52b
|
(stackexchange-integration) Add better comments
|
2023-09-20 14:43:06 +02:00 |
|
Viktor Lofgren
|
6bbf40d7d2
|
(stackexchange-integration) Tools for reading stackexchange xml files
|
2023-09-20 14:17:33 +02:00 |
|
Viktor Lofgren
|
d895f83520
|
(blocking-thread-pool) Move DumbThreadPool to its own micro-library
Also rename it to SimpleBlockingThreadPool.
|
2023-09-20 10:11:49 +02:00 |
|
Viktor Lofgren
|
f6b9e8c5eb
|
(converter) JavadocSpecialization should truncate its summary if it gets too long
|
2023-09-17 16:25:33 +02:00 |
|
Viktor Lofgren
|
98bcdf6028
|
(converter) DirtreeSideloader now trims /index.html from the URL if present
This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.
|
2023-09-17 16:08:16 +02:00 |
|
Viktor Lofgren
|
9b385ec7cc
|
(converter) Make it possible to sideload documents from a directory tree
|
2023-09-17 14:35:06 +02:00 |
|
Viktor Lofgren
|
5c040f7a46
|
(crawl-spec) Parquetify crawl spec
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
|
2023-09-17 09:41:34 +02:00 |
|
Viktor
|
46232c7fd4
|
Merge pull request #48 from MarginaliaSearch/parquet
Converter-Loader communicates via Parquet files
|
2023-09-15 13:32:06 +02:00 |
|
Viktor Lofgren
|
c67d95c00f
|
(converter) Write dummy processor log when sideloading
|
2023-09-14 14:13:03 +02:00 |
|
Viktor Lofgren
|
5e5aaf9a7e
|
(converter, control) Re-enable sideloading encyclopedia data
|
2023-09-14 12:12:07 +02:00 |
|
Viktor Lofgren
|
35996d0adb
|
(docs) Update the documentation up-to-date information
|
2023-09-14 11:33:36 +02:00 |
|
Viktor Lofgren
|
eaeb23d41e
|
(refactor) Remove converting-model package completely
|
2023-09-14 11:21:44 +02:00 |
|
Viktor Lofgren
|
c71f6ad417
|
(converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup
|
2023-09-14 10:11:57 +02:00 |
|
Viktor Lofgren
|
87a8593291
|
(work-log) Fix bug where items weren't added to the current batch on logItem
|
2023-09-14 10:11:04 +02:00 |
|
Viktor Lofgren
|
4799dd769e
|
(converting) WIP begin to remove converting-model and the old InstructionsCompiler
|
2023-09-13 19:18:58 +02:00 |
|
Viktor Lofgren
|
24b4606f96
|
(converter,loader) Converter outputs parquet files instead of compressed json.
|
2023-09-13 16:13:41 +02:00 |
|
Viktor Lofgren
|
9f672a0cf4
|
(parquet-floor) Modify the parquet library to permit list-fields.
|
2023-09-13 15:56:35 +02:00 |
|
Viktor Lofgren
|
064bc5ee76
|
(processed-data) New parquet-serializable models for converter output
|
2023-09-11 14:08:40 +02:00 |
|
Viktor Lofgren
|
a52d78c8ee
|
(work-log) New batching work log
|
2023-09-11 14:08:08 +02:00 |
|
Viktor Lofgren
|
a00cabe223
|
(parquet-floor) Patch in support for writing and reading repeated values
|
2023-09-11 14:06:43 +02:00 |
|
Viktor Lofgren
|
dbe974f510
|
(parquet) Use ZSTD compression by default.
|
2023-09-11 09:02:58 +02:00 |
|
Viktor Lofgren
|
a284682deb
|
(parquet) Add parquet library
This small library, while great, will require some modifications to fit the project's needs, so it goes into third-party directly.
|
2023-09-05 10:38:51 +02:00 |
|
Viktor Lofgren
|
07d7507ac6
|
(control-service) Move Actions up in storage-details
Papercut fix. If a file storage area has a lot of files, you have to scroll down a long way to get to the actions otherwise.
|
2023-09-02 15:41:55 +02:00 |
|
Viktor Lofgren
|
c68d17d482
|
(keyword-extraction) Fix bug leading to position data missing on some keywords.
This was due to a discrepancy between the KeywordPositionBitmask and WordsTfIdfCounts' concept of a keyword.
|
2023-09-02 14:48:55 +02:00 |
|
Viktor Lofgren
|
9e185e80ce
|
(control-service) Add timestamp to file storages.
|
2023-09-02 14:01:04 +02:00 |
|
Viktor Lofgren
|
676e7c7947
|
(keywords) Add Serializable properties that went missing as the record became a class
|
2023-09-02 09:52:01 +02:00 |
|
Viktor Lofgren
|
04212b2cef
|
(btree) Add more consistent asserts on sortedness
|
2023-09-01 15:45:02 +02:00 |
|
Viktor Lofgren
|
bafc2a1f30
|
(reverse-index) Force() final docs after being written
Unlikely to be a problem, but we want to ensure it's on dsik before we go read it later.
|
2023-09-01 15:43:53 +02:00 |
|
Viktor Lofgren
|
563e388a45
|
(reverse-index) Fix parallel documents sorting bug
Bug was caused by parallel sorting capturing the iterator rather than the offsets to sort.
|
2023-09-01 15:42:45 +02:00 |
|
Viktor Lofgren
|
d31d8ec5b0
|
(index) Log keyword ids on hex format
|
2023-09-01 15:40:24 +02:00 |
|
Viktor Lofgren
|
2b00cd632d
|
(process) Propagate environment JVM params to the index constructor
|
2023-09-01 15:39:42 +02:00 |
|
Viktor Lofgren
|
5f427d2b4c
|
(keywords) Clean up leaky abstractions, clean up tests
|
2023-09-01 13:52:00 +02:00 |
|
Viktor Lofgren
|
8c0ce4fc1d
|
(index journal; minor) Clean up
|
2023-09-01 11:32:24 +02:00 |
|
Viktor Lofgren
|
10a74f45ea
|
(index journal; minor) Even cleaner separation of concerns.
|
2023-09-01 11:28:02 +02:00 |
|
Viktor Lofgren
|
320dad7f1a
|
(index journal) Fix leaky abstraction in IndexJournalReader.
The caller shouldn't be required to know the on-disk layout of the file to make use of the data in a performant way.
|
2023-09-01 11:18:13 +02:00 |
|
Viktor Lofgren
|
88ac72c8eb
|
(journal/reverse index) Working WIP fix over-allocation of documents
|
2023-08-31 20:16:02 +02:00 |
|
Viktor Lofgren
|
f74b9df0a7
|
(array) Don't use paging arrays when mapping small files for writing
|
2023-08-31 20:15:10 +02:00 |
|
Viktor Lofgren
|
a6f1335375
|
(loader) Fix bugfix where the loader would omit some meta and words.
|
2023-08-31 17:48:43 +02:00 |
|