53c575db3f
To cope with writing large files out of order, the system needs some form of strategy to avoid writing them directly to disk, as this causes insane amounts of disk thrashing. By default, the data is just buffered in RAM. This works well on a large server, but smaller systems struggle. To help systems with small RAM process large amounts of data, the old RandomWriteFunnel is brought back if the system property 'system.conserve-memory' is set to true. RandomWriteFunnel is buffering the construction by creating a series of small files that pigeonhole the writes into rough neighborhoods, and then it goes over the files one by one to construct one area of the file at a time. This is relatively slow and uses more than twice the disk size. A new interface RandomFileAssembler is introduced as an abstraction for this operation. A third strategy, direct mmaps, is also introduced if the file is very small (less than 1 GB). In this domain, disk thrashing is unlikely since it will comfortably fit in RAM. |
||
---|---|---|
.. | ||
src | ||
build.gradle | ||
index.svg | ||
merging.svg | ||
preindex.svg | ||
readme.md |
Reverse Index
The reverse index contains a mapping from word to document id.
There are two tiers of this index.
- A priority index which only indexes terms that are flagged with priority flags1.
- A full index that indexes all terms.
The full index also provides access to term-level metadata, while the priority index is a binary index that only offers information about which documents has a specific word.
[1] See WordFlags in common/model and KeywordMetadata in features-convert/keyword-extraction.
Construction
The reverse index is constructed by first building a series of preindexes. Preindexes consist of a Segment and a Documents object. The segment contains information about which word identifiers are present and how many, and the documents contain information about in which documents the words can be found.
These would typically not fit in RAM, so the index journal is paged and the preindexes are constructed small enough to fit in memory, and then merged. Merging sorted arrays is a very fast operation that does not require additional RAM.
Once merged into one large preindex, indexes are added to the preindex data to form a finalized reverse index.
Central Classes
- ReversePreindex intermediate reverse index state.
- ReverseIndexConstructor constructs the index.
- ReverseIndexReader interrogates the index.