(doc) Documentation corrections
This commit is contained in:
parent
929caed0b9
commit
ba26f6ce84
@ -1,7 +1,19 @@
|
||||
# Forward Index
|
||||
|
||||
The forward index contains a mapping from document id to word id. It also provides document-level
|
||||
metadata, and a document-to-domain mapping.
|
||||
The forward index contains a mapping from document id to various forms of document metadata.
|
||||
|
||||
In practice, the forward index consists of two files, an `id` file and a `data` file.
|
||||
|
||||
The `id` file contains a list of sorted document ids, and the `data` file contains
|
||||
metadata for each document id, in the same order as the `id` file, with a fixed
|
||||
size record containing data associated with each document id.
|
||||
|
||||
Each record contains a binary encoded [DocumentMetadata](../../common/model/src/main/java/nu/marginalia/model/idx/DocumentMetadata.java) object,
|
||||
as well as a [HtmlFeatures](../../common/model/src/main/java/nu/marginalia/model/crawl/HtmlFeature.java) bitmask.
|
||||
|
||||
Unlike the reverse index, the forward index is not split into two tiers, and the data is in the same
|
||||
order as it is in the source data, and the cardinality of the document IDs is assumed to fit in memory,
|
||||
so it's relatively easy to construct.
|
||||
|
||||
## Central Classes
|
||||
|
||||
|
@ -20,8 +20,8 @@ import static nu.marginalia.index.forward.ForwardIndexParameters.*;
|
||||
* and a mapping between document identifiers to the index into the
|
||||
* data array.
|
||||
* <p/>
|
||||
* Since the total data is relatively small, this is attempted to be
|
||||
* kept in memory to reduce the amount of disk thrashing.
|
||||
* Since the total data is relatively small, this is kept in memory to
|
||||
* reduce the amount of disk thrashing.
|
||||
* <p/>
|
||||
* The metadata is a binary encoding of {@see nu.marginalia.idx.DocumentMetadata}
|
||||
*/
|
||||
|
6
third-party/openzim/readme.md
vendored
6
third-party/openzim/readme.md
vendored
@ -1,11 +1,7 @@
|
||||
# OpenZIM
|
||||
|
||||
[OpenZIM](https://github.com/openzim/libzim) - GPL-2.0
|
||||
[OpenZIM](https://github.com/openzim/libzim) - GPL-2.0+
|
||||
|
||||
OpenZIM is a ZIM file reader. This code has been modified in a fairly crude manner
|
||||
to be much faster than the original code base which seems quite antique. It also
|
||||
supports XZ compression.
|
||||
|
||||
**Important Note** the license is incompatible with AGPL 3, so we can't link Marginalia
|
||||
directly to this. It's still very useful for building tools that deal with
|
||||
wikipedia data which would be stand-alone.
|
Loading…
Reference in New Issue
Block a user