(doc) Documentation corrections

This commit is contained in:
Viktor Lofgren 2024-02-10 14:16:01 +01:00
parent 929caed0b9
commit ba26f6ce84
3 changed files with 17 additions and 9 deletions

View File

@ -1,7 +1,19 @@
# Forward Index
The forward index contains a mapping from document id to word id. It also provides document-level
metadata, and a document-to-domain mapping.
The forward index contains a mapping from document id to various forms of document metadata.
In practice, the forward index consists of two files, an `id` file and a `data` file.
The `id` file contains a list of sorted document ids, and the `data` file contains
metadata for each document id, in the same order as the `id` file, with a fixed
size record containing data associated with each document id.
Each record contains a binary encoded [DocumentMetadata](../../common/model/src/main/java/nu/marginalia/model/idx/DocumentMetadata.java) object,
as well as a [HtmlFeatures](../../common/model/src/main/java/nu/marginalia/model/crawl/HtmlFeature.java) bitmask.
Unlike the reverse index, the forward index is not split into two tiers, and the data is in the same
order as it is in the source data, and the cardinality of the document IDs is assumed to fit in memory,
so it's relatively easy to construct.
## Central Classes

View File

@ -20,8 +20,8 @@ import static nu.marginalia.index.forward.ForwardIndexParameters.*;
* and a mapping between document identifiers to the index into the
* data array.
* <p/>
* Since the total data is relatively small, this is attempted to be
* kept in memory to reduce the amount of disk thrashing.
* Since the total data is relatively small, this is kept in memory to
* reduce the amount of disk thrashing.
* <p/>
* The metadata is a binary encoding of {@see nu.marginalia.idx.DocumentMetadata}
*/

View File

@ -1,11 +1,7 @@
# OpenZIM
[OpenZIM](https://github.com/openzim/libzim) - GPL-2.0
[OpenZIM](https://github.com/openzim/libzim) - GPL-2.0+
OpenZIM is a ZIM file reader. This code has been modified in a fairly crude manner
to be much faster than the original code base which seems quite antique. It also
supports XZ compression.
**Important Note** the license is incompatible with AGPL 3, so we can't link Marginalia
directly to this. It's still very useful for building tools that deal with
wikipedia data which would be stand-alone.