diff --git a/code/features-index/index-forward/readme.md b/code/features-index/index-forward/readme.md index 22b4ed96..545fbf1e 100644 --- a/code/features-index/index-forward/readme.md +++ b/code/features-index/index-forward/readme.md @@ -1,7 +1,19 @@ # Forward Index -The forward index contains a mapping from document id to word id. It also provides document-level -metadata, and a document-to-domain mapping. +The forward index contains a mapping from document id to various forms of document metadata. + +In practice, the forward index consists of two files, an `id` file and a `data` file. + +The `id` file contains a list of sorted document ids, and the `data` file contains +metadata for each document id, in the same order as the `id` file, with a fixed +size record containing data associated with each document id. + +Each record contains a binary encoded [DocumentMetadata](../../common/model/src/main/java/nu/marginalia/model/idx/DocumentMetadata.java) object, +as well as a [HtmlFeatures](../../common/model/src/main/java/nu/marginalia/model/crawl/HtmlFeature.java) bitmask. + +Unlike the reverse index, the forward index is not split into two tiers, and the data is in the same +order as it is in the source data, and the cardinality of the document IDs is assumed to fit in memory, +so it's relatively easy to construct. ## Central Classes diff --git a/code/features-index/index-forward/src/main/java/nu/marginalia/index/forward/ForwardIndexReader.java b/code/features-index/index-forward/src/main/java/nu/marginalia/index/forward/ForwardIndexReader.java index 04e4fce1..5d26de82 100644 --- a/code/features-index/index-forward/src/main/java/nu/marginalia/index/forward/ForwardIndexReader.java +++ b/code/features-index/index-forward/src/main/java/nu/marginalia/index/forward/ForwardIndexReader.java @@ -20,8 +20,8 @@ import static nu.marginalia.index.forward.ForwardIndexParameters.*; * and a mapping between document identifiers to the index into the * data array. *
- * Since the total data is relatively small, this is attempted to be - * kept in memory to reduce the amount of disk thrashing. + * Since the total data is relatively small, this is kept in memory to + * reduce the amount of disk thrashing. * * The metadata is a binary encoding of {@see nu.marginalia.idx.DocumentMetadata} */ diff --git a/third-party/openzim/readme.md b/third-party/openzim/readme.md index ee47e601..df9af456 100644 --- a/third-party/openzim/readme.md +++ b/third-party/openzim/readme.md @@ -1,11 +1,7 @@ # OpenZIM -[OpenZIM](https://github.com/openzim/libzim) - GPL-2.0 +[OpenZIM](https://github.com/openzim/libzim) - GPL-2.0+ OpenZIM is a ZIM file reader. This code has been modified in a fairly crude manner to be much faster than the original code base which seems quite antique. It also supports XZ compression. - -**Important Note** the license is incompatible with AGPL 3, so we can't link Marginalia -directly to this. It's still very useful for building tools that deal with -wikipedia data which would be stand-alone. \ No newline at end of file