(doc) Documentation corrections
This commit is contained in:
parent
929caed0b9
commit
ba26f6ce84
@ -1,7 +1,19 @@
|
|||||||
# Forward Index
|
# Forward Index
|
||||||
|
|
||||||
The forward index contains a mapping from document id to word id. It also provides document-level
|
The forward index contains a mapping from document id to various forms of document metadata.
|
||||||
metadata, and a document-to-domain mapping.
|
|
||||||
|
In practice, the forward index consists of two files, an `id` file and a `data` file.
|
||||||
|
|
||||||
|
The `id` file contains a list of sorted document ids, and the `data` file contains
|
||||||
|
metadata for each document id, in the same order as the `id` file, with a fixed
|
||||||
|
size record containing data associated with each document id.
|
||||||
|
|
||||||
|
Each record contains a binary encoded [DocumentMetadata](../../common/model/src/main/java/nu/marginalia/model/idx/DocumentMetadata.java) object,
|
||||||
|
as well as a [HtmlFeatures](../../common/model/src/main/java/nu/marginalia/model/crawl/HtmlFeature.java) bitmask.
|
||||||
|
|
||||||
|
Unlike the reverse index, the forward index is not split into two tiers, and the data is in the same
|
||||||
|
order as it is in the source data, and the cardinality of the document IDs is assumed to fit in memory,
|
||||||
|
so it's relatively easy to construct.
|
||||||
|
|
||||||
## Central Classes
|
## Central Classes
|
||||||
|
|
||||||
|
@ -20,8 +20,8 @@ import static nu.marginalia.index.forward.ForwardIndexParameters.*;
|
|||||||
* and a mapping between document identifiers to the index into the
|
* and a mapping between document identifiers to the index into the
|
||||||
* data array.
|
* data array.
|
||||||
* <p/>
|
* <p/>
|
||||||
* Since the total data is relatively small, this is attempted to be
|
* Since the total data is relatively small, this is kept in memory to
|
||||||
* kept in memory to reduce the amount of disk thrashing.
|
* reduce the amount of disk thrashing.
|
||||||
* <p/>
|
* <p/>
|
||||||
* The metadata is a binary encoding of {@see nu.marginalia.idx.DocumentMetadata}
|
* The metadata is a binary encoding of {@see nu.marginalia.idx.DocumentMetadata}
|
||||||
*/
|
*/
|
||||||
|
6
third-party/openzim/readme.md
vendored
6
third-party/openzim/readme.md
vendored
@ -1,11 +1,7 @@
|
|||||||
# OpenZIM
|
# OpenZIM
|
||||||
|
|
||||||
[OpenZIM](https://github.com/openzim/libzim) - GPL-2.0
|
[OpenZIM](https://github.com/openzim/libzim) - GPL-2.0+
|
||||||
|
|
||||||
OpenZIM is a ZIM file reader. This code has been modified in a fairly crude manner
|
OpenZIM is a ZIM file reader. This code has been modified in a fairly crude manner
|
||||||
to be much faster than the original code base which seems quite antique. It also
|
to be much faster than the original code base which seems quite antique. It also
|
||||||
supports XZ compression.
|
supports XZ compression.
|
||||||
|
|
||||||
**Important Note** the license is incompatible with AGPL 3, so we can't link Marginalia
|
|
||||||
directly to this. It's still very useful for building tools that deal with
|
|
||||||
wikipedia data which would be stand-alone.
|
|
Loading…
Reference in New Issue
Block a user