CatgirlIntelligenceAgency/code/processes/converting-process/readme.md

# Converting Process

The converting process reads crawl data and extracts information to be fed into the index,
such as keywords, metadata, urls, descriptions...

The converter reads crawl data in the form of parquet files, and writes the extracted data to parquet 
files on a different format.  These files are then passed to the loader process, which does additional 
processing needed to feed the data into the index.

The reason for splitting the process into two parts is that the heavier converting process can be terminated
and restarted without losing progress, while the lighter loader process needs to be run in a single
go (or restarted if it crashes/terminates).

The converter output is also in general more portable and can be used for different tasks, meanwhile the
loader's output is heavily tailored to the index and not much use for anything else.

## Structure

Most information is extracted from the document itself within `DocumentProcessor`, but some information is extracted from the
context of the document, such as other documents on the same domain.  This is done in `DomainProcessor`.

To support multiple document formats, the converting process is pluggable. Each plugin is responsible for
converting a single document format, such as HTML or plain text.  

Further, the HTML plugin supports specializations, which refine the conversion process for specific 
server software, such as Javadoc, MediaWiki, PhpBB, etc.  This helps to improve the processing for 
common types of websites, and makes up for the fact that it's hard to build a one-size-fits-all heuristic
for deciding which parts of a document are important that does justice to every website.

## Anchor Text

The converting process also supports supplementing the data with external information, such as anchor texts. 
This is done automatically if `atags.parquet` is available in the `data/`-directory.  atags.parquet can be
downloaded from [here](https://downloads.marginalia.nu/exports/). 

The rationale for doing this as well as the details of how the file is generated is described in this blog post: 
https://www.marginalia.nu/log/93_atags/ 

## Central Classes

* [ConverterMain](src/main/java/nu/marginalia/converting/ConverterMain.java) orchestrates the conversion process.
* [DocumentProcessor](src/main/java/nu/marginalia/converting/processor/DocumentProcessor.java) converts a single document.
* - [HtmlDocumentProcessorPlugin](src/main/java/nu/marginalia/converting/processor/plugin/HtmlDocumentProcessorPlugin.java) 
has HTML-specific logic related to a document, keywords and identifies features such as whether it has javascript.
* * - [HtmlProcessorSpecializations](src/main/java/nu/marginalia/converting/processor/plugin/specialization/HtmlProcessorSpecializations.java)
* * - [XenForoSpecialization](src/main/java/nu/marginalia/converting/processor/plugin/specialization/XenForoSpecialization.java) ...
* - [PlainTextDocumentProcessorPlugin](src/main/java/nu/marginalia/converting/processor/plugin/PlainTextDocumentProcessorPlugin.java)
  has plain text-specific logic related to a document...

* [DomainProcessor](src/main/java/nu/marginalia/converting/processor/DomainProcessor.java) converts each document and 
generates domain-wide metadata such as link graphs.

## See Also

* [features-convert](../../features-convert/)
More readmes, clean-up of associated code. 2023-03-04 16:42:31 +01:00			`# Converting Process`

			`The converting process reads crawl data and extracts information to be fed into the index,`
			`such as keywords, metadata, urls, descriptions...`

(doc) Update docs 2024-02-06 16:29:55 +01:00			`The converter reads crawl data in the form of parquet files, and writes the extracted data to parquet`
			`files on a different format. These files are then passed to the loader process, which does additional`
			`processing needed to feed the data into the index.`

			`The reason for splitting the process into two parts is that the heavier converting process can be terminated`
			`and restarted without losing progress, while the lighter loader process needs to be run in a single`
			`go (or restarted if it crashes/terminates).`

			`The converter output is also in general more portable and can be used for different tasks, meanwhile the`
			`loader's output is heavily tailored to the index and not much use for anything else.`

(docs) Improve architectural documentation for the converter. 2023-11-30 20:43:22 +01:00			`## Structure`

			Most information is extracted from the document itself within `DocumentProcessor`, but some information is extracted from the
			context of the document, such as other documents on the same domain. This is done in `DomainProcessor`.

			`To support multiple document formats, the converting process is pluggable. Each plugin is responsible for`
			`converting a single document format, such as HTML or plain text.`

			`Further, the HTML plugin supports specializations, which refine the conversion process for specific`
			`server software, such as Javadoc, MediaWiki, PhpBB, etc. This helps to improve the processing for`
			`common types of websites, and makes up for the fact that it's hard to build a one-size-fits-all heuristic`
			`for deciding which parts of a document are important that does justice to every website.`

			`## Anchor Text`

			`The converting process also supports supplementing the data with external information, such as anchor texts.`
			This is done automatically if `atags.parquet` is available in the `data/`-directory. atags.parquet can be
			`downloaded from [here](https://downloads.marginalia.nu/exports/).`

			`The rationale for doing this as well as the details of how the file is generated is described in this blog post:`
			`https://www.marginalia.nu/log/93_atags/`

More readmes, clean-up of associated code. 2023-03-04 16:42:31 +01:00			`## Central Classes`

			`* [ConverterMain](src/main/java/nu/marginalia/converting/ConverterMain.java) orchestrates the conversion process.`
			`* [DocumentProcessor](src/main/java/nu/marginalia/converting/processor/DocumentProcessor.java) converts a single document.`
			`* - [HtmlDocumentProcessorPlugin](src/main/java/nu/marginalia/converting/processor/plugin/HtmlDocumentProcessorPlugin.java)`
			`has HTML-specific logic related to a document, keywords and identifies features such as whether it has javascript.`
(docs) Improve architectural documentation for the converter. 2023-11-30 20:43:22 +01:00			`* * - [HtmlProcessorSpecializations](src/main/java/nu/marginalia/converting/processor/plugin/specialization/HtmlProcessorSpecializations.java)`
			`* * - [XenForoSpecialization](src/main/java/nu/marginalia/converting/processor/plugin/specialization/XenForoSpecialization.java) ...`
More readmes, clean-up of associated code. 2023-03-04 16:42:31 +01:00			`* - [PlainTextDocumentProcessorPlugin](src/main/java/nu/marginalia/converting/processor/plugin/PlainTextDocumentProcessorPlugin.java)`
			`has plain text-specific logic related to a document...`
(docs) Improve architectural documentation for the converter. 2023-11-30 20:43:22 +01:00
More readmes, clean-up of associated code. 2023-03-04 16:42:31 +01:00			`* [DomainProcessor](src/main/java/nu/marginalia/converting/processor/DomainProcessor.java) converts each document and`
			`generates domain-wide metadata such as link graphs.`
More restructuring, big bug fixes in keyword extraction. 2023-03-13 17:39:53 +01:00
			`## See Also`

			`* [features-convert](../../features-convert/)`