CatgirlIntelligenceAgency/code/processes/readme.md

78 lines
2.3 KiB
Markdown

# Processes
## 1. Crawl Process
The [crawling-process](crawling-process/) fetches website contents and saves them
as compressed JSON models described in [crawling-model](../process-models/crawling-model/).
The operation is specified by a crawl job specification. This is generated by [tools/crawl-job-extractor](../tools/crawl-job-extractor/)
based on the content in the database.
## 2. Converting Process
The [converting-process](converting-process/) reads crawl data from the crawling step and
processes them, extracting keywords and metadata and saves them as parquet files
described in [processed-data](../process-models/processed-data/).
## 3. Loading Process
The [loading-process](loading-process/) reads the processed data.
It has creates an [index journal](../features-index/index-journal),
a [link database](../common/linkdb),
and loads domains and domain-links
into the [MariaDB database](../common/db).
## 4. Index Construction Process
The [index-construction-process](index-constructor-process/) constructs indices from
the data generated by the loader.
## Overview
Schematically the crawling and loading process looks like this:
```
//====================\\
|| Compressed JSON: || Specifications
|| ID, Domain, Urls[] || File
|| ID, Domain, Urls[] ||
|| ID, Domain, Urls[] ||
|| ... ||
\\====================//
|
+-----------+
| CRAWLING | Fetch each URL and
| STEP | output to file
+-----------+
|
//========================\\
|| Compressed JSON: || Crawl
|| Status, HTML[], ... || Files
|| Status, HTML[], ... ||
|| Status, HTML[], ... ||
|| ... ||
\\========================//
|
+------------+
| CONVERTING | Analyze HTML and
| STEP | extract keywords
+------------+ features, links, URLs
|
//==================\\
|| Parquet: || Processed
|| Documents[] || Files
|| Domains[] ||
|| Links[] ||
\\==================//
|
+------------+ Insert domains into mariadb
| LOADING | Insert URLs, titles in link DB
| STEP | Insert keywords in Index
+------------+
|
+------------+
| CONSTRUCT | Make the data searchable
| INDEX |
+------------+
```