2023-03-13 17:39:53 +01:00
|
|
|
# Processes
|
2023-03-04 13:19:01 +01:00
|
|
|
|
2023-03-17 16:03:11 +01:00
|
|
|
## 1. Crawl Process
|
2023-03-04 13:19:01 +01:00
|
|
|
|
|
|
|
The [crawling-process](crawling-process/) fetches website contents and saves them
|
2023-03-13 17:39:53 +01:00
|
|
|
as compressed JSON models described in [crawling-model](../process-models/crawling-model/).
|
2023-03-04 13:19:01 +01:00
|
|
|
|
2023-03-17 16:03:11 +01:00
|
|
|
The operation is specified by a crawl job specification. This is generated by [tools/crawl-job-extractor](../tools/crawl-job-extractor/)
|
|
|
|
based on the content in the database.
|
|
|
|
|
|
|
|
## 2. Converting Process
|
2023-03-04 13:19:01 +01:00
|
|
|
|
|
|
|
The [converting-process](converting-process/) reads crawl data from the crawling step and
|
2023-09-14 11:21:44 +02:00
|
|
|
processes them, extracting keywords and metadata and saves them as parquet files
|
|
|
|
described in [processed-data](../process-models/processed-data/).
|
2023-03-04 13:19:01 +01:00
|
|
|
|
2023-03-17 16:03:11 +01:00
|
|
|
## 3. Loading Process
|
2023-03-04 13:19:01 +01:00
|
|
|
|
2023-09-14 11:21:44 +02:00
|
|
|
The [loading-process](loading-process/) reads the processed data.
|
|
|
|
|
|
|
|
It has creates an [index journal](../features-index/index-journal),
|
|
|
|
a [link database](../common/linkdb),
|
|
|
|
and loads domains and domain-links
|
|
|
|
into the [MariaDB database](../common/db).
|
2023-03-04 13:19:01 +01:00
|
|
|
|
2023-08-29 17:04:54 +02:00
|
|
|
## 4. Index Construction Process
|
|
|
|
|
|
|
|
The [index-construction-process](index-constructor-process/) constructs indices from
|
|
|
|
the data generated by the loader.
|
|
|
|
|
2023-03-04 13:19:01 +01:00
|
|
|
## Overview
|
|
|
|
|
|
|
|
Schematically the crawling and loading process looks like this:
|
|
|
|
|
|
|
|
```
|
|
|
|
//====================\\
|
|
|
|
|| Compressed JSON: || Specifications
|
|
|
|
|| ID, Domain, Urls[] || File
|
|
|
|
|| ID, Domain, Urls[] ||
|
|
|
|
|| ID, Domain, Urls[] ||
|
|
|
|
|| ... ||
|
|
|
|
\\====================//
|
|
|
|
|
|
|
|
|
+-----------+
|
|
|
|
| CRAWLING | Fetch each URL and
|
|
|
|
| STEP | output to file
|
|
|
|
+-----------+
|
|
|
|
|
|
|
|
|
//========================\\
|
|
|
|
|| Compressed JSON: || Crawl
|
|
|
|
|| Status, HTML[], ... || Files
|
|
|
|
|| Status, HTML[], ... ||
|
|
|
|
|| Status, HTML[], ... ||
|
|
|
|
|| ... ||
|
|
|
|
\\========================//
|
|
|
|
|
|
|
|
|
+------------+
|
|
|
|
| CONVERTING | Analyze HTML and
|
|
|
|
| STEP | extract keywords
|
|
|
|
+------------+ features, links, URLs
|
|
|
|
|
|
|
|
|
//==================\\
|
2023-09-14 11:21:44 +02:00
|
|
|
|| Parquet: || Processed
|
|
|
|
|| Documents[] || Files
|
2023-03-04 13:19:01 +01:00
|
|
|
|| Domains[] ||
|
|
|
|
|| Links[] ||
|
|
|
|
\\==================//
|
|
|
|
|
|
2023-09-14 11:21:44 +02:00
|
|
|
+------------+ Insert domains into mariadb
|
|
|
|
| LOADING | Insert URLs, titles in link DB
|
2023-03-04 13:19:01 +01:00
|
|
|
| STEP | Insert keywords in Index
|
|
|
|
+------------+
|
2023-08-29 17:04:54 +02:00
|
|
|
|
|
|
|
|
+------------+
|
|
|
|
| CONSTRUCT | Make the data searchable
|
|
|
|
| INDEX |
|
|
|
|
+------------+
|
2023-03-04 13:19:01 +01:00
|
|
|
```
|