CatgirlIntelligenceAgency/crawl/readme.md
2023-03-04 13:19:01 +01:00

74 lines
2.1 KiB
Markdown

# Crawl
## 1. Crawl Job Extractor
The [crawl-job-extractor-process](crawl-job-extractor-process/) creates a crawl job specification
based on the content in the database.
## 2. Crawl Process
The [crawling-process](crawling-process/) fetches website contents and saves them
as compressed JSON models described in [crawling-model](crawling-model/).
## 3. Converting Process
The [converting-process](converting-process/) reads crawl data from the crawling step and
processes them, extracting keywords and metadata and saves them as compressed JSON models
described in [converting-model](converting-model/).
## 4. Loading Process
The [loading-process](loading-process/) reads the processed data and creates an index journal
and lexicon, and loads domains and addresses into the MariaDB-database.
## Overview
Schematically the crawling and loading process looks like this:
```
//====================\\
|| Compressed JSON: || Specifications
|| ID, Domain, Urls[] || File
|| ID, Domain, Urls[] ||
|| ID, Domain, Urls[] ||
|| ... ||
\\====================//
|
+-----------+
| CRAWLING | Fetch each URL and
| STEP | output to file
+-----------+
|
//========================\\
|| Compressed JSON: || Crawl
|| Status, HTML[], ... || Files
|| Status, HTML[], ... ||
|| Status, HTML[], ... ||
|| ... ||
\\========================//
|
+------------+
| CONVERTING | Analyze HTML and
| STEP | extract keywords
+------------+ features, links, URLs
|
//==================\\
|| Compressed JSON: || Processed
|| URLs[] || Files
|| Domains[] ||
|| Links[] ||
|| Keywords[] ||
|| ... ||
|| URLs[] ||
|| Domains[] ||
|| Links[] ||
|| Keywords[] ||
|| ... ||
\\==================//
|
+------------+
| LOADING | Insert URLs in DB
| STEP | Insert keywords in Index
+------------+
```