acf7bcc7a6
With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter. This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while. The first step is to move stuff out of the domain processor into the document processor. |
||
---|---|---|
.. | ||
converting-process | ||
crawling-process | ||
index-constructor-process | ||
loading-process | ||
test-data | ||
website-adjacencies-calculator | ||
readme.md |
Processes
1. Crawl Process
The crawling-process fetches website contents and saves them as compressed JSON models described in crawling-model.
The operation is specified by a crawl specification, which can be created in the control GUI.
2. Converting Process
The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in processed-data.
3. Loading Process
The loading-process reads the processed data.
It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.
4. Index Construction Process
The index-construction-process constructs indices from the data generated by the loader.
Overview
Schematically the crawling and loading process looks like this:
//====================\\
|| Compressed JSON: || Specifications
|| ID, Domain, Urls[] || File
|| ID, Domain, Urls[] ||
|| ID, Domain, Urls[] ||
|| ... ||
\\====================//
|
+-----------+
| CRAWLING | Fetch each URL and
| STEP | output to file
+-----------+
|
//========================\\
|| Compressed JSON: || Crawl
|| Status, HTML[], ... || Files
|| Status, HTML[], ... ||
|| Status, HTML[], ... ||
|| ... ||
\\========================//
|
+------------+
| CONVERTING | Analyze HTML and
| STEP | extract keywords
+------------+ features, links, URLs
|
//==================\\
|| Parquet: || Processed
|| Documents[] || Files
|| Domains[] ||
|| Links[] ||
\\==================//
|
+------------+ Insert domains into mariadb
| LOADING | Insert URLs, titles in link DB
| STEP | Insert keywords in Index
+------------+
|
+------------+
| CONSTRUCT | Make the data searchable
| INDEX |
+------------+