2.1 KiB
2.1 KiB
Crawl
1. Crawl Job Extractor
The crawl-job-extractor-process creates a crawl job specification based on the content in the database.
2. Crawl Process
The crawling-process fetches website contents and saves them as compressed JSON models described in crawling-model.
3. Converting Process
The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as compressed JSON models described in converting-model.
4. Loading Process
The loading-process reads the processed data and creates an index journal and lexicon, and loads domains and addresses into the MariaDB-database.
Overview
Schematically the crawling and loading process looks like this:
//====================\\
|| Compressed JSON: || Specifications
|| ID, Domain, Urls[] || File
|| ID, Domain, Urls[] ||
|| ID, Domain, Urls[] ||
|| ... ||
\\====================//
|
+-----------+
| CRAWLING | Fetch each URL and
| STEP | output to file
+-----------+
|
//========================\\
|| Compressed JSON: || Crawl
|| Status, HTML[], ... || Files
|| Status, HTML[], ... ||
|| Status, HTML[], ... ||
|| ... ||
\\========================//
|
+------------+
| CONVERTING | Analyze HTML and
| STEP | extract keywords
+------------+ features, links, URLs
|
//==================\\
|| Compressed JSON: || Processed
|| URLs[] || Files
|| Domains[] ||
|| Links[] ||
|| Keywords[] ||
|| ... ||
|| URLs[] ||
|| Domains[] ||
|| Links[] ||
|| Keywords[] ||
|| ... ||
\\==================//
|
+------------+
| LOADING | Insert URLs in DB
| STEP | Insert keywords in Index
+------------+