bf92c270dc
It appears to lead to too much junk in the lexicon. |
||
---|---|---|
.. | ||
converting-process | ||
crawling-process | ||
loading-process | ||
test-data | ||
readme.md |
Processes
1. Crawl Process
The crawling-process fetches website contents and saves them as compressed JSON models described in crawling-model.
The operation is specified by a crawl job specification. This is generated by tools/crawl-job-extractor based on the content in the database.
2. Converting Process
The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as compressed JSON models described in converting-model.
3. Loading Process
The loading-process reads the processed data and creates an index journal and lexicon, and loads domains and addresses into the MariaDB-database.
Overview
Schematically the crawling and loading process looks like this:
//====================\\
|| Compressed JSON: || Specifications
|| ID, Domain, Urls[] || File
|| ID, Domain, Urls[] ||
|| ID, Domain, Urls[] ||
|| ... ||
\\====================//
|
+-----------+
| CRAWLING | Fetch each URL and
| STEP | output to file
+-----------+
|
//========================\\
|| Compressed JSON: || Crawl
|| Status, HTML[], ... || Files
|| Status, HTML[], ... ||
|| Status, HTML[], ... ||
|| ... ||
\\========================//
|
+------------+
| CONVERTING | Analyze HTML and
| STEP | extract keywords
+------------+ features, links, URLs
|
//==================\\
|| Compressed JSON: || Processed
|| URLs[] || Files
|| Domains[] ||
|| Links[] ||
|| Keywords[] ||
|| ... ||
|| URLs[] ||
|| Domains[] ||
|| Links[] ||
|| Keywords[] ||
|| ... ||
\\==================//
|
+------------+
| LOADING | Insert URLs in DB
| STEP | Insert keywords in Index
+------------+