# Processes ## 1. Crawl Process The [crawling-process](crawling-process/) fetches website contents and saves them as compressed JSON models described in [crawling-model](../process-models/crawling-model/). The operation is specified by a crawl job specification. This is generated by [tools/crawl-job-extractor](../tools/crawl-job-extractor/) based on the content in the database. ## 2. Converting Process The [converting-process](converting-process/) reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as compressed JSON models described in [converting-model](../process-models/converting-model/). ## 3. Loading Process The [loading-process](loading-process/) reads the processed data and creates an index journal and lexicon, and loads domains and addresses into the MariaDB-database. ## Overview Schematically the crawling and loading process looks like this: ``` //====================\\ || Compressed JSON: || Specifications || ID, Domain, Urls[] || File || ID, Domain, Urls[] || || ID, Domain, Urls[] || || ... || \\====================// | +-----------+ | CRAWLING | Fetch each URL and | STEP | output to file +-----------+ | //========================\\ || Compressed JSON: || Crawl || Status, HTML[], ... || Files || Status, HTML[], ... || || Status, HTML[], ... || || ... || \\========================// | +------------+ | CONVERTING | Analyze HTML and | STEP | extract keywords +------------+ features, links, URLs | //==================\\ || Compressed JSON: || Processed || URLs[] || Files || Domains[] || || Links[] || || Keywords[] || || ... || || URLs[] || || Domains[] || || Links[] || || Keywords[] || || ... || \\==================// | +------------+ | LOADING | Insert URLs in DB | STEP | Insert keywords in Index +------------+ ```