# Processes ## 1. Crawl Process The [crawling-process](crawling-process/) fetches website contents and saves them as compressed JSON models described in [crawling-model](../process-models/crawling-model/). The operation is specified by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI. ## 2. Converting Process The [converting-process](converting-process/) reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in [processed-data](../process-models/processed-data/). ## 3. Loading Process The [loading-process](loading-process/) reads the processed data. It has creates an [index journal](../features-index/index-journal), a [link database](../common/linkdb), and loads domains and domain-links into the [MariaDB database](../common/db). ## 4. Index Construction Process The [index-construction-process](index-constructor-process/) constructs indices from the data generated by the loader. ## Overview Schematically the crawling and loading process looks like this: ``` //====================\\ || Compressed JSON: || Specifications || ID, Domain, Urls[] || File || ID, Domain, Urls[] || || ID, Domain, Urls[] || || ... || \\====================// | +-----------+ | CRAWLING | Fetch each URL and | STEP | output to file +-----------+ | //========================\\ || Compressed JSON: || Crawl || Status, HTML[], ... || Files || Status, HTML[], ... || || Status, HTML[], ... || || ... || \\========================// | +------------+ | CONVERTING | Analyze HTML and | STEP | extract keywords +------------+ features, links, URLs | //==================\\ || Parquet: || Processed || Documents[] || Files || Domains[] || || Links[] || \\==================// | +------------+ Insert domains into mariadb | LOADING | Insert URLs, titles in link DB | STEP | Insert keywords in Index +------------+ | +------------+ | CONSTRUCT | Make the data searchable | INDEX | +------------+ ```