CatgirlIntelligenceAgency/code/processes
Viktor Lofgren 02dd5c5853 (converter) Look at properties when deciding pool size
Look at whether the property 'system.conserveProperty' is enabled when deciding he default pool size for the converter.

If true, a much more conservative default is used, limiting the risk of running out of memory.
2024-02-12 16:24:19 +01:00
..
converting-process (converter) Look at properties when deciding pool size 2024-02-12 16:24:19 +01:00
crawling-process (warc) Minor code clean-up. 2024-02-10 18:30:33 +01:00
index-constructor-process (index-construction) Split repartition into two actions 2024-02-06 17:20:07 +01:00
loading-process (*) Add flag for disabling ASCII flattening 2024-01-31 11:50:59 +01:00
test-data (convert) Wiki specialization that should do a better job at removing junk keywords and providing a useful summary. 2023-11-30 20:04:46 +01:00
website-adjacencies-calculator (*) Overhaul settings and properties 2024-01-13 17:12:18 +01:00
readme.md (doc) Update the readme's the crawler, as they've grown stale. 2024-02-01 18:10:55 +01:00

Processes

1. Crawl Process

The crawling-process fetches website contents, temporarily saving them as WARC files, and then re-converts them into parquet models. Both are described in crawling-model.

The operation is optionally defined by a crawl specification, which can be created in the control GUI.

2. Converting Process

The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in processed-data.

3. Loading Process

The loading-process reads the processed data.

It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.

4. Index Construction Process

The index-construction-process constructs indices from the data generated by the loader.

Overview

Schematically the crawling and loading process looks like this:

    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Parquet:              || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Parquet:         ||  Processed
    ||  Documents[]     ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    \\==================//
          |
    +------------+ Insert domains into mariadb
    |  LOADING   | Insert URLs, titles in link DB
    |    STEP    | Insert keywords in Index
    +------------+    
          |
    +------------+
    | CONSTRUCT  | Make the data searchable
    |   INDEX    | 
    +------------+