History

Viktor Lofgren 8f74dbdbb4 (crawler) Set more lenient parameters for recrawl		2023-10-30 11:35:30 +01:00
..
converting-process	(array) Fix spurious search results	2023-10-26 15:27:02 +02:00
crawling-process	(crawler) Set more lenient parameters for recrawl	2023-10-30 11:35:30 +01:00
index-constructor-process	(index-creation) Print whether full or prio is created	2023-10-24 16:23:10 +02:00
loading-process	Refactoring	2023-10-25 18:51:02 +02:00
test-data	(*) Upgrade to JDK21 with preview enabled.	2023-09-24 10:38:59 +02:00
website-adjacencies-calculator	(executor-service) Embed dist/ in executor-service's docker image	2023-10-19 17:48:34 +02:00
readme.md	(refactor) Remove features-search and update documentation	2023-10-09 15:12:30 +02:00

readme.md

Processes

1. Crawl Process

The crawling-process fetches website contents and saves them as compressed JSON models described in crawling-model.

The operation is specified by a crawl specification, which can be created in the control GUI.

2. Converting Process

The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in processed-data.

3. Loading Process

The loading-process reads the processed data.

It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.

4. Index Construction Process

The index-construction-process constructs indices from the data generated by the loader.

Overview

Schematically the crawling and loading process looks like this:

    //====================\\
    || Compressed JSON:   ||  Specifications
    || ID, Domain, Urls[] ||  File
    || ID, Domain, Urls[] ||
    || ID, Domain, Urls[] ||
    ||      ...           ||
    \\====================//
          |
    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Compressed JSON:      || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Parquet:         ||  Processed
    ||  Documents[]     ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    \\==================//
          |
    +------------+ Insert domains into mariadb
    |  LOADING   | Insert URLs, titles in link DB
    |    STEP    | Insert keywords in Index
    +------------+    
          |
    +------------+
    | CONSTRUCT  | Make the data searchable
    |   INDEX    | 
    +------------+