CatgirlIntelligenceAgency/crawl
Viktor Lofgren 549d323f6d Code cleanup
2023-03-07 16:37:05 +01:00
..
common A lot of readmes, some refactoring. 2023-03-06 18:32:13 +01:00
converting-model Clean up DocumentKeywordExtractor and DocumentKeywordsBuilder 2023-03-07 16:36:12 +01:00
converting-process Clean up DocumentKeywordExtractor and DocumentKeywordsBuilder 2023-03-07 16:36:12 +01:00
crawl-job-extractor-process A lot of readmes, some refactoring. 2023-03-06 18:32:13 +01:00
crawling-model More documentation... 2023-03-06 19:01:36 +01:00
crawling-process Code cleanup 2023-03-07 16:37:05 +01:00
experimental A lot of readmes, some refactoring. 2023-03-06 18:32:13 +01:00
loading-process A lot of readmes, some refactoring. 2023-03-06 18:32:13 +01:00
readme.md Restructuring the git repo 2023-03-04 13:19:01 +01:00

Crawl

1. Crawl Job Extractor

The crawl-job-extractor-process creates a crawl job specification based on the content in the database.

2. Crawl Process

The crawling-process fetches website contents and saves them as compressed JSON models described in crawling-model.

3. Converting Process

The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as compressed JSON models described in converting-model.

4. Loading Process

The loading-process reads the processed data and creates an index journal and lexicon, and loads domains and addresses into the MariaDB-database.

Overview

Schematically the crawling and loading process looks like this:

    //====================\\
    || Compressed JSON:   ||  Specifications
    || ID, Domain, Urls[] ||  File
    || ID, Domain, Urls[] ||
    || ID, Domain, Urls[] ||
    ||      ...           ||
    \\====================//
          |
    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Compressed JSON:      || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Compressed JSON: ||  Processed
    ||  URLs[]          ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    ||  Keywords[]      ||
    ||    ...           ||
    ||  URLs[]          ||
    ||  Domains[]       ||
    ||  Links[]         ||    
    ||  Keywords[]      ||
    ||    ...           ||
    \\==================//
          |
    +------------+
    |  LOADING   | Insert URLs in DB
    |    STEP    | Insert keywords in Index
    +------------+