CatgirlIntelligenceAgency/code/services-core
Viktor Lofgren 440e097d78 (crawler) WIP integration of WARC files into the crawler and converter process.
This commit is in a pretty rough state.  It refactors the crawler fairly significantly to offer better separation of concerns.  It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data.  This works, -ish.

There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.

A problem is that the WARC files are a bit too large.  It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
2023-12-13 15:33:42 +01:00
..
assistant-service (*) Refactor GeoIP-related code 2023-12-10 17:30:43 +01:00
control-service (control) Fix spurious state detection in control-side actors 2023-12-09 12:50:05 +01:00
executor-service (crawler) WIP integration of WARC files into the crawler and converter process. 2023-12-13 15:33:42 +01:00
index-service Merge branch 'master' into warc 2023-12-11 14:32:35 +01:00
query-service Refactoring 2023-10-25 18:51:02 +02:00
readme.md (refactor) Move search service into services-satellite 2023-10-09 13:40:01 +02:00

Core Services

The cores services constitute the main functionality of the search engine, relatively agnostic to the Marginalia application.

  • The index-service contains the indexes, it answers questions about which documents contain which terms.

  • The query-service Interprets queries and delegates work to index-service.

  • The control-service provides an operator's user interface, and is responsible for orchestrating the various processes of the system.

  • The assistant-service helps the search service with spelling suggestions other peripheral functionality.