History

Viktor Lofgren c069c8c182 (crawler) Clean up crawl data reference and recrawl logic		2023-07-22 18:42:21 +02:00
..
src	(crawler) Clean up crawl data reference and recrawl logic	2023-07-22 18:42:21 +02:00
build.gradle	Improved crawl-job-extractor.	2023-06-20 11:37:52 +02:00
readme.md	Improved crawl-job-extractor.	2023-06-20 11:37:52 +02:00

readme.md

Crawl Job Extractor

The crawl job extractor creates a file containing a list of domains along with known URLs.

This is consumed by processes/crawling-process.

Usage

The crawl job extractor has three modes of operation:

# 1  grab domains from the database
./crawl-job-extractor file.out

# 2  grab domains from a file
./crawl-job-extractor file.out -f domains.txt

# 3  grab domains from the command line
./crawl-job-extractor file.out domain1 domain2 ...

When only a single argument is passed, the file name to write to, it will create a complete list of domains and URLs known to the system from the list of already indexed domains, as well as domains from the CRAWL_QUEUE table in the database.
When the command line is passed like ./crawl-job-extractor output-file -f domains.txt, domains will be read from non-blank and non-comment lines in the file.
In other cases, the 2nd argument onward to the command will be interpreted as domain-names.

In the last two modes, if the crawl-job-extractor is able to connect to the database, it will use information from the link database to populate the list of URLs for each domain, otherwise it will create a spec with only the domain name and the index address, so the crawler will have to figure out the rest.

The crawl-specification is zstd-compressed json.

Tricks

Joining two specifications

Two or more specifications can be joined with a shell command on the form

$ zstdcat file1 file2 | zstd -o new-file

Inspection

The file can also be inspected with zstdless, or combinations like zstdcat file | jq