CatgirlIntelligenceAgency/code/tools/crawl-job-extractor/readme.md

# Crawl Job Extractor

The crawl job extractor creates a file containing a list of domains
along with known URLs. 

This is consumed by [processes/crawling-process](../../processes/crawling-process).

## Usage


The crawl job extractor has three modes of operation:

```
# 1  grab domains from the database
./crawl-job-extractor file.out

# 2  grab domains from a file
./crawl-job-extractor file.out -f domains.txt

# 3  grab domains from the command line
./crawl-job-extractor file.out domain1 domain2 ...
```

* When only a single argument is passed, the file name to write to, it will create a complete list of domains
  and URLs known to the system from the list of already indexed domains, 
  as well as domains from the CRAWL_QUEUE table in the database.
* When the command line is passed like `./crawl-job-extractor output-file -f domains.txt`,
  domains will be read from non-blank and non-comment lines in the file.
* In other cases, the 2nd argument onward to the command will be interpreted as domain-names.

In the last two modes, if the crawl-job-extractor is able to connect to the database, it will use
information from the link database to populate the list of URLs for each domain, otherwise it will
create a spec with only the domain name and the index address, so the crawler will have to figure out
the rest. 

The crawl-specification is zstd-compressed json.

## Tricks

### Joining two specifications

Two or more specifications can be joined with a shell command on the form

```shell
$ zstdcat file1 file2 | zstd -o new-file
```

### Inspection

The file can also be inspected with `zstdless`, 
or combinations like `zstdcat file | jq`
Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00			`# Crawl Job Extractor`

			`The crawl job extractor creates a file containing a list of domains`
			`along with known URLs.`

Improved crawl-job-extractor. Let crawl-job-extractor run offline and allow it to read domains from file. Improved docs. 2023-06-20 11:35:14 +02:00			`This is consumed by [processes/crawling-process](../../processes/crawling-process).`

			`## Usage`


			`The crawl job extractor has three modes of operation:`

			```
			`# 1 grab domains from the database`
			`./crawl-job-extractor file.out`

			`# 2 grab domains from a file`
			`./crawl-job-extractor file.out -f domains.txt`

			`# 3 grab domains from the command line`
			`./crawl-job-extractor file.out domain1 domain2 ...`
			```

			`* When only a single argument is passed, the file name to write to, it will create a complete list of domains`
			`and URLs known to the system from the list of already indexed domains,`
			`as well as domains from the CRAWL_QUEUE table in the database.`
			* When the command line is passed like `./crawl-job-extractor output-file -f domains.txt`,
			`domains will be read from non-blank and non-comment lines in the file.`
			`* In other cases, the 2nd argument onward to the command will be interpreted as domain-names.`

			`In the last two modes, if the crawl-job-extractor is able to connect to the database, it will use`
			`information from the link database to populate the list of URLs for each domain, otherwise it will`
			`create a spec with only the domain name and the index address, so the crawler will have to figure out`
			`the rest.`

			`The crawl-specification is zstd-compressed json.`

			`## Tricks`

			`### Joining two specifications`

			`Two or more specifications can be joined with a shell command on the form`

			```shell
			`$ zstdcat file1 file2 \| zstd -o new-file`
			```

			`### Inspection`

			The file can also be inspected with `zstdless`,
			or combinations like `zstdcat file \| jq`