(doc) Update Crawling Docs
Still a WIP, but now more accurately reflects the new GUI, with screenshots to boot!
This commit is contained in:
parent
fd1eec99b5
commit
b9445d4f62
@ -3,79 +3,79 @@
|
|||||||
This document is a draft.
|
This document is a draft.
|
||||||
|
|
||||||
## WARNING
|
## WARNING
|
||||||
Please don't run the crawler unless you intend to actually operate a public
|
|
||||||
facing search engine! For testing, use crawl sets from downloads.marginalia.nu instead.
|
|
||||||
|
|
||||||
See the documentation in run/ for more information.
|
Please don't run the crawler unless you intend to actually operate a public
|
||||||
|
facing search engine! For testing, use crawl sets from [downloads.marginalia.nu](https://downloads.marginalia.nu/) instead;
|
||||||
|
or if you wish to play with the crawler, crawl a small set of domains from people who are
|
||||||
|
ok with it, use your own, your friends, or any subdomain from marginalia.nu.
|
||||||
|
|
||||||
|
See the documentation in run/ for more information on how to load sample data!
|
||||||
|
|
||||||
Reckless crawling annoys webmasters and makes it harder to run an independent search engine.
|
Reckless crawling annoys webmasters and makes it harder to run an independent search engine.
|
||||||
Crawling from a domestic IP address is also likely to put you on a greylist
|
Crawling from a domestic IP address is also likely to put you on a greylist
|
||||||
of probable bots. You will solve CAPTCHAs for almost every website you visit
|
of probable bots. You will solve CAPTCHAs for almost every website you visit
|
||||||
for weeks.
|
for weeks, and may be permanently blocked from a few IPs.
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
You probably want to run a local bind resolver to speed up DNS lookups and reduce the amount of
|
You probably want to run a local bind resolver to speed up DNS lookups and reduce the amount of
|
||||||
DNS traffic.
|
DNS traffic.
|
||||||
|
|
||||||
These processes require a lot of disk space. It's strongly recommended to use a dedicated disk,
|
These processes require a lot of disk space. It's strongly recommended to use a dedicated disk for
|
||||||
it doesn't need to be extremely fast, but it should be a few terabytes in size. It should be mounted
|
the index storage subdirectory, it doesn't need to be extremely fast, but it should be a few terabytes in size.
|
||||||
with `noatime` and partitioned with a large block size. It may be a good idea to format the disk with
|
|
||||||
a block size of 4096 bytes. This will reduce the amount of disk space used by the crawler.
|
It should be mounted with `noatime`. It may be a good idea to format the disk with a block size of 4096 bytes. This will reduce the amount of disk space used by the crawler.
|
||||||
|
|
||||||
Make sure you configure the user-agent properly. This will be used to identify the crawler,
|
Make sure you configure the user-agent properly. This will be used to identify the crawler,
|
||||||
and is matched against the robots.txt file. The crawler will not crawl sites that don't allow it.
|
and is matched against the robots.txt file. The crawler will not crawl sites that don't allow it.
|
||||||
|
See [wiki://Robots_exclusion_standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard) for more information
|
||||||
This can be done by editing the file `${WMSA_HOME}/conf/user-agent`.
|
about robots.txt; the user agent can be configured in conf/properties/system.properties; see the
|
||||||
|
[system-properties](system-properties.md) documentation for more information.
|
||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
Ensure that the system is running and go to https://localhost:8081.
|
Ensure that the system is running and go to https://localhost:8081.
|
||||||
|
|
||||||
With the default test configuration, the system is configured to
|
With the default test configuration, the system is configured to
|
||||||
store data in `run/node-1/samples`.
|
store data in `node-1/storage`.
|
||||||
|
|
||||||
### Specifications
|
## Fresh Crawl
|
||||||
|
|
||||||
While a running search engine can use the link database to figure out which websites to visit, a clean
|
While a running search engine can use the link database to figure out which websites to visit, a clean
|
||||||
system does not know of any links. To bootstrap a crawl, a crawl specification can be created.
|
system does not know of any links. To bootstrap a crawl, a crawl specification needs to be created to
|
||||||
|
seed the domain database.
|
||||||
|
|
||||||
You need a list of known domains. This is just a text file with one domain name per line,
|
Go to `Nodes->Node 1->Actions->New Crawl`
|
||||||
with blank lines and comments starting with `#` ignored. Make it available over HTTP(S).
|
|
||||||
|
|
||||||
Go to
|
![img](images/new_crawl.png)
|
||||||
|
|
||||||
* System -> Nodes
|
Click the link that says 'New Spec' to arrive at a form for creating a new specification:
|
||||||
* Select node 1
|
|
||||||
* Storage -> Specs
|
|
||||||
* Click `[Create New Specification]`
|
|
||||||
|
|
||||||
Fill out the form with a description and a link to the domain list.
|
Fill out the form with a description and a link to the domain list.
|
||||||
|
|
||||||
## Crawling
|
## Crawling
|
||||||
|
|
||||||
Refresh the specification list in the operator's gui. You should see your new specification in the list.
|
If you aren't redirected there automatically, go back to the `New Crawl` page under Node 1 -> Actions.
|
||||||
Click the link, then select `[Crawl]` under `Actions`.
|
Your new specification should now be listed.
|
||||||
|
|
||||||
Depending on the size of the specification, this may take anywhere between a few minutes to a few weeks.
|
Check the box next to it, and click `[Trigger New Crawl]`.
|
||||||
You can follow the progress in the `Overview` view. It's fairly common for the crawler to get stuck at
|
|
||||||
99%, this is from the crawler finishing up the largest sites. It will abort if no progress has been made
|
|
||||||
in five hours.
|
|
||||||
|
|
||||||
You can manually also abort the crawler by going to
|
![img](images/new_crawl2.png)
|
||||||
|
|
||||||
* System -> Nodes -> `[your node]` -> Actors.
|
This will start the crawling process. Crawling may take a while, depending on the size
|
||||||
|
of the domain list and the size of the websites.
|
||||||
|
|
||||||
Toggle both CRAWL and PROC_CRAWLER_SPAWNER to `[OFF]`.
|
![img](images/crawl_in_progress.png)
|
||||||
|
|
||||||
CRAWL controls the larger crawler process, and PROC_CRAWLER_SPAWNER spawns the actual
|
Eventually a process bar will show up, and the crawl will start. When it reaches 100%, the crawl is done.
|
||||||
crawler process. The crawler will be aborted, but the crawl data should be intact.
|
You can also monitor the `Events Summary` table on the same page to see what happened after the fact.
|
||||||
|
|
||||||
At this point you'll want to set PROC_CRAWLER_SPAWNER back to `[ON]`, as the crawler
|
It is expected that the crawl will stall out toward the end of the process, this is a statistical effect since
|
||||||
won't be able to start until it's set to this mode.
|
the largest websites take the longest to finish, and tend to be the ones lingering at 99% or so completion. The
|
||||||
|
crawler has a timeout of 5 hours, where if no new domains are finished crawling, it will stop, to prevent crawler traps
|
||||||
|
from stalling the crawl indefinitely.
|
||||||
|
|
||||||
!!! FIXME: This UX kinda sucks, should be an abort button ideally, none of this having to toggle
|
**Be sure to read the section on re-crawling!**
|
||||||
circuit breakers on and off.
|
|
||||||
|
|
||||||
## Converting
|
## Converting
|
||||||
|
|
||||||
|
BIN
doc/images/crawl_in_progress.png
Normal file
BIN
doc/images/crawl_in_progress.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 42 KiB |
BIN
doc/images/new_crawl.png
Normal file
BIN
doc/images/new_crawl.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 18 KiB |
BIN
doc/images/new_crawl2.png
Normal file
BIN
doc/images/new_crawl2.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 13 KiB |
BIN
doc/images/new_spec.png
Normal file
BIN
doc/images/new_spec.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 33 KiB |
Loading…
Reference in New Issue
Block a user