(control) Add warnings about domain data contamination

This commit is contained in:
Viktor Lofgren 2024-01-25 18:26:15 +01:00
parent 0b105b5986
commit 182c0cf28e
3 changed files with 13 additions and 1 deletions

View File

@ -1,8 +1,15 @@
<h1 class="my-3">Download Sample Data</h1>
<div class="my-3 p-3 border bg-light">
This will download sample crawl data from <a href="https://downloads.marginalia.nu">downloads.marginalia.nu</a> onto Node {{node.id}}.
<p>This will download sample crawl data from <a href="https://downloads.marginalia.nu">downloads.marginalia.nu</a> onto Node {{node.id}}.
This is a sample of real crawl data. It is intended for demo, testing and development purposes. Several sets are available.
</p>
<p>
<span class="text-danger">Warning</span> While processing the sample data, the domains associated with it will be loaded
into the domain database. This means that if you run the re-crawl action on this machine, regardless of which crawl data
is specified, the domains in the sample data will be crawled!
</p>
</div>
<form method="post" action="actions/download-sample-data">

View File

@ -6,6 +6,9 @@
If you are just looking to test the software, feel free to use <a href="https://downloads.marginalia.nu/domain-list-test.txt">this
short list of marginalia-related websites</a>, that are safe to crawl repeatedly without causing any problems.
</p>
<p><span class="text-danger">Warning</span> Ensure <a href="?view=download-sample-data">downloaded sample data</a> has not been loaded onto this instance
before performing this action, otherwise those domains will also be crawled while re-crawling in the future!</p>
</div>
<form method="post" action="actions/new-crawl-specs">

View File

@ -18,6 +18,8 @@
crawl spec. If the document has changed, it will be re-crawled. If it has not changed, it will be skipped,
and the previous data will be retained. This is both faster and easier on the target server.
</p>
<p><span class="text-danger">Warning</span> Ensure <a href="?view=download-sample-data">downloaded sample data</a>
has not been loaded onto this instance before performing this action, otherwise those domains will also be crawled!</p>
</div>
<form method="post" action="actions/recrawl">