(crawler) Increase growth of crawl jobs

A number of crawl jobs get stuck at about 300 documents, or just under. This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl. GOOD_URLS is based on how many documents successfully process, which is typically fairly small. Switching to KNOWN_URLS should let this grow faster. The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table. The floor is also increased to 250 from 200.
2023-12-23 13:22:10 +01:00 · 2023-12-23 13:22:10 +01:00 · e1a155a9c8
commit e1a155a9c8
parent 0454447e41
1 changed files with 2 additions and 2 deletions
--- a/code/processes/crawling-process/src/main/java/nu/marginalia/crawl/spec/DbCrawlSpecProvider.java
+++ b/code/processes/crawling-process/src/main/java/nu/marginalia/crawl/spec/DbCrawlSpecProvider.java
@ -36,7 +36,7 @@ public class DbCrawlSpecProvider implements CrawlSpecProvider {

        try (var conn = dataSource.getConnection();
             var query = conn.prepareStatement("""
-                     SELECT DOMAIN_NAME, COALESCE(GOOD_URLS, 0)
+                     SELECT DOMAIN_NAME, COALESCE(KNOWN_URLS, 0)
                     FROM EC_DOMAIN
                     LEFT JOIN DOMAIN_METADATA ON EC_DOMAIN.ID=DOMAIN_METADATA.ID
                     WHERE NODE_AFFINITY=?
@ -48,7 +48,7 @@ public class DbCrawlSpecProvider implements CrawlSpecProvider {
            while (rs.next()) {
                domains.add(new CrawlSpecRecord(
                                rs.getString(1),
-                                Math.clamp((int) (1.25 * rs.getInt(2)), 200, 10_000),
+                                Math.clamp((int) (1.25 * rs.getInt(2)), 250, 10_000),
                                List.of()
                        ));
            }