Commit Graph

1042 Commits

Author SHA1 Message Date
Viktor Lofgren
8e25cfff4f Update README and CONTRIBUTING. 2023-06-27 18:32:47 +02:00
Viktor Lofgren
b7dc748942 Update README to external reflect funding. 2023-06-27 18:20:55 +02:00
Viktor Lofgren
d71124961e Better tests for crawling and processing. 2023-06-27 16:11:27 +02:00
Viktor Lofgren
fbdedf53de Fix bug in CrawlerRetreiver
... where the root URL wasn't always added properly to the front of the crawl queue.
2023-06-27 15:50:38 +02:00
Viktor Lofgren
a6a66c6d8a Improve site info for unknown domains:
* Placeholder screenshot should work
* Add a link to git-repo for submitting the site for crawling
2023-06-27 15:32:11 +02:00
Viktor Lofgren
d167ad2017 Remove sitemap related log spam 2023-06-27 13:59:47 +02:00
Viktor Lofgren
7d741ff499 Fix so crawl plan replay doesn't crash if a file is missing. 2023-06-27 10:57:54 +02:00
Viktor Lofgren
f8f9f04158 Specialized logic for processing Lemmy-based websites. 2023-06-27 10:57:54 +02:00
Viktor Lofgren
b0c7480d06 Set default timeouts for java.net.URL-connections 2023-06-27 10:57:54 +02:00
Viktor Lofgren
e7af77e151 Tests for crawler specialization + testdata 2023-06-27 10:57:54 +02:00
Viktor Lofgren
ec940e36d0 Sitemap support, refined crawler specialization 2023-06-27 10:57:54 +02:00
Viktor Lofgren
f92d8a0975 EdgeUrl conversion to/from java.net.URL 2023-06-27 10:57:54 +02:00
Viktor Lofgren
ed373eef61 Refactor crawler and add special logic for some platforms
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
5abaf13192 Fix serialization bug with CompressedBigString 2023-06-27 10:57:54 +02:00
Viktor Lofgren
d86e8522e2 Add search profiles for wiki, forum and docs. 2023-06-24 12:17:35 +02:00
Viktor Lofgren
bd2c3855ed Add bits and keywords for generator classes (docs, forum, wiki). 2023-06-23 21:35:28 +02:00
Viktor Lofgren
4c627d0e1d Improvements to crawling.md 2023-06-22 18:01:43 +02:00
Viktor Lofgren
c8dd45e37d First draft for crawling documentation. 2023-06-22 17:44:24 +02:00
Viktor Lofgren
54c2be893b TRIVIAL: Remove unused import. 2023-06-22 17:21:47 +02:00
Viktor Lofgren
55c65f0935 Use document generator to complement the document selection.
Will let through e.g. a modern SSG in the small web filter.
2023-06-22 17:21:33 +02:00
Viktor Lofgren
b5ef67ed28 Categorize generators by type
This is a great quality signal!
Add the type as document bitflags by category.
2023-06-22 16:04:37 +02:00
Viktor Lofgren
f140e7d7c7 Use a default tag for unset or invalid generators. 2023-06-21 17:30:14 +02:00
Viktor Lofgren
a9a2960e86 New synthetic keyword for document generator meta tag. 2023-06-20 16:25:49 +02:00
Viktor Lofgren
7326ba74fe Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right.
Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.
2023-06-20 14:15:05 +02:00
Viktor Lofgren
a9fabba407 Tell experiment runner to only process some domains.
Updated the experiment runner, as well as the script.
2023-06-20 14:14:01 +02:00
Viktor Lofgren
5d862d119c Bump dependency versions. 2023-06-20 12:03:12 +02:00
Viktor Lofgren
4fc0ddbc45 Improved crawl-job-extractor.
Let crawl-job-extractor run offline and allow it to read domains from file.
Improved docs.
2023-06-20 11:37:52 +02:00
Viktor Lofgren
9455100907 Throw a custom exception when WMSA_HOME isn't found 2023-06-20 11:37:52 +02:00
Viktor Lofgren
32a6735d03 Undo change in requirements for counting as a high tf-idf word 2023-06-19 17:58:19 +02:00
Viktor Lofgren
f0b4acb358 Better logic for summarization. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
67c15a34e6 Reduce the amount of expensive operations in HtmlDocumentProcessorPlugin. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
9579cdd151 Improved heuristic for which words are considered important in selecting the summary text. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
443cf0cf1e Expose additional functionality through WordsTfIdfCounts.
Bump requirements for being flagged as high TF-IDF from 2 occurences to 3.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
4138233ddf Truncate repeated strings of any non-alnum symbols in SummaryExtractor 2023-06-19 17:58:19 +02:00
Viktor Lofgren
2979f4703e Allocation-free text utility 2023-06-19 17:58:19 +02:00
Viktor Lofgren
77f2ca51af Optimize SentenceExtractor.
Remove String pool because it's not doing much.
Break out constant.
Use a shared RdrPosTagger.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
ffcbc6c1c9 Reduce the odds of re-allocation by AsciiFlattener 2023-06-19 17:58:19 +02:00
Viktor Lofgren
186a02acfd Optimize RDRPosTagger to use integer comparisons instead of string comparisons.
Also reduce the cache-thrashing by deconstructing the tree's nodes into arrays.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
6f2a7977c1 (Minor) Remove character debris in build.gradle 2023-06-19 17:58:19 +02:00
Viktor Lofgren
266ad2e4de Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.

fixup! Re-introduce monkey patched GSON to make converter run better.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
d1a004bea6 (minor) Clean up StringPool 2023-06-19 17:58:19 +02:00
Viktor Lofgren
e4372289a5 Use fixed buffers for BigString compression and decompression to reduce GC churn.
fixup! Use fixed buffers for BigString compression and decompression to reduce GC churn.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
379bccc1a3 Disable AdblockSimulator since it's slow and doesn't really work. Just wasting CPU cycles until it's fixed. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
21125206b4 Fix some bugs in JSON+LD-heuristics for pub date. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
44b1fe0e6d Move list-conversion into getDescription method. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
88399e30e2 Consider keyword relevance signals when creating the document summary using the DOM walker. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
7ed3306be3 Make the adjacency calculator behave like it used to in the past, when it gave better results. 2023-06-07 22:03:06 +02:00
Viktor Lofgren
eb2ca942d5 Up the default crawl delay to 1 second. 2023-06-07 22:02:17 +02:00
Viktor Lofgren
2afbdc2269 Adjust the logic for the crawl job extractor to set a relatively low visit limit for websites that are new in the index or has not yielded many good documents previously. 2023-06-07 22:01:35 +02:00
Viktor Lofgren
d82a858491 Don't consider slash to be a sentence separator. 2023-05-31 16:54:30 +02:00