Adrthegamedev
5ce894564c
(an attempt to) Add wikidot to wiki generators list
2023-07-03 13:31:42 +02:00
Viktor Lofgren
813fa08bdd
Better wordpress fingerprinting
2023-07-03 11:29:27 +02:00
Viktor Lofgren
e5792ba8b3
Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string.
2023-07-03 11:06:39 +02:00
Viktor Lofgren
42375f0e53
Specialization for javadocs
2023-07-01 20:16:56 +02:00
Viktor Lofgren
24dce8c03b
Remove link filtering for mediawiki, it's too strict and not every site uses the /wiki/-pattern.
2023-07-01 19:32:25 +02:00
Viktor Lofgren
eda615de0f
Add generator fingerprint for invision.
2023-07-01 14:47:57 +02:00
Viktor Lofgren
a000256223
Add generator fingerprint for xenforo.
...
Also clean up the specializations logic a bit, and add a barebones specialization for phpbb that cleans out paths we aren't interested in but doesn't touch pruning or summarizing logic for now.
2023-07-01 14:43:49 +02:00
Viktor Lofgren
9bd0e3ce58
Add generator fingerprint for xenforo.
2023-07-01 14:04:48 +02:00
Viktor Lofgren
b4d1e0e81e
Add generator fingerprints for phpBB and flarum.
2023-07-01 13:44:42 +02:00
Viktor Lofgren
d2fdaafc7a
Big brain web developers were using onload and onerror handlers to load JS without script tags...
2023-06-30 17:10:25 +02:00
Viktor Lofgren
7d86586594
Remove annoying log spam in sitemap retriever
2023-06-30 17:08:35 +02:00
Viktor Lofgren
11c26e700e
Remove annoying log spam in crawler retriever
2023-06-30 17:08:24 +02:00
Viktor Lofgren
8274e8a953
JVM flags for disabling black and block-lists.
2023-06-30 17:07:47 +02:00
Viktor Lofgren
0f34beb1aa
Update search front page
2023-06-29 17:14:27 +02:00
Viktor Lofgren
baff83912e
Small optimizations that shave an hour of processing time :D
2023-06-28 15:41:10 +02:00
Viktor Lofgren
d71124961e
Better tests for crawling and processing.
2023-06-27 16:11:27 +02:00
Viktor Lofgren
fbdedf53de
Fix bug in CrawlerRetreiver
...
... where the root URL wasn't always added properly to the front of the crawl queue.
2023-06-27 15:50:38 +02:00
Viktor Lofgren
a6a66c6d8a
Improve site info for unknown domains:
...
* Placeholder screenshot should work
* Add a link to git-repo for submitting the site for crawling
2023-06-27 15:32:11 +02:00
Viktor Lofgren
d167ad2017
Remove sitemap related log spam
2023-06-27 13:59:47 +02:00
Viktor Lofgren
7d741ff499
Fix so crawl plan replay doesn't crash if a file is missing.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
f8f9f04158
Specialized logic for processing Lemmy-based websites.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
b0c7480d06
Set default timeouts for java.net.URL-connections
2023-06-27 10:57:54 +02:00
Viktor Lofgren
e7af77e151
Tests for crawler specialization + testdata
2023-06-27 10:57:54 +02:00
Viktor Lofgren
ec940e36d0
Sitemap support, refined crawler specialization
2023-06-27 10:57:54 +02:00
Viktor Lofgren
f92d8a0975
EdgeUrl conversion to/from java.net.URL
2023-06-27 10:57:54 +02:00
Viktor Lofgren
ed373eef61
Refactor crawler and add special logic for some platforms
...
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
5abaf13192
Fix serialization bug with CompressedBigString
2023-06-27 10:57:54 +02:00
Viktor Lofgren
d86e8522e2
Add search profiles for wiki, forum and docs.
2023-06-24 12:17:35 +02:00
Viktor Lofgren
bd2c3855ed
Add bits and keywords for generator classes (docs, forum, wiki).
2023-06-23 21:35:28 +02:00
Viktor Lofgren
54c2be893b
TRIVIAL: Remove unused import.
2023-06-22 17:21:47 +02:00
Viktor Lofgren
55c65f0935
Use document generator to complement the document selection.
...
Will let through e.g. a modern SSG in the small web filter.
2023-06-22 17:21:33 +02:00
Viktor Lofgren
b5ef67ed28
Categorize generators by type
...
This is a great quality signal!
Add the type as document bitflags by category.
2023-06-22 16:04:37 +02:00
Viktor Lofgren
f140e7d7c7
Use a default tag for unset or invalid generators.
2023-06-21 17:30:14 +02:00
Viktor Lofgren
a9a2960e86
New synthetic keyword for document generator meta tag.
2023-06-20 16:25:49 +02:00
Viktor Lofgren
7326ba74fe
Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right.
...
Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.
2023-06-20 14:15:05 +02:00
Viktor Lofgren
a9fabba407
Tell experiment runner to only process some domains.
...
Updated the experiment runner, as well as the script.
2023-06-20 14:14:01 +02:00
Viktor Lofgren
4fc0ddbc45
Improved crawl-job-extractor.
...
Let crawl-job-extractor run offline and allow it to read domains from file.
Improved docs.
2023-06-20 11:37:52 +02:00
Viktor Lofgren
9455100907
Throw a custom exception when WMSA_HOME isn't found
2023-06-20 11:37:52 +02:00
Viktor Lofgren
32a6735d03
Undo change in requirements for counting as a high tf-idf word
2023-06-19 17:58:19 +02:00
Viktor Lofgren
f0b4acb358
Better logic for summarization.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
67c15a34e6
Reduce the amount of expensive operations in HtmlDocumentProcessorPlugin.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
9579cdd151
Improved heuristic for which words are considered important in selecting the summary text.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
443cf0cf1e
Expose additional functionality through WordsTfIdfCounts.
...
Bump requirements for being flagged as high TF-IDF from 2 occurences to 3.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
4138233ddf
Truncate repeated strings of any non-alnum symbols in SummaryExtractor
2023-06-19 17:58:19 +02:00
Viktor Lofgren
2979f4703e
Allocation-free text utility
2023-06-19 17:58:19 +02:00
Viktor Lofgren
77f2ca51af
Optimize SentenceExtractor.
...
Remove String pool because it's not doing much.
Break out constant.
Use a shared RdrPosTagger.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
ffcbc6c1c9
Reduce the odds of re-allocation by AsciiFlattener
2023-06-19 17:58:19 +02:00
Viktor Lofgren
266ad2e4de
Re-introduce monkey patched GSON to make converter run better.
...
fixup! Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
d1a004bea6
(minor) Clean up StringPool
2023-06-19 17:58:19 +02:00
Viktor Lofgren
e4372289a5
Use fixed buffers for BigString compression and decompression to reduce GC churn.
...
fixup! Use fixed buffers for BigString compression and decompression to reduce GC churn.
2023-06-19 17:58:19 +02:00