Commit Graph

828 Commits

Author SHA1 Message Date
Viktor Lofgren
fba466d6e2 (crawler) Update URL blocklist
* Don't crawl MDN mirrors
* More mailing list variants
2023-07-10 18:58:43 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
c125d8ab48 (search) Fix a bug where space-like characters weren't normalized in query processing. 2023-07-10 18:58:43 +02:00
Viktor Lofgren
f03146de4b (crawler) Fix bug poor handling of duplicate ids
* Also clean up the code a bit
2023-07-10 18:58:43 +02:00
Viktor Lofgren
dbb758d1a8 Minor: Better error handling in crawled domain reader 2023-07-10 18:58:43 +02:00
Viktor Lofgren
da8bcc6e24 Minor: Don't blow up the reader on a corrupted file 2023-07-10 18:58:43 +02:00
Viktor Lofgren
96eecc6ea5 Minor: Readability. 2023-07-10 18:58:43 +02:00
Viktor Lofgren
98d1898610 Bugfix: Don't run the xenforo specialization on phpBB. 2023-07-06 18:12:26 +02:00
Viktor Lofgren
b73fcc19fe Fix so that crawler tests don't sometimes fetch real sitemaps when they're run. 2023-07-06 18:05:03 +02:00
Viktor Lofgren
d9e6c4f266 Trial integration of MQ-FSM into index service. 2023-07-06 18:04:16 +02:00
Viktor Lofgren
34653f03a2 Temporary bugfix, need to find source 2023-07-06 14:13:03 +02:00
Viktor Lofgren
f0a8ca440f MQFSM Usability WIP 2023-07-06 13:33:11 +02:00
Viktor Lofgren
d89db10645 MQFSM Usability WIP 2023-07-06 13:02:16 +02:00
Viktor
413dc6ced4 Update FUNDING.yml 2023-07-05 18:03:36 +02:00
Adrthegamedev
78f21dd19a (an attempt to) Add wikidot to wiki generators list 2023-07-05 18:03:36 +02:00
Viktor Lofgren
2cb209ae9c Better wordpress fingerprinting 2023-07-05 18:03:36 +02:00
Viktor Lofgren
979a620ead Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string. 2023-07-05 18:03:36 +02:00
Viktor Lofgren
7a17933c65 Control service owns message queue garbage collection. 2023-07-04 19:52:30 +02:00
Viktor Lofgren
097a163cf5 Getting a skeleton in place for the control service. 2023-07-04 18:25:42 +02:00
Viktor Lofgren
2ae0b8c159 Message queue based state machine 2023-07-04 17:42:06 +02:00
Viktor Lofgren
31ae71c7d6 Message queue WIP 2023-07-04 14:28:14 +02:00
Viktor Lofgren
62cc9df206 Embryo of new control process
* New events and heartbeat tables in mariadb
* Refactored to a cleaner Service interface
2023-07-03 10:40:32 +02:00
Viktor Lofgren
42375f0e53 Specialization for javadocs 2023-07-01 20:16:56 +02:00
Viktor Lofgren
24dce8c03b Remove link filtering for mediawiki, it's too strict and not every site uses the /wiki/-pattern. 2023-07-01 19:32:25 +02:00
Viktor Lofgren
eda615de0f Add generator fingerprint for invision. 2023-07-01 14:47:57 +02:00
Viktor Lofgren
a000256223 Add generator fingerprint for xenforo.
Also clean up the specializations logic a bit, and add a barebones specialization for phpbb that cleans out paths we aren't interested in but doesn't touch pruning or summarizing logic for now.
2023-07-01 14:43:49 +02:00
Viktor Lofgren
9bd0e3ce58 Add generator fingerprint for xenforo. 2023-07-01 14:04:48 +02:00
Viktor Lofgren
b4d1e0e81e Add generator fingerprints for phpBB and flarum. 2023-07-01 13:44:42 +02:00
Viktor Lofgren
d2fdaafc7a Big brain web developers were using onload and onerror handlers to load JS without script tags... 2023-06-30 17:10:25 +02:00
Viktor Lofgren
7d86586594 Remove annoying log spam in sitemap retriever 2023-06-30 17:08:35 +02:00
Viktor Lofgren
11c26e700e Remove annoying log spam in crawler retriever 2023-06-30 17:08:24 +02:00
Viktor Lofgren
8274e8a953 JVM flags for disabling black and block-lists. 2023-06-30 17:07:47 +02:00
Viktor Lofgren
42afe490b7 Update README with version info 2023-06-30 11:49:17 +02:00
Viktor Lofgren
0f34beb1aa Update search front page 2023-06-29 17:14:27 +02:00
Viktor Lofgren
e853483ef3 Bump Crawler Commons version 2023-06-29 14:14:18 +02:00
Viktor Lofgren
baff83912e Small optimizations that shave an hour of processing time :D 2023-06-28 15:41:10 +02:00
Viktor Lofgren
8e25cfff4f Update README and CONTRIBUTING. 2023-06-27 18:32:47 +02:00
Viktor Lofgren
b7dc748942 Update README to external reflect funding. 2023-06-27 18:20:55 +02:00
Viktor Lofgren
d71124961e Better tests for crawling and processing. 2023-06-27 16:11:27 +02:00
Viktor Lofgren
fbdedf53de Fix bug in CrawlerRetreiver
... where the root URL wasn't always added properly to the front of the crawl queue.
2023-06-27 15:50:38 +02:00
Viktor Lofgren
a6a66c6d8a Improve site info for unknown domains:
* Placeholder screenshot should work
* Add a link to git-repo for submitting the site for crawling
2023-06-27 15:32:11 +02:00
Viktor Lofgren
d167ad2017 Remove sitemap related log spam 2023-06-27 13:59:47 +02:00
Viktor Lofgren
7d741ff499 Fix so crawl plan replay doesn't crash if a file is missing. 2023-06-27 10:57:54 +02:00
Viktor Lofgren
f8f9f04158 Specialized logic for processing Lemmy-based websites. 2023-06-27 10:57:54 +02:00
Viktor Lofgren
b0c7480d06 Set default timeouts for java.net.URL-connections 2023-06-27 10:57:54 +02:00
Viktor Lofgren
e7af77e151 Tests for crawler specialization + testdata 2023-06-27 10:57:54 +02:00
Viktor Lofgren
ec940e36d0 Sitemap support, refined crawler specialization 2023-06-27 10:57:54 +02:00
Viktor Lofgren
f92d8a0975 EdgeUrl conversion to/from java.net.URL 2023-06-27 10:57:54 +02:00
Viktor Lofgren
ed373eef61 Refactor crawler and add special logic for some platforms
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
5abaf13192 Fix serialization bug with CompressedBigString 2023-06-27 10:57:54 +02:00