Viktor Lofgren
7087ab5f07
(run) Reduce nginx access log noise for local setup
2023-07-11 23:11:34 +02:00
Viktor Lofgren
0b0cf48849
(control) Better looking UUIDs
2023-07-11 23:11:02 +02:00
Viktor Lofgren
00d9773b44
(control) Better looking progress bar
2023-07-11 21:37:32 +02:00
Viktor Lofgren
88b9ec70c6
(control, WIP) Run reconvert-load from converter :D
2023-07-11 18:05:37 +02:00
Viktor Lofgren
77261a38cd
(control, WIP) MQFSM and ProcessService are sitting in a tree
...
We're spawning processes from the MSFSM in control service now!
2023-07-11 17:08:43 +02:00
Viktor Lofgren
3c7c77fe21
(minor) Bugfix in Path handling
2023-07-11 17:06:52 +02:00
Viktor Lofgren
4ee3f6ba3f
(minor) Refactor ControlService
2023-07-11 14:51:51 +02:00
Viktor Lofgren
4c016b0318
Process monitoring
...
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor Lofgren
f59cab300e
(minor) Javadoc comments for MqPersistance and MqMessageState
2023-07-10 21:59:51 +02:00
Viktor Lofgren
ec7826659a
(minor) Javadoc comments for MqPersistance and MqMessageState
2023-07-10 21:52:25 +02:00
Viktor Lofgren
98b5f22104
(control) WIP control service
...
* Set messages to OK when received so they're cleaned up properly.
2023-07-10 21:33:57 +02:00
Viktor Lofgren
2283ceb77d
(control) WIP control service
2023-07-10 18:58:43 +02:00
Viktor Lofgren
fba466d6e2
(crawler) Update URL blocklist
...
* Don't crawl MDN mirrors
* More mailing list variants
2023-07-10 18:58:43 +02:00
Viktor
cbbf60a599
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
c125d8ab48
(search) Fix a bug where space-like characters weren't normalized in query processing.
2023-07-10 18:58:43 +02:00
Viktor Lofgren
f03146de4b
(crawler) Fix bug poor handling of duplicate ids
...
* Also clean up the code a bit
2023-07-10 18:58:43 +02:00
Viktor Lofgren
dbb758d1a8
Minor: Better error handling in crawled domain reader
2023-07-10 18:58:43 +02:00
Viktor Lofgren
da8bcc6e24
Minor: Don't blow up the reader on a corrupted file
2023-07-10 18:58:43 +02:00
Viktor Lofgren
96eecc6ea5
Minor: Readability.
2023-07-10 18:58:43 +02:00
Viktor Lofgren
98d1898610
Bugfix: Don't run the xenforo specialization on phpBB.
2023-07-06 18:12:26 +02:00
Viktor Lofgren
b73fcc19fe
Fix so that crawler tests don't sometimes fetch real sitemaps when they're run.
2023-07-06 18:05:03 +02:00
Viktor Lofgren
d9e6c4f266
Trial integration of MQ-FSM into index service.
2023-07-06 18:04:16 +02:00
Viktor Lofgren
34653f03a2
Temporary bugfix, need to find source
2023-07-06 14:13:03 +02:00
Viktor Lofgren
f0a8ca440f
MQFSM Usability WIP
2023-07-06 13:33:11 +02:00
Viktor Lofgren
d89db10645
MQFSM Usability WIP
2023-07-06 13:02:16 +02:00
Viktor
413dc6ced4
Update FUNDING.yml
2023-07-05 18:03:36 +02:00
Adrthegamedev
78f21dd19a
(an attempt to) Add wikidot to wiki generators list
2023-07-05 18:03:36 +02:00
Viktor Lofgren
2cb209ae9c
Better wordpress fingerprinting
2023-07-05 18:03:36 +02:00
Viktor Lofgren
979a620ead
Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string.
2023-07-05 18:03:36 +02:00
Viktor Lofgren
7a17933c65
Control service owns message queue garbage collection.
2023-07-04 19:52:30 +02:00
Viktor Lofgren
097a163cf5
Getting a skeleton in place for the control service.
2023-07-04 18:25:42 +02:00
Viktor Lofgren
2ae0b8c159
Message queue based state machine
2023-07-04 17:42:06 +02:00
Viktor Lofgren
31ae71c7d6
Message queue WIP
2023-07-04 14:28:14 +02:00
Viktor Lofgren
62cc9df206
Embryo of new control process
...
* New events and heartbeat tables in mariadb
* Refactored to a cleaner Service interface
2023-07-03 10:40:32 +02:00
Viktor Lofgren
42375f0e53
Specialization for javadocs
2023-07-01 20:16:56 +02:00
Viktor Lofgren
24dce8c03b
Remove link filtering for mediawiki, it's too strict and not every site uses the /wiki/-pattern.
2023-07-01 19:32:25 +02:00
Viktor Lofgren
eda615de0f
Add generator fingerprint for invision.
2023-07-01 14:47:57 +02:00
Viktor Lofgren
a000256223
Add generator fingerprint for xenforo.
...
Also clean up the specializations logic a bit, and add a barebones specialization for phpbb that cleans out paths we aren't interested in but doesn't touch pruning or summarizing logic for now.
2023-07-01 14:43:49 +02:00
Viktor Lofgren
9bd0e3ce58
Add generator fingerprint for xenforo.
2023-07-01 14:04:48 +02:00
Viktor Lofgren
b4d1e0e81e
Add generator fingerprints for phpBB and flarum.
2023-07-01 13:44:42 +02:00
Viktor Lofgren
d2fdaafc7a
Big brain web developers were using onload and onerror handlers to load JS without script tags...
2023-06-30 17:10:25 +02:00
Viktor Lofgren
7d86586594
Remove annoying log spam in sitemap retriever
2023-06-30 17:08:35 +02:00
Viktor Lofgren
11c26e700e
Remove annoying log spam in crawler retriever
2023-06-30 17:08:24 +02:00
Viktor Lofgren
8274e8a953
JVM flags for disabling black and block-lists.
2023-06-30 17:07:47 +02:00
Viktor Lofgren
42afe490b7
Update README with version info
2023-06-30 11:49:17 +02:00
Viktor Lofgren
0f34beb1aa
Update search front page
2023-06-29 17:14:27 +02:00
Viktor Lofgren
e853483ef3
Bump Crawler Commons version
2023-06-29 14:14:18 +02:00
Viktor Lofgren
baff83912e
Small optimizations that shave an hour of processing time :D
2023-06-28 15:41:10 +02:00
Viktor Lofgren
8e25cfff4f
Update README and CONTRIBUTING.
2023-06-27 18:32:47 +02:00
Viktor Lofgren
b7dc748942
Update README to external reflect funding.
2023-06-27 18:20:55 +02:00