Commit Graph

1105 Commits

Author SHA1 Message Date
Viktor Lofgren
74caf9e38a (processes) Remove forEach-constructs in favor of iterators. 2023-07-12 17:47:36 +02:00
Viktor Lofgren
7087ab5f07 (run) Reduce nginx access log noise for local setup 2023-07-11 23:11:34 +02:00
Viktor Lofgren
0b0cf48849 (control) Better looking UUIDs 2023-07-11 23:11:02 +02:00
Viktor Lofgren
00d9773b44 (control) Better looking progress bar 2023-07-11 21:37:32 +02:00
Viktor Lofgren
ac2d7034db (minor) Bugfix in Path handling 2023-07-11 21:24:29 +02:00
Viktor Lofgren
88b9ec70c6 (control, WIP) Run reconvert-load from converter :D 2023-07-11 18:05:37 +02:00
Viktor Lofgren
77261a38cd (control, WIP) MQFSM and ProcessService are sitting in a tree
We're spawning processes from the MSFSM in control service now!
2023-07-11 17:08:43 +02:00
Viktor Lofgren
3c7c77fe21 (minor) Bugfix in Path handling 2023-07-11 17:06:52 +02:00
Viktor Lofgren
4ee3f6ba3f (minor) Refactor ControlService 2023-07-11 14:51:51 +02:00
Viktor Lofgren
4c016b0318 Process monitoring
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor Lofgren
f59cab300e (minor) Javadoc comments for MqPersistance and MqMessageState 2023-07-10 21:59:51 +02:00
Viktor Lofgren
ec7826659a (minor) Javadoc comments for MqPersistance and MqMessageState 2023-07-10 21:52:25 +02:00
Viktor Lofgren
98b5f22104 (control) WIP control service
* Set messages to OK when received so they're cleaned up properly.
2023-07-10 21:33:57 +02:00
Viktor Lofgren
2283ceb77d (control) WIP control service 2023-07-10 18:58:43 +02:00
Viktor Lofgren
fba466d6e2 (crawler) Update URL blocklist
* Don't crawl MDN mirrors
* More mailing list variants
2023-07-10 18:58:43 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
c125d8ab48 (search) Fix a bug where space-like characters weren't normalized in query processing. 2023-07-10 18:58:43 +02:00
Viktor Lofgren
f03146de4b (crawler) Fix bug poor handling of duplicate ids
* Also clean up the code a bit
2023-07-10 18:58:43 +02:00
Viktor Lofgren
dbb758d1a8 Minor: Better error handling in crawled domain reader 2023-07-10 18:58:43 +02:00
Viktor Lofgren
da8bcc6e24 Minor: Don't blow up the reader on a corrupted file 2023-07-10 18:58:43 +02:00
Viktor Lofgren
96eecc6ea5 Minor: Readability. 2023-07-10 18:58:43 +02:00
Viktor Lofgren
74644d59f3 (crawler) Update URL blocklist
* Don't crawl MDN mirrors
* More mailing list variants
2023-07-10 18:04:43 +02:00
Viktor
0f9b90eb1c
Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 17:36:12 +02:00
Viktor Lofgren
ae9537b68e (search) Fix a bug where space-like characters weren't normalized in query processing. 2023-07-07 20:02:05 +02:00
Viktor Lofgren
2619d196bb (crawler) Fix bug poor handling of duplicate ids
* Also clean up the code a bit
2023-07-07 19:56:14 +02:00
Viktor Lofgren
17db23c2c1 Minor: Better error handling in crawled domain reader 2023-07-07 19:48:32 +02:00
Viktor Lofgren
040bea1f75 Minor: Don't blow up the reader on a corrupted file 2023-07-07 19:48:11 +02:00
Viktor Lofgren
dc8277223a Minor: Readability. 2023-07-06 19:50:13 +02:00
Viktor Lofgren
98d1898610 Bugfix: Don't run the xenforo specialization on phpBB. 2023-07-06 18:12:26 +02:00
Viktor Lofgren
1400fb4a9b Bugfix: Don't run the xenforo specialization on phpBB. 2023-07-06 18:11:19 +02:00
Viktor Lofgren
647bbfa617 Fix so that crawler tests don't sometimes fetch real sitemaps when they're run. 2023-07-06 18:05:23 +02:00
Viktor Lofgren
b73fcc19fe Fix so that crawler tests don't sometimes fetch real sitemaps when they're run. 2023-07-06 18:05:03 +02:00
Viktor Lofgren
d9e6c4f266 Trial integration of MQ-FSM into index service. 2023-07-06 18:04:16 +02:00
Viktor Lofgren
34653f03a2 Temporary bugfix, need to find source 2023-07-06 14:13:03 +02:00
Viktor Lofgren
f0a8ca440f MQFSM Usability WIP 2023-07-06 13:33:11 +02:00
Viktor Lofgren
d89db10645 MQFSM Usability WIP 2023-07-06 13:02:16 +02:00
Viktor
413dc6ced4 Update FUNDING.yml 2023-07-05 18:03:36 +02:00
Adrthegamedev
78f21dd19a (an attempt to) Add wikidot to wiki generators list 2023-07-05 18:03:36 +02:00
Viktor Lofgren
2cb209ae9c Better wordpress fingerprinting 2023-07-05 18:03:36 +02:00
Viktor Lofgren
979a620ead Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string. 2023-07-05 18:03:36 +02:00
Viktor Lofgren
7a17933c65 Control service owns message queue garbage collection. 2023-07-04 19:52:30 +02:00
Viktor
019fa763cd
Update FUNDING.yml 2023-07-04 18:46:58 +02:00
Viktor Lofgren
097a163cf5 Getting a skeleton in place for the control service. 2023-07-04 18:25:42 +02:00
Viktor Lofgren
2ae0b8c159 Message queue based state machine 2023-07-04 17:42:06 +02:00
Viktor Lofgren
31ae71c7d6 Message queue WIP 2023-07-04 14:28:14 +02:00
Adrthegamedev
5ce894564c (an attempt to) Add wikidot to wiki generators list 2023-07-03 13:31:42 +02:00
Viktor Lofgren
813fa08bdd Better wordpress fingerprinting 2023-07-03 11:29:27 +02:00
Viktor Lofgren
e5792ba8b3 Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string. 2023-07-03 11:06:39 +02:00
Viktor Lofgren
62cc9df206 Embryo of new control process
* New events and heartbeat tables in mariadb
* Refactored to a cleaner Service interface
2023-07-03 10:40:32 +02:00
Viktor Lofgren
42375f0e53 Specialization for javadocs 2023-07-01 20:16:56 +02:00