Commit Graph

149 Commits

Author SHA1 Message Date
Viktor Lofgren
fd44e09ebd (loader) Don't delete the entire link database when the loader runs 2023-07-24 18:37:35 +02:00
Viktor Lofgren
667b0ca0b0 (converter, WIP) Refactor CrawledDomainReader to not return iterators.
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798 (converter, WIP) Refactor converter to not have to load everything into RAM. 2023-07-24 15:25:09 +02:00
Viktor Lofgren
f91d92cccb (crawler) WIP 2023-07-20 21:05:16 +02:00
Viktor Lofgren
d7ab21fe34 (*) Refactor Control Service and processes 2023-07-17 21:20:31 +02:00
Viktor Lofgren
bca4bbb6c8 (*) Refactor MQ and MQSM 2023-07-17 13:57:32 +02:00
Viktor Lofgren
e618aa34e9 (control) Name change process->fsm, new fsm:s
* FSM for spawning processes when messages appear for them
* FSM for removing data flagged for purging
2023-07-17 12:27:27 +02:00
Viktor Lofgren
8b74e3aa0d (*) File Storage WIP 2023-07-14 17:08:10 +02:00
Viktor Lofgren
74caf9e38a (processes) Remove forEach-constructs in favor of iterators. 2023-07-12 17:47:36 +02:00
Viktor Lofgren
ac2d7034db (minor) Bugfix in Path handling 2023-07-11 21:24:29 +02:00
Viktor Lofgren
3c7c77fe21 (minor) Bugfix in Path handling 2023-07-11 17:06:52 +02:00
Viktor Lofgren
4c016b0318 Process monitoring
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor
0f9b90eb1c
Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 17:36:12 +02:00
Viktor Lofgren
98d1898610 Bugfix: Don't run the xenforo specialization on phpBB. 2023-07-06 18:12:26 +02:00
Viktor Lofgren
1400fb4a9b Bugfix: Don't run the xenforo specialization on phpBB. 2023-07-06 18:11:19 +02:00
Adrthegamedev
78f21dd19a (an attempt to) Add wikidot to wiki generators list 2023-07-05 18:03:36 +02:00
Viktor Lofgren
2cb209ae9c Better wordpress fingerprinting 2023-07-05 18:03:36 +02:00
Viktor Lofgren
979a620ead Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string. 2023-07-05 18:03:36 +02:00
Adrthegamedev
5ce894564c (an attempt to) Add wikidot to wiki generators list 2023-07-03 13:31:42 +02:00
Viktor Lofgren
813fa08bdd Better wordpress fingerprinting 2023-07-03 11:29:27 +02:00
Viktor Lofgren
e5792ba8b3 Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string. 2023-07-03 11:06:39 +02:00
Viktor Lofgren
42375f0e53 Specialization for javadocs 2023-07-01 20:16:56 +02:00
Viktor Lofgren
eda615de0f Add generator fingerprint for invision. 2023-07-01 14:47:57 +02:00
Viktor Lofgren
a000256223 Add generator fingerprint for xenforo.
Also clean up the specializations logic a bit, and add a barebones specialization for phpbb that cleans out paths we aren't interested in but doesn't touch pruning or summarizing logic for now.
2023-07-01 14:43:49 +02:00
Viktor Lofgren
9bd0e3ce58 Add generator fingerprint for xenforo. 2023-07-01 14:04:48 +02:00
Viktor Lofgren
b4d1e0e81e Add generator fingerprints for phpBB and flarum. 2023-07-01 13:44:42 +02:00
Viktor Lofgren
d2fdaafc7a Big brain web developers were using onload and onerror handlers to load JS without script tags... 2023-06-30 17:10:25 +02:00
Viktor Lofgren
baff83912e Small optimizations that shave an hour of processing time :D 2023-06-28 15:41:10 +02:00
Viktor Lofgren
d71124961e Better tests for crawling and processing. 2023-06-27 16:11:27 +02:00
Viktor Lofgren
f8f9f04158 Specialized logic for processing Lemmy-based websites. 2023-06-27 10:57:54 +02:00
Viktor Lofgren
bd2c3855ed Add bits and keywords for generator classes (docs, forum, wiki). 2023-06-23 21:35:28 +02:00
Viktor Lofgren
b5ef67ed28 Categorize generators by type
This is a great quality signal!
Add the type as document bitflags by category.
2023-06-22 16:04:37 +02:00
Viktor Lofgren
f140e7d7c7 Use a default tag for unset or invalid generators. 2023-06-21 17:30:14 +02:00
Viktor Lofgren
a9a2960e86 New synthetic keyword for document generator meta tag. 2023-06-20 16:25:49 +02:00
Viktor Lofgren
7326ba74fe Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right.
Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.
2023-06-20 14:15:05 +02:00
Viktor Lofgren
67c15a34e6 Reduce the amount of expensive operations in HtmlDocumentProcessorPlugin. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
266ad2e4de Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.

fixup! Re-introduce monkey patched GSON to make converter run better.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
44b1fe0e6d Move list-conversion into getDescription method. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
88399e30e2 Consider keyword relevance signals when creating the document summary using the DOM walker. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
a9f7b4c457 Add synthetic keywords for same-site files linked from a document (e.g. file:png). Also add category keywords, like file:image or file:document. 2023-04-30 19:29:13 +02:00
Viktor Lofgren
2ab26f37b8 Bug fix for document metadata encoding that breaks year based queries. 2023-04-14 16:56:49 +02:00
Viktor Lofgren
cc4e089a5d Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc. 2023-03-30 15:46:15 +02:00
Viktor Lofgren
4d05be4095 Refactor InternalLinkGraph 2023-03-30 15:44:23 +02:00
Viktor Lofgren
03bd892b95 Improve document processing in conversion.
* Add flags for long and short documents.
* Break out common length logic from plugins.
* Cleaning up of related code.
2023-03-28 16:38:00 +02:00
Viktor Lofgren
ca22c287a5 Make use of DocumentFlags' flags 2023-03-21 16:03:15 +01:00
Viktor Lofgren
2eb972dea1 Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00
Viktor Lofgren
449471a076 Yet more restructuring. Improved search result ranking. 2023-03-16 21:35:54 +01:00
Viktor Lofgren
d82532b7f1 More restructuring, big bug fixes in keyword extraction. 2023-03-13 17:39:53 +01:00