Viktor Lofgren
fd44e09ebd
(loader) Don't delete the entire link database when the loader runs
2023-07-24 18:37:35 +02:00
Viktor Lofgren
667b0ca0b0
(converter, WIP) Refactor CrawledDomainReader to not return iterators.
...
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798
(converter, WIP) Refactor converter to not have to load everything into RAM.
2023-07-24 15:25:09 +02:00
Viktor Lofgren
f91d92cccb
(crawler) WIP
2023-07-20 21:05:16 +02:00
Viktor Lofgren
d7ab21fe34
(*) Refactor Control Service and processes
2023-07-17 21:20:31 +02:00
Viktor Lofgren
bca4bbb6c8
(*) Refactor MQ and MQSM
2023-07-17 13:57:32 +02:00
Viktor Lofgren
e618aa34e9
(control) Name change process->fsm, new fsm:s
...
* FSM for spawning processes when messages appear for them
* FSM for removing data flagged for purging
2023-07-17 12:27:27 +02:00
Viktor Lofgren
8b74e3aa0d
(*) File Storage WIP
2023-07-14 17:08:10 +02:00
Viktor Lofgren
74caf9e38a
(processes) Remove forEach-constructs in favor of iterators.
2023-07-12 17:47:36 +02:00
Viktor Lofgren
ac2d7034db
(minor) Bugfix in Path handling
2023-07-11 21:24:29 +02:00
Viktor Lofgren
3c7c77fe21
(minor) Bugfix in Path handling
2023-07-11 17:06:52 +02:00
Viktor Lofgren
4c016b0318
Process monitoring
...
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor
cbbf60a599
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor
0f9b90eb1c
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 17:36:12 +02:00
Viktor Lofgren
98d1898610
Bugfix: Don't run the xenforo specialization on phpBB.
2023-07-06 18:12:26 +02:00
Viktor Lofgren
1400fb4a9b
Bugfix: Don't run the xenforo specialization on phpBB.
2023-07-06 18:11:19 +02:00
Adrthegamedev
78f21dd19a
(an attempt to) Add wikidot to wiki generators list
2023-07-05 18:03:36 +02:00
Viktor Lofgren
2cb209ae9c
Better wordpress fingerprinting
2023-07-05 18:03:36 +02:00
Viktor Lofgren
979a620ead
Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string.
2023-07-05 18:03:36 +02:00
Adrthegamedev
5ce894564c
(an attempt to) Add wikidot to wiki generators list
2023-07-03 13:31:42 +02:00
Viktor Lofgren
813fa08bdd
Better wordpress fingerprinting
2023-07-03 11:29:27 +02:00
Viktor Lofgren
e5792ba8b3
Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string.
2023-07-03 11:06:39 +02:00
Viktor Lofgren
42375f0e53
Specialization for javadocs
2023-07-01 20:16:56 +02:00
Viktor Lofgren
eda615de0f
Add generator fingerprint for invision.
2023-07-01 14:47:57 +02:00
Viktor Lofgren
a000256223
Add generator fingerprint for xenforo.
...
Also clean up the specializations logic a bit, and add a barebones specialization for phpbb that cleans out paths we aren't interested in but doesn't touch pruning or summarizing logic for now.
2023-07-01 14:43:49 +02:00
Viktor Lofgren
9bd0e3ce58
Add generator fingerprint for xenforo.
2023-07-01 14:04:48 +02:00
Viktor Lofgren
b4d1e0e81e
Add generator fingerprints for phpBB and flarum.
2023-07-01 13:44:42 +02:00
Viktor Lofgren
d2fdaafc7a
Big brain web developers were using onload and onerror handlers to load JS without script tags...
2023-06-30 17:10:25 +02:00
Viktor Lofgren
baff83912e
Small optimizations that shave an hour of processing time :D
2023-06-28 15:41:10 +02:00
Viktor Lofgren
d71124961e
Better tests for crawling and processing.
2023-06-27 16:11:27 +02:00
Viktor Lofgren
f8f9f04158
Specialized logic for processing Lemmy-based websites.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
bd2c3855ed
Add bits and keywords for generator classes (docs, forum, wiki).
2023-06-23 21:35:28 +02:00
Viktor Lofgren
b5ef67ed28
Categorize generators by type
...
This is a great quality signal!
Add the type as document bitflags by category.
2023-06-22 16:04:37 +02:00
Viktor Lofgren
f140e7d7c7
Use a default tag for unset or invalid generators.
2023-06-21 17:30:14 +02:00
Viktor Lofgren
a9a2960e86
New synthetic keyword for document generator meta tag.
2023-06-20 16:25:49 +02:00
Viktor Lofgren
7326ba74fe
Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right.
...
Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.
2023-06-20 14:15:05 +02:00
Viktor Lofgren
67c15a34e6
Reduce the amount of expensive operations in HtmlDocumentProcessorPlugin.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
266ad2e4de
Re-introduce monkey patched GSON to make converter run better.
...
fixup! Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
44b1fe0e6d
Move list-conversion into getDescription method.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
88399e30e2
Consider keyword relevance signals when creating the document summary using the DOM walker.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
a9f7b4c457
Add synthetic keywords for same-site files linked from a document (e.g. file:png). Also add category keywords, like file:image or file:document.
2023-04-30 19:29:13 +02:00
Viktor Lofgren
2ab26f37b8
Bug fix for document metadata encoding that breaks year based queries.
2023-04-14 16:56:49 +02:00
Viktor Lofgren
cc4e089a5d
Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc.
2023-03-30 15:46:15 +02:00
Viktor Lofgren
4d05be4095
Refactor InternalLinkGraph
2023-03-30 15:44:23 +02:00
Viktor Lofgren
03bd892b95
Improve document processing in conversion.
...
* Add flags for long and short documents.
* Break out common length logic from plugins.
* Cleaning up of related code.
2023-03-28 16:38:00 +02:00
Viktor Lofgren
ca22c287a5
Make use of DocumentFlags' flags
2023-03-21 16:03:15 +01:00
Viktor Lofgren
2eb972dea1
Remove unrelated code, break tools into their own directory.
2023-03-17 16:03:11 +01:00
Viktor Lofgren
449471a076
Yet more restructuring. Improved search result ranking.
2023-03-16 21:35:54 +01:00
Viktor Lofgren
d82532b7f1
More restructuring, big bug fixes in keyword extraction.
2023-03-13 17:39:53 +01:00