Viktor Lofgren
|
5abaf13192
|
Fix serialization bug with CompressedBigString
|
2023-06-27 10:57:54 +02:00 |
|
Viktor Lofgren
|
d86e8522e2
|
Add search profiles for wiki, forum and docs.
|
2023-06-24 12:17:35 +02:00 |
|
Viktor Lofgren
|
bd2c3855ed
|
Add bits and keywords for generator classes (docs, forum, wiki).
|
2023-06-23 21:35:28 +02:00 |
|
Viktor Lofgren
|
4c627d0e1d
|
Improvements to crawling.md
|
2023-06-22 18:01:43 +02:00 |
|
Viktor Lofgren
|
c8dd45e37d
|
First draft for crawling documentation.
|
2023-06-22 17:44:24 +02:00 |
|
Viktor Lofgren
|
54c2be893b
|
TRIVIAL: Remove unused import.
|
2023-06-22 17:21:47 +02:00 |
|
Viktor Lofgren
|
55c65f0935
|
Use document generator to complement the document selection.
Will let through e.g. a modern SSG in the small web filter.
|
2023-06-22 17:21:33 +02:00 |
|
Viktor Lofgren
|
b5ef67ed28
|
Categorize generators by type
This is a great quality signal!
Add the type as document bitflags by category.
|
2023-06-22 16:04:37 +02:00 |
|
Viktor Lofgren
|
f140e7d7c7
|
Use a default tag for unset or invalid generators.
|
2023-06-21 17:30:14 +02:00 |
|
Viktor Lofgren
|
a9a2960e86
|
New synthetic keyword for document generator meta tag.
|
2023-06-20 16:25:49 +02:00 |
|
Viktor Lofgren
|
7326ba74fe
|
Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right.
Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.
|
2023-06-20 14:15:05 +02:00 |
|
Viktor Lofgren
|
a9fabba407
|
Tell experiment runner to only process some domains.
Updated the experiment runner, as well as the script.
|
2023-06-20 14:14:01 +02:00 |
|
Viktor Lofgren
|
5d862d119c
|
Bump dependency versions.
|
2023-06-20 12:03:12 +02:00 |
|
Viktor Lofgren
|
4fc0ddbc45
|
Improved crawl-job-extractor.
Let crawl-job-extractor run offline and allow it to read domains from file.
Improved docs.
|
2023-06-20 11:37:52 +02:00 |
|
Viktor Lofgren
|
9455100907
|
Throw a custom exception when WMSA_HOME isn't found
|
2023-06-20 11:37:52 +02:00 |
|
Viktor Lofgren
|
32a6735d03
|
Undo change in requirements for counting as a high tf-idf word
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
f0b4acb358
|
Better logic for summarization.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
67c15a34e6
|
Reduce the amount of expensive operations in HtmlDocumentProcessorPlugin.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
9579cdd151
|
Improved heuristic for which words are considered important in selecting the summary text.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
443cf0cf1e
|
Expose additional functionality through WordsTfIdfCounts.
Bump requirements for being flagged as high TF-IDF from 2 occurences to 3.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
4138233ddf
|
Truncate repeated strings of any non-alnum symbols in SummaryExtractor
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
2979f4703e
|
Allocation-free text utility
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
77f2ca51af
|
Optimize SentenceExtractor.
Remove String pool because it's not doing much.
Break out constant.
Use a shared RdrPosTagger.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
ffcbc6c1c9
|
Reduce the odds of re-allocation by AsciiFlattener
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
186a02acfd
|
Optimize RDRPosTagger to use integer comparisons instead of string comparisons.
Also reduce the cache-thrashing by deconstructing the tree's nodes into arrays.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
6f2a7977c1
|
(Minor) Remove character debris in build.gradle
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
266ad2e4de
|
Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
d1a004bea6
|
(minor) Clean up StringPool
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
e4372289a5
|
Use fixed buffers for BigString compression and decompression to reduce GC churn.
fixup! Use fixed buffers for BigString compression and decompression to reduce GC churn.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
379bccc1a3
|
Disable AdblockSimulator since it's slow and doesn't really work. Just wasting CPU cycles until it's fixed.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
21125206b4
|
Fix some bugs in JSON+LD-heuristics for pub date.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
44b1fe0e6d
|
Move list-conversion into getDescription method.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
88399e30e2
|
Consider keyword relevance signals when creating the document summary using the DOM walker.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
7ed3306be3
|
Make the adjacency calculator behave like it used to in the past, when it gave better results.
|
2023-06-07 22:03:06 +02:00 |
|
Viktor Lofgren
|
eb2ca942d5
|
Up the default crawl delay to 1 second.
|
2023-06-07 22:02:17 +02:00 |
|
Viktor Lofgren
|
2afbdc2269
|
Adjust the logic for the crawl job extractor to set a relatively low visit limit for websites that are new in the index or has not yielded many good documents previously.
|
2023-06-07 22:01:35 +02:00 |
|
Viktor Lofgren
|
d82a858491
|
Don't consider slash to be a sentence separator.
|
2023-05-31 16:54:30 +02:00 |
|
Viktor Lofgren
|
e332faa07e
|
Fix test that broke when memex.marginalia.nu started redirecting to www.marginalia.nu.
|
2023-05-28 13:46:24 +02:00 |
|
Viktor Lofgren
|
4e9e79454f
|
Fix broken transformation functions in the PagingArray classes.
|
2023-05-28 13:31:05 +02:00 |
|
Viktor Lofgren
|
b0bc07b4e7
|
Insertion sort was *super* busted I don't even know how it worked
|
2023-05-28 12:17:50 +02:00 |
|
Viktor Lofgren
|
2cda57355a
|
More word metadata tests
|
2023-05-28 11:57:06 +02:00 |
|
Viktor Lofgren
|
fd192d2791
|
Fix putative overflow error with a large dictionary
|
2023-05-28 11:57:06 +02:00 |
|
Viktor Lofgren
|
6814c90625
|
Fix N-width sorting bug
|
2023-05-28 11:57:06 +02:00 |
|
Viktor
|
a57ab427b3
|
Update useful-resources.md
|
2023-05-27 12:01:45 +02:00 |
|
Viktor Lofgren
|
1e184a8372
|
(search) Make exploration mode more random
|
2023-05-25 17:40:28 +02:00 |
|
Viktor Lofgren
|
6fae51a8ef
|
Stopgap fix for a bug in dealing with quote terms containing stop words.
|
2023-05-02 19:38:59 +02:00 |
|
Viktor Lofgren
|
a9f7b4c457
|
Add synthetic keywords for same-site files linked from a document (e.g. file:png). Also add category keywords, like file:image or file:document.
|
2023-04-30 19:29:13 +02:00 |
|
Viktor Lofgren
|
1e3b6934bb
|
Reduce log noise during loading. Bad URLs don't need to be loaded, they can be grepped from the instructions.
|
2023-04-30 18:36:44 +02:00 |
|
Viktor
|
0a5e85be8f
|
Update README.md
|
2023-04-22 21:02:25 +02:00 |
|
Viktor
|
7694a15f62
|
Fix kale's unreasonably high weighting factor
|
2023-04-22 20:55:09 +02:00 |
|