Viktor Lofgren
|
704de50a9b
|
(forward-index, valuator) HTML features in valuator
Put it in the forward index for easy access during index-side valuation.
|
2023-08-18 11:54:56 +02:00 |
|
Viktor
|
52e2ab45bf
|
Merge branch 'master' into master-control-program
|
2023-08-07 12:53:43 +02:00 |
|
Viktor Lofgren
|
2f8488610a
|
(loader) Fix bug where trailing deferred domain meta inserts weren't executed
|
2023-07-31 14:23:23 +02:00 |
|
Viktor Lofgren
|
5c071ce4d3
|
(crawler) Clean up the code and remove unnecessary logging
|
2023-07-30 16:53:39 +02:00 |
|
Viktor Lofgren
|
730e8f74e4
|
(crawler) Even more memory optimizations.
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
|
2023-07-30 14:19:55 +02:00 |
|
Viktor Lofgren
|
d3f01bd171
|
(crawler, converter) Remove monkey patched gson from dependencies
|
2023-07-29 19:18:12 +02:00 |
|
Viktor Lofgren
|
77d5e39fe0
|
Make processed data Serializable
|
2023-07-28 18:11:19 +02:00 |
|
Viktor Lofgren
|
667b0ca0b0
|
(converter, WIP) Refactor CrawledDomainReader to not return iterators.
Instead return a closable class SerializableCrawlDataStream.
|
2023-07-24 16:28:30 +02:00 |
|
Viktor Lofgren
|
a56953c798
|
(converter, WIP) Refactor converter to not have to load everything into RAM.
|
2023-07-24 15:25:09 +02:00 |
|
Viktor Lofgren
|
789e8eea85
|
(crawler) Clean up and refactor the code a bit
|
2023-07-23 19:08:38 +02:00 |
|
Viktor Lofgren
|
35b29e4f9e
|
(crawler) Clean up and refactor the code a bit
|
2023-07-23 19:06:37 +02:00 |
|
Viktor Lofgren
|
c069c8c182
|
(crawler) Clean up crawl data reference and recrawl logic
|
2023-07-22 18:42:21 +02:00 |
|
Viktor Lofgren
|
58f2f86ea8
|
(crawler) Don't read all the data into RAM when doing a refresh-crawl
|
2023-07-21 19:47:52 +02:00 |
|
Viktor Lofgren
|
f91d92cccb
|
(crawler) WIP
|
2023-07-20 21:05:16 +02:00 |
|
Viktor Lofgren
|
d7ab21fe34
|
(*) Refactor Control Service and processes
|
2023-07-17 21:20:31 +02:00 |
|
Viktor Lofgren
|
bca4bbb6c8
|
(*) Refactor MQ and MQSM
|
2023-07-17 13:57:32 +02:00 |
|
Viktor Lofgren
|
8b74e3aa0d
|
(*) File Storage WIP
|
2023-07-14 17:08:10 +02:00 |
|
Viktor Lofgren
|
74caf9e38a
|
(processes) Remove forEach-constructs in favor of iterators.
|
2023-07-12 17:47:36 +02:00 |
|
Viktor Lofgren
|
4c016b0318
|
Process monitoring
* Also refactored the SQL tables a bit
|
2023-07-11 14:46:21 +02:00 |
|
Viktor Lofgren
|
dbb758d1a8
|
Minor: Better error handling in crawled domain reader
|
2023-07-10 18:58:43 +02:00 |
|
Viktor Lofgren
|
da8bcc6e24
|
Minor: Don't blow up the reader on a corrupted file
|
2023-07-10 18:58:43 +02:00 |
|
Viktor Lofgren
|
17db23c2c1
|
Minor: Better error handling in crawled domain reader
|
2023-07-07 19:48:32 +02:00 |
|
Viktor Lofgren
|
040bea1f75
|
Minor: Don't blow up the reader on a corrupted file
|
2023-07-07 19:48:11 +02:00 |
|
Viktor Lofgren
|
baff83912e
|
Small optimizations that shave an hour of processing time :D
|
2023-06-28 15:41:10 +02:00 |
|
Viktor Lofgren
|
fbdedf53de
|
Fix bug in CrawlerRetreiver
... where the root URL wasn't always added properly to the front of the crawl queue.
|
2023-06-27 15:50:38 +02:00 |
|
Viktor Lofgren
|
7d741ff499
|
Fix so crawl plan replay doesn't crash if a file is missing.
|
2023-06-27 10:57:54 +02:00 |
|
Viktor Lofgren
|
bd2c3855ed
|
Add bits and keywords for generator classes (docs, forum, wiki).
|
2023-06-23 21:35:28 +02:00 |
|
Viktor Lofgren
|
b5ef67ed28
|
Categorize generators by type
This is a great quality signal!
Add the type as document bitflags by category.
|
2023-06-22 16:04:37 +02:00 |
|
Viktor Lofgren
|
7326ba74fe
|
Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right.
Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.
|
2023-06-20 14:15:05 +02:00 |
|
Viktor Lofgren
|
266ad2e4de
|
Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.
|
2023-06-19 17:58:19 +02:00 |
|
Viktor Lofgren
|
16e37672fc
|
Bugfix crawl plan, doesn't use rewrite() everywhere
|
2023-03-30 15:41:07 +02:00 |
|
Viktor Lofgren
|
7c58ddce81
|
readme.md
|
2023-03-22 15:10:30 +01:00 |
|
Viktor Lofgren
|
2eb972dea1
|
Remove unrelated code, break tools into their own directory.
|
2023-03-17 16:03:11 +01:00 |
|
Viktor Lofgren
|
449471a076
|
Yet more restructuring. Improved search result ranking.
|
2023-03-16 21:35:54 +01:00 |
|
Viktor Lofgren
|
d82532b7f1
|
More restructuring, big bug fixes in keyword extraction.
|
2023-03-13 17:39:53 +01:00 |
|