Viktor Lofgren
|
5c040f7a46
|
(crawl-spec) Parquetify crawl spec
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
|
2023-09-17 09:41:34 +02:00 |
|
Viktor
|
52e2ab45bf
|
Merge branch 'master' into master-control-program
|
2023-08-07 12:53:43 +02:00 |
|
Viktor Lofgren
|
5c071ce4d3
|
(crawler) Clean up the code and remove unnecessary logging
|
2023-07-30 16:53:39 +02:00 |
|
Viktor Lofgren
|
730e8f74e4
|
(crawler) Even more memory optimizations.
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
|
2023-07-30 14:19:55 +02:00 |
|
Viktor Lofgren
|
667b0ca0b0
|
(converter, WIP) Refactor CrawledDomainReader to not return iterators.
Instead return a closable class SerializableCrawlDataStream.
|
2023-07-24 16:28:30 +02:00 |
|
Viktor Lofgren
|
a56953c798
|
(converter, WIP) Refactor converter to not have to load everything into RAM.
|
2023-07-24 15:25:09 +02:00 |
|
Viktor Lofgren
|
789e8eea85
|
(crawler) Clean up and refactor the code a bit
|
2023-07-23 19:08:38 +02:00 |
|
Viktor Lofgren
|
35b29e4f9e
|
(crawler) Clean up and refactor the code a bit
|
2023-07-23 19:06:37 +02:00 |
|
Viktor Lofgren
|
c069c8c182
|
(crawler) Clean up crawl data reference and recrawl logic
|
2023-07-22 18:42:21 +02:00 |
|
Viktor Lofgren
|
58f2f86ea8
|
(crawler) Don't read all the data into RAM when doing a refresh-crawl
|
2023-07-21 19:47:52 +02:00 |
|
Viktor Lofgren
|
f91d92cccb
|
(crawler) WIP
|
2023-07-20 21:05:16 +02:00 |
|
Viktor Lofgren
|
d7ab21fe34
|
(*) Refactor Control Service and processes
|
2023-07-17 21:20:31 +02:00 |
|
Viktor Lofgren
|
bca4bbb6c8
|
(*) Refactor MQ and MQSM
|
2023-07-17 13:57:32 +02:00 |
|
Viktor Lofgren
|
8b74e3aa0d
|
(*) File Storage WIP
|
2023-07-14 17:08:10 +02:00 |
|
Viktor Lofgren
|
74caf9e38a
|
(processes) Remove forEach-constructs in favor of iterators.
|
2023-07-12 17:47:36 +02:00 |
|
Viktor Lofgren
|
4c016b0318
|
Process monitoring
* Also refactored the SQL tables a bit
|
2023-07-11 14:46:21 +02:00 |
|
Viktor Lofgren
|
dbb758d1a8
|
Minor: Better error handling in crawled domain reader
|
2023-07-10 18:58:43 +02:00 |
|
Viktor Lofgren
|
da8bcc6e24
|
Minor: Don't blow up the reader on a corrupted file
|
2023-07-10 18:58:43 +02:00 |
|
Viktor Lofgren
|
17db23c2c1
|
Minor: Better error handling in crawled domain reader
|
2023-07-07 19:48:32 +02:00 |
|
Viktor Lofgren
|
040bea1f75
|
Minor: Don't blow up the reader on a corrupted file
|
2023-07-07 19:48:11 +02:00 |
|
Viktor Lofgren
|
baff83912e
|
Small optimizations that shave an hour of processing time :D
|
2023-06-28 15:41:10 +02:00 |
|
Viktor Lofgren
|
fbdedf53de
|
Fix bug in CrawlerRetreiver
... where the root URL wasn't always added properly to the front of the crawl queue.
|
2023-06-27 15:50:38 +02:00 |
|
Viktor Lofgren
|
7d741ff499
|
Fix so crawl plan replay doesn't crash if a file is missing.
|
2023-06-27 10:57:54 +02:00 |
|
Viktor Lofgren
|
16e37672fc
|
Bugfix crawl plan, doesn't use rewrite() everywhere
|
2023-03-30 15:41:07 +02:00 |
|
Viktor Lofgren
|
449471a076
|
Yet more restructuring. Improved search result ranking.
|
2023-03-16 21:35:54 +01:00 |
|
Viktor Lofgren
|
d82532b7f1
|
More restructuring, big bug fixes in keyword extraction.
|
2023-03-13 17:39:53 +01:00 |
|