Viktor Lofgren
01476577b8
(loader) Speed up loading back to original speeds with a cascading DELETE FROM EC_URL rather than EC_PAGE_DATA.
...
* Also clean up code and have proper rollbacks for transactions.
2023-07-28 22:00:07 +02:00
Viktor Lofgren
e237df4a10
(converter) Use a dumb thread pool instead of Java's executor service.
2023-07-28 18:15:16 +02:00
Viktor Lofgren
f11103d31d
(WIP) Make it possible to sideload encyclopedia data.
...
This is mostly a pilot track for sideloading other large websites.
Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
507f26ad47
(converter) Refactor converter to not keep instructions list in RAM.
...
(converter) Refactor converter to not keep instructions list in RAM.
(converter) Refactor converter to not keep instructions list in RAM.
2023-07-25 22:06:46 +02:00
Viktor Lofgren
fd44e09ebd
(loader) Don't delete the entire link database when the loader runs
2023-07-24 18:37:35 +02:00
Viktor Lofgren
667b0ca0b0
(converter, WIP) Refactor CrawledDomainReader to not return iterators.
...
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798
(converter, WIP) Refactor converter to not have to load everything into RAM.
2023-07-24 15:25:09 +02:00
Viktor Lofgren
35b29e4f9e
(crawler) Clean up and refactor the code a bit
2023-07-23 19:06:37 +02:00
Viktor Lofgren
69f333c0bf
(crawler) Clean up and refactor the code a bit
2023-07-23 18:59:14 +02:00
Viktor Lofgren
c069c8c182
(crawler) Clean up crawl data reference and recrawl logic
2023-07-22 18:42:21 +02:00
Viktor Lofgren
9e4aa7da7c
(crawler) Support for X-Robots-Tag
2023-07-22 18:42:21 +02:00
Viktor Lofgren
58f2f86ea8
(crawler) Don't read all the data into RAM when doing a refresh-crawl
2023-07-21 19:47:52 +02:00
Viktor Lofgren
f91d92cccb
(crawler) WIP
2023-07-20 21:05:16 +02:00
Viktor Lofgren
d7ab21fe34
(*) Refactor Control Service and processes
2023-07-17 21:20:31 +02:00
Viktor Lofgren
bca4bbb6c8
(*) Refactor MQ and MQSM
2023-07-17 13:57:32 +02:00
Viktor Lofgren
e618aa34e9
(control) Name change process->fsm, new fsm:s
...
* FSM for spawning processes when messages appear for them
* FSM for removing data flagged for purging
2023-07-17 12:27:27 +02:00
Viktor Lofgren
8b74e3aa0d
(*) File Storage WIP
2023-07-14 17:08:10 +02:00
Viktor Lofgren
5deec63667
(work-log) Better tests
2023-07-12 18:04:06 +02:00
Viktor Lofgren
74caf9e38a
(processes) Remove forEach-constructs in favor of iterators.
2023-07-12 17:47:36 +02:00
Viktor Lofgren
77261a38cd
(control, WIP) MQFSM and ProcessService are sitting in a tree
...
We're spawning processes from the MSFSM in control service now!
2023-07-11 17:08:43 +02:00
Viktor Lofgren
3c7c77fe21
(minor) Bugfix in Path handling
2023-07-11 17:06:52 +02:00
Viktor Lofgren
4c016b0318
Process monitoring
...
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor
cbbf60a599
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
f03146de4b
(crawler) Fix bug poor handling of duplicate ids
...
* Also clean up the code a bit
2023-07-10 18:58:43 +02:00
Viktor Lofgren
98d1898610
Bugfix: Don't run the xenforo specialization on phpBB.
2023-07-06 18:12:26 +02:00
Viktor Lofgren
b73fcc19fe
Fix so that crawler tests don't sometimes fetch real sitemaps when they're run.
2023-07-06 18:05:03 +02:00
Viktor Lofgren
34653f03a2
Temporary bugfix, need to find source
2023-07-06 14:13:03 +02:00
Adrthegamedev
78f21dd19a
(an attempt to) Add wikidot to wiki generators list
2023-07-05 18:03:36 +02:00
Viktor Lofgren
2cb209ae9c
Better wordpress fingerprinting
2023-07-05 18:03:36 +02:00
Viktor Lofgren
979a620ead
Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string.
2023-07-05 18:03:36 +02:00
Viktor Lofgren
42375f0e53
Specialization for javadocs
2023-07-01 20:16:56 +02:00
Viktor Lofgren
24dce8c03b
Remove link filtering for mediawiki, it's too strict and not every site uses the /wiki/-pattern.
2023-07-01 19:32:25 +02:00
Viktor Lofgren
eda615de0f
Add generator fingerprint for invision.
2023-07-01 14:47:57 +02:00
Viktor Lofgren
a000256223
Add generator fingerprint for xenforo.
...
Also clean up the specializations logic a bit, and add a barebones specialization for phpbb that cleans out paths we aren't interested in but doesn't touch pruning or summarizing logic for now.
2023-07-01 14:43:49 +02:00
Viktor Lofgren
9bd0e3ce58
Add generator fingerprint for xenforo.
2023-07-01 14:04:48 +02:00
Viktor Lofgren
b4d1e0e81e
Add generator fingerprints for phpBB and flarum.
2023-07-01 13:44:42 +02:00
Viktor Lofgren
d2fdaafc7a
Big brain web developers were using onload and onerror handlers to load JS without script tags...
2023-06-30 17:10:25 +02:00
Viktor Lofgren
7d86586594
Remove annoying log spam in sitemap retriever
2023-06-30 17:08:35 +02:00
Viktor Lofgren
11c26e700e
Remove annoying log spam in crawler retriever
2023-06-30 17:08:24 +02:00
Viktor Lofgren
baff83912e
Small optimizations that shave an hour of processing time :D
2023-06-28 15:41:10 +02:00
Viktor Lofgren
d71124961e
Better tests for crawling and processing.
2023-06-27 16:11:27 +02:00
Viktor Lofgren
fbdedf53de
Fix bug in CrawlerRetreiver
...
... where the root URL wasn't always added properly to the front of the crawl queue.
2023-06-27 15:50:38 +02:00
Viktor Lofgren
d167ad2017
Remove sitemap related log spam
2023-06-27 13:59:47 +02:00
Viktor Lofgren
f8f9f04158
Specialized logic for processing Lemmy-based websites.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
b0c7480d06
Set default timeouts for java.net.URL-connections
2023-06-27 10:57:54 +02:00
Viktor Lofgren
e7af77e151
Tests for crawler specialization + testdata
2023-06-27 10:57:54 +02:00
Viktor Lofgren
ec940e36d0
Sitemap support, refined crawler specialization
2023-06-27 10:57:54 +02:00
Viktor Lofgren
ed373eef61
Refactor crawler and add special logic for some platforms
...
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
bd2c3855ed
Add bits and keywords for generator classes (docs, forum, wiki).
2023-06-23 21:35:28 +02:00
Viktor Lofgren
b5ef67ed28
Categorize generators by type
...
This is a great quality signal!
Add the type as document bitflags by category.
2023-06-22 16:04:37 +02:00