Viktor Lofgren
fcfe07fb7d
(valuator) Clean up code
2023-08-18 11:26:56 +02:00
Viktor Lofgren
ccf4990add
(minor) Clean up code
2023-08-18 11:26:39 +02:00
Viktor Lofgren
f2638dd845
(feature-extractor) More adtech nonsense
2023-08-18 11:26:19 +02:00
Viktor Lofgren
239980ecae
(minor) Improve comment
2023-08-18 11:26:05 +02:00
Viktor Lofgren
bee815b1c4
(converter) Add monsterinsights as an adtech tracker
2023-08-17 17:44:11 +02:00
Viktor Lofgren
e296b02649
(converter) Optimize LSH based within-domain deduplication
2023-08-17 17:43:46 +02:00
Viktor Lofgren
46d761f34f
(language) fasttext based language filter
2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f
(valuation) Penalize wordpress style kebab case urls
2023-08-16 13:11:24 +02:00
Viktor Lofgren
d8073f0dde
(feature-extractor) Add mail.ru counter to non-adtech trackers
2023-08-15 19:10:43 +02:00
Viktor Lofgren
e7192a9cad
(mq) Refactor mq and actor library and move it to libraries out of common
2023-08-15 10:53:23 +02:00
Viktor Lofgren
ce293029c7
(converter) Treat adtech tracking as advertisement.
2023-08-09 14:23:53 +02:00
Viktor Lofgren
4ab1cd9502
(*) last touches
2023-08-07 12:57:44 +02:00
Viktor
52e2ab45bf
Merge branch 'master' into master-control-program
2023-08-07 12:53:43 +02:00
Viktor Lofgren
2f8488610a
(loader) Fix bug where trailing deferred domain meta inserts weren't executed
2023-07-31 14:23:23 +02:00
Viktor Lofgren
37c4cc68ed
TODO
2023-07-31 10:34:42 +02:00
Viktor Lofgren
1c948eb3d8
(minor) Alter DumbThreadPool in Converter to not claim the threads are crawlers.
2023-07-31 10:33:15 +02:00
Viktor Lofgren
cd90ca820f
YAGNI filter over ConverterDomainTypes
2023-07-31 10:32:47 +02:00
Viktor Lofgren
6f4e767a04
(minor) Re-enable monkey-patch-json for converter
2023-07-31 10:31:46 +02:00
Viktor Lofgren
730e8f74e4
(crawler) Even more memory optimizations.
...
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
ee143bbc48
(crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended.
2023-07-29 19:19:09 +02:00
Viktor Lofgren
d3f01bd171
(crawler, converter) Remove monkey patched gson from dependencies
2023-07-29 19:18:12 +02:00
Viktor Lofgren
f11103d31d
(WIP) Make it possible to sideload encyclopedia data.
...
This is mostly a pilot track for sideloading other large websites.
Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
507f26ad47
(converter) Refactor converter to not keep instructions list in RAM.
...
(converter) Refactor converter to not keep instructions list in RAM.
(converter) Refactor converter to not keep instructions list in RAM.
2023-07-25 22:06:46 +02:00
Viktor Lofgren
fd44e09ebd
(loader) Don't delete the entire link database when the loader runs
2023-07-24 18:37:35 +02:00
Viktor Lofgren
667b0ca0b0
(converter, WIP) Refactor CrawledDomainReader to not return iterators.
...
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798
(converter, WIP) Refactor converter to not have to load everything into RAM.
2023-07-24 15:25:09 +02:00
Viktor Lofgren
f91d92cccb
(crawler) WIP
2023-07-20 21:05:16 +02:00
Viktor Lofgren
d7ab21fe34
(*) Refactor Control Service and processes
2023-07-17 21:20:31 +02:00
Viktor Lofgren
bca4bbb6c8
(*) Refactor MQ and MQSM
2023-07-17 13:57:32 +02:00
Viktor Lofgren
e618aa34e9
(control) Name change process->fsm, new fsm:s
...
* FSM for spawning processes when messages appear for them
* FSM for removing data flagged for purging
2023-07-17 12:27:27 +02:00
Viktor Lofgren
8b74e3aa0d
(*) File Storage WIP
2023-07-14 17:08:10 +02:00
Viktor Lofgren
74caf9e38a
(processes) Remove forEach-constructs in favor of iterators.
2023-07-12 17:47:36 +02:00
Viktor Lofgren
ac2d7034db
(minor) Bugfix in Path handling
2023-07-11 21:24:29 +02:00
Viktor Lofgren
3c7c77fe21
(minor) Bugfix in Path handling
2023-07-11 17:06:52 +02:00
Viktor Lofgren
4c016b0318
Process monitoring
...
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor
cbbf60a599
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor
0f9b90eb1c
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 17:36:12 +02:00
Viktor Lofgren
98d1898610
Bugfix: Don't run the xenforo specialization on phpBB.
2023-07-06 18:12:26 +02:00
Viktor Lofgren
1400fb4a9b
Bugfix: Don't run the xenforo specialization on phpBB.
2023-07-06 18:11:19 +02:00
Adrthegamedev
78f21dd19a
(an attempt to) Add wikidot to wiki generators list
2023-07-05 18:03:36 +02:00
Viktor Lofgren
2cb209ae9c
Better wordpress fingerprinting
2023-07-05 18:03:36 +02:00
Viktor Lofgren
979a620ead
Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string.
2023-07-05 18:03:36 +02:00
Adrthegamedev
5ce894564c
(an attempt to) Add wikidot to wiki generators list
2023-07-03 13:31:42 +02:00
Viktor Lofgren
813fa08bdd
Better wordpress fingerprinting
2023-07-03 11:29:27 +02:00
Viktor Lofgren
e5792ba8b3
Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string.
2023-07-03 11:06:39 +02:00
Viktor Lofgren
42375f0e53
Specialization for javadocs
2023-07-01 20:16:56 +02:00
Viktor Lofgren
eda615de0f
Add generator fingerprint for invision.
2023-07-01 14:47:57 +02:00
Viktor Lofgren
a000256223
Add generator fingerprint for xenforo.
...
Also clean up the specializations logic a bit, and add a barebones specialization for phpbb that cleans out paths we aren't interested in but doesn't touch pruning or summarizing logic for now.
2023-07-01 14:43:49 +02:00
Viktor Lofgren
9bd0e3ce58
Add generator fingerprint for xenforo.
2023-07-01 14:04:48 +02:00
Viktor Lofgren
b4d1e0e81e
Add generator fingerprints for phpBB and flarum.
2023-07-01 13:44:42 +02:00