Viktor Lofgren
10a74f45ea
(index journal; minor) Even cleaner separation of concerns.
2023-09-01 11:28:02 +02:00
Viktor Lofgren
320dad7f1a
(index journal) Fix leaky abstraction in IndexJournalReader.
...
The caller shouldn't be required to know the on-disk layout of the file to make use of the data in a performant way.
2023-09-01 11:18:13 +02:00
Viktor Lofgren
a6f1335375
(loader) Fix bugfix where the loader would omit some meta and words.
2023-08-31 17:48:43 +02:00
Viktor Lofgren
3f288e264b
(minor) Clean up dead endpoints
2023-08-29 17:04:54 +02:00
Viktor Lofgren
dd593c292c
(loader) Minor optimizations and bugfixes.
...
* Reduce memory churn in LoaderIndexJournalWriter, fix bug with keyword mappings as well
* Remove remains of OldDomains
* Ensure LOADER_PROCESS_OPTS gets fed to the processes
* LinkdbStatusWriter won't execute batch after each added item post 100 items
2023-08-29 15:37:52 +02:00
Viktor Lofgren
39c1857c61
(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.
2023-08-29 13:07:55 +02:00
Viktor Lofgren
ba4513e82c
(loader) Revert accidental experimental changes that slipped by in an earlier commit
2023-08-28 19:54:56 +02:00
Viktor Lofgren
b6a92506d1
(index) Hook in missing DocIdRewriter
...
This enables documents to be ranked properly.
2023-08-28 19:53:43 +02:00
Viktor Lofgren
3101b74580
(index) Move to a lexicon-free index design
...
This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.
The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.
It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
194a6057dd
(index,control) Recoverable index backups
2023-08-25 14:57:43 +02:00
Viktor Lofgren
e710e057e2
(db) Remove EC_URL and EC_PAGE_DATA from mariadb database
2023-08-25 13:45:03 +02:00
Viktor Lofgren
460998d512
(index) Move index construction to separate process.
...
This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service. It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D
2023-08-25 12:52:54 +02:00
Viktor Lofgren
1e6800565a
(system) Remove EdgeId<T> and similar objects
...
They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.
2023-08-24 17:46:02 +02:00
Viktor Lofgren
c909120ae1
(search) Basic working integration of linkdb in search service
2023-08-24 17:24:56 +02:00
Viktor Lofgren
6a04cdfddf
(loader) Implement new linkdb in loader
...
Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal.
For now, we no longer store new URLs in different domains. We need to re-implement this somehow, probably in a different job or a as a different output.
2023-08-24 13:07:54 +02:00
Viktor Lofgren
ebc84c22fb
Upgrade antique lombok plugin
...
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a
Upgrade code to Java 20.
...
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
bf92c270dc
(language) Rollback language filter change a bit.
...
It appears to lead to too much junk in the lexicon.
2023-08-23 10:16:57 +02:00
Viktor Lofgren
e507844616
(language) Rollback language filter change a bit.
...
It appears to lead to too much junk in the lexicon.
2023-08-23 10:03:25 +02:00
Viktor Lofgren
ca12dd59f7
(loader) Fix Cleaner resource leak
...
Apparently Cleaners have an associated native thread, so the way to use them is to have a single static cleaner.
2023-08-22 18:05:00 +02:00
Viktor Lofgren
46409c4c2d
(loader) Use the correct interface for InstructionCounter
2023-08-22 11:11:36 +02:00
Viktor Lofgren
704de50a9b
(forward-index, valuator) HTML features in valuator
...
Put it in the forward index for easy access during index-side valuation.
2023-08-18 11:54:56 +02:00
Viktor Lofgren
fcfe07fb7d
(valuator) Clean up code
2023-08-18 11:26:56 +02:00
Viktor Lofgren
ccf4990add
(minor) Clean up code
2023-08-18 11:26:39 +02:00
Viktor Lofgren
f2638dd845
(feature-extractor) More adtech nonsense
2023-08-18 11:26:19 +02:00
Viktor Lofgren
239980ecae
(minor) Improve comment
2023-08-18 11:26:05 +02:00
Viktor Lofgren
bee815b1c4
(converter) Add monsterinsights as an adtech tracker
2023-08-17 17:44:11 +02:00
Viktor Lofgren
e296b02649
(converter) Optimize LSH based within-domain deduplication
2023-08-17 17:43:46 +02:00
Viktor Lofgren
46d761f34f
(language) fasttext based language filter
2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f
(valuation) Penalize wordpress style kebab case urls
2023-08-16 13:11:24 +02:00
Viktor Lofgren
1d486bddee
(crawler) Reduce log spam
2023-08-16 11:12:09 +02:00
Viktor Lofgren
d8073f0dde
(feature-extractor) Add mail.ru counter to non-adtech trackers
2023-08-15 19:10:43 +02:00
Viktor Lofgren
e7192a9cad
(mq) Refactor mq and actor library and move it to libraries out of common
2023-08-15 10:53:23 +02:00
Viktor Lofgren
ce293029c7
(converter) Treat adtech tracking as advertisement.
2023-08-09 14:23:53 +02:00
Viktor Lofgren
251fc63b42
(*) Fix merge gore
2023-08-09 13:33:28 +02:00
Viktor Lofgren
4ab1cd9502
(*) last touches
2023-08-07 12:57:44 +02:00
Viktor
52e2ab45bf
Merge branch 'master' into master-control-program
2023-08-07 12:53:43 +02:00
Viktor Lofgren
c22feaf42e
(crawl) Make crawler limiter request a GC when throttling
2023-08-03 17:58:18 +02:00
Viktor Lofgren
e5c9791b14
(crawler) Fix rare ConcurrentModificationError due to HashSet
2023-08-01 17:28:29 +02:00
Viktor Lofgren
58556af6c7
(db) Use flwyay for database migrations.
2023-08-01 17:08:42 +02:00
Viktor Lofgren
ea66195b97
(loader) Optimize loader by using zstd's direct streaming writer and the Murmur3_128 string hash
2023-08-01 15:02:13 +02:00
Viktor Lofgren
8f0cbf267b
(loader) Perform instruction reads in a separate thread for extra vroom vroom
2023-07-31 14:24:08 +02:00
Viktor Lofgren
2f8488610a
(loader) Fix bug where trailing deferred domain meta inserts weren't executed
2023-07-31 14:23:23 +02:00
Viktor Lofgren
37c4cc68ed
TODO
2023-07-31 10:34:42 +02:00
Viktor Lofgren
1c948eb3d8
(minor) Alter DumbThreadPool in Converter to not claim the threads are crawlers.
2023-07-31 10:33:15 +02:00
Viktor Lofgren
cd90ca820f
YAGNI filter over ConverterDomainTypes
2023-07-31 10:32:47 +02:00
Viktor Lofgren
6f4e767a04
(minor) Re-enable monkey-patch-json for converter
2023-07-31 10:31:46 +02:00
Viktor Lofgren
5c071ce4d3
(crawler) Clean up the code and remove unnecessary logging
2023-07-30 16:53:39 +02:00
Viktor Lofgren
caf3d231a8
(crawler) Fix rare issue with NPEs if the crawl queue is empty
2023-07-30 16:53:13 +02:00
Viktor Lofgren
730e8f74e4
(crawler) Even more memory optimizations.
...
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
aba134284f
(crawler) Reduce log spam
2023-07-29 19:22:58 +02:00
Viktor Lofgren
2a6183f9e0
(crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size.
2023-07-29 19:20:09 +02:00
Viktor Lofgren
ee143bbc48
(crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended.
2023-07-29 19:19:09 +02:00
Viktor Lofgren
d3f01bd171
(crawler, converter) Remove monkey patched gson from dependencies
2023-07-29 19:18:12 +02:00
Viktor Lofgren
05ba3bab96
(crawler) Make SitemapRetriever abort on too large sitemaps.
2023-07-29 19:18:12 +02:00
Viktor Lofgren
d2b6b2044c
(crawler) Reduce log spam in HttpFetcherImpl
2023-07-29 19:18:12 +02:00
Viktor Lofgren
7611b7900d
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
...
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
2023-07-29 19:18:12 +02:00
Viktor Lofgren
01476577b8
(loader) Speed up loading back to original speeds with a cascading DELETE FROM EC_URL rather than EC_PAGE_DATA.
...
* Also clean up code and have proper rollbacks for transactions.
2023-07-28 22:00:07 +02:00
Viktor Lofgren
e237df4a10
(converter) Use a dumb thread pool instead of Java's executor service.
2023-07-28 18:15:16 +02:00
Viktor Lofgren
f11103d31d
(WIP) Make it possible to sideload encyclopedia data.
...
This is mostly a pilot track for sideloading other large websites.
Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
507f26ad47
(converter) Refactor converter to not keep instructions list in RAM.
...
(converter) Refactor converter to not keep instructions list in RAM.
(converter) Refactor converter to not keep instructions list in RAM.
2023-07-25 22:06:46 +02:00
Viktor Lofgren
fd44e09ebd
(loader) Don't delete the entire link database when the loader runs
2023-07-24 18:37:35 +02:00
Viktor Lofgren
667b0ca0b0
(converter, WIP) Refactor CrawledDomainReader to not return iterators.
...
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798
(converter, WIP) Refactor converter to not have to load everything into RAM.
2023-07-24 15:25:09 +02:00
Viktor Lofgren
35b29e4f9e
(crawler) Clean up and refactor the code a bit
2023-07-23 19:06:37 +02:00
Viktor Lofgren
69f333c0bf
(crawler) Clean up and refactor the code a bit
2023-07-23 18:59:14 +02:00
Viktor Lofgren
c069c8c182
(crawler) Clean up crawl data reference and recrawl logic
2023-07-22 18:42:21 +02:00
Viktor Lofgren
9e4aa7da7c
(crawler) Support for X-Robots-Tag
2023-07-22 18:42:21 +02:00
Viktor Lofgren
58f2f86ea8
(crawler) Don't read all the data into RAM when doing a refresh-crawl
2023-07-21 19:47:52 +02:00
Viktor Lofgren
f91d92cccb
(crawler) WIP
2023-07-20 21:05:16 +02:00
Viktor Lofgren
d7ab21fe34
(*) Refactor Control Service and processes
2023-07-17 21:20:31 +02:00
Viktor Lofgren
bca4bbb6c8
(*) Refactor MQ and MQSM
2023-07-17 13:57:32 +02:00
Viktor Lofgren
e618aa34e9
(control) Name change process->fsm, new fsm:s
...
* FSM for spawning processes when messages appear for them
* FSM for removing data flagged for purging
2023-07-17 12:27:27 +02:00
Viktor Lofgren
8b74e3aa0d
(*) File Storage WIP
2023-07-14 17:08:10 +02:00
Viktor Lofgren
5deec63667
(work-log) Better tests
2023-07-12 18:04:06 +02:00
Viktor Lofgren
74caf9e38a
(processes) Remove forEach-constructs in favor of iterators.
2023-07-12 17:47:36 +02:00
Viktor Lofgren
ac2d7034db
(minor) Bugfix in Path handling
2023-07-11 21:24:29 +02:00
Viktor Lofgren
77261a38cd
(control, WIP) MQFSM and ProcessService are sitting in a tree
...
We're spawning processes from the MSFSM in control service now!
2023-07-11 17:08:43 +02:00
Viktor Lofgren
3c7c77fe21
(minor) Bugfix in Path handling
2023-07-11 17:06:52 +02:00
Viktor Lofgren
4c016b0318
Process monitoring
...
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor
cbbf60a599
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
f03146de4b
(crawler) Fix bug poor handling of duplicate ids
...
* Also clean up the code a bit
2023-07-10 18:58:43 +02:00
Viktor
0f9b90eb1c
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 17:36:12 +02:00
Viktor Lofgren
2619d196bb
(crawler) Fix bug poor handling of duplicate ids
...
* Also clean up the code a bit
2023-07-07 19:56:14 +02:00
Viktor Lofgren
98d1898610
Bugfix: Don't run the xenforo specialization on phpBB.
2023-07-06 18:12:26 +02:00
Viktor Lofgren
1400fb4a9b
Bugfix: Don't run the xenforo specialization on phpBB.
2023-07-06 18:11:19 +02:00
Viktor Lofgren
647bbfa617
Fix so that crawler tests don't sometimes fetch real sitemaps when they're run.
2023-07-06 18:05:23 +02:00
Viktor Lofgren
b73fcc19fe
Fix so that crawler tests don't sometimes fetch real sitemaps when they're run.
2023-07-06 18:05:03 +02:00
Viktor Lofgren
34653f03a2
Temporary bugfix, need to find source
2023-07-06 14:13:03 +02:00
Adrthegamedev
78f21dd19a
(an attempt to) Add wikidot to wiki generators list
2023-07-05 18:03:36 +02:00
Viktor Lofgren
2cb209ae9c
Better wordpress fingerprinting
2023-07-05 18:03:36 +02:00
Viktor Lofgren
979a620ead
Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string.
2023-07-05 18:03:36 +02:00
Adrthegamedev
5ce894564c
(an attempt to) Add wikidot to wiki generators list
2023-07-03 13:31:42 +02:00
Viktor Lofgren
813fa08bdd
Better wordpress fingerprinting
2023-07-03 11:29:27 +02:00
Viktor Lofgren
e5792ba8b3
Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string.
2023-07-03 11:06:39 +02:00
Viktor Lofgren
42375f0e53
Specialization for javadocs
2023-07-01 20:16:56 +02:00
Viktor Lofgren
24dce8c03b
Remove link filtering for mediawiki, it's too strict and not every site uses the /wiki/-pattern.
2023-07-01 19:32:25 +02:00
Viktor Lofgren
eda615de0f
Add generator fingerprint for invision.
2023-07-01 14:47:57 +02:00
Viktor Lofgren
a000256223
Add generator fingerprint for xenforo.
...
Also clean up the specializations logic a bit, and add a barebones specialization for phpbb that cleans out paths we aren't interested in but doesn't touch pruning or summarizing logic for now.
2023-07-01 14:43:49 +02:00
Viktor Lofgren
9bd0e3ce58
Add generator fingerprint for xenforo.
2023-07-01 14:04:48 +02:00