CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	d0982e7ba5	(converter) Add error handling and lazy load external domain links The converter was not properly initiating the external links for each domain, causing an NPE in conversion. This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data. Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.	2023-12-09 12:33:39 +01:00
Viktor Lofgren	fc30da0d48	(converter) Add academia recognition to DomainProcessor The code now includes an additional function in the DomainProcessor class that checks if a domain is associated with academia. An academic domain is identified by the ".edu" TLD, or fits a specific regex pattern matching domains like .ac.ccTld or .edu.ccTld. If these conditions are met, the search term "special:academia" is added to the domain. The existing academia search filter uses personalized pagerank to select academia-adjacent domains, but it isn't working very well. The hope is that filtering on domain names will be more effective, and that it can supplant the ranking-based approach.	2023-12-08 20:31:34 +01:00
Viktor Lofgren	e6a1052ba7	Simplify CrawlerMain, removing the CrawlerLimiter and using a global HttpFetcher with a virtual thread pool dispatcher instead of the default.	2023-12-08 20:24:01 +01:00
Viktor Lofgren	968dce50fc	(crawler) Refactored IpInterceptingNetworkInterceptor for clarity.	2023-12-08 17:45:46 +01:00
Viktor Lofgren	3bbffd3c22	(crawler) Refactor HttpFetcher to integrate WarcRecorder Partially hook in the WarcRecorder into the crawler process. So far it's not read, but should record the crawled documents. The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.	2023-12-08 17:12:51 +01:00
Viktor Lofgren	072b5fcd12	Implement Warc-recording wrapper for OkHttp3 client This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted. This component is currently not hooked into anything. The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'. The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.	2023-12-08 13:49:16 +01:00
Viktor Lofgren	fabffa80f0	(warc) Integrate the crawler's content type parsing and charset logic into the WarcSideloader	2023-12-07 15:26:01 +01:00
Viktor Lofgren	064265b0b9	(crawler) Move content type/charset sniffing to a separate microlibrary This functionality needs to be accessed by the WarcSideloader, which is in the converter. The resultant microlibrary is tiny, but I think in this case it's justifiable.	2023-12-07 15:16:37 +01:00
Viktor Lofgren	2d5d11645d	(warc) Refactor WarcSideloaderTest to not rely on specific test files on the computer	2023-12-06 19:00:29 +01:00
Viktor Lofgren	cc813a5624	(convert) Add basic support for Warc file sideloading This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.	2023-12-06 18:43:55 +01:00
Viktor Lofgren	f615cf2391	(convert) Loosen up the rules enforcement for documents that have external links.	2023-12-01 17:44:29 +01:00
Viktor Lofgren	166a391eae	(docs) Improve architectural documentation for the crawler.	2023-11-30 21:30:57 +01:00
Viktor Lofgren	5fb24bb27f	(docs) Improve architectural documentation for the converter.	2023-11-30 20:43:22 +01:00
Viktor Lofgren	5a5430b383	(convert) Wiki specialization that should do a better job at removing junk keywords and providing a useful summary.	2023-11-30 20:04:46 +01:00
Viktor Lofgren	1dafa0c74d	(mqapi/control) Repair repartition endpoint, deprecate notify endpoints. The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId. In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.	2023-11-27 16:01:12 +01:00
Viktor Lofgren	09917837d0	(process) Ensure construction exceptions are logged Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs. The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.	2023-11-22 18:32:06 +01:00
Viktor Lofgren	f58a9f46be	(loader) Don't truncate the entire links table on load This behavior is an old vestige from the days of only having a single loader process. We'd truncate the links table because doing inserts/updates was too slow. This was also important because we had 32 bit ID, and there's a lot of links between domains to go around... Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE. We also update the PRIMARY KEY to a BIGINT. We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.	2023-11-16 10:30:12 +01:00
Viktor Lofgren	5de37cb820	(converter) Set feature flags appropriately on stackexchange posts	2023-11-12 15:48:08 +01:00
Viktor Lofgren	e5cee1f46d	(sideload) Fix sideloading so that it doesn't get disproportionately good rankings Also add type flags so that e.g. wikipedia shows up in the wikis filter.	2023-11-12 14:57:57 +01:00
Viktor Lofgren	7617b4cbc2	(crawler) Fix NPE in crawler caused by not having fetched the domains list yet	2023-11-06 18:16:38 +01:00
Viktor Lofgren	e0c769fd19	(converter) Integrate atags.parquet with the encyclopedia sideloader Also clean up stackexchange and dirtree a bit.	2023-11-06 18:03:01 +01:00
Viktor Lofgren	ebd10a5f28	(crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized	2023-11-06 16:14:58 +01:00
Viktor Lofgren	2b77184281	(converter) Integrate atags with the topology field	2023-11-06 13:46:44 +01:00
Viktor Lofgren	1847845151	Revert "(loader) Optimize INSERT statements" This reverts commit `7cb92195d1`.	2023-11-04 19:32:02 +01:00
Viktor Lofgren	7cb92195d1	(loader) Optimize INSERT statements INSERT IGNORE is too slow.	2023-11-04 17:43:55 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	8f74dbdbb4	(crawler) Set more lenient parameters for recrawl	2023-10-30 11:35:30 +01:00
Viktor Lofgren	fd5a7eac87	(crawler) Exit crawler retriever on thread interrupted	2023-10-30 11:34:16 +01:00
Viktor Lofgren	f613f4f2df	(array) Fix spurious search results This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss. It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.	2023-10-26 15:27:02 +02:00
Viktor Lofgren	a497e4c920	(crawler) Terminate crawler after a few hours of no progress	2023-10-26 12:49:28 +02:00
Viktor Lofgren	d7686b665e	Refactoring * Encyclopedia sideloader; permit providing base URL. * Storage base shows node id in GUI * ProcessLivenessMonitorActor restarts automatically * Clean-up of outbox code	2023-10-25 18:51:02 +02:00
Viktor Lofgren	313cc2965c	(index-creation) Print whether full or prio is created Previous state of saying reverse index for both was pretty confusing.	2023-10-24 16:23:10 +02:00
Viktor Lofgren	e06a8c1de2	(converter) Put upper limit on number of worker threads.	2023-10-22 14:03:09 +02:00
Viktor Lofgren	1d75b974b5	(loader bugfix) Set DOMAIN_METADATA appropriately	2023-10-20 13:03:27 +02:00
Viktor Lofgren	7b5ec6b98f	(executor-service) Embed dist/ in executor-service's docker image	2023-10-19 17:48:34 +02:00
Viktor Lofgren	81dd3809e9	(*) WIP Add node affinity to EC_DOMAIN Very messy commit due to fractalline yak shaving	2023-10-19 17:48:34 +02:00
Viktor Lofgren	16e0738731	(*) Get multi-node routing working.	2023-10-15 18:38:30 +02:00
Viktor Lofgren	4baf9527d7	() WIP Control GUI redesign, executor-service, multi-node mq This turned out to be very difficult to do in small isolated steps. Design overhaul of the control gui using bootstrap * Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes * Add node-affinity to message queue	2023-10-14 12:08:43 +02:00
Viktor Lofgren	199c459697	(*) Add node-affinity to services, processes and file storage.	2023-10-10 12:32:22 +02:00
Viktor Lofgren	8375237de5	(converter) Add special keyword for websites with a tilde url.	2023-10-09 17:02:32 +02:00
Viktor Lofgren	3889c4bdd9	(refactor) Remove features-search and update documentation	2023-10-09 15:12:30 +02:00
Viktor Lofgren	5dd55c7cad	(refactor) Rename satellite services to application services This is a better descriptor, since they now all implement different applications on top of the core services' APIs.	2023-10-09 13:45:45 +02:00
Viktor Lofgren	c0e61d4c87	(refactor) Move search service into services-satellite	2023-10-09 13:40:01 +02:00
Viktor	8e1abc3f10	(index-reverse) Parallel construction of the reverse indexes. (#52 ) * (index-reverse) Parallel construction of the reverse indexes. * (array) Remove wasteful calculation of numDistinct before merging two sorted arrays. * (index-reverse) Force changes to disk on close, reduce logging. * (index-reverse) Clean up merging process and add back logging * (run) Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM * (index-reverse) Better logging during processing * (array) 2GB+ compatible write() function * (array) 2GB+ compatible write() function * (index-reverse) We are logging like Bolsonaro and I will not have it. * (reverse-index) Self-diagnostics * (btree) Fix bug in btree reader to do with large data sizes	2023-10-07 10:00:00 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	e0cd3cd991	(converter) Alter StackexchangeSideloader's summary length to align with the rest of the system.	2023-09-26 12:19:43 +02:00
Viktor Lofgren	81ae501e73	(converter) Use ThreadLocalSentenceExtractorProvider for PlainText plugin as well	2023-09-25 18:28:34 +02:00
Viktor Lofgren	f797a92f87	(converter, minor) Use domain name in task heartbeat progress	2023-09-25 18:27:04 +02:00
Viktor Lofgren	a433bbbe45	(converter) Fix rare sentence extractor bug It was caused by non-thread safe concurrent memory access in SentenceExtractor.	2023-09-24 19:39:48 +02:00
Viktor Lofgren	8ca20f184d	(keyword-extraction) Chasing my tail looking for a bug	2023-09-24 19:39:48 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	f809d22fc6	(loader) Support simultaneous loading of multiple processed data sets	2023-09-22 13:14:58 +02:00
Viktor Lofgren	ad660cf420	(converter) Bugfix: Don't try to Path.of() on optional field	2023-09-21 13:27:09 +02:00
Viktor Lofgren	70aa04c047	(converter, stackexchange-xml) Add the ability to sideload stackexchange data	2023-09-21 12:48:33 +02:00
Viktor Lofgren	d895f83520	(blocking-thread-pool) Move DumbThreadPool to its own micro-library Also rename it to SimpleBlockingThreadPool.	2023-09-20 10:11:49 +02:00
Viktor Lofgren	f6b9e8c5eb	(converter) JavadocSpecialization should truncate its summary if it gets too long	2023-09-17 16:25:33 +02:00
Viktor Lofgren	98bcdf6028	(converter) DirtreeSideloader now trims /index.html from the URL if present This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.	2023-09-17 16:08:16 +02:00
Viktor Lofgren	9b385ec7cc	(converter) Make it possible to sideload documents from a directory tree	2023-09-17 14:35:06 +02:00
Viktor Lofgren	5c040f7a46	(crawl-spec) Parquetify crawl spec * Crawl-specs are now parquet files * Deprecate the crawl-job-extractor tool	2023-09-17 09:41:34 +02:00
Viktor Lofgren	c67d95c00f	(converter) Write dummy processor log when sideloading	2023-09-14 14:13:03 +02:00
Viktor Lofgren	5e5aaf9a7e	(converter, control) Re-enable sideloading encyclopedia data	2023-09-14 12:12:07 +02:00
Viktor Lofgren	eaeb23d41e	(refactor) Remove converting-model package completely	2023-09-14 11:21:44 +02:00
Viktor Lofgren	c71f6ad417	(converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup	2023-09-14 10:11:57 +02:00
Viktor Lofgren	4799dd769e	(converting) WIP begin to remove converting-model and the old InstructionsCompiler	2023-09-13 19:18:58 +02:00
Viktor Lofgren	24b4606f96	(converter,loader) Converter outputs parquet files instead of compressed json.	2023-09-13 16:13:41 +02:00
Viktor Lofgren	9e185e80ce	(control-service) Add timestamp to file storages.	2023-09-02 14:01:04 +02:00
Viktor Lofgren	5f427d2b4c	(keywords) Clean up leaky abstractions, clean up tests	2023-09-01 13:52:00 +02:00
Viktor Lofgren	8c0ce4fc1d	(index journal; minor) Clean up	2023-09-01 11:32:24 +02:00
Viktor Lofgren	10a74f45ea	(index journal; minor) Even cleaner separation of concerns.	2023-09-01 11:28:02 +02:00
Viktor Lofgren	320dad7f1a	(index journal) Fix leaky abstraction in IndexJournalReader. The caller shouldn't be required to know the on-disk layout of the file to make use of the data in a performant way.	2023-09-01 11:18:13 +02:00
Viktor Lofgren	a6f1335375	(loader) Fix bugfix where the loader would omit some meta and words.	2023-08-31 17:48:43 +02:00
Viktor Lofgren	3f288e264b	(minor) Clean up dead endpoints	2023-08-29 17:04:54 +02:00
Viktor Lofgren	dd593c292c	(loader) Minor optimizations and bugfixes. * Reduce memory churn in LoaderIndexJournalWriter, fix bug with keyword mappings as well * Remove remains of OldDomains * Ensure LOADER_PROCESS_OPTS gets fed to the processes * LinkdbStatusWriter won't execute batch after each added item post 100 items	2023-08-29 15:37:52 +02:00
Viktor Lofgren	39c1857c61	(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.	2023-08-29 13:07:55 +02:00
Viktor Lofgren	ba4513e82c	(loader) Revert accidental experimental changes that slipped by in an earlier commit	2023-08-28 19:54:56 +02:00
Viktor Lofgren	b6a92506d1	(index) Hook in missing DocIdRewriter This enables documents to be ranked properly.	2023-08-28 19:53:43 +02:00
Viktor Lofgren	3101b74580	(index) Move to a lexicon-free index design This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it also added a fairly significant RAM penalty to both the index service and the loader. The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices. It also became necessary half-way through to upgrade guice as its error reporting wasn't quite compatible with JDK20.	2023-08-28 14:02:23 +02:00
Viktor Lofgren	194a6057dd	(index,control) Recoverable index backups	2023-08-25 14:57:43 +02:00
Viktor Lofgren	e710e057e2	(db) Remove EC_URL and EC_PAGE_DATA from mariadb database	2023-08-25 13:45:03 +02:00
Viktor Lofgren	460998d512	(index) Move index construction to separate process. This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service. It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D	2023-08-25 12:52:54 +02:00
Viktor Lofgren	1e6800565a	(system) Remove EdgeId<T> and similar objects They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.	2023-08-24 17:46:02 +02:00
Viktor Lofgren	c909120ae1	(search) Basic working integration of linkdb in search service	2023-08-24 17:24:56 +02:00
Viktor Lofgren	6a04cdfddf	(loader) Implement new linkdb in loader Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal. For now, we no longer store new URLs in different domains. We need to re-implement this somehow, probably in a different job or a as a different output.	2023-08-24 13:07:54 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	bf92c270dc	(language) Rollback language filter change a bit. It appears to lead to too much junk in the lexicon.	2023-08-23 10:16:57 +02:00
Viktor Lofgren	e507844616	(language) Rollback language filter change a bit. It appears to lead to too much junk in the lexicon.	2023-08-23 10:03:25 +02:00
Viktor Lofgren	ca12dd59f7	(loader) Fix Cleaner resource leak Apparently Cleaners have an associated native thread, so the way to use them is to have a single static cleaner.	2023-08-22 18:05:00 +02:00
Viktor Lofgren	46409c4c2d	(loader) Use the correct interface for InstructionCounter	2023-08-22 11:11:36 +02:00
Viktor Lofgren	704de50a9b	(forward-index, valuator) HTML features in valuator Put it in the forward index for easy access during index-side valuation.	2023-08-18 11:54:56 +02:00
Viktor Lofgren	fcfe07fb7d	(valuator) Clean up code	2023-08-18 11:26:56 +02:00
Viktor Lofgren	ccf4990add	(minor) Clean up code	2023-08-18 11:26:39 +02:00
Viktor Lofgren	f2638dd845	(feature-extractor) More adtech nonsense	2023-08-18 11:26:19 +02:00
Viktor Lofgren	239980ecae	(minor) Improve comment	2023-08-18 11:26:05 +02:00
Viktor Lofgren	bee815b1c4	(converter) Add monsterinsights as an adtech tracker	2023-08-17 17:44:11 +02:00
Viktor Lofgren	e296b02649	(converter) Optimize LSH based within-domain deduplication	2023-08-17 17:43:46 +02:00
Viktor Lofgren	46d761f34f	(language) fasttext based language filter	2023-08-16 15:48:12 +02:00
Viktor Lofgren	4598c7f40f	(valuation) Penalize wordpress style kebab case urls	2023-08-16 13:11:24 +02:00
Viktor Lofgren	1d486bddee	(crawler) Reduce log spam	2023-08-16 11:12:09 +02:00
Viktor Lofgren	d8073f0dde	(feature-extractor) Add mail.ru counter to non-adtech trackers	2023-08-15 19:10:43 +02:00
Viktor Lofgren	e7192a9cad	(mq) Refactor mq and actor library and move it to libraries out of common	2023-08-15 10:53:23 +02:00
Viktor Lofgren	ce293029c7	(converter) Treat adtech tracking as advertisement.	2023-08-09 14:23:53 +02:00
Viktor Lofgren	251fc63b42	(*) Fix merge gore	2023-08-09 13:33:28 +02:00
Viktor Lofgren	4ab1cd9502	(*) last touches	2023-08-07 12:57:44 +02:00
Viktor	52e2ab45bf	Merge branch 'master' into master-control-program	2023-08-07 12:53:43 +02:00
Viktor Lofgren	c22feaf42e	(crawl) Make crawler limiter request a GC when throttling	2023-08-03 17:58:18 +02:00
Viktor Lofgren	e5c9791b14	(crawler) Fix rare ConcurrentModificationError due to HashSet	2023-08-01 17:28:29 +02:00
Viktor Lofgren	58556af6c7	(db) Use flwyay for database migrations.	2023-08-01 17:08:42 +02:00
Viktor Lofgren	ea66195b97	(loader) Optimize loader by using zstd's direct streaming writer and the Murmur3_128 string hash	2023-08-01 15:02:13 +02:00
Viktor Lofgren	8f0cbf267b	(loader) Perform instruction reads in a separate thread for extra vroom vroom	2023-07-31 14:24:08 +02:00
Viktor Lofgren	2f8488610a	(loader) Fix bug where trailing deferred domain meta inserts weren't executed	2023-07-31 14:23:23 +02:00
Viktor Lofgren	37c4cc68ed	TODO	2023-07-31 10:34:42 +02:00
Viktor Lofgren	1c948eb3d8	(minor) Alter DumbThreadPool in Converter to not claim the threads are crawlers.	2023-07-31 10:33:15 +02:00
Viktor Lofgren	cd90ca820f	YAGNI filter over ConverterDomainTypes	2023-07-31 10:32:47 +02:00
Viktor Lofgren	6f4e767a04	(minor) Re-enable monkey-patch-json for converter	2023-07-31 10:31:46 +02:00
Viktor Lofgren	5c071ce4d3	(crawler) Clean up the code and remove unnecessary logging	2023-07-30 16:53:39 +02:00
Viktor Lofgren	caf3d231a8	(crawler) Fix rare issue with NPEs if the crawl queue is empty	2023-07-30 16:53:13 +02:00
Viktor Lofgren	730e8f74e4	(crawler) Even more memory optimizations. * Fix minor resource leak in zstd streams * Use pools for zstd streams * Reduce the SSL session cache size	2023-07-30 14:19:55 +02:00
Viktor Lofgren	aba134284f	(crawler) Reduce log spam	2023-07-29 19:22:58 +02:00
Viktor Lofgren	2a6183f9e0	(crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size.	2023-07-29 19:20:09 +02:00
Viktor Lofgren	ee143bbc48	(crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended.	2023-07-29 19:19:09 +02:00
Viktor Lofgren	d3f01bd171	(crawler, converter) Remove monkey patched gson from dependencies	2023-07-29 19:18:12 +02:00
Viktor Lofgren	05ba3bab96	(crawler) Make SitemapRetriever abort on too large sitemaps.	2023-07-29 19:18:12 +02:00
Viktor Lofgren	d2b6b2044c	(crawler) Reduce log spam in HttpFetcherImpl	2023-07-29 19:18:12 +02:00
Viktor Lofgren	7611b7900d	(crawler) Reduce long term memory allocation in DomainCrawlFrontier (crawler) Reduce long term memory allocation in DomainCrawlFrontier	2023-07-29 19:18:12 +02:00
Viktor Lofgren	01476577b8	(loader) Speed up loading back to original speeds with a cascading DELETE FROM EC_URL rather than EC_PAGE_DATA. * Also clean up code and have proper rollbacks for transactions.	2023-07-28 22:00:07 +02:00
Viktor Lofgren	e237df4a10	(converter) Use a dumb thread pool instead of Java's executor service.	2023-07-28 18:15:16 +02:00
Viktor Lofgren	f11103d31d	(WIP) Make it possible to sideload encyclopedia data. This is mostly a pilot track for sideloading other large websites. Also change coverter to produce a more compact output (java serialization instead of json).	2023-07-28 18:14:43 +02:00
Viktor Lofgren	507f26ad47	(converter) Refactor converter to not keep instructions list in RAM. (converter) Refactor converter to not keep instructions list in RAM. (converter) Refactor converter to not keep instructions list in RAM.	2023-07-25 22:06:46 +02:00
Viktor Lofgren	fd44e09ebd	(loader) Don't delete the entire link database when the loader runs	2023-07-24 18:37:35 +02:00
Viktor Lofgren	667b0ca0b0	(converter, WIP) Refactor CrawledDomainReader to not return iterators. Instead return a closable class SerializableCrawlDataStream.	2023-07-24 16:28:30 +02:00
Viktor Lofgren	a56953c798	(converter, WIP) Refactor converter to not have to load everything into RAM.	2023-07-24 15:25:09 +02:00
Viktor Lofgren	35b29e4f9e	(crawler) Clean up and refactor the code a bit	2023-07-23 19:06:37 +02:00
Viktor Lofgren	69f333c0bf	(crawler) Clean up and refactor the code a bit	2023-07-23 18:59:14 +02:00
Viktor Lofgren	c069c8c182	(crawler) Clean up crawl data reference and recrawl logic	2023-07-22 18:42:21 +02:00
Viktor Lofgren	9e4aa7da7c	(crawler) Support for X-Robots-Tag	2023-07-22 18:42:21 +02:00
Viktor Lofgren	58f2f86ea8	(crawler) Don't read all the data into RAM when doing a refresh-crawl	2023-07-21 19:47:52 +02:00
Viktor Lofgren	f91d92cccb	(crawler) WIP	2023-07-20 21:05:16 +02:00
Viktor Lofgren	d7ab21fe34	(*) Refactor Control Service and processes	2023-07-17 21:20:31 +02:00
Viktor Lofgren	bca4bbb6c8	(*) Refactor MQ and MQSM	2023-07-17 13:57:32 +02:00
Viktor Lofgren	e618aa34e9	(control) Name change process->fsm, new fsm:s * FSM for spawning processes when messages appear for them * FSM for removing data flagged for purging	2023-07-17 12:27:27 +02:00
Viktor Lofgren	8b74e3aa0d	(*) File Storage WIP	2023-07-14 17:08:10 +02:00
Viktor Lofgren	5deec63667	(work-log) Better tests	2023-07-12 18:04:06 +02:00
Viktor Lofgren	74caf9e38a	(processes) Remove forEach-constructs in favor of iterators.	2023-07-12 17:47:36 +02:00
Viktor Lofgren	ac2d7034db	(minor) Bugfix in Path handling	2023-07-11 21:24:29 +02:00
Viktor Lofgren	77261a38cd	(control, WIP) MQFSM and ProcessService are sitting in a tree We're spawning processes from the MSFSM in control service now!	2023-07-11 17:08:43 +02:00
Viktor Lofgren	3c7c77fe21	(minor) Bugfix in Path handling	2023-07-11 17:06:52 +02:00
Viktor Lofgren	4c016b0318	Process monitoring * Also refactored the SQL tables a bit	2023-07-11 14:46:21 +02:00
Viktor	cbbf60a599	Better fingerprinting (#35 ) * Better fingerprinting for server tech * Many more features in FeatureExtractor * Blog specialization * SiteType table	2023-07-10 18:58:43 +02:00
Viktor Lofgren	f03146de4b	(crawler) Fix bug poor handling of duplicate ids * Also clean up the code a bit	2023-07-10 18:58:43 +02:00

1 2 3 4 5 ...

307 Commits