CatgirlIntelligenceAgency

Author	SHA1	Message	Date
Viktor Lofgren	440e097d78	(crawler) WIP integration of WARC files into the crawler and converter process. This commit is in a pretty rough state. It refactors the crawler fairly significantly to offer better separation of concerns. It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data. This works, -ish. There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either. A problem is that the WARC files are a bit too large. It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.	2023-12-13 15:33:42 +01:00
Viktor Lofgren	b74a3ebd85	(crawler) WIP integration of WARC files into the crawler process. At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly. This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled. The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.	2023-12-11 19:32:58 +01:00
Viktor Lofgren	45987a1d98	Merge branch 'master' into warc	2023-12-11 14:32:35 +01:00
Viktor Lofgren	8f0950fc44	(geoip) Fix incorrect synchronization.	2023-12-11 14:01:39 +01:00
Viktor Lofgren	30bc3f9281	(converter) Use the prefix ip: instead of geopip: for country codes This is the same as the prefix for the IP address, but I don't think that substantially matters, the as two have such different namespaces there can be no confusion.	2023-12-11 13:59:23 +01:00
Viktor Lofgren	f655ec5a5c	(*) Refactor GeoIP-related code In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services. The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions. The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server. The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.	2023-12-10 17:30:43 +01:00
Viktor Lofgren	84b4158555	(minor) Fix broken test	2023-12-10 14:39:20 +01:00
Viktor Lofgren	91dd45cf64	(search) IP and IP geolocation in site info view This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.	2023-12-09 20:06:55 +01:00
Viktor Lofgren	37af60254f	(search) Better recipe filter Tune the recipe filter to give better results, by using the 'popular' domains set along with excluding results with heavy tracking.	2023-12-09 20:06:55 +01:00
Viktor Lofgren	f0e736d4ea	(search) Update the search profile 'Academia' to strictly filter on academic tlds The previous version used a personalized pagerank centering on a few academic domains, but this didn't work very well and most results were not very academia-centric.	2023-12-09 20:06:55 +01:00
Viktor Lofgren	e3ebb0c5bb	(*) Rename the search filter 'RETRO' into 'POPULAR' This will make the terminology more consistent between the GUI and the code. The rankings yaml still uses 'retro' though, for to retain compatibility.	2023-12-09 20:06:54 +01:00
Viktor Lofgren	6382f779c3	(search) Revert back to using 'Popular' as the default search filter Unfiltered is a bit too ... unfiltered, and gives a bad first impression for many queries.	2023-12-09 16:34:12 +01:00
Viktor Lofgren	8ef34883a8	(search) Move site information out of the search service and into assistant. This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available. It also permits exposing this information via API in the future if there is interest in this. The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time. Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.	2023-12-09 16:30:06 +01:00
Viktor Lofgren	5c46af0edb	(converter) Refactor EncyclopediaMarginaliaNuSideloader to use ProcessingIterator Refactored the getDocumentsStream method in EncyclopediaMarginaliaNuSideloader to use the newly extracted ProcessingIterator class that encapsulates processing a stream of results from e.g a database query in parallel and returning the computed results as an iterator. The iterator was also improved on to be more reliable, previous versions of the logic would sometimes deadlock due to false positives in hasMore().	2023-12-09 15:20:53 +01:00
Viktor Lofgren	b6511fbfe2	(converter) Add AnchorTextKeywords to EncyclopediaMarginaliaNuSideloader processing The commit updates EncyclopediaMarginaliaNuSideloader to include the AnchorTextKeywords in processing documents, aiding search result relevance. It also removes old test-related functionality and a large but fairly useless test previously used to debug a specific problem, to the detriment of the overall code quality.	2023-12-09 15:20:52 +01:00
Viktor Lofgren	eccb12b366	(control) Fix spurious state detection in control-side actors A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor! To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.	2023-12-09 12:50:05 +01:00
Viktor Lofgren	d0982e7ba5	(converter) Add error handling and lazy load external domain links The converter was not properly initiating the external links for each domain, causing an NPE in conversion. This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data. Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.	2023-12-09 12:33:39 +01:00
Viktor Lofgren	fc30da0d48	(converter) Add academia recognition to DomainProcessor The code now includes an additional function in the DomainProcessor class that checks if a domain is associated with academia. An academic domain is identified by the ".edu" TLD, or fits a specific regex pattern matching domains like .ac.ccTld or .edu.ccTld. If these conditions are met, the search term "special:academia" is added to the domain. The existing academia search filter uses personalized pagerank to select academia-adjacent domains, but it isn't working very well. The hope is that filtering on domain names will be more effective, and that it can supplant the ranking-based approach.	2023-12-08 20:31:34 +01:00
Viktor Lofgren	e6a1052ba7	Simplify CrawlerMain, removing the CrawlerLimiter and using a global HttpFetcher with a virtual thread pool dispatcher instead of the default.	2023-12-08 20:24:01 +01:00
Viktor Lofgren	968dce50fc	(crawler) Refactored IpInterceptingNetworkInterceptor for clarity.	2023-12-08 17:45:46 +01:00
Viktor Lofgren	3bbffd3c22	(crawler) Refactor HttpFetcher to integrate WarcRecorder Partially hook in the WarcRecorder into the crawler process. So far it's not read, but should record the crawled documents. The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.	2023-12-08 17:12:51 +01:00
Viktor Lofgren	072b5fcd12	Implement Warc-recording wrapper for OkHttp3 client This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted. This component is currently not hooked into anything. The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'. The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.	2023-12-08 13:49:16 +01:00
Viktor Lofgren	fabffa80f0	(warc) Integrate the crawler's content type parsing and charset logic into the WarcSideloader	2023-12-07 15:26:01 +01:00
Viktor Lofgren	064265b0b9	(crawler) Move content type/charset sniffing to a separate microlibrary This functionality needs to be accessed by the WarcSideloader, which is in the converter. The resultant microlibrary is tiny, but I think in this case it's justifiable.	2023-12-07 15:16:37 +01:00
Viktor Lofgren	2d5d11645d	(warc) Refactor WarcSideloaderTest to not rely on specific test files on the computer	2023-12-06 19:00:29 +01:00
Viktor Lofgren	cc813a5624	(convert) Add basic support for Warc file sideloading This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.	2023-12-06 18:43:55 +01:00
Viktor Lofgren	156c067f79	(search) Fix mobile issues with browse feature	2023-12-05 21:28:50 +01:00
Viktor Lofgren	b33b013d41	(search) Fix broken script tag Apparently it can't be called suggestions.js...?	2023-12-05 20:29:13 +01:00
Viktor Lofgren	e74e2f705f	(search) Fix broken script tag suggestions.js became something else.	2023-12-05 20:20:07 +01:00
Viktor Lofgren	2e438847fc	(search) Optimize related domains queries In the future this logic probably needs to move into a separate service, as it's still quite slow to load. But this fixes response times and DOS potential of previous version.	2023-12-05 20:12:03 +01:00
Viktor Lofgren	9301c47d93	(search) Optimize related domains queries	2023-12-05 14:42:03 +01:00
Viktor Lofgren	20ec58b07f	(search) Remove layout-breakingly long URLs from the similar domains view. They're almost all .onion URLs anyway, not really the space we're looking to peer into.	2023-12-05 13:58:15 +01:00
Viktor Lofgren	98983c1015	(search) Hopefully fix race condition that leaves the response with no Content-type header	2023-12-05 13:52:36 +01:00
Viktor Lofgren	67195592c6	(search) Hopefully fix race condition that leaves the response with no Content-type header	2023-12-05 13:48:42 +01:00
Viktor	21abfc6424	Merge pull request #61 from MarginaliaSearch/new-look Design Revamp For search.marginalia.nu	2023-12-05 13:28:54 +01:00
Viktor Lofgren	d1e88df71e	(search) Cleaning up the code a bit	2023-12-05 13:26:05 +01:00
Viktor Lofgren	f36cfe34ab	(search) Hackery to get a more balanced view	2023-12-04 22:50:39 +01:00
Viktor Lofgren	8a1934008c	(search) Merge similar sites results with the info view. WIP: This commit needs to be cleaned up.	2023-12-04 22:10:24 +01:00
Viktor Lofgren	b41bb9cfcf	(search) Use a Ξ for mobile button title instead of "Filters". Makes it easier to distinguish form the search button.	2023-12-03 16:33:25 +01:00
Viktor Lofgren	d58324bbef	(search) Clean up filters menu a bit, improve accessibility.	2023-12-02 18:05:30 +01:00
Viktor Lofgren	cbbd45d3e5	(search) Clean up filters menu a bit, improve accessibility.	2023-12-02 18:01:03 +01:00
Viktor Lofgren	b89633ae4b	(search) Don't render a filter button on mobile when there are no filters to be presented.	2023-12-02 17:23:45 +01:00
Viktor Lofgren	96357e9bfd	(search) Fix typeahead suggestions, as well as improve mobile and desktop UX in small ways.	2023-12-02 17:06:40 +01:00
Viktor Lofgren	d530c3096f	(search) GUI tweaks to make the new interface not fall apart on mobile/chrome	2023-12-02 17:06:40 +01:00
Viktor Lofgren	ae0c1c3f2d	(control) Adjust search result margins for better visual density	2023-12-02 17:06:40 +01:00
Viktor Lofgren	0cc2564380	(search) CSS tweaks	2023-12-02 17:06:40 +01:00
Viktor Lofgren	38d20022ad	(search) Fix script loading for mobile support	2023-12-02 17:06:40 +01:00
Viktor Lofgren	280132dad0	(search) Fix script loading for mobile support	2023-12-02 17:06:40 +01:00
Viktor Lofgren	61de4e2789	(search) Retain filter options when performing a new search from the input field	2023-12-02 17:06:40 +01:00
Viktor Lofgren	f9d3455320	(search) Reduce visual weight of search results	2023-12-02 17:06:40 +01:00

... 2 3 4 5 6 ...

1541 Commits