Commit Graph

  • bfae478251 Refactor CrawlerRevisitor for better consistency Viktor Lofgren 2023-12-20 15:21:49 +0100
  • a7cd490593 (minor) Remove dead code. Viktor Lofgren 2023-12-19 18:46:11 +0100
  • 283d2caa81 Merge remote-tracking branch 'origin/master' Viktor Lofgren 2023-12-19 18:38:01 +0100
  • dd8fb04886 (converter) Add sizeloadSizeAdvice field to several ProcessedDomain Viktor Lofgren 2023-12-19 18:37:51 +0100
  • ce8dca7659
    Update Additional Contributors.md Viktor 2023-12-19 12:22:01 +0100
  • 5bd3934d22
    Merge pull request #64 from dreimolo/macos_AS_fix Viktor 2023-12-18 18:29:14 +0100
  • 128f550ee5 (run) Download to a temporary file to avoid corruption from aborted downloads Viktor Lofgren 2023-12-18 18:28:17 +0100
  • 3a56a06c4f (warc) Add a fields for etags and last-modified headers to the new crawl data formats Viktor Lofgren 2023-12-18 17:45:54 +0100
  • 126ac3816f (converter) Reduce queue size in ConverterWriter Viktor Lofgren 2023-12-18 13:42:40 +0100
  • d02bed1a55 (loader) Optimize DomainLoaderService for faster startups Viktor Lofgren 2023-12-18 13:15:10 +0100
  • b7ed0ce537 (loader) Reset count after executing batch in DomainLoaderService Viktor Lofgren 2023-12-18 12:43:53 +0100
  • a742503508 (search) Add view for showing mutual links between two websites Viktor Lofgren 2023-12-17 17:50:44 +0100
  • 33312ab09e (geo-ip) Update readme Viktor Lofgren 2023-12-17 16:08:33 +0100
  • c422f0b9fb (geo-ip) Tidy up error handling Viktor Lofgren 2023-12-17 15:26:57 +0100
  • 35a555b134 (geo-ip) Tidy up error handling asn-info Viktor Lofgren 2023-12-17 15:26:57 +0100
  • 7797de80e3
    Merge pull request #65 from MarginaliaSearch/asn-info Viktor 2023-12-17 15:04:29 +0100
  • c92f1b8df8 (geo-ip) Revert removal of ip2location logic Viktor Lofgren 2023-12-17 15:03:00 +0100
  • bde68ba48b Merge branch 'master' into asn-info Viktor Lofgren 2023-12-17 14:00:23 +0100
  • bf44805e69 (*) Rename EdgeDomain$domain into topDomain Viktor Lofgren 2023-12-17 14:00:07 +0100
  • edf9aa2c23 (*) Rename EdgeDomain$domain into topDomain Viktor Lofgren 2023-12-17 13:59:54 +0100
  • 4801c47273 (crawling-model) Fix bug where CrawledDocument.getDomain() trimmed www-prefixes Viktor Lofgren 2023-12-17 13:53:31 +0100
  • bcad6492d6 (sideloader) Fix integration problems with sideloaders Viktor Lofgren 2023-12-17 13:28:17 +0100
  • 5ab2a22e88 (search) Fix result count back down to 1 per domain Viktor Lofgren 2023-12-17 13:14:23 +0100
  • d7bd540683 (*) Replace the ip2location IP geolocation data with ASN information from apnic.net. Viktor Lofgren 2023-12-16 21:55:04 +0100
  • 62954f98de adds xl to help output dreimolo 2023-12-16 19:41:41 +0100
  • 722b56c8ca (index) Fix rare bug in the index-switching logic Viktor Lofgren 2023-12-16 18:57:35 +0100
  • f3f12058dc (assistant) Fix logic error in filtering related domains Viktor Lofgren 2023-12-16 18:45:53 +0100
  • 3da38d0483 (assistant) Fix logic error in filtering related domains Viktor Lofgren 2023-12-16 18:44:25 +0100
  • d715b1f9ca (search) Improve error handling in search parameters parsing Viktor Lofgren 2023-12-16 18:42:13 +0100
  • e13fa25e11 (assistant) Clean up the site info related domains view by filtering viable domains Viktor Lofgren 2023-12-16 18:37:09 +0100
  • 34d4834ff6 (assistant) Clean up the site info related domains view by filtering viable domains Viktor Lofgren 2023-12-16 18:27:24 +0100
  • 117ddd17d7 (assistant) Fix bugs in IP flag emoji generation Viktor Lofgren 2023-12-16 17:07:17 +0100
  • 6f2bf38f0e (index) Fix off-by-1 error in the domain count limiter Viktor Lofgren 2023-12-16 16:57:05 +0100
  • 320882c34a (site-info) Try to discover the schema of the website with a site:-query Viktor Lofgren 2023-12-16 16:34:53 +0100
  • 8bbb533c9a
    Merge pull request #62 from MarginaliaSearch/warc Viktor 2023-12-16 16:02:46 +0100
  • 3113b5a551 (warc) Filter WarcResponses based on X-Robots-Tags warc Viktor Lofgren 2023-12-16 15:57:10 +0100
  • c0cc05177f corrects protobuf.plugins.grpc dreimolo 2023-12-16 14:24:41 +0100
  • 0b34d43804 workaround for failing mac on apple silicon deps dreimolo 2023-12-16 14:22:11 +0100
  • 6c7d7427bf Adds check for wget and curl, and valid sample archives dreimolo 2023-12-16 14:14:58 +0100
  • 54ed3b86ba (minor) Remove dead code. Viktor Lofgren 2023-12-15 21:49:35 +0100
  • 2001d0f707 (converter) Add @Deprecated annotation to a few fields that should no longer be used. Viktor Lofgren 2023-12-15 21:42:00 +0100
  • 0f9cd9c87d (warc) More accurate filering of advisory records Viktor Lofgren 2023-12-15 21:37:02 +0100
  • 2e7db61808 (warc) More accurate filering of advisory records Viktor Lofgren 2023-12-15 21:31:16 +0100
  • 5329968155 (crawler) Update CrawlingThenConvertingIntegrationTest Viktor Lofgren 2023-12-15 21:04:06 +0100
  • 2e536e3141 (crawler) Add timestamp to CrawledDocument records Viktor Lofgren 2023-12-15 20:23:27 +0100
  • cf935a5331 (converter) Read cookie information Viktor Lofgren 2023-12-15 18:09:53 +0100
  • fa81e5b8ee (warc) Use a non-standard WARC header to convey information about whether a website uses cookies Viktor Lofgren 2023-12-15 16:37:53 +0100
  • 9fea22b90d (warc) Further tidying Viktor Lofgren 2023-12-15 15:38:23 +0100
  • 0889b6d247 (warc) Clean up parquet conversion Viktor Lofgren 2023-12-14 20:39:40 +0100
  • 1328bc4938 (warc) Clean up parquet conversion Viktor Lofgren 2023-12-14 16:05:48 +0100
  • 787a20cbaa (crawling-model) Implement a parquet format for crawl data Viktor Lofgren 2023-12-13 16:22:19 +0100
  • a73f1ab0ac Merge branch 'master' into warc Viktor Lofgren 2023-12-13 15:35:29 +0100
  • 30c0dad3ae (gradle) Bump gradle-wrapper version to 8.5 Viktor Lofgren 2023-12-13 15:35:01 +0100
  • 440e097d78 (crawler) WIP integration of WARC files into the crawler and converter process. Viktor Lofgren 2023-12-13 15:33:42 +0100
  • b74a3ebd85 (crawler) WIP integration of WARC files into the crawler process. Viktor Lofgren 2023-12-11 19:32:58 +0100
  • 45987a1d98 Merge branch 'master' into warc Viktor Lofgren 2023-12-11 14:30:20 +0100
  • 8f0950fc44 (geoip) Fix incorrect synchronization. Viktor Lofgren 2023-12-11 14:01:39 +0100
  • 30bc3f9281 (converter) Use the prefix ip: instead of geopip: for country codes Viktor Lofgren 2023-12-11 13:59:23 +0100
  • f655ec5a5c (*) Refactor GeoIP-related code Viktor Lofgren 2023-12-10 17:30:43 +0100
  • 84b4158555 (minor) Fix broken test Viktor Lofgren 2023-12-10 14:29:18 +0100
  • 91dd45cf64 (search) IP and IP geolocation in site info view Viktor Lofgren 2023-12-09 20:04:27 +0100
  • 37af60254f (search) Better recipe filter Viktor Lofgren 2023-12-09 16:53:06 +0100
  • f0e736d4ea (search) Update the search profile 'Academia' to strictly filter on academic tlds Viktor Lofgren 2023-12-09 16:46:51 +0100
  • e3ebb0c5bb (*) Rename the search filter 'RETRO' into 'POPULAR' Viktor Lofgren 2023-12-09 16:39:46 +0100
  • 6382f779c3 (search) Revert back to using 'Popular' as the default search filter Viktor Lofgren 2023-12-09 16:34:12 +0100
  • 8ef34883a8 (search) Move site information out of the search service and into assistant. Viktor Lofgren 2023-12-09 16:29:35 +0100
  • 5c46af0edb (converter) Refactor EncyclopediaMarginaliaNuSideloader to use ProcessingIterator Viktor Lofgren 2023-12-09 15:20:31 +0100
  • b6511fbfe2 (converter) Add AnchorTextKeywords to EncyclopediaMarginaliaNuSideloader processing Viktor Lofgren 2023-12-09 13:23:21 +0100
  • eccb12b366 (control) Fix spurious state detection in control-side actors Viktor Lofgren 2023-12-09 12:50:05 +0100
  • d0982e7ba5 (converter) Add error handling and lazy load external domain links Viktor Lofgren 2023-12-09 12:33:39 +0100
  • fc30da0d48 (converter) Add academia recognition to DomainProcessor Viktor Lofgren 2023-12-08 20:31:34 +0100
  • e6a1052ba7 Simplify CrawlerMain, removing the CrawlerLimiter and using a global HttpFetcher with a virtual thread pool dispatcher instead of the default. Viktor Lofgren 2023-12-08 20:24:01 +0100
  • 968dce50fc (crawler) Refactored IpInterceptingNetworkInterceptor for clarity. Viktor Lofgren 2023-12-08 17:45:46 +0100
  • 3bbffd3c22 (crawler) Refactor HttpFetcher to integrate WarcRecorder Viktor Lofgren 2023-12-08 17:12:51 +0100
  • 072b5fcd12 Implement Warc-recording wrapper for OkHttp3 client Viktor Lofgren 2023-12-08 13:49:16 +0100
  • fabffa80f0 (warc) Integrate the crawler's content type parsing and charset logic into the WarcSideloader Viktor Lofgren 2023-12-07 15:26:01 +0100
  • 064265b0b9 (crawler) Move content type/charset sniffing to a separate microlibrary Viktor Lofgren 2023-12-07 15:16:37 +0100
  • 2d5d11645d (warc) Refactor WarcSideloaderTest to not rely on specific test files on the computer Viktor Lofgren 2023-12-06 19:00:29 +0100
  • cc813a5624 (convert) Add basic support for Warc file sideloading Viktor Lofgren 2023-12-06 18:43:55 +0100
  • 156c067f79 (search) Fix mobile issues with browse feature Viktor Lofgren 2023-12-05 21:28:50 +0100
  • b33b013d41 (search) Fix broken script tag Viktor Lofgren 2023-12-05 20:29:13 +0100
  • e74e2f705f (search) Fix broken script tag Viktor Lofgren 2023-12-05 20:20:07 +0100
  • 2e438847fc (search) Optimize related domains queries Viktor Lofgren 2023-12-05 15:47:21 +0100
  • 9301c47d93 (search) Optimize related domains queries Viktor Lofgren 2023-12-05 14:42:03 +0100
  • 20ec58b07f (search) Remove layout-breakingly long URLs from the similar domains view. Viktor Lofgren 2023-12-05 13:58:15 +0100
  • 98983c1015 (search) Hopefully fix race condition that leaves the response with no Content-type header Viktor Lofgren 2023-12-05 13:52:36 +0100
  • 67195592c6 (search) Hopefully fix race condition that leaves the response with no Content-type header Viktor Lofgren 2023-12-05 13:48:42 +0100
  • 21abfc6424
    Merge pull request #61 from MarginaliaSearch/new-look Viktor 2023-12-05 13:28:54 +0100
  • d1e88df71e (search) Cleaning up the code a bit new-look Viktor Lofgren 2023-12-05 13:26:05 +0100
  • f36cfe34ab (search) Hackery to get a more balanced view Viktor Lofgren 2023-12-04 22:50:39 +0100
  • 8a1934008c (search) Merge similar sites results with the info view. Viktor Lofgren 2023-12-04 22:10:24 +0100
  • b41bb9cfcf (search) Use a Ξ for mobile button title instead of "Filters". Viktor Lofgren 2023-12-03 16:33:25 +0100
  • d58324bbef (search) Clean up filters menu a bit, improve accessibility. Viktor Lofgren 2023-12-02 18:05:30 +0100
  • cbbd45d3e5 (search) Clean up filters menu a bit, improve accessibility. Viktor Lofgren 2023-12-02 18:01:03 +0100
  • b89633ae4b (search) Don't render a filter button on mobile when there are no filters to be presented. Viktor Lofgren 2023-12-02 17:23:45 +0100
  • 96357e9bfd (search) Fix typeahead suggestions, as well as improve mobile and desktop UX in small ways. Viktor Lofgren 2023-12-01 16:36:45 +0100
  • d530c3096f (search) GUI tweaks to make the new interface not fall apart on mobile/chrome Viktor Lofgren 2023-11-30 16:23:54 +0100
  • ae0c1c3f2d (control) Adjust search result margins for better visual density Viktor Lofgren 2023-11-29 14:11:40 +0100
  • 0cc2564380 (search) CSS tweaks Viktor Lofgren 2023-11-28 15:03:25 +0100
  • 38d20022ad (search) Fix script loading for mobile support Viktor Lofgren 2023-11-28 14:08:18 +0100