Commit Graph

  • 60361f88ed (converter) Add upper 128KB limit to how much HTML we'll parse Viktor Lofgren 2024-01-03 23:14:03 +0100
  • f7560cb1d8 (feature) More trackers Viktor Lofgren 2024-01-03 17:31:02 +0100
  • 1f66568d59 (feature) More trackers Viktor Lofgren 2024-01-03 17:27:25 +0100
  • 7af07cef95 (feature) Add another doubleclick variant to the adtech trackers Viktor Lofgren 2024-01-03 17:21:12 +0100
  • 41a540a629 (converter) Penalize chatgpt content farm spam Viktor Lofgren 2024-01-03 17:04:38 +0100
  • f599944942 (converter) Penalize chatgpt content farm spam Viktor Lofgren 2024-01-03 16:51:26 +0100
  • 1e06aee6a2 (index) Adjust BM25 parameters Viktor Lofgren 2024-01-03 16:30:46 +0100
  • 7bbaedef97 (search) Add query strategy requiring link Viktor Lofgren 2024-01-03 16:23:00 +0100
  • 87048511fe (valuation) Tweaking penalties a bit Viktor Lofgren 2024-01-03 16:02:25 +0100
  • c770f0b68b (valuation) Tweaking penalties a bit Viktor Lofgren 2024-01-03 15:59:21 +0100
  • 78c00ad512 (valuation) Tweaking penalties a bit Viktor Lofgren 2024-01-03 15:52:57 +0100
  • a19879d494 (valuation) Tweaking penalties a bit Viktor Lofgren 2024-01-03 15:32:33 +0100
  • ac1aca36b0 (valuation) Increase the penalty for adtech a bit Viktor Lofgren 2024-01-03 15:20:38 +0100
  • 1f3b89cf28 (index) Reduce the value of site and site-adjacent in BM25P calculations Viktor Lofgren 2024-01-03 15:20:18 +0100
  • f732f6ae6f (index) Tweak result valuation renormalization Viktor Lofgren 2024-01-03 14:53:53 +0100
  • 0b9f3d1751 (*) Remove accidental commit of debug logging Viktor Lofgren 2024-01-03 14:32:00 +0100
  • 0806aa6dfe (language-processing) Add maximum length limit for text input in SentenceExtractor Viktor Lofgren 2024-01-03 13:59:05 +0100
  • 32436d099c (language-processing) Add maximum length limit for text input in SentenceExtractor Viktor Lofgren 2024-01-03 13:49:39 +0100
  • 4ce692ccaf (converter) Use SimpleBlockingThreadPool in ProcessingIterator Viktor Lofgren 2024-01-03 13:40:44 +0100
  • 3caa4eed75 Merge branch 'master' into converter-optimizations Viktor Lofgren 2024-01-02 17:13:25 +0100
  • c70f508ae8 (prometheus) Saner histogram buckets Viktor Lofgren 2024-01-02 17:13:14 +0100
  • 9e64d7aaf9 Merge branch 'master' into converter-optimizations Viktor Lofgren 2024-01-02 15:46:24 +0100
  • 72b773f06d (search) fix search metrics labeling Viktor Lofgren 2024-01-02 15:46:14 +0100
  • 5f978b865b Merge branch 'master' into converter-optimizations Viktor Lofgren 2024-01-02 15:41:48 +0100
  • 57a4f92722 (api) fix missing metrics label in api service Viktor Lofgren 2024-01-02 15:41:38 +0100
  • 87351e89ca Merge branch 'master' into converter-optimizations Viktor Lofgren 2024-01-02 15:17:02 +0100
  • 7920c67a48
    Merge pull request #71 from MarginaliaSearch/metrics Viktor 2024-01-02 15:13:53 +0100
  • 192e356169 (prometheus) Add instrumentation to the api service Viktor Lofgren 2024-01-02 15:12:44 +0100
  • 31232e49fb (prometheus) Add instrumentation to the search, qs and index services. Viktor Lofgren 2024-01-02 15:02:29 +0100
  • 116595d218 (prometheus) Add in-docker prometheus instance to exfiltrate metrics from the docker-based services Viktor Lofgren 2024-01-01 17:16:29 +0100
  • 9d93a31755 Merge branch 'master' into converter-optimizations Viktor Lofgren 2024-01-02 12:36:16 +0100
  • 9f7df59945 (sideload) Reduce quality assessment. Viktor Lofgren 2024-01-02 12:34:58 +0100
  • d2418521a7 (index) Further ranking adjustments Viktor Lofgren 2024-01-01 18:43:17 +0100
  • 9330b5b1d9 (index) Adjust rank weightings to fix bad wikipedia results Viktor Lofgren 2024-01-01 17:16:29 +0100
  • faa50bf578 (sideload) Just index based on first paragraph Viktor Lofgren 2024-01-01 16:19:38 +0100
  • f0d9618dfc (sideload) Reduce quality assessment. Viktor Lofgren 2024-01-02 12:34:58 +0100
  • 310a880fa8 (index) Further ranking adjustments Viktor Lofgren 2024-01-02 12:24:52 +0100
  • fc6e3b6da0 (index) Further ranking adjustments Viktor Lofgren 2024-01-01 18:51:03 +0100
  • 50771045d0 (index) Further ranking adjustments Viktor Lofgren 2024-01-01 18:43:17 +0100
  • 8f522470ed (index) Adjust rank weightings to fix bad wikipedia results Viktor Lofgren 2024-01-01 17:16:29 +0100
  • dc90c9ac65 (sideload) Just index based on first paragraph Viktor Lofgren 2024-01-01 16:19:38 +0100
  • e46e174b59 (keyword-extractor) Add another test for Name-extractor Viktor Lofgren 2024-01-01 15:21:51 +0100
  • 7f3f3f577c (backup) Add task heartbeats to the backup service Viktor Lofgren 2024-01-01 15:20:57 +0100
  • 75d87c73d1 (crawler) Disable Java's infinite DNS caching Viktor Lofgren 2023-12-30 13:56:44 +0100
  • 0fe44c9bf2 (crawler) Fix broken test Viktor Lofgren 2023-12-30 13:56:44 +0100
  • 7a1d20ed0a (converter) Better use of ProcessingIterator Viktor Lofgren 2023-12-30 13:53:55 +0100
  • 70c83b60a1 (converter) Clean up fullProcessing() Viktor Lofgren 2023-12-30 13:36:18 +0100
  • 7ba296ccdf (converter) Route sizeHint to SideloadProcessing Viktor Lofgren 2023-12-30 13:05:10 +0100
  • 0b112cb4d4 (warc) Update URL encoding in WarcProtocolReconstructor Viktor Lofgren 2023-12-29 19:41:37 +0100
  • 68ac8d3e09 (search) Fetch fewer linking and similar domains. Viktor Lofgren 2023-12-29 16:37:00 +0100
  • f6fa8bd722 (search) Fetch fewer linking and similar domains. Viktor Lofgren 2023-12-29 16:37:00 +0100
  • 6aee27a3f1 (*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style. Viktor Lofgren 2023-12-29 16:36:01 +0100
  • 401568033c Merge branch 'master' into converter-optimizations Viktor Lofgren 2023-12-29 15:55:57 +0100
  • ea73be6831 (search) Remove the ugly placeholder screenshots from the site info view. Viktor Lofgren 2023-12-29 15:55:46 +0100
  • ba8a75c84b Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool Viktor Lofgren 2023-12-29 15:10:32 +0100
  • a1f3ccdd6d Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool Viktor Lofgren 2023-12-29 14:59:39 +0100
  • 647d38007f Reduce queue polling time in ProcessingIterator Viktor Lofgren 2023-12-29 14:27:58 +0100
  • e7dd28b926 (converter) Optimize sideload-loading Viktor Lofgren 2023-12-29 14:25:48 +0100
  • b5fc9673d9 Merge branch 'master' into converter-optimizations Viktor Lofgren 2023-12-29 14:04:43 +0100
  • a065040323 (search) Don't inject arbitrary HTML into the site info view xD Viktor Lofgren 2023-12-29 14:04:26 +0100
  • dec3b1092d (converter) Fix bugs in conversion Viktor Lofgren 2023-12-29 13:58:08 +0100
  • 407915a86e (converter) Fix NPEs in converter due to the new data format Viktor Lofgren 2023-12-28 19:52:26 +0100
  • c488599879 (converter) Fix NPE in converter Viktor Lofgren 2023-12-28 19:52:26 +0100
  • bcecc93e39 (converter) Swallow errors in parquet data stream Viktor Lofgren 2023-12-28 19:45:35 +0100
  • ff7d1a250e Merge branch 'master' into converter-optimizations Viktor Lofgren 2023-12-28 19:35:00 +0100
  • 70f338c3de (search) Fix NPE in layout selection Viktor Lofgren 2023-12-28 19:34:46 +0100
  • c847d83011 (converter) Add size hint to converter sideload processing Viktor Lofgren 2023-12-28 19:14:16 +0100
  • 5ce46a61d4 Merge branch 'master' into converter-optimizations Viktor Lofgren 2023-12-28 13:26:19 +0100
  • 775974d5ec
    Merge pull request #67 from MarginaliaSearch/rss-feeds-in-site-info Viktor 2023-12-28 13:25:38 +0100
  • c7af40c368 (search) Change layout balance when feeds/samples are present Viktor Lofgren 2023-12-28 13:16:10 +0100
  • 00a974a721 (crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions Viktor Lofgren 2023-12-27 20:02:17 +0100
  • 7428ba2dd7 (converter) Basic test coverage for sideloading-style processing Viktor Lofgren 2023-12-27 19:29:26 +0100
  • b37223c053 (converter) Basic test coverage for sideloading-style processing Viktor Lofgren 2023-12-27 18:33:16 +0100
  • 24051fec03 (converter) WIP Run sideload-style processing for large domains Viktor Lofgren 2023-12-27 18:20:03 +0100
  • f811a29f87 (crawler) Fix resource leak in crawler Viktor Lofgren 2023-12-27 16:32:17 +0100
  • acf7bcc7a6 (converter) Refactor the DomainProcessor for new format of crawl data Viktor Lofgren 2023-12-27 13:57:59 +0100
  • 9707366348 (test) Fix a few slow tests that broke due to domainCount Viktor Lofgren 2023-12-27 13:29:59 +0100
  • 9e5fe71f5b (crawler) Switch hash function in crawler Viktor Lofgren 2023-12-27 13:29:00 +0100
  • 5d1b7da728 Updated site info feed and search service Viktor Lofgren 2023-12-26 22:06:01 +0100
  • 3ea1ddae22 (crawler) Roll back switch to virtual thread pool in crawler Viktor Lofgren 2023-12-26 19:37:34 +0100
  • 1694e9c78c (search) Add RSS Feeds to site info Viktor Lofgren 2023-12-26 16:21:40 +0100
  • 4763077b76 (search/index) Add a new keyword "count" Viktor Lofgren 2023-12-25 20:38:29 +0100
  • c0eaca220c (search) Add convenient link for AS search to the search view Viktor Lofgren 2023-12-25 15:07:58 +0100
  • 25d086c4e1 (crawler) Clean up stale warc files Viktor Lofgren 2023-12-25 15:07:36 +0100
  • 88551043cd (crawler) Even more lenient resyncing Viktor Lofgren 2023-12-25 01:48:11 +0100
  • f779f760c4 (crawler) Even more lenient resyncing Viktor Lofgren 2023-12-25 01:44:18 +0100
  • f18f82e229 (crawler) Write etags and last-modified on reference copy Viktor Lofgren 2023-12-25 01:40:13 +0100
  • 67ef2b45fa (crawler) Reduce logging Viktor Lofgren 2023-12-25 01:10:03 +0100
  • d72e871265 (warc) Fix resync Viktor Lofgren 2023-12-25 01:03:03 +0100
  • 4c9bc13309 (warc) Reduce log spam Viktor Lofgren 2023-12-25 00:58:31 +0100
  • 84563b0d46 (crawler) Be a bit more conservative about pulling etags and so on if the previous fetch wasn't OK Viktor Lofgren 2023-12-25 00:55:05 +0100
  • c5aab7e8db (warc) Fix NPE in WarcRecorder Viktor Lofgren 2023-12-25 00:54:38 +0100
  • 1755b646b8 (warc) Fix NPE in WarcRecorder Viktor Lofgren 2023-12-25 00:48:42 +0100
  • 85f906ea53 (executor) Fix removal of stale process heartbeats Viktor Lofgren 2023-12-23 13:49:24 +0100
  • e1a155a9c8 (crawler) Increase growth of crawl jobs Viktor Lofgren 2023-12-23 13:22:10 +0100
  • 0454447e41 (executor) Implement process removal for long-absent heartbeats Viktor Lofgren 2023-12-23 13:18:21 +0100
  • 7b40c0bbee (assistant) Clean up similar websites' results Viktor Lofgren 2023-12-22 14:07:01 +0100
  • dc773c5c20 (adjacencies) Clean up AdjacenciesLoader Viktor Lofgren 2023-12-21 14:14:22 +0100
  • b6253b03c2 (adjacencies) Fix bug in AdjacenciesLoader Viktor Lofgren 2023-12-21 13:12:31 +0100
  • a5bc29245b (cleanup) Remove vestigial support for WARC crawl data streams Viktor Lofgren 2023-12-20 15:46:21 +0100