Commit Graph

688 Commits

Author SHA1 Message Date
Viktor Lofgren
8f0950fc44 (geoip) Fix incorrect synchronization. 2023-12-11 14:01:39 +01:00
Viktor Lofgren
30bc3f9281 (converter) Use the prefix ip: instead of geopip: for country codes
This is the same as the prefix for the IP address, but I don't think that substantially matters, the as two have such different namespaces there can be no confusion.
2023-12-11 13:59:23 +01:00
Viktor Lofgren
f655ec5a5c (*) Refactor GeoIP-related code
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.

The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.

The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.

The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
2023-12-10 17:30:43 +01:00
Viktor Lofgren
84b4158555 (minor) Fix broken test 2023-12-10 14:39:20 +01:00
Viktor Lofgren
91dd45cf64 (search) IP and IP geolocation in site info view
This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.
2023-12-09 20:06:55 +01:00
Viktor Lofgren
37af60254f (search) Better recipe filter
Tune the recipe filter to give better results, by using the 'popular' domains set along with excluding results with heavy tracking.
2023-12-09 20:06:55 +01:00
Viktor Lofgren
f0e736d4ea (search) Update the search profile 'Academia' to strictly filter on academic tlds
The previous version used a personalized pagerank centering on a few academic domains, but this didn't work very well and most results were not very academia-centric.
2023-12-09 20:06:55 +01:00
Viktor Lofgren
e3ebb0c5bb (*) Rename the search filter 'RETRO' into 'POPULAR'
This will make the terminology more consistent between the GUI and the code.  The rankings yaml still uses 'retro' though, for to retain compatibility.
2023-12-09 20:06:54 +01:00
Viktor Lofgren
6382f779c3 (search) Revert back to using 'Popular' as the default search filter
Unfiltered is a bit too ... unfiltered, and gives a bad first impression for many queries.
2023-12-09 16:34:12 +01:00
Viktor Lofgren
8ef34883a8 (search) Move site information out of the search service and into assistant.
This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available.  It also permits exposing this information via API in the future if there is interest in this.

The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time.

Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.
2023-12-09 16:30:06 +01:00
Viktor Lofgren
5c46af0edb (converter) Refactor EncyclopediaMarginaliaNuSideloader to use ProcessingIterator
Refactored the getDocumentsStream method in EncyclopediaMarginaliaNuSideloader to use the newly extracted ProcessingIterator class that encapsulates processing a stream of results from e.g a database query in parallel and returning the computed results as an iterator.

The iterator was also improved on to be more reliable, previous versions of the logic would sometimes deadlock due to false positives in hasMore().
2023-12-09 15:20:53 +01:00
Viktor Lofgren
b6511fbfe2 (converter) Add AnchorTextKeywords to EncyclopediaMarginaliaNuSideloader processing
The commit updates EncyclopediaMarginaliaNuSideloader to include the AnchorTextKeywords in processing documents, aiding search result relevance.

It also removes old test-related functionality and a large but fairly useless test previously used to debug a specific problem, to the detriment of the overall code quality.
2023-12-09 15:20:52 +01:00
Viktor Lofgren
eccb12b366 (control) Fix spurious state detection in control-side actors
A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor!

To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.
2023-12-09 12:50:05 +01:00
Viktor Lofgren
d0982e7ba5 (converter) Add error handling and lazy load external domain links
The converter was not properly initiating the external links for each domain, causing an NPE in conversion.  This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data.

Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.
2023-12-09 12:33:39 +01:00
Viktor Lofgren
fc30da0d48 (converter) Add academia recognition to DomainProcessor
The code now includes an additional function in the DomainProcessor class that checks if a domain is associated with academia. An academic domain is identified by the ".edu" TLD, or fits a specific regex pattern matching domains like *.ac.ccTld or *.edu.ccTld.

 If these conditions are met, the search term "special:academia" is added to the domain.

 The existing academia search filter uses personalized pagerank to select academia-adjacent domains, but it isn't working very well.  The hope is that filtering on domain names will be more effective, and that it can supplant the ranking-based approach.
2023-12-08 20:31:34 +01:00
Viktor Lofgren
156c067f79 (search) Fix mobile issues with browse feature 2023-12-05 21:28:50 +01:00
Viktor Lofgren
b33b013d41 (search) Fix broken script tag
Apparently it can't be called suggestions.js...?
2023-12-05 20:29:13 +01:00
Viktor Lofgren
e74e2f705f (search) Fix broken script tag
suggestions.js became something else.
2023-12-05 20:20:07 +01:00
Viktor Lofgren
2e438847fc (search) Optimize related domains queries
In the future this logic probably needs to move into a separate
service, as it's still quite slow to load.  But this fixes response
times and DOS potential of previous version.
2023-12-05 20:12:03 +01:00
Viktor Lofgren
9301c47d93 (search) Optimize related domains queries 2023-12-05 14:42:03 +01:00
Viktor Lofgren
20ec58b07f (search) Remove layout-breakingly long URLs from the similar domains view.
They're almost all .onion URLs anyway, not really the space we're looking to peer into.
2023-12-05 13:58:15 +01:00
Viktor Lofgren
98983c1015 (search) Hopefully fix race condition that leaves the response with no Content-type header 2023-12-05 13:52:36 +01:00
Viktor Lofgren
67195592c6 (search) Hopefully fix race condition that leaves the response with no Content-type header 2023-12-05 13:48:42 +01:00
Viktor Lofgren
d1e88df71e (search) Cleaning up the code a bit 2023-12-05 13:26:05 +01:00
Viktor Lofgren
f36cfe34ab (search) Hackery to get a more balanced view 2023-12-04 22:50:39 +01:00
Viktor Lofgren
8a1934008c (search) Merge similar sites results with the info view.
WIP: This commit needs to be cleaned up.
2023-12-04 22:10:24 +01:00
Viktor Lofgren
b41bb9cfcf (search) Use a &Xi; for mobile button title instead of "Filters".
Makes it easier to distinguish form the search button.
2023-12-03 16:33:25 +01:00
Viktor Lofgren
d58324bbef (search) Clean up filters menu a bit, improve accessibility. 2023-12-02 18:05:30 +01:00
Viktor Lofgren
cbbd45d3e5 (search) Clean up filters menu a bit, improve accessibility. 2023-12-02 18:01:03 +01:00
Viktor Lofgren
b89633ae4b (search) Don't render a filter button on mobile when there are no filters to be presented. 2023-12-02 17:23:45 +01:00
Viktor Lofgren
96357e9bfd (search) Fix typeahead suggestions, as well as improve mobile and desktop UX in small ways. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
d530c3096f (search) GUI tweaks to make the new interface not fall apart on mobile/chrome 2023-12-02 17:06:40 +01:00
Viktor Lofgren
ae0c1c3f2d (control) Adjust search result margins for better visual density 2023-12-02 17:06:40 +01:00
Viktor Lofgren
0cc2564380 (search) CSS tweaks 2023-12-02 17:06:40 +01:00
Viktor Lofgren
38d20022ad (search) Fix script loading for mobile support 2023-12-02 17:06:40 +01:00
Viktor Lofgren
280132dad0 (search) Fix script loading for mobile support 2023-12-02 17:06:40 +01:00
Viktor Lofgren
61de4e2789 (search) Retain filter options when performing a new search from the input field 2023-12-02 17:06:40 +01:00
Viktor Lofgren
f9d3455320 (search) Reduce visual weight of search results 2023-12-02 17:06:40 +01:00
Viktor Lofgren
2ff64c3c12 (search) New toggle for reducing tracking 2023-12-02 17:06:40 +01:00
Viktor Lofgren
902f235b5b (search) Integrate 'similar' tab in site info. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
97d43a6fa2 (search) Revamp browse results with new look. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
9bc65ff0ca (search) Desaturate search result titles according to rank 2023-12-02 17:06:40 +01:00
Viktor Lofgren
6cd6a615fd (search) Add data-filter to body as a data attribute
For future shenanigans ;D
2023-12-02 17:06:40 +01:00
Viktor Lofgren
5639f0653d (search) Rename SearchProfile.name into filterId
Avoid foot-gun caused by name clash with the Enumeration method name(), which returns the Java name of the enumeration value.
2023-12-02 17:06:40 +01:00
Viktor Lofgren
251174c9a2 (search) Update front page with new look 2023-12-02 17:06:40 +01:00
Viktor Lofgren
42ea87d637 (search) Update conversion results, error page, and dictionary results with new CSS. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
7c8a60b8cf (search) Site info view is mostly done
Also optimize the rendering a bit to avoid having to allocate huge string buffers, writing directly to Spark's response instead.
2023-12-02 17:06:40 +01:00
Viktor Lofgren
2f4500be5a (search) New frontend look 2023-12-02 17:06:40 +01:00
Viktor Lofgren
fa7534a362 (search) Remove dead code 2023-12-02 17:06:40 +01:00
Viktor Lofgren
a258f0af7a (search) Refactor search parameters to include query 2023-12-02 17:06:40 +01:00