Commit Graph

353 Commits

Author SHA1 Message Date
Viktor Lofgren
c7f0276005 (control) Don't spin on process output printing
This is the "correct" way of copying stdout and stderr to the curren't process' output.
2023-08-22 11:48:54 +02:00
Viktor Lofgren
46df58d28b (control-service) Use default value for WMSA_HOME if it is not set 2023-08-22 11:11:01 +02:00
Viktor Lofgren
15912f31d0 (control-service) Basic GUI for deleting bad links from exploration mode 2023-08-21 18:35:26 +02:00
Viktor Lofgren
93f49f1fb3 (search-service) RSS feed for the news feed 2023-08-20 12:58:34 +02:00
Viktor Lofgren
704de50a9b (forward-index, valuator) HTML features in valuator
Put it in the forward index for easy access during index-side valuation.
2023-08-18 11:54:56 +02:00
Viktor Lofgren
efee904531 (search) Use the adtech bit instead of ads for ads flag 2023-08-18 11:24:59 +02:00
Viktor Lofgren
46d761f34f (language) fasttext based language filter 2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f (valuation) Penalize wordpress style kebab case urls 2023-08-16 13:11:24 +02:00
Viktor Lofgren
606db54dc8 (docs) Fix dead links to message-queue after moving it to libraries 2023-08-15 19:26:40 +02:00
Viktor Lofgren
df85468c01 (control) Action for refreshing the blogs definition. 2023-08-15 11:38:52 +02:00
Viktor Lofgren
e7192a9cad (mq) Refactor mq and actor library and move it to libraries out of common 2023-08-15 10:53:23 +02:00
Viktor Lofgren
019b61b330 (control) Remove message queue listing from actors view. 2023-08-13 13:50:04 +02:00
Viktor Lofgren
f997707049 (control) Move event log out of plumbing 2023-08-13 13:40:50 +02:00
Viktor Lofgren
c56ee10185 (control) Separate [Process] and [Process and Load] actions for crawl data; all SLOW data is deletable. 2023-08-13 13:39:59 +02:00
Viktor Lofgren
8210e49b4e (control) Helpful tooltips for the Actor table. 2023-08-13 12:55:56 +02:00
Viktor Lofgren
a8f2e9ee2c (control) Tidy up empty tables, remove actors from index view 2023-08-12 15:18:14 +02:00
Viktor Lofgren
a91b909103 (control) Event log on stop actor 2023-08-12 15:02:53 +02:00
Viktor Lofgren
99e031c529 (control) Remove broken pagination from events and message queue; new "light" events table for some views 2023-08-12 14:57:55 +02:00
Viktor Lofgren
998f239ed9 (control) Filterable event log view 2023-08-12 14:43:11 +02:00
Viktor Lofgren
0961f627b1 (control) Pretty up the nav bar 2023-08-12 14:42:42 +02:00
Viktor Lofgren
4f8048be31 (blacklist) Blacklist management 2023-08-10 15:40:07 +02:00
Viktor Lofgren
ce293029c7 (converter) Treat adtech tracking as advertisement. 2023-08-09 14:23:53 +02:00
Viktor Lofgren
251fc63b42 (*) Fix merge gore 2023-08-09 13:33:28 +02:00
Viktor Lofgren
47f3855a4b (control) More informative readme.md 2023-08-09 12:42:23 +02:00
Viktor Lofgren
71dfe9f33e (control) Clean up the ControlService, move mq-related endpoints to MessageQueueService. 2023-08-09 12:42:01 +02:00
Viktor Lofgren
4ab1cd9502 (*) last touches 2023-08-07 12:57:44 +02:00
Viktor Lofgren
be444f9172 (control) New actions view, re-arrange navigation menu 2023-08-05 14:45:04 +02:00
Viktor Lofgren
bf37a3eb25 (search-service) Make flushCaches endpoint a notice and not a request 2023-08-05 14:45:04 +02:00
Viktor Lofgren
00eb8b90dc (control) Message Queue GUI 2023-08-04 22:05:29 +02:00
Viktor Lofgren
912129311d (control) Message Queue GUI 2023-08-04 17:54:18 +02:00
Viktor Lofgren
624b78ec3a (heartbeat) Task heartbeats 2023-08-04 14:40:06 +02:00
Viktor Lofgren
1d0cea1d55 (converter) GUI for dealing with user complaints 2023-08-03 17:59:57 +02:00
Viktor Lofgren
f01f608474 (blacklist) Support blacklists with subdomain 2023-08-03 17:58:52 +02:00
Viktor Lofgren
63e857f7cd (control) Add basic api key management 2023-08-02 20:14:03 +02:00
Viktor Lofgren
9979c9defe (search/index) Add blogosphere filter 2023-08-02 20:13:30 +02:00
Viktor Lofgren
8de3e6ab80 (control) Fix bug where CrawlActor and RecrawlActor would steal each others' mail 2023-08-01 22:33:30 +02:00
Viktor Lofgren
867410c66b (file-storage) Automatic file storage discovery via manifest file 2023-08-01 18:05:43 +02:00
Viktor Lofgren
36a23707c1 (control) Control service should be a core service. 2023-08-01 15:49:50 +02:00
Viktor Lofgren
e22e65eee4 (index) Fix bug related to debug print statements 2023-07-22 14:33:58 +02:00
Viktor Lofgren
d7ab21fe34 (*) Refactor Control Service and processes 2023-07-17 21:20:31 +02:00
Viktor Lofgren
8b74e3aa0d (*) File Storage WIP 2023-07-14 17:08:10 +02:00
Viktor Lofgren
88b9ec70c6 (control, WIP) Run reconvert-load from converter :D 2023-07-11 18:05:37 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
96eecc6ea5 Minor: Readability. 2023-07-10 18:58:43 +02:00
Viktor Lofgren
d9e6c4f266 Trial integration of MQ-FSM into index service. 2023-07-06 18:04:16 +02:00
Viktor Lofgren
62cc9df206 Embryo of new control process
* New events and heartbeat tables in mariadb
* Refactored to a cleaner Service interface
2023-07-03 10:40:32 +02:00
Viktor Lofgren
0f34beb1aa Update search front page 2023-06-29 17:14:27 +02:00
Viktor Lofgren
a6a66c6d8a Improve site info for unknown domains:
* Placeholder screenshot should work
* Add a link to git-repo for submitting the site for crawling
2023-06-27 15:32:11 +02:00
Viktor Lofgren
d86e8522e2 Add search profiles for wiki, forum and docs. 2023-06-24 12:17:35 +02:00
Viktor Lofgren
bd2c3855ed Add bits and keywords for generator classes (docs, forum, wiki). 2023-06-23 21:35:28 +02:00
Viktor Lofgren
55c65f0935 Use document generator to complement the document selection.
Will let through e.g. a modern SSG in the small web filter.
2023-06-22 17:21:33 +02:00
Viktor Lofgren
fd192d2791 Fix putative overflow error with a large dictionary 2023-05-28 11:57:06 +02:00
Viktor Lofgren
1e184a8372 (search) Make exploration mode more random 2023-05-25 17:40:28 +02:00
Viktor Lofgren
6fae51a8ef Stopgap fix for a bug in dealing with quote terms containing stop words. 2023-05-02 19:38:59 +02:00
Viktor Lofgren
bb587ca47f Reformulate search-header.hdb, s/Support/Donate/ the formulation was apparently confusing some people thinking they could get support on this page. 2023-04-18 17:04:24 +02:00
Viktor Lofgren
df1850bd45 Fix bug in index service where tld: and links:-queries wouldn't work. 2023-04-15 18:39:16 +02:00
Viktor Lofgren
502713f7a8 Reduce memory churn 2023-04-10 16:51:17 +02:00
Viktor Lofgren
e19256a6b6 Tune settings to retrieve more results. 2023-04-10 15:39:20 +02:00
Viktor Lofgren
ccc41d1717 Clean up of the index query handling related code. 2023-04-10 14:50:57 +02:00
Viktor Lofgren
e49b1dd155 Better handling of quote terms, fix bug in handling of longer queries.
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:20:40 +02:00
Viktor Lofgren
fe419b12b4 Better handling of quote terms, fix bug in handling of longer queries.
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:11:40 +02:00
Viktor Lofgren
535a51a621 Repair broken year query test. 2023-04-08 12:04:09 +02:00
Viktor
a278fc6296
Increase search result relevance (#8)
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.
2023-04-07 20:18:08 +02:00
Viktor Lofgren
716ab35b4e Search ranking debuggability improvements. 2023-04-02 13:43:24 +02:00
Viktor Lofgren
3fb249758e Adjust result ordering. 2023-04-02 12:05:22 +02:00
Viktor Lofgren
f7a6ef2179 Smarter queries, better logging. 2023-04-02 12:05:09 +02:00
Viktor Lofgren
105d93cd85 Index query builder automatically ignores redundant predicates. 2023-04-02 12:04:26 +02:00
Viktor Lofgren
1e4157017d More helpful descriptions of index queries. 2023-04-02 12:03:58 +02:00
Viktor Lofgren
5fb75adaae Remove antique result scoring adjustment that makes no sense anymore. 2023-04-02 11:58:04 +02:00
Viktor Lofgren
cc4e089a5d Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc. 2023-03-30 15:46:15 +02:00
Viktor Lofgren
dcf6218cdb Fix bugs related to search result selection in the case with multiple search terms.
* A deduplication filter step ran too early, and removed many good results on the basis that they partially, but did not fully fit another set of search terms.

* Altered the query creation process to prefer documents where multiple terms appear in the priority index.
2023-03-29 15:18:52 +02:00
Viktor Lofgren
17ca4f9eea Permit search results that are all synthetic to pass relevancy check. 2023-03-27 17:27:35 +02:00
Viktor Lofgren
7fb3db3249 Fix bug where link on front page news listing wouldn't work.
... also changed order of date and source to make the UI more consistent.
2023-03-27 17:26:46 +02:00
Viktor Lofgren
862e925d7c "-Dsmall-ram=TRUE" no longer does anything. Remove references to the flag, which previously reduced the memory footprint of the loader and index service. 2023-03-26 21:37:11 +02:00
Viktor Lofgren
a0027ad32b Fix broken diagram links after doc/ restructuring. 2023-03-25 16:32:10 +01:00
Viktor
ac1ac3ea57
Move database to a separate module
* Move database to a separate project, break apart sql file into separate entities.
* Fix front page news listing.
2023-03-25 15:26:17 +01:00
Viktor Lofgren
3464ca514b Fix typeahead suggestions 2023-03-25 10:20:52 +01:00
Viktor
e3675d2fa9
Update readme.md 2023-03-22 17:02:03 +01:00
Viktor
c4a6bf7672
Update readme.md 2023-03-22 17:01:34 +01:00
Viktor
cb6865924e
Update readme.md 2023-03-22 16:59:38 +01:00
Viktor Lofgren
964014860a Get suggestions working again 2023-03-22 15:11:22 +01:00
Viktor Lofgren
46f81aca2f Break apart reverse index into a separate full index and priority index. It did this before using the same code. This will make the priority index about half as big since it no longer needs to keep metadata. 2023-03-21 16:12:31 +01:00
Viktor Lofgren
72115e490f Put news into a database table instead of keeping them hardcoded, request counter on front page. 2023-03-19 12:54:58 +01:00
Viktor Lofgren
bdd2b4a43e Put news into a database table instead of keeping them hardcoded. 2023-03-19 11:46:13 +01:00
Viktor Lofgren
6a20b2b678 Trivial reformatting of code. 2023-03-17 22:11:14 +01:00
Viktor Lofgren
3675c7a090 The search-service doesn't speak REST. 2023-03-17 16:21:52 +01:00
Viktor Lofgren
2eb972dea1 Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00
Viktor Lofgren
449471a076 Yet more restructuring. Improved search result ranking. 2023-03-16 21:35:54 +01:00
Viktor Lofgren
0ecab53635 Yet more restructuring. 2023-03-13 23:40:26 +01:00
Viktor Lofgren
d82532b7f1 More restructuring, big bug fixes in keyword extraction. 2023-03-13 17:39:53 +01:00
Viktor Lofgren
73eaa0865d The refactoring will continue until morale improves. 2023-03-12 10:50:31 +01:00
Viktor Lofgren
616effdb3c The refactoring will continue until morale improves. 2023-03-12 10:04:48 +01:00
Viktor Lofgren
4cec89da91 Fix bug where results would sometimes be presented solely based on the fact that the document is important on the site in general, regardless of whether it's important to the document. 2023-03-11 14:20:32 +01:00
Viktor Lofgren
6d939175b1 Additional code restructuring to get rid of util and misc-style packages. 2023-03-11 13:48:40 +01:00
Viktor Lofgren
73e412ea5b Clean up search-service and index-api 2023-03-11 12:26:12 +01:00
Viktor Lofgren
0532e8c40e Tidy up. 2023-03-11 11:35:08 +01:00
Viktor Lofgren
919b80b9ab Gradle shouldn't generate dist zips, zipping jar files is slow and also just ridiculous when you realize jar files are zip files and you can't compress a file twice using the same algo. 2023-03-11 11:34:51 +01:00
Viktor Lofgren
a62015d5f3 Fix broken test, compiler warning. 2023-03-10 17:12:12 +01:00
Viktor Lofgren
722ff3bffb Word feature bit for words that appear in the URL, new search profile for plain text files, better plain text titles. 2023-03-10 16:46:56 +01:00
Viktor Lofgren
efb46cc703 Remove count from WordMetadata entirely. 2023-03-09 18:14:14 +01:00
Viktor Lofgren
9ece07d559 Chasing a result ranking bug 2023-03-09 17:52:35 +01:00
Viktor Lofgren
1252f95da5 Fix for valuation bug in index code that wouldn't sort bad-ish items properly. 2023-03-07 21:26:04 +01:00
Viktor Lofgren
ad1be7c835 Move all code to a code directory. 2023-03-07 17:14:32 +01:00