Viktor Lofgren
f809d22fc6
(loader) Support simultaneous loading of multiple processed data sets
2023-09-22 13:14:58 +02:00
Viktor Lofgren
70aa04c047
(converter, stackexchange-xml) Add the ability to sideload stackexchange data
2023-09-21 12:48:33 +02:00
Viktor Lofgren
f8050816ac
(search) Don't run LSH deduplication on details with zero lsh to support not calculating this hash.
2023-09-21 12:47:02 +02:00
Viktor Lofgren
9b385ec7cc
(converter) Make it possible to sideload documents from a directory tree
2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46
(crawl-spec) Parquetify crawl spec
...
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
5e5aaf9a7e
(converter, control) Re-enable sideloading encyclopedia data
2023-09-14 12:12:07 +02:00
Viktor Lofgren
07d7507ac6
(control-service) Move Actions up in storage-details
...
Papercut fix. If a file storage area has a lot of files, you have to scroll down a long way to get to the actions otherwise.
2023-09-02 15:41:55 +02:00
Viktor Lofgren
9e185e80ce
(control-service) Add timestamp to file storages.
2023-09-02 14:01:04 +02:00
Viktor Lofgren
d31d8ec5b0
(index) Log keyword ids on hex format
2023-09-01 15:40:24 +02:00
Viktor Lofgren
2b00cd632d
(process) Propagate environment JVM params to the index constructor
2023-09-01 15:39:42 +02:00
Viktor Lofgren
764e7d1315
(index) Add more comprehensive integration tests for the index service.
2023-08-30 10:37:24 +02:00
Viktor Lofgren
e4d7958379
(control) ProcessLivenessMonitorActor shouldn't reap tasks based on service instance liveness
2023-08-29 18:19:04 +02:00
Viktor Lofgren
3f288e264b
(minor) Clean up dead endpoints
2023-08-29 17:04:54 +02:00
Viktor Lofgren
dd593c292c
(loader) Minor optimizations and bugfixes.
...
* Reduce memory churn in LoaderIndexJournalWriter, fix bug with keyword mappings as well
* Remove remains of OldDomains
* Ensure LOADER_PROCESS_OPTS gets fed to the processes
* LinkdbStatusWriter won't execute batch after each added item post 100 items
2023-08-29 15:37:52 +02:00
Viktor Lofgren
39c1857c61
(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.
2023-08-29 13:07:55 +02:00
Viktor Lofgren
c57a2d0dc3
(control-service) Remove old index journal files when restoring a backup.
2023-08-29 11:58:01 +02:00
Viktor Lofgren
6525b16e1f
(minor) Improved logging and error messages
2023-08-28 19:53:55 +02:00
Viktor Lofgren
b6a92506d1
(index) Hook in missing DocIdRewriter
...
This enables documents to be ranked properly.
2023-08-28 19:53:43 +02:00
Viktor Lofgren
3101b74580
(index) Move to a lexicon-free index design
...
This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.
The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.
It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
194a6057dd
(index,control) Recoverable index backups
2023-08-25 14:57:43 +02:00
Viktor Lofgren
e710e057e2
(db) Remove EC_URL and EC_PAGE_DATA from mariadb database
2023-08-25 13:45:03 +02:00
Viktor Lofgren
28188a6e59
(control) Simplify ConvertAndLoadActor
2023-08-25 13:30:20 +02:00
Viktor Lofgren
70a5df96c8
(control) Display progress of process tasks
2023-08-25 13:05:21 +02:00
Viktor Lofgren
460998d512
(index) Move index construction to separate process.
...
This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service. It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D
2023-08-25 12:52:54 +02:00
Viktor Lofgren
e741301417
(search) Remove endpoint flush-search-caches
...
It's not necessary anymore with the new linkdb.
2023-08-25 09:51:06 +02:00
Viktor Lofgren
5ed5298409
(converter) Update confusing state description
...
SWAP_LEXICON doesn't instruct the index service to do anything. It just moves the file.
2023-08-24 18:56:49 +02:00
Viktor Lofgren
b911665691
(index) Clean up and optimize valuator
2023-08-24 18:34:06 +02:00
Viktor Lofgren
56eb83319d
(index) Clean up result domain deduplicator
2023-08-24 18:24:55 +02:00
Viktor Lofgren
1e6800565a
(system) Remove EdgeId<T> and similar objects
...
They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.
2023-08-24 17:46:02 +02:00
Viktor Lofgren
c909120ae1
(search) Basic working integration of linkdb in search service
2023-08-24 17:24:56 +02:00
Viktor Lofgren
9894f37412
(index) Implement new URL ID coding scheme.
...
Also refactor along the way. Really needs an additional pass, these tests are very hairy.
2023-08-24 16:44:27 +02:00
Viktor Lofgren
ebc84c22fb
Upgrade antique lombok plugin
...
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a
Upgrade code to Java 20.
...
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
4d75fa2908
Upgrade gradle and docker plugin to support native JDK20 environments
2023-08-23 13:30:55 +00:00
Viktor Lofgren
6f222b9800
(search) Add refresh link to explore mode.
...
This is a QOL improvement for mobile users, who otherwise would have to scroll all the way up to refresh.
Also removed the confusing "this is a random set of domains"-message when viewing adjacent websites, as it's not random.
2023-08-22 12:43:44 +02:00
Viktor Lofgren
c7f0276005
(control) Don't spin on process output printing
...
This is the "correct" way of copying stdout and stderr to the curren't process' output.
2023-08-22 11:48:54 +02:00
Viktor Lofgren
46df58d28b
(control-service) Use default value for WMSA_HOME if it is not set
2023-08-22 11:11:01 +02:00
Viktor Lofgren
15912f31d0
(control-service) Basic GUI for deleting bad links from exploration mode
2023-08-21 18:35:26 +02:00
Viktor Lofgren
93f49f1fb3
(search-service) RSS feed for the news feed
2023-08-20 12:58:34 +02:00
Viktor Lofgren
704de50a9b
(forward-index, valuator) HTML features in valuator
...
Put it in the forward index for easy access during index-side valuation.
2023-08-18 11:54:56 +02:00
Viktor Lofgren
efee904531
(search) Use the adtech bit instead of ads for ads flag
2023-08-18 11:24:59 +02:00
Viktor Lofgren
46d761f34f
(language) fasttext based language filter
2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f
(valuation) Penalize wordpress style kebab case urls
2023-08-16 13:11:24 +02:00
Viktor Lofgren
606db54dc8
(docs) Fix dead links to message-queue after moving it to libraries
2023-08-15 19:26:40 +02:00
Viktor Lofgren
df85468c01
(control) Action for refreshing the blogs definition.
2023-08-15 11:38:52 +02:00
Viktor Lofgren
e7192a9cad
(mq) Refactor mq and actor library and move it to libraries out of common
2023-08-15 10:53:23 +02:00
Viktor Lofgren
019b61b330
(control) Remove message queue listing from actors view.
2023-08-13 13:50:04 +02:00
Viktor Lofgren
f997707049
(control) Move event log out of plumbing
2023-08-13 13:40:50 +02:00
Viktor Lofgren
c56ee10185
(control) Separate [Process] and [Process and Load] actions for crawl data; all SLOW data is deletable.
2023-08-13 13:39:59 +02:00
Viktor Lofgren
8210e49b4e
(control) Helpful tooltips for the Actor table.
2023-08-13 12:55:56 +02:00
Viktor Lofgren
a8f2e9ee2c
(control) Tidy up empty tables, remove actors from index view
2023-08-12 15:18:14 +02:00
Viktor Lofgren
a91b909103
(control) Event log on stop actor
2023-08-12 15:02:53 +02:00
Viktor Lofgren
99e031c529
(control) Remove broken pagination from events and message queue; new "light" events table for some views
2023-08-12 14:57:55 +02:00
Viktor Lofgren
998f239ed9
(control) Filterable event log view
2023-08-12 14:43:11 +02:00
Viktor Lofgren
0961f627b1
(control) Pretty up the nav bar
2023-08-12 14:42:42 +02:00
Viktor Lofgren
4f8048be31
(blacklist) Blacklist management
2023-08-10 15:40:07 +02:00
Viktor Lofgren
ce293029c7
(converter) Treat adtech tracking as advertisement.
2023-08-09 14:23:53 +02:00
Viktor Lofgren
251fc63b42
(*) Fix merge gore
2023-08-09 13:33:28 +02:00
Viktor Lofgren
47f3855a4b
(control) More informative readme.md
2023-08-09 12:42:23 +02:00
Viktor Lofgren
71dfe9f33e
(control) Clean up the ControlService, move mq-related endpoints to MessageQueueService.
2023-08-09 12:42:01 +02:00
Viktor Lofgren
4ab1cd9502
(*) last touches
2023-08-07 12:57:44 +02:00
Viktor Lofgren
be444f9172
(control) New actions view, re-arrange navigation menu
2023-08-05 14:45:04 +02:00
Viktor Lofgren
bf37a3eb25
(search-service) Make flushCaches endpoint a notice and not a request
2023-08-05 14:45:04 +02:00
Viktor Lofgren
00eb8b90dc
(control) Message Queue GUI
2023-08-04 22:05:29 +02:00
Viktor Lofgren
912129311d
(control) Message Queue GUI
2023-08-04 17:54:18 +02:00
Viktor Lofgren
624b78ec3a
(heartbeat) Task heartbeats
2023-08-04 14:40:06 +02:00
Viktor Lofgren
1d0cea1d55
(converter) GUI for dealing with user complaints
2023-08-03 17:59:57 +02:00
Viktor Lofgren
f01f608474
(blacklist) Support blacklists with subdomain
2023-08-03 17:58:52 +02:00
Viktor Lofgren
63e857f7cd
(control) Add basic api key management
2023-08-02 20:14:03 +02:00
Viktor Lofgren
9979c9defe
(search/index) Add blogosphere filter
2023-08-02 20:13:30 +02:00
Viktor Lofgren
8de3e6ab80
(control) Fix bug where CrawlActor and RecrawlActor would steal each others' mail
2023-08-01 22:33:30 +02:00
Viktor Lofgren
867410c66b
(file-storage) Automatic file storage discovery via manifest file
2023-08-01 18:05:43 +02:00
Viktor Lofgren
36a23707c1
(control) Control service should be a core service.
2023-08-01 15:49:50 +02:00
Viktor Lofgren
e22e65eee4
(index) Fix bug related to debug print statements
2023-07-22 14:33:58 +02:00
Viktor Lofgren
d7ab21fe34
(*) Refactor Control Service and processes
2023-07-17 21:20:31 +02:00
Viktor Lofgren
8b74e3aa0d
(*) File Storage WIP
2023-07-14 17:08:10 +02:00
Viktor Lofgren
88b9ec70c6
(control, WIP) Run reconvert-load from converter :D
2023-07-11 18:05:37 +02:00
Viktor
cbbf60a599
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
96eecc6ea5
Minor: Readability.
2023-07-10 18:58:43 +02:00
Viktor Lofgren
d9e6c4f266
Trial integration of MQ-FSM into index service.
2023-07-06 18:04:16 +02:00
Viktor Lofgren
62cc9df206
Embryo of new control process
...
* New events and heartbeat tables in mariadb
* Refactored to a cleaner Service interface
2023-07-03 10:40:32 +02:00
Viktor Lofgren
0f34beb1aa
Update search front page
2023-06-29 17:14:27 +02:00
Viktor Lofgren
a6a66c6d8a
Improve site info for unknown domains:
...
* Placeholder screenshot should work
* Add a link to git-repo for submitting the site for crawling
2023-06-27 15:32:11 +02:00
Viktor Lofgren
d86e8522e2
Add search profiles for wiki, forum and docs.
2023-06-24 12:17:35 +02:00
Viktor Lofgren
bd2c3855ed
Add bits and keywords for generator classes (docs, forum, wiki).
2023-06-23 21:35:28 +02:00
Viktor Lofgren
55c65f0935
Use document generator to complement the document selection.
...
Will let through e.g. a modern SSG in the small web filter.
2023-06-22 17:21:33 +02:00
Viktor Lofgren
fd192d2791
Fix putative overflow error with a large dictionary
2023-05-28 11:57:06 +02:00
Viktor Lofgren
1e184a8372
(search) Make exploration mode more random
2023-05-25 17:40:28 +02:00
Viktor Lofgren
6fae51a8ef
Stopgap fix for a bug in dealing with quote terms containing stop words.
2023-05-02 19:38:59 +02:00
Viktor Lofgren
bb587ca47f
Reformulate search-header.hdb, s/Support/Donate/ the formulation was apparently confusing some people thinking they could get support on this page.
2023-04-18 17:04:24 +02:00
Viktor Lofgren
df1850bd45
Fix bug in index service where tld: and links:-queries wouldn't work.
2023-04-15 18:39:16 +02:00
Viktor Lofgren
502713f7a8
Reduce memory churn
2023-04-10 16:51:17 +02:00
Viktor Lofgren
e19256a6b6
Tune settings to retrieve more results.
2023-04-10 15:39:20 +02:00
Viktor Lofgren
ccc41d1717
Clean up of the index query handling related code.
2023-04-10 14:50:57 +02:00
Viktor Lofgren
e49b1dd155
Better handling of quote terms, fix bug in handling of longer queries.
...
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:20:40 +02:00
Viktor Lofgren
fe419b12b4
Better handling of quote terms, fix bug in handling of longer queries.
...
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:11:40 +02:00
Viktor Lofgren
535a51a621
Repair broken year query test.
2023-04-08 12:04:09 +02:00
Viktor
a278fc6296
Increase search result relevance ( #8 )
...
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.
2023-04-07 20:18:08 +02:00
Viktor Lofgren
716ab35b4e
Search ranking debuggability improvements.
2023-04-02 13:43:24 +02:00
Viktor Lofgren
3fb249758e
Adjust result ordering.
2023-04-02 12:05:22 +02:00