Viktor Lofgren
8375237de5
(converter) Add special keyword for websites with a tilde url.
2023-10-09 17:02:32 +02:00
Viktor Lofgren
6319b8ef51
(api-service) Improved testability, always set content type to application/json
2023-10-09 15:39:34 +02:00
Viktor Lofgren
397a85eaa4
(query-service) Apply blacklisting to search results
2023-10-09 15:18:53 +02:00
Viktor Lofgren
3889c4bdd9
(refactor) Remove features-search and update documentation
2023-10-09 15:12:30 +02:00
Viktor Lofgren
c899f1cb85
(docs) Update documentation to reflect new query service
2023-10-09 14:56:59 +02:00
Viktor Lofgren
d8956c51d0
(refactor) Remove api:search-api
...
Application services should not have an API, but purely act as clients
to the core services (which should always have an API).
2023-10-09 14:42:33 +02:00
Viktor Lofgren
5dd55c7cad
(refactor) Rename satellite services to application services
...
This is a better descriptor, since they now all implement different applications on top of the core services' APIs.
2023-10-09 13:45:45 +02:00
Viktor Lofgren
c0e61d4c87
(refactor) Move search service into services-satellite
2023-10-09 13:40:01 +02:00
Viktor Lofgren
97e17282ab
(query-service) Move query parsing from search-service to the new query service.
2023-10-09 13:27:44 +02:00
Viktor Lofgren
94c882af7d
(query-service) Provide delegate of IndexApi's query functionality.
...
This is an intermediate step in the process of introducing the query-service as a proxy between search and index.
2023-10-08 22:22:26 +02:00
Viktor Lofgren
89c6d85f2f
(query-service) Create new empty 'query-service' service
2023-10-08 17:31:50 +02:00
Viktor Lofgren
cf366c602f
(search) Refactor SearchQueryIndexService in preparation for feature extraction.
...
Prefer working on DecoratedSearchResultItem in favor of UrlDetails.
2023-10-08 17:15:41 +02:00
Viktor Lofgren
77ccab7d80
(index) Move linkdb to index from search.
...
This makes index complete in the sense that you can deploy an index instance and build a complete separate application on top of it, without having to go through the Marginalia-laden search service.
2023-10-08 16:48:35 +02:00
Viktor Lofgren
f51ba63742
(search) Remove dead file
2023-10-07 21:05:06 +02:00
Viktor Lofgren
9044518be5
(search) Fix broken link to git repo
2023-10-07 19:43:22 +02:00
Viktor Lofgren
9e0367eef4
(search) Filter blacklisted items in API query service as well
2023-10-07 16:16:04 +02:00
Viktor Lofgren
235bb6c1b9
(control) Administrative QOL improvement, GUI for banning spam
2023-10-07 15:45:50 +02:00
Viktor Lofgren
49344d7ea8
(control) Administrative QOL improvement, GUI for banning spam
2023-10-07 15:43:18 +02:00
Viktor Lofgren
1b418d77ff
(search) We got some new IP ranges to work with for the crawler
2023-10-07 13:41:55 +02:00
Viktor Lofgren
80cc302627
(search) We can't in claim to be on PC hardware anymore...
2023-10-07 11:49:29 +02:00
Viktor
8e1abc3f10
(index-reverse) Parallel construction of the reverse indexes. ( #52 )
...
* (index-reverse) Parallel construction of the reverse indexes.
* (array) Remove wasteful calculation of numDistinct before merging two sorted arrays.
* (index-reverse) Force changes to disk on close, reduce logging.
* (index-reverse) Clean up merging process and add back logging
* (run) Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM
* (index-reverse) Better logging during processing
* (array) 2GB+ compatible write() function
* (array) 2GB+ compatible write() function
* (index-reverse) We are logging like Bolsonaro and I will not have it.
* (reverse-index) Self-diagnostics
* (btree) Fix bug in btree reader to do with large data sizes
2023-10-07 10:00:00 +02:00
Viktor Lofgren
e498c6907a
(forward-index) Don't leak off heap memory
2023-10-05 21:22:13 +02:00
Viktor Lofgren
08e8fc6736
(index-journal) Thread safe IndexJournalReadEntry
2023-10-05 19:39:09 +02:00
Viktor Lofgren
f6e9ef6de9
(array) Fix transferFrom() so it survives larger than 2 GB transfers
2023-10-04 13:57:36 +02:00
Viktor Lofgren
c51159672e
(build) Move unit test configuration to root build.gradle
2023-10-04 12:46:22 +02:00
Viktor Lofgren
233b51e29e
(test) flag DomainTypesTest as Slow to exclude from regular CI
2023-10-04 12:23:10 +02:00
Viktor Lofgren
54c8e13a68
(term-frequency-dict) Fix memory leak in TermFrequencyDict
2023-10-04 11:55:11 +02:00
Viktor Lofgren
405300b4b2
(control) Fix bug where finishing one process ad hoc task would remove all other tasks from the db
2023-10-04 11:44:31 +02:00
Viktor Lofgren
40768e935b
(test) Removing /tmp-guardrails as it doesn't hold in CI
2023-10-02 16:52:59 +02:00
Viktor Lofgren
13ee31770a
(file storage) Make it possible to override the value returned by getFileStorage(type) with a JVM property.
2023-10-01 12:57:53 +02:00
Viktor Lofgren
93dc80000c
(bugfix) Fix NPE in KeywordExtractor due to bad SoftReference handling
2023-09-26 17:16:41 +02:00
Viktor Lofgren
e0cd3cd991
(converter) Alter StackexchangeSideloader's summary length to align with the rest of the system.
2023-09-26 12:19:43 +02:00
Viktor Lofgren
81ae501e73
(converter) Use ThreadLocalSentenceExtractorProvider for PlainText plugin as well
2023-09-25 18:28:34 +02:00
Viktor Lofgren
9b781f8404
(keyoword-extractor) Address very rare race condition in memoization logic
2023-09-25 18:28:04 +02:00
Viktor Lofgren
f797a92f87
(converter, minor) Use domain name in task heartbeat progress
2023-09-25 18:27:04 +02:00
Viktor Lofgren
ec6c9bca62
(common) Fix factual error in comments
2023-09-24 19:40:19 +02:00
Viktor Lofgren
a433bbbe45
(converter) Fix rare sentence extractor bug
...
It was caused by non-thread safe concurrent memory access in SentenceExtractor.
2023-09-24 19:39:48 +02:00
Viktor Lofgren
8ca20f184d
(keyword-extraction) Chasing my tail looking for a bug
2023-09-24 19:39:48 +02:00
Viktor Lofgren
d160954080
(index) Two useful debug endpoints
2023-09-24 19:39:48 +02:00
Viktor Lofgren
14372e0ef0
(index) Slightly reduce alloc churn
2023-09-24 19:36:14 +02:00
Viktor Lofgren
03bffa27ac
(search) Add combined id to the search result HTML
2023-09-24 19:34:35 +02:00
Viktor Lofgren
028b5a4f0d
(minor performance) Reduce GC churn in index
2023-09-24 12:12:08 +02:00
Viktor Lofgren
cd12f49fc0
(long-array) Return slices SegmentLongArray of itself for range() &c
2023-09-24 11:31:54 +02:00
Viktor Lofgren
1bd146fb8e
(minor) Remove dead code
2023-09-24 10:55:20 +02:00
Viktor Lofgren
5f6c3da7a4
(index) Add close methods on the index readers so they clean up their mmaps
2023-09-24 10:54:23 +02:00
Viktor Lofgren
d0aa754252
(long-array) Implement java.lang.foreign.Arena based lifecycle control for LongArray.
...
Further de-ByteBuffer:ing of these classes is to be done, but this is the smallest most urgently needed benefit.
This commit is a WIP but in a fully working state, pushing due to the importance of the changes to offer lifecycle control over mmaps.
2023-09-24 10:40:06 +02:00
Viktor Lofgren
dbe9235f3a
(*) Upgrade to JDK21 with preview enabled.
...
... also move some common configuration into the root build.gradle-file.
Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
d78569986b
(backups) Fix bug where backup service would zero the linkdb when restoring.
2023-09-22 18:34:34 +02:00
Viktor Lofgren
95323e6caa
(backups) Support restore multi-source load data
2023-09-22 18:34:17 +02:00
Viktor Lofgren
f809d22fc6
(loader) Support simultaneous loading of multiple processed data sets
2023-09-22 13:14:58 +02:00
Viktor Lofgren
10cad3abb2
(dating) Implementing @samstorment's fantastic design polish
2023-09-21 15:19:50 +02:00
Viktor Lofgren
9338f35cd8
(doc) Remove confusingly outdated ER-diagrams
2023-09-21 15:08:27 +02:00
Viktor Lofgren
ad660cf420
(converter) Bugfix: Don't try to Path.of() on optional field
2023-09-21 13:27:09 +02:00
Viktor Lofgren
75f8ae2815
(file-storage) Use human-readable timestamps in the names of file storage directories
2023-09-21 13:22:53 +02:00
Viktor Lofgren
70aa04c047
(converter, stackexchange-xml) Add the ability to sideload stackexchange data
2023-09-21 12:48:33 +02:00
Viktor Lofgren
4aa47e87f2
(blocking-thread-pool) Add isTerminated convenience function
2023-09-21 12:47:41 +02:00
Viktor Lofgren
f8050816ac
(search) Don't run LSH deduplication on details with zero lsh to support not calculating this hash.
2023-09-21 12:47:02 +02:00
Viktor Lofgren
5b0a6d7ec1
(stackexchange-converter) Create tool for converting stackexchange 7z-files to digestible sqlite db:s
2023-09-20 15:15:13 +02:00
Viktor Lofgren
3b4d08f52b
(stackexchange-integration) Add better comments
2023-09-20 14:43:06 +02:00
Viktor Lofgren
6bbf40d7d2
(stackexchange-integration) Tools for reading stackexchange xml files
2023-09-20 14:17:33 +02:00
Viktor Lofgren
d895f83520
(blocking-thread-pool) Move DumbThreadPool to its own micro-library
...
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
f6b9e8c5eb
(converter) JavadocSpecialization should truncate its summary if it gets too long
2023-09-17 16:25:33 +02:00
Viktor Lofgren
98bcdf6028
(converter) DirtreeSideloader now trims /index.html from the URL if present
...
This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.
2023-09-17 16:08:16 +02:00
Viktor Lofgren
9b385ec7cc
(converter) Make it possible to sideload documents from a directory tree
2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46
(crawl-spec) Parquetify crawl spec
...
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
c67d95c00f
(converter) Write dummy processor log when sideloading
2023-09-14 14:13:03 +02:00
Viktor Lofgren
5e5aaf9a7e
(converter, control) Re-enable sideloading encyclopedia data
2023-09-14 12:12:07 +02:00
Viktor Lofgren
35996d0adb
(docs) Update the documentation up-to-date information
2023-09-14 11:33:36 +02:00
Viktor Lofgren
eaeb23d41e
(refactor) Remove converting-model package completely
2023-09-14 11:21:44 +02:00
Viktor Lofgren
c71f6ad417
(converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup
2023-09-14 10:11:57 +02:00
Viktor Lofgren
87a8593291
(work-log) Fix bug where items weren't added to the current batch on logItem
2023-09-14 10:11:04 +02:00
Viktor Lofgren
4799dd769e
(converting) WIP begin to remove converting-model and the old InstructionsCompiler
2023-09-13 19:18:58 +02:00
Viktor Lofgren
24b4606f96
(converter,loader) Converter outputs parquet files instead of compressed json.
2023-09-13 16:13:41 +02:00
Viktor Lofgren
064bc5ee76
(processed-data) New parquet-serializable models for converter output
2023-09-11 14:08:40 +02:00
Viktor Lofgren
a52d78c8ee
(work-log) New batching work log
2023-09-11 14:08:08 +02:00
Viktor Lofgren
07d7507ac6
(control-service) Move Actions up in storage-details
...
Papercut fix. If a file storage area has a lot of files, you have to scroll down a long way to get to the actions otherwise.
2023-09-02 15:41:55 +02:00
Viktor Lofgren
c68d17d482
(keyword-extraction) Fix bug leading to position data missing on some keywords.
...
This was due to a discrepancy between the KeywordPositionBitmask and WordsTfIdfCounts' concept of a keyword.
2023-09-02 14:48:55 +02:00
Viktor Lofgren
9e185e80ce
(control-service) Add timestamp to file storages.
2023-09-02 14:01:04 +02:00
Viktor Lofgren
676e7c7947
(keywords) Add Serializable properties that went missing as the record became a class
2023-09-02 09:52:01 +02:00
Viktor Lofgren
04212b2cef
(btree) Add more consistent asserts on sortedness
2023-09-01 15:45:02 +02:00
Viktor Lofgren
bafc2a1f30
(reverse-index) Force() final docs after being written
...
Unlikely to be a problem, but we want to ensure it's on dsik before we go read it later.
2023-09-01 15:43:53 +02:00
Viktor Lofgren
563e388a45
(reverse-index) Fix parallel documents sorting bug
...
Bug was caused by parallel sorting capturing the iterator rather than the offsets to sort.
2023-09-01 15:42:45 +02:00
Viktor Lofgren
d31d8ec5b0
(index) Log keyword ids on hex format
2023-09-01 15:40:24 +02:00
Viktor Lofgren
2b00cd632d
(process) Propagate environment JVM params to the index constructor
2023-09-01 15:39:42 +02:00
Viktor Lofgren
5f427d2b4c
(keywords) Clean up leaky abstractions, clean up tests
2023-09-01 13:52:00 +02:00
Viktor Lofgren
8c0ce4fc1d
(index journal; minor) Clean up
2023-09-01 11:32:24 +02:00
Viktor Lofgren
10a74f45ea
(index journal; minor) Even cleaner separation of concerns.
2023-09-01 11:28:02 +02:00
Viktor Lofgren
320dad7f1a
(index journal) Fix leaky abstraction in IndexJournalReader.
...
The caller shouldn't be required to know the on-disk layout of the file to make use of the data in a performant way.
2023-09-01 11:18:13 +02:00
Viktor Lofgren
88ac72c8eb
(journal/reverse index) Working WIP fix over-allocation of documents
2023-08-31 20:16:02 +02:00
Viktor Lofgren
f74b9df0a7
(array) Don't use paging arrays when mapping small files for writing
2023-08-31 20:15:10 +02:00
Viktor Lofgren
a6f1335375
(loader) Fix bugfix where the loader would omit some meta and words.
2023-08-31 17:48:43 +02:00
Viktor Lofgren
f321fa5ad3
(array) Override to Paging...Array$range()
...
This is a big performance boost in array.range().get().
Without an override, each access will go through pages[page].get(...) for each get()-operation. This adds up very quickly. BTreeReader does a bunch of get():s on a range()'d array during traversal in the queryData... methods.
2023-08-31 13:52:29 +02:00
Viktor Lofgren
03d999444d
(ldb) Re-add accidentally removed stmt.addBatch that breaks
2023-08-31 12:06:30 +02:00
Viktor Lofgren
763ed260c3
(ldb) Better handling of null pubYear
2023-08-30 23:08:27 +02:00
Viktor Lofgren
764e7d1315
(index) Add more comprehensive integration tests for the index service.
2023-08-30 10:37:24 +02:00
Viktor Lofgren
048f685073
(ldb) add OR IGNORE to insert status query
...
Otherwise it will sometimes fail because documents may appear more than once in error scenarios.
2023-08-30 10:34:01 +02:00
Viktor Lofgren
e4d7958379
(control) ProcessLivenessMonitorActor shouldn't reap tasks based on service instance liveness
2023-08-29 18:19:04 +02:00
Viktor Lofgren
3f288e264b
(minor) Clean up dead endpoints
2023-08-29 17:04:54 +02:00
Viktor Lofgren
dd593c292c
(loader) Minor optimizations and bugfixes.
...
* Reduce memory churn in LoaderIndexJournalWriter, fix bug with keyword mappings as well
* Remove remains of OldDomains
* Ensure LOADER_PROCESS_OPTS gets fed to the processes
* LinkdbStatusWriter won't execute batch after each added item post 100 items
2023-08-29 15:37:52 +02:00
Viktor Lofgren
39c1857c61
(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.
2023-08-29 13:07:55 +02:00
Viktor Lofgren
c57a2d0dc3
(control-service) Remove old index journal files when restoring a backup.
2023-08-29 11:58:01 +02:00
Viktor Lofgren
a2e6616100
(index-reverse) Add documentation and clean up code.
2023-08-29 11:35:54 +02:00
Viktor Lofgren
ba4513e82c
(loader) Revert accidental experimental changes that slipped by in an earlier commit
2023-08-28 19:54:56 +02:00
Viktor Lofgren
6525b16e1f
(minor) Improved logging and error messages
2023-08-28 19:53:55 +02:00
Viktor Lofgren
b6a92506d1
(index) Hook in missing DocIdRewriter
...
This enables documents to be ranked properly.
2023-08-28 19:53:43 +02:00
Viktor Lofgren
ffa0366deb
(minor) Fix typo in ActorStateMachine's logging
2023-08-28 16:11:52 +02:00
Viktor Lofgren
00c4686ef0
(reverse-index) Fix over-allocation of the count array in merging
2023-08-28 14:36:28 +02:00
Viktor Lofgren
3101b74580
(index) Move to a lexicon-free index design
...
This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.
The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.
It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
194a6057dd
(index,control) Recoverable index backups
2023-08-25 14:57:43 +02:00
Viktor Lofgren
e710e057e2
(db) Remove EC_URL and EC_PAGE_DATA from mariadb database
2023-08-25 13:45:03 +02:00
Viktor Lofgren
28188a6e59
(control) Simplify ConvertAndLoadActor
2023-08-25 13:30:20 +02:00
Viktor Lofgren
70a5df96c8
(control) Display progress of process tasks
2023-08-25 13:05:21 +02:00
Viktor Lofgren
460998d512
(index) Move index construction to separate process.
...
This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service. It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D
2023-08-25 12:52:54 +02:00
Viktor Lofgren
e741301417
(search) Remove endpoint flush-search-caches
...
It's not necessary anymore with the new linkdb.
2023-08-25 09:51:06 +02:00
Viktor Lofgren
5ed5298409
(converter) Update confusing state description
...
SWAP_LEXICON doesn't instruct the index service to do anything. It just moves the file.
2023-08-24 18:56:49 +02:00
Viktor Lofgren
b911665691
(index) Clean up and optimize valuator
2023-08-24 18:34:06 +02:00
Viktor Lofgren
56eb83319d
(index) Clean up result domain deduplicator
2023-08-24 18:24:55 +02:00
Viktor Lofgren
1e6800565a
(system) Remove EdgeId<T> and similar objects
...
They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.
2023-08-24 17:46:02 +02:00
Viktor Lofgren
c909120ae1
(search) Basic working integration of linkdb in search service
2023-08-24 17:24:56 +02:00
Viktor Lofgren
9894f37412
(index) Implement new URL ID coding scheme.
...
Also refactor along the way. Really needs an additional pass, these tests are very hairy.
2023-08-24 16:44:27 +02:00
Viktor Lofgren
6a04cdfddf
(loader) Implement new linkdb in loader
...
Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal.
For now, we no longer store new URLs in different domains. We need to re-implement this somehow, probably in a different job or a as a different output.
2023-08-24 13:07:54 +02:00
Viktor Lofgren
c70670bacb
(common) New UrlIdCodec class
...
Have a single class responsible for encoding and decoding URL ids, as it's a bit finicky and used all over.
2023-08-24 11:41:07 +02:00
Viktor Lofgren
7bb3e44a76
(common) Deprecate EdgeId and similar
2023-08-24 11:16:28 +02:00
Viktor Lofgren
b958acb76a
(file-storage) New File Storage type for linkdb
2023-08-24 09:06:13 +02:00
Viktor Lofgren
b22f4fbb72
(linkdb) New Module for sqlite-backed document db
2023-08-24 09:06:13 +02:00
Viktor Lofgren
ebc84c22fb
Upgrade antique lombok plugin
...
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a
Upgrade code to Java 20.
...
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
4d75fa2908
Upgrade gradle and docker plugin to support native JDK20 environments
2023-08-23 13:30:55 +00:00
Viktor Lofgren
1a05cba60a
(keyword lexicon) Use three hash tables to increase the possible number of keywords to 2^31 from 0.75 x 2^30.
2023-08-23 11:25:20 +02:00
Viktor Lofgren
bf92c270dc
(language) Rollback language filter change a bit.
...
It appears to lead to too much junk in the lexicon.
2023-08-23 10:16:57 +02:00
Viktor Lofgren
e507844616
(language) Rollback language filter change a bit.
...
It appears to lead to too much junk in the lexicon.
2023-08-23 10:03:25 +02:00
Viktor Lofgren
ca12dd59f7
(loader) Fix Cleaner resource leak
...
Apparently Cleaners have an associated native thread, so the way to use them is to have a single static cleaner.
2023-08-22 18:05:00 +02:00
Viktor Lofgren
6f222b9800
(search) Add refresh link to explore mode.
...
This is a QOL improvement for mobile users, who otherwise would have to scroll all the way up to refresh.
Also removed the confusing "this is a random set of domains"-message when viewing adjacent websites, as it's not random.
2023-08-22 12:43:44 +02:00
Viktor Lofgren
fca62f261e
(mq) Down-tune polling intervals in MQ
...
Polling 10 times a second across dozens of queues is a bit too aggressive and wasteful.
2023-08-22 11:49:30 +02:00
Viktor Lofgren
c7f0276005
(control) Don't spin on process output printing
...
This is the "correct" way of copying stdout and stderr to the curren't process' output.
2023-08-22 11:48:54 +02:00
Viktor Lofgren
46409c4c2d
(loader) Use the correct interface for InstructionCounter
2023-08-22 11:11:36 +02:00
Viktor Lofgren
46df58d28b
(control-service) Use default value for WMSA_HOME if it is not set
2023-08-22 11:11:01 +02:00
Viktor Lofgren
15912f31d0
(control-service) Basic GUI for deleting bad links from exploration mode
2023-08-21 18:35:26 +02:00
Viktor Lofgren
93f49f1fb3
(search-service) RSS feed for the news feed
2023-08-20 12:58:34 +02:00
Viktor Lofgren
704de50a9b
(forward-index, valuator) HTML features in valuator
...
Put it in the forward index for easy access during index-side valuation.
2023-08-18 11:54:56 +02:00
Viktor Lofgren
fcfe07fb7d
(valuator) Clean up code
2023-08-18 11:26:56 +02:00
Viktor Lofgren
ccf4990add
(minor) Clean up code
2023-08-18 11:26:39 +02:00
Viktor Lofgren
f2638dd845
(feature-extractor) More adtech nonsense
2023-08-18 11:26:19 +02:00
Viktor Lofgren
239980ecae
(minor) Improve comment
2023-08-18 11:26:05 +02:00
Viktor Lofgren
6cb784df75
(minor) Improve comment
2023-08-18 11:25:36 +02:00
Viktor Lofgren
efee904531
(search) Use the adtech bit instead of ads for ads flag
2023-08-18 11:24:59 +02:00
Viktor Lofgren
bee815b1c4
(converter) Add monsterinsights as an adtech tracker
2023-08-17 17:44:11 +02:00
Viktor Lofgren
e296b02649
(converter) Optimize LSH based within-domain deduplication
2023-08-17 17:43:46 +02:00
Viktor Lofgren
c019a029ec
(flags) Documentation and preventative bugfix
2023-08-17 17:42:31 +02:00
Viktor Lofgren
db0216936e
(summary) Reduce the chance of expensive operations
2023-08-16 15:48:34 +02:00
Viktor Lofgren
46d761f34f
(language) fasttext based language filter
2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f
(valuation) Penalize wordpress style kebab case urls
2023-08-16 13:11:24 +02:00
Viktor Lofgren
1d486bddee
(crawler) Reduce log spam
2023-08-16 11:12:09 +02:00
Viktor Lofgren
606db54dc8
(docs) Fix dead links to message-queue after moving it to libraries
2023-08-15 19:26:40 +02:00
Viktor Lofgren
d8073f0dde
(feature-extractor) Add mail.ru counter to non-adtech trackers
2023-08-15 19:10:43 +02:00
Viktor Lofgren
df85468c01
(control) Action for refreshing the blogs definition.
2023-08-15 11:38:52 +02:00
Viktor Lofgren
4404ad98ae
(mq) Fix missing @Inject that broke everything in control-service
2023-08-15 11:22:12 +02:00
Viktor Lofgren
e7192a9cad
(mq) Refactor mq and actor library and move it to libraries out of common
2023-08-15 10:53:23 +02:00
Viktor Lofgren
019b61b330
(control) Remove message queue listing from actors view.
2023-08-13 13:50:04 +02:00
Viktor Lofgren
f997707049
(control) Move event log out of plumbing
2023-08-13 13:40:50 +02:00
Viktor Lofgren
c56ee10185
(control) Separate [Process] and [Process and Load] actions for crawl data; all SLOW data is deletable.
2023-08-13 13:39:59 +02:00
Viktor Lofgren
8210e49b4e
(control) Helpful tooltips for the Actor table.
2023-08-13 12:55:56 +02:00
Viktor Lofgren
a8f2e9ee2c
(control) Tidy up empty tables, remove actors from index view
2023-08-12 15:18:14 +02:00
Viktor Lofgren
a91b909103
(control) Event log on stop actor
2023-08-12 15:02:53 +02:00
Viktor Lofgren
d6b8b38955
(db) Add indices on SERVICE_EVENTLOG
2023-08-12 15:00:15 +02:00
Viktor Lofgren
99e031c529
(control) Remove broken pagination from events and message queue; new "light" events table for some views
2023-08-12 14:57:55 +02:00
Viktor Lofgren
998f239ed9
(control) Filterable event log view
2023-08-12 14:43:11 +02:00
Viktor Lofgren
0961f627b1
(control) Pretty up the nav bar
2023-08-12 14:42:42 +02:00
Viktor Lofgren
6483308bb0
(sql) Update default value for DOMAIN_SELECTION_TYPE
2023-08-11 14:01:15 +02:00
Viktor Lofgren
7440da240d
(blacklist) Fix broken SQL migration
2023-08-11 13:33:35 +02:00
Viktor Lofgren
4f8048be31
(blacklist) Blacklist management
2023-08-10 15:40:07 +02:00
Viktor Lofgren
807fb2d052
(service) Task heartbeat creates event log entries
2023-08-09 15:15:16 +02:00
Viktor Lofgren
ce293029c7
(converter) Treat adtech tracking as advertisement.
2023-08-09 14:23:53 +02:00
Viktor Lofgren
b5ed21be21
(mq) MqPersistence no longer relies on autoCommit being enabled
2023-08-09 14:23:22 +02:00
Viktor Lofgren
251fc63b42
(*) Fix merge gore
2023-08-09 13:33:28 +02:00
Viktor Lofgren
47f3855a4b
(control) More informative readme.md
2023-08-09 12:42:23 +02:00
Viktor Lofgren
71dfe9f33e
(control) Clean up the ControlService, move mq-related endpoints to MessageQueueService.
2023-08-09 12:42:01 +02:00
Viktor Lofgren
afad4f5ebb
(*) last touches
2023-08-07 12:59:33 +02:00
Viktor Lofgren
4ab1cd9502
(*) last touches
2023-08-07 12:57:44 +02:00
Viktor
52e2ab45bf
Merge branch 'master' into master-control-program
2023-08-07 12:53:43 +02:00
Viktor Lofgren
be444f9172
(control) New actions view, re-arrange navigation menu
2023-08-05 14:45:04 +02:00
Viktor Lofgren
715d61dfea
(mq) Fix bug in notice handling where they were registered on the wrong name
2023-08-05 14:45:04 +02:00
Viktor Lofgren
bf37a3eb25
(search-service) Make flushCaches endpoint a notice and not a request
2023-08-05 14:45:04 +02:00
Viktor Lofgren
c2b45bec8d
(mq) Rename notify to sendNotice to avoid name clash with the java object function
2023-08-05 14:45:04 +02:00
Viktor Lofgren
cdfe284f9a
(file storage) File Storage Type for EXPORT data
...
(file storage) File Storage Type for EXPORT data
2023-08-05 14:45:03 +02:00
Viktor Lofgren
08eed17e66
(api-service) Mq endpoint for flushing caches
2023-08-05 14:42:16 +02:00
Viktor Lofgren
00eb8b90dc
(control) Message Queue GUI
2023-08-04 22:05:29 +02:00
Viktor Lofgren
912129311d
(control) Message Queue GUI
2023-08-04 17:54:18 +02:00
Viktor Lofgren
624b78ec3a
(heartbeat) Task heartbeats
2023-08-04 14:40:06 +02:00
Viktor Lofgren
1d0cea1d55
(converter) GUI for dealing with user complaints
2023-08-03 17:59:57 +02:00
Viktor Lofgren
f01f608474
(blacklist) Support blacklists with subdomain
2023-08-03 17:58:52 +02:00
Viktor Lofgren
c22feaf42e
(crawl) Make crawler limiter request a GC when throttling
2023-08-03 17:58:18 +02:00
Viktor Lofgren
63e857f7cd
(control) Add basic api key management
2023-08-02 20:14:03 +02:00
Viktor Lofgren
9979c9defe
(search/index) Add blogosphere filter
2023-08-02 20:13:30 +02:00
Viktor Lofgren
7763df0715
(docs) Add control-service to the main readme.md
2023-08-01 22:52:41 +02:00
Viktor Lofgren
8de3e6ab80
(control) Fix bug where CrawlActor and RecrawlActor would steal each others' mail
2023-08-01 22:33:30 +02:00
Viktor Lofgren
659d2134ba
(file-storage) Deprecate mustClean flag
2023-08-01 22:32:30 +02:00
Viktor Lofgren
867410c66b
(file-storage) Automatic file storage discovery via manifest file
2023-08-01 18:05:43 +02:00
Viktor Lofgren
e5c9791b14
(crawler) Fix rare ConcurrentModificationError due to HashSet
2023-08-01 17:28:29 +02:00
Viktor Lofgren
58556af6c7
(db) Use flwyay for database migrations.
2023-08-01 17:08:42 +02:00
Viktor Lofgren
2e29038ecd
(db) Fix broken insert statement, move file storage defaults to a separate file.
2023-08-01 15:50:08 +02:00
Viktor Lofgren
36a23707c1
(control) Control service should be a core service.
2023-08-01 15:49:50 +02:00
Viktor Lofgren
c1ea60b399
(db) Default values for storage base
2023-08-01 15:05:04 +02:00
Viktor Lofgren
b08e302dd5
(lexicon) Optimize lexicon by using Murmur3_128's hash function
2023-08-01 15:02:13 +02:00
Viktor Lofgren
ea66195b97
(loader) Optimize loader by using zstd's direct streaming writer and the Murmur3_128 string hash
2023-08-01 15:02:13 +02:00
Viktor Lofgren
8f0cbf267b
(loader) Perform instruction reads in a separate thread for extra vroom vroom
2023-07-31 14:24:08 +02:00
Viktor Lofgren
2f8488610a
(loader) Fix bug where trailing deferred domain meta inserts weren't executed
2023-07-31 14:23:23 +02:00
Viktor Lofgren
d95f01b701
(control) Reduce log spam in control svc
2023-07-31 14:21:06 +02:00
Viktor Lofgren
c9d7635370
(control) Aborting an actor that waits on a process request terminates the running job.
...
(control) Aborting an actor that waits on a process request terminates the running job.
2023-07-31 14:21:06 +02:00
Viktor Lofgren
6b5fb0f841
(control) Disable the start button for actors that aren't directly initializable.
...
(control) Disable the start button for actors that aren't directly initializable.
2023-07-31 14:21:00 +02:00
Viktor Lofgren
12bd74d4f3
Clean up ProcessService
2023-07-31 10:56:16 +02:00
Viktor Lofgren
37c4cc68ed
TODO
2023-07-31 10:34:42 +02:00
Viktor Lofgren
1c948eb3d8
(minor) Alter DumbThreadPool in Converter to not claim the threads are crawlers.
2023-07-31 10:33:15 +02:00
Viktor Lofgren
cd90ca820f
YAGNI filter over ConverterDomainTypes
2023-07-31 10:32:47 +02:00
Viktor Lofgren
6f4e767a04
(minor) Re-enable monkey-patch-json for converter
2023-07-31 10:31:46 +02:00
Viktor Lofgren
5411950b87
(minor) Tidy up EdgeDomain class a bit, no functional difference
2023-07-31 10:31:29 +02:00
Viktor Lofgren
6ff7e9648f
(crawler) Use and pass the proper environment variables to the processes.
2023-07-30 16:54:02 +02:00
Viktor Lofgren
5c071ce4d3
(crawler) Clean up the code and remove unnecessary logging
2023-07-30 16:53:39 +02:00
Viktor Lofgren
caf3d231a8
(crawler) Fix rare issue with NPEs if the crawl queue is empty
2023-07-30 16:53:13 +02:00
Viktor Lofgren
730e8f74e4
(crawler) Even more memory optimizations.
...
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
aba134284f
(crawler) Reduce log spam
2023-07-29 19:22:58 +02:00
Viktor Lofgren
2a6183f9e0
(crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size.
2023-07-29 19:20:09 +02:00
Viktor Lofgren
ee143bbc48
(crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended.
2023-07-29 19:19:09 +02:00
Viktor Lofgren
d3f01bd171
(crawler, converter) Remove monkey patched gson from dependencies
2023-07-29 19:18:12 +02:00
Viktor Lofgren
05ba3bab96
(crawler) Make SitemapRetriever abort on too large sitemaps.
2023-07-29 19:18:12 +02:00
Viktor Lofgren
d2b6b2044c
(crawler) Reduce log spam in HttpFetcherImpl
2023-07-29 19:18:12 +02:00
Viktor Lofgren
7611b7900d
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
...
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
2023-07-29 19:18:12 +02:00
Viktor Lofgren
9ad32ee9c7
(control) Be more clear about when a process exits and why.
2023-07-29 19:16:00 +02:00
Viktor Lofgren
866db6c63f
(control) Dialog for updating message state; clean up file view.
2023-07-28 22:02:05 +02:00
Viktor Lofgren
01476577b8
(loader) Speed up loading back to original speeds with a cascading DELETE FROM EC_URL rather than EC_PAGE_DATA.
...
* Also clean up code and have proper rollbacks for transactions.
2023-07-28 22:00:07 +02:00
Viktor Lofgren
e237df4a10
(converter) Use a dumb thread pool instead of Java's executor service.
2023-07-28 18:15:16 +02:00
Viktor Lofgren
f11103d31d
(WIP) Make it possible to sideload encyclopedia data.
...
This is mostly a pilot track for sideloading other large websites.
Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
9288d311d4
Add buffering to index journal writer
2023-07-28 18:11:19 +02:00
Viktor Lofgren
77d5e39fe0
Make processed data Serializable
2023-07-28 18:11:19 +02:00
Viktor Lofgren
27e781761d
(mq single shot inbox) Flag messages as OK if there is no recipient
2023-07-28 12:04:23 +02:00
Viktor Lofgren
92cac52813
(mq) Add indexes to MESSAGE_QUEUE
2023-07-28 12:03:51 +02:00
Viktor Lofgren
66bb12e55a
(converter) File listing and download for file storage
2023-07-26 21:59:35 +02:00
Viktor Lofgren
a5d980ee56
(converter) Hook crawl job extractor and adjacencies calculator into control service.
2023-07-26 15:46:22 +02:00
Viktor Lofgren
19c2ceec9b
(converter) Use Marginalia Yellow for control service
2023-07-26 11:50:23 +02:00
Viktor Lofgren
507f26ad47
(converter) Refactor converter to not keep instructions list in RAM.
...
(converter) Refactor converter to not keep instructions list in RAM.
(converter) Refactor converter to not keep instructions list in RAM.
2023-07-25 22:06:46 +02:00
Viktor Lofgren
fd44e09ebd
(loader) Don't delete the entire link database when the loader runs
2023-07-24 18:37:35 +02:00
Viktor Lofgren
09fd0a1d0e
(converter) Automatically clean stale file storage records if they disappear on disk
2023-07-24 17:04:42 +02:00
Viktor Lofgren
667b0ca0b0
(converter, WIP) Refactor CrawledDomainReader to not return iterators.
...
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798
(converter, WIP) Refactor converter to not have to load everything into RAM.
2023-07-24 15:25:09 +02:00
Viktor Lofgren
7470c170b1
(minor) EdgeUrl.parse() should deal with null
2023-07-24 15:06:57 +02:00
Viktor Lofgren
bc330acfc9
(control) Better refresh script that doesn't cause weird artifacts
2023-07-23 19:26:16 +02:00
Viktor Lofgren
789e8eea85
(crawler) Clean up and refactor the code a bit
2023-07-23 19:08:38 +02:00
Viktor Lofgren
35b29e4f9e
(crawler) Clean up and refactor the code a bit
2023-07-23 19:06:37 +02:00
Viktor Lofgren
69f333c0bf
(crawler) Clean up and refactor the code a bit
2023-07-23 18:59:14 +02:00
Viktor Lofgren
c069c8c182
(crawler) Clean up crawl data reference and recrawl logic
2023-07-22 18:42:21 +02:00