Viktor Lofgren
70aa04c047
(converter, stackexchange-xml) Add the ability to sideload stackexchange data
2023-09-21 12:48:33 +02:00
Viktor Lofgren
4aa47e87f2
(blocking-thread-pool) Add isTerminated convenience function
2023-09-21 12:47:41 +02:00
Viktor Lofgren
f8050816ac
(search) Don't run LSH deduplication on details with zero lsh to support not calculating this hash.
2023-09-21 12:47:02 +02:00
Viktor Lofgren
5b0a6d7ec1
(stackexchange-converter) Create tool for converting stackexchange 7z-files to digestible sqlite db:s
2023-09-20 15:15:13 +02:00
Viktor Lofgren
3b4d08f52b
(stackexchange-integration) Add better comments
2023-09-20 14:43:06 +02:00
Viktor Lofgren
6bbf40d7d2
(stackexchange-integration) Tools for reading stackexchange xml files
2023-09-20 14:17:33 +02:00
Viktor Lofgren
d895f83520
(blocking-thread-pool) Move DumbThreadPool to its own micro-library
...
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
f6b9e8c5eb
(converter) JavadocSpecialization should truncate its summary if it gets too long
2023-09-17 16:25:33 +02:00
Viktor Lofgren
98bcdf6028
(converter) DirtreeSideloader now trims /index.html from the URL if present
...
This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.
2023-09-17 16:08:16 +02:00
Viktor Lofgren
9b385ec7cc
(converter) Make it possible to sideload documents from a directory tree
2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46
(crawl-spec) Parquetify crawl spec
...
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
c67d95c00f
(converter) Write dummy processor log when sideloading
2023-09-14 14:13:03 +02:00
Viktor Lofgren
5e5aaf9a7e
(converter, control) Re-enable sideloading encyclopedia data
2023-09-14 12:12:07 +02:00
Viktor Lofgren
35996d0adb
(docs) Update the documentation up-to-date information
2023-09-14 11:33:36 +02:00
Viktor Lofgren
eaeb23d41e
(refactor) Remove converting-model package completely
2023-09-14 11:21:44 +02:00
Viktor Lofgren
c71f6ad417
(converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup
2023-09-14 10:11:57 +02:00
Viktor Lofgren
87a8593291
(work-log) Fix bug where items weren't added to the current batch on logItem
2023-09-14 10:11:04 +02:00
Viktor Lofgren
4799dd769e
(converting) WIP begin to remove converting-model and the old InstructionsCompiler
2023-09-13 19:18:58 +02:00
Viktor Lofgren
24b4606f96
(converter,loader) Converter outputs parquet files instead of compressed json.
2023-09-13 16:13:41 +02:00
Viktor Lofgren
064bc5ee76
(processed-data) New parquet-serializable models for converter output
2023-09-11 14:08:40 +02:00
Viktor Lofgren
a52d78c8ee
(work-log) New batching work log
2023-09-11 14:08:08 +02:00
Viktor Lofgren
07d7507ac6
(control-service) Move Actions up in storage-details
...
Papercut fix. If a file storage area has a lot of files, you have to scroll down a long way to get to the actions otherwise.
2023-09-02 15:41:55 +02:00
Viktor Lofgren
c68d17d482
(keyword-extraction) Fix bug leading to position data missing on some keywords.
...
This was due to a discrepancy between the KeywordPositionBitmask and WordsTfIdfCounts' concept of a keyword.
2023-09-02 14:48:55 +02:00
Viktor Lofgren
9e185e80ce
(control-service) Add timestamp to file storages.
2023-09-02 14:01:04 +02:00
Viktor Lofgren
676e7c7947
(keywords) Add Serializable properties that went missing as the record became a class
2023-09-02 09:52:01 +02:00
Viktor Lofgren
04212b2cef
(btree) Add more consistent asserts on sortedness
2023-09-01 15:45:02 +02:00
Viktor Lofgren
bafc2a1f30
(reverse-index) Force() final docs after being written
...
Unlikely to be a problem, but we want to ensure it's on dsik before we go read it later.
2023-09-01 15:43:53 +02:00
Viktor Lofgren
563e388a45
(reverse-index) Fix parallel documents sorting bug
...
Bug was caused by parallel sorting capturing the iterator rather than the offsets to sort.
2023-09-01 15:42:45 +02:00
Viktor Lofgren
d31d8ec5b0
(index) Log keyword ids on hex format
2023-09-01 15:40:24 +02:00
Viktor Lofgren
2b00cd632d
(process) Propagate environment JVM params to the index constructor
2023-09-01 15:39:42 +02:00
Viktor Lofgren
5f427d2b4c
(keywords) Clean up leaky abstractions, clean up tests
2023-09-01 13:52:00 +02:00
Viktor Lofgren
8c0ce4fc1d
(index journal; minor) Clean up
2023-09-01 11:32:24 +02:00
Viktor Lofgren
10a74f45ea
(index journal; minor) Even cleaner separation of concerns.
2023-09-01 11:28:02 +02:00
Viktor Lofgren
320dad7f1a
(index journal) Fix leaky abstraction in IndexJournalReader.
...
The caller shouldn't be required to know the on-disk layout of the file to make use of the data in a performant way.
2023-09-01 11:18:13 +02:00
Viktor Lofgren
88ac72c8eb
(journal/reverse index) Working WIP fix over-allocation of documents
2023-08-31 20:16:02 +02:00
Viktor Lofgren
f74b9df0a7
(array) Don't use paging arrays when mapping small files for writing
2023-08-31 20:15:10 +02:00
Viktor Lofgren
a6f1335375
(loader) Fix bugfix where the loader would omit some meta and words.
2023-08-31 17:48:43 +02:00
Viktor Lofgren
f321fa5ad3
(array) Override to Paging...Array$range()
...
This is a big performance boost in array.range().get().
Without an override, each access will go through pages[page].get(...) for each get()-operation. This adds up very quickly. BTreeReader does a bunch of get():s on a range()'d array during traversal in the queryData... methods.
2023-08-31 13:52:29 +02:00
Viktor Lofgren
03d999444d
(ldb) Re-add accidentally removed stmt.addBatch that breaks
2023-08-31 12:06:30 +02:00
Viktor Lofgren
763ed260c3
(ldb) Better handling of null pubYear
2023-08-30 23:08:27 +02:00
Viktor Lofgren
764e7d1315
(index) Add more comprehensive integration tests for the index service.
2023-08-30 10:37:24 +02:00
Viktor Lofgren
048f685073
(ldb) add OR IGNORE to insert status query
...
Otherwise it will sometimes fail because documents may appear more than once in error scenarios.
2023-08-30 10:34:01 +02:00
Viktor Lofgren
e4d7958379
(control) ProcessLivenessMonitorActor shouldn't reap tasks based on service instance liveness
2023-08-29 18:19:04 +02:00
Viktor Lofgren
3f288e264b
(minor) Clean up dead endpoints
2023-08-29 17:04:54 +02:00
Viktor Lofgren
dd593c292c
(loader) Minor optimizations and bugfixes.
...
* Reduce memory churn in LoaderIndexJournalWriter, fix bug with keyword mappings as well
* Remove remains of OldDomains
* Ensure LOADER_PROCESS_OPTS gets fed to the processes
* LinkdbStatusWriter won't execute batch after each added item post 100 items
2023-08-29 15:37:52 +02:00
Viktor Lofgren
39c1857c61
(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.
2023-08-29 13:07:55 +02:00
Viktor Lofgren
c57a2d0dc3
(control-service) Remove old index journal files when restoring a backup.
2023-08-29 11:58:01 +02:00
Viktor Lofgren
a2e6616100
(index-reverse) Add documentation and clean up code.
2023-08-29 11:35:54 +02:00
Viktor Lofgren
ba4513e82c
(loader) Revert accidental experimental changes that slipped by in an earlier commit
2023-08-28 19:54:56 +02:00
Viktor Lofgren
6525b16e1f
(minor) Improved logging and error messages
2023-08-28 19:53:55 +02:00
Viktor Lofgren
b6a92506d1
(index) Hook in missing DocIdRewriter
...
This enables documents to be ranked properly.
2023-08-28 19:53:43 +02:00
Viktor Lofgren
ffa0366deb
(minor) Fix typo in ActorStateMachine's logging
2023-08-28 16:11:52 +02:00
Viktor Lofgren
00c4686ef0
(reverse-index) Fix over-allocation of the count array in merging
2023-08-28 14:36:28 +02:00
Viktor Lofgren
3101b74580
(index) Move to a lexicon-free index design
...
This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.
The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.
It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
194a6057dd
(index,control) Recoverable index backups
2023-08-25 14:57:43 +02:00
Viktor Lofgren
e710e057e2
(db) Remove EC_URL and EC_PAGE_DATA from mariadb database
2023-08-25 13:45:03 +02:00
Viktor Lofgren
28188a6e59
(control) Simplify ConvertAndLoadActor
2023-08-25 13:30:20 +02:00
Viktor Lofgren
70a5df96c8
(control) Display progress of process tasks
2023-08-25 13:05:21 +02:00
Viktor Lofgren
460998d512
(index) Move index construction to separate process.
...
This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service. It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D
2023-08-25 12:52:54 +02:00
Viktor Lofgren
e741301417
(search) Remove endpoint flush-search-caches
...
It's not necessary anymore with the new linkdb.
2023-08-25 09:51:06 +02:00
Viktor Lofgren
5ed5298409
(converter) Update confusing state description
...
SWAP_LEXICON doesn't instruct the index service to do anything. It just moves the file.
2023-08-24 18:56:49 +02:00
Viktor Lofgren
b911665691
(index) Clean up and optimize valuator
2023-08-24 18:34:06 +02:00
Viktor Lofgren
56eb83319d
(index) Clean up result domain deduplicator
2023-08-24 18:24:55 +02:00
Viktor Lofgren
1e6800565a
(system) Remove EdgeId<T> and similar objects
...
They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.
2023-08-24 17:46:02 +02:00
Viktor Lofgren
c909120ae1
(search) Basic working integration of linkdb in search service
2023-08-24 17:24:56 +02:00
Viktor Lofgren
9894f37412
(index) Implement new URL ID coding scheme.
...
Also refactor along the way. Really needs an additional pass, these tests are very hairy.
2023-08-24 16:44:27 +02:00
Viktor Lofgren
6a04cdfddf
(loader) Implement new linkdb in loader
...
Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal.
For now, we no longer store new URLs in different domains. We need to re-implement this somehow, probably in a different job or a as a different output.
2023-08-24 13:07:54 +02:00
Viktor Lofgren
c70670bacb
(common) New UrlIdCodec class
...
Have a single class responsible for encoding and decoding URL ids, as it's a bit finicky and used all over.
2023-08-24 11:41:07 +02:00
Viktor Lofgren
7bb3e44a76
(common) Deprecate EdgeId and similar
2023-08-24 11:16:28 +02:00
Viktor Lofgren
b958acb76a
(file-storage) New File Storage type for linkdb
2023-08-24 09:06:13 +02:00
Viktor Lofgren
b22f4fbb72
(linkdb) New Module for sqlite-backed document db
2023-08-24 09:06:13 +02:00
Viktor Lofgren
ebc84c22fb
Upgrade antique lombok plugin
...
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a
Upgrade code to Java 20.
...
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
4d75fa2908
Upgrade gradle and docker plugin to support native JDK20 environments
2023-08-23 13:30:55 +00:00
Viktor Lofgren
1a05cba60a
(keyword lexicon) Use three hash tables to increase the possible number of keywords to 2^31 from 0.75 x 2^30.
2023-08-23 11:25:20 +02:00
Viktor Lofgren
bf92c270dc
(language) Rollback language filter change a bit.
...
It appears to lead to too much junk in the lexicon.
2023-08-23 10:16:57 +02:00
Viktor Lofgren
e507844616
(language) Rollback language filter change a bit.
...
It appears to lead to too much junk in the lexicon.
2023-08-23 10:03:25 +02:00
Viktor Lofgren
ca12dd59f7
(loader) Fix Cleaner resource leak
...
Apparently Cleaners have an associated native thread, so the way to use them is to have a single static cleaner.
2023-08-22 18:05:00 +02:00
Viktor Lofgren
6f222b9800
(search) Add refresh link to explore mode.
...
This is a QOL improvement for mobile users, who otherwise would have to scroll all the way up to refresh.
Also removed the confusing "this is a random set of domains"-message when viewing adjacent websites, as it's not random.
2023-08-22 12:43:44 +02:00
Viktor Lofgren
fca62f261e
(mq) Down-tune polling intervals in MQ
...
Polling 10 times a second across dozens of queues is a bit too aggressive and wasteful.
2023-08-22 11:49:30 +02:00
Viktor Lofgren
c7f0276005
(control) Don't spin on process output printing
...
This is the "correct" way of copying stdout and stderr to the curren't process' output.
2023-08-22 11:48:54 +02:00
Viktor Lofgren
46409c4c2d
(loader) Use the correct interface for InstructionCounter
2023-08-22 11:11:36 +02:00
Viktor Lofgren
46df58d28b
(control-service) Use default value for WMSA_HOME if it is not set
2023-08-22 11:11:01 +02:00
Viktor Lofgren
15912f31d0
(control-service) Basic GUI for deleting bad links from exploration mode
2023-08-21 18:35:26 +02:00
Viktor Lofgren
93f49f1fb3
(search-service) RSS feed for the news feed
2023-08-20 12:58:34 +02:00
Viktor Lofgren
704de50a9b
(forward-index, valuator) HTML features in valuator
...
Put it in the forward index for easy access during index-side valuation.
2023-08-18 11:54:56 +02:00
Viktor Lofgren
fcfe07fb7d
(valuator) Clean up code
2023-08-18 11:26:56 +02:00
Viktor Lofgren
ccf4990add
(minor) Clean up code
2023-08-18 11:26:39 +02:00
Viktor Lofgren
f2638dd845
(feature-extractor) More adtech nonsense
2023-08-18 11:26:19 +02:00
Viktor Lofgren
239980ecae
(minor) Improve comment
2023-08-18 11:26:05 +02:00
Viktor Lofgren
6cb784df75
(minor) Improve comment
2023-08-18 11:25:36 +02:00
Viktor Lofgren
efee904531
(search) Use the adtech bit instead of ads for ads flag
2023-08-18 11:24:59 +02:00
Viktor Lofgren
bee815b1c4
(converter) Add monsterinsights as an adtech tracker
2023-08-17 17:44:11 +02:00
Viktor Lofgren
e296b02649
(converter) Optimize LSH based within-domain deduplication
2023-08-17 17:43:46 +02:00
Viktor Lofgren
c019a029ec
(flags) Documentation and preventative bugfix
2023-08-17 17:42:31 +02:00
Viktor Lofgren
db0216936e
(summary) Reduce the chance of expensive operations
2023-08-16 15:48:34 +02:00
Viktor Lofgren
46d761f34f
(language) fasttext based language filter
2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f
(valuation) Penalize wordpress style kebab case urls
2023-08-16 13:11:24 +02:00
Viktor Lofgren
1d486bddee
(crawler) Reduce log spam
2023-08-16 11:12:09 +02:00
Viktor Lofgren
606db54dc8
(docs) Fix dead links to message-queue after moving it to libraries
2023-08-15 19:26:40 +02:00
Viktor Lofgren
d8073f0dde
(feature-extractor) Add mail.ru counter to non-adtech trackers
2023-08-15 19:10:43 +02:00
Viktor Lofgren
df85468c01
(control) Action for refreshing the blogs definition.
2023-08-15 11:38:52 +02:00
Viktor Lofgren
4404ad98ae
(mq) Fix missing @Inject that broke everything in control-service
2023-08-15 11:22:12 +02:00
Viktor Lofgren
e7192a9cad
(mq) Refactor mq and actor library and move it to libraries out of common
2023-08-15 10:53:23 +02:00
Viktor Lofgren
019b61b330
(control) Remove message queue listing from actors view.
2023-08-13 13:50:04 +02:00
Viktor Lofgren
f997707049
(control) Move event log out of plumbing
2023-08-13 13:40:50 +02:00
Viktor Lofgren
c56ee10185
(control) Separate [Process] and [Process and Load] actions for crawl data; all SLOW data is deletable.
2023-08-13 13:39:59 +02:00
Viktor Lofgren
8210e49b4e
(control) Helpful tooltips for the Actor table.
2023-08-13 12:55:56 +02:00
Viktor Lofgren
a8f2e9ee2c
(control) Tidy up empty tables, remove actors from index view
2023-08-12 15:18:14 +02:00
Viktor Lofgren
a91b909103
(control) Event log on stop actor
2023-08-12 15:02:53 +02:00
Viktor Lofgren
d6b8b38955
(db) Add indices on SERVICE_EVENTLOG
2023-08-12 15:00:15 +02:00
Viktor Lofgren
99e031c529
(control) Remove broken pagination from events and message queue; new "light" events table for some views
2023-08-12 14:57:55 +02:00
Viktor Lofgren
998f239ed9
(control) Filterable event log view
2023-08-12 14:43:11 +02:00
Viktor Lofgren
0961f627b1
(control) Pretty up the nav bar
2023-08-12 14:42:42 +02:00
Viktor Lofgren
6483308bb0
(sql) Update default value for DOMAIN_SELECTION_TYPE
2023-08-11 14:01:15 +02:00
Viktor Lofgren
7440da240d
(blacklist) Fix broken SQL migration
2023-08-11 13:33:35 +02:00
Viktor Lofgren
4f8048be31
(blacklist) Blacklist management
2023-08-10 15:40:07 +02:00
Viktor Lofgren
807fb2d052
(service) Task heartbeat creates event log entries
2023-08-09 15:15:16 +02:00
Viktor Lofgren
ce293029c7
(converter) Treat adtech tracking as advertisement.
2023-08-09 14:23:53 +02:00
Viktor Lofgren
b5ed21be21
(mq) MqPersistence no longer relies on autoCommit being enabled
2023-08-09 14:23:22 +02:00
Viktor Lofgren
251fc63b42
(*) Fix merge gore
2023-08-09 13:33:28 +02:00
Viktor Lofgren
47f3855a4b
(control) More informative readme.md
2023-08-09 12:42:23 +02:00
Viktor Lofgren
71dfe9f33e
(control) Clean up the ControlService, move mq-related endpoints to MessageQueueService.
2023-08-09 12:42:01 +02:00
Viktor Lofgren
afad4f5ebb
(*) last touches
2023-08-07 12:59:33 +02:00
Viktor Lofgren
4ab1cd9502
(*) last touches
2023-08-07 12:57:44 +02:00
Viktor
52e2ab45bf
Merge branch 'master' into master-control-program
2023-08-07 12:53:43 +02:00
Viktor Lofgren
be444f9172
(control) New actions view, re-arrange navigation menu
2023-08-05 14:45:04 +02:00
Viktor Lofgren
715d61dfea
(mq) Fix bug in notice handling where they were registered on the wrong name
2023-08-05 14:45:04 +02:00
Viktor Lofgren
bf37a3eb25
(search-service) Make flushCaches endpoint a notice and not a request
2023-08-05 14:45:04 +02:00
Viktor Lofgren
c2b45bec8d
(mq) Rename notify to sendNotice to avoid name clash with the java object function
2023-08-05 14:45:04 +02:00
Viktor Lofgren
cdfe284f9a
(file storage) File Storage Type for EXPORT data
...
(file storage) File Storage Type for EXPORT data
2023-08-05 14:45:03 +02:00
Viktor Lofgren
08eed17e66
(api-service) Mq endpoint for flushing caches
2023-08-05 14:42:16 +02:00
Viktor Lofgren
00eb8b90dc
(control) Message Queue GUI
2023-08-04 22:05:29 +02:00
Viktor Lofgren
912129311d
(control) Message Queue GUI
2023-08-04 17:54:18 +02:00
Viktor Lofgren
624b78ec3a
(heartbeat) Task heartbeats
2023-08-04 14:40:06 +02:00
Viktor Lofgren
1d0cea1d55
(converter) GUI for dealing with user complaints
2023-08-03 17:59:57 +02:00
Viktor Lofgren
f01f608474
(blacklist) Support blacklists with subdomain
2023-08-03 17:58:52 +02:00
Viktor Lofgren
c22feaf42e
(crawl) Make crawler limiter request a GC when throttling
2023-08-03 17:58:18 +02:00
Viktor Lofgren
63e857f7cd
(control) Add basic api key management
2023-08-02 20:14:03 +02:00
Viktor Lofgren
9979c9defe
(search/index) Add blogosphere filter
2023-08-02 20:13:30 +02:00
Viktor Lofgren
7763df0715
(docs) Add control-service to the main readme.md
2023-08-01 22:52:41 +02:00
Viktor Lofgren
8de3e6ab80
(control) Fix bug where CrawlActor and RecrawlActor would steal each others' mail
2023-08-01 22:33:30 +02:00
Viktor Lofgren
659d2134ba
(file-storage) Deprecate mustClean flag
2023-08-01 22:32:30 +02:00
Viktor Lofgren
867410c66b
(file-storage) Automatic file storage discovery via manifest file
2023-08-01 18:05:43 +02:00
Viktor Lofgren
e5c9791b14
(crawler) Fix rare ConcurrentModificationError due to HashSet
2023-08-01 17:28:29 +02:00
Viktor Lofgren
58556af6c7
(db) Use flwyay for database migrations.
2023-08-01 17:08:42 +02:00
Viktor Lofgren
2e29038ecd
(db) Fix broken insert statement, move file storage defaults to a separate file.
2023-08-01 15:50:08 +02:00
Viktor Lofgren
36a23707c1
(control) Control service should be a core service.
2023-08-01 15:49:50 +02:00
Viktor Lofgren
c1ea60b399
(db) Default values for storage base
2023-08-01 15:05:04 +02:00
Viktor Lofgren
b08e302dd5
(lexicon) Optimize lexicon by using Murmur3_128's hash function
2023-08-01 15:02:13 +02:00
Viktor Lofgren
ea66195b97
(loader) Optimize loader by using zstd's direct streaming writer and the Murmur3_128 string hash
2023-08-01 15:02:13 +02:00
Viktor Lofgren
8f0cbf267b
(loader) Perform instruction reads in a separate thread for extra vroom vroom
2023-07-31 14:24:08 +02:00
Viktor Lofgren
2f8488610a
(loader) Fix bug where trailing deferred domain meta inserts weren't executed
2023-07-31 14:23:23 +02:00
Viktor Lofgren
d95f01b701
(control) Reduce log spam in control svc
2023-07-31 14:21:06 +02:00
Viktor Lofgren
c9d7635370
(control) Aborting an actor that waits on a process request terminates the running job.
...
(control) Aborting an actor that waits on a process request terminates the running job.
2023-07-31 14:21:06 +02:00
Viktor Lofgren
6b5fb0f841
(control) Disable the start button for actors that aren't directly initializable.
...
(control) Disable the start button for actors that aren't directly initializable.
2023-07-31 14:21:00 +02:00
Viktor Lofgren
12bd74d4f3
Clean up ProcessService
2023-07-31 10:56:16 +02:00
Viktor Lofgren
37c4cc68ed
TODO
2023-07-31 10:34:42 +02:00
Viktor Lofgren
1c948eb3d8
(minor) Alter DumbThreadPool in Converter to not claim the threads are crawlers.
2023-07-31 10:33:15 +02:00
Viktor Lofgren
cd90ca820f
YAGNI filter over ConverterDomainTypes
2023-07-31 10:32:47 +02:00
Viktor Lofgren
6f4e767a04
(minor) Re-enable monkey-patch-json for converter
2023-07-31 10:31:46 +02:00
Viktor Lofgren
5411950b87
(minor) Tidy up EdgeDomain class a bit, no functional difference
2023-07-31 10:31:29 +02:00
Viktor Lofgren
6ff7e9648f
(crawler) Use and pass the proper environment variables to the processes.
2023-07-30 16:54:02 +02:00
Viktor Lofgren
5c071ce4d3
(crawler) Clean up the code and remove unnecessary logging
2023-07-30 16:53:39 +02:00
Viktor Lofgren
caf3d231a8
(crawler) Fix rare issue with NPEs if the crawl queue is empty
2023-07-30 16:53:13 +02:00
Viktor Lofgren
730e8f74e4
(crawler) Even more memory optimizations.
...
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
aba134284f
(crawler) Reduce log spam
2023-07-29 19:22:58 +02:00
Viktor Lofgren
2a6183f9e0
(crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size.
2023-07-29 19:20:09 +02:00
Viktor Lofgren
ee143bbc48
(crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended.
2023-07-29 19:19:09 +02:00
Viktor Lofgren
d3f01bd171
(crawler, converter) Remove monkey patched gson from dependencies
2023-07-29 19:18:12 +02:00
Viktor Lofgren
05ba3bab96
(crawler) Make SitemapRetriever abort on too large sitemaps.
2023-07-29 19:18:12 +02:00
Viktor Lofgren
d2b6b2044c
(crawler) Reduce log spam in HttpFetcherImpl
2023-07-29 19:18:12 +02:00
Viktor Lofgren
7611b7900d
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
...
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
2023-07-29 19:18:12 +02:00
Viktor Lofgren
9ad32ee9c7
(control) Be more clear about when a process exits and why.
2023-07-29 19:16:00 +02:00
Viktor Lofgren
866db6c63f
(control) Dialog for updating message state; clean up file view.
2023-07-28 22:02:05 +02:00
Viktor Lofgren
01476577b8
(loader) Speed up loading back to original speeds with a cascading DELETE FROM EC_URL rather than EC_PAGE_DATA.
...
* Also clean up code and have proper rollbacks for transactions.
2023-07-28 22:00:07 +02:00
Viktor Lofgren
e237df4a10
(converter) Use a dumb thread pool instead of Java's executor service.
2023-07-28 18:15:16 +02:00
Viktor Lofgren
f11103d31d
(WIP) Make it possible to sideload encyclopedia data.
...
This is mostly a pilot track for sideloading other large websites.
Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
9288d311d4
Add buffering to index journal writer
2023-07-28 18:11:19 +02:00
Viktor Lofgren
77d5e39fe0
Make processed data Serializable
2023-07-28 18:11:19 +02:00
Viktor Lofgren
27e781761d
(mq single shot inbox) Flag messages as OK if there is no recipient
2023-07-28 12:04:23 +02:00
Viktor Lofgren
92cac52813
(mq) Add indexes to MESSAGE_QUEUE
2023-07-28 12:03:51 +02:00
Viktor Lofgren
66bb12e55a
(converter) File listing and download for file storage
2023-07-26 21:59:35 +02:00
Viktor Lofgren
a5d980ee56
(converter) Hook crawl job extractor and adjacencies calculator into control service.
2023-07-26 15:46:22 +02:00
Viktor Lofgren
19c2ceec9b
(converter) Use Marginalia Yellow for control service
2023-07-26 11:50:23 +02:00
Viktor Lofgren
507f26ad47
(converter) Refactor converter to not keep instructions list in RAM.
...
(converter) Refactor converter to not keep instructions list in RAM.
(converter) Refactor converter to not keep instructions list in RAM.
2023-07-25 22:06:46 +02:00
Viktor Lofgren
fd44e09ebd
(loader) Don't delete the entire link database when the loader runs
2023-07-24 18:37:35 +02:00
Viktor Lofgren
09fd0a1d0e
(converter) Automatically clean stale file storage records if they disappear on disk
2023-07-24 17:04:42 +02:00
Viktor Lofgren
667b0ca0b0
(converter, WIP) Refactor CrawledDomainReader to not return iterators.
...
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798
(converter, WIP) Refactor converter to not have to load everything into RAM.
2023-07-24 15:25:09 +02:00
Viktor Lofgren
7470c170b1
(minor) EdgeUrl.parse() should deal with null
2023-07-24 15:06:57 +02:00
Viktor Lofgren
bc330acfc9
(control) Better refresh script that doesn't cause weird artifacts
2023-07-23 19:26:16 +02:00
Viktor Lofgren
789e8eea85
(crawler) Clean up and refactor the code a bit
2023-07-23 19:08:38 +02:00
Viktor Lofgren
35b29e4f9e
(crawler) Clean up and refactor the code a bit
2023-07-23 19:06:37 +02:00
Viktor Lofgren
69f333c0bf
(crawler) Clean up and refactor the code a bit
2023-07-23 18:59:14 +02:00
Viktor Lofgren
c069c8c182
(crawler) Clean up crawl data reference and recrawl logic
2023-07-22 18:42:21 +02:00
Viktor Lofgren
9e4aa7da7c
(crawler) Support for X-Robots-Tag
2023-07-22 18:42:21 +02:00
Viktor Lofgren
e22e65eee4
(index) Fix bug related to debug print statements
2023-07-22 14:33:58 +02:00
Viktor Lofgren
cb55c76664
(index) Fix bug related to debug print statements
2023-07-22 14:20:52 +02:00
Viktor Lofgren
d6b07e4d01
(controller) Improve the storage interface
2023-07-21 19:56:16 +02:00
Viktor Lofgren
995657c6ce
(big-string) Make big-string disable:able
2023-07-21 19:50:35 +02:00
Viktor Lofgren
58f2f86ea8
(crawler) Don't read all the data into RAM when doing a refresh-crawl
2023-07-21 19:47:52 +02:00
Viktor Lofgren
7bc1cff286
(minor) code cleanup
2023-07-21 14:28:37 +02:00
Viktor Lofgren
8f455f3b6d
(control) Aborting a process spawner actor cancels the message to the actor.
2023-07-21 14:12:32 +02:00
Viktor Lofgren
f91d92cccb
(crawler) WIP
2023-07-20 21:05:16 +02:00
Viktor Lofgren
08ca6399ec
(converter) WIP
2023-07-19 17:14:45 +02:00
Viktor Lofgren
c0b5ea0e7d
Revert "Less spammy default log settings"
...
This reverts commit f6e2216b87
.
2023-07-18 19:28:42 +02:00
Viktor Lofgren
f21a3983aa
Abortable processes
2023-07-18 18:40:12 +02:00
Viktor Lofgren
f6e2216b87
Less spammy default log settings
2023-07-17 21:42:13 +02:00
Viktor Lofgren
92ed513e4f
Less spammy default log settings
2023-07-17 21:41:56 +02:00
Viktor Lofgren
d7ab21fe34
(*) Refactor Control Service and processes
2023-07-17 21:20:31 +02:00
Viktor Lofgren
bca4bbb6c8
(*) Refactor MQ and MQSM
2023-07-17 13:57:32 +02:00
Viktor Lofgren
e618aa34e9
(control) Name change process->fsm, new fsm:s
...
* FSM for spawning processes when messages appear for them
* FSM for removing data flagged for purging
2023-07-17 12:27:27 +02:00
Viktor Lofgren
6e41e78f36
(control) Higlight missing processes
2023-07-16 12:03:32 +02:00
Viktor Lofgren
c4dd9a0547
(control) Use MQFSMs to monitor and spawn processes when messages are sent to them
2023-07-16 11:58:47 +02:00
Viktor Lofgren
5ec10634d8
(mqfsm) Abortable state machine
2023-07-15 14:12:16 +02:00
Viktor Lofgren
cdae74d395
(control) Working redirects
2023-07-15 14:11:59 +02:00
Viktor Lofgren
8b74e3aa0d
(*) File Storage WIP
2023-07-14 17:08:10 +02:00
Viktor Lofgren
23169ad818
(db) Model for file storage areas
2023-07-14 11:40:05 +02:00
Viktor Lofgren
d36e36c8fd
(mq) Bugfix lastNMessages; use Lists.reverse properly
2023-07-14 11:39:15 +02:00
Viktor Lofgren
948d4d5f08
(control) Clean up the number of GUI views, abortable FSM tasks
2023-07-13 17:24:21 +02:00
Viktor Lofgren
0960e18f8e
(control) Auto-refreshing tables
2023-07-13 15:44:36 +02:00
Viktor Lofgren
825fd10efa
(control) Clean up the MQ ui a bit
2023-07-13 15:14:04 +02:00
Viktor Lofgren
1ec6f9cde2
(mq) More robust resume and recovery logic, protection against spurious state changes, minor bugfixes
2023-07-13 14:55:45 +02:00
Viktor Lofgren
a5118fe8f1
(minor) clean-up
2023-07-12 22:46:14 +02:00
Viktor Lofgren
6c88f00a9d
(mqsm) guard against spurious transitions from unexpected messages
2023-07-12 22:44:05 +02:00
Viktor Lofgren
bf783dad7a
(converter) NPE fix
2023-07-12 20:13:01 +02:00
Viktor Lofgren
8a53e107fa
(mq) Synchronous and Asynchronous inboxes.
2023-07-12 20:12:52 +02:00
Viktor Lofgren
0ed938545b
(mq) Add single-shot inbox
2023-07-12 18:41:27 +02:00
Viktor Lofgren
480abfe966
(minor) Add limit to pol count in MqPersistence, fix test
2023-07-12 18:16:23 +02:00
Viktor Lofgren
89e4343fdb
(minor) Fix test
2023-07-12 18:15:50 +02:00
Viktor Lofgren
8c16a2aede
(work-log, minor) Clean up code
2023-07-12 18:10:05 +02:00
Viktor Lofgren
5deec63667
(work-log) Better tests
2023-07-12 18:04:06 +02:00
Viktor Lofgren
363368b150
(converter) Remove auto-refresh.
2023-07-12 17:48:37 +02:00
Viktor Lofgren
74caf9e38a
(processes) Remove forEach-constructs in favor of iterators.
2023-07-12 17:47:36 +02:00
Viktor Lofgren
0b0cf48849
(control) Better looking UUIDs
2023-07-11 23:11:02 +02:00
Viktor Lofgren
00d9773b44
(control) Better looking progress bar
2023-07-11 21:37:32 +02:00
Viktor Lofgren
ac2d7034db
(minor) Bugfix in Path handling
2023-07-11 21:24:29 +02:00
Viktor Lofgren
88b9ec70c6
(control, WIP) Run reconvert-load from converter :D
2023-07-11 18:05:37 +02:00
Viktor Lofgren
77261a38cd
(control, WIP) MQFSM and ProcessService are sitting in a tree
...
We're spawning processes from the MSFSM in control service now!
2023-07-11 17:08:43 +02:00
Viktor Lofgren
3c7c77fe21
(minor) Bugfix in Path handling
2023-07-11 17:06:52 +02:00
Viktor Lofgren
4ee3f6ba3f
(minor) Refactor ControlService
2023-07-11 14:51:51 +02:00
Viktor Lofgren
4c016b0318
Process monitoring
...
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor Lofgren
f59cab300e
(minor) Javadoc comments for MqPersistance and MqMessageState
2023-07-10 21:59:51 +02:00
Viktor Lofgren
ec7826659a
(minor) Javadoc comments for MqPersistance and MqMessageState
2023-07-10 21:52:25 +02:00
Viktor Lofgren
98b5f22104
(control) WIP control service
...
* Set messages to OK when received so they're cleaned up properly.
2023-07-10 21:33:57 +02:00
Viktor Lofgren
2283ceb77d
(control) WIP control service
2023-07-10 18:58:43 +02:00
Viktor Lofgren
fba466d6e2
(crawler) Update URL blocklist
...
* Don't crawl MDN mirrors
* More mailing list variants
2023-07-10 18:58:43 +02:00
Viktor
cbbf60a599
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
c125d8ab48
(search) Fix a bug where space-like characters weren't normalized in query processing.
2023-07-10 18:58:43 +02:00
Viktor Lofgren
f03146de4b
(crawler) Fix bug poor handling of duplicate ids
...
* Also clean up the code a bit
2023-07-10 18:58:43 +02:00
Viktor Lofgren
dbb758d1a8
Minor: Better error handling in crawled domain reader
2023-07-10 18:58:43 +02:00
Viktor Lofgren
da8bcc6e24
Minor: Don't blow up the reader on a corrupted file
2023-07-10 18:58:43 +02:00
Viktor Lofgren
96eecc6ea5
Minor: Readability.
2023-07-10 18:58:43 +02:00
Viktor Lofgren
74644d59f3
(crawler) Update URL blocklist
...
* Don't crawl MDN mirrors
* More mailing list variants
2023-07-10 18:04:43 +02:00
Viktor
0f9b90eb1c
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 17:36:12 +02:00
Viktor Lofgren
ae9537b68e
(search) Fix a bug where space-like characters weren't normalized in query processing.
2023-07-07 20:02:05 +02:00
Viktor Lofgren
2619d196bb
(crawler) Fix bug poor handling of duplicate ids
...
* Also clean up the code a bit
2023-07-07 19:56:14 +02:00
Viktor Lofgren
17db23c2c1
Minor: Better error handling in crawled domain reader
2023-07-07 19:48:32 +02:00
Viktor Lofgren
040bea1f75
Minor: Don't blow up the reader on a corrupted file
2023-07-07 19:48:11 +02:00
Viktor Lofgren
dc8277223a
Minor: Readability.
2023-07-06 19:50:13 +02:00
Viktor Lofgren
98d1898610
Bugfix: Don't run the xenforo specialization on phpBB.
2023-07-06 18:12:26 +02:00
Viktor Lofgren
1400fb4a9b
Bugfix: Don't run the xenforo specialization on phpBB.
2023-07-06 18:11:19 +02:00
Viktor Lofgren
647bbfa617
Fix so that crawler tests don't sometimes fetch real sitemaps when they're run.
2023-07-06 18:05:23 +02:00
Viktor Lofgren
b73fcc19fe
Fix so that crawler tests don't sometimes fetch real sitemaps when they're run.
2023-07-06 18:05:03 +02:00
Viktor Lofgren
d9e6c4f266
Trial integration of MQ-FSM into index service.
2023-07-06 18:04:16 +02:00
Viktor Lofgren
34653f03a2
Temporary bugfix, need to find source
2023-07-06 14:13:03 +02:00
Viktor Lofgren
f0a8ca440f
MQFSM Usability WIP
2023-07-06 13:33:11 +02:00
Viktor Lofgren
d89db10645
MQFSM Usability WIP
2023-07-06 13:02:16 +02:00
Adrthegamedev
78f21dd19a
(an attempt to) Add wikidot to wiki generators list
2023-07-05 18:03:36 +02:00
Viktor Lofgren
2cb209ae9c
Better wordpress fingerprinting
2023-07-05 18:03:36 +02:00
Viktor Lofgren
979a620ead
Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string.
2023-07-05 18:03:36 +02:00
Viktor Lofgren
7a17933c65
Control service owns message queue garbage collection.
2023-07-04 19:52:30 +02:00
Viktor Lofgren
097a163cf5
Getting a skeleton in place for the control service.
2023-07-04 18:25:42 +02:00
Viktor Lofgren
2ae0b8c159
Message queue based state machine
2023-07-04 17:42:06 +02:00
Viktor Lofgren
31ae71c7d6
Message queue WIP
2023-07-04 14:28:14 +02:00
Adrthegamedev
5ce894564c
(an attempt to) Add wikidot to wiki generators list
2023-07-03 13:31:42 +02:00
Viktor Lofgren
813fa08bdd
Better wordpress fingerprinting
2023-07-03 11:29:27 +02:00
Viktor Lofgren
e5792ba8b3
Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string.
2023-07-03 11:06:39 +02:00
Viktor Lofgren
62cc9df206
Embryo of new control process
...
* New events and heartbeat tables in mariadb
* Refactored to a cleaner Service interface
2023-07-03 10:40:32 +02:00
Viktor Lofgren
42375f0e53
Specialization for javadocs
2023-07-01 20:16:56 +02:00
Viktor Lofgren
24dce8c03b
Remove link filtering for mediawiki, it's too strict and not every site uses the /wiki/-pattern.
2023-07-01 19:32:25 +02:00
Viktor Lofgren
eda615de0f
Add generator fingerprint for invision.
2023-07-01 14:47:57 +02:00
Viktor Lofgren
a000256223
Add generator fingerprint for xenforo.
...
Also clean up the specializations logic a bit, and add a barebones specialization for phpbb that cleans out paths we aren't interested in but doesn't touch pruning or summarizing logic for now.
2023-07-01 14:43:49 +02:00
Viktor Lofgren
9bd0e3ce58
Add generator fingerprint for xenforo.
2023-07-01 14:04:48 +02:00
Viktor Lofgren
b4d1e0e81e
Add generator fingerprints for phpBB and flarum.
2023-07-01 13:44:42 +02:00
Viktor Lofgren
d2fdaafc7a
Big brain web developers were using onload and onerror handlers to load JS without script tags...
2023-06-30 17:10:25 +02:00
Viktor Lofgren
7d86586594
Remove annoying log spam in sitemap retriever
2023-06-30 17:08:35 +02:00
Viktor Lofgren
11c26e700e
Remove annoying log spam in crawler retriever
2023-06-30 17:08:24 +02:00
Viktor Lofgren
8274e8a953
JVM flags for disabling black and block-lists.
2023-06-30 17:07:47 +02:00
Viktor Lofgren
0f34beb1aa
Update search front page
2023-06-29 17:14:27 +02:00
Viktor Lofgren
baff83912e
Small optimizations that shave an hour of processing time :D
2023-06-28 15:41:10 +02:00
Viktor Lofgren
d71124961e
Better tests for crawling and processing.
2023-06-27 16:11:27 +02:00
Viktor Lofgren
fbdedf53de
Fix bug in CrawlerRetreiver
...
... where the root URL wasn't always added properly to the front of the crawl queue.
2023-06-27 15:50:38 +02:00
Viktor Lofgren
a6a66c6d8a
Improve site info for unknown domains:
...
* Placeholder screenshot should work
* Add a link to git-repo for submitting the site for crawling
2023-06-27 15:32:11 +02:00
Viktor Lofgren
d167ad2017
Remove sitemap related log spam
2023-06-27 13:59:47 +02:00
Viktor Lofgren
7d741ff499
Fix so crawl plan replay doesn't crash if a file is missing.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
f8f9f04158
Specialized logic for processing Lemmy-based websites.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
b0c7480d06
Set default timeouts for java.net.URL-connections
2023-06-27 10:57:54 +02:00
Viktor Lofgren
e7af77e151
Tests for crawler specialization + testdata
2023-06-27 10:57:54 +02:00
Viktor Lofgren
ec940e36d0
Sitemap support, refined crawler specialization
2023-06-27 10:57:54 +02:00
Viktor Lofgren
f92d8a0975
EdgeUrl conversion to/from java.net.URL
2023-06-27 10:57:54 +02:00
Viktor Lofgren
ed373eef61
Refactor crawler and add special logic for some platforms
...
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
5abaf13192
Fix serialization bug with CompressedBigString
2023-06-27 10:57:54 +02:00
Viktor Lofgren
d86e8522e2
Add search profiles for wiki, forum and docs.
2023-06-24 12:17:35 +02:00
Viktor Lofgren
bd2c3855ed
Add bits and keywords for generator classes (docs, forum, wiki).
2023-06-23 21:35:28 +02:00
Viktor Lofgren
54c2be893b
TRIVIAL: Remove unused import.
2023-06-22 17:21:47 +02:00
Viktor Lofgren
55c65f0935
Use document generator to complement the document selection.
...
Will let through e.g. a modern SSG in the small web filter.
2023-06-22 17:21:33 +02:00
Viktor Lofgren
b5ef67ed28
Categorize generators by type
...
This is a great quality signal!
Add the type as document bitflags by category.
2023-06-22 16:04:37 +02:00
Viktor Lofgren
f140e7d7c7
Use a default tag for unset or invalid generators.
2023-06-21 17:30:14 +02:00
Viktor Lofgren
a9a2960e86
New synthetic keyword for document generator meta tag.
2023-06-20 16:25:49 +02:00
Viktor Lofgren
7326ba74fe
Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right.
...
Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.
2023-06-20 14:15:05 +02:00
Viktor Lofgren
a9fabba407
Tell experiment runner to only process some domains.
...
Updated the experiment runner, as well as the script.
2023-06-20 14:14:01 +02:00
Viktor Lofgren
4fc0ddbc45
Improved crawl-job-extractor.
...
Let crawl-job-extractor run offline and allow it to read domains from file.
Improved docs.
2023-06-20 11:37:52 +02:00
Viktor Lofgren
9455100907
Throw a custom exception when WMSA_HOME isn't found
2023-06-20 11:37:52 +02:00
Viktor Lofgren
32a6735d03
Undo change in requirements for counting as a high tf-idf word
2023-06-19 17:58:19 +02:00
Viktor Lofgren
f0b4acb358
Better logic for summarization.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
67c15a34e6
Reduce the amount of expensive operations in HtmlDocumentProcessorPlugin.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
9579cdd151
Improved heuristic for which words are considered important in selecting the summary text.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
443cf0cf1e
Expose additional functionality through WordsTfIdfCounts.
...
Bump requirements for being flagged as high TF-IDF from 2 occurences to 3.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
4138233ddf
Truncate repeated strings of any non-alnum symbols in SummaryExtractor
2023-06-19 17:58:19 +02:00
Viktor Lofgren
2979f4703e
Allocation-free text utility
2023-06-19 17:58:19 +02:00
Viktor Lofgren
77f2ca51af
Optimize SentenceExtractor.
...
Remove String pool because it's not doing much.
Break out constant.
Use a shared RdrPosTagger.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
ffcbc6c1c9
Reduce the odds of re-allocation by AsciiFlattener
2023-06-19 17:58:19 +02:00
Viktor Lofgren
266ad2e4de
Re-introduce monkey patched GSON to make converter run better.
...
fixup! Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
d1a004bea6
(minor) Clean up StringPool
2023-06-19 17:58:19 +02:00
Viktor Lofgren
e4372289a5
Use fixed buffers for BigString compression and decompression to reduce GC churn.
...
fixup! Use fixed buffers for BigString compression and decompression to reduce GC churn.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
379bccc1a3
Disable AdblockSimulator since it's slow and doesn't really work. Just wasting CPU cycles until it's fixed.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
21125206b4
Fix some bugs in JSON+LD-heuristics for pub date.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
44b1fe0e6d
Move list-conversion into getDescription method.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
88399e30e2
Consider keyword relevance signals when creating the document summary using the DOM walker.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
7ed3306be3
Make the adjacency calculator behave like it used to in the past, when it gave better results.
2023-06-07 22:03:06 +02:00
Viktor Lofgren
eb2ca942d5
Up the default crawl delay to 1 second.
2023-06-07 22:02:17 +02:00
Viktor Lofgren
2afbdc2269
Adjust the logic for the crawl job extractor to set a relatively low visit limit for websites that are new in the index or has not yielded many good documents previously.
2023-06-07 22:01:35 +02:00
Viktor Lofgren
d82a858491
Don't consider slash to be a sentence separator.
2023-05-31 16:54:30 +02:00
Viktor Lofgren
e332faa07e
Fix test that broke when memex.marginalia.nu started redirecting to www.marginalia.nu.
2023-05-28 13:46:24 +02:00
Viktor Lofgren
4e9e79454f
Fix broken transformation functions in the PagingArray classes.
2023-05-28 13:31:05 +02:00
Viktor Lofgren
b0bc07b4e7
Insertion sort was *super* busted I don't even know how it worked
2023-05-28 12:17:50 +02:00
Viktor Lofgren
2cda57355a
More word metadata tests
2023-05-28 11:57:06 +02:00
Viktor Lofgren
fd192d2791
Fix putative overflow error with a large dictionary
2023-05-28 11:57:06 +02:00
Viktor Lofgren
6814c90625
Fix N-width sorting bug
2023-05-28 11:57:06 +02:00
Viktor Lofgren
1e184a8372
(search) Make exploration mode more random
2023-05-25 17:40:28 +02:00
Viktor Lofgren
6fae51a8ef
Stopgap fix for a bug in dealing with quote terms containing stop words.
2023-05-02 19:38:59 +02:00
Viktor Lofgren
a9f7b4c457
Add synthetic keywords for same-site files linked from a document (e.g. file:png). Also add category keywords, like file:image or file:document.
2023-04-30 19:29:13 +02:00
Viktor Lofgren
1e3b6934bb
Reduce log noise during loading. Bad URLs don't need to be loaded, they can be grepped from the instructions.
2023-04-30 18:36:44 +02:00
Viktor
7694a15f62
Fix kale's unreasonably high weighting factor
2023-04-22 20:55:09 +02:00
Viktor
d72da01a92
Update readme.md
2023-04-22 16:05:57 +02:00
Viktor
112f43b3a1
Api service response cache ( #16 )
...
* Add response caching to the API service to help SearXNG
* Clean up the code a bit.
* Add an endpoint without a terminating slash for getLicense.
* Add tests for API service.
2023-04-22 15:42:32 +02:00
Viktor Lofgren
f12c6fd57e
Add a ranking parameter for biasing toward recent or old content.
2023-04-20 16:00:59 +02:00
Viktor
96bac70b85
Tools for merging sorted lists, and merging btrees. ( #14 )
...
* Utilities for merging BTrees of entity size 1 and 2.
* Isolate and clean up sorting algorithms.
* Functions for keeping distinct items in a LongArray
2023-04-20 15:28:09 +02:00
Viktor Lofgren
619fb8ba80
(converter) Adjust the pub-date sniffing heuristics' order. Doing HTML5 tags too early puts some sites too early. Also expanded support for JSON+LD.
2023-04-19 15:28:50 +02:00
Viktor
5a5cdaf70e
Improvements to the adjacency calculator and screenshots tool ( #13 )
...
* WIP: Improvements to website adjacencies loader tool.
* Improving screenshots capture bot.
2023-04-18 22:21:49 +02:00
Viktor Lofgren
bb587ca47f
Reformulate search-header.hdb, s/Support/Donate/ the formulation was apparently confusing some people thinking they could get support on this page.
2023-04-18 17:04:24 +02:00
Viktor Lofgren
4d298cd5fa
Improving screenshots capture bot.
2023-04-17 18:04:22 +02:00
Viktor Lofgren
fbbaf584ba
Adjustments to screenshot capture tool.
2023-04-16 08:55:57 +02:00
Viktor Lofgren
df1850bd45
Fix bug in index service where tld: and links:-queries wouldn't work.
2023-04-15 18:39:16 +02:00
Viktor Lofgren
d42ab19166
Issue 5: Fix bug where some IPv6 addresses blew up domain loading.
2023-04-15 14:11:08 +02:00
Viktor Lofgren
2ab26f37b8
Bug fix for document metadata encoding that breaks year based queries.
2023-04-14 16:56:49 +02:00
Viktor
ec7ce7b0b3
Update readme.md
2023-04-11 16:31:11 +02:00
Viktor Lofgren
3e9b37c264
Refactor website screenshot tool and website adjacencies calculator into code/tools.
2023-04-11 16:20:27 +02:00
Viktor Lofgren
502713f7a8
Reduce memory churn
2023-04-10 16:51:17 +02:00
Viktor Lofgren
e19256a6b6
Tune settings to retrieve more results.
2023-04-10 15:39:20 +02:00
Viktor Lofgren
ccc41d1717
Clean up of the index query handling related code.
2023-04-10 14:50:57 +02:00
Viktor Lofgren
e49b1dd155
Better handling of quote terms, fix bug in handling of longer queries.
...
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:20:40 +02:00
Viktor Lofgren
fe419b12b4
Better handling of quote terms, fix bug in handling of longer queries.
...
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:11:40 +02:00
Viktor Lofgren
810515c08d
Clean up artifact extractor.
2023-04-10 13:07:54 +02:00
Viktor Lofgren
535a51a621
Repair broken year query test.
2023-04-08 12:04:09 +02:00
Viktor
a278fc6296
Increase search result relevance ( #8 )
...
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.
2023-04-07 20:18:08 +02:00
Viktor Lofgren
716ab35b4e
Search ranking debuggability improvements.
2023-04-02 13:43:24 +02:00
Viktor Lofgren
3fb249758e
Adjust result ordering.
2023-04-02 12:05:22 +02:00
Viktor Lofgren
f7a6ef2179
Smarter queries, better logging.
2023-04-02 12:05:09 +02:00
Viktor Lofgren
105d93cd85
Index query builder automatically ignores redundant predicates.
2023-04-02 12:04:26 +02:00
Viktor Lofgren
1e4157017d
More helpful descriptions of index queries.
2023-04-02 12:03:58 +02:00
Viktor Lofgren
5fb75adaae
Remove antique result scoring adjustment that makes no sense anymore.
2023-04-02 11:58:04 +02:00
Viktor Lofgren
affcf8cf41
Load test tool
2023-04-02 09:43:43 +02:00
Viktor Lofgren
cc4e089a5d
Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc.
2023-03-30 15:46:15 +02:00
Viktor Lofgren
32b9c2e671
Fix SentenceExtractor jank
2023-03-30 15:45:04 +02:00
Viktor Lofgren
4d05be4095
Refactor InternalLinkGraph
2023-03-30 15:44:23 +02:00
Viktor Lofgren
137adb9c3c
Bitmask calculation improvement. Take sentence length into consideration, not all lines are equal.
2023-03-30 15:42:06 +02:00
Viktor Lofgren
16e37672fc
Bugfix crawl plan, doesn't use rewrite() everywhere
2023-03-30 15:41:07 +02:00
Viktor Lofgren
d0c72ceb7e
Improve experiment runner, convenient start script.
2023-03-30 15:40:31 +02:00
Viktor Lofgren
0fcb2b534c
Polish Names
2023-03-29 16:51:47 +02:00
Viktor Lofgren
dcf6218cdb
Fix bugs related to search result selection in the case with multiple search terms.
...
* A deduplication filter step ran too early, and removed many good results on the basis that they partially, but did not fully fit another set of search terms.
* Altered the query creation process to prefer documents where multiple terms appear in the priority index.
2023-03-29 15:18:52 +02:00
Viktor Lofgren
8f51345a1d
Add experiment runner tool and got rid of experiments module in processes.
2023-03-28 16:58:46 +02:00
Viktor Lofgren
03bd892b95
Improve document processing in conversion.
...
* Add flags for long and short documents.
* Break out common length logic from plugins.
* Cleaning up of related code.
2023-03-28 16:38:00 +02:00
Viktor Lofgren
30584887f9
DictionaryMap changes.
...
Add new flag to change the default size to make prod index boot faster. Remove option to select OffHeapDictionaryHashMap.
2023-03-27 17:28:39 +02:00
Viktor Lofgren
17ca4f9eea
Permit search results that are all synthetic to pass relevancy check.
2023-03-27 17:27:35 +02:00
Viktor Lofgren
7fb3db3249
Fix bug where link on front page news listing wouldn't work.
...
... also changed order of date and source to make the UI more consistent.
2023-03-27 17:26:46 +02:00
Viktor Lofgren
862e925d7c
"-Dsmall-ram=TRUE" no longer does anything. Remove references to the flag, which previously reduced the memory footprint of the loader and index service.
2023-03-26 21:37:11 +02:00
Viktor Lofgren
a0027ad32b
Fix broken diagram links after doc/ restructuring.
2023-03-25 16:32:10 +01:00
Viktor Lofgren
c5f4cb34bf
Documentation for DB
2023-03-25 16:14:16 +01:00
Viktor
be3ba3ef37
Update readme.md
2023-03-25 15:27:11 +01:00
Viktor
ac1ac3ea57
Move database to a separate module
...
* Move database to a separate project, break apart sql file into separate entities.
* Fix front page news listing.
2023-03-25 15:26:17 +01:00
Viktor
0b505939ed
Update features-convert/readme.md
2023-03-25 12:43:58 +01:00
Viktor
d2a9e1b644
Add processes link to readme.md for code/common
2023-03-25 12:42:44 +01:00
Viktor Lofgren
3464ca514b
Fix typeahead suggestions
2023-03-25 10:20:52 +01:00
Viktor Lofgren
2f2c86a9f5
Fix bug where WmsaHome wouldn't look in /var/lib/wmsa as a fallback
2023-03-25 10:20:52 +01:00
Viktor
45dd9fea25
Update readme.md
2023-03-22 17:15:36 +01:00
Viktor
c974d72e7e
Update readme.md
2023-03-22 17:09:48 +01:00
Viktor
e3675d2fa9
Update readme.md
2023-03-22 17:02:03 +01:00