3687: Allow to disable specialized tokenizations (again) r=Kerollmops a=jirutka
In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai` feature flags to allow melisearch to be built without huge specialed tokenizations that took up 90% of the melisearch binary size. Unfortunately, due to some recent changes, this doesn't work anymore. The problem lies in excessive use of the `default` feature flag, which infects the dependency graph.
Instead of adding `default-features = false` here and there, it's easier and more future-proof to not declare `default` in `milli` and `meilisearch-types`. I've renamed it to `all-tokenizers`, which also makes it a bit clearer what it's about.
Co-authored-by: Jakub Jirutka <jakub@jirutka.cz>
In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai`
feature flags to allow melisearch to be built without huge specialed
tokenizations that took up 90% of the melisearch binary size.
Unfortunately, due to some recent changes, this doesn't work anymore.
The problem lies in excessive use of the `default` feature flag, which
infects the dependency graph.
Instead of adding `default-features = false` here and there, it's easier
and more future-proof to not declare `default` in `milli` and
`meilisearch-types`. I've renamed it to `all-tokenizers`, which also
makes it a bit clearer what it's about.
3570: Get documents by filter r=irevoire a=dureuill
# Pull Request
## Related issue
Associated spec: https://github.com/meilisearch/specifications/pull/234
None really, this is more of an extension of #3477: since after this issue we'll be able to delete documents by filter, it makes sense to also be able to get documents by filter.
## What does this PR do?
### User standpoint
- Add a new `filter` URL parameter to `GET /indexes/{:indexUid}/documents` and a new `POST /indexes/{:indexUid}/documents/fetch` route with the same `offset, limit, fields, filter`
### Implementation standpoint
- Add a new `Index::iter_documents` method to iterate on a set of documents rather than return a vector of these documents.
- Rewrite the other `Index::*documents` methods to use the new `Index::iter_documents` method.
## Usage
<details>
<summary>
Sample request and response
</summary>
```
curl -X POST 'http://localhost:7700/indexes/index-1101/documents/fetch' -H 'Content-Type: application/json' --data-binary '{ "filter": "genres = Comedy", "limit": 3, "offset": 8000}' | jsonxf
```
```json
{
"results": [
{
"id": 326126,
"title": "Bad Exorcists",
"overview": "A trio of awkward teens intend to win a horror festival by making their own movie, but wind up getting their actress possessed in the process.",
"genres": [
"Horror",
"Comedy"
],
"poster": "https://image.tmdb.org/t/p/w500/lwd65kPbjFacAw3QSXiwSsW6cFU.jpg",
"release_date": 1425081600
},
{
"id": 326215,
"title": "Ooops! Noah is Gone...",
"overview": "It's the end of the world. A flood is coming. Luckily for Dave and his son Finny, a couple of clumsy Nestrians, an Ark has been built to save all animals. But as it turns out, Nestrians aren't allowed. Sneaking on board with the involuntary help of Hazel and her daughter Leah, two Grymps, they think they're safe. Until the curious kids fall off the Ark. Now Finny and Leah struggle to survive the flood and hungry predators and attempt to reach the top of a mountain, while Dave and Hazel must put aside their differences, turn the Ark around and save their kids. It's definitely not going to be smooth sailing.",
"genres": [
"Animation",
"Adventure",
"Comedy",
"Family"
],
"poster": "https://image.tmdb.org/t/p/w500/gEJXHgpiKh89Vwjc4XUY5CIgUdB.jpg",
"release_date": 1427328000
},
{
"id": 326241,
"title": "For Here or to Go?",
"overview": "An aspiring Indian tech entrepreneur in the Silicon Valley finds himself unexpectedly battling the bizarre American immigration system to keep his dream alive or prepare to return home forever.",
"genres": [
"Drama",
"Comedy"
],
"poster": "https://image.tmdb.org/t/p/w500/ff8WaA7ItBgl36kdT232i0d0Fnq.jpg",
"release_date": 1490918400
}
],
"offset": 8000,
"limit": 3,
"total": 9331
}
```
<img width="1348" alt="Capture d’écran 2023-03-08 à 10 09 04" src="https://user-images.githubusercontent.com/41078892/223670905-6932b79b-f9b8-4a41-b59e-be2171705b7d.png">
</details>
# Draft status
- [ ] Route naming: having one route be `GET /indexes/{:indexUid}/documents` and the other `POST /indexes/{:indexUid}/documents/fetch` is suboptimal (also, technically a breaking change for documents with `fetch` as uid?), but `POST /indexes/{:indexUid}/documents` is already used to insert documents.
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
3550: Delete documents by filter r=irevoire a=dureuill
# Prototype `prototype-delete-by-filter-0`
Usage:
A new route is available under `POST /indexes/{index_uid}/documents/delete` that allows you to delete your documents by filter.
The expected payload looks like that:
```json
{
"filter": "doggo = bernese",
}
```
It'll then enqueue a task in your task queue that'll delete all the documents matching this filter once it's processed.
Here is an example of the associated details;
```json
"details": {
"deletedDocuments": 53,
"originalFilter": "\"doggo = bernese\""
}
```
----------
# Pull Request
## Related issue
Related to https://github.com/meilisearch/meilisearch/issues/3477
## What does this PR do?
### User standpoint
- Modifies the `/indexes/{:indexUid}/documents/delete-batch` route to accept either the existing array of documents ids, or a JSON object with a `filter` field representing a filter to apply. If that latter variant is used, any document matching the filter will be deleted.
### Implementation standpoint
- (processing time version) Adds a new BatchKind that is not autobatchable and that performs the delete by filter
- Reuse the `documentDeletion` task with a new `originalFilter` detail that replaces the `providedIds` detail.
## Example
<details>
<summary>Sample request, response and task result</summary>
Request:
```
curl \
-X POST 'http://localhost:7700/indexes/index-10/documents/delete-batch' \
-H 'Content-Type: application/json' \
--data-binary '{ "filter" : "mass = 600"}'
```
Response:
```
{
"taskUid": 3902,
"indexUid": "index-10",
"status": "enqueued",
"type": "documentDeletion",
"enqueuedAt": "2023-02-28T20:50:31.667502Z"
}
```
Task log:
```json
{
"uid": 3906,
"indexUid": "index-12",
"status": "succeeded",
"type": "documentDeletion",
"canceledBy": null,
"details": {
"deletedDocuments": 3,
"originalFilter": "\"mass = 600\""
},
"error": null,
"duration": "PT0.001819S",
"enqueuedAt": "2023-03-07T08:57:20.11387Z",
"startedAt": "2023-03-07T08:57:20.115895Z",
"finishedAt": "2023-03-07T08:57:20.117714Z"
}
```
</details>
## Draft status
- [ ] Error handling
- [ ] Analytics
- [ ] Do we want to reuse the `delete-batch` route in this way, or create a new route instead?
- [ ] Should the filter be applied at request time or when the deletion task is processed?
- The first commit in this PR applies the filter at request time, meaning that even if a document is modified in a way that no longer matches the filter in a later update, it will be deleted as long as the deletion task is processed after that update.
- The other commits in this PR apply the filter only when the asynchronous deletion task is processed, meaning that documents that match the filter at processing time are deleted even if they didn't match the filter at request time.
- [ ] If keeping the filter at request time, find a more elegant way to recover the user document ids from the internal document ids. The current way implemented in the first commit of this PR involves getting all the documents matching the filter, looking for the value of their primary key, and turning it into a string by copy-pasting routines found in milli...
- [ ] Security consideration, if any
- [ ] Fix the tests (but waiting until product questions are resolved)
- [ ] Add delete by filter specific tests
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
3639: Add a dedicated error variant for planned failures in index scheduler tests r=Kerollmops a=Sufflope
# Pull Request
## Related issue
Fixes#3086
## What does this PR do?
- Add a dedicated test variant in test cfg to avoid reusing a misleading existing error
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Thank you so much for contributing to Meilisearch!
Co-authored-by: Jean-Sébastien Bour <jean-sebastien@bour.name>
3693: Implement the auto deletion of tasks r=dureuill a=irevoire
Fixes https://github.com/meilisearch/meilisearch/issues/3622
This PR should be the definite fix for #3622.
It adds a limit (1M) to the maximum number of tasks the task queue can hold.
Once the task queue reaches this limit (1M of tasks are in the task queue, whatever their status is), meilisearch will schedule a task deletion that tries to delete the oldest 100k tasks.
If meilisearch can't delete 100k tasks because some of them are not yet finished, it will delete as many tasks as possible.
Once the limit is reached, you're still able to register new tasks. The engine will only stop you from adding new tasks once [the other hard limit](https://github.com/meilisearch/meilisearch/pull/3659) of 10GiB of tasks is reached (that's between 5M and 15M of tasks depending on your workflow).
-------
Technically;
- We only try to schedule our task deletion when calling the tick function but before creating a new batch. This means we never enqueue a task we're not going to process ~right away.
- If our task deletion doesn't delete anything, we don't enqueue it and log a warn the user that the engine is not working properly
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
3542: Refactor of the search algorithms r=dureuill a=loiclec
This PR refactors a large part of the search logic (related to https://github.com/meilisearch/meilisearch/issues/3547)
- The "query tree" is replaced by a "query graph", which describes the different ways in which the search query can be interpreted and precomputes the word derivations for each query term. Example:
<img width="1162" alt="Screenshot 2023-02-27 at 10 26 50" src="https://user-images.githubusercontent.com/6040237/221525270-87917cc0-60d1-473f-847f-2c5a7de9e370.png">
- The control flow between the ~criterions~ ranking rules is managed in a single place instead of being independently implemented by each ranking rule.
- The set of document candidates is determined greedily from the beginning. It is often referred as the "universe" in the code.
- The ranking rules `proximity`, `attribute`, `typo`, and (maybe) `exactness` are or will be implemented using a K-shortest path graph algorithm. This minimises the number of database and bitmap operations we need to do to compute each ranking rule bucket. It also simplifies the code a lot since a lot of ranking rules will share a large part of their implementation.
- Pointers to database values are stored in a cache to avoid searching in the LMDB databases needlessly.
- The result of some roaring bitmap operations are also stored in a cache, although we'll need to measure the memory pressure this puts on the system and maybe deactivate this cache later on.
- Search requests can be visually logged and debugged in tests.
TODO:
- [ ] Reintroduce search benchmarks
- [x] Implement `disableOnWords` and `disableOnAttributes` settings of typo tolerance
- [x] Implement "exhaustive number of hits
- [x] Implement `attribute` ranking rule
- [x] Indexing changes: split into `word_fid_docids` and `word_position_docids` (with bucketed position)
- [x] Ranking rule implementations
- [ ] Implement `exactness` ranking rule
- [x] Initial implementation
- [ ] Correct implementation when followed by `Words`
- [ ] Implement `geosort` ranking rule
- [ ] Add tests
- [x] Typo tolerance `disableOnWords`/`disableOnAttributes`
- [ ] Geosort
- [x] Exactness
- [ ] Attribute/Position
- [ ] Interactions between ranking rules:
- [x] Typo/Proximity/Attribute not preceded by Words
- [x] Exactness not preceded by Words
- [x] Exactness -> Words (+ check universe correctness)
- [x] Exactness -> Typo, etc.
- [ ] Sort -> Words (performance tests)
- [ ] Attribute/Position -> Typo
- [ ] Attribute/Position -> Proximity
- [x] Typo -> Exactness
- [x] Typo -> Proximity
- [x] Proximity -> Typo
- [x] Words
- [x] Typo
- [x] Proximity
- [x] Sort
- [x] Ngrams
- [x] Split words
- [x] Ngram + Split Words
- [x] Term matching strategy
- [x] Distinct attribute
- [x] Phrase Search
- [x] Placeholder search
- [x] Highlighter
- [x] Limit the number of word derivations in a search query
- [x] Compute the initial universe correctly according to the terms matching strategy
- [x] Implement placeholder search
- [x] Get the list of ranking rules from the settings
- [x] Implement `distinct`
- [x] Determine what to do when one of `attribute`, `proximity`, `typo`, or `exactness` is placed before `words`
- [x] Make sure the correct number of allowed typos is used for each word, including the prefix one
- [x] Make sure stop words are treated correctly (e.g. correct position in query graph), including in phrases
- [x] Support phrases correctly
- [x] Support synonyms
- [x] Support split words
- [x] Support combination of ngram + split-words (e.g. `whiteh orse` -> `"white horse"`)
- [x] Implement `typo` ranking rule
- [x] Implement `sort` ranking rule
- [x] Use existing `Search` interface to use the new search algorithms
- [x] Remove old code
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>