3525: Fix phrase search containing stop words r=ManyTheFish a=ManyTheFish
# Summary
A search with a phrase containing only stop words was returning an HTTP error 500,
this PR filters the phrase containing only stop words dropping them before the search starts, a query with a phrase containing only stop words now behaves like a placeholder search.
fixes https://github.com/meilisearch/meilisearch/issues/3521
related v1.0.2 PR on milli: https://github.com/meilisearch/milli/pull/779
Co-authored-by: ManyTheFish <many@meilisearch.com>
3347: Enhance language detection r=irevoire a=ManyTheFish
## Summary
Some completely unrelated Languages can share the same characters, in Meilisearch we detect the Languages using `whatlang`, which works well on large texts but fails on small search queries leading to a bad segmentation and normalization of the query.
This PR now stores the Languages detected during the indexing in order to reduce the Languages list that can be detected during the search.
## Detail
- Create a 19th database mapping the scripts and the Languages detected with the documents where the Language is detected
- Fill the newly created database during indexing
- Create an allow-list with this database and pass it to Charabia
- Add a test ensuring that a Japanese request containing kanjis only is detected as Japanese and not Chinese
## Related issues
Fixes#2403Fixes#3513
Co-authored-by: f3r10 <frledesma@outlook.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Many the fish <many@meilisearch.com>
3505: Csv delimiter r=irevoire a=irevoire
Fixes https://github.com/meilisearch/meilisearch/issues/3442
Closes https://github.com/meilisearch/meilisearch/pull/2803
Specified in https://github.com/meilisearch/specifications/pull/221
This PR is a reimplementation of https://github.com/meilisearch/meilisearch/pull/2803, on the new engine. Thanks for your idea and initial PR `@MixusMinimax;` sorry I couldn’t update/merge your PR. Way too many changes happened on the engine in the meantime.
**Attention to reviewer**; I had to update deserr to implement the support of deserializing `char`s
-------
It introduces four new error messages;
- Invalid value in parameter csvDelimiter: expected a string of one character, but found an empty string
- Invalid value in parameter csvDelimiter: expected a string of one character, but found the following string of 5 characters: doggo
- csv delimiter must be an ascii character. Found: 🍰
- The Content-Type application/json does not support the use of a csv delimiter. The csv delimiter can only be used with the Content-Type text/csv.
And one error code;
- `invalid_index_csv_delimiter`
The `invalid_content_type` error code is now also used when we encounter the `csvDelimiter` query parameter with a non-csv content type.
Co-authored-by: Tamo <tamo@meilisearch.com>
3319: Transparently resize indexes on MaxDatabaseSizeReached errors r=Kerollmops a=dureuill
# Pull Request
## Related issue
Related to https://github.com/meilisearch/meilisearch/discussions/3280, depends on https://github.com/meilisearch/milli/pull/760
## What does this PR do?
### User standpoint
- Meilisearch no longer fails tasks that encounter the `milli::UserError(MaxDatabaseSizeReached)` error.
- Instead, these tasks are retried after increasing the maximum size allocated to the index where the failure occurred.
### Implementation standpoint
- Add `Batch::index_uid` to get the `index_uid` of a batch of task if there is one
- `IndexMapper::create_or_open_index` now takes an additional `size` argument that allows to (re)open indexes with a size different from the base `IndexScheduler::index_size` field
- `IndexScheduler::tick` now returns a `Result<TickOutcome>` instead of a `Result<usize>`. This offers more explicit control over what the behavior should be wrt the next tick.
- Add `IndexStatus::BeingResized` that contains a handle that a thread can use to await for the resize operation to complete and the index to be available again.
- Add `IndexMapper::resize_index` to increase the size of an index.
- In `IndexScheduler::tick`, intercept task batches that failed due to `MaxDatabaseSizeReached` and resize the index that caused the error, then request a new tick that will eventually handle the still enqueued task.
## Testing the PR
The following diff can be applied to this branch to make testing the PR easier:
<details>
```diff
diff --git a/index-scheduler/src/index_mapper.rs b/index-scheduler/src/index_mapper.rs
index 553ab45a..022b2f00 100644
--- a/index-scheduler/src/index_mapper.rs
+++ b/index-scheduler/src/index_mapper.rs
`@@` -228,13 +228,15 `@@` impl IndexMapper {
drop(lock);
+ std:🧵:sleep_ms(2000);
+
let current_size = index.map_size()?;
let closing_event = index.prepare_for_closing();
- log::info!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2);
+ log::error!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2);
closing_event.wait();
- log::info!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2);
+ log::error!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2);
let index_path = self.base_path.join(uuid.to_string());
let index = self.create_or_open_index(&index_path, None, 2 * current_size)?;
`@@` -268,8 +270,10 `@@` impl IndexMapper {
match index {
Some(Available(index)) => break index,
Some(BeingResized(ref resize_operation)) => {
+ log::error!("waiting for resize end");
// Deadlock: no lock taken while doing this operation.
resize_operation.wait();
+ log::error!("trying our luck again!");
continue;
}
Some(BeingDeleted) => return Err(Error::IndexNotFound(name.to_string())),
diff --git a/index-scheduler/src/lib.rs b/index-scheduler/src/lib.rs
index 11b17d05..242dc095 100644
--- a/index-scheduler/src/lib.rs
+++ b/index-scheduler/src/lib.rs
`@@` -908,6 +908,7 `@@` impl IndexScheduler {
///
/// Returns the number of processed tasks.
fn tick(&self) -> Result<TickOutcome> {
+ log::error!("ticking!");
#[cfg(test)]
{
*self.run_loop_iteration.write().unwrap() += 1;
diff --git a/meilisearch/src/main.rs b/meilisearch/src/main.rs
index 050c825a..63f312f6 100644
--- a/meilisearch/src/main.rs
+++ b/meilisearch/src/main.rs
`@@` -25,7 +25,7 `@@` fn setup(opt: &Opt) -> anyhow::Result<()> {
#[actix_web::main]
async fn main() -> anyhow::Result<()> {
- let (opt, config_read_from) = Opt::try_build()?;
+ let (mut opt, config_read_from) = Opt::try_build()?;
setup(&opt)?;
`@@` -56,6 +56,8 `@@` We generated a secure master key for you (you can safely copy this token):
_ => (),
}
+ opt.max_index_size = byte_unit::Byte::from_str("1MB").unwrap();
+
let (index_scheduler, auth_controller) = setup_meilisearch(&opt)?;
#[cfg(all(not(debug_assertions), feature = "analytics"))]
```
</details>
Mainly, these debug changes do the following:
- Set the default index size to 1MiB so that index resizes are initially frequent
- Turn some logs from info to error so that they can be displayed with `--log-level ERROR` (hiding the other infos)
- Add a long sleep between the beginning and the end of the resize so that we can observe the `BeingResized` index status (otherwise it would never come up in my tests)
## Open questions
- Is the growth factor of x2 the correct solution? For a `Vec` in memory it makes sense, but here we're manipulating quantities that are potentially in the order of 500GiBs. For bigger indexes it may make more sense to add at most e.g. 100GiB on each resize operation, avoiding big steps like 500GiB -> 1TiB.
## PR checklist
Please check if your PR fulfills the following requirements:
- [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [ ] Have you read the contributing guidelines?
- [ ] Have you made sure that the title is accurate and descriptive of the changes?
Thank you so much for contributing to Meilisearch!
3470: Autobatch addition and deletion r=irevoire a=irevoire
This PR adds the capability to meilisearch to batch document addition and deletion together.
Fix https://github.com/meilisearch/meilisearch/issues/3440
--------------
Things to check before merging;
- [x] What happens if we delete multiple time the same documents -> add a test
- [x] If a documentDeletion gets batched with a documentAddition but the index doesn't exist yet? It should not work
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
3499: Use the workspace inheritance r=Kerollmops a=irevoire
Use the workspace inheritance [introduced in rust 1.64](https://blog.rust-lang.org/2022/09/22/Rust-1.64.0.html#cargo-improvements-workspace-inheritance-and-multi-target-builds).
It allows us to define the version of meilisearch once in the main `Cargo.toml` and let all the other `Cargo.toml` uses this version.
`@curquiza` I added you as a reviewer because I had to patch some CI scripts
And `@Kerollmops,` I had to bump the `cargo_toml` crates because our version was getting old and didn't support the feature yet.
Also, in another PR, I would like to unify some of our dependencies to ensure we always stay in sync between all our crates.
Co-authored-by: Tamo <tamo@meilisearch.com>
3490: Fix attributes set candidates r=curquiza a=ManyTheFish
# Pull Request
Fix attributes set candidates for v1.1.0
## details
The attribute criterion was not returning the remaining candidates when its internal algorithm was been exhausted.
We had a loss of candidates by the attribute criterion leading to the bug reported in the issue linked below.
After some investigation, it seems that it was the only criterion that had this behavior.
We are now returning the remaining candidates instead of an empty bitmap.
## Related issue
Fixes#3483
PR on milli for v1.0.1: https://github.com/meilisearch/milli/pull/777
Co-authored-by: ManyTheFish <many@meilisearch.com>
3492: Bump deserr r=Kerollmops a=irevoire
Bump deserr to the latest version;
- We now use the default actix-web extractors that deserr provides (which were copy/pasted from meilisearch)
- We also use the default `JsonError` message provided by deserr instead of defining our own in meilisearch
- Finally, we get the new `did you mean?` error message. Fix#3493
Co-authored-by: Tamo <tamo@meilisearch.com>
3461: Bring v1 changes into main r=curquiza a=Kerollmops
Also bring back changes in milli (the remote repository) into main done during the pre-release
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com>
Co-authored-by: curquiza <curquiza@users.noreply.github.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Philipp Ahlner <philipp@ahlner.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
776: Reduce incremental indexing time of `words_prefix_position_docids` DB r=curquiza a=loiclec
Fixes partially https://github.com/meilisearch/milli/issues/605
The `words_prefix_position_docids` can easily contain millions of entries. Thus, iterating
over it can be very expensive. But we do so needlessly for every document addition tasks.
It can sometimes cause indexing performance issues when :
- a user sends many `documentAdditionOrUpdate` tasks that cannot be all batched together (for example if they are interspersed with `documentDeletion` tasks)
- the documents contain long, diverse text fields, thus increasing the number of entries in `words_prefix_position_docids`
- the index has accumulated many soft-deleted documents, further increasing the size of `words_prefix_position_docids`
- the machine running Meilisearch does not have great IO performance (e.g. slow SSD, or quota-limited by the cloud provider)
Note, before approving the PR: the only changed file should be `milli/src/update/words_prefix_position_docids.rs`.
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
774: Update version for the next release (v0.41.1) in Cargo.toml files r=curquiza a=meili-bot
⚠️ This PR is automatically generated. Check the new version is the expected one before merging.
Co-authored-by: curquiza <curquiza@users.noreply.github.com>
This database can easily contain millions of entries. Thus, iterating
over it can be very expensive.
For regular `documentAdditionOrUpdate` tasks, `del_prefix_fst_words`
will always be empty. Thus, we can save a significant amount of time
by adding this `if !del_prefix_fst_words.is_empty()` condition.
The code's behaviour remains completely unchanged.
763: Fixes error message when lat and lng are unparseable r=loiclec a=ahlner
# Pull Request
## Related issue
Fixes partially [#3007](https://github.com/meilisearch/meilisearch/issues/3007)
## What does this PR do?
- Changes function validate_geo_from_json to return a BadLatitudeAndLongitude if lat or lng is a string and not parseable to f64
- implemented some unittests
- Derived PartialEq for GeoError to use assert_eq! in tests
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Thank you so much for contributing to Meilisearch!
Co-authored-by: Philipp Ahlner <philipp@ahlner.com>
767: Update version for the next release (v0.39.2) in Cargo.toml files r=curquiza a=meili-bot
⚠️ This PR is automatically generated. Check the new version is the expected one before merging.
Co-authored-by: curquiza <curquiza@users.noreply.github.com>