736: Update charabia r=curquiza a=ManyTheFish
Update Charabia to the last version.
> We are now Romanizing Chinese characters into Pinyin.
> Note that we keep the accent because they are in fact never typed directly by the end-user, moreover, changing an accent leads to a different Chinese character, and I don't have sufficient knowledge to forecast the impact of removing accents in this context.
Co-authored-by: ManyTheFish <many@meilisearch.com>
Displays log message in the form:
```
[2022-12-21T09:19:42Z INFO milli::update::index_documents::enrich] Primary key was not specified in index. Inferred to 'id'
```
By creating snapshots and updating the format of the existing
snapshots. The next commit will apply the fix, which will show
its effects cleanly on the old and new snapshot tests
706: Limit the reindexing caused by updating settings when not needed r=curquiza a=GregoryConrad
## What does this PR do?
When updating index settings using `update::Settings`, sometimes a `reindex` of `update::Settings` is triggered when it doesn't need to be. This PR aims to prevent those unnecessary `reindex` calls.
For reference, here is a snippet from the current `execute` method in `update::Settings`:
```rust
// ...
if stop_words_updated
|| faceted_updated
|| synonyms_updated
|| searchable_updated
|| exact_attributes_updated
{
self.reindex(&progress_callback, &should_abort, old_fields_ids_map)?;
}
```
- [x] `faceted_updated` - looks good as-is ✅
- [x] `stop_words_updated` - looks good as-is ✅
- [x] `synonyms_updated` - looks good as-is ✅
- [x] `searchable_updated` - fixed in this PR
- [x] `exact_attributes_updated` - fixed in this PR
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Thank you so much for contributing to Meilisearch!
Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>
708: Reduce memory usage of the MatchingWords structure r=ManyTheFish a=loiclec
# Pull Request
## Related issue
Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3115
## What does this PR do?
1. Reduces the memory usage caused by the creation of a 10-word query tree by 20x.
This is done by deduplicating the `MatchingWord` values, which are heavy because of their inner DFA. The deduplication works by wrapping each `MatchingWord` in a reference-counted box and using a hash map to determine whether a `MatchingWord` DFA already exists for a certain signature, or whether a new one needs to be built.
2. Avoid the worst-case scenario of creating a `MatchingWord` for extremely long words that cannot be indexed by milli.
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
697: Fix bug in prefix DB indexing r=loiclec a=loiclec
Where the batch's information was not properly updated in cases where only the proximity changed between two consecutive word pair proximities.
Closes partially https://github.com/meilisearch/meilisearch/issues/3043
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
1. Handle keys with variable length correctly
This fixes https://github.com/meilisearch/meilisearch/issues/3042 and
is easily reproducible with the updated fuzz tests, which now generate
keys with variable lengths.
2. Prevent adding facets to the database if their encoded value does
not satisfy `valid_lmdb_key`.
This fixes an indexing failure when a document had a filterable
attribute containing a value whose length is higher than ~500 bytes.