meilisearch

mirror of https://github.com/meilisearch/meilisearch.git synced 2024-11-23 10:37:41 +08:00

Author	SHA1	Message	Date
Loïc Lecrenier	998d46ac10	Add support for search offset and limit	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	6c85c0d95e	Fix more bugs + visual empty path cache logging	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	0e1fbbf7c6	Fix bugs in query graph's "remove word" and "cheapest paths" algos	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	6806640ef0	Fix d2 description of paths map	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	173e37584c	Improve the visual/detailed search logger	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	6ba4d5e987	Add a search logger	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	dd12d44134	Support swapped word pairs in new proximity ranking rule impl	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	c8e251bf24	Remove noise in codebase	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	a938fbde4a	Use a cache when resolving the query graph	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	dcf3f1d18a	Remove EdgeIndex and NodeIndex types, prefer u32 instead	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	66d0c63694	Add some documentation and use bitmaps instead of hashmaps when possible	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	132191360b	Introduce the sort ranking rule working with the new search structures	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	345c99d5bd	Introduce the words ranking rule working with the new search structures	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	89d696c1e3	Introduce the proximity ranking rule as a graph-based ranking rule	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	c645853529	Introduce a generic graph-based ranking rule	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	a70ab8b072	Introduce a function to find the K shortest paths in a graph	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	48aae76b15	Introduce a function to find the docids of a set of paths in a graph	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	23bf572dea	Introduce cache structures used with ranking rule graphs	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	864f6410ed	Introduce a structure to represent a set of graph paths efficiently	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	c9bf6bb2fa	Introduce a structure to implement ranking rules with graph algorithms	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	46249ea901	Implement a function to find a QueryGraph's docids	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	ce0d1e0e13	Introduce a common way to manage the coordination between ranking rules	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	5065d8b0c1	Introduce a DatabaseCache to memorize the addresses of LMDB values	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	a83007c013	Introduce structure to represent search queries as graphs	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	79e0a6dd4e	Introduce a new search module, eventually meant to replace the old one The code here does not compile, because I am merely splitting one giant commit into smaller ones where each commit explains a single file.	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	2d88089129	Remove unused term matching strategies	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	6c659dc12f	Use MiMalloc in milli tests	2023-03-20 09:41:37 +01:00
Tamo	0f33a65468	makes kero happy	2023-03-13 16:51:11 +01:00
bors[bot]	fb1260ee88	Merge #3568 #3569 3568: CI: Fix `publish-aarch64` job that still uses ubuntu-18.04 r=Kerollmops a=curquiza Fixes #3563 Main change - add the usage of the `ubuntu-18.04` container instead of the native `ubuntu-18.04` of GitHub actions: I had to install docker in the container. Small additional changes - remove useless `fail-fast` and unused/irrelevant matrix inputs (`build`, `linker`, `os`, `use-cross`...) - Remove useless step in job Proof of work with this CI triggered on this current branch: https://github.com/meilisearch/meilisearch/actions/runs/4366233882 3569: Enhance Japanese language detection r=dureuill a=ManyTheFish # Pull Request This PR is a prototype and can be tested by downloading [the dedicated docker image](https://hub.docker.com/layers/getmeili/meilisearch/prototype-better-language-detection-0/images/sha256-a12847de00e21a71ab797879fd09777dadcb0881f65b5f810e7d1ed434d116ef?context=explore): ```bash $ docker pull getmeili/meilisearch:prototype-better-language-detection-0 ``` ## Context Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization. A [first iteration has been implemented for v1.1.0](https://github.com/meilisearch/meilisearch/pull/3347) but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search. Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing. For instance, a Japanese document `{"id": 1, "name": "東京スカパラダイスオーケストラ"}` is detected as Japanese during indexing, during the search the query `東京` will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese. However if in the dataset there is at least one document containing a field with only Kanjis like: _A document with only 1 field containing only Kanjis:_ ```json { "id":4, "name": "東京特許許可局" } ``` _A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:_ ```json { "id":105, "name": "東京特許許可局", "desc": "日経平均株価は26日に約8カ月ぶりに2万4000円の心理的な節目を上回った。株高を支える材料のひとつは、自民党総裁選で3選を決めた安倍晋三首相の経済政策への期待だ。恩恵が見込まれるとされる人材サービスや建設株の一角が買われている。ただ思惑が先行して資金が集まっている面は否めない。実際に政策効果を取り込む企業はどこか、なお未知数だ。" } ``` Then, in both cases, the field `name` will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore, the query `東京` will be detected as Chinese and only the two last documents will be retrieved by Meilisearch. ## Technical Approach The current PR partially fixes these issues by: 1) Adding a check over potential miss-detections and rerunning the extraction of the document forcing the tokenization over the main Languages detected in it. > 1) run a first extraction allowing the tokenizer to detect any Language in any Script > 2) generate a distribution of tokens by Script and Languages (`script_language`) > 3) if for a Script we have a token distribution of one of the Language that is under the threshold, then we rerun the extraction forbidding the tokenizer to detect the marginal Languages > 4) the tokenizer will fall back on the other available Languages to tokenize the text. For example, if the Chinese were marginally detected compared to the Japanese on the CJ script, then the second extraction will force Japanese tokenization for CJ text in the document. however, the text on another script like Latin will not be impacted by this restriction. 2) Adding a filtering threshold during the search over Languages that have been marginally detected in documents ## Limits This PR introduces 2 arbitrary thresholds: 1) during the indexing, a Language is considered miss-detected if the number of detected tokens of this Language is under 10% of the tokens detected in the same Script (Japanese and Chinese are 2 different Languages sharing the "same" script "CJK"). 2) during the search, a Language is considered marginal if less than 5% of documents are detected as this Language. This PR only partially fixes these issues: - ✅ the query `東京` now find Japanese documents if less than 5% of documents are detected as Chinese. - ✅ the document with the id `105` containing the Japanese field `desc` but the miss-detected field `name` is now completely detected and tokenized as Japanese and is found with the query `東京`. - ❌ the document with the id `4` no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search. ## Related issue Fixes #3565 ## Possible future enhancements - Change or contribute to the Library used to detect the Language - the related issue on Whatlang: https://github.com/greyblake/whatlang-rs/issues/122 Co-authored-by: curquiza <clementine@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Many the fish <many@meilisearch.com>	2023-03-09 15:34:35 +00:00
ManyTheFish	2f8eb4f54a	last PR fixes	2023-03-09 15:34:36 +01:00
Clément Renault	175e8a8495	Fix a diacritic issue	2023-03-09 14:57:47 +01:00
Tamo	eddefb0e0f	refactor the error type of the milli::document thing silence a warning	2023-03-09 13:03:14 +01:00
ManyTheFish	5deea631ea	fix clippy too many arguments	2023-03-09 11:19:13 +01:00
Tamo	c5f22be6e1	add boolean support for csv documents	2023-03-09 11:12:49 +01:00
ManyTheFish	b4b859ec8c	Fix typos	2023-03-09 10:58:35 +01:00
ManyTheFish	24c0775c67	Change indexing threshold	2023-03-08 12:36:04 +01:00
ManyTheFish	3092cf0448	Fix clippy errors	2023-03-08 10:53:42 +01:00
ManyTheFish	37d4551e8e	Add a threshold filtering the Languages allowed to be detected at search time	2023-03-07 19:38:01 +01:00
ManyTheFish	da48506f15	Rerun extraction when language detection might have failed	2023-03-07 18:35:26 +01:00
bors[bot]	4f1ccbc495	Merge #3525 3525: Fix phrase search containing stop words r=ManyTheFish a=ManyTheFish # Summary A search with a phrase containing only stop words was returning an HTTP error 500, this PR filters the phrase containing only stop words dropping them before the search starts, a query with a phrase containing only stop words now behaves like a placeholder search. fixes https://github.com/meilisearch/meilisearch/issues/3521 related v1.0.2 PR on milli: https://github.com/meilisearch/milli/pull/779 Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-03-02 10:55:37 +00:00
ManyTheFish	37489fd495	Return an internal error in the case of matching word is invalid	2023-03-01 19:05:16 +01:00
Louis Dureuil	5822764be9	Skip computing index budget in tests	2023-02-23 11:23:39 +01:00
bors[bot]	ac5a1e4c4b	Merge #3423 3423: Add min and max facet stats r=dureuill a=dureuill # Pull Request ## Related issue Fixes #3426 ## What does this PR do? ### User standpoint - When using a `facets` parameter in search, the facets that have numeric values are displayed in a new section of the response called `facetStats` that contains, per facet, the numeric min and max value of the hits returned by the search. <details> <summary> Sample request/response </summary> ```json ❯ curl \ -X POST 'http://localhost:7700/indexes/meteorites/search?facets=mass' \ -H 'Content-Type: application/json' \ --data-binary '{ "q": "LL6", "facets":["mass", "recclass"], "limit": 5 }' \| jsonxf { "hits": [ { "name": "Niger (LL6)", "id": "16975", "nametype": "Valid", "recclass": "LL6", "mass": 3.3, "fall": "Fell" }, { "name": "Appley Bridge", "id": "2318", "nametype": "Valid", "recclass": "LL6", "mass": 15000, "fall": "Fell", "_geo": { "lat": 53.58333, "lng": -2.71667 } }, { "name": "Athens", "id": "4885", "nametype": "Valid", "recclass": "LL6", "mass": 265, "fall": "Fell", "_geo": { "lat": 34.75, "lng": -87.0 } }, { "name": "Bandong", "id": "4935", "nametype": "Valid", "recclass": "LL6", "mass": 11500, "fall": "Fell", "_geo": { "lat": -6.91667, "lng": 107.6 } }, { "name": "Benguerir", "id": "30443", "nametype": "Valid", "recclass": "LL6", "mass": 25000, "fall": "Fell", "_geo": { "lat": 32.25, "lng": -8.15 } } ], "query": "LL6", "processingTimeMs": 15, "limit": 5, "offset": 0, "estimatedTotalHits": 42, "facetDistribution": { "mass": { "110000": 1, "11500": 1, "1161": 1, "12000": 1, "1215.5": 1, "127000": 1, "15000": 1, "1676": 1, "1700": 1, "1710.5": 1, "18000": 1, "19000": 1, "220000": 1, "2220": 1, "22300": 1, "25000": 2, "265": 1, "271000": 1, "2840": 1, "3.3": 1, "3000": 1, "303": 1, "32000": 1, "34000": 1, "36.1": 1, "45000": 1, "460": 1, "478": 1, "483": 1, "5500": 2, "600": 1, "6000": 1, "67.8": 1, "678": 1, "680.5": 1, "6930": 1, "8": 1, "8300": 1, "840": 1, "8400": 1 }, "recclass": { "L/LL6": 3, "LL6": 39 } }, "facetStats": { "mass": { "min": 3.3, "max": 271000.0 } } } ``` </details> ## PR checklist Please check if your PR fulfills the following requirements: - [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [ ] Have you read the contributing guidelines? - [ ] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-02-22 13:06:43 +00:00
ManyTheFish	900bae3d9d	keep phrases that has at least one word	2023-02-21 18:16:51 +01:00
ManyTheFish	28b7d73d4a	Remove an unefficient part of a test on milli	2023-02-21 18:16:51 +01:00
bors[bot]	39407885c2	Merge #3347 3347: Enhance language detection r=irevoire a=ManyTheFish ## Summary Some completely unrelated Languages can share the same characters, in Meilisearch we detect the Languages using `whatlang`, which works well on large texts but fails on small search queries leading to a bad segmentation and normalization of the query. This PR now stores the Languages detected during the indexing in order to reduce the Languages list that can be detected during the search. ## Detail - Create a 19th database mapping the scripts and the Languages detected with the documents where the Language is detected - Fill the newly created database during indexing - Create an allow-list with this database and pass it to Charabia - Add a test ensuring that a Japanese request containing kanjis only is detected as Japanese and not Chinese ## Related issues Fixes #2403 Fixes #3513 Co-authored-by: f3r10 <frledesma@outlook.com> Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Many the fish <many@meilisearch.com>	2023-02-21 10:52:13 +00:00
ManyTheFish	bbecab8948	fix clippy	2023-02-21 10:18:44 +01:00
ManyTheFish	8aa808d51b	Merge branch 'main' into enhance-language-detection	2023-02-20 18:14:34 +01:00
bors[bot]	1e9ac00800	Merge #3505 3505: Csv delimiter r=irevoire a=irevoire Fixes https://github.com/meilisearch/meilisearch/issues/3442 Closes https://github.com/meilisearch/meilisearch/pull/2803 Specified in https://github.com/meilisearch/specifications/pull/221 This PR is a reimplementation of https://github.com/meilisearch/meilisearch/pull/2803, on the new engine. Thanks for your idea and initial PR `@MixusMinimax;` sorry I couldn’t update/merge your PR. Way too many changes happened on the engine in the meantime. Attention to reviewer; I had to update deserr to implement the support of deserializing `char`s ------- It introduces four new error messages; - Invalid value in parameter csvDelimiter: expected a string of one character, but found an empty string - Invalid value in parameter csvDelimiter: expected a string of one character, but found the following string of 5 characters: doggo - csv delimiter must be an ascii character. Found: 🍰 - The Content-Type application/json does not support the use of a csv delimiter. The csv delimiter can only be used with the Content-Type text/csv. And one error code; - `invalid_index_csv_delimiter` The `invalid_content_type` error code is now also used when we encounter the `csvDelimiter` query parameter with a non-csv content type. Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-20 17:01:36 +00:00
bors[bot]	b08a49a16e	Merge #3319 #3470 3319: Transparently resize indexes on MaxDatabaseSizeReached errors r=Kerollmops a=dureuill # Pull Request ## Related issue Related to https://github.com/meilisearch/meilisearch/discussions/3280, depends on https://github.com/meilisearch/milli/pull/760 ## What does this PR do? ### User standpoint - Meilisearch no longer fails tasks that encounter the `milli::UserError(MaxDatabaseSizeReached)` error. - Instead, these tasks are retried after increasing the maximum size allocated to the index where the failure occurred. ### Implementation standpoint - Add `Batch::index_uid` to get the `index_uid` of a batch of task if there is one - `IndexMapper::create_or_open_index` now takes an additional `size` argument that allows to (re)open indexes with a size different from the base `IndexScheduler::index_size` field - `IndexScheduler::tick` now returns a `Result<TickOutcome>` instead of a `Result<usize>`. This offers more explicit control over what the behavior should be wrt the next tick. - Add `IndexStatus::BeingResized` that contains a handle that a thread can use to await for the resize operation to complete and the index to be available again. - Add `IndexMapper::resize_index` to increase the size of an index. - In `IndexScheduler::tick`, intercept task batches that failed due to `MaxDatabaseSizeReached` and resize the index that caused the error, then request a new tick that will eventually handle the still enqueued task. ## Testing the PR The following diff can be applied to this branch to make testing the PR easier: <details> ```diff diff --git a/index-scheduler/src/index_mapper.rs b/index-scheduler/src/index_mapper.rs index 553ab45a..022b2f00 100644 --- a/index-scheduler/src/index_mapper.rs +++ b/index-scheduler/src/index_mapper.rs `@@` -228,13 +228,15 `@@` impl IndexMapper { drop(lock); + std:🧵:sleep_ms(2000); + let current_size = index.map_size()?; let closing_event = index.prepare_for_closing(); - log::info!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2); + log::error!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2); closing_event.wait(); - log::info!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2); + log::error!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2); let index_path = self.base_path.join(uuid.to_string()); let index = self.create_or_open_index(&index_path, None, 2 * current_size)?; `@@` -268,8 +270,10 `@@` impl IndexMapper { match index { Some(Available(index)) => break index, Some(BeingResized(ref resize_operation)) => { + log::error!("waiting for resize end"); // Deadlock: no lock taken while doing this operation. resize_operation.wait(); + log::error!("trying our luck again!"); continue; } Some(BeingDeleted) => return Err(Error::IndexNotFound(name.to_string())), diff --git a/index-scheduler/src/lib.rs b/index-scheduler/src/lib.rs index 11b17d05..242dc095 100644 --- a/index-scheduler/src/lib.rs +++ b/index-scheduler/src/lib.rs `@@` -908,6 +908,7 `@@` impl IndexScheduler { /// /// Returns the number of processed tasks. fn tick(&self) -> Result<TickOutcome> { + log::error!("ticking!"); #[cfg(test)] { *self.run_loop_iteration.write().unwrap() += 1; diff --git a/meilisearch/src/main.rs b/meilisearch/src/main.rs index 050c825a..63f312f6 100644 --- a/meilisearch/src/main.rs +++ b/meilisearch/src/main.rs `@@` -25,7 +25,7 `@@` fn setup(opt: &Opt) -> anyhow::Result<()> { #[actix_web::main] async fn main() -> anyhow::Result<()> { - let (opt, config_read_from) = Opt::try_build()?; + let (mut opt, config_read_from) = Opt::try_build()?; setup(&opt)?; `@@` -56,6 +56,8 `@@` We generated a secure master key for you (you can safely copy this token): _ => (), } + opt.max_index_size = byte_unit::Byte::from_str("1MB").unwrap(); + let (index_scheduler, auth_controller) = setup_meilisearch(&opt)?; #[cfg(all(not(debug_assertions), feature = "analytics"))] ``` </details> Mainly, these debug changes do the following: - Set the default index size to 1MiB so that index resizes are initially frequent - Turn some logs from info to error so that they can be displayed with `--log-level ERROR` (hiding the other infos) - Add a long sleep between the beginning and the end of the resize so that we can observe the `BeingResized` index status (otherwise it would never come up in my tests) ## Open questions - Is the growth factor of x2 the correct solution? For a `Vec` in memory it makes sense, but here we're manipulating quantities that are potentially in the order of 500GiBs. For bigger indexes it may make more sense to add at most e.g. 100GiB on each resize operation, avoiding big steps like 500GiB -> 1TiB. ## PR checklist Please check if your PR fulfills the following requirements: - [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [ ] Have you read the contributing guidelines? - [ ] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! 3470: Autobatch addition and deletion r=irevoire a=irevoire This PR adds the capability to meilisearch to batch document addition and deletion together. Fix https://github.com/meilisearch/meilisearch/issues/3440 -------------- Things to check before merging; - [x] What happens if we delete multiple time the same documents -> add a test - [x] If a documentDeletion gets batched with a documentAddition but the index doesn't exist yet? It should not work Co-authored-by: Louis Dureuil <louis@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-20 15:00:19 +00:00
Many the fish	119e6d8811	Update milli/src/search/mod.rs Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-20 15:33:10 +01:00
ManyTheFish	cb8d5f2d4b	Update Charabia to 0.7.1	2023-02-20 14:00:31 +01:00
Louis Dureuil	eb28d4c525	add facet test	2023-02-20 13:52:28 +01:00
Louis Dureuil	9ac981d025	Remove some clippy type complexity warns by deboxing iters	2023-02-20 13:52:27 +01:00
Louis Dureuil	74859ecd61	Add min and max facet stats	2023-02-20 13:52:27 +01:00
Louis Dureuil	8ae441a4db	Update usage of iterators	2023-02-20 13:52:27 +01:00
Louis Dureuil	042d86cbb3	facet sort ascending/descending now also return the values	2023-02-20 13:52:27 +01:00
Tamo	18796d6e6a	Consider null as a valid geo object	2023-02-20 13:45:51 +01:00
bors[bot]	28961b2ad1	Merge #3499 3499: Use the workspace inheritance r=Kerollmops a=irevoire Use the workspace inheritance [introduced in rust 1.64](https://blog.rust-lang.org/2022/09/22/Rust-1.64.0.html#cargo-improvements-workspace-inheritance-and-multi-target-builds). It allows us to define the version of meilisearch once in the main `Cargo.toml` and let all the other `Cargo.toml` uses this version. `@curquiza` I added you as a reviewer because I had to patch some CI scripts And `@Kerollmops,` I had to bump the `cargo_toml` crates because our version was getting old and didn't support the feature yet. Also, in another PR, I would like to unify some of our dependencies to ensure we always stay in sync between all our crates. Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-17 09:52:29 +00:00
Tamo	895ab2906c	apply review suggestions	2023-02-16 18:42:47 +01:00
Tamo	8c074f5028	implements the csv delimiter without tests Co-authored-by: Maxi Barmetler <maxi.barmetler@gmail.com>	2023-02-16 17:35:36 +01:00
bors[bot]	143e3cf948	Merge #3490 3490: Fix attributes set candidates r=curquiza a=ManyTheFish # Pull Request Fix attributes set candidates for v1.1.0 ## details The attribute criterion was not returning the remaining candidates when its internal algorithm was been exhausted. We had a loss of candidates by the attribute criterion leading to the bug reported in the issue linked below. After some investigation, it seems that it was the only criterion that had this behavior. We are now returning the remaining candidates instead of an empty bitmap. ## Related issue Fixes #3483 PR on milli for v1.0.1: https://github.com/meilisearch/milli/pull/777 Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-02-15 17:38:07 +00:00
Tamo	74d1a67a99	Use the workspace inheritance feature of rust 1.64	2023-02-15 13:51:07 +01:00
bors[bot]	91ce8a5e67	Merge #3492 3492: Bump deserr r=Kerollmops a=irevoire Bump deserr to the latest version; - We now use the default actix-web extractors that deserr provides (which were copy/pasted from meilisearch) - We also use the default `JsonError` message provided by deserr instead of defining our own in meilisearch - Finally, we get the new `did you mean?` error message. Fix #3493 Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-15 10:05:05 +00:00
Tamo	a43765d454	use the pre-defined deserr extractors	2023-02-14 20:05:30 +01:00
Tamo	8fb7b1d10f	bump deserr	2023-02-14 20:04:30 +01:00
Tamo	74dcfe9676	Fix a bug when you update a document that was already present in the db, deleted and then inserted again in the same transform	2023-02-14 19:09:40 +01:00
Tamo	1b1703a609	make a small optimization to merge obkvs a little bit faster	2023-02-14 18:32:41 +01:00
Tamo	fb5e4957a6	fix and test the early exit in case a grenad ends with a deletion	2023-02-14 18:23:57 +01:00
Tamo	8de3c9f737	Update milli/src/update/index_documents/transform.rs Co-authored-by: Clément Renault <clement@meilisearch.com>	2023-02-14 17:57:14 +01:00
Tamo	43a19d0709	document the operation enum + the grenads	2023-02-14 17:55:26 +01:00
Filip Bachul	a53536836b	fmt	2023-02-14 17:04:22 +01:00
Filip Bachul	d7ad39ad77	fix: clippy error	2023-02-14 00:15:35 +01:00
Filip Bachul	849de089d2	add thiserror for AscDescError	2023-02-14 00:15:35 +01:00
filip	7f25007d31	Update milli/src/asc_desc.rs Co-authored-by: Tamo <irevoire@protonmail.ch>	2023-02-14 00:15:35 +01:00
Filip Bachul	c810af3ebf	implement From<ParseGeoError> for AscDescError	2023-02-14 00:15:35 +01:00
Filip Bachul	c0b77773ba	fmt asc_desc	2023-02-14 00:15:35 +01:00
Filip Bachul	7481559e8b	move BadGeo to FilterError	2023-02-14 00:15:35 +01:00
Filip Bachul	83c765ce6c	implement From<ParseGeoError> for FilterError	2023-02-14 00:15:35 +01:00
Filip Bachul	4c91037602	use ParseGeoError in sort parser	2023-02-14 00:15:35 +01:00
Filip Bachul	825923f6fc	export ParseGeoError	2023-02-14 00:15:35 +01:00
Filip Bachul	e405702733	chore: introduce new error ParseGeoError type	2023-02-14 00:15:35 +01:00
ManyTheFish	6fa877efb0	Fix attributes set candidates	2023-02-13 17:49:52 +01:00
Tamo	746b31c1ce	makes clippy happy	2023-02-09 12:23:01 +01:00
Tamo	93db755d57	add a test to ensure we handle correctly a deletion of multiple time the same document	2023-02-08 21:03:34 +01:00
Tamo	93f130a400	fix all warnings	2023-02-08 20:57:35 +01:00
Tamo	421a9cf05e	provide a new method on the transform to remove documents	2023-02-08 16:06:09 +01:00
Tamo	8f64fba1ce	rewrite the current transform to handle a new byte specifying the kind of operation it's merging	2023-02-08 12:53:38 +01:00
bors[bot]	c88c3637b4	Merge #3461 3461: Bring v1 changes into main r=curquiza a=Kerollmops Also bring back changes in milli (the remote repository) into main done during the pre-release Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com> Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com> Co-authored-by: curquiza <curquiza@users.noreply.github.com> Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: Philipp Ahlner <philipp@ahlner.com> Co-authored-by: Kerollmops <clement@meilisearch.com>	2023-02-07 11:27:27 +00:00
bors[bot]	97fd9ac493	Merge #3405 3405: Implement geo bounding box r=irevoire a=curquiza Following https://github.com/meilisearch/milli/pull/672 (work from `@gmourier)` Fixes #2761 Co-authored-by: Guillaume Mourier <guillaume@meilisearch.com> Co-authored-by: Louis Dureuil <louis@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-07 09:55:20 +00:00
bors[bot]	821d92b5d0	Merge #3407 3407: Add Cargo feature for LMDB's POSIX semaphores r=dureuill a=GregoryConrad See https://github.com/meilisearch/milli/pull/757 Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>	2023-02-07 08:25:20 +00:00
Tamo	42114325cd	Apply suggestions from code review Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-02-06 18:07:00 +01:00
Tamo	7a38fe624f	throw an error if the top left corner is found below the bottom right corner	2023-02-06 17:50:47 +01:00
Tamo	1b005f697d	update the syntax of the geoboundingbox filter to uses brackets instead of parens around lat and lng	2023-02-06 16:50:27 +01:00
Kerollmops	fbec48f56e	Merge remote-tracking branch 'milli/main' into bring-v1-changes	2023-02-06 16:48:10 +01:00
Tamo	3ebc99473f	Apply suggestions from code review Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-02-06 13:29:37 +01:00
Tamo	d27007005e	comments the geoboundingbox + forbid the usage of the lexeme method which could introduce bugs	2023-02-06 11:36:49 +01:00
Tamo	fcb09ccc3d	add tests on the geoBoundingBox	2023-02-02 18:19:56 +01:00
Louis Dureuil	ae8660e585	Add Token::original_span rather than making Token::span pub	2023-02-02 15:03:34 +01:00
Guillaume Mourier	b297b5deb0	cargo fmt	2023-02-02 12:34:49 +01:00
Guillaume Mourier	0d71c80ba6	add tests	2023-02-02 12:31:27 +01:00
Guillaume Mourier	65a3086cf1	fix test	2023-02-02 12:27:58 +01:00
Guillaume Mourier	426d63b01b	Update insta test suite	2023-02-02 12:27:56 +01:00
Guillaume Mourier	b078477d80	Add error handling and earth lap collision with bounding box	2023-02-02 12:17:38 +01:00
ManyTheFish	0bc1a18f52	Use Languages list detected during indexing at search time	2023-02-01 18:57:43 +01:00
ManyTheFish	643d99e0f9	Add expectancy test	2023-02-01 18:39:54 +01:00
ManyTheFish	064158e4e2	Update test	2023-02-01 15:34:01 +01:00
ManyTheFish	77d32d0ee8	Fix codec deserialization	2023-02-01 15:26:26 +01:00
ManyTheFish	f4569b04ad	Update Charabia version	2023-02-01 15:26:26 +01:00
bors[bot]	758b4acea7	Merge #776 776: Reduce incremental indexing time of `words_prefix_position_docids` DB r=curquiza a=loiclec Fixes partially https://github.com/meilisearch/milli/issues/605 The `words_prefix_position_docids` can easily contain millions of entries. Thus, iterating over it can be very expensive. But we do so needlessly for every document addition tasks. It can sometimes cause indexing performance issues when : - a user sends many `documentAdditionOrUpdate` tasks that cannot be all batched together (for example if they are interspersed with `documentDeletion` tasks) - the documents contain long, diverse text fields, thus increasing the number of entries in `words_prefix_position_docids` - the index has accumulated many soft-deleted documents, further increasing the size of `words_prefix_position_docids` - the machine running Meilisearch does not have great IO performance (e.g. slow SSD, or quota-limited by the cloud provider) Note, before approving the PR: the only changed file should be `milli/src/update/words_prefix_position_docids.rs`. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-31 15:52:28 +00:00
bors[bot]	a4e8158239	Merge #774 774: Update version for the next release (v0.41.1) in Cargo.toml files r=curquiza a=meili-bot ⚠️ This PR is automatically generated. Check the new version is the expected one before merging. Co-authored-by: curquiza <curquiza@users.noreply.github.com>	2023-01-31 11:51:42 +00:00
Loïc Lecrenier	a2690ea8d4	Reduce incremental indexing time of `words_prefix_position_docids` DB This database can easily contain millions of entries. Thus, iterating over it can be very expensive. For regular `documentAdditionOrUpdate` tasks, `del_prefix_fst_words` will always be empty. Thus, we can save a significant amount of time by adding this `if !del_prefix_fst_words.is_empty()` condition. The code's behaviour remains completely unchanged.	2023-01-31 11:42:24 +01:00
f3r10	2922c5c899	Fix code format	2023-01-31 11:28:05 +01:00
f3r10	7681be5367	Format code	2023-01-31 11:28:05 +01:00
f3r10	50bc156257	Fix tests	2023-01-31 11:28:05 +01:00
f3r10	d8207356f4	Skip script,language insertion if language is undetected	2023-01-31 11:28:05 +01:00
f3r10	2d58b28f43	Improve script language codec	2023-01-31 11:28:05 +01:00
f3r10	fd60a39f1c	Format code	2023-01-31 11:28:05 +01:00
f3r10	369c05732e	Add test checking if from script_language_docids database were removed deleted docids	2023-01-31 11:28:05 +01:00
f3r10	34d04f3d3f	Filter from script_language_docids database soft deleted documents	2023-01-31 11:28:05 +01:00
f3r10	a27f329e3a	Add tests for checking that detected script and language associated with document(s) were stored during indexing	2023-01-31 11:28:05 +01:00
f3r10	b216ddba63	Delete and clear data from the new database	2023-01-31 11:28:05 +01:00
f3r10	d97fb6117e	Extract and index data	2023-01-31 11:28:05 +01:00
f3r10	c45d1e3610	Create a new database on index and add a specialized codec for it	2023-01-31 11:28:05 +01:00
Louis Dureuil	20f05efb3c	clippy: needless_lifetimes	2023-01-31 11:12:59 +01:00
Louis Dureuil	cbf029f64c	clippy: --fix	2023-01-31 11:12:59 +01:00
curquiza	bffabf9cc6	Update version for the next release (v0.41.1) in Cargo.toml files	2023-01-31 09:56:22 +00:00
Louis Dureuil	3296cf7ae6	clippy: remove needless lifetimes	2023-01-31 09:32:40 +01:00
Louis Dureuil	89675e5f15	clippy: Replace seek 0 by rewind	2023-01-31 09:32:40 +01:00
Tamo	55e8046551	bump milli	2023-01-24 13:52:21 +01:00
Tamo	de3c4f1986	throw an error on unknown fields specified in the _geo field	2023-01-24 12:23:24 +01:00
Gregory Conrad	3f69dd6450	feat: add Cargo feature for LMDB's POSIX semaphores	2023-01-19 12:08:38 -05:00
bors[bot]	1c4b1b3b2d	Merge #770 770: Update deserr v0.3.0 r=irevoire a=ManyTheFish related to https://github.com/meilisearch/meilisearch/issues/3391 Co-authored-by: Many the fish <many@meilisearch.com>	2023-01-19 17:05:56 +00:00
curquiza	abd65d9307	Update version for the next release (v0.40.0) in Cargo.toml files	2023-01-19 16:43:45 +00:00
Many the fish	30fc376713	Update deserr v0.3.0	2023-01-19 17:37:30 +01:00
bors[bot]	3521a3a0b2	Merge #763 763: Fixes error message when lat and lng are unparseable r=loiclec a=ahlner # Pull Request ## Related issue Fixes partially [#3007](https://github.com/meilisearch/meilisearch/issues/3007) ## What does this PR do? - Changes function validate_geo_from_json to return a BadLatitudeAndLongitude if lat or lng is a string and not parseable to f64 - implemented some unittests - Derived PartialEq for GeoError to use assert_eq! in tests ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Philipp Ahlner <philipp@ahlner.com>	2023-01-19 15:15:46 +00:00
bors[bot]	40a53f8824	Merge #767 767: Update version for the next release (v0.39.2) in Cargo.toml files r=curquiza a=meili-bot ⚠️ This PR is automatically generated. Check the new version is the expected one before merging. Co-authored-by: curquiza <curquiza@users.noreply.github.com>	2023-01-19 14:48:12 +00:00
Philipp Ahlner	f5ca421227	Superfluous test removed	2023-01-19 15:39:21 +01:00
curquiza	3f048927a0	Update version for the next release (v0.39.2) in Cargo.toml files	2023-01-19 14:29:09 +00:00
Louis Dureuil	4fd6fd9bef	Indicate filterable attributes when the user set a non filterable attribute in facet distributions	2023-01-19 12:25:18 +01:00
Philipp Ahlner	a2cd7214f0	Fixes error message when lat/lng are unparseable	2023-01-19 10:10:26 +01:00
ManyTheFish	d1fc42b53a	Use compatibility decomposition normalizer in facets	2023-01-18 15:02:13 +01:00
ManyTheFish	e64571a881	Add test sorting string with diacritics	2023-01-18 14:43:38 +01:00
Philipp Ahlner	497187083b	Add test for bug #3007 : Wrong error message Adds a test for #3007: Wrong error message when lat and lng are unparseable	2023-01-18 13:24:26 +01:00
Clément Renault	1d507c84b2	Fix the formatting	2023-01-17 18:25:55 +01:00
Clément Renault	1b78231e18	Make clippy happy	2023-01-17 18:25:54 +01:00
bors[bot]	0c7d1f761e	Merge #765 765: Update version for the next release (v0.39.1) in Cargo.toml files r=curquiza a=meili-bot ⚠️ This PR is automatically generated. Check the new version is the expected one before merging. Co-authored-by: curquiza <curquiza@users.noreply.github.com>	2023-01-17 11:04:26 +00:00
curquiza	e3d30e28ef	Update version for the next release (v0.39.1) in Cargo.toml files	2023-01-17 10:50:29 +00:00
bors[bot]	63af1e9f28	Merge #764 764: Update deserr to latest version r=irevoire a=loiclec Update deserr to 0.1.5, which changes the `DeserializeFromValue` trait, getting rid of the `default()` method. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-17 10:39:36 +00:00
Loïc Lecrenier	f073a86387	Update deserr to latest version	2023-01-17 11:28:19 +01:00
Kerollmops	97005dd505	Bump the milli-imported crates to v1.0.0	2023-01-16 16:29:12 +01:00
Kerollmops	ebb2494879	Add a README to the milli crate	2023-01-16 16:25:12 +01:00
curquiza	9e32ac7cb2	Update version for the next release (v0.39.0) in Cargo.toml files	2023-01-11 15:05:06 +00:00
bors[bot]	302d6cccd7	Merge #761 761: Integrate deserr r=irevoire a=loiclec 1. `Setting<T>` now implements `DeserializeFromValue` 2. The settings now store ranking rules as strongly typed `Criterion` instead of `String`, since the validation of the ranking rules will be done on meilisearch's side from now on Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-11 14:35:15 +00:00
bors[bot]	21b7d709ad	Merge #759 759: Change primary key inference error messages r=Kerollmops a=dureuill # Pull Request ## Related issue Milli part of https://github.com/meilisearch/meilisearch/issues/3301 ## What does this PR do? - Change error message strings ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-01-11 14:04:25 +00:00
Loïc Lecrenier	02fd06ea0b	Integrate deserr	2023-01-11 13:56:47 +01:00
Louis Dureuil	00746b32c0	Add Index::map_size	2023-01-10 11:16:51 +01:00
Louis Dureuil	be9786bed9	Change primary key inference error messages	2023-01-05 10:40:09 +01:00
bors[bot]	c3f4835e8e	Merge #733 733: Avoid a prefix-related worst-case scenario in the proximity criterion r=loiclec a=loiclec # Pull Request ## Related issue Somewhat fixes (until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3118 ## What does this PR do? When a query ends with a word and a prefix, such as: ``` word pr ``` Then we first determine whether `pre` could possibly be in the proximity prefix database before querying it. There are then three possibilities: 1. `pr` is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations of `pre` through the FST and query the regular proximity databases. 2. `pr` is in the prefix cache but cannot be found in the proximity prefix databases. In this case, we partially disable the proximity ranking rule for the pair `word pre`. This is done as follows: 1. Only find the documents where `word` is in proximity to `pre` exactly (no derivations) 2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8 3. `pr` is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases. Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is: 1. For common prefixes of one or two letters: we no longer distinguish between proximities from 4 to 8 2. For common prefixes of more than two letters: we no longer distinguish between any proximities 3. For uncommon prefixes: nothing changes Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query `heard pr` (IF `pr` is the prefix of more than 200 words in the dataset): ```json [ { "text": "I heard there is a faster proximity criterion" }, { "text": "I heard there is a faster but less relevant proximity criterion" } ] ``` Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro": ```json [ { "text": "I heard there is a faster but less relevant proximity criterion" } { "text": "I heard there is a faster proximity criterion" }, ] ``` But the following document would be considered more relevant than the two documents above: ```json { "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " } ``` Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything. --- ## Performance I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the `songs` benchmark dataset. ``` 1. 10x 'a': - 640ms ⟹ 630ms = no significant difference 2. 10x 'b': - set-based: 4.47s ⟹ 7.42 = bad, ~2x regression - dynamic: 1s ⟹ 870 ms = no significant difference 3. 'Someone I l': - set-based: 250ms ⟹ 12 ms = very good, x20 speedup - dynamic: 21ms ⟹ 11 ms = good, x2 speedup 4. 'billie e': - set-based: 623ms ⟹ 2ms = very good, x300 speedup - dynamic: ~4ms ⟹ 4ms = no difference 5. 'billie ei': - set-based: 57ms ⟹ 20ms = good, ~2x speedup - dynamic: ~4ms ⟹ ~2ms. = no significant difference 6. 'i am getting o' - set-based: 300ms ⟹ 60ms = very good, 5x speedup - dynamic: 30ms ⟹ 6ms = very good, 5x speedup 7. 'prologue 1 a 1: - set-based: 3.36s ⟹ 120ms = very good, 30x speedup - dynamic: 200ms ⟹ 30ms = very good, 6x speedup 8. 'prologue 1 a 10': - set-based: 590ms ⟹ 18ms = very good, 30x speedup - dynamic: 82ms ⟹ 35ms = good, ~2x speedup ``` Performance is often significantly better, but there is also one regression in the set-based implementation with the query `b b b b b b b b b b`. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-04 09:00:50 +00:00
bors[bot]	49f58b2c47	Merge #732 732: Interpret synonyms as phrases r=loiclec a=loiclec # Pull Request ## Related issue Fixes (when merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3125 ## What does this PR do? We now map multi-word synonyms to phrases instead of loose words. Such that the request: ``` btw I am going to nyc soon ``` is interpreted as (when the synonym interpretation is chosen for both `btw` and `nyc`): ``` "by the way" I am going to "New York City" soon ``` instead of: ``` by the way I am going to New York City soon ``` This prevents queries containing multi-word synonyms to exceed to word length limit and degrade the search performance. In terms of relevancy, there is a debate to have. I personally think this could be considered an improvement, since it would be strange for a user to search for: ``` good DIY project ``` and have a result such as: ``` { "text": "whether it is a good project to do, you'll have to decide for yourself" } ``` However, for synonyms such as `NYC -> New York City`, then we will stop matching documents where `New York` is separated from `City`. This is however solvable by adding an additional mapping: `NYC -> New York`. ## Performance With the old behaviour, some long search requests making heavy uses of synonyms could take minutes to be executed. This is no longer the case, these search requests now take an average amount of time to be resolved. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-04 08:34:18 +00:00
bors[bot]	6a10e85707	Merge #736 736: Update charabia r=curquiza a=ManyTheFish Update Charabia to the last version. > We are now Romanizing Chinese characters into Pinyin. > Note that we keep the accent because they are in fact never typed directly by the end-user, moreover, changing an accent leads to a different Chinese character, and I don't have sufficient knowledge to forecast the impact of removing accents in this context. Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-01-03 15:44:41 +00:00
bors[bot]	9519e60f97	Merge #709 709: Optimise the `ExactWords` sub-criterion within `Exactness` r=loiclec a=loiclec # Pull Request ## Related issue Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3116 ## What does this PR do? 1. Reduces the algorithmic complexity of finding the documents containing N exact words from something that is exponential to something that is polynomial. 2. Cache intermediary results between different calls to the `exactness` criterion. ## Performance Results On the `smol_songs.csv` dataset, a request containing 10 common words now takes about 60ms instead of 5 seconds to execute. For example, this is the case with this (admittedly nonsensical) request: `Rock You Hip Hop Folk World Country Electronic Love The`. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-02 12:28:30 +00:00
Loïc Lecrenier	b5df889dcb	Apply review suggestions: simplify implementation of exactness criterion	2023-01-02 13:11:47 +01:00
Loïc Lecrenier	8d36570958	Add explicit criterion impl strategy to proximity search tests	2023-01-02 10:37:01 +01:00
Loïc Lecrenier	32c6062e65	Optimise exactness criterion 1. Cache some results between calls to next() 2. Compute the combinations of exact words more efficiently	2022-12-22 12:28:45 +01:00
Loïc Lecrenier	f097aafa1c	Add unit test for prefix handling by the proximity criterion	2022-12-22 12:08:00 +01:00
Loïc Lecrenier	777b387dc4	Avoid a prefix-related worst-case scenario in the proximity criterion	2022-12-22 12:08:00 +01:00
Loïc Lecrenier	b0f3dc2c06	Interpret synonyms as phrases	2022-12-22 12:07:51 +01:00
Louis Dureuil	4b166bea2b	Add primary_key_inference test	2022-12-21 15:13:38 +01:00
Louis Dureuil	5943100754	Fix existing tests	2022-12-21 15:13:38 +01:00
Louis Dureuil	b24def3281	Add logging when inference took place. Displays log message in the form: ``` [2022-12-21T09:19:42Z INFO milli::update::index_documents::enrich] Primary key was not specified in index. Inferred to 'id' ```	2022-12-21 15:13:38 +01:00
Louis Dureuil	402dcd6b2f	Simplify primary key inference	2022-12-21 15:13:38 +01:00
Louis Dureuil	13c95d25aa	Remove uses of UserError::MissingPrimaryKey not related to inference	2022-12-21 15:13:36 +01:00
bors[bot]	a8defb585b	Merge #742 742: Add a "Criterion implementation strategy" parameter to Search r=irevoire a=loiclec Add a parameter to search requests which determines the implementation strategy of the criteria. This can be either `set-based`, `iterative`, or `dynamic` (ie choosing between set-based or iterative at search time). See https://github.com/meilisearch/milli/issues/755 for more context about this change. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-12-21 12:18:49 +00:00
Loïc Lecrenier	339a4b0789	Make clippy happy	2022-12-21 12:49:34 +01:00
Loïc Lecrenier	229405aeb9	Choose implementation strategy of criterion at runtime	2022-12-21 09:29:39 +01:00
Loïc Lecrenier	fc0e7382fe	Fix hard-deletion of an external id that was soft-deleted	2022-12-20 15:33:31 +01:00
bors[bot]	97fb64e40e	Merge #747 747: Soft-deletion computation no longer depends on the mapsize r=irevoire a=dureuill # Pull Request ## Related issue Related to https://github.com/meilisearch/meilisearch/issues/3231: After removing `--max-index-size`, the `mapsize` will always be unrelated to the actual max size the user wants for their DB, so it doesn't make sense to use these values any longer. This implements solution 2.3 from https://github.com/meilisearch/meilisearch/issues/3231#issuecomment-1348628824 ## What does this PR do? ### User-visible - Soft-deleted are no longer deleted when there is less than 10% of the mapsize available or when they take more than 10% of the mapsize - Instead, they are deleted when they are more soft deleted than regular documents, or when they take more than 1GiB disk space (estimated). ### Implementation standpoint 1. Adds a `DeletionStrategy` struct to replace the boolean `disable_soft_deletion` that we had up until now. This enum allows us to specify that we want "always hard", "always soft", or to use the dynamic soft-deletion strategy (default). 2. Uses the current strategy when deleting documents, with the new heuristics being used in the `DeletionStrategy::Dynamic` variant. 3. Updates the tests to use the appropriate DeletionStrategy whenever needed (one of `AlwaysHard` or `AlwaysSoft` depending on the test) Note to reviewers: this PR is optimized for a commit-by-commit review. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Louis Dureuil <louis@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com>	2022-12-19 17:46:18 +00:00
Tamo	69edbf9f6d	Update milli/src/update/delete_documents.rs	2022-12-19 18:23:50 +01:00
curquiza	c72535531b	Update version for the next release (v0.38.0) in Cargo.toml files	2022-12-19 16:35:38 +00:00
Louis Dureuil	916c23e7be	Tests: rename snapshots	2022-12-19 10:07:17 +01:00
Louis Dureuil	ad9937c755	Fix tests after adding DeletionStrategy	2022-12-19 10:07:17 +01:00
Louis Dureuil	171c942282	Soft-deletion computation no longer takes into account the mapsize Implemented solution 2.3 from https://github.com/meilisearch/meilisearch/issues/3231#issuecomment-1348628824	2022-12-19 10:07:17 +01:00
Louis Dureuil	e2ae3b24aa	Hard or soft delete according to the deletion strategy	2022-12-19 10:00:13 +01:00
Louis Dureuil	fc7618d49b	Add DeletionStrategy	2022-12-19 09:49:58 +01:00
ManyTheFish	7f88c4ff2f	Fix #1714 test	2022-12-15 18:22:28 +01:00
ManyTheFish	96d4242b93	Update charabia	2022-12-15 18:22:22 +01:00
bors[bot]	5114686394	Merge #743 743: Fix finite pagination with placeholder search r=Kerollmops a=ManyTheFish this bug is reproducible on real datasets and is hard to isolate in a simple test. related to: https://github.com/meilisearch/meilisearch/issues/3200 poke `@curquiza` Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-12-15 09:31:47 +00:00
ManyTheFish	3322018c06	Fix placeholder search	2022-12-14 20:09:47 +01:00
bors[bot]	0276d5212a	Merge #728 728: Add some integration tests on the sort criterion r=ManyTheFish a=loiclec This is simply an integration test ensuring that the sort criterion works properly. However, only one version of the algorithm is tested here (the iterative one). To test the version that uses the facet DB, one has to manually set the `CANDIDATES_THRESHOLD` constant to `0`. I have done that and ensured that the test still succeeds. However, in the future, we will probably want to have an option to force which algorithm is used at runtime, for testing purposes. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-12-14 09:27:12 +00:00
bors[bot]	e2ffc3d69a	Merge #741 741: Add test reproducing the bug fixed by #737 r=Kerollmops a=ManyTheFish related to #737 Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-12-13 15:02:19 +00:00
ManyTheFish	739da9fd4d	Add test	2022-12-13 15:54:43 +01:00
bors[bot]	406ee31d1a	Merge #737 737: Fix typo initial candidates computation r=Kerollmops a=ManyTheFish When `Typo` criterion was after a different criterion than `Words` and the previous criterion wasn't returning any candidates at the first iteration of the bucket sort, then the `initial_candidates` were lost. Now, `Typo`ensure to keep the `initial_candidates` between iterations. related to https://github.com/meilisearch/meilisearch/issues/3200#issuecomment-1345179578 related to https://github.com/meilisearch/meilisearch/issues/3228 Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-12-13 10:29:28 +00:00
ManyTheFish	2d8d0af1a6	Rename short name bc by ic for initial_candidates	2022-12-13 10:56:38 +01:00
Loïc Lecrenier	be3b00350c	Apply review suggestions: naming and documentation	2022-12-13 10:15:22 +01:00
ManyTheFish	80d34a4169	Fix typo initial candiddates computation	2022-12-12 19:02:48 +01:00
Loïc Lecrenier	e3ee553dcc	Remove soft deleted ids from ExternalDocumentIds during document import If the document import replaces a document using hard deletion	2022-12-12 14:16:09 +01:00
Loïc Lecrenier	bebd050961	Add new test for bug 3021	2022-12-08 19:19:40 +01:00
ManyTheFish	55724f2412	Introduce an initial candidates set that makes the difference between an exhaustive count and an estimation	2022-12-08 09:41:34 +01:00
ManyTheFish	6d50ea0830	add tests	2022-12-08 08:56:57 +01:00
Loïc Lecrenier	f37c86e0b2	Add some integration tests on the sort criterion	2022-12-07 15:59:33 +01:00
Loïc Lecrenier	d38cc73630	Add one more filter "integration" test	2022-12-07 14:38:25 +01:00
Loïc Lecrenier	e688581c36	Add tests for facet range search on different field ids	2022-12-07 14:38:21 +01:00
Loïc Lecrenier	4ac8f96342	Simplify implementation of equality condition in filters	2022-12-07 14:38:18 +01:00
Loïc Lecrenier	1c9555566e	Fix bug in facet range search	2022-12-07 14:38:14 +01:00
Loïc Lecrenier	303d740245	Prepare fix within facet range search By creating snapshots and updating the format of the existing snapshots. The next commit will apply the fix, which will show its effects cleanly on the old and new snapshot tests	2022-12-07 14:38:10 +01:00
bors[bot]	0a301b5f88	Merge #723 723: Fix bug in handling of soft deleted documents when updating settings r=Kerollmops a=loiclec # Pull Request ## Related issue Fixes (partially, until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3021 ## What does this PR do? This PR fixes the bug where a `missing key in documents database` internal error message could appear when indexing documents. When updating the settings, before clearing the database and before creating the transform output, we now modify the `ExternalDocumentsIds` structure to get rid of all references to soft deleted document ids in its FSTs. It used to be that updating the settings would clear the soft-deleted document ids, but keep the original `ExternalDocumentsIds` structure. As a consequence of this, when processing a future document addition, we could wrongly believe that a document was being replaced when, in fact, it was a completely new document. See the tests `bug_3021_first`, `bug_3021_second`, and `bug_3021` for a minimal test case that would have reproduced the issue. We need to take special care to: - evaluate how users should update to v0.30.1 (containing this fix): dump? reimporting all documents from scratch? - understand IF/HOW this bug could have caused duplicate documents to be returned - and evaluate the correctness of the fix, of course :) Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-12-06 14:37:38 +00:00
Loïc Lecrenier	a993b68684	Cargo fmt >:-(	2022-12-06 15:22:10 +01:00
Loïc Lecrenier	80c7a00567	Fix compilation error in tests of settings update	2022-12-06 15:19:26 +01:00
Loïc Lecrenier	67d8cec209	Fix bug in handling of soft deleted documents when updating settings	2022-12-06 15:09:19 +01:00
bors[bot]	2a846aaae7	Merge #719 719: Add more members of `filter_parser` to `milli::` & `From<&str>` implementation for `Token` r=Kerollmops a=GregoryConrad ## What does this PR do? The current `milli::Filter` and `milli::FilterCondition` APIs require working with some members of `filter_parser` directly that `milli::` does not re-export to its users (at least when not parsing input using `parse`). Also, using `filter_parser` does not make sense when using milli from an embedded context where there is no query to parse. Instead of reworking `milli::Filter` and `milli::FilterCondition`, this PR adds two non-breaking changes that ease the use of milli: - Re-exports more members of the dependent version of `filter_parser` in `milli` - Implements `From<&str>` for `filter_parser::Token` - This will also allow some basic tests that need to create a `Token` from a string to avoid some boilerplate. In conjunction, both of these will allow milli users to easily create a `Token` from a `&str` without needing to add `filter_parser` as an extra dependency. Note: I wanted to use `FromStr` for the `From` implementation; however, it requires returning a `Result` which is not needed for the conversion. Thus, I just left it as `From<&str>`. Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>	2022-12-06 10:36:00 +00:00
Tamo	212dbfa3b5	Update milli/src/search/facet/filter.rs	2022-12-05 20:56:21 +01:00
amab8901	456da5de9c	Geosearch for zero radius	2022-12-05 20:11:46 +01:00
Loïc Lecrenier	cda4ba2bb6	Add document import tests	2022-12-05 12:02:49 +01:00
Loïc Lecrenier	ae59d37b75	Improve insta-snap of the external document ids	2022-12-05 10:51:02 +01:00
Loïc Lecrenier	f2cf981641	Add more tests and allow disabling of soft-deletion outside of tests Also allow disabling soft-deletion in the IndexDocumentsConfig	2022-12-05 10:51:01 +01:00
Gregory Conrad	50954d31fa	feat: Re-export Span and Token to milli::	2022-12-03 13:37:33 -05:00
bors[bot]	d3731dda48	Merge #706 706: Limit the reindexing caused by updating settings when not needed r=curquiza a=GregoryConrad ## What does this PR do? When updating index settings using `update::Settings`, sometimes a `reindex` of `update::Settings` is triggered when it doesn't need to be. This PR aims to prevent those unnecessary `reindex` calls. For reference, here is a snippet from the current `execute` method in `update::Settings`: ```rust // ... if stop_words_updated \|\| faceted_updated \|\| synonyms_updated \|\| searchable_updated \|\| exact_attributes_updated { self.reindex(&progress_callback, &should_abort, old_fields_ids_map)?; } ``` - [x] `faceted_updated` - looks good as-is ✅ - [x] `stop_words_updated` - looks good as-is ✅ - [x] `synonyms_updated` - looks good as-is ✅ - [x] `searchable_updated` - fixed in this PR - [x] `exact_attributes_updated` - fixed in this PR ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>	2022-12-01 13:58:02 +00:00
bors[bot]	5e754b3ee0	Merge #708 708: Reduce memory usage of the MatchingWords structure r=ManyTheFish a=loiclec # Pull Request ## Related issue Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3115 ## What does this PR do? 1. Reduces the memory usage caused by the creation of a 10-word query tree by 20x. This is done by deduplicating the `MatchingWord` values, which are heavy because of their inner DFA. The deduplication works by wrapping each `MatchingWord` in a reference-counted box and using a hash map to determine whether a `MatchingWord` DFA already exists for a certain signature, or whether a new one needs to be built. 2. Avoid the worst-case scenario of creating a `MatchingWord` for extremely long words that cannot be indexed by milli. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-11-30 17:47:34 +00:00
bors[bot]	e1612fcb01	Merge #712 712: Fix bulk facet indexing bug r=Kerollmops a=loiclec # Pull Request ## Related issue Fixes (partially, until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3165 ## What does this PR do? Fixes a bug where indexing certain numbers of filterable attribute values in bulk led to corrupted facet databases. This was due to a lossy integer conversion which would ultimately prevent entire levels of the facet database to be written into LMDB. More specifically, this change was made: ```diff - if cur_writer_len as u8 >= self.min_level_size { + if cur_writer_len >= self.min_level_size as usize { ``` I also checked other comparisons to `min_level_size` and other conversions such as `x as u8` in this part of the codebase. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-11-30 16:51:48 +00:00
Loïc Lecrenier	9dd4b33a9a	Fix bulk facet indexing bug	2022-11-30 14:27:36 +01:00
Gregory Conrad	87e2bc3bed	fix(reindex): reindex in a few more cases Cases: whenever searchable_fields OR user_defined_searchable_fields is modified	2022-11-28 13:12:19 -05:00
Loïc Lecrenier	61b58b115a	Don't create partial matching words for synonyms in ngrams	2022-11-28 16:32:28 +01:00
Gregory Conrad	d3182f3830	refactor: Change return type to keep consistency with others	2022-11-28 10:02:03 -05:00
Loïc Lecrenier	f70856bab1	Remove memory usage test that fails when many tests are run in parallel	2022-11-28 12:55:28 +01:00
Loïc Lecrenier	e2ebed62b1	Don't create partial matching words for synonyms, split words, phrases	2022-11-28 10:20:13 +01:00
Loïc Lecrenier	8284bd760f	Relax memory ordering of operations within the test CountingAlloc	2022-11-28 10:20:13 +01:00
Loïc Lecrenier	8d0ace2d64	Avoid creating a MatchingWord for words that exceed the length limit	2022-11-28 10:20:13 +01:00
Loïc Lecrenier	86c34a996b	Deduplicate matching words	2022-11-28 10:20:13 +01:00
Gregory Conrad	e0d24104a3	refactor: Rewrite another method chain to be more readable	2022-11-26 13:33:19 -05:00
Gregory Conrad	2db738dbac	refactor: rewrite method chain to be more readable	2022-11-26 13:26:39 -05:00
Gregory Conrad	935a724c57	revert: Revert pass by reference API change	2022-11-24 10:08:23 -05:00
Gregory Conrad	ed29cceae9	perf: Prevent reindex in searchable set case when not needed	2022-11-23 22:33:06 -05:00
Gregory Conrad	bb9e33bf85	perf: Prevent reindex in searchable reset case when not needed	2022-11-23 22:01:46 -05:00
Gregory Conrad	7c0e544839	feat: Add all_obkv_to_json function	2022-11-23 21:18:58 -05:00
Gregory Conrad	d19c8672bb	perf: limit reindex to when exact_attributes changes	2022-11-23 15:50:53 -05:00
bors[bot]	57c9f03e51	Merge #697 697: Fix bug in prefix DB indexing r=loiclec a=loiclec Where the batch's information was not properly updated in cases where only the proximity changed between two consecutive word pair proximities. Closes partially https://github.com/meilisearch/meilisearch/issues/3043 Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-11-17 15:22:01 +00:00
curquiza	cd5aaa3a9f	Update version for the next release (v0.37.0) in Cargo.toml files	2022-11-17 12:50:07 +00:00
Loïc Lecrenier	777eb3fa00	Add insta-snaps for test of bug 3043	2022-11-17 12:21:27 +01:00
Loïc Lecrenier	0caadedd3b	Make clippy happy	2022-11-17 12:17:53 +01:00
Loïc Lecrenier	ac3baafbe8	Truncate facet values that are too long before indexing them	2022-11-17 11:29:42 +01:00
Loïc Lecrenier	990a861241	Add test for indexing a document with a long facet value	2022-11-17 11:29:42 +01:00
Loïc Lecrenier	d95d02cb8a	Fix Facet Indexing bugs 1. Handle keys with variable length correctly This fixes https://github.com/meilisearch/meilisearch/issues/3042 and is easily reproducible with the updated fuzz tests, which now generate keys with variable lengths. 2. Prevent adding facets to the database if their encoded value does not satisfy `valid_lmdb_key`. This fixes an indexing failure when a document had a filterable attribute containing a value whose length is higher than ~500 bytes.	2022-11-17 11:29:42 +01:00
Loïc Lecrenier	f00108d2ec	Fix name of bug in reproduction test	2022-11-17 11:29:18 +01:00
Loïc Lecrenier	f7c8730d09	Fix bug in prefix DB indexing Where the batch's information was not properly updated in cases where only the proximity changed between two consecutive word pair proximities. Closes https://github.com/meilisearch/meilisearch/issues/3043	2022-11-17 11:29:18 +01:00
Louis Dureuil	6dc6a5d874	Force using vendored version of LMDB - don't use lmdb master3 branch anymore	2022-11-14 17:17:51 +01:00
Kerollmops	d00d2aab3f	Update version for the next release (v0.36.0) in Cargo.toml files	2022-11-09 11:03:09 +00:00
bors[bot]	f46a8ab2e2	Merge #693 693: use the lmdb-master.3 branch r=Kerollmops a=irevoire After investigating https://github.com/meilisearch/meilisearch/issues/3017, we found out that it was due to lmdb and that, without any code change on our side, bumping using the lmdb-master-3 branch fix our issues. But, we’re not really confident about what changed between the `mdb.master` and `mdb.master3` branches; thus this is a temporary change, and we hope we’ll be able to move to the new version of heed asap (either before the end of the pre-release or for the next release). -------- The bug is hard to reproduce; I can reproduce it 100% of the time on my archlinux personal computer. But on a scaleway archlinux bare-metal machine, it doesn’t reproduce. It’s flaky on our test suite, but `@loiclec` was able to write a minimal test that reproduces it every time on macOS. Basically, what happens is when there are multiple threads opening databases in a different directory at the same time. If there are 10 or more threads running at the same time, lmdb starts throwing the `Invalid argument (os error 22)` error for no reason, we believe. I would like to submit an issue to lmdb, but I don’t really have the time to write a test in C without heed currently. `@hyc,` if you want to take a look at it, here is the repo that reproduces the issue on macOS: https://github.com/irevoire/heed-bug Co-authored-by: Irevoire <tamo@meilisearch.com>	2022-11-09 09:42:38 +00:00
Irevoire	c7711daca3	use the lmdb-master.3 branch	2022-11-08 16:28:01 +01:00
Kerollmops	bd12989610	Update version for the next release (v0.35.1) in Cargo.toml files	2022-11-08 14:31:39 +00:00

... 3 4 5 6 7 ...

1618 Commits