meilisearch

mirror of https://github.com/meilisearch/meilisearch.git synced 2025-01-18 08:48:32 +08:00

Author	SHA1	Message	Date
bors[bot]	cd3bca06e9	Merge #699 699: Force vendoring of LMDB even if a system version is available r=Kerollmops a=dureuill # Pull Request ## Related issue Related to https://github.com/meilisearch/meilisearch/issues/3017: will fix once ported to milli and meilisearch. ## What does this PR do? - Force using vendored version of LMDB - don't use lmdb master3 branch anymore: this is a bit of a side effect of using a tag instead of branch for heed as a dependency, but it is wanted anyway for now as lmdb master3 was more of an experiment - modifies CI to run `cargo check` on the release rather than the debug artifacts. This is an attempt to reduce the necessary disk space and avoid "out of space" failures. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2022-11-15 11:09:20 +00:00
Louis Dureuil	87576cf26c	Perform cargo check on the release artifacts	2022-11-15 10:25:02 +01:00
Louis Dureuil	6dc6a5d874	Force using vendored version of LMDB - don't use lmdb master3 branch anymore	2022-11-14 17:17:51 +01:00
bors[bot]	e75829aded	Merge #694 694: Update version for the next release (v0.36.0) in Cargo.toml files r=curquiza a=meili-bot ⚠️ This PR is automatically generated. Check the new version is the expected one before merging. Co-authored-by: Kerollmops <Kerollmops@users.noreply.github.com>	2022-11-09 11:22:24 +00:00
Kerollmops	d00d2aab3f	Update version for the next release (v0.36.0) in Cargo.toml files	2022-11-09 11:03:09 +00:00
bors[bot]	f46a8ab2e2	Merge #693 693: use the lmdb-master.3 branch r=Kerollmops a=irevoire After investigating https://github.com/meilisearch/meilisearch/issues/3017, we found out that it was due to lmdb and that, without any code change on our side, bumping using the lmdb-master-3 branch fix our issues. But, we’re not really confident about what changed between the `mdb.master` and `mdb.master3` branches; thus this is a temporary change, and we hope we’ll be able to move to the new version of heed asap (either before the end of the pre-release or for the next release). -------- The bug is hard to reproduce; I can reproduce it 100% of the time on my archlinux personal computer. But on a scaleway archlinux bare-metal machine, it doesn’t reproduce. It’s flaky on our test suite, but `@loiclec` was able to write a minimal test that reproduces it every time on macOS. Basically, what happens is when there are multiple threads opening databases in a different directory at the same time. If there are 10 or more threads running at the same time, lmdb starts throwing the `Invalid argument (os error 22)` error for no reason, we believe. I would like to submit an issue to lmdb, but I don’t really have the time to write a test in C without heed currently. `@hyc,` if you want to take a look at it, here is the repo that reproduces the issue on macOS: https://github.com/irevoire/heed-bug Co-authored-by: Irevoire <tamo@meilisearch.com>	2022-11-09 09:42:38 +00:00
bors[bot]	c3b75bbe5d	Merge #691 691: Update version for the next release (v0.35.1) in Cargo.toml files r=curquiza a=meili-bot ⚠️ This PR is automatically generated. Check the new version is the expected one before merging. Co-authored-by: Kerollmops <Kerollmops@users.noreply.github.com>	2022-11-08 15:31:50 +00:00
Irevoire	c7711daca3	use the lmdb-master.3 branch	2022-11-08 16:28:01 +01:00
bors[bot]	f18a4581f1	Merge #692 692: Update CONTRIBUTING.md r=Kerollmops a=curquiza Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>	2022-11-08 14:51:30 +00:00
Clémentine Urquizar - curqui	8ce8bbcdfc	Update CONTRIBUTING.md	2022-11-08 15:49:45 +01:00
Kerollmops	bd12989610	Update version for the next release (v0.35.1) in Cargo.toml files	2022-11-08 14:31:39 +00:00
bors[bot]	24a298a83c	Merge #690 690: Fix soft deleted bug settings r=ManyTheFish a=Kerollmops Co-authored-by: Kerollmops <clement@meilisearch.com>	2022-11-08 13:45:10 +00:00
bors[bot]	d85cd9bf1a	Merge #689 689: Handle non-finite floats consistently in filters r=irevoire a=dureuill # Pull Request ## Related issue Related meilisearch/meilisearch#3000 ## What does this PR do? ### User - Filters using `field = inf`, (or `infinite`, `NaN`) now match the value as a string rather than returning an internal error. - Filters using `field < inf` (or other comparison operators) now return an invalid_filter error rather than returning an internal error, much like when using `field < aaa`. ### Implementation - Add new `NonFiniteFloat` error variants to the filter-parser errors - Add `Token::parse_as_finite_float` that can fail both when the string is not a float and when the float is not finite - Refactor `Filter::inner_evaluate` to always use `parse_as_finite_float` instead of just `parse` - Add corresponding tests ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2022-11-08 13:24:38 +00:00
Kerollmops	37b3c5c323	Fix transform to use all_documents and ignore soft_deleted documents	2022-11-08 14:23:16 +01:00
Kerollmops	1b1ad1923b	Add a test to check that we take care of soft deleted documents	2022-11-08 14:23:14 +01:00
Louis Dureuil	a836b8e703	tests: Tests filter with non-finite floats	2022-11-08 13:56:55 +01:00
Louis Dureuil	3328560788	fix: allow filters on = inf, = NaN, return InvalidFilter for < inf, < NaN Fixes meilisearch/meilisearch#3000	2022-11-08 13:27:15 +01:00
bors[bot]	cf76ec7b37	Merge #673 673: Add clippy job r=ManyTheFish a=unvalley # Pull Request ## Related issue Fixes #231 ## What does this PR do? - fix some clippy errors remain - add clippy job to CI (I set `nightly` as toolchain) ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Co-authored-by: unvalley <kirohi.code@gmail.com>	2022-11-08 09:43:26 +00:00
unvalley	abf1cf9cd5	Fix clippy errors	2022-11-04 09:27:46 +09:00
unvalley	b09676779d	Use nightly for clippy and remove conflict mistake	2022-11-04 09:13:01 +09:00
unvalley	70465aa5ce	Execute cargo fmt	2022-11-04 08:59:58 +09:00
unvalley	3009981d31	Fix clippy errors Add clippy job Add clippy job to CI	2022-11-04 08:58:14 +09:00
unvalley	401e956128	Add clippy job Add clippy job to CI	2022-11-04 08:58:12 +09:00
bors[bot]	6add470805	Merge #659 659: Fix clippy error to add clippy job on Ci r=Kerollmops a=unvalley ## Related PR This PR is for #673 ## What does this PR do? - ~~add `Run Clippy` job to CI (rust.yml)~~ - apply `cargo clippy --fix` command - fix some `cargo clippy` error manually (but warnings still remain on tests) ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Co-authored-by: unvalley <kirohi.code@gmail.com> Co-authored-by: unvalley <38400669+unvalley@users.noreply.github.com>	2022-11-03 15:24:38 +00:00
unvalley	13175f2339	refactor: match for filterCondition	2022-11-03 17:34:33 +09:00
bors[bot]	1a1ad8a792	Merge #679 679: Bump Swatinem/rust-cache from 2.0.0 to 2.0.1 r=curquiza a=dependabot[bot] Bumps [Swatinem/rust-cache](https://github.com/Swatinem/rust-cache) from 2.0.0 to 2.0.1. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/Swatinem/rust-cache/releases">Swatinem/rust-cache's releases</a>.</em></p> <blockquote> <h2>v2.0.1</h2> <ul> <li>Primarily just updating dependencies to fix GitHub deprecation notices.</li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/Swatinem/rust-cache/blob/master/CHANGELOG.md">Swatinem/rust-cache's changelog</a>.</em></p> <blockquote> <h2>2.0.1</h2> <ul> <li>Primarily just updating dependencies to fix GitHub deprecation notices.</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`22c9328bcb`"><code>22c9328</code></a> 2.0.1</li> <li><a href="`d4d463bd9b`"><code>d4d463b</code></a> bump deps and rebuild</li> <li><a href="`c4652c677c`"><code>c4652c6</code></a> Update <code>`@actions/core</code>` (<a href="https://github-redirect.dependabot.com/Swatinem/rust-cache/issues/83">#83</a>)</li> <li><a href="`76686c56f2`"><code>76686c5</code></a> docs: Fix github workflows directory (<a href="https://github-redirect.dependabot.com/Swatinem/rust-cache/issues/79">#79</a>)</li> <li><a href="`1b43d2f2c3`"><code>1b43d2f</code></a> remove outdated versioning note</li> <li><a href="`20b9201e8a`"><code>20b9201</code></a> bump cargo hash</li> <li><a href="`0d72e5f9a0`"><code>0d72e5f</code></a> revert explicit dir close</li> <li><a href="`86531941c2`"><code>8653194</code></a> Merge branch 'master' of <a href="https://github.com/Swatinem/rust-cache">https://github.com/Swatinem/rust-cache</a></li> <li><a href="`be4be3720d`"><code>be4be37</code></a> explicitly close dir handles, add more logging, cleanups</li> <li><a href="`213334cd98`"><code>213334c</code></a> cargo update</li> <li>Additional commits viewable in <a href="https://github.com/Swatinem/rust-cache/compare/v2.0.0...v2.0.1">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Swatinem/rust-cache&package-manager=github_actions&previous-version=2.0.0&new-version=2.0.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) You can trigger a rebase of this PR by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-11-02 12:38:05 +00:00
dependabot[bot]	4492605a78	Bump Swatinem/rust-cache from 2.0.0 to 2.0.1 Bumps [Swatinem/rust-cache](https://github.com/Swatinem/rust-cache) from 2.0.0 to 2.0.1. - [Release notes](https://github.com/Swatinem/rust-cache/releases) - [Changelog](https://github.com/Swatinem/rust-cache/blob/master/CHANGELOG.md) - [Commits](https://github.com/Swatinem/rust-cache/compare/v2.0.0...v2.0.1) --- updated-dependencies: - dependency-name: Swatinem/rust-cache dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2022-11-01 10:19:45 +00:00
bors[bot]	fe5a0219e1	Merge #677 677: run the tests in all workspaces r=curquiza a=irevoire With #676 I noticed the tests were not running in any of our sub crates. Most of our sub crates didn't includes any tests though. But the filter-parser did and we're lucky we never broke these one without noticing 😁 Co-authored-by: Irevoire <tamo@meilisearch.com>	2022-10-31 18:05:04 +00:00
Irevoire	5ff066c3e7	run the tests in all workspaces	2022-10-31 18:38:48 +01:00
bors[bot]	6770eb2a87	Merge #676 676: chore: added `IN`,`NOT IN` to `invalid_filter` msg r=irevoire a=Pranav-yadav # Pull Request ## Related issue `Fixes` https://github.com/meilisearch/meilisearch/issues/3004 ## What does this PR do? - Improves correct error msg in response ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Pranav Yadav <Pranavyadav3912@gmail.com>	2022-10-31 17:29:24 +00:00
unvalley	0d43ddbd85	Update filter-parser/src/lib.rs Co-authored-by: Tamo <irevoire@protonmail.ch>	2022-11-01 01:32:54 +09:00
Pranav Yadav	3950ec8d3c	chore: update tests for `invalid_filter` msg	2022-10-31 15:41:49 +00:00
Pranav Yadav	3b35ebda50	chore: added `IN`,`NOT IN` to `invalid_filter` msg	2022-10-31 15:01:14 +00:00
bors[bot]	4bcfd14a45	Merge #675 675: Deleted empty files r=Kerollmops a=SKVKPandey # Pull Request ## Related issue Fixes #674 ## What does this PR do? Delete empty files: - `milli/src/heed_codec/facet/facet_string_level_zero_value_codec.rs` - `milli/src/heed_codec/facet/facet_string_zero_bounds_value_codec.rs` ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Shashank Kashyap <50551759+SKVKPandey@users.noreply.github.com>	2022-10-30 07:09:30 +00:00
Shashank Kashyap	a07f0a4a43	Delete facet_string_zero_bounds_value_codec.rs	2022-10-30 08:59:04 +05:30
Shashank Kashyap	2dec6e86e9	Delete facet_string_level_zero_value_codec.rs	2022-10-30 08:58:36 +05:30
bors[bot]	c965200010	Merge #664 664: Fix phrase search containing stop words r=ManyTheFish a=Samyak2 # Pull Request This a WIP draft PR I wanted to create to let other potential contributors know that I'm working on this issue. I'll be completing this in a few hours from opening this. ## Related issue Fixes #661 and towards fixing meilisearch/meilisearch#2905 ## What does this PR do? - [x] Change Phrase Operation to use a `Vec<Option<String>>` instead of `Vec<String>` where `None` corresponds to a stop word - [x] Update all other uses of phrase operation - [x] Update `resolve_phrase` - [x] Update `create_primitive_query`? - [x] Add test ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Co-authored-by: Samyak S Sarnayak <samyak201@gmail.com> Co-authored-by: Samyak Sarnayak <samyak201@gmail.com>	2022-10-29 13:42:52 +00:00
unvalley	d55f0e2e53	Execute cargo fmt	2022-10-28 23:42:23 +09:00
unvalley	d53a80b408	Fix clippy error	2022-10-28 23:41:35 +09:00
Samyak Sarnayak	ecb88143f9	Run cargo fmt	2022-10-28 19:37:02 +05:30
Samyak Sarnayak	03eb5d87c1	Only call plane_sweep on subgroups when 2 or more are present	2022-10-28 19:32:05 +05:30
unvalley	a1d7ed1258	fix clippy error and remove clippy job from ci Remove clippy job Fix clippy error type_complexity Restore ambiguous change	2022-10-28 22:33:50 +09:00
unvalley	f3c0b05ae8	Fix rust fmt	2022-10-28 09:32:31 +09:00
unvalley	f4ec1abb9b	Fix all clippy error after conflicts	2022-10-27 23:58:13 +09:00
Samyak S Sarnayak	d35afa0cf5	Change consecutive phrase search grouping logic Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-10-26 23:10:48 +05:30
Samyak S Sarnayak	752d031010	Update phrase search to use new `execute` method	2022-10-26 23:07:20 +05:30
unvalley	c7322f704c	Fix cargo clippy errors Dont apply clippy for tests for now Fix clippy warnings of filter-parser package parent 8352febd646ec4bcf56a44161e5c4dce0e55111f author unvalley <38400669+unvalley@users.noreply.github.com> 1666325847 +0900 committer unvalley <kirohi.code@gmail.com> 1666791316 +0900 Update .github/workflows/rust.yml Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com> Allow clippy lint too_many_argments Allow clippy lint needless_collect Allow clippy lint too_many_arguments and type_complexity Fix for clippy warnings comparison_chains Fix for clippy warnings vec_init_then_push Allow clippy lint should_implement_trait Allow clippy lint drop_non_drop Fix lifetime clipy warnings in filter-paprser Execute cargo fmt Fix clippy remaining warnings Fix clippy remaining warnings again and allow lint on each place	2022-10-27 01:04:23 +09:00
unvalley	811f156031	Execute cargo clippy --fix	2022-10-27 01:00:00 +09:00
unvalley	d8fed1f7a9	Add clippy job Add Run Clippy to bors.toml	2022-10-27 01:00:00 +09:00
bors[bot]	2e539249cb	Merge #619 619: Refactor the Facets databases to enable incremental indexing r=curquiza a=loiclec # Pull Request ## What does this PR do? Party fixes https://github.com/meilisearch/milli/issues/605 by making the indexing of the facet databases (i.e. `facet_id_f64_docids` and `facet_id_string_docids`) incremental. It also closes #327 and https://github.com/meilisearch/meilisearch/issues/2820 . Two more untracked bugs were also fixed: 1. The facet distribution algorithm did not respect the `maxFacetValues` parameter when there were only a few candidate document ids. 2. The structure of the levels > 0 of the facet databases were not updated following the deletion of documents ## How to review this PR First, read this comment to get an overview of the changes. Then, based on this comment, raise any concerns you might have about: 1. the new structure of the databases 2. the algorithms for sort, facet distribution, and range search 3. the new/removed heed codecs Then, weigh in on the following concerns: 1. adding `fuzzcheck` as a fuzz-only dependency may add too much complexity for the benefits it provides 2. the `ByteSliceRef` and `StrRefCodec` are misnamed or should not exist 3. the new behaviour of facet distributions can be considered incorrect 4. incremental deletion is useless given that documents are always deleted in bulk ## What's left for me to do 1. Re-read everything once to make sure I haven't forgotten anything 2. Wait for the results of the benchmarks and see if (1) they provide enough information (2) there was any change in performance, especially for search queries. Then, maybe, spend some time optimising the code. 3. Test whether the `info`/`http-ui` crates survived the refactor ## Old structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases Previously, these two databases had different but conceptually similar structures. For each field id, the facet number database had the following format: ``` ┌───────────────────────────────┬───────────────────────────────┬───────────────┐ ┌───────┐ │ 1.2 – 2 │ 3.4 – 100 │ 102 – 104 │ │Level 2│ │ │ │ │ └───────┘ │ a, b, d, f, z │ c, d, e, f, g │ u, y │ ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤ ┌───────┐ │ 1.2 – 1.3 │ 1.6 – 2 │ 3.4 – 12 │ 12.3 – 100 │ 102 – 104 │ │Level 1│ │ │ │ │ │ │ └───────┘ │ a, b, d, z │ a, b, f │ c, d, g │ e, f │ u, y │ ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤ ┌───────┐ │ 1.2 │ 1.3 │ 1.6 │ 2 │ 3.4 │ 12 │ 12.3 │ 100 │ 102 │ 104 │ │Level 0│ │ │ │ │ │ │ │ │ │ │ │ └───────┘ │ a, b │ d, z │ b, f │ a, f │ c, d │ g │ e │ e, f │ y │ u │ └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘ ``` where the first line is the key of the database, consisting of : - the field id - the level height - the left and right bound of the group and the second line is the value of the database, consisting of: - a bitmap of all the docids that have a facet value within the bounds The `facet_id_string_docids` had a similar structure: ``` ┌───────────────────────────────┬───────────────────────────────┬───────────────┐ ┌───────┐ │ 0 – 3 │ 4 – 7 │ 8 – 9 │ │Level 2│ │ │ │ │ └───────┘ │ a, b, d, f, z │ c, d, e, f, g │ u, y │ ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤ ┌───────┐ │ 0 – 1 │ 2 – 3 │ 4 – 5 │ 6 – 7 │ 8 – 9 │ │Level 1│ │ "ab" – "ac" │ "ba" – "bac" │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz" │ └───────┘ │ a, b, d, z │ a, b, f │ c, d, g │ e, f │ u, y │ ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤ ┌───────┐ │ "ab" │ "ac" │ "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │ "zz" │ │Level 0│ │ "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │ "ZZ" │ └───────┘ │ a, b │ d, z │ b, f │ a, f │ c, d │ g │ e │ e, f │ y │ u │ └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘ ``` where, at level 0, the key is: * the normalised facet value (string) and the value is: * the original facet value (string) * a bitmap of all the docids that have this normalised string facet value At level 1, the key is: * the left bound of the range as an index in level 0 * the right bound of the range as an index in level 0 and the value is: * the left bound of the range as a normalised string * the right bound of the range as a normalised string * a bitmap of all the docids that have a string facet value within the bounds At level > 1, the key is: * the left bound of the range as an index in level 0 * the right bound of the range as an index in level 0 and the value is: * a bitmap of all the docids that have a string facet value within the bounds ## New structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases Now both the `facet_id_f64_docids` and `facet_id_string_docids` databases have the exact same structure: ``` ┌───────────────────────────────┬───────────────────────────────┬───────────────┐ ┌───────┐ │ "ab" (2) │ "gaf" (2) │ "woz" (1) │ │Level 2│ │ │ │ │ └───────┘ │ [a, b, d, f, z] │ [c, d, e, f, g] │ [u, y] │ ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤ ┌───────┐ │ "ab" (2) │ "ba" (2) │ "gaf" (2) │ "form" (2) │ "woz" (2) │ │Level 1│ │ │ │ │ │ │ └───────┘ │ [a, b, d, z] │ [a, b, f] │ [c, d, g] │ [e, f] │ [u, y] │ ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤ ┌───────┐ │ "ab" │ "ac" │ "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │ "zz" │ │Level 0│ │ │ │ │ │ │ │ │ │ │ │ └───────┘ │ [a, b]│ [d, z]│ [b, f]│ [a, f]│ [c, d]│ [g] │ [e] │ [e, f]│ [y] │ [u] │ └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘ ``` where for all levels, the key is a `FacetGroupKey<T>` containing: * the field id (`u16`) * the level height (`u8`) * the left bound of the range (`T`) and the value is a `FacetGroupValue` containing: * the number of elements from the level below that are part of the range (`u8`, =0 for level 0) * a bitmap of all the docids that have a facet value within the bounds (`RoaringBitmap`) The right bound of the range is now implicit, it is equal to `Excluded(next_left_bound)`. In the code, the key is always encoded using `FacetGroupKeyCodec<C>` where `C` is the codec used to encode the facet value (either `OrderedF64Codec` or `StrRefCodec`) and the value is encoded with `FacetGroupValueCodec`. Since both databases share the same structure, we can implement almost all operations only once by treating the facet value as a byte slice (i.e. `FacetGroupKey<&[u8]>` encoded as `FacetGroupKeyCodec<ByteSliceRef>`). This is, in my opinion, a big simplification. The reason for changing the structure of the databases is to make it possible to incrementally add a facet value to an existing database. Since the `facet_id_string_docids` used to store indices to `level 0` in all levels > 0, adding an element to level 0 would potentially invalidate all the indices. Note that the original string value of a facet is no longer stored in this database. ## Incrementally adding a facet value Here I describe how we can add a facet value to the new database incrementally. If we want to add the document with id `z` and facet value `gap`., then we want to add/modify the elements highlighted below in pink: <img width="946" alt="Screenshot 2022-09-12 at 10 14 54" src="https://user-images.githubusercontent.com/6040237/189605532-fe4b0f52-e13d-4b3c-92d9-10c705953e3d.png"> which results in: <img width="662" alt="Screenshot 2022-09-12 at 10 23 29" src="https://user-images.githubusercontent.com/6040237/189607015-c3a37588-b825-43c2-878a-f8f85c000b94.png"> * one element was added in level 0 * one key/value was modified in level 1 * one value was modified in level 2 Adding this element was easy since we could simply add it to level 0 and then increase the `group_size` part of the value for the level above. However, in order to keep the structure balanced, we can't always do this. If the group size reaches a threshold (`max_group_size`), then we split the node into two. For example, let's imagine that `max_group_size` is `4` and we add the docid `y` with facet value `gas`. First, we add it in level 0: <img width="904" alt="Screenshot 2022-09-12 at 10 30 40" src="https://user-images.githubusercontent.com/6040237/189608391-531f9df1-3424-4f1f-8344-73eb194570e5.png"> Then, we realise that the group size of its parent is going to reach the maximum group size (=4) and thus we split the parent into two nodes: <img width="919" alt="Screenshot 2022-09-12 at 10 33 16" src="https://user-images.githubusercontent.com/6040237/189608884-66f87635-1fc6-41d2-a459-87c995491ac4.png"> and since we inserted an element in level 1, we also update level 2 accordingly, by increasing the group size of the parent: <img width="915" alt="Screenshot 2022-09-12 at 10 34 42" src="https://user-images.githubusercontent.com/6040237/189609233-d4a893ff-254a-48a7-a5ad-c0dc337f23ca.png"> We also have two other parameters: * `group_size` is the default group size when building the database from scratch * `min_level_size` is the minimum number of elements that a level should contain When the highest level size is greater than `group_size * min_level_size`, then we create an additional level above it. There is one more edge case for the insertion algorithm. While we normally don't modify the existing left bounds of a key, we have to do it if the facet value being inserted is smaller than the first left bound. For example, inserting `"aa"` with the docid `w` would change the database to: <img width="756" alt="Screenshot 2022-09-12 at 10 41 56" src="https://user-images.githubusercontent.com/6040237/189610637-a043ef71-7159-4bf1-b4fd-9903134fc095.png"> The root of the code for incremental indexing is the `FacetUpdateIncremental` builder. ## Incrementally removing a facet value TODO: the algorithm was implemented and works, but its current API is: `fn delete(self, facet_value, single_docid)`. It removes the given document id from all keys containing the given facet value. I don't think it is the right way to implement it anymore. Perhaps a bitmap of docids should be given instead. This is fairly easy to do. But since we batch document deletions together (because of soft deletion), it's not clear to me anymore that incremental deletion should be implemented at all. ## Bulk insertion While it's faster to incrementally add a single facet value to the database, it is sometimes slower to repeatedly add facet values one-by-one instead of doing it in bulk. For example, during initial indexing, we'd like to build the database from a list of facet values and associated document ids in one go. The `FacetUpdateBulk` builder provides a way to do so. It works by: 1. clearing all levels > 0 from the DB 2. adding all new elements in level 0 3. rebuilding the higher levels from scratch The algorithm for bulk insertion is the same as the previous one. ## Choosing between incremental and bulk insertion On my computer, I measured that is about 50x slower to add N facet values incrementally than it is to re-build a database with N facet values in level 0. Therefore, we dynamically choose to use either incremental insertion or bulk insertion based on (1) the number of existing elements in level 0 of the database and (2) the number of facet values from the new documents. This is imprecise but is mainly aimed at avoiding the worst-case scenario where the incremental insertion method is used repeatedly millions of times. ## Fuzz-testing Potentially controversial: I fuzz-tested incremental addition and deletion using fuzzcheck, which found many bugs. The fuzz-test consists of inserting/deleting facet values and docids in succession, each operation is processed with different parameters for `group_size`, `max_group_size`, and `min_level_size`. After all the operations are processed, the content of level 0 is compared to the content of an equivalent structure with a simple and easily-checked implementation. Furthermore, we check that the database has a correct structure (all groups from levels > 0 correctly combine the content of their children). I also visualised the code coverage found by the fuzz-test. It covered 100% of the relevant code except for `unreachable/panic` statements and errors returned by `heed`. The fuzz-test and the fuzzcheck dependency are only compiled when `cargo fuzzcheck` is used. For now, the dependency is from a local path on my computer, but it can be changed to a crate version if we decide to keep it. ## Algorithms operating on the facet databases There are four important algorithms making use of the facet databases: 1. Sort, ascending 2. Sort, descending 3. Facet distribution 4. Range search Previously, the implementation of all four algorithms was based on a number of iterators specific to each database kind (number or string): `FacetNumberRange`, `FacetNumberRevRange`, `FacetNumberIter` (with a reversed and reducing/non-reducing option), `FacetStringGroupRange`, `FacetStringGroupRevRange`, `FacetStringLevel0Range`, `FacetStringLevel0RevRange`, and `FacetStringIter` (reversed + reducing/non-reducing). Now, all four algorithms have a unique implementation shared by both the string and number databases. There are four functions: 1. `ascending_facet_sort` in `search/facet/facet_sort_ascending.rs` 2. `descending_facet_sort` in `search/facet/facet_sort_descending.rs` 3. `iterate_over_facet_distribution` in `search/facet/facet_distribution_iter.rs` 4. `find_docids_of_facet_within_bounds` in `search/facet/facet_range_search.rs` I have tried to test them with some snapshot tests but more testing could still be done. I don't think that the performance of these algorithms regressed, but that will need to be confirmed by benchmarks. ## Change of behaviour for facet distributions Previously, the original string value of a facet was stored in the level 0 of `facet_id_string_docids `. This is no longer the case. The original string value was used in the implementation of the facet distribution algorithm. Now, to recover it, we pick a random document id which contains the normalised string value and look up the original one in `field_id_docid_facet_strings`. As a consequence, it may be that the string value returned in the field distribution does not appear in any of the candidates. For example, ```json { "id": 0, "colour": "RED" } { "id": 1, "colour": "red" } ``` Facet distribution for the `colour` field among the candidates `[1]`: ``` { "RED": 1 } ``` Here, "RED" was given as the original facet value even though it does not appear in the document id `1`. ## Heed codecs A number of heed codecs related to the facet databases were removed: * `FacetLevelValueF64Codec` * `FacetLevelValueU32Codec` * `FacetStringLevelZeroCodec` * `StringValueCodec` * `FacetStringZeroBoundsValueCodec` * `FacetValueStringCodec` * `FieldDocIdFacetStringCodec` * `FieldDocIdFacetF64Codec` They were replaced by: * `FacetGroupKeyCodec<C>` (replaces all key codecs for the facet databases) * `FacetGroupValueCodec` (replaces all value codecs for the facet databases) * `FieldDocIdFacetCodec<C>` (replaces `FieldDocIdFacetStringCodec` and `FieldDocIdFacetF64Codec`) Since the associated encoded item of `FacetGroupKeyCodec<C>` is `FacetKey<T>` and we often work with `FacetKey<&[u8]>` and `FacetKey<&str>`, then we need to have codecs that encode values of type `&str` and `&[u8]`. The existing `ByteSlice` and `Str` codecs do not work for that purpose (their `EItem` are `[u8]` and `str`), I have also created two new codecs: * `ByteSliceRef` is a codec with a `EItem = DItem = &[u8]` * `StrRefCodec` is a codec with a `EItem = DItem = &str` I have also factored out the code used to encode an ordered f64 into its own `OrderedF64Codec`. Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>	2022-10-26 15:04:53 +00:00

1 2 3 4 5 ...

2250 Commits