meilisearch

mirror of https://github.com/meilisearch/meilisearch.git synced 2024-11-30 09:04:59 +08:00

Author	SHA1	Message	Date
Clément Renault	aadbe88048	Return an internal error when a field id is missing	2023-06-28 15:01:50 +02:00
Clément Renault	f36de2115f	Make clippy happy	2023-06-28 15:01:50 +02:00
Clément Renault	702041b7e1	Improve the returned errors from the facet-search route	2023-06-28 15:01:48 +02:00
Clément Renault	a05074e675	Fix the max number of facets to be returned to 100	2023-06-28 14:58:42 +02:00
Clément Renault	93f30e65a9	Return the correct response JSON object from the facet-search route	2023-06-28 14:58:42 +02:00
Clément Renault	e81809aae7	Make the search for facet work	2023-06-28 14:58:41 +02:00
Kerollmops	ce7e7f12c8	Introduce the facet search route	2023-06-28 14:58:41 +02:00
Kerollmops	addb21f110	Restrict the number of facet search results to 1000	2023-06-28 14:58:41 +02:00
Kerollmops	c34de05106	Introduce the SearchForFacetValue struct	2023-06-28 14:58:41 +02:00
Clément Renault	15a4c05379	Store the facet string values in multiple FSTs	2023-06-28 14:58:41 +02:00
meili-bors[bot]	d4f10800f2	Merge #3834 3834: Define searchable fields at runtime r=Kerollmops a=ManyTheFish ## Summary This feature allows the end-user to search in one or multiple attributes using the search parameter `attributesToSearchOn`: ```json { "q": "Captain Marvel", "attributesToSearchOn": ["title"] } ``` This feature act like a filter, forcing Meilisearch to only return the documents containing the requested words in the attributes-to-search-on. Note that, with the matching strategy `last`, Meilisearch will only ensure that the first word is in the attributes-to-search-on, but, the retrieved documents will be ordered taking into account the word contained in the attributes-to-search-on. ## Trying the prototype A dedicated docker image has been released for this feature: #### last prototype version: ```bash docker pull getmeili/meilisearch:prototype-define-searchable-fields-at-search-time-1 ``` #### others prototype versions: ```bash docker pull getmeili/meilisearch:prototype-define-searchable-fields-at-search-time-0 ``` ## Technical Detail The attributes-to-search-on list is given to the search context, then, the search context uses the `fid_word_docids`database using only the allowed field ids instead of the global `word_docids` database. This is the same for the prefix databases. The database cache is updated with the merged values, meaning that the union of the field-id-database values is only made if the requested key is missing from the cache. ### Relevancy limits Almost all ranking rules behave as expected when ordering the documents. Only `proximity` could miss-order documents if all the searched words are in the restricted attribute but a better proximity is found in an ignored attribute in a document that should be ranked lower. I put below a failing test showing it: ```rust #[actix_rt::test] async fn proximity_ranking_rule_order() { let server = Server::new().await; let index = index_with_documents( &server, &json!([ { "title": "Captain super mega cool. A Marvel story", // Perfect distance between words in an ignored attribute "desc": "Captain Marvel", "id": "1", }, { "title": "Captain America from Marvel", "desc": "a Shazam ersatz", "id": "2", }]), ) .await; // Document 2 should appear before document 1. index .search(json!({"q": "Captain Marvel", "attributesToSearchOn": ["title"], "attributesToRetrieve": ["id"]}), \|response, code\| { assert_eq!(code, 200, "{}", response); assert_eq!( response["hits"], json!([ {"id": "2"}, {"id": "1"}, ]) ); }) .await; } ``` Fixing this would force us to create a `fid_word_pair_proximity_docids` and a `fid_word_prefix_pair_proximity_docids` databases which may multiply the keys of `word_pair_proximity_docids` and `word_prefix_pair_proximity_docids` by the number of attributes in the searchable_attributes list. If we think we should fix this test, I'll suggest doing it in another PR. ## Related Fixes #3772 Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-06-28 08:19:23 +00:00
Clément Renault	30741d17fa	Change the TODO message	2023-06-27 12:32:43 +02:00
Clément Renault	ebad1f396f	Remove the useless euclidean distance implementation	2023-06-27 12:32:43 +02:00
Clément Renault	29d8268c94	Fix the vector query part by using the correct universe	2023-06-27 12:32:43 +02:00
Clément Renault	63bfe1cee2	Ignore when there are too many vectors	2023-06-27 12:32:43 +02:00
Kerollmops	7c2f5f77b8	Make clippy and fmt happy	2023-06-27 12:32:42 +02:00
Kerollmops	66b8cfd8c8	Introduce a way to store the HNSW on multiple LMDB entries	2023-06-27 12:32:42 +02:00
Kerollmops	ff3664431f	Make rustfmt happy	2023-06-27 12:32:42 +02:00
Kerollmops	531748c536	Return a user error when the _vectors type is invalid	2023-06-27 12:32:41 +02:00
Kerollmops	7aa1275337	Display the _semanticSimilarity even if the `_vectors` field is not displayed	2023-06-27 12:32:41 +02:00
Kerollmops	737aec1705	Expose an _semanticSimilarity as a dot product in the documents	2023-06-27 12:32:41 +02:00
Kerollmops	3e3c743392	Make Rustfmt happy	2023-06-27 12:32:41 +02:00
Kerollmops	5c5a4e075d	Make clippy happy	2023-06-27 12:32:41 +02:00
Kerollmops	ab9f2269aa	Normalize the vectors during indexation and search	2023-06-27 12:32:41 +02:00
Kerollmops	321ec5f3fa	Accept multiple vectors by documents using the _vectors field	2023-06-27 12:32:40 +02:00
Kerollmops	717d4fddd4	Remove the unused distance	2023-06-27 12:32:40 +02:00
Kerollmops	a7e0f0de89	Introduce a new error message for invalid vector dimensions	2023-06-27 12:32:40 +02:00
Kerollmops	3b560ef7d0	Make clippy happy	2023-06-27 12:32:40 +02:00
Kerollmops	2cf747cb89	Fix the tests	2023-06-27 12:32:40 +02:00
Kerollmops	3c31e1cdd1	Support more pages but in an ugly way	2023-06-27 12:32:39 +02:00
Kerollmops	23eaaf1001	Change the name of the distance module	2023-06-27 12:32:39 +02:00
Kerollmops	c2a402f3ae	Implement an ugly deletion of values in the HNSW	2023-06-27 12:32:39 +02:00
Kerollmops	436a10bef4	Replace the euclidean with a dot product	2023-06-27 12:32:39 +02:00
Kerollmops	8debf6fe81	Use a basic euclidean distance function	2023-06-27 12:32:39 +02:00
Kerollmops	c79e82c62a	Move back to the hnsw crate This reverts commit 7a4b6c065482f988b01298642f4c18775503f92f.	2023-06-27 12:32:39 +02:00
Kerollmops	aca305bb77	Log more to make sure we insert vectors in the hgg data-structure	2023-06-27 12:32:38 +02:00
Kerollmops	5816008139	Introduce an optimized version of the euclidean distance function	2023-06-27 12:32:38 +02:00
Kerollmops	268a9ef416	Move to the hgg crate	2023-06-27 12:32:38 +02:00
Clément Renault	642b0f3a1b	Expose a new vector field on the search route	2023-06-27 12:32:38 +02:00
Clément Renault	4571e512d2	Store the vectors in an HNSW in LMDB	2023-06-27 12:32:38 +02:00
Clément Renault	7ac2f1489d	Extract the vectors from the documents	2023-06-27 12:32:37 +02:00
Clément Renault	34349faeae	Create a new _vector extractor	2023-06-27 12:32:37 +02:00
ManyTheFish	63ca25290b	Take into account small Review requests	2023-06-26 14:56:19 +02:00
ManyTheFish	59f64a5256	Return an error when an attribute is not searchable	2023-06-26 14:56:19 +02:00
ManyTheFish	42709ea9a5	Fix clippy warnings	2023-06-26 14:55:57 +02:00
ManyTheFish	fb8fa07169	Restrict field ids in search context	2023-06-26 14:55:57 +02:00
ManyTheFish	0ccf1e2e40	Allow the search cache to store owned values	2023-06-26 14:55:57 +02:00
ManyTheFish	9680e1e41f	Introduce a BytesDecodeOwned trait in heed_codecs	2023-06-26 14:55:14 +02:00
ManyTheFish	461b5118bd	Add API search setting	2023-06-26 14:55:14 +02:00
Tamo	a3716c5678	add the new parameter to the search builder of milli	2023-06-26 14:55:14 +02:00
meili-bors[bot]	2d34005965	Merge #3821 3821: Add normalized and detailed scores to documents returned by a query r=dureuill a=dureuill # Pull Request ## Related issue Fixes #3771 ## What does this PR do? ### User standpoint <details> <summary>Request ranking score</summary> ``` echo '{ "q": "Badman dark knight returns", "showRankingScore": true, "limit": 10, "attributesToRetrieve": ["title"] }' \| mieli search -i index-word-count-10-count ``` </details> <details> <summary>Response</summary> ```json { "hits": [ { "title": "Batman: The Dark Knight Returns, Part 1", "_rankingScore": 0.947520325203252 }, { "title": "Batman: The Dark Knight Returns, Part 2", "_rankingScore": 0.947520325203252 }, { "title": "Batman Unmasked: The Psychology of the Dark Knight", "_rankingScore": 0.6657594086021505 }, { "title": "Legends of the Dark Knight: The History of Batman", "_rankingScore": 0.6654905913978495 }, { "title": "Angel and the Badman", "_rankingScore": 0.2196969696969697 }, { "title": "Angel and the Badman", "_rankingScore": 0.2196969696969697 }, { "title": "Batman", "_rankingScore": 0.11553030303030302 }, { "title": "Batman Begins", "_rankingScore": 0.11553030303030302 }, { "title": "Batman Returns", "_rankingScore": 0.11553030303030302 }, { "title": "Batman Forever", "_rankingScore": 0.11553030303030302 } ], "query": "Badman dark knight returns", "processingTimeMs": 12, "limit": 10, "offset": 0, "estimatedTotalHits": 46 } ``` </details> - If adding a `showRankingScore` parameter to the search query, then documents returned by a search now contain an additional field `_rankingScore` that is a float bigger than 0 and lower or equal to 1.0. This field represents the relevancy of the document, relatively to the search query and the settings of the index, with 1.0 meaning "perfect match" and 0 meaning "not matching the query" (Meilisearch should never return documents not matching the query at all). - The `sort` and `geosort` ranking rules do not influence the `_rankingScore`. <details> <summary>Request detailed ranking scores</summary> ``` echo '{ "q": "Badman dark knight returns", "showRankingScoreDetails": true, "limit": 5, "attributesToRetrieve": ["title"] }' \| mieli search -i index-word-count-10-count ``` </details> <details> <summary>Response</summary> ```json { "hits": [ { "title": "Batman: The Dark Knight Returns, Part 1", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 4, "maxMatchingWords": 4, "score": 1.0 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 4, "score": 0.8 }, "proximity": { "order": 2, "score": 0.9545454545454546 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.926829268292683, "score": 0.926829268292683 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.26666666666666666 } } }, { "title": "Batman: The Dark Knight Returns, Part 2", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 4, "maxMatchingWords": 4, "score": 1.0 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 4, "score": 0.8 }, "proximity": { "order": 2, "score": 0.9545454545454546 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.926829268292683, "score": 0.926829268292683 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.26666666666666666 } } }, { "title": "Batman Unmasked: The Psychology of the Dark Knight", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 3, "maxMatchingWords": 4, "score": 0.75 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 3, "score": 0.75 }, "proximity": { "order": 2, "score": 0.6666666666666666 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.8064516129032258, "score": 0.8064516129032258 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.25 } } }, { "title": "Legends of the Dark Knight: The History of Batman", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 3, "maxMatchingWords": 4, "score": 0.75 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 3, "score": 0.75 }, "proximity": { "order": 2, "score": 0.6666666666666666 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.7419354838709677, "score": 0.7419354838709677 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.25 } } }, { "title": "Angel and the Badman", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 1, "maxMatchingWords": 4, "score": 0.25 }, "typo": { "order": 1, "typoCount": 0, "maxTypoCount": 1, "score": 1.0 }, "proximity": { "order": 2, "score": 1.0 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.8181818181818182, "score": 0.8181818181818182 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.3333333333333333 } } } ], "query": "Badman dark knight returns", "processingTimeMs": 9, "limit": 5, "offset": 0, "estimatedTotalHits": 46 } ``` </details> - If adding a `showRankingScoreDetails` parameter to the search query, then the returned documents will now contain an additional `_rankingScoreDetails` field that is a JSON object containing one field per ranking rule that was applied, whose value is a JSON object with the following fields: - `order`: a number indicating the order this rule was applied (0 is the first applied ranking rule) - `score` (except for `sort` and `geosort`): a float indicating how the document matched this particular rule. - other fields that are specific to the rule, indicating for example how many words matched for a document and how many typos were counted in a matching document. - If the `displayableAttributes` list is defined in the settings of the index, any ranking rule using an attribute not part of that list will be marked as `<hidden-rule>` in the `_rankingScoreDetails`. - Search queries that are part of a `multi-search` requests are modified in the same way and each of the queries can take the `showRankingScore` and `showRankingScoreDetails` parameters independently. The results are still returned in separate lists and providing a unified list of results between multiple queries is not in the scope of this PR (but is unblocked by this PR and can be done manually by using the scores of the various documents). ### Implementation standpoint - Fix difference in how the position of terms were computed at indexing time and query time: this difference meant that a query containing a hard separator would fail the exactness check. - Fix the id reported by the sort ranking rule (very minor) - Change how the cost of removing words is computed. After this change the cost no longer works for any other ranking rule than `words`. Also made `words` have a cost of 0 such that the entire cost of `words` is given by the termRemovalStrategy. The new cost computation makes it so the score is computed in a way consistent with the number of words in the query. Additionally, the words that appear in phrases in the query are also counted as matching words. - When any score computation is requested through `showRankingScore` or `showRankingScoreDetails`, remove optimization where ranking rules are not executed on buckets of a single document: this is important to allow the computation of an accurate score. - add virtual conditions to fid and position to always have the max cost: this ensures that the score is independent from the dataset - the Position ranking rule now takes into account the distance to the position of the word in the query instead of the distance to the position 0. - modified proximity ranking rule cost calculation so that the cost is 0 for documents that are perfectly matching the query - Add a new `milli::score_details` module containing all the types that are involved in score computation. - Make it so a bucket of result now contains a `ScoreDetails` and changed the ranking rules to produce their `ScoreDetails`. - Expose the scores in the REST API. - Add very light analytics for scoring. - Update the search tests to add the expected scores. Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-06-26 09:32:43 +00:00
meili-bors[bot]	040b5a5b6f	Merge #3842 3842: fix some typos r=dureuill a=cuishuang # Pull Request ## Related issue Fixes #<issue_number> ## What does this PR do? - fix some typos ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: cui fliter <imcusg@gmail.com>	2023-06-22 18:01:10 +00:00
cui fliter	530a3e2df3	fix some typos Signed-off-by: cui fliter <imcusg@gmail.com>	2023-06-22 21:59:00 +08:00
Louis Dureuil	d26e9a96ec	Add score details to new search tests	2023-06-22 12:39:14 +02:00
Louis Dureuil	49c8bc4de6	Fix tests	2023-06-22 12:39:14 +02:00
Louis Dureuil	da833eb095	Expose the scores and detailed scores in the API	2023-06-22 12:39:14 +02:00
Louis Dureuil	701d44bd91	Store the scores for each bucket Remove optimization where ranking rules are not executed on buckets of a single document when the score needs to be computed	2023-06-22 12:39:14 +02:00
Louis Dureuil	c621a250a7	Score for graph based ranking rules Count phrases in matchingWords and maxMatchingWords	2023-06-22 12:39:14 +02:00
Louis Dureuil	8939e85f60	Add rank_to_score for graph based ranking rules	2023-06-22 12:39:14 +02:00
Louis Dureuil	fa41d2489e	Score for sort	2023-06-22 12:39:14 +02:00
Louis Dureuil	59c5b992c2	Score for geosort	2023-06-22 12:39:14 +02:00
Louis Dureuil	2ea8194c18	Score for exact_attributes	2023-06-22 12:39:14 +02:00
Louis Dureuil	421df64602	RankingRuleOutput now contains a Score	2023-06-22 12:39:14 +02:00
Louis Dureuil	c0fca6f884	Add score_details	2023-06-22 12:39:14 +02:00
Louis Dureuil	f050634b1e	add virtual conditions to fid and position to always have the max cost	2023-06-20 10:07:18 +02:00
Louis Dureuil	becf1f066a	Change how the cost of removing words is computed	2023-06-20 09:45:43 +02:00
Louis Dureuil	701d299369	Remove out-of-date comment	2023-06-20 09:45:42 +02:00
Louis Dureuil	a20e4d447c	Position now takes into account the distance to the position of the word in the query it used to be based on the distance to the position 0	2023-06-20 09:45:42 +02:00
Louis Dureuil	af57c3c577	Proximity costs 0 for documents that are perfectly matching	2023-06-20 09:45:42 +02:00
Louis Dureuil	0c40ef6911	Fix sort id	2023-06-20 09:45:42 +02:00
meili-bors[bot]	45636d315c	Merge #3670 3670: Fix addition deletion bug r=irevoire a=irevoire The first commit of this PR is a revert of https://github.com/meilisearch/meilisearch/pull/3667. It re-enable the auto-batching of addition and deletion of tasks. No new changes have been introduced outside of `milli`. So all the changes you see on the autobatcher have actually already been reviewed. It fixes https://github.com/meilisearch/meilisearch/issues/3440. ### What was happening? The issue was that the `external_documents_ids` generated in the `transform` were used in a very strange way that wasn’t compatible with the deletion of documents. Instead of doing a clear merge between the external document IDs of the DB and the one returned by the transform + writing it on disk, we were doing some weird tricks with the soft-deleted to avoid writing the fst on disk as much as possible. The new algorithm may be a bit slower but is way more straightforward and doesn’t change depending on if the soft deletion was used or not. Here is a list of the changes introduced: 1. We now do a clear distinction between the `new_external_documents_ids` coming from the transform and only held on RAM and the `external_documents_ids` coming from the DB. 2. The `new_external_documents_ids` (coming out of the transform) are now represented as an `fst`. We don't need to struggle with the hard, soft distinction + the soft_deleted => That's easier to understand 3. When indexing documents, we merge the `external_documents_ids` coming from the DB and the `new_external_documents_ids` coming from the transform. ### Other things introduced in this PR Since we constantly have to write small, very specialized fuzzers for this kind of bug, we decided to push the one used to reproduce this bug. It's not perfect, but it's easy to improve in the future. It'll also run for as long as possible on every merge on the main branch. Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>	2023-06-19 09:09:30 +00:00
meili-bors[bot]	cb9d78fc7f	Merge #3835 3835: Add more documentation to graph-based ranking rule algorithms + comment cleanup r=Kerollmops a=loiclec In addition to documenting the `cheapest_path.rs` file, this PR cleans up a few outdated comments as well as some TODOs. These TODOs have been moved to https://github.com/meilisearch/meilisearch/issues/3776 Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>	2023-06-15 15:30:24 +00:00
Louis Dureuil	e0c4682758	Fix tests	2023-06-14 13:30:52 +02:00
Louis Dureuil	d9b4b39922	Add trailing pipe to the snapshots so it doesn't end with trailing whitespace	2023-06-14 13:30:52 +02:00
Loïc Lecrenier	2da86b31a6	Remove comments and add documentation	2023-06-14 12:39:42 +02:00
Louis Dureuil	a2a3b8c973	Fix offset difference between query and indexing for hard separators	2023-06-08 12:07:12 +02:00
Louis Dureuil	9f37b61666	DB BREAKING: raise limit of word count from 10 to 30.	2023-06-08 12:07:12 +02:00
Louis Dureuil	c15c076da9	DB BREAKING: Count the number of words in field_id_word_count_docids	2023-06-08 12:07:11 +02:00
Loïc Lecrenier	8628a0c856	Remove docid_word_positions_db + fix deletion bug That would happen when a word was deleted from all exact attributes but not all regular attributes.	2023-06-07 10:52:50 +02:00
Clémentine U. - curqui	f3e2f79290	Merge branch 'main' into tmp-release-v1.2.0	2023-06-05 18:36:28 +02:00
Kerollmops	da04edff8c	Better use deserialize_unchecked_from to reduce the deserialization time	2023-05-30 14:58:30 +02:00
Tamo	23a5b45ebf	drop the old fuzz file	2023-05-29 14:02:37 +02:00
Tamo	6c6387d05e	move the fuzzer to its own crate	2023-05-29 12:27:39 +02:00
Louis Dureuil	1dfc4038ab	Add test that fails before PR and passes now	2023-05-29 11:58:26 +02:00
Louis Dureuil	73198179f1	Consistently use wrapping add to avoid overflow in debug when query starts with a separator	2023-05-29 11:54:12 +02:00
meili-bors[bot]	2e49d6aec1	Merge #3768 3768: Fix bugs in graph-based ranking rules + make `words` a graph-based ranking rule r=dureuill a=loiclec This PR contains three changes: ## 1. Don't call the `words` ranking rule if the term matching strategy is `All` This is because the purpose of `words` is only to remove nodes from the query graph. It would never do any useful work when the matching strategy was `All`. Remember that the universe was already computed before by computing all the docids corresponding to the "maximally reduced" query graph, which, in the case of `All`, is equal to the original graph. ## 2. The `words` ranking rule is replaced by a graph-based ranking rule. This is for three reasons: 1. performance: graph-based ranking rules benefit from a lot of optimisations by default, which ensures that they are never too slow. The previous implementation of `words` could call `compute_query_graph_docids` many times if some words had to be removed from the query, which would be quite expensive. I was especially worried about its performance in cases where it is placed right after the `sort` ranking rule. Furthermore, `compute_query_graph_docids` would clone a lot of bitmaps many times unnecessarily. 2. consistency: every other ranking rule (except `sort`) is graph-based. It makes sense to implement `words` like that as well. It will automatically benefit from all the features, optimisations, and bug fixes that all the other ranking rules get. 3. surfacing bugs: as the first ranking rule to be called (most of the time), I'd like `words` to behave the same as the other ranking rules so that we can quickly detect bugs in our graph algorithms. This actually already happened, which is why this PR also contains a bug fix. ## 3. Fix the `update_all_costs_before_nodes` function It is a bit difficult to explain what was wrong, but I'll try. The bug happened when we had graphs like: <img width="730" alt="Screenshot 2023-05-16 at 10 58 57" src="https://github.com/meilisearch/meilisearch/assets/6040237/40db1a68-d852-4e89-99d5-0d65757242a7"> and we gave the node `is` as argument. Then, we'd walk backwards from the node breadth-first. We'd update the costs of: 1. `sun` 2. `thesun` 3. `start` 4. `the` which is an incorrect order. The correct order is: 1. `sun` 2. `thesun` 3. `the` 4. `start` That is, we can only update the cost of a node when all of its successors have either already been visited or were not affected by the update to the node passed as argument. To solve this bug, I factored out the graph-traversal logic into a `traverse_breadth_first_backward` function. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com> Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-05-23 13:28:08 +00:00
Louis Dureuil	51043f78f0	Remove trailing whitespace	2023-05-23 15:27:25 +02:00
Louis Dureuil	a490a11325	Add explanatory comment on the way we're recomputing costs	2023-05-23 15:24:24 +02:00
Tamo	002f42875f	fix the fuzzer	2023-05-23 11:42:40 +02:00
Tamo	22213dc604	push the fuzzer	2023-05-23 09:14:26 +02:00
Tamo	602ad98cb8	improve the way we handle the fsts	2023-05-22 11:15:14 +02:00
Tamo	7f619ff0e4	get rids of the now unused soft_deletion_used parameter	2023-05-22 10:33:49 +02:00
Tamo	4391cba6ca	fix the addition + deletion bug	2023-05-17 18:28:57 +02:00
meili-bors[bot]	101f5a20d2	Merge #3757 3757: Adjust the cost of edges in the `position` ranking rule by bucketing positions more aggressively r=loiclec a=loiclec This PR significantly improves the performance of the `position` ranking rule when: 1. a query contains many words 2. the `position` ranking rule needs to be called many times 3. the score of the documents according to `position` is high These conditions greatly increase: 1. the number of edge traversals that are needed to find a valid path from the `start` node to the `end` node 2. the number of edges that need to be deleted from the graph, and therefore the number of times that we need to recompute all the possible costs from START to END As a result, a majority of the search time is spent in `visit_condition`, `visit_node`, and `update_all_costs_before_node`. This is frustrating because it often happens when the "universe" given to the rule consists of only a handful of document ids. By limiting the number of possible edges between two nodes from `20` to `10`, we: 1. reduce the number of possible costs from START to END 2. reduce the number of edges that will be deleted 3. make it faster to update the costs after deleting an edge 4. reduce the number of buckets that need to be computed In terms of relevancy, I don't think we lose or gain much. We still prefer terms that are in a lower positions, with decreasing precision as we go further. The previous choice of bucketing wasn't chosen in a principled way, and neither is this one. They both "feel" right to me. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com> Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>	2023-05-17 11:43:59 +00:00
Loïc Lecrenier	ec8f685d84	Fix bug in cheapest path algorithm	2023-05-16 17:01:30 +02:00
Loïc Lecrenier	5758268866	Don't compute split_words for phrases	2023-05-16 17:01:18 +02:00
Loïc Lecrenier	3e19702de6	Update snapshot tests	2023-05-16 12:22:46 +02:00
meili-bors[bot]	1e762d151f	Merge #3755 3755: Re-add final dot r=curquiza a=ManyTheFish I removed the final dot of the error message in my last PR, this one re-adds it. related to https://github.com/meilisearch/meilisearch/pull/3749 > Oups 😬 Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-05-16 10:10:58 +00:00
Loïc Lecrenier	f6524a6858	Adjust costs of edges in position ranking rule To ensure good performance	2023-05-16 11:28:56 +02:00
meili-bors[bot]	65ad8cce36	Merge #3741 3741: Add ngram support to the highlighter r=ManyTheFish a=loiclec This PR fixes a bug introduced by the search refactor, where ngrams were not highlighted. The solution was to add the ngrams to the vector of `LocatedQueryTerm` that is given to the `MatchingWords` structure. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-05-16 09:03:31 +00:00
ManyTheFish	42650f82e8	Re-add final dot	2023-05-16 10:57:26 +02:00
Loïc Lecrenier	a37da36766	Implement `words` as a graph-based ranking rule and fix some bugs	2023-05-16 10:42:11 +02:00
Loïc Lecrenier	85d96d35a8	Highlight ngram matches as well	2023-05-16 10:39:36 +02:00
meili-bors[bot]	bf66e97b48	Merge #3749 3749: Fix back: sort error message r=ManyTheFish a=ManyTheFish This PR reintroduces the error message modified in https://github.com/meilisearch/milli/pull/375. However, this added double-quotes around `sort` in the message. I don't think another message contains double-quotes, so I have added a separate commit replacing the double-quotes with back-ticks, which seems more consistent with the other error messages, this last change can be reverted easily. ## Detailed changes #### v1.2-rc0 ``` The sort ranking rule must be specified in the ranking rules settings to use the sort parameter at search time. ``` #### [Reintroduce fix (previous and expected behavior)](`23d1c86825`) ``` You must specify where "sort" is listed in the rankingRules setting to use the sort parameter at search time ``` #### [Replace double-quotes with back-ticks (my suggestion)](`4d691d071a`) ``` You must specify where `sort` is listed in the rankingRules setting to use the sort parameter at search time ``` ## Related Fixes #3722 ## Reviewers - technical review: `@irevoire` - to validate the replacement: `@macraig` Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-05-15 14:55:51 +00:00
Kerollmops	1a79fd0c3c	Use the new heed v0.12.6	2023-05-15 11:42:30 +02:00
Kerollmops	f759ec7fad	Expose a flag to enable the MDB_WRITEMAP flag	2023-05-15 11:38:43 +02:00
ManyTheFish	4d691d071a	Change double-quotes by back-ticks in sort error message	2023-05-15 11:10:36 +02:00
ManyTheFish	23d1c86825	Re-introduce the sort error message fix	2023-05-15 11:07:23 +02:00
Kerollmops	c4a40e7110	Use the writemap flag to reduce the memory usage	2023-05-15 10:15:33 +02:00
Loïc Lecrenier	4d352a21ac	Compute split words derivations of terms that don't accept typos	2023-05-10 13:31:19 +02:00
Loïc Lecrenier	3625389057	Highlight ngram matches as well	2023-05-08 15:35:41 +02:00
meili-bors[bot]	eace6df91b	Merge #3726 3726: Fix prefix highlighting r=loiclec a=ManyTheFish The prefix queries were not properly highlighted, this PR now highlights only the start of a word when it matched with a prefix Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-05-08 07:46:46 +00:00
Loïc Lecrenier	83ab8cf4e5	Remove dbg!(..) expression in highlighter tests	2023-05-08 09:45:23 +02:00
ManyTheFish	cd2573fcc3	Fix prefix highlighting	2023-05-04 16:53:50 +02:00
meili-bors[bot]	9f7981df28	Merge #3687 3687: Allow to disable specialized tokenizations (again) r=Kerollmops a=jirutka In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai` feature flags to allow melisearch to be built without huge specialed tokenizations that took up 90% of the melisearch binary size. Unfortunately, due to some recent changes, this doesn't work anymore. The problem lies in excessive use of the `default` feature flag, which infects the dependency graph. Instead of adding `default-features = false` here and there, it's easier and more future-proof to not declare `default` in `milli` and `meilisearch-types`. I've renamed it to `all-tokenizers`, which also makes it a bit clearer what it's about. Co-authored-by: Jakub Jirutka <jakub@jirutka.cz>	2023-05-04 14:48:01 +00:00
Jakub Jirutka	e615fa5ec6	Fix unused_imports warning in milli when japanese is not enabled	2023-05-04 15:46:11 +02:00
Jakub Jirutka	13f1277637	Allow to disable specialized tokenizations (again) In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai` feature flags to allow melisearch to be built without huge specialed tokenizations that took up 90% of the melisearch binary size. Unfortunately, due to some recent changes, this doesn't work anymore. The problem lies in excessive use of the `default` feature flag, which infects the dependency graph. Instead of adding `default-features = false` here and there, it's easier and more future-proof to not declare `default` in `milli` and `meilisearch-types`. I've renamed it to `all-tokenizers`, which also makes it a bit clearer what it's about.	2023-05-04 15:45:40 +02:00
Louis Dureuil	a35d3fc708	Add Index::iter_documents	2023-05-04 15:31:54 +02:00
Louis Dureuil	732c52093d	Processing time without autobatching implementation	2023-05-03 17:41:48 +02:00
Louis Dureuil	f8f190cd40	Update exactness tests following charabia camelCase tokenization	2023-05-03 14:45:09 +02:00
Louis Dureuil	3a408e8287	Increase map size for tests following charabia camelCase tokenization	2023-05-03 14:44:48 +02:00
Louis Dureuil	d3e5b10e23	fix nb of dbs	2023-05-03 14:11:20 +02:00
Louis Dureuil	1aaf24ccbf	Cargo fmt	2023-05-03 12:21:58 +02:00
Louis Dureuil	90bc230820	Merge remote-tracking branch 'origin/main' into search-refactor Conflicts \| resolution ----------\|----------- Cargo.lock \| added mimalloc Cargo.toml \| took origin/main version milli/src/search/criteria/exactness.rs \| deleted after checking it was only clippy changes milli/src/search/query_tree.rs \| deleted after checking it was only clippy changes	2023-05-03 12:19:06 +02:00
Louis Dureuil	342c4ff85d	geosort: Remove rtree unwrap	2023-05-03 09:52:16 +02:00
Tamo	c85392ce40	make the descendent geosort fast	2023-05-03 09:13:12 +02:00
Tamo	8875d24a48	deserialize the rtree only when its needed, and keep it in memory once it has been deserialized	2023-05-03 09:13:12 +02:00
Tamo	c470b67fa2	revamp the test to use execute_iterative_and_rtree_returns_the_same	2023-05-03 09:13:12 +02:00
meili-bors[bot]	c0e081cd98	Merge #3702 #3710 3702: Update charabia v0.7.2 r=curquiza a=ManyTheFish fixes #3701 fixes #3689 fixes #3285 3710: Updated messages pointing to the docs website r=curquiza a=roy9495 # Pull Request Fixes partially #3668 ## What does this PR do? - ...Any messages referencing this docs site https://docs.meilisearch.com has been changed to this docs site https://meilisearch.com/docs . Thanks. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: TATHAGATA ROY <98920199+roy9495@users.noreply.github.com>	2023-05-02 17:27:57 +00:00
Louis Dureuil	b60840ebff	Remove self.iterating from words	2023-05-02 18:54:23 +02:00
Louis Dureuil	fdc1763838	Use MultiOps for resolve_query_graph	2023-05-02 18:54:09 +02:00
Louis Dureuil	75819bc940	Remove too many arguments on resolve_maximally_reduced_query_graph	2023-05-02 18:53:40 +02:00
Louis Dureuil	7b8cc25625	rename located_query_terms_from_string -> located_query_terms_from_tokens	2023-05-02 18:53:01 +02:00
Loïc Lecrenier	aa63091752	Fix bug in exact_attribute	2023-05-02 10:48:32 +02:00
Loïc Lecrenier	58735d6d8f	Fix outdated relevancy test	2023-05-02 10:48:32 +02:00
Loïc Lecrenier	1b514517f5	Fix bug in computation of query term at a position	2023-05-02 10:48:32 +02:00
Loïc Lecrenier	11f814821d	Minor cleanup	2023-05-02 10:48:32 +02:00
Loïc Lecrenier	30fb1153cc	Speed up graph based ranking rule when a lot of different costs exist	2023-05-02 09:59:42 +02:00
Loïc Lecrenier	3b2c8b9f25	Improve performance of position rr	2023-05-02 09:59:42 +02:00
Loïc Lecrenier	2a7f9adf78	Build query graph more correctly from paths Update snapshots	2023-05-02 09:59:42 +02:00
Loïc Lecrenier	608ceea440	Fix bug in position rr	2023-05-02 09:59:42 +02:00
Loïc Lecrenier	79001b9c97	Improve performance of the cheapest path finder algorithm	2023-05-02 09:59:42 +02:00
Loïc Lecrenier	59b12fca87	Fix errors, clippy warnings, and add review comments	2023-04-29 11:48:11 +02:00
Loïc Lecrenier	48f5bb1693	Implements the geo-sort ranking rule	2023-04-29 11:02:16 +02:00
Loïc Lecrenier	93188b3c88	Fix indexing of word_prefix_fid_docids	2023-04-29 10:56:48 +02:00
Loïc Lecrenier	bc4efca611	Add more tests for the attribute ranking rule	2023-04-29 10:56:48 +02:00
bors[bot]	414b3fae89	Merge #3571 3571: Introduce two filters to select documents with `null` and empty fields r=irevoire a=Kerollmops # Pull Request ## Related issue This PR implements the `X IS NULL`, `X IS NOT NULL`, `X IS EMPTY`, `X IS NOT EMPTY` filters that [this comment](https://github.com/meilisearch/product/discussions/539#discussioncomment-5115884) is describing in a very detailed manner. ## What does this PR do? ### `IS NULL` and `IS NOT NULL` This PR will be exposed as a prototype for now. Below is the copy/pasted version of a spec that defines this filter. - `IS NULL` matches fields that `EXISTS` AND `= IS NULL` - `IS NOT NULL` matches fields that `NOT EXISTS` OR `!= IS NULL` 1. `{"name": "A", "price": null}` 2. `{"name": "A", "price": 10}` 3. `{"name": "A"}` `price IS NULL` would match 1 `price IS NOT NULL` or `NOT price IS NULL` would match 2,3 `price EXISTS` would match 1, 2 `price NOT EXISTS` or `NOT price EXISTS` would match 3 common query : `(price EXISTS) AND (price IS NOT NULL)` would match 2 ### `IS EMPTY` and `IS NOT EMPTY` - `IS EMPTY` matches Array `[]`, Object `{}`, or String `""` fields that `EXISTS` and are empty - `IS NOT EMPTY` matches fields that `NOT EXISTS` OR are not empty. 1. `{"name": "A", "tags": null}` 2. `{"name": "A", "tags": [null]}` 3. `{"name": "A", "tags": []}` 4. `{"name": "A", "tags": ["hello","world"]}` 5. `{"name": "A", "tags": [""]}` 6. `{"name": "A"}` 7. `{"name": "A", "tags": {}}` 8. `{"name": "A", "tags": {"t1":"v1"}}` 9. `{"name": "A", "tags": {"t1":""}}` 10. `{"name": "A", "tags": ""}` `tags IS EMPTY` would match 3,7,10 `tags IS NOT EMPTY` or `NOT tags IS EMPTY` would match 1,2,4,5,6,8,9 `tags IS NULL` would match 1 `tags IS NOT NULL` or `NOT tags IS NULL` would match 2,3,4,5,6,7,8,9,10 `tags EXISTS` would match 1,2,3,4,5,7,8,9,10 `tags NOT EXISTS` or `NOT tags EXISTS` would match 6 common query : `(tags EXISTS) AND (tags IS NOT NULL) AND (tags IS NOT EMPTY)` would match 2,4,5,8,9 ## What should the reviewer do? - Check that I tested the filters - Check that I deleted the ids of the documents when deleting documents Co-authored-by: Clément Renault <clement@meilisearch.com> Co-authored-by: Kerollmops <clement@meilisearch.com>	2023-04-27 13:14:00 +00:00
Loïc Lecrenier	899baa0ea5	Update forgotten snapshot from previous commit	2023-04-27 13:43:04 +02:00
Loïc Lecrenier	374095d42c	Add tests for stop words and fix a couple of bugs	2023-04-27 13:30:09 +02:00
Louis Dureuil	b41a6cbd7a	Check sort criteria also in placeholder search	2023-04-26 16:28:17 +02:00
Louis Dureuil	c8af572697	Add tests for exact words and exact attributes	2023-04-26 16:13:01 +02:00
ManyTheFish	249053e514	Update feature flags	2023-04-26 14:59:25 +02:00
ManyTheFish	ff2cf2a5ae	Update charabia in milli	2023-04-26 14:56:54 +02:00
Loïc Lecrenier	b448aca49c	Add more tests for exactness rr	2023-04-26 11:04:18 +02:00
Loïc Lecrenier	55bad07c16	Fix bug in exact_attribute rr implementation	2023-04-26 10:40:05 +02:00
Loïc Lecrenier	3421125a55	Prevent the `exactness` ranking rule from removing random words Make it strictly follow the term matching strategy	2023-04-26 09:09:19 +02:00
Clément Renault	14293f6c8f	Make rustfmt happy	2023-04-25 16:55:39 +02:00
Loïc Lecrenier	d3a94e8b25	Fix bugs and add tests to exactness ranking rule	2023-04-25 16:49:08 +02:00
Clément Renault	cfd1b2cc97	Fix the clippy warnings	2023-04-25 16:40:32 +02:00
Kerollmops	a109802d45	Upgrade the incompatible versions of the dependencies	2023-04-24 17:50:57 +02:00
Kerollmops	47b66e49b8	Upgrade the compatible versions of the dependencies	2023-04-24 17:50:52 +02:00
Loïc Lecrenier	8f2e971879	Add tests for "exactness" rr, make correct universe computation	2023-04-24 16:57:34 +02:00
Loïc Lecrenier	d1fdbb63da	Make all search tests pass, fix distinctAttribute bug	2023-04-24 12:12:08 +02:00
Loïc Lecrenier	a7a0891210	Update examples	2023-04-24 10:07:49 +02:00
Loïc Lecrenier	84d9c731f8	Fix bug in encoding of word_position_docids and word_fid_docids	2023-04-24 09:59:30 +02:00
Loïc Lecrenier	bd9aba4d77	Add "position" part of the attribute ranking rule	2023-04-13 10:46:09 +02:00
Loïc Lecrenier	8edad8291b	Add logger to attribute rr, fix a bug	2023-04-13 10:25:00 +02:00
Kerollmops	d9cebff61c	Add a simple test to check that attributes are ranking correctly	2023-04-13 08:27:09 +02:00
Loïc Lecrenier	30f7bd03f6	Fix compiler warning/errors caused by previous merge	2023-04-13 08:27:09 +02:00
Kerollmops	df0d9bb878	Introduce the attribute ranking rule in the list of ranking rules	2023-04-13 08:27:09 +02:00
Kerollmops	5230ddb3ea	Resolve the attribute ranking rule conditions	2023-04-13 08:27:09 +02:00
Kerollmops	d6a7c28e4d	Implement the attribute ranking rule edge computation	2023-04-13 08:27:09 +02:00
Kerollmops	e55efc419e	Introduce a new cache for the words fids	2023-04-13 08:27:09 +02:00
Loïc Lecrenier	644e136aee	Merge branch 'search-refactor-typo-attributes' into search-refactor	2023-04-13 08:26:56 +02:00
Louis Dureuil	38b7b31beb	Decide to use prefix DB if the word is not an ngram	2023-04-12 16:45:38 +02:00
Louis Dureuil	7a01f20df7	Use word_prefix_docids, make get_word_prefix_docids private	2023-04-12 16:45:38 +02:00
Louis Dureuil	c20c38a7fa	Add SearchContext::word_prefix_docids() method	2023-04-12 16:44:43 +02:00
Louis Dureuil	5ab46324c4	Everyone uses the SearchContext::word_docids instead of get_db_word_docids make get_db_word_docids private	2023-04-12 16:44:43 +02:00
Louis Dureuil	325f17488a	Add SearchContext::word_docids() method	2023-04-12 16:37:05 +02:00
Louis Dureuil	e7ff987c46	Update call sites	2023-04-12 16:36:38 +02:00
Louis Dureuil	244003e36f	Refactor DB cache to return Roaring Bitmaps directly instead of byte slices	2023-04-12 16:35:48 +02:00
Loïc Lecrenier	1f813a6f3b	Simplify implementation of the detailed (=visual) logger	2023-04-12 16:32:53 +02:00
Loïc Lecrenier	96183e804a	Simplify the logger	2023-04-12 16:32:53 +02:00
Loïc Lecrenier	7ab48ed8c7	Matching words fixes	2023-04-12 16:21:43 +02:00
Loïc Lecrenier	e7bb8c940f	Merge branch 'search-refactor-highlighter' into search-refactor-highlighter-merged	2023-04-11 12:22:34 +02:00
Loïc Lecrenier	8cb85294ef	Remove unused import warning	2023-04-07 11:09:30 +02:00
Loïc Lecrenier	d0e9d65025	Fix distinct attribute bugs	2023-04-07 11:09:01 +02:00
Loïc Lecrenier	540a396e49	Fix indexing bug in words_prefix_position	2023-04-07 11:08:39 +02:00
Loïc Lecrenier	a81165f0d8	Merge remote-tracking branch 'origin/main' into search-refactor	2023-04-07 10:15:55 +02:00
Loïc Lecrenier	d6585eb10b	Avoid splitting ngrams into their original component words	2023-04-07 10:13:49 +02:00
Loïc Lecrenier	f7d90ad19f	Merge remote-tracking branch 'origin/search-refactor-tests-doc' into search-refactor	2023-04-07 10:13:18 +02:00
Louis Dureuil	31630c85d0	exactness graph rr: Add important TODO/FIXME after review	2023-04-06 17:50:39 +02:00
Louis Dureuil	ab09dc0167	exact_attributes: Add TODOs and additional check after review	2023-04-06 17:50:39 +02:00
Louis Dureuil	618c54915d	exact_attribute: dedup nodes after sorting them	2023-04-06 17:50:39 +02:00
Loïc Lecrenier	130d2061bd	Fix indexing of word_position_docid and fid	2023-04-06 17:50:39 +02:00
Louis Dureuil	66ddee4390	Fix word_position_docids indexing	2023-04-06 17:50:39 +02:00
Louis Dureuil	90a6c01495	Use correct codec in proximity	2023-04-06 17:50:39 +02:00
Louis Dureuil	e58426109a	Fix panics and issues in exactness graph ranking rule	2023-04-06 17:50:39 +02:00
Louis Dureuil	f513cf930a	Exact attribute with state	2023-04-06 17:50:39 +02:00
Louis Dureuil	8a13ed7e3f	Add exactness ranking rules	2023-04-06 17:50:39 +02:00
Louis Dureuil	1b8e4d0301	Add ExactTerm and helper method	2023-04-06 17:50:39 +02:00
Louis Dureuil	996619b22a	Increase position by 8 on hard separator when building query terms	2023-04-06 17:50:39 +02:00
Louis Dureuil	2c9822a337	Rename `is_multiple_words` to `is_ngram` and `zero_typo` to `exact`	2023-04-06 17:50:39 +02:00
Louis Dureuil	7276deee0a	Add new db caches	2023-04-06 17:50:39 +02:00
ManyTheFish	f7e7f438f8	Patch prefix match	2023-04-06 17:22:31 +02:00
ManyTheFish	ba8dcc2d78	Fix clippy	2023-04-06 15:50:47 +02:00
Loïc Lecrenier	7ca91ebb71	Merge branch 'search-refactor-exactness' into search-refactor-tests-doc	2023-04-06 15:16:35 +02:00
ManyTheFish	47f6a3ad3d	Take into account that a logger need the search context	2023-04-06 15:02:23 +02:00
bors[bot]	b4c01581cd	Merge #3641 3641: Bring back changes from `release v1.1.0` into `main` after v1.1.0 release r=curquiza a=curquiza Replace https://github.com/meilisearch/meilisearch/pull/3637 since we don't want to pull commits from `main` into `release-v1.1.0` when fixing git conflicts Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com> Co-authored-by: Charlotte Vermandel <charlottevermandel@gmail.com> Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: Louis Dureuil <louis@meilisearch.com> Co-authored-by: curquiza <clementine@meilisearch.com> Co-authored-by: Clément Renault <clement@meilisearch.com> Co-authored-by: Many the fish <many@meilisearch.com>	2023-04-06 12:37:54 +00:00
ManyTheFish	ae17c62e24	Remove warnings	2023-04-06 14:07:18 +02:00
ManyTheFish	a1148c09c2	remove old matcher	2023-04-06 14:00:21 +02:00
ManyTheFish	9c5f64769a	Integrate the new Highlighter in the search	2023-04-06 13:58:56 +02:00
ManyTheFish	ebe23b04c9	Make the matcher consume the search context	2023-04-06 12:28:28 +02:00
ManyTheFish	13b7c826c1	add new highlighter	2023-04-06 12:15:37 +02:00
Loïc Lecrenier	5440f43fd3	Fix indexing of word_position_docid and fid	2023-04-05 18:14:00 +02:00
Louis Dureuil	d9460a76f4	Fix word_position_docids indexing	2023-04-05 18:14:00 +02:00
Louis Dureuil	d1ddaa223d	Use correct codec in proximity	2023-04-05 18:14:00 +02:00
Louis Dureuil	f7ecea142e	Fix panics and issues in exactness graph ranking rule	2023-04-05 18:13:46 +02:00
Louis Dureuil	337e75b0e4	Exact attribute with state	2023-04-05 18:12:46 +02:00
Loïc Lecrenier	b5691802a3	Add new tests and fix construction of query graph from paths	2023-04-05 16:31:10 +02:00
Loïc Lecrenier	6e50f23896	Add more search tests	2023-04-05 13:33:23 +02:00
Tamo	597d57bf1d	Merge branch 'main' into bring-back-changes-v1.1.0	2023-04-05 11:32:14 +02:00
Loïc Lecrenier	4c8a0179ba	Add more search tests	2023-04-05 11:30:49 +02:00
Loïc Lecrenier	c69cbec64a	Add more search tests	2023-04-05 11:20:04 +02:00
Loïc Lecrenier	ce328c329d	Move bucket sort function to its own module and fix a bug	2023-04-04 18:03:08 +02:00
Loïc Lecrenier	959e4607bb	Add more search tests	2023-04-04 18:02:46 +02:00
Louis Dureuil	4b4ffb8ec9	Add exactness ranking rules	2023-04-04 17:12:07 +02:00
Louis Dureuil	3951fe22ab	Add ExactTerm and helper method	2023-04-04 17:09:32 +02:00
Louis Dureuil	4d5bc9df4c	Increase position by 8 on hard separator when building query terms	2023-04-04 17:07:26 +02:00
Louis Dureuil	ec2f8e8040	Rename `is_multiple_words` to `is_ngram` and `zero_typo` to `exact`	2023-04-04 17:06:07 +02:00
Louis Dureuil	406b8bd248	Add new db caches	2023-04-04 17:04:46 +02:00
Loïc Lecrenier	62b9c6fbee	Add search tests	2023-04-04 16:18:22 +02:00
Loïc Lecrenier	b439d36807	Split query_term module into multiple submodules	2023-04-04 15:38:30 +02:00
Loïc Lecrenier	faceb661e3	Add note that a part of the code needs fixing	2023-04-04 15:02:01 +02:00
Loïc Lecrenier	4129d657e2	Simplify query_term module a bit	2023-04-04 15:01:42 +02:00
Filip Bachul	1e6fe71a67	fix clippy warning	2023-04-03 20:18:26 +02:00
Filip Bachul	fddfb37f1f	remove unnecessary FilterError:ReservedGeo and FilterError:ReservedGeo	2023-04-03 20:18:26 +02:00
Loïc Lecrenier	3f13608002	Fix computation of ngram derivations	2023-04-03 15:27:49 +02:00
Loïc Lecrenier	4708d9b016	Fix compiler warnings/errors	2023-04-03 10:09:27 +02:00
Clément Renault	0d2e7bcc13	Implement the previous way for the exhaustive distinct candidates	2023-04-03 10:08:10 +02:00
Loïc Lecrenier	55fbfb6124	Merge branch 'search-refactor-located-query-terms' into search-refactor	2023-04-03 10:04:36 +02:00
Loïc Lecrenier	58fe260c72	Allow removing all the terms from a query if it contains a phrase	2023-04-03 09:18:02 +02:00
Loïc Lecrenier	24e5f6f7a9	Don't remove phrases with "last" term matching strategy	2023-04-03 09:17:33 +02:00
Louis Dureuil	9b87c36200	Limit the number of derivations for a single word.	2023-03-31 09:19:18 +02:00
Filip Bachul	1861c69964	fmt	2023-03-30 23:37:26 +02:00
Filip Bachul	cb2b5eb38e	handle _geoDistance(x,x) sort error	2023-03-30 23:21:23 +02:00
Filip Bachul	53aa0a1b54	handle _geo(x,x) sort error	2023-03-30 23:17:34 +02:00
Loïc Lecrenier	12b26cd54e	Don't remove phrases from the query with term matching strategy Last	2023-03-30 14:54:08 +02:00
Loïc Lecrenier	061b1e6d7c	Tiny refactor of query graph remove_nodes method	2023-03-30 14:49:25 +02:00
Loïc Lecrenier	0d6e8b5c31	Fix phrase search bug when the phrase has only one word	2023-03-30 14:48:12 +02:00
Loïc Lecrenier	d48cdc67a0	Fix term matching strategy bugs	2023-03-30 14:01:52 +02:00
Loïc Lecrenier	35c16ad047	Use new term matching strategy logic in words ranking rule	2023-03-30 13:15:43 +02:00
Loïc Lecrenier	2997d1f186	Use new term matching strategy logic in resolve_maximally_reduced_...	2023-03-30 13:12:51 +02:00
Loïc Lecrenier	2a5997fb20	Avoid expensive assert! in bucket sort function	2023-03-30 13:07:17 +02:00
Loïc Lecrenier	ee8a9e0bad	Remove outdated sentence in documentation	2023-03-30 12:22:24 +02:00
Loïc Lecrenier	3b0737a092	Fix detailed logger	2023-03-30 12:20:44 +02:00
Loïc Lecrenier	fdd02105ac	Graph-based ranking rule + term matching strategy support	2023-03-30 12:19:21 +02:00
Loïc Lecrenier	aa9592455c	Refactor the paths_of_cost algorithm Support conditions that require certain nodes to be skipped	2023-03-30 12:11:11 +02:00
Loïc Lecrenier	01e24dd630	Rewrite proximity ranking rule	2023-03-30 11:59:06 +02:00
Loïc Lecrenier	ae6bb1ce17	Update the ConditionDocidsCache after change to RankingRuleGraphTrait	2023-03-30 11:41:20 +02:00
Loïc Lecrenier	5fd28620cd	Build ranking rule graph correctly after changes to trait definition	2023-03-30 11:32:55 +02:00
Loïc Lecrenier	728710d63a	Update typo ranking rule to use new query term structure	2023-03-30 11:32:19 +02:00
Loïc Lecrenier	fa81381865	Update the trait requirements of ranking-rule graphs	2023-03-30 11:19:45 +02:00
Loïc Lecrenier	b96a682f16	Update resolve_graph module to work with lazy query terms	2023-03-30 11:10:38 +02:00
Loïc Lecrenier	d0f048c068	Simplify the API of the DatabaseCache	2023-03-30 11:08:17 +02:00
Loïc Lecrenier	223e82a10d	Update QueryGraph to use new lazy query terms + build from paths	2023-03-30 11:06:02 +02:00
Loïc Lecrenier	9507ff5e31	Update query term structure to allow for laziness	2023-03-30 11:06:02 +02:00
Louis Dureuil	c2b025946a	`located_query_terms_from_string`: use u16 for positions, hard limit number of iterated tokens. - Refactor phrase logic to reduce number of possible states	2023-03-30 11:04:14 +02:00
Loïc Lecrenier	3a818c5e87	Add more functionality to interners	2023-03-30 09:56:23 +02:00
Louis Dureuil	d74134ce3a	Check sort criteria	2023-03-29 15:21:54 +02:00
Louis Dureuil	5ac129bfa1	Mark geosearch as currently unimplemented for sort rule	2023-03-29 15:20:42 +02:00
ManyTheFish	efea1e5837	Fix facet normalization	2023-03-29 12:02:24 +02:00
Louis Dureuil	abb4522f76	Small comment on ignored rules for placeholder search	2023-03-29 09:11:06 +02:00
Louis Dureuil	ef084ef042	SmallBitmap: Consistently panic on incoherent universe lengths	2023-03-29 08:45:38 +02:00
Louis Dureuil	3524bd1257	SmallBitmap: Add documentation	2023-03-29 08:44:11 +02:00
Tamo	a50b058557	update the geoBoundingBox feature Now instead of using the (top_left, bottom_right) corners of the bounding box it s using the (top_right, bottom_left) corners.	2023-03-28 18:26:18 +02:00
Louis Dureuil	d4f6216966	Resolve rule time sort criteria	2023-03-28 16:42:02 +02:00
Louis Dureuil	77acafe534	Resolve search time sort criteria for placeholder search	2023-03-28 16:41:03 +02:00
Louis Dureuil	53afda3237	Update search usage in example	2023-03-28 16:35:46 +02:00
Louis Dureuil	abb19d368d	Initialize query time ranking rule for query search	2023-03-28 12:40:52 +02:00
Louis Dureuil	b4a52a622e	BoxRankingRule	2023-03-28 12:39:42 +02:00
Louis Dureuil	8d7d8cdc2f	Clean-up index example	2023-03-27 18:34:10 +02:00
Louis Dureuil	626a93b348	Search example: panic when missing the index path	2023-03-27 18:18:01 +02:00
Louis Dureuil	af65fe201a	Clean-up search example	2023-03-27 17:49:43 +02:00
Louis Dureuil	9b83b1deb0	Expose SearchLogger trait	2023-03-27 17:49:18 +02:00
Louis Dureuil	e9eb271499	Remove empty attribute_rule mod	2023-03-27 11:08:03 +02:00
Louis Dureuil	3281a88d08	SmallBitmap: don't expose internal items	2023-03-27 11:04:43 +02:00
Louis Dureuil	5a644054ab	Removed unused search impl	2023-03-27 11:04:27 +02:00
Louis Dureuil	16fefd364e	Add TODO notes	2023-03-27 11:04:04 +02:00
Gregory Conrad	e7994cdeb3	feat: check to see if the PK changed before erroring out Previously, if the primary key was set and a Settings update contained a primary key, an error would be returned. However, this error is not needed if the new PK == the current PK. This commit just checks to see if the PK actually changes before raising an error.	2023-03-26 12:18:39 -04:00
Loïc Lecrenier	00bad8c716	Add comments suggesting performance improvements	2023-03-23 10:18:24 +01:00
Loïc Lecrenier	862714a18b	Remove criterion_implementation_strategy param of Search	2023-03-23 09:44:12 +01:00
Loïc Lecrenier	d18ebe4f3a	Remove more warnings	2023-03-23 09:41:18 +01:00
Loïc Lecrenier	7169d85115	Remove old query_tree code and make clippy happy	2023-03-23 09:39:16 +01:00
Loïc Lecrenier	f5f5f03ec0	Remove old criteria code	2023-03-23 09:35:53 +01:00
Loïc Lecrenier	9b2653427d	Split position DB into fid and relative position DB	2023-03-23 09:22:01 +01:00
Loïc Lecrenier	56b7209f26	Make clippy happy	2023-03-23 09:16:17 +01:00
Loïc Lecrenier	9b1f439a91	WIP	2023-03-23 09:12:35 +01:00
Loïc Lecrenier	01c7d2de8f	Add example targets to the milli crate	2023-03-22 14:50:41 +01:00
Loïc Lecrenier	a86aeba411	WIP	2023-03-22 14:43:08 +01:00
Loïc Lecrenier	384fdc2df4	Fix two bugs in proximity ranking rule	2023-03-21 11:43:25 +01:00
Loïc Lecrenier	83e5b4ed0d	Compute edges of proximity graph lazily	2023-03-21 10:44:40 +01:00
Loïc Lecrenier	272cd7ebbd	Small cleanup	2023-03-20 13:39:19 +01:00
Loïc Lecrenier	c63c7377e6	Switch order of MappedInterner generic params	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	5b50e49522	cargo fmt	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	65474c8de5	Update new sort ranking rule after rebasing	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	fbb1ba3de0	Cargo fmt	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	a59ca28e2c	Add forgotten file	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	825f742000	Simplify graph-based ranking rule impl	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	dd491320e5	Simplify graph-based ranking rule impl	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	c6ff97a220	Rewrite the dead-ends cache to detect more dead-ends	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	49240c367a	Fix bug in cost of typo conditions	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	1e6e624078	Fix bug in SmallBitmap	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	8b4e07e1a3	WIP	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	2853009987	Renaming Edge -> Condition	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	aa59c3bc2c	Replace EdgeCondition with an Option<..> + other code cleanup	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	7b1d8f4c6d	Make PathSet strongly typed	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	a49ddec9df	Prune the query graph after executing a ranking rule	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	05fe856e6e	Merge forward and backward proximity conditions in proximity graph	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	c0cdaf9f53	Fix bug in the proximity ranking rule for queries with ngrams	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	e9cf58d584	Refactor of the Interner	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	31628c5cd4	Merge Phrase and WordDerivations into one structure	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	3004e281d7	Support ngram typos + splitwords and splitwords+synonyms in proximity	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	14e8d0aaa2	Rename lifetime	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	1c58cf8426	Intern ranking rule graph edge conditions as well	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	5155fd2bf1	Reorganise initialisation of ranking rules + rename PathsMap -> PathSet	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	9ec9c204d3	Small code cleanup	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	78b9304d52	Implement distinct attribute	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	0465ba4a05	Intern more values	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	2099991dd1	Continue documenting and cleaning up the code	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	c232cdabf5	Add documentation	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	4e266211bf	Small code reorganisation	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	57fa689131	Cargo fmt	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	10626dddfc	Add a few more optimisations to new search algorithms	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	9051065c22	Apply a few optimisations for graph-based ranking rules	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	e8c76cf7bf	Intern all strings and phrases in the search logic	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	3f1729a17f	Update new search test	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	cab2b6bcda	Fix: computation of initial universe, code organisation	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	c4979a2fda	Fix code visibility issue + unimplemented detail in proximity rule	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	23931f8a4f	Fix small bug in visual logger of search algo	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	aa414565bb	Fix proximity graph edge builder to include all proximities	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	1db152046e	WIP on split words and synonyms support	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	c27ea2677f	Rewrite cheapest path algorithm and empty path cache It is now much simpler and has much better performance.	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	caa1e1b923	Add typo ranking rule to new search impl	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	71f18e4379	Add sort ranking rule to new search impl	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	600e3dd1c5	Remove warnings	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	362eb0de86	Add support for filters	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	998d46ac10	Add support for search offset and limit	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	6c85c0d95e	Fix more bugs + visual empty path cache logging	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	0e1fbbf7c6	Fix bugs in query graph's "remove word" and "cheapest paths" algos	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	6806640ef0	Fix d2 description of paths map	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	173e37584c	Improve the visual/detailed search logger	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	6ba4d5e987	Add a search logger	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	dd12d44134	Support swapped word pairs in new proximity ranking rule impl	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	c8e251bf24	Remove noise in codebase	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	a938fbde4a	Use a cache when resolving the query graph	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	dcf3f1d18a	Remove EdgeIndex and NodeIndex types, prefer u32 instead	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	66d0c63694	Add some documentation and use bitmaps instead of hashmaps when possible	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	132191360b	Introduce the sort ranking rule working with the new search structures	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	345c99d5bd	Introduce the words ranking rule working with the new search structures	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	89d696c1e3	Introduce the proximity ranking rule as a graph-based ranking rule	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	c645853529	Introduce a generic graph-based ranking rule	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	a70ab8b072	Introduce a function to find the K shortest paths in a graph	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	48aae76b15	Introduce a function to find the docids of a set of paths in a graph	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	23bf572dea	Introduce cache structures used with ranking rule graphs	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	864f6410ed	Introduce a structure to represent a set of graph paths efficiently	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	c9bf6bb2fa	Introduce a structure to implement ranking rules with graph algorithms	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	46249ea901	Implement a function to find a QueryGraph's docids	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	ce0d1e0e13	Introduce a common way to manage the coordination between ranking rules	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	5065d8b0c1	Introduce a DatabaseCache to memorize the addresses of LMDB values	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	a83007c013	Introduce structure to represent search queries as graphs	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	79e0a6dd4e	Introduce a new search module, eventually meant to replace the old one The code here does not compile, because I am merely splitting one giant commit into smaller ones where each commit explains a single file.	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	2d88089129	Remove unused term matching strategies	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	6c659dc12f	Use MiMalloc in milli tests	2023-03-20 09:41:37 +01:00
Clément Renault	cf34d1c95f	Fix a test that forget to match a Null value	2023-03-15 17:17:19 +01:00
Clément Renault	1a9c58a7ab	Fix a bug with the new flattening rules	2023-03-15 16:56:44 +01:00
Clément Renault	64571c8288	Improve the testing of the filters	2023-03-15 14:57:17 +01:00
Clément Renault	ea016d97af	Implementing an IS EMPTY filter	2023-03-15 14:12:34 +01:00
Clément Renault	fa2ea4a379	Update the test to accept the new IS syntax	2023-03-14 10:31:27 +01:00
Tamo	0f33a65468	makes kero happy	2023-03-13 16:51:11 +01:00
bors[bot]	fb1260ee88	Merge #3568 #3569 3568: CI: Fix `publish-aarch64` job that still uses ubuntu-18.04 r=Kerollmops a=curquiza Fixes #3563 Main change - add the usage of the `ubuntu-18.04` container instead of the native `ubuntu-18.04` of GitHub actions: I had to install docker in the container. Small additional changes - remove useless `fail-fast` and unused/irrelevant matrix inputs (`build`, `linker`, `os`, `use-cross`...) - Remove useless step in job Proof of work with this CI triggered on this current branch: https://github.com/meilisearch/meilisearch/actions/runs/4366233882 3569: Enhance Japanese language detection r=dureuill a=ManyTheFish # Pull Request This PR is a prototype and can be tested by downloading [the dedicated docker image](https://hub.docker.com/layers/getmeili/meilisearch/prototype-better-language-detection-0/images/sha256-a12847de00e21a71ab797879fd09777dadcb0881f65b5f810e7d1ed434d116ef?context=explore): ```bash $ docker pull getmeili/meilisearch:prototype-better-language-detection-0 ``` ## Context Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization. A [first iteration has been implemented for v1.1.0](https://github.com/meilisearch/meilisearch/pull/3347) but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search. Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing. For instance, a Japanese document `{"id": 1, "name": "東京スカパラダイスオーケストラ"}` is detected as Japanese during indexing, during the search the query `東京` will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese. However if in the dataset there is at least one document containing a field with only Kanjis like: _A document with only 1 field containing only Kanjis:_ ```json { "id":4, "name": "東京特許許可局" } ``` _A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:_ ```json { "id":105, "name": "東京特許許可局", "desc": "日経平均株価は26日に約8カ月ぶりに2万4000円の心理的な節目を上回った。株高を支える材料のひとつは、自民党総裁選で3選を決めた安倍晋三首相の経済政策への期待だ。恩恵が見込まれるとされる人材サービスや建設株の一角が買われている。ただ思惑が先行して資金が集まっている面は否めない。実際に政策効果を取り込む企業はどこか、なお未知数だ。" } ``` Then, in both cases, the field `name` will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore, the query `東京` will be detected as Chinese and only the two last documents will be retrieved by Meilisearch. ## Technical Approach The current PR partially fixes these issues by: 1) Adding a check over potential miss-detections and rerunning the extraction of the document forcing the tokenization over the main Languages detected in it. > 1) run a first extraction allowing the tokenizer to detect any Language in any Script > 2) generate a distribution of tokens by Script and Languages (`script_language`) > 3) if for a Script we have a token distribution of one of the Language that is under the threshold, then we rerun the extraction forbidding the tokenizer to detect the marginal Languages > 4) the tokenizer will fall back on the other available Languages to tokenize the text. For example, if the Chinese were marginally detected compared to the Japanese on the CJ script, then the second extraction will force Japanese tokenization for CJ text in the document. however, the text on another script like Latin will not be impacted by this restriction. 2) Adding a filtering threshold during the search over Languages that have been marginally detected in documents ## Limits This PR introduces 2 arbitrary thresholds: 1) during the indexing, a Language is considered miss-detected if the number of detected tokens of this Language is under 10% of the tokens detected in the same Script (Japanese and Chinese are 2 different Languages sharing the "same" script "CJK"). 2) during the search, a Language is considered marginal if less than 5% of documents are detected as this Language. This PR only partially fixes these issues: - ✅ the query `東京` now find Japanese documents if less than 5% of documents are detected as Chinese. - ✅ the document with the id `105` containing the Japanese field `desc` but the miss-detected field `name` is now completely detected and tokenized as Japanese and is found with the query `東京`. - ❌ the document with the id `4` no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search. ## Related issue Fixes #3565 ## Possible future enhancements - Change or contribute to the Library used to detect the Language - the related issue on Whatlang: https://github.com/greyblake/whatlang-rs/issues/122 Co-authored-by: curquiza <clementine@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Many the fish <many@meilisearch.com>	2023-03-09 15:34:35 +00:00
ManyTheFish	2f8eb4f54a	last PR fixes	2023-03-09 15:34:36 +01:00
Clément Renault	175e8a8495	Fix a diacritic issue	2023-03-09 14:57:47 +01:00
Clément Renault	df48ac8803	Add one more test for the NULL operator	2023-03-09 13:53:37 +01:00
Clément Renault	ff86073288	Add a snapshot for the NULL facet database	2023-03-09 13:32:27 +01:00
Clément Renault	0ad53784e7	Create a new struct to reduce the type complexity	2023-03-09 13:21:21 +01:00
Clément Renault	e064c52544	Rename an internal facet deletion method	2023-03-09 13:08:02 +01:00
Clément Renault	e106b16148	Fix a typo in a variable Co-authored-by: Louis Dureuil <louis@meilisearch.com> aaa	2023-03-09 13:08:02 +01:00
Tamo	eddefb0e0f	refactor the error type of the milli::document thing silence a warning	2023-03-09 13:03:14 +01:00
ManyTheFish	5deea631ea	fix clippy too many arguments	2023-03-09 11:19:13 +01:00
Tamo	c5f22be6e1	add boolean support for csv documents	2023-03-09 11:12:49 +01:00
ManyTheFish	b4b859ec8c	Fix typos	2023-03-09 10:58:35 +01:00
Clément Renault	b1d61f5a02	Add more tests for the NULL filter	2023-03-09 10:04:27 +01:00
Clément Renault	7dc04747fd	Make clippy happy	2023-03-08 17:37:08 +01:00
Clément Renault	7c0cd7172d	Introduce the NULL and NOT value NULL operator	2023-03-08 17:14:34 +01:00
Clément Renault	43ff236df8	Write the NULL facet values in the database	2023-03-08 16:49:53 +01:00
Clément Renault	19ab4d1a15	Classify the NULL fields values in the facet extractor	2023-03-08 16:49:31 +01:00
Clément Renault	9287858997	Introduce a new facet_id_is_null_docids database in the index	2023-03-08 16:14:00 +01:00
ManyTheFish	24c0775c67	Change indexing threshold	2023-03-08 12:36:04 +01:00
ManyTheFish	3092cf0448	Fix clippy errors	2023-03-08 10:53:42 +01:00
ManyTheFish	37d4551e8e	Add a threshold filtering the Languages allowed to be detected at search time	2023-03-07 19:38:01 +01:00
ManyTheFish	da48506f15	Rerun extraction when language detection might have failed	2023-03-07 18:35:26 +01:00
bors[bot]	4f1ccbc495	Merge #3525 3525: Fix phrase search containing stop words r=ManyTheFish a=ManyTheFish # Summary A search with a phrase containing only stop words was returning an HTTP error 500, this PR filters the phrase containing only stop words dropping them before the search starts, a query with a phrase containing only stop words now behaves like a placeholder search. fixes https://github.com/meilisearch/meilisearch/issues/3521 related v1.0.2 PR on milli: https://github.com/meilisearch/milli/pull/779 Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-03-02 10:55:37 +00:00
ManyTheFish	37489fd495	Return an internal error in the case of matching word is invalid	2023-03-01 19:05:16 +01:00
Louis Dureuil	5822764be9	Skip computing index budget in tests	2023-02-23 11:23:39 +01:00
bors[bot]	ac5a1e4c4b	Merge #3423 3423: Add min and max facet stats r=dureuill a=dureuill # Pull Request ## Related issue Fixes #3426 ## What does this PR do? ### User standpoint - When using a `facets` parameter in search, the facets that have numeric values are displayed in a new section of the response called `facetStats` that contains, per facet, the numeric min and max value of the hits returned by the search. <details> <summary> Sample request/response </summary> ```json ❯ curl \ -X POST 'http://localhost:7700/indexes/meteorites/search?facets=mass' \ -H 'Content-Type: application/json' \ --data-binary '{ "q": "LL6", "facets":["mass", "recclass"], "limit": 5 }' \| jsonxf { "hits": [ { "name": "Niger (LL6)", "id": "16975", "nametype": "Valid", "recclass": "LL6", "mass": 3.3, "fall": "Fell" }, { "name": "Appley Bridge", "id": "2318", "nametype": "Valid", "recclass": "LL6", "mass": 15000, "fall": "Fell", "_geo": { "lat": 53.58333, "lng": -2.71667 } }, { "name": "Athens", "id": "4885", "nametype": "Valid", "recclass": "LL6", "mass": 265, "fall": "Fell", "_geo": { "lat": 34.75, "lng": -87.0 } }, { "name": "Bandong", "id": "4935", "nametype": "Valid", "recclass": "LL6", "mass": 11500, "fall": "Fell", "_geo": { "lat": -6.91667, "lng": 107.6 } }, { "name": "Benguerir", "id": "30443", "nametype": "Valid", "recclass": "LL6", "mass": 25000, "fall": "Fell", "_geo": { "lat": 32.25, "lng": -8.15 } } ], "query": "LL6", "processingTimeMs": 15, "limit": 5, "offset": 0, "estimatedTotalHits": 42, "facetDistribution": { "mass": { "110000": 1, "11500": 1, "1161": 1, "12000": 1, "1215.5": 1, "127000": 1, "15000": 1, "1676": 1, "1700": 1, "1710.5": 1, "18000": 1, "19000": 1, "220000": 1, "2220": 1, "22300": 1, "25000": 2, "265": 1, "271000": 1, "2840": 1, "3.3": 1, "3000": 1, "303": 1, "32000": 1, "34000": 1, "36.1": 1, "45000": 1, "460": 1, "478": 1, "483": 1, "5500": 2, "600": 1, "6000": 1, "67.8": 1, "678": 1, "680.5": 1, "6930": 1, "8": 1, "8300": 1, "840": 1, "8400": 1 }, "recclass": { "L/LL6": 3, "LL6": 39 } }, "facetStats": { "mass": { "min": 3.3, "max": 271000.0 } } } ``` </details> ## PR checklist Please check if your PR fulfills the following requirements: - [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [ ] Have you read the contributing guidelines? - [ ] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-02-22 13:06:43 +00:00
ManyTheFish	900bae3d9d	keep phrases that has at least one word	2023-02-21 18:16:51 +01:00
ManyTheFish	28b7d73d4a	Remove an unefficient part of a test on milli	2023-02-21 18:16:51 +01:00
bors[bot]	39407885c2	Merge #3347 3347: Enhance language detection r=irevoire a=ManyTheFish ## Summary Some completely unrelated Languages can share the same characters, in Meilisearch we detect the Languages using `whatlang`, which works well on large texts but fails on small search queries leading to a bad segmentation and normalization of the query. This PR now stores the Languages detected during the indexing in order to reduce the Languages list that can be detected during the search. ## Detail - Create a 19th database mapping the scripts and the Languages detected with the documents where the Language is detected - Fill the newly created database during indexing - Create an allow-list with this database and pass it to Charabia - Add a test ensuring that a Japanese request containing kanjis only is detected as Japanese and not Chinese ## Related issues Fixes #2403 Fixes #3513 Co-authored-by: f3r10 <frledesma@outlook.com> Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Many the fish <many@meilisearch.com>	2023-02-21 10:52:13 +00:00
ManyTheFish	bbecab8948	fix clippy	2023-02-21 10:18:44 +01:00
ManyTheFish	8aa808d51b	Merge branch 'main' into enhance-language-detection	2023-02-20 18:14:34 +01:00
bors[bot]	1e9ac00800	Merge #3505 3505: Csv delimiter r=irevoire a=irevoire Fixes https://github.com/meilisearch/meilisearch/issues/3442 Closes https://github.com/meilisearch/meilisearch/pull/2803 Specified in https://github.com/meilisearch/specifications/pull/221 This PR is a reimplementation of https://github.com/meilisearch/meilisearch/pull/2803, on the new engine. Thanks for your idea and initial PR `@MixusMinimax;` sorry I couldn’t update/merge your PR. Way too many changes happened on the engine in the meantime. Attention to reviewer; I had to update deserr to implement the support of deserializing `char`s ------- It introduces four new error messages; - Invalid value in parameter csvDelimiter: expected a string of one character, but found an empty string - Invalid value in parameter csvDelimiter: expected a string of one character, but found the following string of 5 characters: doggo - csv delimiter must be an ascii character. Found: 🍰 - The Content-Type application/json does not support the use of a csv delimiter. The csv delimiter can only be used with the Content-Type text/csv. And one error code; - `invalid_index_csv_delimiter` The `invalid_content_type` error code is now also used when we encounter the `csvDelimiter` query parameter with a non-csv content type. Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-20 17:01:36 +00:00
bors[bot]	b08a49a16e	Merge #3319 #3470 3319: Transparently resize indexes on MaxDatabaseSizeReached errors r=Kerollmops a=dureuill # Pull Request ## Related issue Related to https://github.com/meilisearch/meilisearch/discussions/3280, depends on https://github.com/meilisearch/milli/pull/760 ## What does this PR do? ### User standpoint - Meilisearch no longer fails tasks that encounter the `milli::UserError(MaxDatabaseSizeReached)` error. - Instead, these tasks are retried after increasing the maximum size allocated to the index where the failure occurred. ### Implementation standpoint - Add `Batch::index_uid` to get the `index_uid` of a batch of task if there is one - `IndexMapper::create_or_open_index` now takes an additional `size` argument that allows to (re)open indexes with a size different from the base `IndexScheduler::index_size` field - `IndexScheduler::tick` now returns a `Result<TickOutcome>` instead of a `Result<usize>`. This offers more explicit control over what the behavior should be wrt the next tick. - Add `IndexStatus::BeingResized` that contains a handle that a thread can use to await for the resize operation to complete and the index to be available again. - Add `IndexMapper::resize_index` to increase the size of an index. - In `IndexScheduler::tick`, intercept task batches that failed due to `MaxDatabaseSizeReached` and resize the index that caused the error, then request a new tick that will eventually handle the still enqueued task. ## Testing the PR The following diff can be applied to this branch to make testing the PR easier: <details> ```diff diff --git a/index-scheduler/src/index_mapper.rs b/index-scheduler/src/index_mapper.rs index 553ab45a..022b2f00 100644 --- a/index-scheduler/src/index_mapper.rs +++ b/index-scheduler/src/index_mapper.rs `@@` -228,13 +228,15 `@@` impl IndexMapper { drop(lock); + std:🧵:sleep_ms(2000); + let current_size = index.map_size()?; let closing_event = index.prepare_for_closing(); - log::info!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2); + log::error!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2); closing_event.wait(); - log::info!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2); + log::error!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2); let index_path = self.base_path.join(uuid.to_string()); let index = self.create_or_open_index(&index_path, None, 2 * current_size)?; `@@` -268,8 +270,10 `@@` impl IndexMapper { match index { Some(Available(index)) => break index, Some(BeingResized(ref resize_operation)) => { + log::error!("waiting for resize end"); // Deadlock: no lock taken while doing this operation. resize_operation.wait(); + log::error!("trying our luck again!"); continue; } Some(BeingDeleted) => return Err(Error::IndexNotFound(name.to_string())), diff --git a/index-scheduler/src/lib.rs b/index-scheduler/src/lib.rs index 11b17d05..242dc095 100644 --- a/index-scheduler/src/lib.rs +++ b/index-scheduler/src/lib.rs `@@` -908,6 +908,7 `@@` impl IndexScheduler { /// /// Returns the number of processed tasks. fn tick(&self) -> Result<TickOutcome> { + log::error!("ticking!"); #[cfg(test)] { *self.run_loop_iteration.write().unwrap() += 1; diff --git a/meilisearch/src/main.rs b/meilisearch/src/main.rs index 050c825a..63f312f6 100644 --- a/meilisearch/src/main.rs +++ b/meilisearch/src/main.rs `@@` -25,7 +25,7 `@@` fn setup(opt: &Opt) -> anyhow::Result<()> { #[actix_web::main] async fn main() -> anyhow::Result<()> { - let (opt, config_read_from) = Opt::try_build()?; + let (mut opt, config_read_from) = Opt::try_build()?; setup(&opt)?; `@@` -56,6 +56,8 `@@` We generated a secure master key for you (you can safely copy this token): _ => (), } + opt.max_index_size = byte_unit::Byte::from_str("1MB").unwrap(); + let (index_scheduler, auth_controller) = setup_meilisearch(&opt)?; #[cfg(all(not(debug_assertions), feature = "analytics"))] ``` </details> Mainly, these debug changes do the following: - Set the default index size to 1MiB so that index resizes are initially frequent - Turn some logs from info to error so that they can be displayed with `--log-level ERROR` (hiding the other infos) - Add a long sleep between the beginning and the end of the resize so that we can observe the `BeingResized` index status (otherwise it would never come up in my tests) ## Open questions - Is the growth factor of x2 the correct solution? For a `Vec` in memory it makes sense, but here we're manipulating quantities that are potentially in the order of 500GiBs. For bigger indexes it may make more sense to add at most e.g. 100GiB on each resize operation, avoiding big steps like 500GiB -> 1TiB. ## PR checklist Please check if your PR fulfills the following requirements: - [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [ ] Have you read the contributing guidelines? - [ ] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! 3470: Autobatch addition and deletion r=irevoire a=irevoire This PR adds the capability to meilisearch to batch document addition and deletion together. Fix https://github.com/meilisearch/meilisearch/issues/3440 -------------- Things to check before merging; - [x] What happens if we delete multiple time the same documents -> add a test - [x] If a documentDeletion gets batched with a documentAddition but the index doesn't exist yet? It should not work Co-authored-by: Louis Dureuil <louis@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-20 15:00:19 +00:00
Many the fish	119e6d8811	Update milli/src/search/mod.rs Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-20 15:33:10 +01:00
ManyTheFish	cb8d5f2d4b	Update Charabia to 0.7.1	2023-02-20 14:00:31 +01:00
Louis Dureuil	eb28d4c525	add facet test	2023-02-20 13:52:28 +01:00
Louis Dureuil	9ac981d025	Remove some clippy type complexity warns by deboxing iters	2023-02-20 13:52:27 +01:00
Louis Dureuil	74859ecd61	Add min and max facet stats	2023-02-20 13:52:27 +01:00
Louis Dureuil	8ae441a4db	Update usage of iterators	2023-02-20 13:52:27 +01:00
Louis Dureuil	042d86cbb3	facet sort ascending/descending now also return the values	2023-02-20 13:52:27 +01:00
Tamo	18796d6e6a	Consider null as a valid geo object	2023-02-20 13:45:51 +01:00
bors[bot]	28961b2ad1	Merge #3499 3499: Use the workspace inheritance r=Kerollmops a=irevoire Use the workspace inheritance [introduced in rust 1.64](https://blog.rust-lang.org/2022/09/22/Rust-1.64.0.html#cargo-improvements-workspace-inheritance-and-multi-target-builds). It allows us to define the version of meilisearch once in the main `Cargo.toml` and let all the other `Cargo.toml` uses this version. `@curquiza` I added you as a reviewer because I had to patch some CI scripts And `@Kerollmops,` I had to bump the `cargo_toml` crates because our version was getting old and didn't support the feature yet. Also, in another PR, I would like to unify some of our dependencies to ensure we always stay in sync between all our crates. Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-17 09:52:29 +00:00
Tamo	895ab2906c	apply review suggestions	2023-02-16 18:42:47 +01:00
Tamo	8c074f5028	implements the csv delimiter without tests Co-authored-by: Maxi Barmetler <maxi.barmetler@gmail.com>	2023-02-16 17:35:36 +01:00
bors[bot]	143e3cf948	Merge #3490 3490: Fix attributes set candidates r=curquiza a=ManyTheFish # Pull Request Fix attributes set candidates for v1.1.0 ## details The attribute criterion was not returning the remaining candidates when its internal algorithm was been exhausted. We had a loss of candidates by the attribute criterion leading to the bug reported in the issue linked below. After some investigation, it seems that it was the only criterion that had this behavior. We are now returning the remaining candidates instead of an empty bitmap. ## Related issue Fixes #3483 PR on milli for v1.0.1: https://github.com/meilisearch/milli/pull/777 Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-02-15 17:38:07 +00:00
Tamo	74d1a67a99	Use the workspace inheritance feature of rust 1.64	2023-02-15 13:51:07 +01:00
bors[bot]	91ce8a5e67	Merge #3492 3492: Bump deserr r=Kerollmops a=irevoire Bump deserr to the latest version; - We now use the default actix-web extractors that deserr provides (which were copy/pasted from meilisearch) - We also use the default `JsonError` message provided by deserr instead of defining our own in meilisearch - Finally, we get the new `did you mean?` error message. Fix #3493 Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-15 10:05:05 +00:00
Tamo	a43765d454	use the pre-defined deserr extractors	2023-02-14 20:05:30 +01:00
Tamo	8fb7b1d10f	bump deserr	2023-02-14 20:04:30 +01:00
Tamo	74dcfe9676	Fix a bug when you update a document that was already present in the db, deleted and then inserted again in the same transform	2023-02-14 19:09:40 +01:00
Tamo	1b1703a609	make a small optimization to merge obkvs a little bit faster	2023-02-14 18:32:41 +01:00
Tamo	fb5e4957a6	fix and test the early exit in case a grenad ends with a deletion	2023-02-14 18:23:57 +01:00
Tamo	8de3c9f737	Update milli/src/update/index_documents/transform.rs Co-authored-by: Clément Renault <clement@meilisearch.com>	2023-02-14 17:57:14 +01:00
Tamo	43a19d0709	document the operation enum + the grenads	2023-02-14 17:55:26 +01:00
Filip Bachul	a53536836b	fmt	2023-02-14 17:04:22 +01:00
Filip Bachul	d7ad39ad77	fix: clippy error	2023-02-14 00:15:35 +01:00
Filip Bachul	849de089d2	add thiserror for AscDescError	2023-02-14 00:15:35 +01:00
filip	7f25007d31	Update milli/src/asc_desc.rs Co-authored-by: Tamo <irevoire@protonmail.ch>	2023-02-14 00:15:35 +01:00
Filip Bachul	c810af3ebf	implement From<ParseGeoError> for AscDescError	2023-02-14 00:15:35 +01:00
Filip Bachul	c0b77773ba	fmt asc_desc	2023-02-14 00:15:35 +01:00
Filip Bachul	7481559e8b	move BadGeo to FilterError	2023-02-14 00:15:35 +01:00
Filip Bachul	83c765ce6c	implement From<ParseGeoError> for FilterError	2023-02-14 00:15:35 +01:00
Filip Bachul	4c91037602	use ParseGeoError in sort parser	2023-02-14 00:15:35 +01:00
Filip Bachul	825923f6fc	export ParseGeoError	2023-02-14 00:15:35 +01:00
Filip Bachul	e405702733	chore: introduce new error ParseGeoError type	2023-02-14 00:15:35 +01:00
ManyTheFish	6fa877efb0	Fix attributes set candidates	2023-02-13 17:49:52 +01:00
Tamo	746b31c1ce	makes clippy happy	2023-02-09 12:23:01 +01:00
Tamo	93db755d57	add a test to ensure we handle correctly a deletion of multiple time the same document	2023-02-08 21:03:34 +01:00
Tamo	93f130a400	fix all warnings	2023-02-08 20:57:35 +01:00
Tamo	421a9cf05e	provide a new method on the transform to remove documents	2023-02-08 16:06:09 +01:00
Tamo	8f64fba1ce	rewrite the current transform to handle a new byte specifying the kind of operation it's merging	2023-02-08 12:53:38 +01:00
bors[bot]	c88c3637b4	Merge #3461 3461: Bring v1 changes into main r=curquiza a=Kerollmops Also bring back changes in milli (the remote repository) into main done during the pre-release Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com> Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com> Co-authored-by: curquiza <curquiza@users.noreply.github.com> Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: Philipp Ahlner <philipp@ahlner.com> Co-authored-by: Kerollmops <clement@meilisearch.com>	2023-02-07 11:27:27 +00:00
bors[bot]	97fd9ac493	Merge #3405 3405: Implement geo bounding box r=irevoire a=curquiza Following https://github.com/meilisearch/milli/pull/672 (work from `@gmourier)` Fixes #2761 Co-authored-by: Guillaume Mourier <guillaume@meilisearch.com> Co-authored-by: Louis Dureuil <louis@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-07 09:55:20 +00:00
bors[bot]	821d92b5d0	Merge #3407 3407: Add Cargo feature for LMDB's POSIX semaphores r=dureuill a=GregoryConrad See https://github.com/meilisearch/milli/pull/757 Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>	2023-02-07 08:25:20 +00:00
Tamo	42114325cd	Apply suggestions from code review Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-02-06 18:07:00 +01:00
Tamo	7a38fe624f	throw an error if the top left corner is found below the bottom right corner	2023-02-06 17:50:47 +01:00
Tamo	1b005f697d	update the syntax of the geoboundingbox filter to uses brackets instead of parens around lat and lng	2023-02-06 16:50:27 +01:00
Kerollmops	fbec48f56e	Merge remote-tracking branch 'milli/main' into bring-v1-changes	2023-02-06 16:48:10 +01:00
Tamo	3ebc99473f	Apply suggestions from code review Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-02-06 13:29:37 +01:00
Tamo	d27007005e	comments the geoboundingbox + forbid the usage of the lexeme method which could introduce bugs	2023-02-06 11:36:49 +01:00
Tamo	fcb09ccc3d	add tests on the geoBoundingBox	2023-02-02 18:19:56 +01:00
Louis Dureuil	ae8660e585	Add Token::original_span rather than making Token::span pub	2023-02-02 15:03:34 +01:00
Guillaume Mourier	b297b5deb0	cargo fmt	2023-02-02 12:34:49 +01:00
Guillaume Mourier	0d71c80ba6	add tests	2023-02-02 12:31:27 +01:00
Guillaume Mourier	65a3086cf1	fix test	2023-02-02 12:27:58 +01:00
Guillaume Mourier	426d63b01b	Update insta test suite	2023-02-02 12:27:56 +01:00
Guillaume Mourier	b078477d80	Add error handling and earth lap collision with bounding box	2023-02-02 12:17:38 +01:00
ManyTheFish	0bc1a18f52	Use Languages list detected during indexing at search time	2023-02-01 18:57:43 +01:00
ManyTheFish	643d99e0f9	Add expectancy test	2023-02-01 18:39:54 +01:00
ManyTheFish	064158e4e2	Update test	2023-02-01 15:34:01 +01:00
ManyTheFish	77d32d0ee8	Fix codec deserialization	2023-02-01 15:26:26 +01:00
ManyTheFish	f4569b04ad	Update Charabia version	2023-02-01 15:26:26 +01:00
bors[bot]	758b4acea7	Merge #776 776: Reduce incremental indexing time of `words_prefix_position_docids` DB r=curquiza a=loiclec Fixes partially https://github.com/meilisearch/milli/issues/605 The `words_prefix_position_docids` can easily contain millions of entries. Thus, iterating over it can be very expensive. But we do so needlessly for every document addition tasks. It can sometimes cause indexing performance issues when : - a user sends many `documentAdditionOrUpdate` tasks that cannot be all batched together (for example if they are interspersed with `documentDeletion` tasks) - the documents contain long, diverse text fields, thus increasing the number of entries in `words_prefix_position_docids` - the index has accumulated many soft-deleted documents, further increasing the size of `words_prefix_position_docids` - the machine running Meilisearch does not have great IO performance (e.g. slow SSD, or quota-limited by the cloud provider) Note, before approving the PR: the only changed file should be `milli/src/update/words_prefix_position_docids.rs`. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-31 15:52:28 +00:00
bors[bot]	a4e8158239	Merge #774 774: Update version for the next release (v0.41.1) in Cargo.toml files r=curquiza a=meili-bot ⚠️ This PR is automatically generated. Check the new version is the expected one before merging. Co-authored-by: curquiza <curquiza@users.noreply.github.com>	2023-01-31 11:51:42 +00:00
Loïc Lecrenier	a2690ea8d4	Reduce incremental indexing time of `words_prefix_position_docids` DB This database can easily contain millions of entries. Thus, iterating over it can be very expensive. For regular `documentAdditionOrUpdate` tasks, `del_prefix_fst_words` will always be empty. Thus, we can save a significant amount of time by adding this `if !del_prefix_fst_words.is_empty()` condition. The code's behaviour remains completely unchanged.	2023-01-31 11:42:24 +01:00
f3r10	2922c5c899	Fix code format	2023-01-31 11:28:05 +01:00
f3r10	7681be5367	Format code	2023-01-31 11:28:05 +01:00
f3r10	50bc156257	Fix tests	2023-01-31 11:28:05 +01:00
f3r10	d8207356f4	Skip script,language insertion if language is undetected	2023-01-31 11:28:05 +01:00
f3r10	2d58b28f43	Improve script language codec	2023-01-31 11:28:05 +01:00
f3r10	fd60a39f1c	Format code	2023-01-31 11:28:05 +01:00
f3r10	369c05732e	Add test checking if from script_language_docids database were removed deleted docids	2023-01-31 11:28:05 +01:00
f3r10	34d04f3d3f	Filter from script_language_docids database soft deleted documents	2023-01-31 11:28:05 +01:00
f3r10	a27f329e3a	Add tests for checking that detected script and language associated with document(s) were stored during indexing	2023-01-31 11:28:05 +01:00
f3r10	b216ddba63	Delete and clear data from the new database	2023-01-31 11:28:05 +01:00
f3r10	d97fb6117e	Extract and index data	2023-01-31 11:28:05 +01:00
f3r10	c45d1e3610	Create a new database on index and add a specialized codec for it	2023-01-31 11:28:05 +01:00
Louis Dureuil	20f05efb3c	clippy: needless_lifetimes	2023-01-31 11:12:59 +01:00
Louis Dureuil	cbf029f64c	clippy: --fix	2023-01-31 11:12:59 +01:00
curquiza	bffabf9cc6	Update version for the next release (v0.41.1) in Cargo.toml files	2023-01-31 09:56:22 +00:00
Louis Dureuil	3296cf7ae6	clippy: remove needless lifetimes	2023-01-31 09:32:40 +01:00
Louis Dureuil	89675e5f15	clippy: Replace seek 0 by rewind	2023-01-31 09:32:40 +01:00
Tamo	55e8046551	bump milli	2023-01-24 13:52:21 +01:00
Tamo	de3c4f1986	throw an error on unknown fields specified in the _geo field	2023-01-24 12:23:24 +01:00
Gregory Conrad	3f69dd6450	feat: add Cargo feature for LMDB's POSIX semaphores	2023-01-19 12:08:38 -05:00
bors[bot]	1c4b1b3b2d	Merge #770 770: Update deserr v0.3.0 r=irevoire a=ManyTheFish related to https://github.com/meilisearch/meilisearch/issues/3391 Co-authored-by: Many the fish <many@meilisearch.com>	2023-01-19 17:05:56 +00:00
curquiza	abd65d9307	Update version for the next release (v0.40.0) in Cargo.toml files	2023-01-19 16:43:45 +00:00
Many the fish	30fc376713	Update deserr v0.3.0	2023-01-19 17:37:30 +01:00
bors[bot]	3521a3a0b2	Merge #763 763: Fixes error message when lat and lng are unparseable r=loiclec a=ahlner # Pull Request ## Related issue Fixes partially [#3007](https://github.com/meilisearch/meilisearch/issues/3007) ## What does this PR do? - Changes function validate_geo_from_json to return a BadLatitudeAndLongitude if lat or lng is a string and not parseable to f64 - implemented some unittests - Derived PartialEq for GeoError to use assert_eq! in tests ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Philipp Ahlner <philipp@ahlner.com>	2023-01-19 15:15:46 +00:00
bors[bot]	40a53f8824	Merge #767 767: Update version for the next release (v0.39.2) in Cargo.toml files r=curquiza a=meili-bot ⚠️ This PR is automatically generated. Check the new version is the expected one before merging. Co-authored-by: curquiza <curquiza@users.noreply.github.com>	2023-01-19 14:48:12 +00:00
Philipp Ahlner	f5ca421227	Superfluous test removed	2023-01-19 15:39:21 +01:00
curquiza	3f048927a0	Update version for the next release (v0.39.2) in Cargo.toml files	2023-01-19 14:29:09 +00:00
Louis Dureuil	4fd6fd9bef	Indicate filterable attributes when the user set a non filterable attribute in facet distributions	2023-01-19 12:25:18 +01:00
Philipp Ahlner	a2cd7214f0	Fixes error message when lat/lng are unparseable	2023-01-19 10:10:26 +01:00
ManyTheFish	d1fc42b53a	Use compatibility decomposition normalizer in facets	2023-01-18 15:02:13 +01:00
ManyTheFish	e64571a881	Add test sorting string with diacritics	2023-01-18 14:43:38 +01:00
Philipp Ahlner	497187083b	Add test for bug #3007 : Wrong error message Adds a test for #3007: Wrong error message when lat and lng are unparseable	2023-01-18 13:24:26 +01:00
Clément Renault	1d507c84b2	Fix the formatting	2023-01-17 18:25:55 +01:00
Clément Renault	1b78231e18	Make clippy happy	2023-01-17 18:25:54 +01:00
bors[bot]	0c7d1f761e	Merge #765 765: Update version for the next release (v0.39.1) in Cargo.toml files r=curquiza a=meili-bot ⚠️ This PR is automatically generated. Check the new version is the expected one before merging. Co-authored-by: curquiza <curquiza@users.noreply.github.com>	2023-01-17 11:04:26 +00:00
curquiza	e3d30e28ef	Update version for the next release (v0.39.1) in Cargo.toml files	2023-01-17 10:50:29 +00:00
bors[bot]	63af1e9f28	Merge #764 764: Update deserr to latest version r=irevoire a=loiclec Update deserr to 0.1.5, which changes the `DeserializeFromValue` trait, getting rid of the `default()` method. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-17 10:39:36 +00:00
Loïc Lecrenier	f073a86387	Update deserr to latest version	2023-01-17 11:28:19 +01:00
Kerollmops	97005dd505	Bump the milli-imported crates to v1.0.0	2023-01-16 16:29:12 +01:00
Kerollmops	ebb2494879	Add a README to the milli crate	2023-01-16 16:25:12 +01:00
curquiza	9e32ac7cb2	Update version for the next release (v0.39.0) in Cargo.toml files	2023-01-11 15:05:06 +00:00
bors[bot]	302d6cccd7	Merge #761 761: Integrate deserr r=irevoire a=loiclec 1. `Setting<T>` now implements `DeserializeFromValue` 2. The settings now store ranking rules as strongly typed `Criterion` instead of `String`, since the validation of the ranking rules will be done on meilisearch's side from now on Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-11 14:35:15 +00:00
bors[bot]	21b7d709ad	Merge #759 759: Change primary key inference error messages r=Kerollmops a=dureuill # Pull Request ## Related issue Milli part of https://github.com/meilisearch/meilisearch/issues/3301 ## What does this PR do? - Change error message strings ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-01-11 14:04:25 +00:00
Loïc Lecrenier	02fd06ea0b	Integrate deserr	2023-01-11 13:56:47 +01:00
Louis Dureuil	00746b32c0	Add Index::map_size	2023-01-10 11:16:51 +01:00
Louis Dureuil	be9786bed9	Change primary key inference error messages	2023-01-05 10:40:09 +01:00
bors[bot]	c3f4835e8e	Merge #733 733: Avoid a prefix-related worst-case scenario in the proximity criterion r=loiclec a=loiclec # Pull Request ## Related issue Somewhat fixes (until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3118 ## What does this PR do? When a query ends with a word and a prefix, such as: ``` word pr ``` Then we first determine whether `pre` could possibly be in the proximity prefix database before querying it. There are then three possibilities: 1. `pr` is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations of `pre` through the FST and query the regular proximity databases. 2. `pr` is in the prefix cache but cannot be found in the proximity prefix databases. In this case, we partially disable the proximity ranking rule for the pair `word pre`. This is done as follows: 1. Only find the documents where `word` is in proximity to `pre` exactly (no derivations) 2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8 3. `pr` is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases. Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is: 1. For common prefixes of one or two letters: we no longer distinguish between proximities from 4 to 8 2. For common prefixes of more than two letters: we no longer distinguish between any proximities 3. For uncommon prefixes: nothing changes Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query `heard pr` (IF `pr` is the prefix of more than 200 words in the dataset): ```json [ { "text": "I heard there is a faster proximity criterion" }, { "text": "I heard there is a faster but less relevant proximity criterion" } ] ``` Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro": ```json [ { "text": "I heard there is a faster but less relevant proximity criterion" } { "text": "I heard there is a faster proximity criterion" }, ] ``` But the following document would be considered more relevant than the two documents above: ```json { "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " } ``` Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything. --- ## Performance I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the `songs` benchmark dataset. ``` 1. 10x 'a': - 640ms ⟹ 630ms = no significant difference 2. 10x 'b': - set-based: 4.47s ⟹ 7.42 = bad, ~2x regression - dynamic: 1s ⟹ 870 ms = no significant difference 3. 'Someone I l': - set-based: 250ms ⟹ 12 ms = very good, x20 speedup - dynamic: 21ms ⟹ 11 ms = good, x2 speedup 4. 'billie e': - set-based: 623ms ⟹ 2ms = very good, x300 speedup - dynamic: ~4ms ⟹ 4ms = no difference 5. 'billie ei': - set-based: 57ms ⟹ 20ms = good, ~2x speedup - dynamic: ~4ms ⟹ ~2ms. = no significant difference 6. 'i am getting o' - set-based: 300ms ⟹ 60ms = very good, 5x speedup - dynamic: 30ms ⟹ 6ms = very good, 5x speedup 7. 'prologue 1 a 1: - set-based: 3.36s ⟹ 120ms = very good, 30x speedup - dynamic: 200ms ⟹ 30ms = very good, 6x speedup 8. 'prologue 1 a 10': - set-based: 590ms ⟹ 18ms = very good, 30x speedup - dynamic: 82ms ⟹ 35ms = good, ~2x speedup ``` Performance is often significantly better, but there is also one regression in the set-based implementation with the query `b b b b b b b b b b`. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-04 09:00:50 +00:00
bors[bot]	49f58b2c47	Merge #732 732: Interpret synonyms as phrases r=loiclec a=loiclec # Pull Request ## Related issue Fixes (when merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3125 ## What does this PR do? We now map multi-word synonyms to phrases instead of loose words. Such that the request: ``` btw I am going to nyc soon ``` is interpreted as (when the synonym interpretation is chosen for both `btw` and `nyc`): ``` "by the way" I am going to "New York City" soon ``` instead of: ``` by the way I am going to New York City soon ``` This prevents queries containing multi-word synonyms to exceed to word length limit and degrade the search performance. In terms of relevancy, there is a debate to have. I personally think this could be considered an improvement, since it would be strange for a user to search for: ``` good DIY project ``` and have a result such as: ``` { "text": "whether it is a good project to do, you'll have to decide for yourself" } ``` However, for synonyms such as `NYC -> New York City`, then we will stop matching documents where `New York` is separated from `City`. This is however solvable by adding an additional mapping: `NYC -> New York`. ## Performance With the old behaviour, some long search requests making heavy uses of synonyms could take minutes to be executed. This is no longer the case, these search requests now take an average amount of time to be resolved. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-04 08:34:18 +00:00
bors[bot]	6a10e85707	Merge #736 736: Update charabia r=curquiza a=ManyTheFish Update Charabia to the last version. > We are now Romanizing Chinese characters into Pinyin. > Note that we keep the accent because they are in fact never typed directly by the end-user, moreover, changing an accent leads to a different Chinese character, and I don't have sufficient knowledge to forecast the impact of removing accents in this context. Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-01-03 15:44:41 +00:00
bors[bot]	9519e60f97	Merge #709 709: Optimise the `ExactWords` sub-criterion within `Exactness` r=loiclec a=loiclec # Pull Request ## Related issue Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3116 ## What does this PR do? 1. Reduces the algorithmic complexity of finding the documents containing N exact words from something that is exponential to something that is polynomial. 2. Cache intermediary results between different calls to the `exactness` criterion. ## Performance Results On the `smol_songs.csv` dataset, a request containing 10 common words now takes about 60ms instead of 5 seconds to execute. For example, this is the case with this (admittedly nonsensical) request: `Rock You Hip Hop Folk World Country Electronic Love The`. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-02 12:28:30 +00:00
Loïc Lecrenier	b5df889dcb	Apply review suggestions: simplify implementation of exactness criterion	2023-01-02 13:11:47 +01:00
Loïc Lecrenier	8d36570958	Add explicit criterion impl strategy to proximity search tests	2023-01-02 10:37:01 +01:00
Loïc Lecrenier	32c6062e65	Optimise exactness criterion 1. Cache some results between calls to next() 2. Compute the combinations of exact words more efficiently	2022-12-22 12:28:45 +01:00
Loïc Lecrenier	f097aafa1c	Add unit test for prefix handling by the proximity criterion	2022-12-22 12:08:00 +01:00
Loïc Lecrenier	777b387dc4	Avoid a prefix-related worst-case scenario in the proximity criterion	2022-12-22 12:08:00 +01:00
Loïc Lecrenier	b0f3dc2c06	Interpret synonyms as phrases	2022-12-22 12:07:51 +01:00
Louis Dureuil	4b166bea2b	Add primary_key_inference test	2022-12-21 15:13:38 +01:00
Louis Dureuil	5943100754	Fix existing tests	2022-12-21 15:13:38 +01:00
Louis Dureuil	b24def3281	Add logging when inference took place. Displays log message in the form: ``` [2022-12-21T09:19:42Z INFO milli::update::index_documents::enrich] Primary key was not specified in index. Inferred to 'id' ```	2022-12-21 15:13:38 +01:00
Louis Dureuil	402dcd6b2f	Simplify primary key inference	2022-12-21 15:13:38 +01:00
Louis Dureuil	13c95d25aa	Remove uses of UserError::MissingPrimaryKey not related to inference	2022-12-21 15:13:36 +01:00
bors[bot]	a8defb585b	Merge #742 742: Add a "Criterion implementation strategy" parameter to Search r=irevoire a=loiclec Add a parameter to search requests which determines the implementation strategy of the criteria. This can be either `set-based`, `iterative`, or `dynamic` (ie choosing between set-based or iterative at search time). See https://github.com/meilisearch/milli/issues/755 for more context about this change. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-12-21 12:18:49 +00:00
Loïc Lecrenier	339a4b0789	Make clippy happy	2022-12-21 12:49:34 +01:00
Loïc Lecrenier	229405aeb9	Choose implementation strategy of criterion at runtime	2022-12-21 09:29:39 +01:00
Loïc Lecrenier	fc0e7382fe	Fix hard-deletion of an external id that was soft-deleted	2022-12-20 15:33:31 +01:00
bors[bot]	97fb64e40e	Merge #747 747: Soft-deletion computation no longer depends on the mapsize r=irevoire a=dureuill # Pull Request ## Related issue Related to https://github.com/meilisearch/meilisearch/issues/3231: After removing `--max-index-size`, the `mapsize` will always be unrelated to the actual max size the user wants for their DB, so it doesn't make sense to use these values any longer. This implements solution 2.3 from https://github.com/meilisearch/meilisearch/issues/3231#issuecomment-1348628824 ## What does this PR do? ### User-visible - Soft-deleted are no longer deleted when there is less than 10% of the mapsize available or when they take more than 10% of the mapsize - Instead, they are deleted when they are more soft deleted than regular documents, or when they take more than 1GiB disk space (estimated). ### Implementation standpoint 1. Adds a `DeletionStrategy` struct to replace the boolean `disable_soft_deletion` that we had up until now. This enum allows us to specify that we want "always hard", "always soft", or to use the dynamic soft-deletion strategy (default). 2. Uses the current strategy when deleting documents, with the new heuristics being used in the `DeletionStrategy::Dynamic` variant. 3. Updates the tests to use the appropriate DeletionStrategy whenever needed (one of `AlwaysHard` or `AlwaysSoft` depending on the test) Note to reviewers: this PR is optimized for a commit-by-commit review. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Louis Dureuil <louis@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com>	2022-12-19 17:46:18 +00:00
Tamo	69edbf9f6d	Update milli/src/update/delete_documents.rs	2022-12-19 18:23:50 +01:00
curquiza	c72535531b	Update version for the next release (v0.38.0) in Cargo.toml files	2022-12-19 16:35:38 +00:00
Louis Dureuil	916c23e7be	Tests: rename snapshots	2022-12-19 10:07:17 +01:00
Louis Dureuil	ad9937c755	Fix tests after adding DeletionStrategy	2022-12-19 10:07:17 +01:00
Louis Dureuil	171c942282	Soft-deletion computation no longer takes into account the mapsize Implemented solution 2.3 from https://github.com/meilisearch/meilisearch/issues/3231#issuecomment-1348628824	2022-12-19 10:07:17 +01:00
Louis Dureuil	e2ae3b24aa	Hard or soft delete according to the deletion strategy	2022-12-19 10:00:13 +01:00
Louis Dureuil	fc7618d49b	Add DeletionStrategy	2022-12-19 09:49:58 +01:00
ManyTheFish	7f88c4ff2f	Fix #1714 test	2022-12-15 18:22:28 +01:00
ManyTheFish	96d4242b93	Update charabia	2022-12-15 18:22:22 +01:00
bors[bot]	5114686394	Merge #743 743: Fix finite pagination with placeholder search r=Kerollmops a=ManyTheFish this bug is reproducible on real datasets and is hard to isolate in a simple test. related to: https://github.com/meilisearch/meilisearch/issues/3200 poke `@curquiza` Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-12-15 09:31:47 +00:00
ManyTheFish	3322018c06	Fix placeholder search	2022-12-14 20:09:47 +01:00
bors[bot]	0276d5212a	Merge #728 728: Add some integration tests on the sort criterion r=ManyTheFish a=loiclec This is simply an integration test ensuring that the sort criterion works properly. However, only one version of the algorithm is tested here (the iterative one). To test the version that uses the facet DB, one has to manually set the `CANDIDATES_THRESHOLD` constant to `0`. I have done that and ensured that the test still succeeds. However, in the future, we will probably want to have an option to force which algorithm is used at runtime, for testing purposes. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-12-14 09:27:12 +00:00
bors[bot]	e2ffc3d69a	Merge #741 741: Add test reproducing the bug fixed by #737 r=Kerollmops a=ManyTheFish related to #737 Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-12-13 15:02:19 +00:00
ManyTheFish	739da9fd4d	Add test	2022-12-13 15:54:43 +01:00
bors[bot]	406ee31d1a	Merge #737 737: Fix typo initial candidates computation r=Kerollmops a=ManyTheFish When `Typo` criterion was after a different criterion than `Words` and the previous criterion wasn't returning any candidates at the first iteration of the bucket sort, then the `initial_candidates` were lost. Now, `Typo`ensure to keep the `initial_candidates` between iterations. related to https://github.com/meilisearch/meilisearch/issues/3200#issuecomment-1345179578 related to https://github.com/meilisearch/meilisearch/issues/3228 Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-12-13 10:29:28 +00:00
ManyTheFish	2d8d0af1a6	Rename short name bc by ic for initial_candidates	2022-12-13 10:56:38 +01:00
Loïc Lecrenier	be3b00350c	Apply review suggestions: naming and documentation	2022-12-13 10:15:22 +01:00
ManyTheFish	80d34a4169	Fix typo initial candiddates computation	2022-12-12 19:02:48 +01:00
Loïc Lecrenier	e3ee553dcc	Remove soft deleted ids from ExternalDocumentIds during document import If the document import replaces a document using hard deletion	2022-12-12 14:16:09 +01:00
Loïc Lecrenier	bebd050961	Add new test for bug 3021	2022-12-08 19:19:40 +01:00
ManyTheFish	55724f2412	Introduce an initial candidates set that makes the difference between an exhaustive count and an estimation	2022-12-08 09:41:34 +01:00
ManyTheFish	6d50ea0830	add tests	2022-12-08 08:56:57 +01:00
Loïc Lecrenier	f37c86e0b2	Add some integration tests on the sort criterion	2022-12-07 15:59:33 +01:00
Loïc Lecrenier	d38cc73630	Add one more filter "integration" test	2022-12-07 14:38:25 +01:00
Loïc Lecrenier	e688581c36	Add tests for facet range search on different field ids	2022-12-07 14:38:21 +01:00
Loïc Lecrenier	4ac8f96342	Simplify implementation of equality condition in filters	2022-12-07 14:38:18 +01:00
Loïc Lecrenier	1c9555566e	Fix bug in facet range search	2022-12-07 14:38:14 +01:00
Loïc Lecrenier	303d740245	Prepare fix within facet range search By creating snapshots and updating the format of the existing snapshots. The next commit will apply the fix, which will show its effects cleanly on the old and new snapshot tests	2022-12-07 14:38:10 +01:00
bors[bot]	0a301b5f88	Merge #723 723: Fix bug in handling of soft deleted documents when updating settings r=Kerollmops a=loiclec # Pull Request ## Related issue Fixes (partially, until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3021 ## What does this PR do? This PR fixes the bug where a `missing key in documents database` internal error message could appear when indexing documents. When updating the settings, before clearing the database and before creating the transform output, we now modify the `ExternalDocumentsIds` structure to get rid of all references to soft deleted document ids in its FSTs. It used to be that updating the settings would clear the soft-deleted document ids, but keep the original `ExternalDocumentsIds` structure. As a consequence of this, when processing a future document addition, we could wrongly believe that a document was being replaced when, in fact, it was a completely new document. See the tests `bug_3021_first`, `bug_3021_second`, and `bug_3021` for a minimal test case that would have reproduced the issue. We need to take special care to: - evaluate how users should update to v0.30.1 (containing this fix): dump? reimporting all documents from scratch? - understand IF/HOW this bug could have caused duplicate documents to be returned - and evaluate the correctness of the fix, of course :) Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-12-06 14:37:38 +00:00
Loïc Lecrenier	a993b68684	Cargo fmt >:-(	2022-12-06 15:22:10 +01:00
Loïc Lecrenier	80c7a00567	Fix compilation error in tests of settings update	2022-12-06 15:19:26 +01:00
Loïc Lecrenier	67d8cec209	Fix bug in handling of soft deleted documents when updating settings	2022-12-06 15:09:19 +01:00
bors[bot]	2a846aaae7	Merge #719 719: Add more members of `filter_parser` to `milli::` & `From<&str>` implementation for `Token` r=Kerollmops a=GregoryConrad ## What does this PR do? The current `milli::Filter` and `milli::FilterCondition` APIs require working with some members of `filter_parser` directly that `milli::` does not re-export to its users (at least when not parsing input using `parse`). Also, using `filter_parser` does not make sense when using milli from an embedded context where there is no query to parse. Instead of reworking `milli::Filter` and `milli::FilterCondition`, this PR adds two non-breaking changes that ease the use of milli: - Re-exports more members of the dependent version of `filter_parser` in `milli` - Implements `From<&str>` for `filter_parser::Token` - This will also allow some basic tests that need to create a `Token` from a string to avoid some boilerplate. In conjunction, both of these will allow milli users to easily create a `Token` from a `&str` without needing to add `filter_parser` as an extra dependency. Note: I wanted to use `FromStr` for the `From` implementation; however, it requires returning a `Result` which is not needed for the conversion. Thus, I just left it as `From<&str>`. Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>	2022-12-06 10:36:00 +00:00
Tamo	212dbfa3b5	Update milli/src/search/facet/filter.rs	2022-12-05 20:56:21 +01:00
amab8901	456da5de9c	Geosearch for zero radius	2022-12-05 20:11:46 +01:00
Loïc Lecrenier	cda4ba2bb6	Add document import tests	2022-12-05 12:02:49 +01:00
Loïc Lecrenier	ae59d37b75	Improve insta-snap of the external document ids	2022-12-05 10:51:02 +01:00
Loïc Lecrenier	f2cf981641	Add more tests and allow disabling of soft-deletion outside of tests Also allow disabling soft-deletion in the IndexDocumentsConfig	2022-12-05 10:51:01 +01:00
Gregory Conrad	50954d31fa	feat: Re-export Span and Token to milli::	2022-12-03 13:37:33 -05:00
bors[bot]	d3731dda48	Merge #706 706: Limit the reindexing caused by updating settings when not needed r=curquiza a=GregoryConrad ## What does this PR do? When updating index settings using `update::Settings`, sometimes a `reindex` of `update::Settings` is triggered when it doesn't need to be. This PR aims to prevent those unnecessary `reindex` calls. For reference, here is a snippet from the current `execute` method in `update::Settings`: ```rust // ... if stop_words_updated \|\| faceted_updated \|\| synonyms_updated \|\| searchable_updated \|\| exact_attributes_updated { self.reindex(&progress_callback, &should_abort, old_fields_ids_map)?; } ``` - [x] `faceted_updated` - looks good as-is ✅ - [x] `stop_words_updated` - looks good as-is ✅ - [x] `synonyms_updated` - looks good as-is ✅ - [x] `searchable_updated` - fixed in this PR - [x] `exact_attributes_updated` - fixed in this PR ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>	2022-12-01 13:58:02 +00:00
bors[bot]	5e754b3ee0	Merge #708 708: Reduce memory usage of the MatchingWords structure r=ManyTheFish a=loiclec # Pull Request ## Related issue Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3115 ## What does this PR do? 1. Reduces the memory usage caused by the creation of a 10-word query tree by 20x. This is done by deduplicating the `MatchingWord` values, which are heavy because of their inner DFA. The deduplication works by wrapping each `MatchingWord` in a reference-counted box and using a hash map to determine whether a `MatchingWord` DFA already exists for a certain signature, or whether a new one needs to be built. 2. Avoid the worst-case scenario of creating a `MatchingWord` for extremely long words that cannot be indexed by milli. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-11-30 17:47:34 +00:00
bors[bot]	e1612fcb01	Merge #712 712: Fix bulk facet indexing bug r=Kerollmops a=loiclec # Pull Request ## Related issue Fixes (partially, until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3165 ## What does this PR do? Fixes a bug where indexing certain numbers of filterable attribute values in bulk led to corrupted facet databases. This was due to a lossy integer conversion which would ultimately prevent entire levels of the facet database to be written into LMDB. More specifically, this change was made: ```diff - if cur_writer_len as u8 >= self.min_level_size { + if cur_writer_len >= self.min_level_size as usize { ``` I also checked other comparisons to `min_level_size` and other conversions such as `x as u8` in this part of the codebase. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-11-30 16:51:48 +00:00
Loïc Lecrenier	9dd4b33a9a	Fix bulk facet indexing bug	2022-11-30 14:27:36 +01:00
Gregory Conrad	87e2bc3bed	fix(reindex): reindex in a few more cases Cases: whenever searchable_fields OR user_defined_searchable_fields is modified	2022-11-28 13:12:19 -05:00
Loïc Lecrenier	61b58b115a	Don't create partial matching words for synonyms in ngrams	2022-11-28 16:32:28 +01:00
Gregory Conrad	d3182f3830	refactor: Change return type to keep consistency with others	2022-11-28 10:02:03 -05:00
Loïc Lecrenier	f70856bab1	Remove memory usage test that fails when many tests are run in parallel	2022-11-28 12:55:28 +01:00
Loïc Lecrenier	e2ebed62b1	Don't create partial matching words for synonyms, split words, phrases	2022-11-28 10:20:13 +01:00
Loïc Lecrenier	8284bd760f	Relax memory ordering of operations within the test CountingAlloc	2022-11-28 10:20:13 +01:00
Loïc Lecrenier	8d0ace2d64	Avoid creating a MatchingWord for words that exceed the length limit	2022-11-28 10:20:13 +01:00
Loïc Lecrenier	86c34a996b	Deduplicate matching words	2022-11-28 10:20:13 +01:00
Gregory Conrad	e0d24104a3	refactor: Rewrite another method chain to be more readable	2022-11-26 13:33:19 -05:00
Gregory Conrad	2db738dbac	refactor: rewrite method chain to be more readable	2022-11-26 13:26:39 -05:00
Gregory Conrad	935a724c57	revert: Revert pass by reference API change	2022-11-24 10:08:23 -05:00
Gregory Conrad	ed29cceae9	perf: Prevent reindex in searchable set case when not needed	2022-11-23 22:33:06 -05:00
Gregory Conrad	bb9e33bf85	perf: Prevent reindex in searchable reset case when not needed	2022-11-23 22:01:46 -05:00
Gregory Conrad	7c0e544839	feat: Add all_obkv_to_json function	2022-11-23 21:18:58 -05:00
Gregory Conrad	d19c8672bb	perf: limit reindex to when exact_attributes changes	2022-11-23 15:50:53 -05:00
bors[bot]	57c9f03e51	Merge #697 697: Fix bug in prefix DB indexing r=loiclec a=loiclec Where the batch's information was not properly updated in cases where only the proximity changed between two consecutive word pair proximities. Closes partially https://github.com/meilisearch/meilisearch/issues/3043 Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-11-17 15:22:01 +00:00
curquiza	cd5aaa3a9f	Update version for the next release (v0.37.0) in Cargo.toml files	2022-11-17 12:50:07 +00:00
Loïc Lecrenier	777eb3fa00	Add insta-snaps for test of bug 3043	2022-11-17 12:21:27 +01:00
Loïc Lecrenier	0caadedd3b	Make clippy happy	2022-11-17 12:17:53 +01:00
Loïc Lecrenier	ac3baafbe8	Truncate facet values that are too long before indexing them	2022-11-17 11:29:42 +01:00
Loïc Lecrenier	990a861241	Add test for indexing a document with a long facet value	2022-11-17 11:29:42 +01:00
Loïc Lecrenier	d95d02cb8a	Fix Facet Indexing bugs 1. Handle keys with variable length correctly This fixes https://github.com/meilisearch/meilisearch/issues/3042 and is easily reproducible with the updated fuzz tests, which now generate keys with variable lengths. 2. Prevent adding facets to the database if their encoded value does not satisfy `valid_lmdb_key`. This fixes an indexing failure when a document had a filterable attribute containing a value whose length is higher than ~500 bytes.	2022-11-17 11:29:42 +01:00
Loïc Lecrenier	f00108d2ec	Fix name of bug in reproduction test	2022-11-17 11:29:18 +01:00
Loïc Lecrenier	f7c8730d09	Fix bug in prefix DB indexing Where the batch's information was not properly updated in cases where only the proximity changed between two consecutive word pair proximities. Closes https://github.com/meilisearch/meilisearch/issues/3043	2022-11-17 11:29:18 +01:00
Louis Dureuil	6dc6a5d874	Force using vendored version of LMDB - don't use lmdb master3 branch anymore	2022-11-14 17:17:51 +01:00
Kerollmops	d00d2aab3f	Update version for the next release (v0.36.0) in Cargo.toml files	2022-11-09 11:03:09 +00:00
bors[bot]	f46a8ab2e2	Merge #693 693: use the lmdb-master.3 branch r=Kerollmops a=irevoire After investigating https://github.com/meilisearch/meilisearch/issues/3017, we found out that it was due to lmdb and that, without any code change on our side, bumping using the lmdb-master-3 branch fix our issues. But, we’re not really confident about what changed between the `mdb.master` and `mdb.master3` branches; thus this is a temporary change, and we hope we’ll be able to move to the new version of heed asap (either before the end of the pre-release or for the next release). -------- The bug is hard to reproduce; I can reproduce it 100% of the time on my archlinux personal computer. But on a scaleway archlinux bare-metal machine, it doesn’t reproduce. It’s flaky on our test suite, but `@loiclec` was able to write a minimal test that reproduces it every time on macOS. Basically, what happens is when there are multiple threads opening databases in a different directory at the same time. If there are 10 or more threads running at the same time, lmdb starts throwing the `Invalid argument (os error 22)` error for no reason, we believe. I would like to submit an issue to lmdb, but I don’t really have the time to write a test in C without heed currently. `@hyc,` if you want to take a look at it, here is the repo that reproduces the issue on macOS: https://github.com/irevoire/heed-bug Co-authored-by: Irevoire <tamo@meilisearch.com>	2022-11-09 09:42:38 +00:00
Irevoire	c7711daca3	use the lmdb-master.3 branch	2022-11-08 16:28:01 +01:00
Kerollmops	bd12989610	Update version for the next release (v0.35.1) in Cargo.toml files	2022-11-08 14:31:39 +00:00
bors[bot]	24a298a83c	Merge #690 690: Fix soft deleted bug settings r=ManyTheFish a=Kerollmops Co-authored-by: Kerollmops <clement@meilisearch.com>	2022-11-08 13:45:10 +00:00
bors[bot]	d85cd9bf1a	Merge #689 689: Handle non-finite floats consistently in filters r=irevoire a=dureuill # Pull Request ## Related issue Related meilisearch/meilisearch#3000 ## What does this PR do? ### User - Filters using `field = inf`, (or `infinite`, `NaN`) now match the value as a string rather than returning an internal error. - Filters using `field < inf` (or other comparison operators) now return an invalid_filter error rather than returning an internal error, much like when using `field < aaa`. ### Implementation - Add new `NonFiniteFloat` error variants to the filter-parser errors - Add `Token::parse_as_finite_float` that can fail both when the string is not a float and when the float is not finite - Refactor `Filter::inner_evaluate` to always use `parse_as_finite_float` instead of just `parse` - Add corresponding tests ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2022-11-08 13:24:38 +00:00
Kerollmops	37b3c5c323	Fix transform to use all_documents and ignore soft_deleted documents	2022-11-08 14:23:16 +01:00
Kerollmops	1b1ad1923b	Add a test to check that we take care of soft deleted documents	2022-11-08 14:23:14 +01:00
Louis Dureuil	a836b8e703	tests: Tests filter with non-finite floats	2022-11-08 13:56:55 +01:00
Louis Dureuil	3328560788	fix: allow filters on = inf, = NaN, return InvalidFilter for < inf, < NaN Fixes meilisearch/meilisearch#3000	2022-11-08 13:27:15 +01:00
unvalley	abf1cf9cd5	Fix clippy errors	2022-11-04 09:27:46 +09:00
unvalley	70465aa5ce	Execute cargo fmt	2022-11-04 08:59:58 +09:00
unvalley	3009981d31	Fix clippy errors Add clippy job Add clippy job to CI	2022-11-04 08:58:14 +09:00
bors[bot]	6add470805	Merge #659 659: Fix clippy error to add clippy job on Ci r=Kerollmops a=unvalley ## Related PR This PR is for #673 ## What does this PR do? - ~~add `Run Clippy` job to CI (rust.yml)~~ - apply `cargo clippy --fix` command - fix some `cargo clippy` error manually (but warnings still remain on tests) ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Co-authored-by: unvalley <kirohi.code@gmail.com> Co-authored-by: unvalley <38400669+unvalley@users.noreply.github.com>	2022-11-03 15:24:38 +00:00
unvalley	13175f2339	refactor: match for filterCondition	2022-11-03 17:34:33 +09:00
Shashank Kashyap	a07f0a4a43	Delete facet_string_zero_bounds_value_codec.rs	2022-10-30 08:59:04 +05:30
Shashank Kashyap	2dec6e86e9	Delete facet_string_level_zero_value_codec.rs	2022-10-30 08:58:36 +05:30
bors[bot]	c965200010	Merge #664 664: Fix phrase search containing stop words r=ManyTheFish a=Samyak2 # Pull Request This a WIP draft PR I wanted to create to let other potential contributors know that I'm working on this issue. I'll be completing this in a few hours from opening this. ## Related issue Fixes #661 and towards fixing meilisearch/meilisearch#2905 ## What does this PR do? - [x] Change Phrase Operation to use a `Vec<Option<String>>` instead of `Vec<String>` where `None` corresponds to a stop word - [x] Update all other uses of phrase operation - [x] Update `resolve_phrase` - [x] Update `create_primitive_query`? - [x] Add test ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Co-authored-by: Samyak S Sarnayak <samyak201@gmail.com> Co-authored-by: Samyak Sarnayak <samyak201@gmail.com>	2022-10-29 13:42:52 +00:00
unvalley	d55f0e2e53	Execute cargo fmt	2022-10-28 23:42:23 +09:00
unvalley	d53a80b408	Fix clippy error	2022-10-28 23:41:35 +09:00
Samyak Sarnayak	ecb88143f9	Run cargo fmt	2022-10-28 19:37:02 +05:30
Samyak Sarnayak	03eb5d87c1	Only call plane_sweep on subgroups when 2 or more are present	2022-10-28 19:32:05 +05:30
unvalley	a1d7ed1258	fix clippy error and remove clippy job from ci Remove clippy job Fix clippy error type_complexity Restore ambiguous change	2022-10-28 22:33:50 +09:00
unvalley	f3c0b05ae8	Fix rust fmt	2022-10-28 09:32:31 +09:00
unvalley	f4ec1abb9b	Fix all clippy error after conflicts	2022-10-27 23:58:13 +09:00
Samyak S Sarnayak	d35afa0cf5	Change consecutive phrase search grouping logic Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-10-26 23:10:48 +05:30
Samyak S Sarnayak	752d031010	Update phrase search to use new `execute` method	2022-10-26 23:07:20 +05:30
unvalley	c7322f704c	Fix cargo clippy errors Dont apply clippy for tests for now Fix clippy warnings of filter-parser package parent 8352febd646ec4bcf56a44161e5c4dce0e55111f author unvalley <38400669+unvalley@users.noreply.github.com> 1666325847 +0900 committer unvalley <kirohi.code@gmail.com> 1666791316 +0900 Update .github/workflows/rust.yml Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com> Allow clippy lint too_many_argments Allow clippy lint needless_collect Allow clippy lint too_many_arguments and type_complexity Fix for clippy warnings comparison_chains Fix for clippy warnings vec_init_then_push Allow clippy lint should_implement_trait Allow clippy lint drop_non_drop Fix lifetime clipy warnings in filter-paprser Execute cargo fmt Fix clippy remaining warnings Fix clippy remaining warnings again and allow lint on each place	2022-10-27 01:04:23 +09:00
unvalley	811f156031	Execute cargo clippy --fix	2022-10-27 01:00:00 +09:00
Samyak S Sarnayak	488d31ecdf	Run cargo fmt	2022-10-26 19:09:45 +05:30
Samyak S Sarnayak	af33d22f25	Consecutive is false when at least 1 stop word is surrounded by words	2022-10-26 19:09:45 +05:30
Samyak S Sarnayak	f1da623af3	Add test for phrase search with stop words and all criteria at once Moved the actual test into a separate function used by both the existing test and the new test.	2022-10-26 19:09:44 +05:30
Samyak S Sarnayak	77f1ff019b	Simplify stop word checking in create_primitive_query	2022-10-26 19:09:44 +05:30
Samyak S Sarnayak	2aa11afb87	Fix panic when phrase contains only one stop word and nothing else	2022-10-26 19:09:42 +05:30
Samyak S Sarnayak	bb9ce3c5c5	Run cargo fmt	2022-10-26 19:09:03 +05:30
Samyak S Sarnayak	d187b32a28	Fix snapshots to use new phrase type	2022-10-26 19:09:03 +05:30
Samyak S Sarnayak	c8c666c6a6	Use resolve_phrase in exactness and typo criteria	2022-10-26 19:09:01 +05:30
Samyak S Sarnayak	3e190503e6	Search for closest non-stop words in proximity criteria	2022-10-26 19:08:34 +05:30
Samyak S Sarnayak	709ab3c14c	Increment position even when it's a stop word in exactness criteria	2022-10-26 19:08:33 +05:30
Samyak S Sarnayak	ef13c6a5b6	Perform filter after enumerate to keep origin indices	2022-10-26 19:08:33 +05:30
Samyak S Sarnayak	6a10b679ca	Add test for phrase search with stop words Originally written by ManyTheFish here: https://gist.github.com/ManyTheFish/f840e37cb2d2e029ce05396b4d540762 Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-10-26 19:08:32 +05:30
Samyak S Sarnayak	62816dddde	[WIP] Fix phrase search containing stop words Fixes #661 and meilisearch/meilisearch#2905	2022-10-26 19:08:06 +05:30
Loïc Lecrenier	54c0cf93fe	Merge remote-tracking branch 'origin/main' into facet-levels-refactor	2022-10-26 15:13:34 +02:00
bors[bot]	365f44c39b	Merge #668 668: Fix many Clippy errors part 2 r=ManyTheFish a=ehiggs This brings us a step closer to enforcing clippy on each build. # Pull Request ## Related issue This does not fix any issue outright, but it is a second round of fixes for clippy after https://github.com/meilisearch/milli/pull/665. This should contribute to fixing https://github.com/meilisearch/milli/pull/659. ## What does this PR do? Satisfies many issues for clippy. The complaints are mostly: * Passing reference where a variable is already a reference. * Using clone where a struct already implements `Copy` * Using `ok_or_else` when it is a closure that returns a value instead of using the closure to call function (hence we use `ok_or`) * Unambiguous lifetimes don't need names, so we can just use `'_` * Using `return` when it is not needed as we are on the last expression of a function. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Ewan Higgs <ewan.higgs@gmail.com>	2022-10-26 12:16:24 +00:00
Loïc Lecrenier	631e9910da	Depend on released version of fuzzcheck from crates.io	2022-10-26 14:06:59 +02:00
Loïc Lecrenier	2741756248	Merge remote-tracking branch 'origin/main' into facet-levels-refactor	2022-10-26 14:03:23 +02:00
bors[bot]	d3f95e6c69	Merge #671 671: Update version for the next release (v0.35.0) in Cargo.toml files r=Kerollmops a=meili-bot ⚠️ This PR is automatically generated. Check the new version is the expected one before merging. Co-authored-by: curquiza <curquiza@users.noreply.github.com>	2022-10-26 11:58:05 +00:00
Loïc Lecrenier	b7f2428961	Fix formatting and warning after rebasing from main	2022-10-26 13:49:33 +02:00
Loïc Lecrenier	3b1f908e5e	Revert behaviour of facet distribution to what it was before Where the docid that is used to get the original facet string value definitely belongs to the candidates	2022-10-26 13:48:01 +02:00
Loïc Lecrenier	14ca8048a8	Add some documentation on how to run the facet db fuzzer	2022-10-26 13:48:01 +02:00
Loïc Lecrenier	206a3e00e5	cargo fmt	2022-10-26 13:48:01 +02:00
Loïc Lecrenier	f198b20c42	Add facet deletion tests that use both the incremental and bulk methods + update deletion snapshots to the new database format	2022-10-26 13:47:46 +02:00
Loïc Lecrenier	e3ba1fc883	Make deletion tests for both soft-deletion and hard-deletion	2022-10-26 13:47:46 +02:00
Loïc Lecrenier	ab5e56fd16	Add document deletion snapshot tests and tests for hard-deletion	2022-10-26 13:47:46 +02:00
Loïc Lecrenier	d885de1600	Add option to avoid soft deletion of documents	2022-10-26 13:47:46 +02:00
Loïc Lecrenier	2295e0e3ce	Use real delete function in facet indexing fuzz tests By deleting multiple docids at once instead of one-by-one	2022-10-26 13:47:46 +02:00
Loïc Lecrenier	acc8caebe6	Add link to GitHub PR to document of update/facet module	2022-10-26 13:47:46 +02:00
Loïc Lecrenier	a034a1e628	Move StrRefCodec and ByteSliceRefCodec to their own files	2022-10-26 13:47:46 +02:00
Loïc Lecrenier	1165ba2171	Make facet deletion incremental	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	0ade699873	Don't crash when failing to decode using StrRef codec	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	d0109627b9	Fix a bug in facet_range_search and add documentation	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	a2270b7432	Change fuzzcheck dependency to point to git repository	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	1ecd3bb822	Fix bug in FieldDocIdFacetCodec	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	51961e1064	Polish some details	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	cb8442a119	Further unify facet databases of f64s and strings	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	3baa34d842	Fix compiler errors/warnings	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	86d9f50b9c	Fix bugs in incremental facet indexing with variable parameters e.g. add one facet value incrementally with a group_size = X and then add another one with group_size = Y It is not actually possible to do so with the public API of milli, but I wanted to make sure the algorithm worked well in those cases anyway. The bugs were found by fuzzing the code with fuzzcheck, which I've added to milli as a conditional dev-dependency. But it can be removed later.	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	de52a9bf75	Improve documentation of some facet-related algorithms	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	985a94adfc	cargo fmt	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	b1ab09196c	Remove outdated TODOs	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	3d7ed3263f	Fix bug in string facet distribution with few candidates	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	fca4577e23	Return original string in facet distributions, work on facet tests	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	27454e9828	Document and refine facet indexing algorithms	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	bee3c23b45	Add comparison benchmark between bulk and incremental facet indexing	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	b2f01ad204	Refactor facet database tests	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	9026867d17	Give same interface to bulk and incremental facet indexing types + cargo fmt, oops, sorry for the bad history :(	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	330c9eb1b2	Rename facet codecs and refine FacetsUpdate API	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	485a72306d	Refactor facet-related codecs	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	9b55e582cd	Add FacetsUpdate type that wraps incremental and bulk indexing methods	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	3d145d7f48	Merge the two <facetttype>_faceted_documents_ids methods into one	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	982efab88f	Fix encoding bugs in facet databases	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	079ed4a992	Add more snapshots	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	afdf87f6f7	Fix bugs in asc/desc criterion and facet indexing	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	a7201ece04	cargo fmt	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	36296bbb20	Add facet incremental indexing snapshot tests + fix bug	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	07ff92c663	Add more snapshots from facet tests	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	61252248fb	Fix some facet indexing bugs	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	68cbcdf08b	Fix compile errors/warnings in http-ui and infos	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	85824ee203	Try to make facet indexing incremental	2022-10-26 13:47:04 +02:00
Loïc Lecrenier	d30c89e345	Fix compile error+warnings in new tests	2022-10-26 13:46:46 +02:00
Loïc Lecrenier	e8a156d682	Reorganise facets database indexing code	2022-10-26 13:46:46 +02:00
Loïc Lecrenier	fb8d23deb3	Reintroduce db_snap! for facet databases	2022-10-26 13:46:14 +02:00
Loïc Lecrenier	e570c23153	Reintroduce asc/desc functionality	2022-10-26 13:46:14 +02:00
Loïc Lecrenier	bd2c0e1ab6	Remove unused code	2022-10-26 13:46:14 +02:00
Loïc Lecrenier	39a4a0a362	Reintroduce filter range search and facet extractors	2022-10-26 13:46:14 +02:00
Loïc Lecrenier	22d80eeaf9	Reintroduce facet deletion functionality	2022-10-26 13:46:14 +02:00
Loïc Lecrenier	6cc91824c1	Remove unused heed codec files	2022-10-26 13:46:14 +02:00
Loïc Lecrenier	5a904cf29d	Reintroduce facet distribution functionality	2022-10-26 13:46:14 +02:00
Loïc Lecrenier	b8a1caad5e	Add range search and incremental indexing algorithm	2022-10-26 13:46:14 +02:00
Loïc Lecrenier	63ef0aba18	Start porting facet distribution and sort to new database structure	2022-10-26 13:46:14 +02:00
Loïc Lecrenier	7913d6365c	Update Facets indexing to be compatible with new database structure	2022-10-26 13:46:14 +02:00
Loïc Lecrenier	c3f49f766d	Prepare refactor of facets database Prepare refactor of facets database	2022-10-26 13:46:14 +02:00
curquiza	e883bccc76	Update version for the next release (v0.35.0) in Cargo.toml files	2022-10-26 11:43:54 +00:00
bors[bot]	c8f16530d5	Merge #616 616: Introduce an indexation abortion function when indexing documents r=Kerollmops a=Kerollmops Co-authored-by: Kerollmops <clement@meilisearch.com> Co-authored-by: Clément Renault <clement@meilisearch.com>	2022-10-26 11:41:18 +00:00
Ewan Higgs	9d27ac8a2e	Ignore too many arguments to functions.	2022-10-25 21:22:53 +02:00
Ewan Higgs	42cdc38c7b	Allow weird ranges like 1..=0 to pass clippy. Everything else is just a warning and exit code will be 0.	2022-10-25 21:12:59 +02:00
Ewan Higgs	2ce025a906	Fixes after rebase to fix new issues.	2022-10-25 20:58:31 +02:00
Ewan Higgs	17f7922bfc	Remove unneeded lifetimes.	2022-10-25 20:49:04 +02:00
Ewan Higgs	6b2fe94192	Fixes for clippy bringing us down to 18 remaining issues. This brings us a step closer to enforcing clippy on each build.	2022-10-25 20:49:02 +02:00
Loïc Lecrenier	36bd66281d	Add method to create a new Index with specific creation dates	2022-10-25 14:37:56 +02:00
Loïc Lecrenier	9a569d73d1	Minor code style change	2022-10-24 15:30:43 +02:00
Loïc Lecrenier	be302fd250	Remove outdated workaround for duplicate words in phrase search	2022-10-24 15:27:06 +02:00
Loïc Lecrenier	d76d0cb1bf	Merge branch 'main' into word-pair-proximity-docids-refactor	2022-10-24 15:23:00 +02:00
curquiza	f3874d58b9	Update version for the next release (v0.34.0) in Cargo.toml files	2022-10-24 10:13:25 +00:00
Loïc Lecrenier	a983129613	Apply suggestions from code review	2022-10-20 09:49:37 +02:00
bors[bot]	f11a4087da	Merge #665 665: Fixing piles of clippy errors. r=ManyTheFish a=ehiggs ## Related issue No issue fixed. Simply cleaning up some code for clippy on the march towards a clean build when #659 is merged. ## What does this PR do? Most of these are calling clone when the struct supports Copy. Many are using & and &mut on `self` when the function they are called from already has an immutable or mutable borrow so this isn't needed. I tried to stay away from actual changes or places where I'd have to name fresh variables. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Co-authored-by: Ewan Higgs <ewan.higgs@gmail.com>	2022-10-20 07:19:46 +00:00
Loïc Lecrenier	176ffd23f5	Fix compile error after rebasing wppd-refactor	2022-10-18 10:40:26 +02:00
Loïc Lecrenier	ab2f6f3aa4	Refine some details in word_prefix_pair_proximity indexing code	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	e6e76fbefe	Improve performance of resolve_phrase at the cost of some relevancy	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	178d00f93a	Cargo fmt	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	830a7c0c7a	Use `resolve_phrase` function for exactness criteria as well	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	18d578dfc4	Adjust some algorithms using DBs of word pair proximities	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	072b576514	Fix proximity value in keys of prefix_word_pair_proximity_docids	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	6c3a5d69e1	Update snapshots	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	a7de4f5b85	Don't add swapped word pairs to the word_pair_proximity_docids db	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	264a04922d	Add prefix_word_pair_proximity database Similar to the word_prefix_pair_proximity one but instead the keys are: (proximity, prefix, word2)	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	1dbbd8694f	Rename StrStrU8Codec to U8StrStrCodec and reorder its fields	2022-10-18 10:37:34 +02:00
Loïc Lecrenier	bdeb47305e	Change encoding of word_pair_proximity DB to (proximity, word1, word2) Same for word_prefix_pair_proximity	2022-10-18 10:37:34 +02:00
Many the fish	81919a35a2	Update milli/src/search/criteria/initial.rs Co-authored-by: Clément Renault <clement@meilisearch.com>	2022-10-17 18:23:20 +02:00
Many the fish	516e838eb4	Update milli/src/search/criteria/initial.rs Co-authored-by: Clément Renault <clement@meilisearch.com>	2022-10-17 18:23:15 +02:00
Clément Renault	fc03e53615	Add a test to check that we can abort an indexation	2022-10-17 17:28:03 +02:00
Kerollmops	6603437cb1	Introduce an indexation abortion function when indexing documents	2022-10-17 17:28:03 +02:00
ManyTheFish	6f55e7844c	Add some code comments	2022-10-17 14:41:57 +02:00
ManyTheFish	cf203b7fde	Take filter in account when computing the pages candidates	2022-10-17 14:13:44 +02:00
ManyTheFish	d71bc1e69f	Compute an exact count when using distinct	2022-10-17 14:13:44 +02:00
ManyTheFish	a396806343	Add settings to force milli to exhaustively compute the total number of hits	2022-10-17 14:13:44 +02:00
Loïc Lecrenier	4c481a8947	Upgrade all dependencies	2022-10-17 13:05:56 +02:00
Ewan Higgs	beb987d3d1	Fixing piles of clippy errors. Most of these are calling clone when the struct supports Copy. Many are using & and &mut on `self` when the function they are called from already has an immutable or mutable borrow so this isn't needed. I tried to stay away from actual changes or places where I'd have to name fresh variables.	2022-10-13 22:02:54 +02:00
bors[bot]	f30979d021	Merge #662 662: Enhance word splitting strategy r=ManyTheFish a=akki1306 # Pull Request ## Related issue Fixes #648 ## What does this PR do? - [split_best_frequency](`55d889522b/milli/src/search/query_tree.rs (L282-L301)`) to use frequency of word pairs near together with proximity value of 1 instead of considering the frequency of individual words. Word pairs having max frequency are considered. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Akshay Kulkarni <akshayk.gj@gmail.com>	2022-10-13 08:14:22 +00:00
Akshay Kulkarni	85f3028317	remove underscore and introduce back word_documents_count	2022-10-13 13:21:59 +05:30
Akshay Kulkarni	8195fc6141	revert removal of word_documents_count method	2022-10-13 13:14:27 +05:30
Akshay Kulkarni	32f825d442	move default implementation of word_pair_frequency to TestContext	2022-10-13 12:57:50 +05:30
Akshay Kulkarni	ff8b2d4422	formatting	2022-10-13 12:44:08 +05:30
Akshay Kulkarni	6cb8b46900	use word_pair_frequency and remove word_documents_count	2022-10-13 12:43:11 +05:30
Akshay Kulkarni	8c9245149e	format file	2022-10-12 15:27:56 +05:30
Akshay Kulkarni	63e79a9039	update comment	2022-10-12 13:36:48 +05:30
Akshay Kulkarni	7f9680f0a0	Enhance word splitting strategy	2022-10-12 13:18:23 +05:30
Loïc Lecrenier	6fbf5dac68	Simplify documents! macro to reduce compile times	2022-10-12 09:22:05 +02:00
msvaljek	762e320c35	Add proximity calculation for the same word	2022-10-07 12:59:12 +02:00
vishalsodani	00c02d00f3	Add missing logging timer to extractors	2022-09-30 22:17:06 +05:30
bors[bot]	d94339a858	Merge #636 636: Remove unused `infos`, `http-ui`, and `milli/fuzz`, crates r=ManyTheFish a=loiclec We haven't used the `infos/`, `http-ui/` and `milli/fuzz/` crates in a long time. They are not properly maintained and probably do not work correctly anymore. This PR removes these crates entirely from the workspace to reduce the amount of code we need to maintain. Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>	2022-09-14 12:39:57 +00:00
bors[bot]	15d478cf4d	Merge #635 635: Use an unstable algorithm for `grenad::Sorter` when possible r=Kerollmops a=loiclec # Pull Request ## What does this PR do? Use an unstable algorithm to sort the internal vector used by `grenad::Sorter` whenever possible to speed up indexing. In practice, every time the merge function creates a `RoaringBitmap`, we use an unstable sort. For every other merge function, such as `keep_first`, `keep_last`, etc., a stable sort is used. Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>	2022-09-14 12:00:52 +00:00
Loïc Lecrenier	add96f921b	Remove unused infos/ http-ui/ and fuzz/ crates	2022-09-14 06:55:01 +02:00
curquiza	753e76d451	Update version for the next release (v0.33.4) in Cargo.toml files	2022-09-13 13:55:50 +00:00
Loïc Lecrenier	3794962330	Use an unstable algorithm for grenad::Sorter when possible	2022-09-13 14:49:53 +02:00
Kerollmops	d4d7c9d577	We avoid skipping errors in the indexing pipeline	2022-09-13 14:03:00 +02:00
Vincent Herlemont	8cd5200f48	Make charabia languages configurable	2022-09-08 12:21:43 +02:00
Vincent Herlemont	5e07ea79c2	Make charabia default feature optional	2022-09-07 20:54:31 +02:00
curquiza	077dcd2002	Update version for the next release (v0.33.3) in Cargo.toml files	2022-09-07 15:48:53 +00:00
Kerollmops	fe3973a51c	Make sure that long words are correctly skipped	2022-09-07 15:03:32 +02:00
Kerollmops	c83c3cd796	Add a test to make sure that long words are correctly skipped	2022-09-07 14:12:36 +02:00
ManyTheFish	bf750e45a1	Fix word removal issue	2022-09-01 12:10:47 +02:00
ManyTheFish	a38608fe59	Add test mixing phrased and no-phrased words	2022-09-01 12:02:10 +02:00
ManyTheFish	97a04887a3	Update version for next release (v0.33.2) in Cargo.toml	2022-09-01 11:47:23 +02:00
bors[bot]	17d020e996	Merge #618 618: Update version for next release (v0.33.1) in Cargo.toml r=Kerollmops a=curquiza No breaking for this release Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>	2022-08-31 10:43:45 +00:00
Clémentine Urquizar	c3363706c5	Update version for next release (v0.33.1) in Cargo.toml	2022-08-31 11:37:27 +02:00
Clément Renault	7f92116b51	Accept again integers as document ids	2022-08-31 10:56:39 +02:00
Irevoire	f6024b3269	Remove the artifacts of the past	2022-08-23 16:10:38 +02:00
bors[bot]	a79ff8a1a9	Merge #611 611: Upgrade charabia v0.6.0 r=curquiza a=ManyTheFish # Pull Request ## What does this PR do? - Update `log` - Upgrade `charabia` related to https://github.com/meilisearch/meilisearch/issues/2686 Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-08-23 10:17:29 +00:00
Clémentine Urquizar	9ed7324995	Update version for next release (v0.33.0)	2022-08-23 11:47:48 +02:00
bors[bot]	18886dc6b7	Merge #598 598: Matching query terms policy r=Kerollmops a=ManyTheFish ## Summary Implement several optional words strategy. ## Content Replace `optional_words` boolean with an enum containing several term matching strategies: ```rust pub enum TermsMatchingStrategy { // remove last word first Last, // remove first word first First, // remove more frequent word first Frequency, // remove smallest word first Size, // only one of the word is mandatory Any, // all words are mandatory All, } ``` All strategies implemented during the prototype are kept, but only `Last` and `All` will be published by Meilisearch in the `v0.29.0` release. ## Related spec: https://github.com/meilisearch/specifications/pull/173 prototype discussion: https://github.com/meilisearch/meilisearch/discussions/2639#discussioncomment-3447699 Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-08-22 15:51:37 +00:00
ManyTheFish	5391e3842c	replace optional_words by term_matching_strategy	2022-08-22 17:47:19 +02:00
ManyTheFish	ba5ca8a362	Upgrade charabia v0.6.0	2022-08-22 14:38:00 +02:00
Irevoire	e7624abe63	share heed between all sub-crates	2022-08-19 11:23:41 +02:00
ManyTheFish	993aa1321c	Fix query tree building	2022-08-18 17:56:06 +02:00
ManyTheFish	bff9653050	Fix remove count	2022-08-18 17:36:30 +02:00
ManyTheFish	9640976c79	Rename TermMatchingPolicies	2022-08-18 17:36:08 +02:00
bors[bot]	afc10acd19	Merge #596 596: Filter operators: NOT + IN[..] r=irevoire a=loiclec # Pull Request ## What does this PR do? Implements the changes described in https://github.com/meilisearch/meilisearch/issues/2580 It is based on top of #556 Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>	2022-08-18 11:24:32 +00:00
Loïc Lecrenier	9b6602cba2	Avoid cloning FilterCondition in filter array parsing	2022-08-18 13:06:57 +02:00
Loïc Lecrenier	c51dcad51b	Don't recompute filterable fields in evaluation of IN[] filter	2022-08-18 10:59:21 +02:00
Irevoire	4aae07d5f5	expose the size methods	2022-08-17 17:07:38 +02:00
Irevoire	e96b852107	bump heed	2022-08-17 17:05:50 +02:00
bors[bot]	087da5621a	Merge #587 587: Word prefix pair proximity docids indexation refactor r=Kerollmops a=loiclec # Pull Request ## What does this PR do? Refactor the code of `WordPrefixPairProximityDocIds` to make it much faster, fix a bug, and add a unit test. ## Why is it faster? Because we avoid using a sorter to insert the (`word1`, `prefix`, `proximity`) keys and their associated bitmaps, and thus we don't have to sort a potentially very big set of data. I have also added a couple of other optimisations: 1. reusing allocations 2. using a prefix trie instead of an array of prefixes to get all the prefixes of a word 3. inserting directly into the database instead of putting the data in an intermediary grenad when possible. Also avoid checking for pre-existing values in the database when we know for certain that they do not exist. ## What bug was fixed? When reindexing, the `new_prefix_fst_words` prefixes may look like: ``` ["ant", "axo", "bor"] ``` which we group by first letter: ``` [["ant", "axo"], ["bor"]] ``` Later in the code, if we have the word2 "axolotl", we try to find which subarray of prefixes contains its prefixes. This check is done with `word2.starts_with(subarray_prefixes[0])`, but `"axolotl".starts_with("ant")` is false, and thus we wrongly think that there are no prefixes in `new_prefix_fst_words` that are prefixes of `axolotl`. ## StrStrU8Codec I had to change the encoding of `StrStrU8Codec` to make the second string null-terminated as well. I don't think this should be a problem, but I may have missed some nuances about the impacts of this change. ## Requests when reviewing this PR I have explained what the code does in the module documentation of `word_pair_proximity_prefix_docids`. It would be nice if someone could read it and give their opinion on whether it is a clear explanation or not. I also have a couple questions regarding the code itself: - Should we clean up and factor out the `PrefixTrieNode` code to try and make broader use of it outside this module? For now, the prefixes undergo a few transformations: from FST, to array, to prefix trie. It seems like it could be simplified. - I wrote a function called `write_into_lmdb_database_without_merging`. (1) Are we okay with such a function existing? (2) Should it be in `grenad_helpers` instead? ## Benchmark Results We reduce the time it takes to index about 8% in most cases, but it varies between -3% and -20%. ``` group indexing_main_ce90fc62 indexing_word-prefix-pair-proximity-docids-refactor_cbad2023 ----- ---------------------- ------------------------------------------------------------ indexing/-geo-delete-facetedNumber-facetedGeo-searchable- 1.00 1893.0±233.03µs ? ?/sec 1.01 1921.2±260.79µs ? ?/sec indexing/-movies-delete-facetedString-facetedNumber-searchable- 1.05 9.4±3.51ms ? ?/sec 1.00 9.0±2.14ms ? ?/sec indexing/-movies-delete-facetedString-facetedNumber-searchable-nested- 1.22 18.3±11.42ms ? ?/sec 1.00 15.0±5.79ms ? ?/sec indexing/-songs-delete-facetedString-facetedNumber-searchable- 1.00 41.4±4.20ms ? ?/sec 1.28 53.0±13.97ms ? ?/sec indexing/-wiki-delete-searchable- 1.00 285.6±18.12ms ? ?/sec 1.03 293.1±16.09ms ? ?/sec indexing/Indexing geo_point 1.03 60.8±0.45s ? ?/sec 1.00 58.8±0.68s ? ?/sec indexing/Indexing movies in three batches 1.14 16.5±0.30s ? ?/sec 1.00 14.5±0.24s ? ?/sec indexing/Indexing movies with default settings 1.11 13.7±0.07s ? ?/sec 1.00 12.3±0.28s ? ?/sec indexing/Indexing nested movies with default settings 1.10 10.6±0.11s ? ?/sec 1.00 9.6±0.15s ? ?/sec indexing/Indexing nested movies without any facets 1.11 9.4±0.15s ? ?/sec 1.00 8.5±0.10s ? ?/sec indexing/Indexing songs in three batches with default settings 1.18 66.2±0.39s ? ?/sec 1.00 56.0±0.67s ? ?/sec indexing/Indexing songs with default settings 1.07 58.7±1.26s ? ?/sec 1.00 54.7±1.71s ? ?/sec indexing/Indexing songs without any facets 1.08 53.1±0.88s ? ?/sec 1.00 49.3±1.43s ? ?/sec indexing/Indexing songs without faceted numbers 1.08 57.7±1.33s ? ?/sec 1.00 53.3±0.98s ? ?/sec indexing/Indexing wiki 1.06 1051.1±21.46s ? ?/sec 1.00 989.6±24.55s ? ?/sec indexing/Indexing wiki in three batches 1.20 1184.8±8.93s ? ?/sec 1.00 989.7±7.06s ? ?/sec indexing/Reindexing geo_point 1.04 67.5±0.75s ? ?/sec 1.00 64.9±0.32s ? ?/sec indexing/Reindexing movies with default settings 1.12 13.9±0.17s ? ?/sec 1.00 12.4±0.13s ? ?/sec indexing/Reindexing songs with default settings 1.05 60.6±0.84s ? ?/sec 1.00 57.5±0.99s ? ?/sec indexing/Reindexing wiki 1.07 1725.0±17.92s ? ?/sec 1.00 1611.4±9.90s ? ?/sec ``` Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>	2022-08-17 14:06:12 +00:00
bors[bot]	fb95e67a2a	Merge #608 608: Fix soft deleted documents r=ManyTheFish a=ManyTheFish When we replaced or updated some documents, the indexing was skipping the replaced documents. Related to https://github.com/meilisearch/meilisearch/issues/2672 Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-08-17 13:38:10 +00:00
bors[bot]	e4a52e6e45	Merge #594 594: Fix(Search): Fix phrase search candidates computation r=Kerollmops a=ManyTheFish This bug is an old bug but was hidden by the proximity criterion, Phrase searches were always returning an empty candidates list when the proximity criterion is deactivated. Before the fix, we were trying to find any words[n] near words[n] instead of finding any words[n] near words[n+1], for example: for a phrase search '"Hello world"' we were searching for "hello" near "hello" first, instead of "hello" near "world". Co-authored-by: ManyTheFish <many@meilisearch.com>	2022-08-17 13:22:52 +00:00
ManyTheFish	8c3f1a9c39	Remove useless lifetime declaration	2022-08-17 15:20:43 +02:00
ManyTheFish	e9e2349ce6	Fix typo in comment	2022-08-17 15:09:48 +02:00
ManyTheFish	2668f841d1	Fix update indexing	2022-08-17 15:03:37 +02:00
ManyTheFish	7384650d85	Update test to showcase the bug	2022-08-17 15:03:08 +02:00
bors[bot]	39869be23b	Merge #590 590: Optimise facets indexing r=Kerollmops a=loiclec # Pull Request ## What does this PR do? Fixes #589 ## Notes I added documentation for the whole module which attempts to explain the shape of the databases and their purpose. However, I realise there is already some documentation about this, so I am not sure if we want to keep it. ## Benchmarks We get a ~1.15x speed up on the geo_point benchmark. ``` group indexing_main_57042355 indexing_optimise-facets-indexation_5728619a ----- ---------------------- -------------------------------------------- indexing/-geo-delete-facetedNumber-facetedGeo-searchable- 1.00 1862.7±294.45µs ? ?/sec 1.58 2.9±1.32ms ? ?/sec indexing/-movies-delete-facetedString-facetedNumber-searchable- 1.11 8.9±2.44ms ? ?/sec 1.00 8.0±1.42ms ? ?/sec indexing/-movies-delete-facetedString-facetedNumber-searchable-nested- 1.00 12.8±3.32ms ? ?/sec 1.32 16.9±6.98ms ? ?/sec indexing/-songs-delete-facetedString-facetedNumber-searchable- 1.09 43.8±4.78ms ? ?/sec 1.00 40.3±3.79ms ? ?/sec indexing/-wiki-delete-searchable- 1.08 287.4±28.72ms ? ?/sec 1.00 264.9±9.46ms ? ?/sec indexing/Indexing geo_point 1.14 61.2±0.39s ? ?/sec 1.00 53.8±0.57s ? ?/sec indexing/Indexing movies in three batches 1.00 16.6±0.12s ? ?/sec 1.00 16.5±0.10s ? ?/sec indexing/Indexing movies with default settings 1.00 14.1±0.30s ? ?/sec 1.00 14.0±0.28s ? ?/sec indexing/Indexing nested movies with default settings 1.10 10.9±0.50s ? ?/sec 1.00 10.0±0.10s ? ?/sec indexing/Indexing nested movies without any facets 1.01 9.6±0.23s ? ?/sec 1.00 9.5±0.06s ? ?/sec indexing/Indexing songs in three batches with default settings 1.07 66.3±0.55s ? ?/sec 1.00 61.8±0.63s ? ?/sec indexing/Indexing songs with default settings 1.03 58.8±0.82s ? ?/sec 1.00 57.1±1.22s ? ?/sec indexing/Indexing songs without any facets 1.00 53.6±1.09s ? ?/sec 1.01 54.0±0.58s ? ?/sec indexing/Indexing songs without faceted numbers 1.02 58.0±1.29s ? ?/sec 1.00 57.1±1.43s ? ?/sec indexing/Indexing wiki 1.00 1064.1±21.20s ? ?/sec 1.00 1068.0±20.49s ? ?/sec indexing/Indexing wiki in three batches 1.00 1182.5±9.62s ? ?/sec 1.01 1191.2±10.96s ? ?/sec indexing/Reindexing geo_point 1.12 68.0±0.21s ? ?/sec 1.00 60.5±0.82s ? ?/sec indexing/Reindexing movies with default settings 1.01 14.1±0.21s ? ?/sec 1.00 14.0±0.26s ? ?/sec indexing/Reindexing songs with default settings 1.04 61.6±0.57s ? ?/sec 1.00 59.2±0.87s ? ?/sec indexing/Reindexing wiki 1.00 1734.0±11.38s ? ?/sec 1.01 1746.6±22.48s ? ?/sec ``` Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>	2022-08-17 11:46:55 +00:00
Loïc Lecrenier	6cc975704d	Add some documentation to facets.rs	2022-08-17 12:59:52 +02:00

... 14 15 16 17 18 ...

2531 Commits