meilisearch

mirror of https://github.com/meilisearch/meilisearch.git synced 2024-11-23 02:27:40 +08:00

Author	SHA1	Message	Date
ManyTheFish	9c485f8563	Make the search and the indexing work	2023-07-24 18:35:20 +02:00
Kerollmops	691a536893	Implement the facet search with the normalized index	2023-07-24 17:56:17 +02:00
ManyTheFish	d8d12d5979	Be able to set and reset settings	2023-07-24 17:00:18 +02:00
Clément Renault	df528b41d8	Normalize for the search the facets values	2023-07-20 17:57:07 +02:00
ManyTheFish	0497f93494	Update Charabia to the last version	2023-07-19 15:19:32 +02:00
Kerollmops	eef95de30e	First iteration on exposing puffin profiling	2023-07-18 17:38:13 +02:00
Kerollmops	d383afc82b	Fix the geo sort when lat and lng are strings	2023-07-17 18:28:04 +02:00
meili-bors[bot]	7745cc9d3c	Merge #3921 3921: Deactivate camel case segmentation r=dureuill a=ManyTheFish # Pull Request This PR deactivates the camel case segmentation to retrieve the possibility to accept typos over camel-cased words ## Related issue Fixes #3869 Fixes #3818 ## What does this PR do? - deactivates camelcase segmentation related to #3919 Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-07-13 11:00:14 +00:00
ManyTheFish	c106906f8f	deactivate camelCase segmentation	2023-07-13 12:06:27 +02:00
Louis Dureuil	4310928803	Fixes #3912	2023-07-12 10:08:56 +02:00
Louis Dureuil	74315b4ea8	Fixes #3911	2023-07-12 10:08:29 +02:00
Louis Dureuil	40fa59d64c	Sort by lexicographic order after normalization	2023-07-10 09:26:59 +02:00
Louis Dureuil	55cd7738b9	Update snapshots	2023-07-04 16:31:01 +02:00
Louis Dureuil	48409c9183	Add missing exactness.matchingWords, exactness.maxMatchingWords	2023-07-04 16:31:01 +02:00
Kerollmops	a442af6a7c	Update the features of the either dependency to compile milli successfully	2023-07-03 18:51:43 +02:00
Louis Dureuil	324d448236	Format let-else ❤️ 🎉	2023-07-03 10:20:28 +02:00
meili-bors[bot]	661d1f90dc	Merge #3866 3866: Update charabia v0.8.0 r=dureuill a=ManyTheFish # Pull Request Update Charabia: - enhance Japanese segmentation - enhance Latin Tokenization - words containing `_` are now properly segmented into several words - brackets `{([])}` are no more considered as context separators so word separated by brackets are now considered near together for the proximity ranking rule - fixes #3815 - fixes #3778 - fixes [product#151](https://github.com/meilisearch/product/discussions/151) > Important note: now the float numbers are segmented around the `.` so `3.22` is segmented as [`3`, `.`, `22`] but the middle dot isn't considered as a hard separator, which means that if we search `3.22` we find documents containing `3.22` Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-06-29 15:24:36 +00:00
ManyTheFish	6ec7541026	Update inta snapshots	2023-06-29 17:18:39 +02:00
ManyTheFish	a82c49ab08	Update test	2023-06-29 15:56:36 +02:00
ManyTheFish	84845de9ef	Update Charabia	2023-06-29 15:56:32 +02:00
Clément Renault	7c157fc442	Document that the LevelEntry fields order is important	2023-06-29 14:33:32 +02:00
Clément Renault	0b97596c93	Replace unwraps with ?	2023-06-29 14:33:32 +02:00
Clément Renault	a0e0fce677	Simplify a Rust lifetime trick	2023-06-29 14:33:32 +02:00
Clément Renault	b951830461	Add more tests	2023-06-29 14:33:32 +02:00
Kerollmops	b132e859f7	Make clippy happy	2023-06-29 14:33:31 +02:00
Kerollmops	9917bf046a	Move the sortFacetValuesBy in the faceting settings	2023-06-29 14:33:31 +02:00
Kerollmops	d9fea0143f	Make Clippy happy	2023-06-29 14:33:31 +02:00
Kerollmops	a385642ec3	Replace the BTreeMap by an IndexMap to return values in order	2023-06-29 14:33:31 +02:00
Kerollmops	34b2e98fe9	Expose a sortFacetValuesBy parameter to the user	2023-06-29 14:33:00 +02:00
Kerollmops	80bbd4b6f3	Clean and make the facet order configurable internally	2023-06-29 14:31:17 +02:00
Kerollmops	f42bef2f66	Make the search to always return the facets ordered by count	2023-06-29 14:31:17 +02:00
Kerollmops	bd3c026406	First to-test version of the algorithm	2023-06-29 14:31:17 +02:00
Kerollmops	84f8938f33	Rename facet distribution to be explicit on the order to find them	2023-06-29 14:31:15 +02:00
Clément Renault	efbe7ce78b	Clean the facet string FSTs when we clear the documents	2023-06-28 15:36:32 +02:00
Kerollmops	26f0fa678d	Change the error message when a facet is not searchable	2023-06-28 15:06:09 +02:00
Kerollmops	60ddd53439	Return one of the original facet values when doing a facet search	2023-06-28 15:06:09 +02:00
Kerollmops	2bcd8d2983	Make sure the facet queries are normalized	2023-06-28 15:06:09 +02:00
Kerollmops	41760a9306	Introduce a new invalid_facet_search_facet_name error code	2023-06-28 15:06:07 +02:00
Kerollmops	e9a3029c30	Use the right field id to write the string facet values FST	2023-06-28 15:01:51 +02:00
Kerollmops	ed0ff47551	Return an empty list of results if attribute is set as filterable	2023-06-28 15:01:51 +02:00
Clément Renault	e1b8fb48ee	Use the minWordSizeForTypos index settings	2023-06-28 15:01:51 +02:00
Clément Renault	87e22e436a	Fix compilation issues	2023-06-28 15:01:51 +02:00
Clément Renault	0252cfe8b6	Simplify the placeholder search of the facet-search route	2023-06-28 15:01:50 +02:00
Clément Renault	f35ad96afa	Use the disableOnAttributes parameter on the facet-search route	2023-06-28 15:01:50 +02:00
Clément Renault	2ceb781c73	Use the disableOnWords parameter on the facet-search route	2023-06-28 15:01:50 +02:00
Clément Renault	7bd67543dd	Support the typoTolerant.enabled parameter	2023-06-28 15:01:50 +02:00
Clément Renault	8e86eb91bb	Log an error when a facet value is missing from the database	2023-06-28 15:01:50 +02:00
Clément Renault	55c17aa38b	Rename the SearchForFacetValues struct	2023-06-28 15:01:50 +02:00
Clément Renault	aadbe88048	Return an internal error when a field id is missing	2023-06-28 15:01:50 +02:00
Clément Renault	f36de2115f	Make clippy happy	2023-06-28 15:01:50 +02:00
Clément Renault	702041b7e1	Improve the returned errors from the facet-search route	2023-06-28 15:01:48 +02:00
Clément Renault	a05074e675	Fix the max number of facets to be returned to 100	2023-06-28 14:58:42 +02:00
Clément Renault	93f30e65a9	Return the correct response JSON object from the facet-search route	2023-06-28 14:58:42 +02:00
Clément Renault	e81809aae7	Make the search for facet work	2023-06-28 14:58:41 +02:00
Kerollmops	ce7e7f12c8	Introduce the facet search route	2023-06-28 14:58:41 +02:00
Kerollmops	addb21f110	Restrict the number of facet search results to 1000	2023-06-28 14:58:41 +02:00
Kerollmops	c34de05106	Introduce the SearchForFacetValue struct	2023-06-28 14:58:41 +02:00
Clément Renault	15a4c05379	Store the facet string values in multiple FSTs	2023-06-28 14:58:41 +02:00
meili-bors[bot]	d4f10800f2	Merge #3834 3834: Define searchable fields at runtime r=Kerollmops a=ManyTheFish ## Summary This feature allows the end-user to search in one or multiple attributes using the search parameter `attributesToSearchOn`: ```json { "q": "Captain Marvel", "attributesToSearchOn": ["title"] } ``` This feature act like a filter, forcing Meilisearch to only return the documents containing the requested words in the attributes-to-search-on. Note that, with the matching strategy `last`, Meilisearch will only ensure that the first word is in the attributes-to-search-on, but, the retrieved documents will be ordered taking into account the word contained in the attributes-to-search-on. ## Trying the prototype A dedicated docker image has been released for this feature: #### last prototype version: ```bash docker pull getmeili/meilisearch:prototype-define-searchable-fields-at-search-time-1 ``` #### others prototype versions: ```bash docker pull getmeili/meilisearch:prototype-define-searchable-fields-at-search-time-0 ``` ## Technical Detail The attributes-to-search-on list is given to the search context, then, the search context uses the `fid_word_docids`database using only the allowed field ids instead of the global `word_docids` database. This is the same for the prefix databases. The database cache is updated with the merged values, meaning that the union of the field-id-database values is only made if the requested key is missing from the cache. ### Relevancy limits Almost all ranking rules behave as expected when ordering the documents. Only `proximity` could miss-order documents if all the searched words are in the restricted attribute but a better proximity is found in an ignored attribute in a document that should be ranked lower. I put below a failing test showing it: ```rust #[actix_rt::test] async fn proximity_ranking_rule_order() { let server = Server::new().await; let index = index_with_documents( &server, &json!([ { "title": "Captain super mega cool. A Marvel story", // Perfect distance between words in an ignored attribute "desc": "Captain Marvel", "id": "1", }, { "title": "Captain America from Marvel", "desc": "a Shazam ersatz", "id": "2", }]), ) .await; // Document 2 should appear before document 1. index .search(json!({"q": "Captain Marvel", "attributesToSearchOn": ["title"], "attributesToRetrieve": ["id"]}), \|response, code\| { assert_eq!(code, 200, "{}", response); assert_eq!( response["hits"], json!([ {"id": "2"}, {"id": "1"}, ]) ); }) .await; } ``` Fixing this would force us to create a `fid_word_pair_proximity_docids` and a `fid_word_prefix_pair_proximity_docids` databases which may multiply the keys of `word_pair_proximity_docids` and `word_prefix_pair_proximity_docids` by the number of attributes in the searchable_attributes list. If we think we should fix this test, I'll suggest doing it in another PR. ## Related Fixes #3772 Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-06-28 08:19:23 +00:00
Clément Renault	30741d17fa	Change the TODO message	2023-06-27 12:32:43 +02:00
Clément Renault	ebad1f396f	Remove the useless euclidean distance implementation	2023-06-27 12:32:43 +02:00
Clément Renault	29d8268c94	Fix the vector query part by using the correct universe	2023-06-27 12:32:43 +02:00
Clément Renault	63bfe1cee2	Ignore when there are too many vectors	2023-06-27 12:32:43 +02:00
Kerollmops	7c2f5f77b8	Make clippy and fmt happy	2023-06-27 12:32:42 +02:00
Kerollmops	66b8cfd8c8	Introduce a way to store the HNSW on multiple LMDB entries	2023-06-27 12:32:42 +02:00
Kerollmops	ff3664431f	Make rustfmt happy	2023-06-27 12:32:42 +02:00
Kerollmops	531748c536	Return a user error when the _vectors type is invalid	2023-06-27 12:32:41 +02:00
Kerollmops	7aa1275337	Display the _semanticSimilarity even if the `_vectors` field is not displayed	2023-06-27 12:32:41 +02:00
Kerollmops	737aec1705	Expose an _semanticSimilarity as a dot product in the documents	2023-06-27 12:32:41 +02:00
Kerollmops	3e3c743392	Make Rustfmt happy	2023-06-27 12:32:41 +02:00
Kerollmops	5c5a4e075d	Make clippy happy	2023-06-27 12:32:41 +02:00
Kerollmops	ab9f2269aa	Normalize the vectors during indexation and search	2023-06-27 12:32:41 +02:00
Kerollmops	321ec5f3fa	Accept multiple vectors by documents using the _vectors field	2023-06-27 12:32:40 +02:00
Kerollmops	717d4fddd4	Remove the unused distance	2023-06-27 12:32:40 +02:00
Kerollmops	a7e0f0de89	Introduce a new error message for invalid vector dimensions	2023-06-27 12:32:40 +02:00
Kerollmops	3b560ef7d0	Make clippy happy	2023-06-27 12:32:40 +02:00
Kerollmops	2cf747cb89	Fix the tests	2023-06-27 12:32:40 +02:00
Kerollmops	3c31e1cdd1	Support more pages but in an ugly way	2023-06-27 12:32:39 +02:00
Kerollmops	23eaaf1001	Change the name of the distance module	2023-06-27 12:32:39 +02:00
Kerollmops	c2a402f3ae	Implement an ugly deletion of values in the HNSW	2023-06-27 12:32:39 +02:00
Kerollmops	436a10bef4	Replace the euclidean with a dot product	2023-06-27 12:32:39 +02:00
Kerollmops	8debf6fe81	Use a basic euclidean distance function	2023-06-27 12:32:39 +02:00
Kerollmops	c79e82c62a	Move back to the hnsw crate This reverts commit 7a4b6c065482f988b01298642f4c18775503f92f.	2023-06-27 12:32:39 +02:00
Kerollmops	aca305bb77	Log more to make sure we insert vectors in the hgg data-structure	2023-06-27 12:32:38 +02:00
Kerollmops	5816008139	Introduce an optimized version of the euclidean distance function	2023-06-27 12:32:38 +02:00
Kerollmops	268a9ef416	Move to the hgg crate	2023-06-27 12:32:38 +02:00
Clément Renault	642b0f3a1b	Expose a new vector field on the search route	2023-06-27 12:32:38 +02:00
Clément Renault	4571e512d2	Store the vectors in an HNSW in LMDB	2023-06-27 12:32:38 +02:00
Clément Renault	7ac2f1489d	Extract the vectors from the documents	2023-06-27 12:32:37 +02:00
Clément Renault	34349faeae	Create a new _vector extractor	2023-06-27 12:32:37 +02:00
ManyTheFish	63ca25290b	Take into account small Review requests	2023-06-26 14:56:19 +02:00
ManyTheFish	59f64a5256	Return an error when an attribute is not searchable	2023-06-26 14:56:19 +02:00
ManyTheFish	42709ea9a5	Fix clippy warnings	2023-06-26 14:55:57 +02:00
ManyTheFish	fb8fa07169	Restrict field ids in search context	2023-06-26 14:55:57 +02:00
ManyTheFish	0ccf1e2e40	Allow the search cache to store owned values	2023-06-26 14:55:57 +02:00
ManyTheFish	9680e1e41f	Introduce a BytesDecodeOwned trait in heed_codecs	2023-06-26 14:55:14 +02:00
ManyTheFish	461b5118bd	Add API search setting	2023-06-26 14:55:14 +02:00
Tamo	a3716c5678	add the new parameter to the search builder of milli	2023-06-26 14:55:14 +02:00
meili-bors[bot]	2d34005965	Merge #3821 3821: Add normalized and detailed scores to documents returned by a query r=dureuill a=dureuill # Pull Request ## Related issue Fixes #3771 ## What does this PR do? ### User standpoint <details> <summary>Request ranking score</summary> ``` echo '{ "q": "Badman dark knight returns", "showRankingScore": true, "limit": 10, "attributesToRetrieve": ["title"] }' \| mieli search -i index-word-count-10-count ``` </details> <details> <summary>Response</summary> ```json { "hits": [ { "title": "Batman: The Dark Knight Returns, Part 1", "_rankingScore": 0.947520325203252 }, { "title": "Batman: The Dark Knight Returns, Part 2", "_rankingScore": 0.947520325203252 }, { "title": "Batman Unmasked: The Psychology of the Dark Knight", "_rankingScore": 0.6657594086021505 }, { "title": "Legends of the Dark Knight: The History of Batman", "_rankingScore": 0.6654905913978495 }, { "title": "Angel and the Badman", "_rankingScore": 0.2196969696969697 }, { "title": "Angel and the Badman", "_rankingScore": 0.2196969696969697 }, { "title": "Batman", "_rankingScore": 0.11553030303030302 }, { "title": "Batman Begins", "_rankingScore": 0.11553030303030302 }, { "title": "Batman Returns", "_rankingScore": 0.11553030303030302 }, { "title": "Batman Forever", "_rankingScore": 0.11553030303030302 } ], "query": "Badman dark knight returns", "processingTimeMs": 12, "limit": 10, "offset": 0, "estimatedTotalHits": 46 } ``` </details> - If adding a `showRankingScore` parameter to the search query, then documents returned by a search now contain an additional field `_rankingScore` that is a float bigger than 0 and lower or equal to 1.0. This field represents the relevancy of the document, relatively to the search query and the settings of the index, with 1.0 meaning "perfect match" and 0 meaning "not matching the query" (Meilisearch should never return documents not matching the query at all). - The `sort` and `geosort` ranking rules do not influence the `_rankingScore`. <details> <summary>Request detailed ranking scores</summary> ``` echo '{ "q": "Badman dark knight returns", "showRankingScoreDetails": true, "limit": 5, "attributesToRetrieve": ["title"] }' \| mieli search -i index-word-count-10-count ``` </details> <details> <summary>Response</summary> ```json { "hits": [ { "title": "Batman: The Dark Knight Returns, Part 1", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 4, "maxMatchingWords": 4, "score": 1.0 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 4, "score": 0.8 }, "proximity": { "order": 2, "score": 0.9545454545454546 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.926829268292683, "score": 0.926829268292683 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.26666666666666666 } } }, { "title": "Batman: The Dark Knight Returns, Part 2", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 4, "maxMatchingWords": 4, "score": 1.0 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 4, "score": 0.8 }, "proximity": { "order": 2, "score": 0.9545454545454546 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.926829268292683, "score": 0.926829268292683 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.26666666666666666 } } }, { "title": "Batman Unmasked: The Psychology of the Dark Knight", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 3, "maxMatchingWords": 4, "score": 0.75 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 3, "score": 0.75 }, "proximity": { "order": 2, "score": 0.6666666666666666 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.8064516129032258, "score": 0.8064516129032258 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.25 } } }, { "title": "Legends of the Dark Knight: The History of Batman", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 3, "maxMatchingWords": 4, "score": 0.75 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 3, "score": 0.75 }, "proximity": { "order": 2, "score": 0.6666666666666666 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.7419354838709677, "score": 0.7419354838709677 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.25 } } }, { "title": "Angel and the Badman", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 1, "maxMatchingWords": 4, "score": 0.25 }, "typo": { "order": 1, "typoCount": 0, "maxTypoCount": 1, "score": 1.0 }, "proximity": { "order": 2, "score": 1.0 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.8181818181818182, "score": 0.8181818181818182 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.3333333333333333 } } } ], "query": "Badman dark knight returns", "processingTimeMs": 9, "limit": 5, "offset": 0, "estimatedTotalHits": 46 } ``` </details> - If adding a `showRankingScoreDetails` parameter to the search query, then the returned documents will now contain an additional `_rankingScoreDetails` field that is a JSON object containing one field per ranking rule that was applied, whose value is a JSON object with the following fields: - `order`: a number indicating the order this rule was applied (0 is the first applied ranking rule) - `score` (except for `sort` and `geosort`): a float indicating how the document matched this particular rule. - other fields that are specific to the rule, indicating for example how many words matched for a document and how many typos were counted in a matching document. - If the `displayableAttributes` list is defined in the settings of the index, any ranking rule using an attribute not part of that list will be marked as `<hidden-rule>` in the `_rankingScoreDetails`. - Search queries that are part of a `multi-search` requests are modified in the same way and each of the queries can take the `showRankingScore` and `showRankingScoreDetails` parameters independently. The results are still returned in separate lists and providing a unified list of results between multiple queries is not in the scope of this PR (but is unblocked by this PR and can be done manually by using the scores of the various documents). ### Implementation standpoint - Fix difference in how the position of terms were computed at indexing time and query time: this difference meant that a query containing a hard separator would fail the exactness check. - Fix the id reported by the sort ranking rule (very minor) - Change how the cost of removing words is computed. After this change the cost no longer works for any other ranking rule than `words`. Also made `words` have a cost of 0 such that the entire cost of `words` is given by the termRemovalStrategy. The new cost computation makes it so the score is computed in a way consistent with the number of words in the query. Additionally, the words that appear in phrases in the query are also counted as matching words. - When any score computation is requested through `showRankingScore` or `showRankingScoreDetails`, remove optimization where ranking rules are not executed on buckets of a single document: this is important to allow the computation of an accurate score. - add virtual conditions to fid and position to always have the max cost: this ensures that the score is independent from the dataset - the Position ranking rule now takes into account the distance to the position of the word in the query instead of the distance to the position 0. - modified proximity ranking rule cost calculation so that the cost is 0 for documents that are perfectly matching the query - Add a new `milli::score_details` module containing all the types that are involved in score computation. - Make it so a bucket of result now contains a `ScoreDetails` and changed the ranking rules to produce their `ScoreDetails`. - Expose the scores in the REST API. - Add very light analytics for scoring. - Update the search tests to add the expected scores. Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-06-26 09:32:43 +00:00
meili-bors[bot]	040b5a5b6f	Merge #3842 3842: fix some typos r=dureuill a=cuishuang # Pull Request ## Related issue Fixes #<issue_number> ## What does this PR do? - fix some typos ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: cui fliter <imcusg@gmail.com>	2023-06-22 18:01:10 +00:00

1 2 3 4 5 ...

1879 Commits