meilisearch

mirror of https://github.com/meilisearch/meilisearch.git synced 2024-11-23 02:27:40 +08:00

Author	SHA1	Message	Date
Louis Dureuil	337e75b0e4	Exact attribute with state	2023-04-05 18:12:46 +02:00
Loïc Lecrenier	b5691802a3	Add new tests and fix construction of query graph from paths	2023-04-05 16:31:10 +02:00
Loïc Lecrenier	6e50f23896	Add more search tests	2023-04-05 13:33:23 +02:00
Tamo	597d57bf1d	Merge branch 'main' into bring-back-changes-v1.1.0	2023-04-05 11:32:14 +02:00
Loïc Lecrenier	4c8a0179ba	Add more search tests	2023-04-05 11:30:49 +02:00
Loïc Lecrenier	c69cbec64a	Add more search tests	2023-04-05 11:20:04 +02:00
Loïc Lecrenier	ce328c329d	Move bucket sort function to its own module and fix a bug	2023-04-04 18:03:08 +02:00
Loïc Lecrenier	959e4607bb	Add more search tests	2023-04-04 18:02:46 +02:00
Louis Dureuil	4b4ffb8ec9	Add exactness ranking rules	2023-04-04 17:12:07 +02:00
Louis Dureuil	3951fe22ab	Add ExactTerm and helper method	2023-04-04 17:09:32 +02:00
Louis Dureuil	4d5bc9df4c	Increase position by 8 on hard separator when building query terms	2023-04-04 17:07:26 +02:00
Louis Dureuil	ec2f8e8040	Rename `is_multiple_words` to `is_ngram` and `zero_typo` to `exact`	2023-04-04 17:06:07 +02:00
Louis Dureuil	406b8bd248	Add new db caches	2023-04-04 17:04:46 +02:00
Loïc Lecrenier	62b9c6fbee	Add search tests	2023-04-04 16:18:22 +02:00
Loïc Lecrenier	b439d36807	Split query_term module into multiple submodules	2023-04-04 15:38:30 +02:00
Loïc Lecrenier	faceb661e3	Add note that a part of the code needs fixing	2023-04-04 15:02:01 +02:00
Loïc Lecrenier	4129d657e2	Simplify query_term module a bit	2023-04-04 15:01:42 +02:00
Filip Bachul	1e6fe71a67	fix clippy warning	2023-04-03 20:18:26 +02:00
Filip Bachul	fddfb37f1f	remove unnecessary FilterError:ReservedGeo and FilterError:ReservedGeo	2023-04-03 20:18:26 +02:00
Loïc Lecrenier	3f13608002	Fix computation of ngram derivations	2023-04-03 15:27:49 +02:00
Loïc Lecrenier	4708d9b016	Fix compiler warnings/errors	2023-04-03 10:09:27 +02:00
Clément Renault	0d2e7bcc13	Implement the previous way for the exhaustive distinct candidates	2023-04-03 10:08:10 +02:00
Loïc Lecrenier	55fbfb6124	Merge branch 'search-refactor-located-query-terms' into search-refactor	2023-04-03 10:04:36 +02:00
Loïc Lecrenier	58fe260c72	Allow removing all the terms from a query if it contains a phrase	2023-04-03 09:18:02 +02:00
Loïc Lecrenier	24e5f6f7a9	Don't remove phrases with "last" term matching strategy	2023-04-03 09:17:33 +02:00
Louis Dureuil	9b87c36200	Limit the number of derivations for a single word.	2023-03-31 09:19:18 +02:00
Filip Bachul	1861c69964	fmt	2023-03-30 23:37:26 +02:00
Filip Bachul	cb2b5eb38e	handle _geoDistance(x,x) sort error	2023-03-30 23:21:23 +02:00
Filip Bachul	53aa0a1b54	handle _geo(x,x) sort error	2023-03-30 23:17:34 +02:00
Loïc Lecrenier	12b26cd54e	Don't remove phrases from the query with term matching strategy Last	2023-03-30 14:54:08 +02:00
Loïc Lecrenier	061b1e6d7c	Tiny refactor of query graph remove_nodes method	2023-03-30 14:49:25 +02:00
Loïc Lecrenier	0d6e8b5c31	Fix phrase search bug when the phrase has only one word	2023-03-30 14:48:12 +02:00
Loïc Lecrenier	d48cdc67a0	Fix term matching strategy bugs	2023-03-30 14:01:52 +02:00
Loïc Lecrenier	35c16ad047	Use new term matching strategy logic in words ranking rule	2023-03-30 13:15:43 +02:00
Loïc Lecrenier	2997d1f186	Use new term matching strategy logic in resolve_maximally_reduced_...	2023-03-30 13:12:51 +02:00
Loïc Lecrenier	2a5997fb20	Avoid expensive assert! in bucket sort function	2023-03-30 13:07:17 +02:00
Loïc Lecrenier	ee8a9e0bad	Remove outdated sentence in documentation	2023-03-30 12:22:24 +02:00
Loïc Lecrenier	3b0737a092	Fix detailed logger	2023-03-30 12:20:44 +02:00
Loïc Lecrenier	fdd02105ac	Graph-based ranking rule + term matching strategy support	2023-03-30 12:19:21 +02:00
Loïc Lecrenier	aa9592455c	Refactor the paths_of_cost algorithm Support conditions that require certain nodes to be skipped	2023-03-30 12:11:11 +02:00
Loïc Lecrenier	01e24dd630	Rewrite proximity ranking rule	2023-03-30 11:59:06 +02:00
Loïc Lecrenier	ae6bb1ce17	Update the ConditionDocidsCache after change to RankingRuleGraphTrait	2023-03-30 11:41:20 +02:00
Loïc Lecrenier	5fd28620cd	Build ranking rule graph correctly after changes to trait definition	2023-03-30 11:32:55 +02:00
Loïc Lecrenier	728710d63a	Update typo ranking rule to use new query term structure	2023-03-30 11:32:19 +02:00
Loïc Lecrenier	fa81381865	Update the trait requirements of ranking-rule graphs	2023-03-30 11:19:45 +02:00
Loïc Lecrenier	b96a682f16	Update resolve_graph module to work with lazy query terms	2023-03-30 11:10:38 +02:00
Loïc Lecrenier	d0f048c068	Simplify the API of the DatabaseCache	2023-03-30 11:08:17 +02:00
Loïc Lecrenier	223e82a10d	Update QueryGraph to use new lazy query terms + build from paths	2023-03-30 11:06:02 +02:00
Loïc Lecrenier	9507ff5e31	Update query term structure to allow for laziness	2023-03-30 11:06:02 +02:00
Louis Dureuil	c2b025946a	`located_query_terms_from_string`: use u16 for positions, hard limit number of iterated tokens. - Refactor phrase logic to reduce number of possible states	2023-03-30 11:04:14 +02:00
Loïc Lecrenier	3a818c5e87	Add more functionality to interners	2023-03-30 09:56:23 +02:00
Louis Dureuil	d74134ce3a	Check sort criteria	2023-03-29 15:21:54 +02:00
Louis Dureuil	5ac129bfa1	Mark geosearch as currently unimplemented for sort rule	2023-03-29 15:20:42 +02:00
ManyTheFish	efea1e5837	Fix facet normalization	2023-03-29 12:02:24 +02:00
Louis Dureuil	abb4522f76	Small comment on ignored rules for placeholder search	2023-03-29 09:11:06 +02:00
Louis Dureuil	ef084ef042	SmallBitmap: Consistently panic on incoherent universe lengths	2023-03-29 08:45:38 +02:00
Louis Dureuil	3524bd1257	SmallBitmap: Add documentation	2023-03-29 08:44:11 +02:00
Tamo	a50b058557	update the geoBoundingBox feature Now instead of using the (top_left, bottom_right) corners of the bounding box it s using the (top_right, bottom_left) corners.	2023-03-28 18:26:18 +02:00
Louis Dureuil	d4f6216966	Resolve rule time sort criteria	2023-03-28 16:42:02 +02:00
Louis Dureuil	77acafe534	Resolve search time sort criteria for placeholder search	2023-03-28 16:41:03 +02:00
Louis Dureuil	53afda3237	Update search usage in example	2023-03-28 16:35:46 +02:00
Louis Dureuil	abb19d368d	Initialize query time ranking rule for query search	2023-03-28 12:40:52 +02:00
Louis Dureuil	b4a52a622e	BoxRankingRule	2023-03-28 12:39:42 +02:00
Louis Dureuil	8d7d8cdc2f	Clean-up index example	2023-03-27 18:34:10 +02:00
Louis Dureuil	626a93b348	Search example: panic when missing the index path	2023-03-27 18:18:01 +02:00
Louis Dureuil	af65fe201a	Clean-up search example	2023-03-27 17:49:43 +02:00
Louis Dureuil	9b83b1deb0	Expose SearchLogger trait	2023-03-27 17:49:18 +02:00
Louis Dureuil	e9eb271499	Remove empty attribute_rule mod	2023-03-27 11:08:03 +02:00
Louis Dureuil	3281a88d08	SmallBitmap: don't expose internal items	2023-03-27 11:04:43 +02:00
Louis Dureuil	5a644054ab	Removed unused search impl	2023-03-27 11:04:27 +02:00
Louis Dureuil	16fefd364e	Add TODO notes	2023-03-27 11:04:04 +02:00
Gregory Conrad	e7994cdeb3	feat: check to see if the PK changed before erroring out Previously, if the primary key was set and a Settings update contained a primary key, an error would be returned. However, this error is not needed if the new PK == the current PK. This commit just checks to see if the PK actually changes before raising an error.	2023-03-26 12:18:39 -04:00
Loïc Lecrenier	00bad8c716	Add comments suggesting performance improvements	2023-03-23 10:18:24 +01:00
Loïc Lecrenier	862714a18b	Remove criterion_implementation_strategy param of Search	2023-03-23 09:44:12 +01:00
Loïc Lecrenier	d18ebe4f3a	Remove more warnings	2023-03-23 09:41:18 +01:00
Loïc Lecrenier	7169d85115	Remove old query_tree code and make clippy happy	2023-03-23 09:39:16 +01:00
Loïc Lecrenier	f5f5f03ec0	Remove old criteria code	2023-03-23 09:35:53 +01:00
Loïc Lecrenier	9b2653427d	Split position DB into fid and relative position DB	2023-03-23 09:22:01 +01:00
Loïc Lecrenier	56b7209f26	Make clippy happy	2023-03-23 09:16:17 +01:00
Loïc Lecrenier	9b1f439a91	WIP	2023-03-23 09:12:35 +01:00
Loïc Lecrenier	01c7d2de8f	Add example targets to the milli crate	2023-03-22 14:50:41 +01:00
Loïc Lecrenier	a86aeba411	WIP	2023-03-22 14:43:08 +01:00
Loïc Lecrenier	384fdc2df4	Fix two bugs in proximity ranking rule	2023-03-21 11:43:25 +01:00
Loïc Lecrenier	83e5b4ed0d	Compute edges of proximity graph lazily	2023-03-21 10:44:40 +01:00
Loïc Lecrenier	272cd7ebbd	Small cleanup	2023-03-20 13:39:19 +01:00
Loïc Lecrenier	c63c7377e6	Switch order of MappedInterner generic params	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	5b50e49522	cargo fmt	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	65474c8de5	Update new sort ranking rule after rebasing	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	fbb1ba3de0	Cargo fmt	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	a59ca28e2c	Add forgotten file	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	825f742000	Simplify graph-based ranking rule impl	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	dd491320e5	Simplify graph-based ranking rule impl	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	c6ff97a220	Rewrite the dead-ends cache to detect more dead-ends	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	49240c367a	Fix bug in cost of typo conditions	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	1e6e624078	Fix bug in SmallBitmap	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	8b4e07e1a3	WIP	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	2853009987	Renaming Edge -> Condition	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	aa59c3bc2c	Replace EdgeCondition with an Option<..> + other code cleanup	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	7b1d8f4c6d	Make PathSet strongly typed	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	a49ddec9df	Prune the query graph after executing a ranking rule	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	05fe856e6e	Merge forward and backward proximity conditions in proximity graph	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	c0cdaf9f53	Fix bug in the proximity ranking rule for queries with ngrams	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	e9cf58d584	Refactor of the Interner	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	31628c5cd4	Merge Phrase and WordDerivations into one structure	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	3004e281d7	Support ngram typos + splitwords and splitwords+synonyms in proximity	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	14e8d0aaa2	Rename lifetime	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	1c58cf8426	Intern ranking rule graph edge conditions as well	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	5155fd2bf1	Reorganise initialisation of ranking rules + rename PathsMap -> PathSet	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	9ec9c204d3	Small code cleanup	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	78b9304d52	Implement distinct attribute	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	0465ba4a05	Intern more values	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	2099991dd1	Continue documenting and cleaning up the code	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	c232cdabf5	Add documentation	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	4e266211bf	Small code reorganisation	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	57fa689131	Cargo fmt	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	10626dddfc	Add a few more optimisations to new search algorithms	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	9051065c22	Apply a few optimisations for graph-based ranking rules	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	e8c76cf7bf	Intern all strings and phrases in the search logic	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	3f1729a17f	Update new search test	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	cab2b6bcda	Fix: computation of initial universe, code organisation	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	c4979a2fda	Fix code visibility issue + unimplemented detail in proximity rule	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	23931f8a4f	Fix small bug in visual logger of search algo	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	aa414565bb	Fix proximity graph edge builder to include all proximities	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	1db152046e	WIP on split words and synonyms support	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	c27ea2677f	Rewrite cheapest path algorithm and empty path cache It is now much simpler and has much better performance.	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	caa1e1b923	Add typo ranking rule to new search impl	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	71f18e4379	Add sort ranking rule to new search impl	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	600e3dd1c5	Remove warnings	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	362eb0de86	Add support for filters	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	998d46ac10	Add support for search offset and limit	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	6c85c0d95e	Fix more bugs + visual empty path cache logging	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	0e1fbbf7c6	Fix bugs in query graph's "remove word" and "cheapest paths" algos	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	6806640ef0	Fix d2 description of paths map	2023-03-20 09:41:56 +01:00
Loïc Lecrenier	173e37584c	Improve the visual/detailed search logger	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	6ba4d5e987	Add a search logger	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	dd12d44134	Support swapped word pairs in new proximity ranking rule impl	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	c8e251bf24	Remove noise in codebase	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	a938fbde4a	Use a cache when resolving the query graph	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	dcf3f1d18a	Remove EdgeIndex and NodeIndex types, prefer u32 instead	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	66d0c63694	Add some documentation and use bitmaps instead of hashmaps when possible	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	132191360b	Introduce the sort ranking rule working with the new search structures	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	345c99d5bd	Introduce the words ranking rule working with the new search structures	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	89d696c1e3	Introduce the proximity ranking rule as a graph-based ranking rule	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	c645853529	Introduce a generic graph-based ranking rule	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	a70ab8b072	Introduce a function to find the K shortest paths in a graph	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	48aae76b15	Introduce a function to find the docids of a set of paths in a graph	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	23bf572dea	Introduce cache structures used with ranking rule graphs	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	864f6410ed	Introduce a structure to represent a set of graph paths efficiently	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	c9bf6bb2fa	Introduce a structure to implement ranking rules with graph algorithms	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	46249ea901	Implement a function to find a QueryGraph's docids	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	ce0d1e0e13	Introduce a common way to manage the coordination between ranking rules	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	5065d8b0c1	Introduce a DatabaseCache to memorize the addresses of LMDB values	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	a83007c013	Introduce structure to represent search queries as graphs	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	79e0a6dd4e	Introduce a new search module, eventually meant to replace the old one The code here does not compile, because I am merely splitting one giant commit into smaller ones where each commit explains a single file.	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	2d88089129	Remove unused term matching strategies	2023-03-20 09:41:55 +01:00
Loïc Lecrenier	6c659dc12f	Use MiMalloc in milli tests	2023-03-20 09:41:37 +01:00
Clément Renault	cf34d1c95f	Fix a test that forget to match a Null value	2023-03-15 17:17:19 +01:00
Clément Renault	1a9c58a7ab	Fix a bug with the new flattening rules	2023-03-15 16:56:44 +01:00
Clément Renault	64571c8288	Improve the testing of the filters	2023-03-15 14:57:17 +01:00
Clément Renault	ea016d97af	Implementing an IS EMPTY filter	2023-03-15 14:12:34 +01:00
Clément Renault	fa2ea4a379	Update the test to accept the new IS syntax	2023-03-14 10:31:27 +01:00
Tamo	0f33a65468	makes kero happy	2023-03-13 16:51:11 +01:00
bors[bot]	fb1260ee88	Merge #3568 #3569 3568: CI: Fix `publish-aarch64` job that still uses ubuntu-18.04 r=Kerollmops a=curquiza Fixes #3563 Main change - add the usage of the `ubuntu-18.04` container instead of the native `ubuntu-18.04` of GitHub actions: I had to install docker in the container. Small additional changes - remove useless `fail-fast` and unused/irrelevant matrix inputs (`build`, `linker`, `os`, `use-cross`...) - Remove useless step in job Proof of work with this CI triggered on this current branch: https://github.com/meilisearch/meilisearch/actions/runs/4366233882 3569: Enhance Japanese language detection r=dureuill a=ManyTheFish # Pull Request This PR is a prototype and can be tested by downloading [the dedicated docker image](https://hub.docker.com/layers/getmeili/meilisearch/prototype-better-language-detection-0/images/sha256-a12847de00e21a71ab797879fd09777dadcb0881f65b5f810e7d1ed434d116ef?context=explore): ```bash $ docker pull getmeili/meilisearch:prototype-better-language-detection-0 ``` ## Context Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization. A [first iteration has been implemented for v1.1.0](https://github.com/meilisearch/meilisearch/pull/3347) but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search. Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing. For instance, a Japanese document `{"id": 1, "name": "東京スカパラダイスオーケストラ"}` is detected as Japanese during indexing, during the search the query `東京` will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese. However if in the dataset there is at least one document containing a field with only Kanjis like: _A document with only 1 field containing only Kanjis:_ ```json { "id":4, "name": "東京特許許可局" } ``` _A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:_ ```json { "id":105, "name": "東京特許許可局", "desc": "日経平均株価は26日に約8カ月ぶりに2万4000円の心理的な節目を上回った。株高を支える材料のひとつは、自民党総裁選で3選を決めた安倍晋三首相の経済政策への期待だ。恩恵が見込まれるとされる人材サービスや建設株の一角が買われている。ただ思惑が先行して資金が集まっている面は否めない。実際に政策効果を取り込む企業はどこか、なお未知数だ。" } ``` Then, in both cases, the field `name` will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore, the query `東京` will be detected as Chinese and only the two last documents will be retrieved by Meilisearch. ## Technical Approach The current PR partially fixes these issues by: 1) Adding a check over potential miss-detections and rerunning the extraction of the document forcing the tokenization over the main Languages detected in it. > 1) run a first extraction allowing the tokenizer to detect any Language in any Script > 2) generate a distribution of tokens by Script and Languages (`script_language`) > 3) if for a Script we have a token distribution of one of the Language that is under the threshold, then we rerun the extraction forbidding the tokenizer to detect the marginal Languages > 4) the tokenizer will fall back on the other available Languages to tokenize the text. For example, if the Chinese were marginally detected compared to the Japanese on the CJ script, then the second extraction will force Japanese tokenization for CJ text in the document. however, the text on another script like Latin will not be impacted by this restriction. 2) Adding a filtering threshold during the search over Languages that have been marginally detected in documents ## Limits This PR introduces 2 arbitrary thresholds: 1) during the indexing, a Language is considered miss-detected if the number of detected tokens of this Language is under 10% of the tokens detected in the same Script (Japanese and Chinese are 2 different Languages sharing the "same" script "CJK"). 2) during the search, a Language is considered marginal if less than 5% of documents are detected as this Language. This PR only partially fixes these issues: - ✅ the query `東京` now find Japanese documents if less than 5% of documents are detected as Chinese. - ✅ the document with the id `105` containing the Japanese field `desc` but the miss-detected field `name` is now completely detected and tokenized as Japanese and is found with the query `東京`. - ❌ the document with the id `4` no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search. ## Related issue Fixes #3565 ## Possible future enhancements - Change or contribute to the Library used to detect the Language - the related issue on Whatlang: https://github.com/greyblake/whatlang-rs/issues/122 Co-authored-by: curquiza <clementine@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Many the fish <many@meilisearch.com>	2023-03-09 15:34:35 +00:00
ManyTheFish	2f8eb4f54a	last PR fixes	2023-03-09 15:34:36 +01:00
Clément Renault	175e8a8495	Fix a diacritic issue	2023-03-09 14:57:47 +01:00
Clément Renault	df48ac8803	Add one more test for the NULL operator	2023-03-09 13:53:37 +01:00
Clément Renault	ff86073288	Add a snapshot for the NULL facet database	2023-03-09 13:32:27 +01:00
Clément Renault	0ad53784e7	Create a new struct to reduce the type complexity	2023-03-09 13:21:21 +01:00
Clément Renault	e064c52544	Rename an internal facet deletion method	2023-03-09 13:08:02 +01:00
Clément Renault	e106b16148	Fix a typo in a variable Co-authored-by: Louis Dureuil <louis@meilisearch.com> aaa	2023-03-09 13:08:02 +01:00
Tamo	eddefb0e0f	refactor the error type of the milli::document thing silence a warning	2023-03-09 13:03:14 +01:00
ManyTheFish	5deea631ea	fix clippy too many arguments	2023-03-09 11:19:13 +01:00
Tamo	c5f22be6e1	add boolean support for csv documents	2023-03-09 11:12:49 +01:00
ManyTheFish	b4b859ec8c	Fix typos	2023-03-09 10:58:35 +01:00
Clément Renault	b1d61f5a02	Add more tests for the NULL filter	2023-03-09 10:04:27 +01:00
Clément Renault	7dc04747fd	Make clippy happy	2023-03-08 17:37:08 +01:00
Clément Renault	7c0cd7172d	Introduce the NULL and NOT value NULL operator	2023-03-08 17:14:34 +01:00
Clément Renault	43ff236df8	Write the NULL facet values in the database	2023-03-08 16:49:53 +01:00
Clément Renault	19ab4d1a15	Classify the NULL fields values in the facet extractor	2023-03-08 16:49:31 +01:00
Clément Renault	9287858997	Introduce a new facet_id_is_null_docids database in the index	2023-03-08 16:14:00 +01:00
ManyTheFish	24c0775c67	Change indexing threshold	2023-03-08 12:36:04 +01:00
ManyTheFish	3092cf0448	Fix clippy errors	2023-03-08 10:53:42 +01:00
ManyTheFish	37d4551e8e	Add a threshold filtering the Languages allowed to be detected at search time	2023-03-07 19:38:01 +01:00
ManyTheFish	da48506f15	Rerun extraction when language detection might have failed	2023-03-07 18:35:26 +01:00
bors[bot]	4f1ccbc495	Merge #3525 3525: Fix phrase search containing stop words r=ManyTheFish a=ManyTheFish # Summary A search with a phrase containing only stop words was returning an HTTP error 500, this PR filters the phrase containing only stop words dropping them before the search starts, a query with a phrase containing only stop words now behaves like a placeholder search. fixes https://github.com/meilisearch/meilisearch/issues/3521 related v1.0.2 PR on milli: https://github.com/meilisearch/milli/pull/779 Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-03-02 10:55:37 +00:00
ManyTheFish	37489fd495	Return an internal error in the case of matching word is invalid	2023-03-01 19:05:16 +01:00
Louis Dureuil	5822764be9	Skip computing index budget in tests	2023-02-23 11:23:39 +01:00
bors[bot]	ac5a1e4c4b	Merge #3423 3423: Add min and max facet stats r=dureuill a=dureuill # Pull Request ## Related issue Fixes #3426 ## What does this PR do? ### User standpoint - When using a `facets` parameter in search, the facets that have numeric values are displayed in a new section of the response called `facetStats` that contains, per facet, the numeric min and max value of the hits returned by the search. <details> <summary> Sample request/response </summary> ```json ❯ curl \ -X POST 'http://localhost:7700/indexes/meteorites/search?facets=mass' \ -H 'Content-Type: application/json' \ --data-binary '{ "q": "LL6", "facets":["mass", "recclass"], "limit": 5 }' \| jsonxf { "hits": [ { "name": "Niger (LL6)", "id": "16975", "nametype": "Valid", "recclass": "LL6", "mass": 3.3, "fall": "Fell" }, { "name": "Appley Bridge", "id": "2318", "nametype": "Valid", "recclass": "LL6", "mass": 15000, "fall": "Fell", "_geo": { "lat": 53.58333, "lng": -2.71667 } }, { "name": "Athens", "id": "4885", "nametype": "Valid", "recclass": "LL6", "mass": 265, "fall": "Fell", "_geo": { "lat": 34.75, "lng": -87.0 } }, { "name": "Bandong", "id": "4935", "nametype": "Valid", "recclass": "LL6", "mass": 11500, "fall": "Fell", "_geo": { "lat": -6.91667, "lng": 107.6 } }, { "name": "Benguerir", "id": "30443", "nametype": "Valid", "recclass": "LL6", "mass": 25000, "fall": "Fell", "_geo": { "lat": 32.25, "lng": -8.15 } } ], "query": "LL6", "processingTimeMs": 15, "limit": 5, "offset": 0, "estimatedTotalHits": 42, "facetDistribution": { "mass": { "110000": 1, "11500": 1, "1161": 1, "12000": 1, "1215.5": 1, "127000": 1, "15000": 1, "1676": 1, "1700": 1, "1710.5": 1, "18000": 1, "19000": 1, "220000": 1, "2220": 1, "22300": 1, "25000": 2, "265": 1, "271000": 1, "2840": 1, "3.3": 1, "3000": 1, "303": 1, "32000": 1, "34000": 1, "36.1": 1, "45000": 1, "460": 1, "478": 1, "483": 1, "5500": 2, "600": 1, "6000": 1, "67.8": 1, "678": 1, "680.5": 1, "6930": 1, "8": 1, "8300": 1, "840": 1, "8400": 1 }, "recclass": { "L/LL6": 3, "LL6": 39 } }, "facetStats": { "mass": { "min": 3.3, "max": 271000.0 } } } ``` </details> ## PR checklist Please check if your PR fulfills the following requirements: - [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [ ] Have you read the contributing guidelines? - [ ] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-02-22 13:06:43 +00:00
ManyTheFish	900bae3d9d	keep phrases that has at least one word	2023-02-21 18:16:51 +01:00
ManyTheFish	28b7d73d4a	Remove an unefficient part of a test on milli	2023-02-21 18:16:51 +01:00
bors[bot]	39407885c2	Merge #3347 3347: Enhance language detection r=irevoire a=ManyTheFish ## Summary Some completely unrelated Languages can share the same characters, in Meilisearch we detect the Languages using `whatlang`, which works well on large texts but fails on small search queries leading to a bad segmentation and normalization of the query. This PR now stores the Languages detected during the indexing in order to reduce the Languages list that can be detected during the search. ## Detail - Create a 19th database mapping the scripts and the Languages detected with the documents where the Language is detected - Fill the newly created database during indexing - Create an allow-list with this database and pass it to Charabia - Add a test ensuring that a Japanese request containing kanjis only is detected as Japanese and not Chinese ## Related issues Fixes #2403 Fixes #3513 Co-authored-by: f3r10 <frledesma@outlook.com> Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Many the fish <many@meilisearch.com>	2023-02-21 10:52:13 +00:00
ManyTheFish	bbecab8948	fix clippy	2023-02-21 10:18:44 +01:00
ManyTheFish	8aa808d51b	Merge branch 'main' into enhance-language-detection	2023-02-20 18:14:34 +01:00
bors[bot]	1e9ac00800	Merge #3505 3505: Csv delimiter r=irevoire a=irevoire Fixes https://github.com/meilisearch/meilisearch/issues/3442 Closes https://github.com/meilisearch/meilisearch/pull/2803 Specified in https://github.com/meilisearch/specifications/pull/221 This PR is a reimplementation of https://github.com/meilisearch/meilisearch/pull/2803, on the new engine. Thanks for your idea and initial PR `@MixusMinimax;` sorry I couldn’t update/merge your PR. Way too many changes happened on the engine in the meantime. Attention to reviewer; I had to update deserr to implement the support of deserializing `char`s ------- It introduces four new error messages; - Invalid value in parameter csvDelimiter: expected a string of one character, but found an empty string - Invalid value in parameter csvDelimiter: expected a string of one character, but found the following string of 5 characters: doggo - csv delimiter must be an ascii character. Found: 🍰 - The Content-Type application/json does not support the use of a csv delimiter. The csv delimiter can only be used with the Content-Type text/csv. And one error code; - `invalid_index_csv_delimiter` The `invalid_content_type` error code is now also used when we encounter the `csvDelimiter` query parameter with a non-csv content type. Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-20 17:01:36 +00:00
bors[bot]	b08a49a16e	Merge #3319 #3470 3319: Transparently resize indexes on MaxDatabaseSizeReached errors r=Kerollmops a=dureuill # Pull Request ## Related issue Related to https://github.com/meilisearch/meilisearch/discussions/3280, depends on https://github.com/meilisearch/milli/pull/760 ## What does this PR do? ### User standpoint - Meilisearch no longer fails tasks that encounter the `milli::UserError(MaxDatabaseSizeReached)` error. - Instead, these tasks are retried after increasing the maximum size allocated to the index where the failure occurred. ### Implementation standpoint - Add `Batch::index_uid` to get the `index_uid` of a batch of task if there is one - `IndexMapper::create_or_open_index` now takes an additional `size` argument that allows to (re)open indexes with a size different from the base `IndexScheduler::index_size` field - `IndexScheduler::tick` now returns a `Result<TickOutcome>` instead of a `Result<usize>`. This offers more explicit control over what the behavior should be wrt the next tick. - Add `IndexStatus::BeingResized` that contains a handle that a thread can use to await for the resize operation to complete and the index to be available again. - Add `IndexMapper::resize_index` to increase the size of an index. - In `IndexScheduler::tick`, intercept task batches that failed due to `MaxDatabaseSizeReached` and resize the index that caused the error, then request a new tick that will eventually handle the still enqueued task. ## Testing the PR The following diff can be applied to this branch to make testing the PR easier: <details> ```diff diff --git a/index-scheduler/src/index_mapper.rs b/index-scheduler/src/index_mapper.rs index 553ab45a..022b2f00 100644 --- a/index-scheduler/src/index_mapper.rs +++ b/index-scheduler/src/index_mapper.rs `@@` -228,13 +228,15 `@@` impl IndexMapper { drop(lock); + std:🧵:sleep_ms(2000); + let current_size = index.map_size()?; let closing_event = index.prepare_for_closing(); - log::info!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2); + log::error!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2); closing_event.wait(); - log::info!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2); + log::error!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2); let index_path = self.base_path.join(uuid.to_string()); let index = self.create_or_open_index(&index_path, None, 2 * current_size)?; `@@` -268,8 +270,10 `@@` impl IndexMapper { match index { Some(Available(index)) => break index, Some(BeingResized(ref resize_operation)) => { + log::error!("waiting for resize end"); // Deadlock: no lock taken while doing this operation. resize_operation.wait(); + log::error!("trying our luck again!"); continue; } Some(BeingDeleted) => return Err(Error::IndexNotFound(name.to_string())), diff --git a/index-scheduler/src/lib.rs b/index-scheduler/src/lib.rs index 11b17d05..242dc095 100644 --- a/index-scheduler/src/lib.rs +++ b/index-scheduler/src/lib.rs `@@` -908,6 +908,7 `@@` impl IndexScheduler { /// /// Returns the number of processed tasks. fn tick(&self) -> Result<TickOutcome> { + log::error!("ticking!"); #[cfg(test)] { *self.run_loop_iteration.write().unwrap() += 1; diff --git a/meilisearch/src/main.rs b/meilisearch/src/main.rs index 050c825a..63f312f6 100644 --- a/meilisearch/src/main.rs +++ b/meilisearch/src/main.rs `@@` -25,7 +25,7 `@@` fn setup(opt: &Opt) -> anyhow::Result<()> { #[actix_web::main] async fn main() -> anyhow::Result<()> { - let (opt, config_read_from) = Opt::try_build()?; + let (mut opt, config_read_from) = Opt::try_build()?; setup(&opt)?; `@@` -56,6 +56,8 `@@` We generated a secure master key for you (you can safely copy this token): _ => (), } + opt.max_index_size = byte_unit::Byte::from_str("1MB").unwrap(); + let (index_scheduler, auth_controller) = setup_meilisearch(&opt)?; #[cfg(all(not(debug_assertions), feature = "analytics"))] ``` </details> Mainly, these debug changes do the following: - Set the default index size to 1MiB so that index resizes are initially frequent - Turn some logs from info to error so that they can be displayed with `--log-level ERROR` (hiding the other infos) - Add a long sleep between the beginning and the end of the resize so that we can observe the `BeingResized` index status (otherwise it would never come up in my tests) ## Open questions - Is the growth factor of x2 the correct solution? For a `Vec` in memory it makes sense, but here we're manipulating quantities that are potentially in the order of 500GiBs. For bigger indexes it may make more sense to add at most e.g. 100GiB on each resize operation, avoiding big steps like 500GiB -> 1TiB. ## PR checklist Please check if your PR fulfills the following requirements: - [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [ ] Have you read the contributing guidelines? - [ ] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! 3470: Autobatch addition and deletion r=irevoire a=irevoire This PR adds the capability to meilisearch to batch document addition and deletion together. Fix https://github.com/meilisearch/meilisearch/issues/3440 -------------- Things to check before merging; - [x] What happens if we delete multiple time the same documents -> add a test - [x] If a documentDeletion gets batched with a documentAddition but the index doesn't exist yet? It should not work Co-authored-by: Louis Dureuil <louis@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-20 15:00:19 +00:00
Many the fish	119e6d8811	Update milli/src/search/mod.rs Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-20 15:33:10 +01:00
ManyTheFish	cb8d5f2d4b	Update Charabia to 0.7.1	2023-02-20 14:00:31 +01:00
Louis Dureuil	eb28d4c525	add facet test	2023-02-20 13:52:28 +01:00
Louis Dureuil	9ac981d025	Remove some clippy type complexity warns by deboxing iters	2023-02-20 13:52:27 +01:00
Louis Dureuil	74859ecd61	Add min and max facet stats	2023-02-20 13:52:27 +01:00

... 2 3 4 5 6 ...

1713 Commits