meilisearch

mirror of https://github.com/meilisearch/meilisearch.git synced 2024-11-26 20:15:07 +08:00

Author	SHA1	Message	Date
meili-bors[bot]	abd954755d	Merge #4476 4476: Make the `/facet-search` route use the `sortFacetValuesBy` setting r=irevoire a=Kerollmops This PR fixes #4423 by ensuring that the `/facet-search` route uses the `sortFacetValuesBy` setting. Note for the documentation team (to be moved in the tracking issue): Using the new `sortFacetValuesBy` setting can slow down the facet-search requests as Meilisearch iterates over the whole list of facet values and computes the count of documents on every entry. That is hardly or even impossible to optimize correctly. ### TODO - [x] Create a custom HashMap wrapper for the facet `OrderBy` settings. This wrapper will return the `OrderBy` setting of the facet, if not defined will use the default `*` one, and if not there either (strange) will fall back on the lexicographic one. - [x] Create a `ValuesCollection` wrapper that implements the logic for the lexicographic and count order by. - [x] Use it when there is no search query. - [x] Use it when there is a search query with and without allowed typos. - [x] Do not change the original logic, only use a wrapper. - [x] Add tests Co-authored-by: Clément Renault <clement@meilisearch.com>	2024-03-13 14:36:14 +00:00
meili-bors[bot]	5ed7b6a0b2	Merge #4456 4456: Add Ollama as an embeddings provider r=dureuill a=jakobklemm # Pull Request ## Related issue [Related Discord Thread](https://discord.com/channels/1006923006964154428/1211977150316683305) ## What does this PR do? - Adds Ollama as a provider of Embeddings besides HuggingFace and OpenAI under the name `ollama` - Adds the environment variable `MEILI_OLLAMA_URL` to set the embeddings URL of an Ollama instance with a default value of `http://localhost:11434/api/embeddings` if no variable is set - Changes some of the structs and functions in `openai.rs` to be public so that they can be shared. - Added more error variants for Ollama specific errors - It uses the model `nomic-embed-text` as default, but any string value is allowed, however it won't automatically check if the model actually exists or is an embedding model Tested against Ollama version `v0.1.27` and the `nomic-embed-text` model. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Co-authored-by: Jakob Klemm <jakob@jeykey.net> Co-authored-by: Louis Dureuil <louis.dureuil@gmail.com>	2024-03-13 08:48:47 +00:00
Jakob Klemm	88bc9556a9	Add Ollama dimension inference and add clearer errors Instead of the user manually specifying the model dimensions it will now automatically get determined Just like with hf.rs the word "test" gets embedded to determine the dimensions of the output Add a dedicated error type for if the model doesn't exist (don't automatically pull it though) and set the fault of that error to be the user	2024-03-12 19:59:11 +01:00
Clément Renault	ca4876fd10	Do not reindex when modifying unknown faceted field	2024-03-12 16:18:58 +01:00
Clément Renault	d3a95ea2f6	Introduce a new OrderByMap struct to simplify the sort by usage	2024-03-12 13:56:56 +01:00
meili-bors[bot]	ee3076d5ba	Merge #4462 4462: Divide threshold by ten r=dureuill a=ManyTheFish Change the facet incremental vs bulk indexing threshold to better fit our user needs, it might be changed in the future if we have more insights Co-authored-by: ManyTheFish <many@meilisearch.com>	2024-03-06 13:05:38 +00:00
Louis Dureuil	b11df7ec34	Meilisearch: fix some wrong spans	2024-03-05 10:11:43 +01:00
ManyTheFish	eada6de261	Divide threshold by ten	2024-03-04 18:02:54 +01:00
Jakob Klemm	d3004d8040	Implemented Ollama as an embeddings provider Initial prototype of Ollama embeddings actually working, error handlign / retries still missing. Allow model to be any String and require dimensions parameter Fixed rustfmt formatting issues There were some formatting issues in the initial PR and this should not make the changes comply with the Rust style guidelines Because I accidentally didn't follow the style guide for commits in my commit messages I squashed them into one to comply	2024-03-04 15:09:43 +01:00
ManyTheFish	5e83bac448	Fix PR comments	2024-02-26 15:40:15 +01:00
ManyTheFish	a493a50825	Fix clippy	2024-02-22 14:53:33 +01:00
ManyTheFish	9d1f489a37	Fix facet incremental indexing	2024-02-21 18:42:16 +01:00
ManyTheFish	03bb6372af	Change is_batchable_with by mergeable_with	2024-02-14 11:50:22 +01:00
ManyTheFish	3beda8833d	Fix and add logs	2024-02-14 11:46:30 +01:00
ManyTheFish	48026aa75c	fix PR comments	2024-02-13 15:19:01 +01:00
Many the fish	e5e811e2c9	Update milli/src/update/index_documents/extract/mod.rs Co-authored-by: Clément Renault <clement@meilisearch.com>	2024-02-13 14:22:21 +01:00
Many the fish	55de96f74e	Update milli/src/update/facet/mod.rs Co-authored-by: Clément Renault <clement@meilisearch.com>	2024-02-13 14:22:10 +01:00
ManyTheFish	39c83cb3d9	fix clippy	2024-02-12 09:12:54 +01:00
Louis Dureuil	7efb1cae11	yield in loop when the channel is not disconnected	2024-02-12 09:12:54 +01:00
Louis Dureuil	7877788510	fix logs	2024-02-12 09:12:54 +01:00
ManyTheFish	be1b054b05	Compute chunk size based on the input data size ant the number of indexing threads	2024-02-08 17:28:37 +01:00
meili-bors[bot]	023c2d755f	Merge #4391 4391: Tracing r=dureuill a=irevoire # Pull Request - [ ] Hide the parameters of the process batch - [x] Make actix-web trace every call on every route - [x] Remove all `env_logger`/`logs` dependencies - [x] Be able to enable or disable the memory measurement using the `/logs` route parameters See the following product discussion: https://github.com/orgs/meilisearch/discussions/721 Supersedes https://github.com/meilisearch/meilisearch/pull/4338 ## Related issue Fixes https://github.com/meilisearch/meilisearch/issues/4317 ## What does this PR do? Update the format of the logs from: ``` [2024-02-06T14:54:11Z INFO actix_server::builder] starting 10 workers ``` to ``` 2024-02-06T13:58:14.710803Z INFO actix_server::builder: 200: starting 10 workers ``` First, run meilisearch with the route enabled via the feature flag: - `cargo run --experimental-enable-logs-route` - Or at runtime by sending the following payload: ``` curl \ -X PATCH 'http://localhost:7700/experimental-features/' \ -H 'Content-Type: application/json' \ --data-binary '{ "logsRoute": true }' ``` Then gather data from meilisearch by calling for example: ``` curl \ -X POST http://localhost:7700/logs \ -H 'Content-Type: application/json' \ --data-binary '{ "mode": "fmt", "target": "milli=trace" }' ``` Once your operation is over, tell meilisearch to stop the route: ``` curl \ -X DELETE http://localhost:7700/logs ``` ---- In the case you’re profiling code, you will be interested by the next command that converts the output of the route to a format that the firefox profiler can understand. ```bash cargo run --release --bin trace-to-firefox -- 2024-01-17_17:07:55-indexing-trace.json ``` Then go to https://profiler.firefox.com and load it. Note that we can also share the profiles using the https://share.firefox.dev website. Co-authored-by: Louis Dureuil <louis@meilisearch.com> Co-authored-by: Clément Renault <clement@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com>	2024-02-08 14:16:56 +00:00
Louis Dureuil	407ad753ed	rust fmt	2024-02-08 15:11:42 +01:00
Tamo	bf43a3f60a	fix typo	2024-02-08 15:04:06 +01:00
Tamo	1502382316	use debug instead of debug_span	2024-02-08 15:04:06 +01:00
Tamo	08af0e690c	Structures a bunch of logs	2024-02-08 15:04:06 +01:00
Louis Dureuil	db722d201a	Write entries into database downgraded to trace level	2024-02-08 15:04:05 +01:00
Tamo	e773dfa9ba	get rids of log in milli and add logs for the bucket sort	2024-02-08 15:04:05 +01:00
Louis Dureuil	5d7061682e	Add tracing to milli	2024-02-08 15:03:31 +01:00
meili-bors[bot]	72ebac1fbb	Merge #4388 4388: Cap the maximum memory of the grenad sorters r=curquiza a=Kerollmops This PR clamps the memory usage of the grenad sorters to a reasonable maximum. Grenad sorters are opened on multiple threads at a time. This can result in higher memory usage than expected, even though it shouldn't consume more than the memory available. Fixes #4152. Co-authored-by: Clément Renault <clement@meilisearch.com>	2024-02-08 13:19:28 +00:00
Louis Dureuil	88d03c56ab	Don't accept dimensions of 0 (ever) or dimensions greater than the default dimensions of the model	2024-02-07 11:52:09 +01:00
Louis Dureuil	517f5332d6	Allow actually passing `dimensions` for OpenAI source -> make sure the settings change is rejected or the settings task fails when the specified model doesn't support overriding `dimensions` and the passed `dimensions` differs from the model's default dimensions.	2024-02-07 11:51:44 +01:00
Clément Renault	053306c0e7	Try with 500MiB	2024-02-07 11:24:43 +01:00
Clément Renault	9eeb75d501	Clamp the max memory of the grenad sorters to a reasonable maximum	2024-02-06 10:47:04 +01:00
Louis Dureuil	fbf5f2a392	Don't use a runtime in extract_embedder, use it only for OpenAI	2024-02-01 10:33:27 +01:00
Tamo	9f8f3105d5	make clippy happy	2024-02-01 10:33:27 +01:00
Tamo	318843aacd	add a bunch of tests and fix the error message when adding the geosearch as filterable/sortable while there is malformed documents in the DB	2024-02-01 10:33:27 +01:00
Tamo	c1bf33a112	Revert "Remove panic on the geosearch"	2024-01-25 18:51:19 +01:00
Tamo	0887186ecf	make clippy happy	2024-01-17 16:07:10 +01:00
Tamo	7d190d8078	add a bunch of tests and fix the error message when adding the geosearch as filterable/sortable while there is malformed documents in the DB	2024-01-17 15:51:52 +01:00
Clément Renault	01e2c3d6bb	Bump arroy to v0.2.0	2024-01-16 16:45:55 +01:00
Clément Renault	9f9ad4cc05	Fix Clippy warnings	2024-01-16 15:27:24 +01:00
Clément Renault	3ee7682fa7	Fix some integer comparisons	2024-01-16 15:22:23 +01:00
Tamo	54ae6951eb	fix warning	2024-01-02 15:19:30 +01:00
Louis Dureuil	6ff81de401	Fix tests	2023-12-20 17:16:46 +01:00
Louis Dureuil	9123370e90	Validate fused settings in settings task after fusing with existing setting	2023-12-20 17:16:46 +01:00
Louis Dureuil	e249e4db7b	Change Setting::apply function signature	2023-12-20 17:15:24 +01:00
Many the fish	9e1b458010	Merge branch 'main' into change-proximity-precision-settings	2023-12-18 09:08:47 +01:00
ManyTheFish	6425996e36	Change the naming of attributeScale and wordScale into byAttribute and byWord	2023-12-14 16:31:00 +01:00
Louis Dureuil	87bba98bd8	Various changes - fixed seed for arroy - check vector dimensions as soon as it is provided to search - don't embed whitespace	2023-12-14 16:08:42 +01:00
Louis Dureuil	b8e4709dfa	Remove prompt strategy and fallback	2023-12-14 16:08:41 +01:00
Louis Dureuil	806e5b6899	Tests pass	2023-12-14 16:08:41 +01:00
Louis Dureuil	e0cc775dc4	Various changes - DistributionShift in Search object (to be set from model in embed?) - Fix issue where embedder index wasn't computed at search time - Accept as default embedder either the "default" one, or the only embedder when there is only one	2023-12-14 16:08:41 +01:00
Louis Dureuil	12940d79a9	WIP - manual embedder - multi embedders OK - clippy + tests OK	2023-12-14 16:08:41 +01:00
Louis Dureuil	922a640188	WIP multi embedders fixed template bugs	2023-12-14 16:08:41 +01:00
Louis Dureuil	65e49b7092	Remove stuff, add distribution shift (WIP)	2023-12-14 16:08:38 +01:00
Louis Dureuil	e56f160032	Actually pass embedders on reindex	2023-12-14 16:07:49 +01:00
Louis Dureuil	687d92f217	prompt bifluor+	2023-12-14 16:07:49 +01:00
Louis Dureuil	fb539f61fe	WIP	2023-12-14 16:07:49 +01:00
Louis Dureuil	cb4ebe163e	WIP	2023-12-14 16:07:49 +01:00
Louis Dureuil	dde3a04679	WIP arroy integration	2023-12-14 16:07:49 +01:00
Louis Dureuil	13c2c6c16b	Small commit to add hybrid search and autoembedding	2023-12-14 16:07:48 +01:00
ManyTheFish	467b49153d	Implement proximityPrecision setting on milli side	2023-12-06 15:49:02 +01:00
ManyTheFish	bddc168d83	List TODOs	2023-12-06 14:59:23 +01:00
Clément Renault	d32eb11329	Move to the v0.20.0-alpha.9 of heed	2023-11-27 11:52:22 +01:00
Clément Renault	0dbf1a16ff	Make clippy happy	2023-11-23 14:11:38 +01:00
Clément Renault	462b4c0080	Fix the tests	2023-11-23 12:07:35 +01:00
Clément Renault	0d4482625a	Make the changes to use heed v0.20-alpha.6	2023-11-23 11:43:58 +01:00
ManyTheFish	d3575fb028	Make into_del_add_obkv parameters more human readable	2023-11-20 16:10:39 +01:00
ManyTheFish	39cbb499c2	Small fixes	2023-11-20 10:20:39 +01:00
ManyTheFish	ebef6bc24d	Simplify documents database writing	2023-11-20 10:14:57 +01:00
ManyTheFish	d59b7db8d0	remove unused code	2023-11-20 10:10:45 +01:00
ManyTheFish	263e825619	Fix typos in comments	2023-11-20 10:06:29 +01:00
Many the fish	b0adc73ce6	Merge pull request #4207 from meilisearch/diff-indexing-prefix-databases Diff indexing prefix databases	2023-11-14 16:04:05 +01:00
Louis Dureuil	772964125d	Factor removal of document from DB	2023-11-13 13:51:22 +01:00
Louis Dureuil	264b10ec20	Fixup documentation	2023-11-09 16:23:20 +01:00
Louis Dureuil	3053e01c05	Batch::remove_documents_from_db_no_batch	2023-11-09 14:23:02 +01:00
Louis Dureuil	9cef800b2a	Enrich uses the new type	2023-11-09 14:22:05 +01:00
ManyTheFish	882ab9cc85	remove warnings	2023-11-09 11:35:33 +01:00
ManyTheFish	5a9c96e1db	Compute word integer prefix cache	2023-11-09 11:34:26 +01:00
ManyTheFish	70ce40828c	Compute word docids prefix cache	2023-11-08 17:01:00 +01:00
ManyTheFish	688266c83e	Remove word pair proximity prefix cache and compute it at search time	2023-11-08 14:16:01 +01:00
ManyTheFish	6dab826908	Reactivate prefix databases	2023-11-08 13:58:01 +01:00
ManyTheFish	1e2fbc6a42	revert "REVERT ME: ignore prefix pair databases tests" This reverts commit `1b2ea6cf19`.	2023-11-08 11:50:52 +01:00
Louis Dureuil	cbaa54cafd	Fix clippy issues	2023-11-06 11:19:31 +01:00
Louis Dureuil	1bccf2079e	Correctly mark non-tests as non-tests	2023-11-06 11:03:56 +01:00
ManyTheFish	1b2ea6cf19	REVERT ME: ignore prefix pair databases tests	2023-11-06 10:46:22 +01:00
Louis Dureuil	1ad1fcc8c8	Remove all warnings	2023-11-06 10:31:14 +01:00
ManyTheFish	87610a5f98	Don't try to delete a document that is not in the database	2023-11-02 16:49:03 +01:00
Clément Renault	ff522c919d	Fix the vector extractions for the diff indexing	2023-11-02 15:58:08 +01:00
ManyTheFish	bf0651f23c	Implement iter method on ExternalDocumentsIds	2023-11-02 15:38:00 +01:00
ManyTheFish	5b20e625f3	fix merge	2023-11-02 15:31:37 +01:00
ManyTheFish	bc51d6157a	Fix transform reindexing path	2023-11-02 15:26:20 +01:00
ManyTheFish	1b4ff991c0	update typed chunks	2023-11-02 15:26:20 +01:00
ManyTheFish	4b64c33aa2	update vector extractor	2023-11-02 15:26:20 +01:00
ManyTheFish	12323d610e	Change the original document sorter key from the internal docid to a concatenation of the internal and the external docid	2023-11-02 15:26:20 +01:00
Clément Renault	4d864f0702	Always sort internal Sorter entries in parallel	2023-11-02 14:47:43 +01:00
Clément Renault	c71b1d33ae	Sort entries using rayon in the transform sorters	2023-11-01 11:07:16 +01:00
Clément Renault	0fc446c62f	Add more timing logs to the Transform	2023-11-01 11:07:16 +01:00
Louis Dureuil	0fb6acefc3	Add snapshots for facets	2023-10-31 17:11:08 +01:00
Louis Dureuil	b1d1355b69	remove tests on soft-deleted	2023-10-31 16:36:27 +01:00
Louis Dureuil	f19332466e	Extract field value as values instead of Option<Value>	2023-10-31 16:36:27 +01:00
Louis Dureuil	03ddb4f310	use deladd in facet update tests	2023-10-31 16:36:27 +01:00
Louis Dureuil	da0503ef80	Fix document count	2023-10-31 16:36:27 +01:00
Louis Dureuil	b40253bf18	update snapshots	2023-10-31 10:30:48 +01:00
Louis Dureuil	d8bf3f3fc2	Remove unused snapshots	2023-10-31 10:12:49 +01:00
Louis Dureuil	9d59e8011a	fix some tests	2023-10-31 10:08:36 +01:00
Louis Dureuil	dad78cbf8d	Bulk facet remove deletes keys from DB when value empty	2023-10-31 09:53:55 +01:00
Louis Dureuil	4e91707a06	Rename test	2023-10-31 09:41:17 +01:00
Louis Dureuil	de10f20732	Fix field distribution again	2023-10-30 17:47:22 +01:00
Louis Dureuil	be395c7944	Change order of arguments to tokenizer_builder	2023-10-30 16:26:29 +01:00
Louis Dureuil	9fedd8101a	Fix tests	2023-10-30 15:11:07 +01:00
Louis Dureuil	54d07a8da3	Update field distribution taking into account both deletions and additions	2023-10-30 14:47:51 +01:00
Louis Dureuil	58690dfb19	Fix tests compilation after changes to ExternalDocumentsIds API	2023-10-30 13:34:07 +01:00
Louis Dureuil	abf424ebfc	Remove unused FromIterator	2023-10-30 11:41:56 +01:00
Clément Renault	dfab6293c9	Use an LMDB database to store the external documents ids	2023-10-30 11:41:23 +01:00
Louis Dureuil	fdf3f7f627	Fix facet distribution test	2023-10-30 11:41:23 +01:00
Louis Dureuil	6260cff65f	Actually delete documents from DB when the merge function says so	2023-10-30 11:41:22 +01:00
Louis Dureuil	8e0d9c9a5e	Recover delete_documents tests that were too eagerly deleted	2023-10-30 11:41:22 +01:00
Louis Dureuil	a35988550c	Fix some snapshots	2023-10-30 11:41:22 +01:00
Louis Dureuil	e78281785c	Actually execute the transform even if there are only documents to delete	2023-10-30 11:41:22 +01:00
Louis Dureuil	290e773d23	remove more warnings and fix some tests	2023-10-30 11:41:22 +01:00
Louis Dureuil	113527f466	Remove soft-deleted related methods from Index	2023-10-30 11:41:22 +01:00
Louis Dureuil	c534a1b687	Stop using delete documents pipeline in batch runner	2023-10-30 11:41:22 +01:00
Louis Dureuil	2263dff02b	Stop using removed delete pipelines almost everywhere	2023-10-30 11:41:22 +01:00
Louis Dureuil	d651b3ef01	Remove delete documents files	2023-10-30 11:41:20 +01:00
ManyTheFish	762b0b47e6	Use deladd merging function in chunks mergers	2023-10-30 11:40:20 +01:00
Louis Dureuil	01d5eedf2f	Remove some warnings	2023-10-30 11:40:20 +01:00
Louis Dureuil	073f89db79	Fix facet tests	2023-10-30 11:40:20 +01:00
Louis Dureuil	85f42fbc03	Handle external to internal id mapping from TypedChunk::Documents	2023-10-30 11:40:20 +01:00
Louis Dureuil	c6b3c18c85	WIP: Comment out document deletion in other pipelines than update TODO: fix calls to DELETE route	2023-10-30 11:40:20 +01:00
Louis Dureuil	946c762d28	WIP: reset documents in TypedChunk::Documents	2023-10-30 11:40:20 +01:00
Louis Dureuil	cda6ca1ee6	Remove TypedChunk::NewDocumentIds	2023-10-30 11:40:18 +01:00
Louis Dureuil	696fcf4d18	Fix document insertion into LMDB	2023-10-30 11:39:31 +01:00
ManyTheFish	476e4d3dbe	Use value buffer instead of the initial value when writting the final result in the sorter	2023-10-30 11:39:31 +01:00
Clément Renault	576fa9c6da	Remove useless comment	2023-10-30 11:39:31 +01:00
Kerollmops	77dcbff6b2	Remove and Insert the DelAdd geo points	2023-10-30 11:39:31 +01:00
Kerollmops	544440c363	Ignore geo fields when the Del and Add content is the same	2023-10-30 11:39:31 +01:00
Clément Renault	a3dae4db9b	Extract the geo fields DelAdd and generate a new DelAdd obkv with it	2023-10-30 11:39:31 +01:00
ManyTheFish	ba90a5ec0e	update extract fid word count docids	2023-10-30 11:39:31 +01:00
Louis Dureuil	b26dc9aabe	Explanatory code comment	2023-10-30 11:39:31 +01:00
Louis Dureuil	66abac9364	Use specialized `KvReaderDelAdd` type Co-authored-by: Clément Renault <clement@meilisearch.com>	2023-10-30 11:39:31 +01:00
Louis Dureuil	59f88c14b3	Simplify facet update after removing `Index::faceted_documents_ids`	2023-10-30 11:39:29 +01:00
Louis Dureuil	14832cb324	Remove Index::faceted_documents_ids	2023-10-30 11:37:32 +01:00
Louis Dureuil	04ec293024	Facet Incremental update	2023-10-30 11:37:30 +01:00
Louis Dureuil	f67ff3a738	Facets Bulk update	2023-10-30 11:36:40 +01:00
Clément Renault	560e8f5613	Introduce the CboRoaringBitmapCodec merge_deladd_into and use it	2023-10-30 11:34:55 +01:00
Clément Renault	2d3f15f82c	Introduce a function to only serialize the Add side of a DelAdd obkv	2023-10-30 11:34:55 +01:00
Clément Renault	40186bf403	Rename FieldIdWordCountDocids correctly	2023-10-30 11:34:50 +01:00
ManyTheFish	87e3d27878	update extract word pair proximity to support deladd obkvs	2023-10-30 11:34:02 +01:00
ManyTheFish	6bcf8b4f8c	update extract word position docids	2023-10-30 11:34:02 +01:00
ManyTheFish	46aa75abdb	update extract word docids	2023-10-30 11:34:02 +01:00
ManyTheFish	2597bbd107	Make script language docids map taking a tuple of roaring bitmaps expressing the deletions and the additions	2023-10-30 11:34:00 +01:00
Clément Renault	e2bc054604	Update extract_facet_string_docids to support deladd obkvs	2023-10-30 11:32:36 +01:00
Clément Renault	fcd3a1434d	Update extract_facet_number_docids to support deladd obkvs	2023-10-30 11:31:04 +01:00
Clément Renault	a82dee21e0	Rename docid_fid into fid_docid	2023-10-30 11:31:02 +01:00
Clément Renault	bc45c1206d	Implement all the facet extraction paths and simplify them	2023-10-30 11:29:08 +01:00
Clément Renault	6ae4100f07	Generate the DelAdd for is_null, is_empty, and exists	2023-10-30 11:29:08 +01:00
Clément Renault	0c47defeee	Work on fid docid facet values rewrite	2023-10-30 11:29:06 +01:00
ManyTheFish	313b16bec2	Support diff indexing on extract_docid_word_positions	2023-10-30 11:24:19 +01:00
ManyTheFish	1dd97578a8	Make the transform struct return diff-based documents obkvs	2023-10-30 11:22:07 +01:00
ManyTheFish	f5ef69293b	deactivate prefix dbs	2023-10-30 11:22:07 +01:00
ManyTheFish	1c5705c164	clean PR warnings	2023-10-30 11:22:05 +01:00
ManyTheFish	66c2c82a18	Split wpp in several sorters	2023-10-30 11:15:02 +01:00
ManyTheFish	28a8d0ccda	Fix word pair proximity	2023-10-30 11:15:02 +01:00
ManyTheFish	96be85396d	Use a vecDeque in wpp database	2023-10-30 11:15:02 +01:00
ManyTheFish	df9e5c8651	Generalize usage of CboRoaringBitmap codec to ease the use	2023-10-30 11:15:02 +01:00
ManyTheFish	b541d48847	Add buffer to the obkv writter	2023-10-30 11:15:02 +01:00
ManyTheFish	8ccf32d1a0	Compute word_fid_docids before word_docids and exact_word_docids	2023-10-30 11:15:02 +01:00
ManyTheFish	db1ca21231	add puffin in sorter into reeder function	2023-10-30 11:15:00 +01:00
ManyTheFish	11ea5acff9	Fix	2023-10-30 11:13:10 +01:00
ManyTheFish	8d77736a67	Fix fid_word_docids	2023-10-30 11:13:10 +01:00
ManyTheFish	748b333161	Add usefull debug assert before key insertion in database	2023-10-30 11:13:10 +01:00
ManyTheFish	17b647dfe5	Wip	2023-10-30 11:13:08 +01:00
meili-bors[bot]	5e0485d8dd	Merge #4131 4131: Reduce proximity range from 7 to 3 r=Kerollmops a=ManyTheFish ## Summary This PR aims to reduce the impact of the proximity databases on the indexing time and on the database size by reducing the maximum distance between two words to be indexed in the proximity database. ## Stats ### Impact on database size and indexing time ![Impact on datasets](https://github.com/meilisearch/meilisearch/assets/6482087/28ed3d96-bdde-41c1-bdac-e90c1b1dbb23) ### Impact on search relevancy <details> \| dataset_name \| host_name \| Relevancy rate (Precision) \| completion_rate 25.00% \| completion_rate 50.00% \| completion_rate 75.00% \| completion_rate 100.00% \| \|--------------\|------------------\|------------------------------------\|-----------------\|-----------------\|-----------------\|-----------------\| \| FBIS \| 1_4_0 \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FBIS \| 1_4_0 \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FBIS \| 1_4_0 \| percentile-50 \| 0.00% \| 0.00% \| 5.00% \| 5.56% \| \| FBIS \| 1_4_0 \| percentile-75 \| 0.00% \| 12.50% \| 35.00% \| 45.00% \| \| FBIS \| 1_4_0 \| percentile-90 \| 20.00% \| 40.00% \| \| 100.00% \| \| FBIS \| 1_4_0 \| average \| 5.78% \| 11.16% \| 21.90% \| 26.29% \| \| FBIS \| reduce_proximity \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FBIS \| reduce_proximity \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FBIS \| reduce_proximity \| percentile-50 \| 0.00% \| 0.00% \| 5.00% \| 5.56% \| \| FBIS \| reduce_proximity \| percentile-75 \| 0.00% \| 15.00% \| 35.00% \| 40.00% \| \| FBIS \| reduce_proximity \| percentile-90 \| 20.00% \| 40.00% \| 85.00% \| 100.00% \| \| FBIS \| reduce_proximity \| average \| 5.55% \| 11.34% \| 21.75% \| 26.14% \| \| FR94 \| 1_4_0 \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FR94 \| 1_4_0 \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FR94 \| 1_4_0 \| percentile-50 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FR94 \| 1_4_0 \| percentile-75 \| 0.00% \| 5.00% \| 15.00% \| 42.11% \| \| FR94 \| 1_4_0 \| percentile-90 \| 15.00% \| 54.55% \| 100.00% \| 100.00% \| \| FR94 \| 1_4_0 \| average \| 5.95% \| 12.07% \| 18.70% \| 25.57% \| \| FR94 \| reduce_proximity \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FR94 \| reduce_proximity \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FR94 \| reduce_proximity \| percentile-50 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FR94 \| reduce_proximity \| percentile-75 \| 0.00% \| 5.00% \| 15.00% \| 42.11% \| \| FR94 \| reduce_proximity \| percentile-90 \| 15.00% \| 54.55% \| 100.00% \| 100.00% \| \| FR94 \| reduce_proximity \| average \| 5.79% \| 12.00% \| 18.70% \| 25.53% \| \| FT \| 1_4_0 \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FT \| 1_4_0 \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FT \| 1_4_0 \| percentile-50 \| 0.00% \| 0.00% \| 5.00% \| 10.00% \| \| FT \| 1_4_0 \| percentile-75 \| 0.00% \| 15.00% \| 30.00% \| 40.00% \| \| FT \| 1_4_0 \| percentile-90 \| 20.00% \| 50.00% \| 65.00% \| 100.00% \| \| FT \| 1_4_0 \| average \| 5.08% \| 12.58% \| 20.00% \| 25.49% \| \| FT \| reduce_proximity \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FT \| reduce_proximity \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FT \| reduce_proximity \| percentile-50 \| 0.00% \| 0.00% \| 5.00% \| 10.00% \| \| FT \| reduce_proximity \| percentile-75 \| 0.00% \| 15.00% \| 30.00% \| 40.00% \| \| FT \| reduce_proximity \| percentile-90 \| 10.00% \| 45.00% \| 60.00% \| 100.00% \| \| FT \| reduce_proximity \| average \| 5.01% \| 12.64% \| 20.10% \| 25.53% \| \| LAT \| 1_4_0 \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| LAT \| 1_4_0 \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| LAT \| 1_4_0 \| percentile-50 \| 0.00% \| 0.00% \| 5.00% \| 5.00% \| \| LAT \| 1_4_0 \| percentile-75 \| 5.00% \| 15.00% \| 30.00% \| 30.00% \| \| LAT \| 1_4_0 \| percentile-90 \| 15.00% \| 45.00% \| 60.00% \| 80.00% \| \| LAT \| 1_4_0 \| average \| 4.80% \| 11.80% \| 17.88% \| 21.62% \| \| LAT \| reduce_proximity \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| LAT \| reduce_proximity \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| LAT \| reduce_proximity \| percentile-50 \| 0.00% \| 0.00% \| 5.00% \| 5.00% \| \| LAT \| reduce_proximity \| percentile-75 \| 0.00% \| 11.11% \| 25.00% \| 35.00% \| \| LAT \| reduce_proximity \| percentile-90 \| 15.00% \| 45.00% \| 55.00% \| 80.00% \| \| LAT \| reduce_proximity \| average \| 4.43% \| 11.23% \| 17.32% \| 21.45% \| </details> ### Impact on Search time \| dataset_name \| host_name \| 25.00% \| 50.00% \| 75.00% \| 100.00% \| Average \| \|--------------\|------------------\|------------:\|------------:\|------------:\|------------:\|-------------\| \| FBIS \| 1_4_0 \| 3.45 \| 7.446666667 \| 9.773489933 \| 9.620300752 \| 7.572614338 \| \| FBIS \| reduce_proximity \| 2.983333333 \| 5.316666667 \| 6.911073826 \| 7.637218045 \| 5.712072968 \| \| FR94 \| 1_4_0 \| 2.236666667 \| 4.45 \| 5.523489933 \| 4.560150376 \| 4.192576744 \| \| FR94 \| reduce_proximity \| 2.09 \| 3.991666667 \| 4.981543624 \| 4.266917293 \| 3.832531896 \| \| FT \| 1_4_0 \| 5.956666667 \| 9.656666667 \| 13.86912752 \| 10.83270677 \| 10.0787919 \| \| FT \| reduce_proximity \| 4.51 \| 5.981666667 \| 7.701342282 \| 6.766917293 \| 6.23998156 \| \| LAT \| 1_4_0 \| 5.856666667 \| 9.233333333 \| 12.98322148 \| 10.78759398 \| 9.715203865 \| \| LAT \| reduce_proximity \| 6.91 \| 6.706666667 \| 8.463087248 \| 8.265037594 \| 7.586197877 \| ## Technical approach - Ensure the MAX_DISTANCE constant is used everywhere needed - Reduce the MAX_DISTANCE from 8 to 4 ## Related TBD Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-10-18 14:56:08 +00:00
ManyTheFish	27eec21415	Fix tests	2023-10-18 16:03:22 +02:00
Clément Renault	62dfd09dc6	Add more puffin logs to the deletion functions	2023-10-13 13:11:09 +02:00
Tamo	c0f2724c2d	get rids of the new introduced error code in favor of an io::Error	2023-10-10 15:12:23 +02:00
Tamo	d772073dfa	use a bufreader everytime there is a grenad<file>	2023-10-10 15:00:30 +02:00
meili-bors[bot]	487d493f49	Merge #4043 4043: Bring back hotfixes from v1.3.3 into v1.4.0 r=Kerollmops a=curquiza Co-authored-by: curquiza <curquiza@users.noreply.github.com> Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com> Co-authored-by: Kerollmops <clement@meilisearch.com> Co-authored-by: curquiza <clementine@meilisearch.com>	2023-09-11 12:27:34 +00:00
meili-bors[bot]	256cf33bca	Merge #4039 4039: Fix multiple vectors dimensions r=ManyTheFish a=Kerollmops This PR fixes #4035, making providing multiple vectors in documents possible. This is fixed by extracting the vectors from the non-flattened version of the documents. Co-authored-by: Kerollmops <clement@meilisearch.com>	2023-09-07 09:25:58 +00:00
Kerollmops	679c0b0f97	Extract the vectors from the non-flattened version of the documents	2023-09-06 12:26:00 +02:00
Kerollmops	e02d0064bd	Add a test case scenario	2023-09-06 12:26:00 +02:00
meili-bors[bot]	dc3d9c90d9	Merge #3994 3994: Fix synonyms with separators r=Kerollmops a=ManyTheFish # Pull Request ## Related issue Fixes #3977 ## Available prototype ``` $ docker pull getmeili/meilisearch:prototype-fix-synonyms-with-separators-0 ``` ## What does this PR do? - add a new test - filter the empty synonyms after normalization Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-09-05 14:42:46 +00:00
ManyTheFish	66aa6d5871	Ignore tokens with empty normalized value during indexing process	2023-09-05 15:44:14 +02:00
Kerollmops	8ac5b765bc	Fix synonyms normalization	2023-09-04 16:12:48 +02:00
Kerollmops	085aad0a94	Add a test	2023-09-04 14:39:33 +02:00
meili-bors[bot]	ccf3ba3f32	Merge #4019 4019: Bringing back changes from `v1.3.2` onto `main` r=irevoire a=Kerollmops Co-authored-by: Kerollmops <clement@meilisearch.com> Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com> Co-authored-by: irevoire <irevoire@users.noreply.github.com> Co-authored-by: Clément Renault <clement@meilisearch.com>	2023-08-28 12:14:11 +00:00
Kerollmops	c53841e166	Accept the null JSON value as the value of _vectors	2023-08-14 16:03:55 +02:00
meili-bors[bot]	e4e49e63d0	Merge #3993 3993: Bringing back changes from v1.3.1 to `main` r=irevoire a=curquiza Co-authored-by: irevoire <irevoire@users.noreply.github.com> Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com> Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-08-10 14:30:02 +00:00
ManyTheFish	5a7c1bde84	Fix clippy	2023-08-10 11:27:56 +02:00
ManyTheFish	6b2d671be7	Fix PR comments	2023-08-10 10:44:07 +02:00
Many the fish	43c13faeda	Update milli/src/update/index_documents/extract/extract_docid_word_positions.rs Co-authored-by: Tamo <tamo@meilisearch.com>	2023-08-10 10:05:03 +02:00
meili-bors[bot]	44c1900f36	Merge #3986 3986: Fix geo bounding box with strings r=ManyTheFish a=irevoire # Pull Request When sending a document with one geofield of type string (i.e.: `{ "_geo": { "lat": 12, "lng": "13" }}`), the geobounding box would exclude this document. This PR fixes this issue by automatically parsing the string value in case we're working on a geofield. ## Related issue Fixes https://github.com/meilisearch/meilisearch/issues/3973 ## What does this PR do? - Automatically parse the facet value iif we're working on a geofield. - Make insta works with snapshots in loops or closure executed multiple times. (you may need to update your cli if it panics after this PR: `cargo install cargo-insta`). - Add one integration test in milli and in meilisearch to ensure it works forever. - Add three snapshots for the dump that mysteriously disappeared I don't know how Co-authored-by: Tamo <tamo@meilisearch.com>	2023-08-09 07:58:15 +00:00
ManyTheFish	8dc5acf998	Try fix	2023-08-08 16:52:36 +02:00
ManyTheFish	35758db9ec	Truncate the the normalized long facets used in search for facet value	2023-08-08 16:38:30 +02:00
Tamo	9d061cec26	automatically parse the filterable attribute to float if it's a geo field	2023-08-08 16:28:07 +02:00
ManyTheFish	4a21fecf67	Merge branch 'main' into settings-customizing-tokenization	2023-08-08 16:08:16 +02:00
ManyTheFish	b45c36cd71	Merge branch 'main' into tmp-release-v1.3.0	2023-08-01 15:05:17 +02:00
ManyTheFish	9d5e3457e5	Fix clippy	2023-07-27 14:21:19 +02:00
ManyTheFish	b0c1a9504a	ensure the synonyms are updated when the tokenizer settings are changed	2023-07-26 09:33:42 +02:00
meili-bors[bot]	be72be7c0d	Merge #3942 3942: Normalize for the search the facets values r=ManyTheFish a=Kerollmops This PR improves and fixes the search for facet values feature. Searching for _bre_ wasn't returning facet values like _brévent_ or _brô_. The issue was related to the fact that facets are normalized but not in the same way as the `searchableAttributes` are. We decided to normalize them further and add another intermediate database where the key is the normalized facet value, and the value is a set of the non-normalized facets. We then use these non-normalized ones to get the correct counts by fetching the associated databases. ### What's missing in this PR? - [x] Apply the change to the whole set of `SearchForFacetValue::execute` conditions. - [x] Factorize the code that does an intermediate normalized value fetch in a function. - [x] Add or modify the search for facet value test. Co-authored-by: Clément Renault <clement@meilisearch.com> Co-authored-by: Kerollmops <clement@meilisearch.com>	2023-07-25 14:37:17 +00:00
ManyTheFish	d57026cd96	Support synonyms sinergies	2023-07-25 15:01:42 +02:00
Kerollmops	29ab54b259	Replace the hnsw crate by the instant-distance one	2023-07-25 12:37:35 +02:00
ManyTheFish	d4ff59fcf5	Fix clippy	2023-07-24 18:42:26 +02:00
ManyTheFish	9c485f8563	Make the search and the indexing work	2023-07-24 18:35:20 +02:00
ManyTheFish	d8d12d5979	Be able to set and reset settings	2023-07-24 17:00:18 +02:00
Clément Renault	df528b41d8	Normalize for the search the facets values	2023-07-20 17:57:07 +02:00
Kerollmops	eef95de30e	First iteration on exposing puffin profiling	2023-07-18 17:38:13 +02:00
Louis Dureuil	40fa59d64c	Sort by lexicographic order after normalization	2023-07-10 09:26:59 +02:00
Louis Dureuil	324d448236	Format let-else ❤️ 🎉	2023-07-03 10:20:28 +02:00
meili-bors[bot]	661d1f90dc	Merge #3866 3866: Update charabia v0.8.0 r=dureuill a=ManyTheFish # Pull Request Update Charabia: - enhance Japanese segmentation - enhance Latin Tokenization - words containing `_` are now properly segmented into several words - brackets `{([])}` are no more considered as context separators so word separated by brackets are now considered near together for the proximity ranking rule - fixes #3815 - fixes #3778 - fixes [product#151](https://github.com/meilisearch/product/discussions/151) > Important note: now the float numbers are segmented around the `.` so `3.22` is segmented as [`3`, `.`, `22`] but the middle dot isn't considered as a hard separator, which means that if we search `3.22` we find documents containing `3.22` Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-06-29 15:24:36 +00:00
ManyTheFish	a82c49ab08	Update test	2023-06-29 15:56:36 +02:00
ManyTheFish	84845de9ef	Update Charabia	2023-06-29 15:56:32 +02:00
Kerollmops	9917bf046a	Move the sortFacetValuesBy in the faceting settings	2023-06-29 14:33:31 +02:00
Clément Renault	efbe7ce78b	Clean the facet string FSTs when we clear the documents	2023-06-28 15:36:32 +02:00
Kerollmops	e9a3029c30	Use the right field id to write the string facet values FST	2023-06-28 15:01:51 +02:00
Clément Renault	f36de2115f	Make clippy happy	2023-06-28 15:01:50 +02:00
Kerollmops	c34de05106	Introduce the SearchForFacetValue struct	2023-06-28 14:58:41 +02:00
Clément Renault	15a4c05379	Store the facet string values in multiple FSTs	2023-06-28 14:58:41 +02:00
meili-bors[bot]	d4f10800f2	Merge #3834 3834: Define searchable fields at runtime r=Kerollmops a=ManyTheFish ## Summary This feature allows the end-user to search in one or multiple attributes using the search parameter `attributesToSearchOn`: ```json { "q": "Captain Marvel", "attributesToSearchOn": ["title"] } ``` This feature act like a filter, forcing Meilisearch to only return the documents containing the requested words in the attributes-to-search-on. Note that, with the matching strategy `last`, Meilisearch will only ensure that the first word is in the attributes-to-search-on, but, the retrieved documents will be ordered taking into account the word contained in the attributes-to-search-on. ## Trying the prototype A dedicated docker image has been released for this feature: #### last prototype version: ```bash docker pull getmeili/meilisearch:prototype-define-searchable-fields-at-search-time-1 ``` #### others prototype versions: ```bash docker pull getmeili/meilisearch:prototype-define-searchable-fields-at-search-time-0 ``` ## Technical Detail The attributes-to-search-on list is given to the search context, then, the search context uses the `fid_word_docids`database using only the allowed field ids instead of the global `word_docids` database. This is the same for the prefix databases. The database cache is updated with the merged values, meaning that the union of the field-id-database values is only made if the requested key is missing from the cache. ### Relevancy limits Almost all ranking rules behave as expected when ordering the documents. Only `proximity` could miss-order documents if all the searched words are in the restricted attribute but a better proximity is found in an ignored attribute in a document that should be ranked lower. I put below a failing test showing it: ```rust #[actix_rt::test] async fn proximity_ranking_rule_order() { let server = Server::new().await; let index = index_with_documents( &server, &json!([ { "title": "Captain super mega cool. A Marvel story", // Perfect distance between words in an ignored attribute "desc": "Captain Marvel", "id": "1", }, { "title": "Captain America from Marvel", "desc": "a Shazam ersatz", "id": "2", }]), ) .await; // Document 2 should appear before document 1. index .search(json!({"q": "Captain Marvel", "attributesToSearchOn": ["title"], "attributesToRetrieve": ["id"]}), \|response, code\| { assert_eq!(code, 200, "{}", response); assert_eq!( response["hits"], json!([ {"id": "2"}, {"id": "1"}, ]) ); }) .await; } ``` Fixing this would force us to create a `fid_word_pair_proximity_docids` and a `fid_word_prefix_pair_proximity_docids` databases which may multiply the keys of `word_pair_proximity_docids` and `word_prefix_pair_proximity_docids` by the number of attributes in the searchable_attributes list. If we think we should fix this test, I'll suggest doing it in another PR. ## Related Fixes #3772 Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-06-28 08:19:23 +00:00
Clément Renault	30741d17fa	Change the TODO message	2023-06-27 12:32:43 +02:00
Clément Renault	63bfe1cee2	Ignore when there are too many vectors	2023-06-27 12:32:43 +02:00
Kerollmops	ff3664431f	Make rustfmt happy	2023-06-27 12:32:42 +02:00
Kerollmops	531748c536	Return a user error when the _vectors type is invalid	2023-06-27 12:32:41 +02:00
Kerollmops	7aa1275337	Display the _semanticSimilarity even if the `_vectors` field is not displayed	2023-06-27 12:32:41 +02:00
Kerollmops	3e3c743392	Make Rustfmt happy	2023-06-27 12:32:41 +02:00
Kerollmops	ab9f2269aa	Normalize the vectors during indexation and search	2023-06-27 12:32:41 +02:00
Kerollmops	321ec5f3fa	Accept multiple vectors by documents using the _vectors field	2023-06-27 12:32:40 +02:00
Kerollmops	a7e0f0de89	Introduce a new error message for invalid vector dimensions	2023-06-27 12:32:40 +02:00
Kerollmops	c2a402f3ae	Implement an ugly deletion of values in the HNSW	2023-06-27 12:32:39 +02:00
Kerollmops	c79e82c62a	Move back to the hnsw crate This reverts commit 7a4b6c065482f988b01298642f4c18775503f92f.	2023-06-27 12:32:39 +02:00
Kerollmops	aca305bb77	Log more to make sure we insert vectors in the hgg data-structure	2023-06-27 12:32:38 +02:00
Kerollmops	268a9ef416	Move to the hgg crate	2023-06-27 12:32:38 +02:00
Clément Renault	4571e512d2	Store the vectors in an HNSW in LMDB	2023-06-27 12:32:38 +02:00
Clément Renault	7ac2f1489d	Extract the vectors from the documents	2023-06-27 12:32:37 +02:00
Clément Renault	34349faeae	Create a new _vector extractor	2023-06-27 12:32:37 +02:00
ManyTheFish	fb8fa07169	Restrict field ids in search context	2023-06-26 14:55:57 +02:00
ManyTheFish	0ccf1e2e40	Allow the search cache to store owned values	2023-06-26 14:55:57 +02:00
meili-bors[bot]	040b5a5b6f	Merge #3842 3842: fix some typos r=dureuill a=cuishuang # Pull Request ## Related issue Fixes #<issue_number> ## What does this PR do? - fix some typos ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: cui fliter <imcusg@gmail.com>	2023-06-22 18:01:10 +00:00
cui fliter	530a3e2df3	fix some typos Signed-off-by: cui fliter <imcusg@gmail.com>	2023-06-22 21:59:00 +08:00
meili-bors[bot]	45636d315c	Merge #3670 3670: Fix addition deletion bug r=irevoire a=irevoire The first commit of this PR is a revert of https://github.com/meilisearch/meilisearch/pull/3667. It re-enable the auto-batching of addition and deletion of tasks. No new changes have been introduced outside of `milli`. So all the changes you see on the autobatcher have actually already been reviewed. It fixes https://github.com/meilisearch/meilisearch/issues/3440. ### What was happening? The issue was that the `external_documents_ids` generated in the `transform` were used in a very strange way that wasn’t compatible with the deletion of documents. Instead of doing a clear merge between the external document IDs of the DB and the one returned by the transform + writing it on disk, we were doing some weird tricks with the soft-deleted to avoid writing the fst on disk as much as possible. The new algorithm may be a bit slower but is way more straightforward and doesn’t change depending on if the soft deletion was used or not. Here is a list of the changes introduced: 1. We now do a clear distinction between the `new_external_documents_ids` coming from the transform and only held on RAM and the `external_documents_ids` coming from the DB. 2. The `new_external_documents_ids` (coming out of the transform) are now represented as an `fst`. We don't need to struggle with the hard, soft distinction + the soft_deleted => That's easier to understand 3. When indexing documents, we merge the `external_documents_ids` coming from the DB and the `new_external_documents_ids` coming from the transform. ### Other things introduced in this PR Since we constantly have to write small, very specialized fuzzers for this kind of bug, we decided to push the one used to reproduce this bug. It's not perfect, but it's easy to improve in the future. It'll also run for as long as possible on every merge on the main branch. Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>	2023-06-19 09:09:30 +00:00
Louis Dureuil	9f37b61666	DB BREAKING: raise limit of word count from 10 to 30.	2023-06-08 12:07:12 +02:00
Louis Dureuil	c15c076da9	DB BREAKING: Count the number of words in field_id_word_count_docids	2023-06-08 12:07:11 +02:00
Loïc Lecrenier	8628a0c856	Remove docid_word_positions_db + fix deletion bug That would happen when a word was deleted from all exact attributes but not all regular attributes.	2023-06-07 10:52:50 +02:00
Tamo	602ad98cb8	improve the way we handle the fsts	2023-05-22 11:15:14 +02:00
Tamo	7f619ff0e4	get rids of the now unused soft_deletion_used parameter	2023-05-22 10:33:49 +02:00
Tamo	4391cba6ca	fix the addition + deletion bug	2023-05-17 18:28:57 +02:00
Kerollmops	c4a40e7110	Use the writemap flag to reduce the memory usage	2023-05-15 10:15:33 +02:00
Jakub Jirutka	13f1277637	Allow to disable specialized tokenizations (again) In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai` feature flags to allow melisearch to be built without huge specialed tokenizations that took up 90% of the melisearch binary size. Unfortunately, due to some recent changes, this doesn't work anymore. The problem lies in excessive use of the `default` feature flag, which infects the dependency graph. Instead of adding `default-features = false` here and there, it's easier and more future-proof to not declare `default` in `milli` and `meilisearch-types`. I've renamed it to `all-tokenizers`, which also makes it a bit clearer what it's about.	2023-05-04 15:45:40 +02:00
Louis Dureuil	90bc230820	Merge remote-tracking branch 'origin/main' into search-refactor Conflicts \| resolution ----------\|----------- Cargo.lock \| added mimalloc Cargo.toml \| took origin/main version milli/src/search/criteria/exactness.rs \| deleted after checking it was only clippy changes milli/src/search/query_tree.rs \| deleted after checking it was only clippy changes	2023-05-03 12:19:06 +02:00
Loïc Lecrenier	93188b3c88	Fix indexing of word_prefix_fid_docids	2023-04-29 10:56:48 +02:00
bors[bot]	414b3fae89	Merge #3571 3571: Introduce two filters to select documents with `null` and empty fields r=irevoire a=Kerollmops # Pull Request ## Related issue This PR implements the `X IS NULL`, `X IS NOT NULL`, `X IS EMPTY`, `X IS NOT EMPTY` filters that [this comment](https://github.com/meilisearch/product/discussions/539#discussioncomment-5115884) is describing in a very detailed manner. ## What does this PR do? ### `IS NULL` and `IS NOT NULL` This PR will be exposed as a prototype for now. Below is the copy/pasted version of a spec that defines this filter. - `IS NULL` matches fields that `EXISTS` AND `= IS NULL` - `IS NOT NULL` matches fields that `NOT EXISTS` OR `!= IS NULL` 1. `{"name": "A", "price": null}` 2. `{"name": "A", "price": 10}` 3. `{"name": "A"}` `price IS NULL` would match 1 `price IS NOT NULL` or `NOT price IS NULL` would match 2,3 `price EXISTS` would match 1, 2 `price NOT EXISTS` or `NOT price EXISTS` would match 3 common query : `(price EXISTS) AND (price IS NOT NULL)` would match 2 ### `IS EMPTY` and `IS NOT EMPTY` - `IS EMPTY` matches Array `[]`, Object `{}`, or String `""` fields that `EXISTS` and are empty - `IS NOT EMPTY` matches fields that `NOT EXISTS` OR are not empty. 1. `{"name": "A", "tags": null}` 2. `{"name": "A", "tags": [null]}` 3. `{"name": "A", "tags": []}` 4. `{"name": "A", "tags": ["hello","world"]}` 5. `{"name": "A", "tags": [""]}` 6. `{"name": "A"}` 7. `{"name": "A", "tags": {}}` 8. `{"name": "A", "tags": {"t1":"v1"}}` 9. `{"name": "A", "tags": {"t1":""}}` 10. `{"name": "A", "tags": ""}` `tags IS EMPTY` would match 3,7,10 `tags IS NOT EMPTY` or `NOT tags IS EMPTY` would match 1,2,4,5,6,8,9 `tags IS NULL` would match 1 `tags IS NOT NULL` or `NOT tags IS NULL` would match 2,3,4,5,6,7,8,9,10 `tags EXISTS` would match 1,2,3,4,5,7,8,9,10 `tags NOT EXISTS` or `NOT tags EXISTS` would match 6 common query : `(tags EXISTS) AND (tags IS NOT NULL) AND (tags IS NOT EMPTY)` would match 2,4,5,8,9 ## What should the reviewer do? - Check that I tested the filters - Check that I deleted the ids of the documents when deleting documents Co-authored-by: Clément Renault <clement@meilisearch.com> Co-authored-by: Kerollmops <clement@meilisearch.com>	2023-04-27 13:14:00 +00:00
Clément Renault	cfd1b2cc97	Fix the clippy warnings	2023-04-25 16:40:32 +02:00
Loïc Lecrenier	d1fdbb63da	Make all search tests pass, fix distinctAttribute bug	2023-04-24 12:12:08 +02:00
Loïc Lecrenier	84d9c731f8	Fix bug in encoding of word_position_docids and word_fid_docids	2023-04-24 09:59:30 +02:00
Loïc Lecrenier	8cb85294ef	Remove unused import warning	2023-04-07 11:09:30 +02:00
Loïc Lecrenier	540a396e49	Fix indexing bug in words_prefix_position	2023-04-07 11:08:39 +02:00
Loïc Lecrenier	a81165f0d8	Merge remote-tracking branch 'origin/main' into search-refactor	2023-04-07 10:15:55 +02:00
Loïc Lecrenier	130d2061bd	Fix indexing of word_position_docid and fid	2023-04-06 17:50:39 +02:00
Louis Dureuil	66ddee4390	Fix word_position_docids indexing	2023-04-06 17:50:39 +02:00
Louis Dureuil	e58426109a	Fix panics and issues in exactness graph ranking rule	2023-04-06 17:50:39 +02:00
Louis Dureuil	996619b22a	Increase position by 8 on hard separator when building query terms	2023-04-06 17:50:39 +02:00
Tamo	597d57bf1d	Merge branch 'main' into bring-back-changes-v1.1.0	2023-04-05 11:32:14 +02:00
ManyTheFish	efea1e5837	Fix facet normalization	2023-03-29 12:02:24 +02:00
Gregory Conrad	e7994cdeb3	feat: check to see if the PK changed before erroring out Previously, if the primary key was set and a Settings update contained a primary key, an error would be returned. However, this error is not needed if the new PK == the current PK. This commit just checks to see if the PK actually changes before raising an error.	2023-03-26 12:18:39 -04:00
Loïc Lecrenier	d18ebe4f3a	Remove more warnings	2023-03-23 09:41:18 +01:00
Loïc Lecrenier	9b2653427d	Split position DB into fid and relative position DB	2023-03-23 09:22:01 +01:00
Clément Renault	1a9c58a7ab	Fix a bug with the new flattening rules	2023-03-15 16:56:44 +01:00
Clément Renault	64571c8288	Improve the testing of the filters	2023-03-15 14:57:17 +01:00
Clément Renault	ea016d97af	Implementing an IS EMPTY filter	2023-03-15 14:12:34 +01:00
ManyTheFish	2f8eb4f54a	last PR fixes	2023-03-09 15:34:36 +01:00
Clément Renault	df48ac8803	Add one more test for the NULL operator	2023-03-09 13:53:37 +01:00
Clément Renault	0ad53784e7	Create a new struct to reduce the type complexity	2023-03-09 13:21:21 +01:00
Clément Renault	e064c52544	Rename an internal facet deletion method	2023-03-09 13:08:02 +01:00
Clément Renault	e106b16148	Fix a typo in a variable Co-authored-by: Louis Dureuil <louis@meilisearch.com> aaa	2023-03-09 13:08:02 +01:00
ManyTheFish	5deea631ea	fix clippy too many arguments	2023-03-09 11:19:13 +01:00
ManyTheFish	b4b859ec8c	Fix typos	2023-03-09 10:58:35 +01:00
Clément Renault	7dc04747fd	Make clippy happy	2023-03-08 17:37:08 +01:00
Clément Renault	43ff236df8	Write the NULL facet values in the database	2023-03-08 16:49:53 +01:00
Clément Renault	19ab4d1a15	Classify the NULL fields values in the facet extractor	2023-03-08 16:49:31 +01:00
Clément Renault	9287858997	Introduce a new facet_id_is_null_docids database in the index	2023-03-08 16:14:00 +01:00
ManyTheFish	24c0775c67	Change indexing threshold	2023-03-08 12:36:04 +01:00
ManyTheFish	3092cf0448	Fix clippy errors	2023-03-08 10:53:42 +01:00
ManyTheFish	da48506f15	Rerun extraction when language detection might have failed	2023-03-07 18:35:26 +01:00
Louis Dureuil	5822764be9	Skip computing index budget in tests	2023-02-23 11:23:39 +01:00
ManyTheFish	bbecab8948	fix clippy	2023-02-21 10:18:44 +01:00
ManyTheFish	8aa808d51b	Merge branch 'main' into enhance-language-detection	2023-02-20 18:14:34 +01:00
bors[bot]	b08a49a16e	Merge #3319 #3470 3319: Transparently resize indexes on MaxDatabaseSizeReached errors r=Kerollmops a=dureuill # Pull Request ## Related issue Related to https://github.com/meilisearch/meilisearch/discussions/3280, depends on https://github.com/meilisearch/milli/pull/760 ## What does this PR do? ### User standpoint - Meilisearch no longer fails tasks that encounter the `milli::UserError(MaxDatabaseSizeReached)` error. - Instead, these tasks are retried after increasing the maximum size allocated to the index where the failure occurred. ### Implementation standpoint - Add `Batch::index_uid` to get the `index_uid` of a batch of task if there is one - `IndexMapper::create_or_open_index` now takes an additional `size` argument that allows to (re)open indexes with a size different from the base `IndexScheduler::index_size` field - `IndexScheduler::tick` now returns a `Result<TickOutcome>` instead of a `Result<usize>`. This offers more explicit control over what the behavior should be wrt the next tick. - Add `IndexStatus::BeingResized` that contains a handle that a thread can use to await for the resize operation to complete and the index to be available again. - Add `IndexMapper::resize_index` to increase the size of an index. - In `IndexScheduler::tick`, intercept task batches that failed due to `MaxDatabaseSizeReached` and resize the index that caused the error, then request a new tick that will eventually handle the still enqueued task. ## Testing the PR The following diff can be applied to this branch to make testing the PR easier: <details> ```diff diff --git a/index-scheduler/src/index_mapper.rs b/index-scheduler/src/index_mapper.rs index 553ab45a..022b2f00 100644 --- a/index-scheduler/src/index_mapper.rs +++ b/index-scheduler/src/index_mapper.rs `@@` -228,13 +228,15 `@@` impl IndexMapper { drop(lock); + std:🧵:sleep_ms(2000); + let current_size = index.map_size()?; let closing_event = index.prepare_for_closing(); - log::info!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2); + log::error!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2); closing_event.wait(); - log::info!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2); + log::error!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2); let index_path = self.base_path.join(uuid.to_string()); let index = self.create_or_open_index(&index_path, None, 2 * current_size)?; `@@` -268,8 +270,10 `@@` impl IndexMapper { match index { Some(Available(index)) => break index, Some(BeingResized(ref resize_operation)) => { + log::error!("waiting for resize end"); // Deadlock: no lock taken while doing this operation. resize_operation.wait(); + log::error!("trying our luck again!"); continue; } Some(BeingDeleted) => return Err(Error::IndexNotFound(name.to_string())), diff --git a/index-scheduler/src/lib.rs b/index-scheduler/src/lib.rs index 11b17d05..242dc095 100644 --- a/index-scheduler/src/lib.rs +++ b/index-scheduler/src/lib.rs `@@` -908,6 +908,7 `@@` impl IndexScheduler { /// /// Returns the number of processed tasks. fn tick(&self) -> Result<TickOutcome> { + log::error!("ticking!"); #[cfg(test)] { *self.run_loop_iteration.write().unwrap() += 1; diff --git a/meilisearch/src/main.rs b/meilisearch/src/main.rs index 050c825a..63f312f6 100644 --- a/meilisearch/src/main.rs +++ b/meilisearch/src/main.rs `@@` -25,7 +25,7 `@@` fn setup(opt: &Opt) -> anyhow::Result<()> { #[actix_web::main] async fn main() -> anyhow::Result<()> { - let (opt, config_read_from) = Opt::try_build()?; + let (mut opt, config_read_from) = Opt::try_build()?; setup(&opt)?; `@@` -56,6 +56,8 `@@` We generated a secure master key for you (you can safely copy this token): _ => (), } + opt.max_index_size = byte_unit::Byte::from_str("1MB").unwrap(); + let (index_scheduler, auth_controller) = setup_meilisearch(&opt)?; #[cfg(all(not(debug_assertions), feature = "analytics"))] ``` </details> Mainly, these debug changes do the following: - Set the default index size to 1MiB so that index resizes are initially frequent - Turn some logs from info to error so that they can be displayed with `--log-level ERROR` (hiding the other infos) - Add a long sleep between the beginning and the end of the resize so that we can observe the `BeingResized` index status (otherwise it would never come up in my tests) ## Open questions - Is the growth factor of x2 the correct solution? For a `Vec` in memory it makes sense, but here we're manipulating quantities that are potentially in the order of 500GiBs. For bigger indexes it may make more sense to add at most e.g. 100GiB on each resize operation, avoiding big steps like 500GiB -> 1TiB. ## PR checklist Please check if your PR fulfills the following requirements: - [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [ ] Have you read the contributing guidelines? - [ ] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! 3470: Autobatch addition and deletion r=irevoire a=irevoire This PR adds the capability to meilisearch to batch document addition and deletion together. Fix https://github.com/meilisearch/meilisearch/issues/3440 -------------- Things to check before merging; - [x] What happens if we delete multiple time the same documents -> add a test - [x] If a documentDeletion gets batched with a documentAddition but the index doesn't exist yet? It should not work Co-authored-by: Louis Dureuil <louis@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com>	2023-02-20 15:00:19 +00:00
Tamo	18796d6e6a	Consider null as a valid geo object	2023-02-20 13:45:51 +01:00
Tamo	895ab2906c	apply review suggestions	2023-02-16 18:42:47 +01:00
Tamo	8fb7b1d10f	bump deserr	2023-02-14 20:04:30 +01:00
Tamo	74dcfe9676	Fix a bug when you update a document that was already present in the db, deleted and then inserted again in the same transform	2023-02-14 19:09:40 +01:00
Tamo	1b1703a609	make a small optimization to merge obkvs a little bit faster	2023-02-14 18:32:41 +01:00
Tamo	fb5e4957a6	fix and test the early exit in case a grenad ends with a deletion	2023-02-14 18:23:57 +01:00
Tamo	8de3c9f737	Update milli/src/update/index_documents/transform.rs Co-authored-by: Clément Renault <clement@meilisearch.com>	2023-02-14 17:57:14 +01:00
Tamo	43a19d0709	document the operation enum + the grenads	2023-02-14 17:55:26 +01:00
Tamo	746b31c1ce	makes clippy happy	2023-02-09 12:23:01 +01:00
Tamo	93db755d57	add a test to ensure we handle correctly a deletion of multiple time the same document	2023-02-08 21:03:34 +01:00
Tamo	93f130a400	fix all warnings	2023-02-08 20:57:35 +01:00
Tamo	421a9cf05e	provide a new method on the transform to remove documents	2023-02-08 16:06:09 +01:00
Tamo	8f64fba1ce	rewrite the current transform to handle a new byte specifying the kind of operation it's merging	2023-02-08 12:53:38 +01:00
Kerollmops	fbec48f56e	Merge remote-tracking branch 'milli/main' into bring-v1-changes	2023-02-06 16:48:10 +01:00
ManyTheFish	064158e4e2	Update test	2023-02-01 15:34:01 +01:00
Loïc Lecrenier	a2690ea8d4	Reduce incremental indexing time of `words_prefix_position_docids` DB This database can easily contain millions of entries. Thus, iterating over it can be very expensive. For regular `documentAdditionOrUpdate` tasks, `del_prefix_fst_words` will always be empty. Thus, we can save a significant amount of time by adding this `if !del_prefix_fst_words.is_empty()` condition. The code's behaviour remains completely unchanged.	2023-01-31 11:42:24 +01:00
f3r10	7681be5367	Format code	2023-01-31 11:28:05 +01:00
f3r10	50bc156257	Fix tests	2023-01-31 11:28:05 +01:00
f3r10	d8207356f4	Skip script,language insertion if language is undetected	2023-01-31 11:28:05 +01:00
f3r10	fd60a39f1c	Format code	2023-01-31 11:28:05 +01:00
f3r10	369c05732e	Add test checking if from script_language_docids database were removed deleted docids	2023-01-31 11:28:05 +01:00
f3r10	a27f329e3a	Add tests for checking that detected script and language associated with document(s) were stored during indexing	2023-01-31 11:28:05 +01:00
f3r10	b216ddba63	Delete and clear data from the new database	2023-01-31 11:28:05 +01:00
f3r10	d97fb6117e	Extract and index data	2023-01-31 11:28:05 +01:00
Louis Dureuil	20f05efb3c	clippy: needless_lifetimes	2023-01-31 11:12:59 +01:00
Louis Dureuil	cbf029f64c	clippy: --fix	2023-01-31 11:12:59 +01:00
Louis Dureuil	3296cf7ae6	clippy: remove needless lifetimes	2023-01-31 09:32:40 +01:00
Louis Dureuil	89675e5f15	clippy: Replace seek 0 by rewind	2023-01-31 09:32:40 +01:00
Tamo	de3c4f1986	throw an error on unknown fields specified in the _geo field	2023-01-24 12:23:24 +01:00
Philipp Ahlner	f5ca421227	Superfluous test removed	2023-01-19 15:39:21 +01:00
Philipp Ahlner	a2cd7214f0	Fixes error message when lat/lng are unparseable	2023-01-19 10:10:26 +01:00
ManyTheFish	d1fc42b53a	Use compatibility decomposition normalizer in facets	2023-01-18 15:02:13 +01:00
Clément Renault	1b78231e18	Make clippy happy	2023-01-17 18:25:54 +01:00
Loïc Lecrenier	f073a86387	Update deserr to latest version	2023-01-17 11:28:19 +01:00
Loïc Lecrenier	02fd06ea0b	Integrate deserr	2023-01-11 13:56:47 +01:00
bors[bot]	c3f4835e8e	Merge #733 733: Avoid a prefix-related worst-case scenario in the proximity criterion r=loiclec a=loiclec # Pull Request ## Related issue Somewhat fixes (until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3118 ## What does this PR do? When a query ends with a word and a prefix, such as: ``` word pr ``` Then we first determine whether `pre` could possibly be in the proximity prefix database before querying it. There are then three possibilities: 1. `pr` is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations of `pre` through the FST and query the regular proximity databases. 2. `pr` is in the prefix cache but cannot be found in the proximity prefix databases. In this case, we partially disable the proximity ranking rule for the pair `word pre`. This is done as follows: 1. Only find the documents where `word` is in proximity to `pre` exactly (no derivations) 2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8 3. `pr` is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases. Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is: 1. For common prefixes of one or two letters: we no longer distinguish between proximities from 4 to 8 2. For common prefixes of more than two letters: we no longer distinguish between any proximities 3. For uncommon prefixes: nothing changes Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query `heard pr` (IF `pr` is the prefix of more than 200 words in the dataset): ```json [ { "text": "I heard there is a faster proximity criterion" }, { "text": "I heard there is a faster but less relevant proximity criterion" } ] ``` Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro": ```json [ { "text": "I heard there is a faster but less relevant proximity criterion" } { "text": "I heard there is a faster proximity criterion" }, ] ``` But the following document would be considered more relevant than the two documents above: ```json { "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " } ``` Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything. --- ## Performance I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the `songs` benchmark dataset. ``` 1. 10x 'a': - 640ms ⟹ 630ms = no significant difference 2. 10x 'b': - set-based: 4.47s ⟹ 7.42 = bad, ~2x regression - dynamic: 1s ⟹ 870 ms = no significant difference 3. 'Someone I l': - set-based: 250ms ⟹ 12 ms = very good, x20 speedup - dynamic: 21ms ⟹ 11 ms = good, x2 speedup 4. 'billie e': - set-based: 623ms ⟹ 2ms = very good, x300 speedup - dynamic: ~4ms ⟹ 4ms = no difference 5. 'billie ei': - set-based: 57ms ⟹ 20ms = good, ~2x speedup - dynamic: ~4ms ⟹ ~2ms. = no significant difference 6. 'i am getting o' - set-based: 300ms ⟹ 60ms = very good, 5x speedup - dynamic: 30ms ⟹ 6ms = very good, 5x speedup 7. 'prologue 1 a 1: - set-based: 3.36s ⟹ 120ms = very good, 30x speedup - dynamic: 200ms ⟹ 30ms = very good, 6x speedup 8. 'prologue 1 a 10': - set-based: 590ms ⟹ 18ms = very good, 30x speedup - dynamic: 82ms ⟹ 35ms = good, ~2x speedup ``` Performance is often significantly better, but there is also one regression in the set-based implementation with the query `b b b b b b b b b b`. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-01-04 09:00:50 +00:00
bors[bot]	6a10e85707	Merge #736 736: Update charabia r=curquiza a=ManyTheFish Update Charabia to the last version. > We are now Romanizing Chinese characters into Pinyin. > Note that we keep the accent because they are in fact never typed directly by the end-user, moreover, changing an accent leads to a different Chinese character, and I don't have sufficient knowledge to forecast the impact of removing accents in this context. Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-01-03 15:44:41 +00:00
Loïc Lecrenier	777b387dc4	Avoid a prefix-related worst-case scenario in the proximity criterion	2022-12-22 12:08:00 +01:00
Louis Dureuil	4b166bea2b	Add primary_key_inference test	2022-12-21 15:13:38 +01:00
Louis Dureuil	5943100754	Fix existing tests	2022-12-21 15:13:38 +01:00
Louis Dureuil	b24def3281	Add logging when inference took place. Displays log message in the form: ``` [2022-12-21T09:19:42Z INFO milli::update::index_documents::enrich] Primary key was not specified in index. Inferred to 'id' ```	2022-12-21 15:13:38 +01:00
Louis Dureuil	402dcd6b2f	Simplify primary key inference	2022-12-21 15:13:38 +01:00
Louis Dureuil	13c95d25aa	Remove uses of UserError::MissingPrimaryKey not related to inference	2022-12-21 15:13:36 +01:00
Loïc Lecrenier	fc0e7382fe	Fix hard-deletion of an external id that was soft-deleted	2022-12-20 15:33:31 +01:00
Tamo	69edbf9f6d	Update milli/src/update/delete_documents.rs	2022-12-19 18:23:50 +01:00
Louis Dureuil	916c23e7be	Tests: rename snapshots	2022-12-19 10:07:17 +01:00
Louis Dureuil	ad9937c755	Fix tests after adding DeletionStrategy	2022-12-19 10:07:17 +01:00
Louis Dureuil	171c942282	Soft-deletion computation no longer takes into account the mapsize Implemented solution 2.3 from https://github.com/meilisearch/meilisearch/issues/3231#issuecomment-1348628824	2022-12-19 10:07:17 +01:00
Louis Dureuil	e2ae3b24aa	Hard or soft delete according to the deletion strategy	2022-12-19 10:00:13 +01:00
Louis Dureuil	fc7618d49b	Add DeletionStrategy	2022-12-19 09:49:58 +01:00
ManyTheFish	7f88c4ff2f	Fix #1714 test	2022-12-15 18:22:28 +01:00
Loïc Lecrenier	be3b00350c	Apply review suggestions: naming and documentation	2022-12-13 10:15:22 +01:00
Loïc Lecrenier	e3ee553dcc	Remove soft deleted ids from ExternalDocumentIds during document import If the document import replaces a document using hard deletion	2022-12-12 14:16:09 +01:00
Loïc Lecrenier	303d740245	Prepare fix within facet range search By creating snapshots and updating the format of the existing snapshots. The next commit will apply the fix, which will show its effects cleanly on the old and new snapshot tests	2022-12-07 14:38:10 +01:00
Loïc Lecrenier	a993b68684	Cargo fmt >:-(	2022-12-06 15:22:10 +01:00
Loïc Lecrenier	80c7a00567	Fix compilation error in tests of settings update	2022-12-06 15:19:26 +01:00
Loïc Lecrenier	67d8cec209	Fix bug in handling of soft deleted documents when updating settings	2022-12-06 15:09:19 +01:00
Loïc Lecrenier	cda4ba2bb6	Add document import tests	2022-12-05 12:02:49 +01:00
Loïc Lecrenier	f2cf981641	Add more tests and allow disabling of soft-deletion outside of tests Also allow disabling soft-deletion in the IndexDocumentsConfig	2022-12-05 10:51:01 +01:00
bors[bot]	d3731dda48	Merge #706 706: Limit the reindexing caused by updating settings when not needed r=curquiza a=GregoryConrad ## What does this PR do? When updating index settings using `update::Settings`, sometimes a `reindex` of `update::Settings` is triggered when it doesn't need to be. This PR aims to prevent those unnecessary `reindex` calls. For reference, here is a snippet from the current `execute` method in `update::Settings`: ```rust // ... if stop_words_updated \|\| faceted_updated \|\| synonyms_updated \|\| searchable_updated \|\| exact_attributes_updated { self.reindex(&progress_callback, &should_abort, old_fields_ids_map)?; } ``` - [x] `faceted_updated` - looks good as-is ✅ - [x] `stop_words_updated` - looks good as-is ✅ - [x] `synonyms_updated` - looks good as-is ✅ - [x] `searchable_updated` - fixed in this PR - [x] `exact_attributes_updated` - fixed in this PR ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>	2022-12-01 13:58:02 +00:00
bors[bot]	5e754b3ee0	Merge #708 708: Reduce memory usage of the MatchingWords structure r=ManyTheFish a=loiclec # Pull Request ## Related issue Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3115 ## What does this PR do? 1. Reduces the memory usage caused by the creation of a 10-word query tree by 20x. This is done by deduplicating the `MatchingWord` values, which are heavy because of their inner DFA. The deduplication works by wrapping each `MatchingWord` in a reference-counted box and using a hash map to determine whether a `MatchingWord` DFA already exists for a certain signature, or whether a new one needs to be built. 2. Avoid the worst-case scenario of creating a `MatchingWord` for extremely long words that cannot be indexed by milli. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2022-11-30 17:47:34 +00:00

... 5 6 7 8 9 ...

1095 Commits