meilisearch

mirror of https://github.com/meilisearch/meilisearch.git synced 2024-11-30 09:04:59 +08:00

Author	SHA1	Message	Date
Clément Renault	3f3462ab62	Limit the number of values returned by the facet search	2024-01-10 16:54:08 +01:00
Tamo	54ae6951eb	fix warning	2024-01-02 15:19:30 +01:00
Louis Dureuil	0bf879fb88	Fix warning on rust stable	2023-12-20 17:48:09 +01:00
Louis Dureuil	6ff81de401	Fix tests	2023-12-20 17:16:46 +01:00
Louis Dureuil	9123370e90	Validate fused settings in settings task after fusing with existing setting	2023-12-20 17:16:46 +01:00
Louis Dureuil	14b396d302	Add new errors	2023-12-20 17:16:45 +01:00
Louis Dureuil	393216bf30	Flatten embedders settings	2023-12-20 17:16:43 +01:00
Louis Dureuil	e249e4db7b	Change Setting::apply function signature	2023-12-20 17:15:24 +01:00
Louis Dureuil	333ce12eb2	Fixed issue where the default revision is always the one we picked for the default model	2023-12-20 10:17:49 +01:00
Many the fish	9e1b458010	Merge branch 'main' into change-proximity-precision-settings	2023-12-18 09:08:47 +01:00
ManyTheFish	6425996e36	Change the naming of attributeScale and wordScale into byAttribute and byWord	2023-12-14 16:31:00 +01:00
Louis Dureuil	eb5cb91da2	Switch default from hf to openai	2023-12-14 16:19:46 +01:00
Louis Dureuil	87bba98bd8	Various changes - fixed seed for arroy - check vector dimensions as soon as it is provided to search - don't embed whitespace	2023-12-14 16:08:42 +01:00
Louis Dureuil	217105b7da	hybrid search uses semantic ratio, error handling	2023-12-14 16:08:42 +01:00
ManyTheFish	9991152bbe	Add TODOs	2023-12-14 16:08:42 +01:00
Louis Dureuil	a4536b1381	Small adjustments to respect the spec	2023-12-14 16:08:42 +01:00
Louis Dureuil	5b51cb04af	Remove some settings	2023-12-14 16:08:42 +01:00
Louis Dureuil	b8e4709dfa	Remove prompt strategy and fallback	2023-12-14 16:08:41 +01:00
Louis Dureuil	806e5b6899	Tests pass	2023-12-14 16:08:41 +01:00
Louis Dureuil	e0cc775dc4	Various changes - DistributionShift in Search object (to be set from model in embed?) - Fix issue where embedder index wasn't computed at search time - Accept as default embedder either the "default" one, or the only embedder when there is only one	2023-12-14 16:08:41 +01:00
Louis Dureuil	12940d79a9	WIP - manual embedder - multi embedders OK - clippy + tests OK	2023-12-14 16:08:41 +01:00
Louis Dureuil	922a640188	WIP multi embedders fixed template bugs	2023-12-14 16:08:41 +01:00
Louis Dureuil	d4715e0c4d	Fix same vector sort bug	2023-12-14 16:08:41 +01:00
Louis Dureuil	11e2a2c1aa	Fix geosort bug	2023-12-14 16:08:41 +01:00
Louis Dureuil	65e49b7092	Remove stuff, add distribution shift (WIP)	2023-12-14 16:08:38 +01:00
Louis Dureuil	e56f160032	Actually pass embedders on reindex	2023-12-14 16:07:49 +01:00
Louis Dureuil	687d92f217	prompt bifluor+	2023-12-14 16:07:49 +01:00
Louis Dureuil	fb539f61fe	WIP	2023-12-14 16:07:49 +01:00
Louis Dureuil	cb4ebe163e	WIP	2023-12-14 16:07:49 +01:00
Louis Dureuil	dde3a04679	WIP arroy integration	2023-12-14 16:07:49 +01:00
Louis Dureuil	13c2c6c16b	Small commit to add hybrid search and autoembedding	2023-12-14 16:07:48 +01:00
Clément Renault	56571f762a	Merge remote-tracking branch 'origin/main' into tmp-release-v1.5.1	2023-12-13 11:57:01 +01:00
ManyTheFish	467b49153d	Implement proximityPrecision setting on milli side	2023-12-06 15:49:02 +01:00
ManyTheFish	bddc168d83	List TODOs	2023-12-06 14:59:23 +01:00
ManyTheFish	3b3fa38f27	Put the restrict list in a sub-struct	2023-11-28 18:37:57 +01:00
ManyTheFish	d6c2ee15a9	Filter on attributes before computing the docids when attribute restriction is on	2023-11-28 14:55:29 +01:00
Clément Renault	ec9b52d608	Rename copy_to_path to copy_to_file	2023-11-28 14:32:30 +01:00
Clément Renault	34c67ac389	Remove the possibility to fail fetching the env info	2023-11-28 14:31:23 +01:00
Clément Renault	d050c9b4ae	Only remap the main database once	2023-11-28 14:27:30 +01:00
Clément Renault	7dd1226faf	Clarify an unreachable unwrap	2023-11-28 14:26:31 +01:00
Clément Renault	548c8247c2	Create and use real error types in the codecs	2023-11-28 10:11:17 +01:00
Clément Renault	d32eb11329	Move to the v0.20.0-alpha.9 of heed	2023-11-27 11:52:22 +01:00
Clément Renault	58dac8af42	Remove the panics and unwraps	2023-11-23 15:00:48 +01:00
Clément Renault	0dbf1a16ff	Make clippy happy	2023-11-23 14:11:38 +01:00
Clément Renault	462b4c0080	Fix the tests	2023-11-23 12:07:35 +01:00
Clément Renault	0d4482625a	Make the changes to use heed v0.20-alpha.6	2023-11-23 11:43:58 +01:00
Clément Renault	7cb7e37ba8	Merge branch 'main' into tmp-release-v1.5.0	2023-11-21 16:30:46 +01:00
ManyTheFish	d3575fb028	Make into_del_add_obkv parameters more human readable	2023-11-20 16:10:39 +01:00
ManyTheFish	39cbb499c2	Small fixes	2023-11-20 10:20:39 +01:00
ManyTheFish	ebef6bc24d	Simplify documents database writing	2023-11-20 10:14:57 +01:00
ManyTheFish	d59b7db8d0	remove unused code	2023-11-20 10:10:45 +01:00
ManyTheFish	263e825619	Fix typos in comments	2023-11-20 10:06:29 +01:00
Many the fish	b0adc73ce6	Merge pull request #4207 from meilisearch/diff-indexing-prefix-databases Diff indexing prefix databases	2023-11-14 16:04:05 +01:00
Louis Dureuil	772964125d	Factor removal of document from DB	2023-11-13 13:51:22 +01:00
Louis Dureuil	378deb0bef	Rename trait	2023-11-13 13:38:36 +01:00
ManyTheFish	1f36410541	Update tests	2023-11-13 13:36:39 +01:00
Louis Dureuil	8c649d8061	Throw error when the vector search is sent with the wrong size	2023-11-13 09:57:42 +01:00
Louis Dureuil	264b10ec20	Fixup documentation	2023-11-09 16:23:20 +01:00
Louis Dureuil	3053e01c05	Batch::remove_documents_from_db_no_batch	2023-11-09 14:23:02 +01:00
Louis Dureuil	b11c2afac0	Index::external_id_of	2023-11-09 14:22:43 +01:00
Louis Dureuil	9cef800b2a	Enrich uses the new type	2023-11-09 14:22:05 +01:00
Louis Dureuil	db2fb86b8b	Extract PrimaryKey logic to a type	2023-11-09 14:19:16 +01:00
ManyTheFish	882ab9cc85	remove warnings	2023-11-09 11:35:33 +01:00
ManyTheFish	5a9c96e1db	Compute word integer prefix cache	2023-11-09 11:34:26 +01:00
ManyTheFish	70ce40828c	Compute word docids prefix cache	2023-11-08 17:01:00 +01:00
ManyTheFish	688266c83e	Remove word pair proximity prefix cache and compute it at search time	2023-11-08 14:16:01 +01:00
ManyTheFish	6dab826908	Reactivate prefix databases	2023-11-08 13:58:01 +01:00
ManyTheFish	1e2fbc6a42	revert "REVERT ME: ignore prefix pair databases tests" This reverts commit `1b2ea6cf19`.	2023-11-08 11:50:52 +01:00
Louis Dureuil	cbaa54cafd	Fix clippy issues	2023-11-06 11:19:31 +01:00
Louis Dureuil	1bccf2079e	Correctly mark non-tests as non-tests	2023-11-06 11:03:56 +01:00
ManyTheFish	1b2ea6cf19	REVERT ME: ignore prefix pair databases tests	2023-11-06 10:46:22 +01:00
Louis Dureuil	1ad1fcc8c8	Remove all warnings	2023-11-06 10:31:14 +01:00
ManyTheFish	87610a5f98	Don't try to delete a document that is not in the database	2023-11-02 16:49:03 +01:00
Clément Renault	ff522c919d	Fix the vector extractions for the diff indexing	2023-11-02 15:58:08 +01:00
ManyTheFish	bf0651f23c	Implement iter method on ExternalDocumentsIds	2023-11-02 15:38:00 +01:00
ManyTheFish	5b20e625f3	fix merge	2023-11-02 15:31:37 +01:00
ManyTheFish	bc51d6157a	Fix transform reindexing path	2023-11-02 15:26:20 +01:00
ManyTheFish	1b4ff991c0	update typed chunks	2023-11-02 15:26:20 +01:00
ManyTheFish	4b64c33aa2	update vector extractor	2023-11-02 15:26:20 +01:00
ManyTheFish	12323d610e	Change the original document sorter key from the internal docid to a concatenation of the internal and the external docid	2023-11-02 15:26:20 +01:00
Clément Renault	4d864f0702	Always sort internal Sorter entries in parallel	2023-11-02 14:47:43 +01:00
Clément Renault	c71b1d33ae	Sort entries using rayon in the transform sorters	2023-11-01 11:07:16 +01:00
Clément Renault	0fc446c62f	Add more timing logs to the Transform	2023-11-01 11:07:16 +01:00
Louis Dureuil	0fb6acefc3	Add snapshots for facets	2023-10-31 17:11:08 +01:00
Louis Dureuil	b1d1355b69	remove tests on soft-deleted	2023-10-31 16:36:27 +01:00
Louis Dureuil	f19332466e	Extract field value as values instead of Option<Value>	2023-10-31 16:36:27 +01:00
Louis Dureuil	03ddb4f310	use deladd in facet update tests	2023-10-31 16:36:27 +01:00
Louis Dureuil	c855cc2721	Remove unused test	2023-10-31 16:36:27 +01:00
Louis Dureuil	da0503ef80	Fix document count	2023-10-31 16:36:27 +01:00
ManyTheFish	94206b0055	Update tests	2023-10-31 13:48:47 +01:00
Louis Dureuil	b40253bf18	update snapshots	2023-10-31 10:30:48 +01:00
Louis Dureuil	d8bf3f3fc2	Remove unused snapshots	2023-10-31 10:12:49 +01:00
Louis Dureuil	9d59e8011a	fix some tests	2023-10-31 10:08:36 +01:00
Louis Dureuil	dad78cbf8d	Bulk facet remove deletes keys from DB when value empty	2023-10-31 09:53:55 +01:00
Louis Dureuil	4e91707a06	Rename test	2023-10-31 09:41:17 +01:00
Louis Dureuil	de10f20732	Fix field distribution again	2023-10-30 17:47:22 +01:00
Louis Dureuil	be395c7944	Change order of arguments to tokenizer_builder	2023-10-30 16:26:29 +01:00
Louis Dureuil	9fedd8101a	Fix tests	2023-10-30 15:11:07 +01:00
Louis Dureuil	54d07a8da3	Update field distribution taking into account both deletions and additions	2023-10-30 14:47:51 +01:00
Louis Dureuil	58690dfb19	Fix tests compilation after changes to ExternalDocumentsIds API	2023-10-30 13:34:07 +01:00
Louis Dureuil	abf424ebfc	Remove unused FromIterator	2023-10-30 11:41:56 +01:00
Clément Renault	dfab6293c9	Use an LMDB database to store the external documents ids	2023-10-30 11:41:23 +01:00
Louis Dureuil	fdf3f7f627	Fix facet distribution test	2023-10-30 11:41:23 +01:00
Louis Dureuil	6260cff65f	Actually delete documents from DB when the merge function says so	2023-10-30 11:41:22 +01:00
Louis Dureuil	8e0d9c9a5e	Recover delete_documents tests that were too eagerly deleted	2023-10-30 11:41:22 +01:00
Louis Dureuil	ae4ec8ea55	Add delete_document_using_wtxn to TempIndex	2023-10-30 11:41:22 +01:00
Louis Dureuil	9a2dccc3bc	Add iterator to find external ids of a bitmap of internal ids	2023-10-30 11:41:22 +01:00
Louis Dureuil	a35988550c	Fix some snapshots	2023-10-30 11:41:22 +01:00
Louis Dureuil	e78281785c	Actually execute the transform even if there are only documents to delete	2023-10-30 11:41:22 +01:00
Louis Dureuil	3c15881818	Add simple delete test	2023-10-30 11:41:22 +01:00
Louis Dureuil	73c06d31d9	snapshot always display stuff in consistent order	2023-10-30 11:41:22 +01:00
Louis Dureuil	290e773d23	remove more warnings and fix some tests	2023-10-30 11:41:22 +01:00
Louis Dureuil	fa6c7f65ca	Add TmpIndex::delete_documents	2023-10-30 11:41:22 +01:00
Louis Dureuil	113527f466	Remove soft-deleted related methods from Index	2023-10-30 11:41:22 +01:00
Louis Dureuil	c534a1b687	Stop using delete documents pipeline in batch runner	2023-10-30 11:41:22 +01:00
Louis Dureuil	2263dff02b	Stop using removed delete pipelines almost everywhere	2023-10-30 11:41:22 +01:00
Louis Dureuil	d651b3ef01	Remove delete documents files	2023-10-30 11:41:20 +01:00
ManyTheFish	762b0b47e6	Use deladd merging function in chunks mergers	2023-10-30 11:40:20 +01:00
Louis Dureuil	01d5eedf2f	Remove some warnings	2023-10-30 11:40:20 +01:00
Louis Dureuil	073f89db79	Fix facet tests	2023-10-30 11:40:20 +01:00
Louis Dureuil	8370fbc92b	Fix snaps	2023-10-30 11:40:20 +01:00
Louis Dureuil	85f42fbc03	Handle external to internal id mapping from TypedChunk::Documents	2023-10-30 11:40:20 +01:00
Louis Dureuil	c6b3c18c85	WIP: Comment out document deletion in other pipelines than update TODO: fix calls to DELETE route	2023-10-30 11:40:20 +01:00
Louis Dureuil	bafeb892a7	Modify Index after changes to ExternalDocumentsIds	2023-10-30 11:40:20 +01:00
Louis Dureuil	8fb221dae3	Refactor ExternalDocumentsIds - Remove soft deleted - Add apply method that takes a list of operations to encapsulate modifications to the external -> internal mapping	2023-10-30 11:40:20 +01:00
Louis Dureuil	946c762d28	WIP: reset documents in TypedChunk::Documents	2023-10-30 11:40:20 +01:00
Louis Dureuil	cda6ca1ee6	Remove TypedChunk::NewDocumentIds	2023-10-30 11:40:18 +01:00
Louis Dureuil	696fcf4d18	Fix document insertion into LMDB	2023-10-30 11:39:31 +01:00
ManyTheFish	476e4d3dbe	Use value buffer instead of the initial value when writting the final result in the sorter	2023-10-30 11:39:31 +01:00
Clément Renault	576fa9c6da	Remove useless comment	2023-10-30 11:39:31 +01:00
Kerollmops	77dcbff6b2	Remove and Insert the DelAdd geo points	2023-10-30 11:39:31 +01:00
Kerollmops	544440c363	Ignore geo fields when the Del and Add content is the same	2023-10-30 11:39:31 +01:00
Clément Renault	a3dae4db9b	Extract the geo fields DelAdd and generate a new DelAdd obkv with it	2023-10-30 11:39:31 +01:00
ManyTheFish	ba90a5ec0e	update extract fid word count docids	2023-10-30 11:39:31 +01:00
Louis Dureuil	b26dc9aabe	Explanatory code comment	2023-10-30 11:39:31 +01:00
Louis Dureuil	66abac9364	Use specialized `KvReaderDelAdd` type Co-authored-by: Clément Renault <clement@meilisearch.com>	2023-10-30 11:39:31 +01:00
Louis Dureuil	59f88c14b3	Simplify facet update after removing `Index::faceted_documents_ids`	2023-10-30 11:39:29 +01:00
Louis Dureuil	14832cb324	Remove Index::faceted_documents_ids	2023-10-30 11:37:32 +01:00
Louis Dureuil	04ec293024	Facet Incremental update	2023-10-30 11:37:30 +01:00
Louis Dureuil	f67ff3a738	Facets Bulk update	2023-10-30 11:36:40 +01:00
Clément Renault	560e8f5613	Introduce the CboRoaringBitmapCodec merge_deladd_into and use it	2023-10-30 11:34:55 +01:00
Clément Renault	2d3f15f82c	Introduce a function to only serialize the Add side of a DelAdd obkv	2023-10-30 11:34:55 +01:00
Clément Renault	40186bf403	Rename FieldIdWordCountDocids correctly	2023-10-30 11:34:50 +01:00
ManyTheFish	87e3d27878	update extract word pair proximity to support deladd obkvs	2023-10-30 11:34:02 +01:00
ManyTheFish	6bcf8b4f8c	update extract word position docids	2023-10-30 11:34:02 +01:00
ManyTheFish	46aa75abdb	update extract word docids	2023-10-30 11:34:02 +01:00
ManyTheFish	2597bbd107	Make script language docids map taking a tuple of roaring bitmaps expressing the deletions and the additions	2023-10-30 11:34:00 +01:00
Clément Renault	e2bc054604	Update extract_facet_string_docids to support deladd obkvs	2023-10-30 11:32:36 +01:00
Clément Renault	fcd3a1434d	Update extract_facet_number_docids to support deladd obkvs	2023-10-30 11:31:04 +01:00
Clément Renault	a82dee21e0	Rename docid_fid into fid_docid	2023-10-30 11:31:02 +01:00
Clément Renault	bc45c1206d	Implement all the facet extraction paths and simplify them	2023-10-30 11:29:08 +01:00
Clément Renault	6ae4100f07	Generate the DelAdd for is_null, is_empty, and exists	2023-10-30 11:29:08 +01:00
Clément Renault	0c47defeee	Work on fid docid facet values rewrite	2023-10-30 11:29:06 +01:00
ManyTheFish	313b16bec2	Support diff indexing on extract_docid_word_positions	2023-10-30 11:24:19 +01:00
ManyTheFish	1dd97578a8	Make the transform struct return diff-based documents obkvs	2023-10-30 11:22:07 +01:00
ManyTheFish	f5ef69293b	deactivate prefix dbs	2023-10-30 11:22:07 +01:00
ManyTheFish	1c5705c164	clean PR warnings	2023-10-30 11:22:05 +01:00
ManyTheFish	66c2c82a18	Split wpp in several sorters	2023-10-30 11:15:02 +01:00
ManyTheFish	28a8d0ccda	Fix word pair proximity	2023-10-30 11:15:02 +01:00
ManyTheFish	96be85396d	Use a vecDeque in wpp database	2023-10-30 11:15:02 +01:00
ManyTheFish	df9e5c8651	Generalize usage of CboRoaringBitmap codec to ease the use	2023-10-30 11:15:02 +01:00
ManyTheFish	b541d48847	Add buffer to the obkv writter	2023-10-30 11:15:02 +01:00
ManyTheFish	8ccf32d1a0	Compute word_fid_docids before word_docids and exact_word_docids	2023-10-30 11:15:02 +01:00
ManyTheFish	db1ca21231	add puffin in sorter into reeder function	2023-10-30 11:15:00 +01:00
ManyTheFish	11ea5acff9	Fix	2023-10-30 11:13:10 +01:00
ManyTheFish	8d77736a67	Fix fid_word_docids	2023-10-30 11:13:10 +01:00
ManyTheFish	748b333161	Add usefull debug assert before key insertion in database	2023-10-30 11:13:10 +01:00
ManyTheFish	17b647dfe5	Wip	2023-10-30 11:13:08 +01:00
Tamo	e7244aa485	fix warnings	2023-10-30 11:00:46 +01:00
Louis Dureuil	2bae9550c8	Add explanatory comment	2023-10-23 12:06:28 +02:00
Vivek Kumar	5fe7c4545a	compute all candidates correctly when skipping	2023-10-23 12:02:45 +02:00
meili-bors[bot]	5e0485d8dd	Merge #4131 4131: Reduce proximity range from 7 to 3 r=Kerollmops a=ManyTheFish ## Summary This PR aims to reduce the impact of the proximity databases on the indexing time and on the database size by reducing the maximum distance between two words to be indexed in the proximity database. ## Stats ### Impact on database size and indexing time ![Impact on datasets](https://github.com/meilisearch/meilisearch/assets/6482087/28ed3d96-bdde-41c1-bdac-e90c1b1dbb23) ### Impact on search relevancy <details> \| dataset_name \| host_name \| Relevancy rate (Precision) \| completion_rate 25.00% \| completion_rate 50.00% \| completion_rate 75.00% \| completion_rate 100.00% \| \|--------------\|------------------\|------------------------------------\|-----------------\|-----------------\|-----------------\|-----------------\| \| FBIS \| 1_4_0 \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FBIS \| 1_4_0 \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FBIS \| 1_4_0 \| percentile-50 \| 0.00% \| 0.00% \| 5.00% \| 5.56% \| \| FBIS \| 1_4_0 \| percentile-75 \| 0.00% \| 12.50% \| 35.00% \| 45.00% \| \| FBIS \| 1_4_0 \| percentile-90 \| 20.00% \| 40.00% \| \| 100.00% \| \| FBIS \| 1_4_0 \| average \| 5.78% \| 11.16% \| 21.90% \| 26.29% \| \| FBIS \| reduce_proximity \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FBIS \| reduce_proximity \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FBIS \| reduce_proximity \| percentile-50 \| 0.00% \| 0.00% \| 5.00% \| 5.56% \| \| FBIS \| reduce_proximity \| percentile-75 \| 0.00% \| 15.00% \| 35.00% \| 40.00% \| \| FBIS \| reduce_proximity \| percentile-90 \| 20.00% \| 40.00% \| 85.00% \| 100.00% \| \| FBIS \| reduce_proximity \| average \| 5.55% \| 11.34% \| 21.75% \| 26.14% \| \| FR94 \| 1_4_0 \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FR94 \| 1_4_0 \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FR94 \| 1_4_0 \| percentile-50 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FR94 \| 1_4_0 \| percentile-75 \| 0.00% \| 5.00% \| 15.00% \| 42.11% \| \| FR94 \| 1_4_0 \| percentile-90 \| 15.00% \| 54.55% \| 100.00% \| 100.00% \| \| FR94 \| 1_4_0 \| average \| 5.95% \| 12.07% \| 18.70% \| 25.57% \| \| FR94 \| reduce_proximity \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FR94 \| reduce_proximity \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FR94 \| reduce_proximity \| percentile-50 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FR94 \| reduce_proximity \| percentile-75 \| 0.00% \| 5.00% \| 15.00% \| 42.11% \| \| FR94 \| reduce_proximity \| percentile-90 \| 15.00% \| 54.55% \| 100.00% \| 100.00% \| \| FR94 \| reduce_proximity \| average \| 5.79% \| 12.00% \| 18.70% \| 25.53% \| \| FT \| 1_4_0 \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FT \| 1_4_0 \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FT \| 1_4_0 \| percentile-50 \| 0.00% \| 0.00% \| 5.00% \| 10.00% \| \| FT \| 1_4_0 \| percentile-75 \| 0.00% \| 15.00% \| 30.00% \| 40.00% \| \| FT \| 1_4_0 \| percentile-90 \| 20.00% \| 50.00% \| 65.00% \| 100.00% \| \| FT \| 1_4_0 \| average \| 5.08% \| 12.58% \| 20.00% \| 25.49% \| \| FT \| reduce_proximity \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FT \| reduce_proximity \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| FT \| reduce_proximity \| percentile-50 \| 0.00% \| 0.00% \| 5.00% \| 10.00% \| \| FT \| reduce_proximity \| percentile-75 \| 0.00% \| 15.00% \| 30.00% \| 40.00% \| \| FT \| reduce_proximity \| percentile-90 \| 10.00% \| 45.00% \| 60.00% \| 100.00% \| \| FT \| reduce_proximity \| average \| 5.01% \| 12.64% \| 20.10% \| 25.53% \| \| LAT \| 1_4_0 \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| LAT \| 1_4_0 \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| LAT \| 1_4_0 \| percentile-50 \| 0.00% \| 0.00% \| 5.00% \| 5.00% \| \| LAT \| 1_4_0 \| percentile-75 \| 5.00% \| 15.00% \| 30.00% \| 30.00% \| \| LAT \| 1_4_0 \| percentile-90 \| 15.00% \| 45.00% \| 60.00% \| 80.00% \| \| LAT \| 1_4_0 \| average \| 4.80% \| 11.80% \| 17.88% \| 21.62% \| \| LAT \| reduce_proximity \| percentile-10 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| LAT \| reduce_proximity \| percentile-25 \| 0.00% \| 0.00% \| 0.00% \| 0.00% \| \| LAT \| reduce_proximity \| percentile-50 \| 0.00% \| 0.00% \| 5.00% \| 5.00% \| \| LAT \| reduce_proximity \| percentile-75 \| 0.00% \| 11.11% \| 25.00% \| 35.00% \| \| LAT \| reduce_proximity \| percentile-90 \| 15.00% \| 45.00% \| 55.00% \| 80.00% \| \| LAT \| reduce_proximity \| average \| 4.43% \| 11.23% \| 17.32% \| 21.45% \| </details> ### Impact on Search time \| dataset_name \| host_name \| 25.00% \| 50.00% \| 75.00% \| 100.00% \| Average \| \|--------------\|------------------\|------------:\|------------:\|------------:\|------------:\|-------------\| \| FBIS \| 1_4_0 \| 3.45 \| 7.446666667 \| 9.773489933 \| 9.620300752 \| 7.572614338 \| \| FBIS \| reduce_proximity \| 2.983333333 \| 5.316666667 \| 6.911073826 \| 7.637218045 \| 5.712072968 \| \| FR94 \| 1_4_0 \| 2.236666667 \| 4.45 \| 5.523489933 \| 4.560150376 \| 4.192576744 \| \| FR94 \| reduce_proximity \| 2.09 \| 3.991666667 \| 4.981543624 \| 4.266917293 \| 3.832531896 \| \| FT \| 1_4_0 \| 5.956666667 \| 9.656666667 \| 13.86912752 \| 10.83270677 \| 10.0787919 \| \| FT \| reduce_proximity \| 4.51 \| 5.981666667 \| 7.701342282 \| 6.766917293 \| 6.23998156 \| \| LAT \| 1_4_0 \| 5.856666667 \| 9.233333333 \| 12.98322148 \| 10.78759398 \| 9.715203865 \| \| LAT \| reduce_proximity \| 6.91 \| 6.706666667 \| 8.463087248 \| 8.265037594 \| 7.586197877 \| ## Technical approach - Ensure the MAX_DISTANCE constant is used everywhere needed - Reduce the MAX_DISTANCE from 8 to 4 ## Related TBD Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-10-18 14:56:08 +00:00
ManyTheFish	27eec21415	Fix tests	2023-10-18 16:03:22 +02:00
Clément Renault	62dfd09dc6	Add more puffin logs to the deletion functions	2023-10-13 13:11:09 +02:00
meili-bors[bot]	f343ef5f2f	Merge #4108 4108: Fix bug where search with distinct attribute and no ranking, returns offset+limit hits r=curquiza a=vivek-26 # Pull Request ## Related issue Fixes #4078 ## What does this PR do? This PR - - Fixes bug where search with distinct attribute and no ranking, returns offset+limit hits. - Adds unit and integration tests. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Vivek Kumar <vivek.26@outlook.com>	2023-10-12 07:51:29 +00:00
Vivek Kumar	d4da06ff47	fix bug where distinct search with no ranking returns offset+limit hits	2023-10-11 19:02:16 +05:30
Tamo	c0f2724c2d	get rids of the new introduced error code in favor of an io::Error	2023-10-10 15:12:23 +02:00
Tamo	d772073dfa	use a bufreader everytime there is a grenad<file>	2023-10-10 15:00:30 +02:00
ManyTheFish	43989fe2e4	Reduce porximity range from 7 to 3	2023-10-03 12:16:48 +02:00
meili-bors[bot]	487d493f49	Merge #4043 4043: Bring back hotfixes from v1.3.3 into v1.4.0 r=Kerollmops a=curquiza Co-authored-by: curquiza <curquiza@users.noreply.github.com> Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com> Co-authored-by: Kerollmops <clement@meilisearch.com> Co-authored-by: curquiza <clementine@meilisearch.com>	2023-09-11 12:27:34 +00:00
Vivek Kumar	abfa7ded25	use a new temp index in the test	2023-09-08 12:32:47 +05:30
Vivek Kumar	f2837aaec2	add another test case	2023-09-08 11:39:54 +05:30
Vivek Kumar	11df155598	fix highlighting bug when searching for a phrase with cropping	2023-09-08 11:39:52 +05:30
meili-bors[bot]	256cf33bca	Merge #4039 4039: Fix multiple vectors dimensions r=ManyTheFish a=Kerollmops This PR fixes #4035, making providing multiple vectors in documents possible. This is fixed by extracting the vectors from the non-flattened version of the documents. Co-authored-by: Kerollmops <clement@meilisearch.com>	2023-09-07 09:25:58 +00:00
Kerollmops	679c0b0f97	Extract the vectors from the non-flattened version of the documents	2023-09-06 12:26:00 +02:00
Kerollmops	e02d0064bd	Add a test case scenario	2023-09-06 12:26:00 +02:00
meili-bors[bot]	dc3d9c90d9	Merge #3994 3994: Fix synonyms with separators r=Kerollmops a=ManyTheFish # Pull Request ## Related issue Fixes #3977 ## Available prototype ``` $ docker pull getmeili/meilisearch:prototype-fix-synonyms-with-separators-0 ``` ## What does this PR do? - add a new test - filter the empty synonyms after normalization Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-09-05 14:42:46 +00:00
ManyTheFish	66aa6d5871	Ignore tokens with empty normalized value during indexing process	2023-09-05 15:44:14 +02:00
Kerollmops	8ac5b765bc	Fix synonyms normalization	2023-09-04 16:12:48 +02:00
Kerollmops	085aad0a94	Add a test	2023-09-04 14:39:33 +02:00
meili-bors[bot]	ccf3ba3f32	Merge #4019 4019: Bringing back changes from `v1.3.2` onto `main` r=irevoire a=Kerollmops Co-authored-by: Kerollmops <clement@meilisearch.com> Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com> Co-authored-by: irevoire <irevoire@users.noreply.github.com> Co-authored-by: Clément Renault <clement@meilisearch.com>	2023-08-28 12:14:11 +00:00
Clément Renault	8c0ebd1331	Update milli/src/search/new/bucket_sort.rs Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-08-23 16:40:39 +02:00
Kerollmops	5130e06b41	Temporarily disable an assert in the ranking rules	2023-08-23 16:11:54 +02:00
meili-bors[bot]	914b125c5f	Merge #3945 3945: Do not leak field information on error r=Kerollmops a=vivek-26 # Pull Request ## Related issue Fixes #3865 ## What does this PR do? This PR ensures that `InvalidSortableAttribute`and `InvalidFacetSearchFacetName` errors do not leak field information i.e. fields which are not part of `displayedAttributes` in the settings are hidden from the error message. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Vivek Kumar <vivek.26@outlook.com>	2023-08-22 18:55:27 +00:00
Kerollmops	c53841e166	Accept the null JSON value as the value of _vectors	2023-08-14 16:03:55 +02:00
meili-bors[bot]	e4e49e63d0	Merge #3993 3993: Bringing back changes from v1.3.1 to `main` r=irevoire a=curquiza Co-authored-by: irevoire <irevoire@users.noreply.github.com> Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com> Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-08-10 14:30:02 +00:00
ManyTheFish	5a7c1bde84	Fix clippy	2023-08-10 11:27:56 +02:00
ManyTheFish	6b2d671be7	Fix PR comments	2023-08-10 10:44:07 +02:00
Many the fish	43c13faeda	Update milli/src/update/index_documents/extract/extract_docid_word_positions.rs Co-authored-by: Tamo <tamo@meilisearch.com>	2023-08-10 10:05:03 +02:00
meili-bors[bot]	44c1900f36	Merge #3986 3986: Fix geo bounding box with strings r=ManyTheFish a=irevoire # Pull Request When sending a document with one geofield of type string (i.e.: `{ "_geo": { "lat": 12, "lng": "13" }}`), the geobounding box would exclude this document. This PR fixes this issue by automatically parsing the string value in case we're working on a geofield. ## Related issue Fixes https://github.com/meilisearch/meilisearch/issues/3973 ## What does this PR do? - Automatically parse the facet value iif we're working on a geofield. - Make insta works with snapshots in loops or closure executed multiple times. (you may need to update your cli if it panics after this PR: `cargo install cargo-insta`). - Add one integration test in milli and in meilisearch to ensure it works forever. - Add three snapshots for the dump that mysteriously disappeared I don't know how Co-authored-by: Tamo <tamo@meilisearch.com>	2023-08-09 07:58:15 +00:00
ManyTheFish	8dc5acf998	Try fix	2023-08-08 16:52:36 +02:00
ManyTheFish	35758db9ec	Truncate the the normalized long facets used in search for facet value	2023-08-08 16:38:30 +02:00
Tamo	4988199bb9	ensure the geoboundingbox works with strings and int geofields in milli and meilisearch	2023-08-08 16:29:25 +02:00
Tamo	9d061cec26	automatically parse the filterable attribute to float if it's a geo field	2023-08-08 16:28:07 +02:00
ManyTheFish	4a21fecf67	Merge branch 'main' into settings-customizing-tokenization	2023-08-08 16:08:16 +02:00
Vivek Kumar	dd57873f8e	hide fields not in the displayedAttributes list from errors	2023-08-05 16:03:10 +05:30
ManyTheFish	b45c36cd71	Merge branch 'main' into tmp-release-v1.3.0	2023-08-01 15:05:17 +02:00
ManyTheFish	9d5e3457e5	Fix clippy	2023-07-27 14:21:19 +02:00
meili-bors[bot]	939b2fc6fd	Merge #3949 3949: Fix score details casing r=Kerollmops a=ManyTheFish # Pull Request Fixes #3941 Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-07-26 14:14:59 +00:00
ManyTheFish	b0c1a9504a	ensure the synonyms are updated when the tokenizer settings are changed	2023-07-26 09:33:42 +02:00
meili-bors[bot]	be72be7c0d	Merge #3942 3942: Normalize for the search the facets values r=ManyTheFish a=Kerollmops This PR improves and fixes the search for facet values feature. Searching for _bre_ wasn't returning facet values like _brévent_ or _brô_. The issue was related to the fact that facets are normalized but not in the same way as the `searchableAttributes` are. We decided to normalize them further and add another intermediate database where the key is the normalized facet value, and the value is a set of the non-normalized facets. We then use these non-normalized ones to get the correct counts by fetching the associated databases. ### What's missing in this PR? - [x] Apply the change to the whole set of `SearchForFacetValue::execute` conditions. - [x] Factorize the code that does an intermediate normalized value fetch in a function. - [x] Add or modify the search for facet value test. Co-authored-by: Clément Renault <clement@meilisearch.com> Co-authored-by: Kerollmops <clement@meilisearch.com>	2023-07-25 14:37:17 +00:00
ManyTheFish	88559a2d54	Fix score details casing	2023-07-25 15:49:33 +02:00
ManyTheFish	d57026cd96	Support synonyms sinergies	2023-07-25 15:01:42 +02:00
Kerollmops	29ab54b259	Replace the hnsw crate by the instant-distance one	2023-07-25 12:37:35 +02:00
ManyTheFish	d4ff59fcf5	Fix clippy	2023-07-24 18:42:26 +02:00
ManyTheFish	9c485f8563	Make the search and the indexing work	2023-07-24 18:35:20 +02:00
Kerollmops	691a536893	Implement the facet search with the normalized index	2023-07-24 17:56:17 +02:00
ManyTheFish	d8d12d5979	Be able to set and reset settings	2023-07-24 17:00:18 +02:00
Clément Renault	df528b41d8	Normalize for the search the facets values	2023-07-20 17:57:07 +02:00
Kerollmops	eef95de30e	First iteration on exposing puffin profiling	2023-07-18 17:38:13 +02:00
Kerollmops	d383afc82b	Fix the geo sort when lat and lng are strings	2023-07-17 18:28:04 +02:00
Louis Dureuil	4310928803	Fixes #3912	2023-07-12 10:08:56 +02:00
Louis Dureuil	74315b4ea8	Fixes #3911	2023-07-12 10:08:29 +02:00
Louis Dureuil	40fa59d64c	Sort by lexicographic order after normalization	2023-07-10 09:26:59 +02:00
Louis Dureuil	55cd7738b9	Update snapshots	2023-07-04 16:31:01 +02:00
Louis Dureuil	48409c9183	Add missing exactness.matchingWords, exactness.maxMatchingWords	2023-07-04 16:31:01 +02:00
Louis Dureuil	324d448236	Format let-else ❤️ 🎉	2023-07-03 10:20:28 +02:00
meili-bors[bot]	661d1f90dc	Merge #3866 3866: Update charabia v0.8.0 r=dureuill a=ManyTheFish # Pull Request Update Charabia: - enhance Japanese segmentation - enhance Latin Tokenization - words containing `_` are now properly segmented into several words - brackets `{([])}` are no more considered as context separators so word separated by brackets are now considered near together for the proximity ranking rule - fixes #3815 - fixes #3778 - fixes [product#151](https://github.com/meilisearch/product/discussions/151) > Important note: now the float numbers are segmented around the `.` so `3.22` is segmented as [`3`, `.`, `22`] but the middle dot isn't considered as a hard separator, which means that if we search `3.22` we find documents containing `3.22` Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-06-29 15:24:36 +00:00
ManyTheFish	6ec7541026	Update inta snapshots	2023-06-29 17:18:39 +02:00
ManyTheFish	a82c49ab08	Update test	2023-06-29 15:56:36 +02:00
ManyTheFish	84845de9ef	Update Charabia	2023-06-29 15:56:32 +02:00
Clément Renault	7c157fc442	Document that the LevelEntry fields order is important	2023-06-29 14:33:32 +02:00
Clément Renault	0b97596c93	Replace unwraps with ?	2023-06-29 14:33:32 +02:00
Clément Renault	a0e0fce677	Simplify a Rust lifetime trick	2023-06-29 14:33:32 +02:00
Clément Renault	b951830461	Add more tests	2023-06-29 14:33:32 +02:00
Kerollmops	b132e859f7	Make clippy happy	2023-06-29 14:33:31 +02:00
Kerollmops	9917bf046a	Move the sortFacetValuesBy in the faceting settings	2023-06-29 14:33:31 +02:00
Kerollmops	d9fea0143f	Make Clippy happy	2023-06-29 14:33:31 +02:00
Kerollmops	a385642ec3	Replace the BTreeMap by an IndexMap to return values in order	2023-06-29 14:33:31 +02:00
Kerollmops	34b2e98fe9	Expose a sortFacetValuesBy parameter to the user	2023-06-29 14:33:00 +02:00
Kerollmops	80bbd4b6f3	Clean and make the facet order configurable internally	2023-06-29 14:31:17 +02:00
Kerollmops	f42bef2f66	Make the search to always return the facets ordered by count	2023-06-29 14:31:17 +02:00
Kerollmops	bd3c026406	First to-test version of the algorithm	2023-06-29 14:31:17 +02:00
Kerollmops	84f8938f33	Rename facet distribution to be explicit on the order to find them	2023-06-29 14:31:15 +02:00
Clément Renault	efbe7ce78b	Clean the facet string FSTs when we clear the documents	2023-06-28 15:36:32 +02:00
Kerollmops	26f0fa678d	Change the error message when a facet is not searchable	2023-06-28 15:06:09 +02:00
Kerollmops	60ddd53439	Return one of the original facet values when doing a facet search	2023-06-28 15:06:09 +02:00
Kerollmops	2bcd8d2983	Make sure the facet queries are normalized	2023-06-28 15:06:09 +02:00
Kerollmops	41760a9306	Introduce a new invalid_facet_search_facet_name error code	2023-06-28 15:06:07 +02:00
Kerollmops	e9a3029c30	Use the right field id to write the string facet values FST	2023-06-28 15:01:51 +02:00
Kerollmops	ed0ff47551	Return an empty list of results if attribute is set as filterable	2023-06-28 15:01:51 +02:00
Clément Renault	e1b8fb48ee	Use the minWordSizeForTypos index settings	2023-06-28 15:01:51 +02:00
Clément Renault	87e22e436a	Fix compilation issues	2023-06-28 15:01:51 +02:00
Clément Renault	0252cfe8b6	Simplify the placeholder search of the facet-search route	2023-06-28 15:01:50 +02:00
Clément Renault	f35ad96afa	Use the disableOnAttributes parameter on the facet-search route	2023-06-28 15:01:50 +02:00
Clément Renault	2ceb781c73	Use the disableOnWords parameter on the facet-search route	2023-06-28 15:01:50 +02:00
Clément Renault	7bd67543dd	Support the typoTolerant.enabled parameter	2023-06-28 15:01:50 +02:00
Clément Renault	8e86eb91bb	Log an error when a facet value is missing from the database	2023-06-28 15:01:50 +02:00
Clément Renault	55c17aa38b	Rename the SearchForFacetValues struct	2023-06-28 15:01:50 +02:00
Clément Renault	aadbe88048	Return an internal error when a field id is missing	2023-06-28 15:01:50 +02:00
Clément Renault	f36de2115f	Make clippy happy	2023-06-28 15:01:50 +02:00
Clément Renault	702041b7e1	Improve the returned errors from the facet-search route	2023-06-28 15:01:48 +02:00
Clément Renault	a05074e675	Fix the max number of facets to be returned to 100	2023-06-28 14:58:42 +02:00
Clément Renault	93f30e65a9	Return the correct response JSON object from the facet-search route	2023-06-28 14:58:42 +02:00
Clément Renault	e81809aae7	Make the search for facet work	2023-06-28 14:58:41 +02:00
Kerollmops	ce7e7f12c8	Introduce the facet search route	2023-06-28 14:58:41 +02:00
Kerollmops	addb21f110	Restrict the number of facet search results to 1000	2023-06-28 14:58:41 +02:00
Kerollmops	c34de05106	Introduce the SearchForFacetValue struct	2023-06-28 14:58:41 +02:00
Clément Renault	15a4c05379	Store the facet string values in multiple FSTs	2023-06-28 14:58:41 +02:00
meili-bors[bot]	d4f10800f2	Merge #3834 3834: Define searchable fields at runtime r=Kerollmops a=ManyTheFish ## Summary This feature allows the end-user to search in one or multiple attributes using the search parameter `attributesToSearchOn`: ```json { "q": "Captain Marvel", "attributesToSearchOn": ["title"] } ``` This feature act like a filter, forcing Meilisearch to only return the documents containing the requested words in the attributes-to-search-on. Note that, with the matching strategy `last`, Meilisearch will only ensure that the first word is in the attributes-to-search-on, but, the retrieved documents will be ordered taking into account the word contained in the attributes-to-search-on. ## Trying the prototype A dedicated docker image has been released for this feature: #### last prototype version: ```bash docker pull getmeili/meilisearch:prototype-define-searchable-fields-at-search-time-1 ``` #### others prototype versions: ```bash docker pull getmeili/meilisearch:prototype-define-searchable-fields-at-search-time-0 ``` ## Technical Detail The attributes-to-search-on list is given to the search context, then, the search context uses the `fid_word_docids`database using only the allowed field ids instead of the global `word_docids` database. This is the same for the prefix databases. The database cache is updated with the merged values, meaning that the union of the field-id-database values is only made if the requested key is missing from the cache. ### Relevancy limits Almost all ranking rules behave as expected when ordering the documents. Only `proximity` could miss-order documents if all the searched words are in the restricted attribute but a better proximity is found in an ignored attribute in a document that should be ranked lower. I put below a failing test showing it: ```rust #[actix_rt::test] async fn proximity_ranking_rule_order() { let server = Server::new().await; let index = index_with_documents( &server, &json!([ { "title": "Captain super mega cool. A Marvel story", // Perfect distance between words in an ignored attribute "desc": "Captain Marvel", "id": "1", }, { "title": "Captain America from Marvel", "desc": "a Shazam ersatz", "id": "2", }]), ) .await; // Document 2 should appear before document 1. index .search(json!({"q": "Captain Marvel", "attributesToSearchOn": ["title"], "attributesToRetrieve": ["id"]}), \|response, code\| { assert_eq!(code, 200, "{}", response); assert_eq!( response["hits"], json!([ {"id": "2"}, {"id": "1"}, ]) ); }) .await; } ``` Fixing this would force us to create a `fid_word_pair_proximity_docids` and a `fid_word_prefix_pair_proximity_docids` databases which may multiply the keys of `word_pair_proximity_docids` and `word_prefix_pair_proximity_docids` by the number of attributes in the searchable_attributes list. If we think we should fix this test, I'll suggest doing it in another PR. ## Related Fixes #3772 Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-06-28 08:19:23 +00:00
Clément Renault	30741d17fa	Change the TODO message	2023-06-27 12:32:43 +02:00
Clément Renault	ebad1f396f	Remove the useless euclidean distance implementation	2023-06-27 12:32:43 +02:00
Clément Renault	29d8268c94	Fix the vector query part by using the correct universe	2023-06-27 12:32:43 +02:00
Clément Renault	63bfe1cee2	Ignore when there are too many vectors	2023-06-27 12:32:43 +02:00
Kerollmops	7c2f5f77b8	Make clippy and fmt happy	2023-06-27 12:32:42 +02:00
Kerollmops	66b8cfd8c8	Introduce a way to store the HNSW on multiple LMDB entries	2023-06-27 12:32:42 +02:00
Kerollmops	ff3664431f	Make rustfmt happy	2023-06-27 12:32:42 +02:00
Kerollmops	531748c536	Return a user error when the _vectors type is invalid	2023-06-27 12:32:41 +02:00
Kerollmops	7aa1275337	Display the _semanticSimilarity even if the `_vectors` field is not displayed	2023-06-27 12:32:41 +02:00
Kerollmops	737aec1705	Expose an _semanticSimilarity as a dot product in the documents	2023-06-27 12:32:41 +02:00
Kerollmops	3e3c743392	Make Rustfmt happy	2023-06-27 12:32:41 +02:00
Kerollmops	5c5a4e075d	Make clippy happy	2023-06-27 12:32:41 +02:00
Kerollmops	ab9f2269aa	Normalize the vectors during indexation and search	2023-06-27 12:32:41 +02:00
Kerollmops	321ec5f3fa	Accept multiple vectors by documents using the _vectors field	2023-06-27 12:32:40 +02:00
Kerollmops	717d4fddd4	Remove the unused distance	2023-06-27 12:32:40 +02:00
Kerollmops	a7e0f0de89	Introduce a new error message for invalid vector dimensions	2023-06-27 12:32:40 +02:00
Kerollmops	3b560ef7d0	Make clippy happy	2023-06-27 12:32:40 +02:00
Kerollmops	2cf747cb89	Fix the tests	2023-06-27 12:32:40 +02:00
Kerollmops	3c31e1cdd1	Support more pages but in an ugly way	2023-06-27 12:32:39 +02:00
Kerollmops	23eaaf1001	Change the name of the distance module	2023-06-27 12:32:39 +02:00
Kerollmops	c2a402f3ae	Implement an ugly deletion of values in the HNSW	2023-06-27 12:32:39 +02:00
Kerollmops	436a10bef4	Replace the euclidean with a dot product	2023-06-27 12:32:39 +02:00
Kerollmops	8debf6fe81	Use a basic euclidean distance function	2023-06-27 12:32:39 +02:00
Kerollmops	c79e82c62a	Move back to the hnsw crate This reverts commit 7a4b6c065482f988b01298642f4c18775503f92f.	2023-06-27 12:32:39 +02:00
Kerollmops	aca305bb77	Log more to make sure we insert vectors in the hgg data-structure	2023-06-27 12:32:38 +02:00
Kerollmops	5816008139	Introduce an optimized version of the euclidean distance function	2023-06-27 12:32:38 +02:00
Kerollmops	268a9ef416	Move to the hgg crate	2023-06-27 12:32:38 +02:00
Clément Renault	642b0f3a1b	Expose a new vector field on the search route	2023-06-27 12:32:38 +02:00
Clément Renault	4571e512d2	Store the vectors in an HNSW in LMDB	2023-06-27 12:32:38 +02:00
Clément Renault	7ac2f1489d	Extract the vectors from the documents	2023-06-27 12:32:37 +02:00
Clément Renault	34349faeae	Create a new _vector extractor	2023-06-27 12:32:37 +02:00
ManyTheFish	63ca25290b	Take into account small Review requests	2023-06-26 14:56:19 +02:00
ManyTheFish	59f64a5256	Return an error when an attribute is not searchable	2023-06-26 14:56:19 +02:00
ManyTheFish	42709ea9a5	Fix clippy warnings	2023-06-26 14:55:57 +02:00
ManyTheFish	fb8fa07169	Restrict field ids in search context	2023-06-26 14:55:57 +02:00
ManyTheFish	0ccf1e2e40	Allow the search cache to store owned values	2023-06-26 14:55:57 +02:00
ManyTheFish	9680e1e41f	Introduce a BytesDecodeOwned trait in heed_codecs	2023-06-26 14:55:14 +02:00
ManyTheFish	461b5118bd	Add API search setting	2023-06-26 14:55:14 +02:00
Tamo	a3716c5678	add the new parameter to the search builder of milli	2023-06-26 14:55:14 +02:00
meili-bors[bot]	2d34005965	Merge #3821 3821: Add normalized and detailed scores to documents returned by a query r=dureuill a=dureuill # Pull Request ## Related issue Fixes #3771 ## What does this PR do? ### User standpoint <details> <summary>Request ranking score</summary> ``` echo '{ "q": "Badman dark knight returns", "showRankingScore": true, "limit": 10, "attributesToRetrieve": ["title"] }' \| mieli search -i index-word-count-10-count ``` </details> <details> <summary>Response</summary> ```json { "hits": [ { "title": "Batman: The Dark Knight Returns, Part 1", "_rankingScore": 0.947520325203252 }, { "title": "Batman: The Dark Knight Returns, Part 2", "_rankingScore": 0.947520325203252 }, { "title": "Batman Unmasked: The Psychology of the Dark Knight", "_rankingScore": 0.6657594086021505 }, { "title": "Legends of the Dark Knight: The History of Batman", "_rankingScore": 0.6654905913978495 }, { "title": "Angel and the Badman", "_rankingScore": 0.2196969696969697 }, { "title": "Angel and the Badman", "_rankingScore": 0.2196969696969697 }, { "title": "Batman", "_rankingScore": 0.11553030303030302 }, { "title": "Batman Begins", "_rankingScore": 0.11553030303030302 }, { "title": "Batman Returns", "_rankingScore": 0.11553030303030302 }, { "title": "Batman Forever", "_rankingScore": 0.11553030303030302 } ], "query": "Badman dark knight returns", "processingTimeMs": 12, "limit": 10, "offset": 0, "estimatedTotalHits": 46 } ``` </details> - If adding a `showRankingScore` parameter to the search query, then documents returned by a search now contain an additional field `_rankingScore` that is a float bigger than 0 and lower or equal to 1.0. This field represents the relevancy of the document, relatively to the search query and the settings of the index, with 1.0 meaning "perfect match" and 0 meaning "not matching the query" (Meilisearch should never return documents not matching the query at all). - The `sort` and `geosort` ranking rules do not influence the `_rankingScore`. <details> <summary>Request detailed ranking scores</summary> ``` echo '{ "q": "Badman dark knight returns", "showRankingScoreDetails": true, "limit": 5, "attributesToRetrieve": ["title"] }' \| mieli search -i index-word-count-10-count ``` </details> <details> <summary>Response</summary> ```json { "hits": [ { "title": "Batman: The Dark Knight Returns, Part 1", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 4, "maxMatchingWords": 4, "score": 1.0 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 4, "score": 0.8 }, "proximity": { "order": 2, "score": 0.9545454545454546 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.926829268292683, "score": 0.926829268292683 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.26666666666666666 } } }, { "title": "Batman: The Dark Knight Returns, Part 2", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 4, "maxMatchingWords": 4, "score": 1.0 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 4, "score": 0.8 }, "proximity": { "order": 2, "score": 0.9545454545454546 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.926829268292683, "score": 0.926829268292683 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.26666666666666666 } } }, { "title": "Batman Unmasked: The Psychology of the Dark Knight", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 3, "maxMatchingWords": 4, "score": 0.75 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 3, "score": 0.75 }, "proximity": { "order": 2, "score": 0.6666666666666666 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.8064516129032258, "score": 0.8064516129032258 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.25 } } }, { "title": "Legends of the Dark Knight: The History of Batman", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 3, "maxMatchingWords": 4, "score": 0.75 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 3, "score": 0.75 }, "proximity": { "order": 2, "score": 0.6666666666666666 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.7419354838709677, "score": 0.7419354838709677 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.25 } } }, { "title": "Angel and the Badman", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 1, "maxMatchingWords": 4, "score": 0.25 }, "typo": { "order": 1, "typoCount": 0, "maxTypoCount": 1, "score": 1.0 }, "proximity": { "order": 2, "score": 1.0 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.8181818181818182, "score": 0.8181818181818182 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.3333333333333333 } } } ], "query": "Badman dark knight returns", "processingTimeMs": 9, "limit": 5, "offset": 0, "estimatedTotalHits": 46 } ``` </details> - If adding a `showRankingScoreDetails` parameter to the search query, then the returned documents will now contain an additional `_rankingScoreDetails` field that is a JSON object containing one field per ranking rule that was applied, whose value is a JSON object with the following fields: - `order`: a number indicating the order this rule was applied (0 is the first applied ranking rule) - `score` (except for `sort` and `geosort`): a float indicating how the document matched this particular rule. - other fields that are specific to the rule, indicating for example how many words matched for a document and how many typos were counted in a matching document. - If the `displayableAttributes` list is defined in the settings of the index, any ranking rule using an attribute not part of that list will be marked as `<hidden-rule>` in the `_rankingScoreDetails`. - Search queries that are part of a `multi-search` requests are modified in the same way and each of the queries can take the `showRankingScore` and `showRankingScoreDetails` parameters independently. The results are still returned in separate lists and providing a unified list of results between multiple queries is not in the scope of this PR (but is unblocked by this PR and can be done manually by using the scores of the various documents). ### Implementation standpoint - Fix difference in how the position of terms were computed at indexing time and query time: this difference meant that a query containing a hard separator would fail the exactness check. - Fix the id reported by the sort ranking rule (very minor) - Change how the cost of removing words is computed. After this change the cost no longer works for any other ranking rule than `words`. Also made `words` have a cost of 0 such that the entire cost of `words` is given by the termRemovalStrategy. The new cost computation makes it so the score is computed in a way consistent with the number of words in the query. Additionally, the words that appear in phrases in the query are also counted as matching words. - When any score computation is requested through `showRankingScore` or `showRankingScoreDetails`, remove optimization where ranking rules are not executed on buckets of a single document: this is important to allow the computation of an accurate score. - add virtual conditions to fid and position to always have the max cost: this ensures that the score is independent from the dataset - the Position ranking rule now takes into account the distance to the position of the word in the query instead of the distance to the position 0. - modified proximity ranking rule cost calculation so that the cost is 0 for documents that are perfectly matching the query - Add a new `milli::score_details` module containing all the types that are involved in score computation. - Make it so a bucket of result now contains a `ScoreDetails` and changed the ranking rules to produce their `ScoreDetails`. - Expose the scores in the REST API. - Add very light analytics for scoring. - Update the search tests to add the expected scores. Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-06-26 09:32:43 +00:00
meili-bors[bot]	040b5a5b6f	Merge #3842 3842: fix some typos r=dureuill a=cuishuang # Pull Request ## Related issue Fixes #<issue_number> ## What does this PR do? - fix some typos ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: cui fliter <imcusg@gmail.com>	2023-06-22 18:01:10 +00:00
cui fliter	530a3e2df3	fix some typos Signed-off-by: cui fliter <imcusg@gmail.com>	2023-06-22 21:59:00 +08:00
Louis Dureuil	d26e9a96ec	Add score details to new search tests	2023-06-22 12:39:14 +02:00
Louis Dureuil	49c8bc4de6	Fix tests	2023-06-22 12:39:14 +02:00
Louis Dureuil	da833eb095	Expose the scores and detailed scores in the API	2023-06-22 12:39:14 +02:00
Louis Dureuil	701d44bd91	Store the scores for each bucket Remove optimization where ranking rules are not executed on buckets of a single document when the score needs to be computed	2023-06-22 12:39:14 +02:00
Louis Dureuil	c621a250a7	Score for graph based ranking rules Count phrases in matchingWords and maxMatchingWords	2023-06-22 12:39:14 +02:00
Louis Dureuil	8939e85f60	Add rank_to_score for graph based ranking rules	2023-06-22 12:39:14 +02:00
Louis Dureuil	fa41d2489e	Score for sort	2023-06-22 12:39:14 +02:00
Louis Dureuil	59c5b992c2	Score for geosort	2023-06-22 12:39:14 +02:00
Louis Dureuil	2ea8194c18	Score for exact_attributes	2023-06-22 12:39:14 +02:00
Louis Dureuil	421df64602	RankingRuleOutput now contains a Score	2023-06-22 12:39:14 +02:00
Louis Dureuil	c0fca6f884	Add score_details	2023-06-22 12:39:14 +02:00
Louis Dureuil	f050634b1e	add virtual conditions to fid and position to always have the max cost	2023-06-20 10:07:18 +02:00
Louis Dureuil	becf1f066a	Change how the cost of removing words is computed	2023-06-20 09:45:43 +02:00
Louis Dureuil	701d299369	Remove out-of-date comment	2023-06-20 09:45:42 +02:00
Louis Dureuil	a20e4d447c	Position now takes into account the distance to the position of the word in the query it used to be based on the distance to the position 0	2023-06-20 09:45:42 +02:00
Louis Dureuil	af57c3c577	Proximity costs 0 for documents that are perfectly matching	2023-06-20 09:45:42 +02:00
Louis Dureuil	0c40ef6911	Fix sort id	2023-06-20 09:45:42 +02:00
meili-bors[bot]	45636d315c	Merge #3670 3670: Fix addition deletion bug r=irevoire a=irevoire The first commit of this PR is a revert of https://github.com/meilisearch/meilisearch/pull/3667. It re-enable the auto-batching of addition and deletion of tasks. No new changes have been introduced outside of `milli`. So all the changes you see on the autobatcher have actually already been reviewed. It fixes https://github.com/meilisearch/meilisearch/issues/3440. ### What was happening? The issue was that the `external_documents_ids` generated in the `transform` were used in a very strange way that wasn’t compatible with the deletion of documents. Instead of doing a clear merge between the external document IDs of the DB and the one returned by the transform + writing it on disk, we were doing some weird tricks with the soft-deleted to avoid writing the fst on disk as much as possible. The new algorithm may be a bit slower but is way more straightforward and doesn’t change depending on if the soft deletion was used or not. Here is a list of the changes introduced: 1. We now do a clear distinction between the `new_external_documents_ids` coming from the transform and only held on RAM and the `external_documents_ids` coming from the DB. 2. The `new_external_documents_ids` (coming out of the transform) are now represented as an `fst`. We don't need to struggle with the hard, soft distinction + the soft_deleted => That's easier to understand 3. When indexing documents, we merge the `external_documents_ids` coming from the DB and the `new_external_documents_ids` coming from the transform. ### Other things introduced in this PR Since we constantly have to write small, very specialized fuzzers for this kind of bug, we decided to push the one used to reproduce this bug. It's not perfect, but it's easy to improve in the future. It'll also run for as long as possible on every merge on the main branch. Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>	2023-06-19 09:09:30 +00:00
meili-bors[bot]	cb9d78fc7f	Merge #3835 3835: Add more documentation to graph-based ranking rule algorithms + comment cleanup r=Kerollmops a=loiclec In addition to documenting the `cheapest_path.rs` file, this PR cleans up a few outdated comments as well as some TODOs. These TODOs have been moved to https://github.com/meilisearch/meilisearch/issues/3776 Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>	2023-06-15 15:30:24 +00:00
Louis Dureuil	e0c4682758	Fix tests	2023-06-14 13:30:52 +02:00
Louis Dureuil	d9b4b39922	Add trailing pipe to the snapshots so it doesn't end with trailing whitespace	2023-06-14 13:30:52 +02:00
Loïc Lecrenier	2da86b31a6	Remove comments and add documentation	2023-06-14 12:39:42 +02:00
Louis Dureuil	a2a3b8c973	Fix offset difference between query and indexing for hard separators	2023-06-08 12:07:12 +02:00
Louis Dureuil	9f37b61666	DB BREAKING: raise limit of word count from 10 to 30.	2023-06-08 12:07:12 +02:00
Louis Dureuil	c15c076da9	DB BREAKING: Count the number of words in field_id_word_count_docids	2023-06-08 12:07:11 +02:00
Loïc Lecrenier	8628a0c856	Remove docid_word_positions_db + fix deletion bug That would happen when a word was deleted from all exact attributes but not all regular attributes.	2023-06-07 10:52:50 +02:00
Clémentine U. - curqui	f3e2f79290	Merge branch 'main' into tmp-release-v1.2.0	2023-06-05 18:36:28 +02:00
Kerollmops	da04edff8c	Better use deserialize_unchecked_from to reduce the deserialization time	2023-05-30 14:58:30 +02:00
Louis Dureuil	1dfc4038ab	Add test that fails before PR and passes now	2023-05-29 11:58:26 +02:00
Louis Dureuil	73198179f1	Consistently use wrapping add to avoid overflow in debug when query starts with a separator	2023-05-29 11:54:12 +02:00
meili-bors[bot]	2e49d6aec1	Merge #3768 3768: Fix bugs in graph-based ranking rules + make `words` a graph-based ranking rule r=dureuill a=loiclec This PR contains three changes: ## 1. Don't call the `words` ranking rule if the term matching strategy is `All` This is because the purpose of `words` is only to remove nodes from the query graph. It would never do any useful work when the matching strategy was `All`. Remember that the universe was already computed before by computing all the docids corresponding to the "maximally reduced" query graph, which, in the case of `All`, is equal to the original graph. ## 2. The `words` ranking rule is replaced by a graph-based ranking rule. This is for three reasons: 1. performance: graph-based ranking rules benefit from a lot of optimisations by default, which ensures that they are never too slow. The previous implementation of `words` could call `compute_query_graph_docids` many times if some words had to be removed from the query, which would be quite expensive. I was especially worried about its performance in cases where it is placed right after the `sort` ranking rule. Furthermore, `compute_query_graph_docids` would clone a lot of bitmaps many times unnecessarily. 2. consistency: every other ranking rule (except `sort`) is graph-based. It makes sense to implement `words` like that as well. It will automatically benefit from all the features, optimisations, and bug fixes that all the other ranking rules get. 3. surfacing bugs: as the first ranking rule to be called (most of the time), I'd like `words` to behave the same as the other ranking rules so that we can quickly detect bugs in our graph algorithms. This actually already happened, which is why this PR also contains a bug fix. ## 3. Fix the `update_all_costs_before_nodes` function It is a bit difficult to explain what was wrong, but I'll try. The bug happened when we had graphs like: <img width="730" alt="Screenshot 2023-05-16 at 10 58 57" src="https://github.com/meilisearch/meilisearch/assets/6040237/40db1a68-d852-4e89-99d5-0d65757242a7"> and we gave the node `is` as argument. Then, we'd walk backwards from the node breadth-first. We'd update the costs of: 1. `sun` 2. `thesun` 3. `start` 4. `the` which is an incorrect order. The correct order is: 1. `sun` 2. `thesun` 3. `the` 4. `start` That is, we can only update the cost of a node when all of its successors have either already been visited or were not affected by the update to the node passed as argument. To solve this bug, I factored out the graph-traversal logic into a `traverse_breadth_first_backward` function. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com> Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-05-23 13:28:08 +00:00
Louis Dureuil	51043f78f0	Remove trailing whitespace	2023-05-23 15:27:25 +02:00
Louis Dureuil	a490a11325	Add explanatory comment on the way we're recomputing costs	2023-05-23 15:24:24 +02:00
Tamo	602ad98cb8	improve the way we handle the fsts	2023-05-22 11:15:14 +02:00
Tamo	7f619ff0e4	get rids of the now unused soft_deletion_used parameter	2023-05-22 10:33:49 +02:00
Tamo	4391cba6ca	fix the addition + deletion bug	2023-05-17 18:28:57 +02:00
meili-bors[bot]	101f5a20d2	Merge #3757 3757: Adjust the cost of edges in the `position` ranking rule by bucketing positions more aggressively r=loiclec a=loiclec This PR significantly improves the performance of the `position` ranking rule when: 1. a query contains many words 2. the `position` ranking rule needs to be called many times 3. the score of the documents according to `position` is high These conditions greatly increase: 1. the number of edge traversals that are needed to find a valid path from the `start` node to the `end` node 2. the number of edges that need to be deleted from the graph, and therefore the number of times that we need to recompute all the possible costs from START to END As a result, a majority of the search time is spent in `visit_condition`, `visit_node`, and `update_all_costs_before_node`. This is frustrating because it often happens when the "universe" given to the rule consists of only a handful of document ids. By limiting the number of possible edges between two nodes from `20` to `10`, we: 1. reduce the number of possible costs from START to END 2. reduce the number of edges that will be deleted 3. make it faster to update the costs after deleting an edge 4. reduce the number of buckets that need to be computed In terms of relevancy, I don't think we lose or gain much. We still prefer terms that are in a lower positions, with decreasing precision as we go further. The previous choice of bucketing wasn't chosen in a principled way, and neither is this one. They both "feel" right to me. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com> Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>	2023-05-17 11:43:59 +00:00
Loïc Lecrenier	ec8f685d84	Fix bug in cheapest path algorithm	2023-05-16 17:01:30 +02:00

... 5 6 7 8 9 ...

2085 Commits