meilisearch

mirror of https://github.com/meilisearch/meilisearch.git synced 2024-11-26 20:15:07 +08:00

Author	SHA1	Message	Date
dependabot[bot]	0177d66149	Bump Swatinem/rust-cache from 2.2.0 to 2.2.1 Bumps [Swatinem/rust-cache](https://github.com/Swatinem/rust-cache) from 2.2.0 to 2.2.1. - [Release notes](https://github.com/Swatinem/rust-cache/releases) - [Changelog](https://github.com/Swatinem/rust-cache/blob/master/CHANGELOG.md) - [Commits](https://github.com/Swatinem/rust-cache/compare/v2.2.0...v2.2.1) --- updated-dependencies: - dependency-name: Swatinem/rust-cache dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2023-04-01 17:58:46 +00:00
Filip Bachul	1861c69964	fmt	2023-03-30 23:37:26 +02:00
Filip Bachul	cb2b5eb38e	handle _geoDistance(x,x) sort error	2023-03-30 23:21:23 +02:00
Filip Bachul	53aa0a1b54	handle _geo(x,x) sort error	2023-03-30 23:17:34 +02:00
bors[bot]	950f73b8bb	Merge #3623 3623: Update mini-dashboard to version v0.2.7 r=curquiza a=bidoubiwa ## Changes * Retrieve the API Key from the url parameters (#416) `@qdequele` ## 🐛 Bug Fixes * Fix show more button not displaying all fields (#419) `@bidoubiwa` Thanks again to `@bidoubiwa,` and `@qdequele!` 🎉 Co-authored-by: Charlotte Vermandel <charlottevermandel@gmail.com>	2023-03-30 08:31:29 +00:00
bors[bot]	7871d12025	Merge #3624 3624: Reduce the time to import a dump r=irevoire a=irevoire When importing a dump, this PR does multiple things; - Stops committing the changes between each task import - Stop deserializing + serializing every bitmap for every task Pros: Importing 1M tasks in a dump went from 3m36 on my computer to 6s Cons: We use slightly more memory, but since we’re using roaring bitmaps, that really shouldn’t be noticeable. Fixes #3620 Co-authored-by: Tamo <tamo@meilisearch.com>	2023-03-29 13:40:25 +00:00
Charlotte Vermandel	e7153e0a97	Update mini-dashboard to version V0.2.7	2023-03-29 14:49:39 +02:00
bors[bot]	37a24a4a05	Merge #3621 3621: Fix facet normalization r=Kerollmops a=ManyTheFish # Pull Request Make sure the facet normalization is the same between indexing and search. ## Related issue Fixes #3599 Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-03-29 12:47:20 +00:00
Tamo	3fb67f94f7	Reduce the time to import a dump by caching some datas With this commit, for a dump containing 1M tasks we went form 1m02 to 6s	2023-03-29 14:44:15 +02:00
ManyTheFish	6592746337	Fix other unrelated tests	2023-03-29 14:36:17 +02:00
Tamo	cf5145b542	Reduce the time to import a dump With this commit, for a dump containing 1M tasks we went from 3m36s to import the task queue down to 1m02s	2023-03-29 14:27:40 +02:00
ManyTheFish	efea1e5837	Fix facet normalization	2023-03-29 12:02:24 +02:00
ManyTheFish	b744f33530	Add test	2023-03-29 12:01:52 +02:00
bors[bot]	31bb61ba99	Merge #3608 3608: In a settings update, check to see if the primary key actually changes before erroring out r=irevoire a=GregoryConrad Previously, if the primary key was set and a Settings update contained a primary key, an error would be returned. However, this error is not needed if the new PK == the current PK. This PR just checks to see if the PK actually changes before raising an error. I came across this slight hiccup in https://github.com/GregoryConrad/mimir/issues/156#issuecomment-1484128654 Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>	2023-03-29 09:07:51 +00:00
bors[bot]	d4f54fc55e	Merge #3617 3617: update the geoBoundingBox feature r=dureuill a=irevoire Closing #3616 Implementing this change in the spec: `38a715c072` Now instead of using the (top_left, bottom_right) corners of the bounding box, it’s using the (top_right, bottom_left) corners. Co-authored-by: Tamo <tamo@meilisearch.com>	2023-03-29 07:01:17 +00:00
Tamo	a50b058557	update the geoBoundingBox feature Now instead of using the (top_left, bottom_right) corners of the bounding box it s using the (top_right, bottom_left) corners.	2023-03-28 18:26:18 +02:00
Gregory Conrad	e7994cdeb3	feat: check to see if the PK changed before erroring out Previously, if the primary key was set and a Settings update contained a primary key, an error would be returned. However, this error is not needed if the new PK == the current PK. This commit just checks to see if the PK actually changes before raising an error.	2023-03-26 12:18:39 -04:00
bors[bot]	514b60f8c8	Merge #3597 3597: ensure that the task queue is correctly imported r=irevoire a=irevoire ## Related issue Fixes #3596 I updated all the dump's integration tests to ensure that we're effectively able to query the tasks Co-authored-by: Tamo <tamo@meilisearch.com>	2023-03-21 17:31:26 +00:00
Tamo	a2b151e877	ensure that the task queue is correctly imported reduce the size of the snapshots file	2023-03-21 14:41:46 +01:00
bors[bot]	70c906d4b4	Merge #3576 3576: Add boolean support for csv documents r=irevoire a=irevoire Fixes https://github.com/meilisearch/meilisearch/issues/3572 ## What does this PR do? Add support for the boolean types in csv documents. The type definition is `boolean` and the possible values are - `true` for true - `false` for false - ` ` for null Here is an example: ```csv #id,cute:boolean 0,true 1,false 2, ``` Co-authored-by: Tamo <tamo@meilisearch.com>	2023-03-14 12:28:12 +00:00
Tamo	0f33a65468	makes kero happy	2023-03-13 16:51:11 +01:00
bors[bot]	7c9a8b1e1b	Merge #3587 3587: Enable cache again in test suite CI r=curquiza a=curquiza Following the change in this PR introduced in v1.1: https://github.com/meilisearch/meilisearch/pull/3422 The cache was removed due to failures (lack of space). Now the binary is smaller (from 250Mb to 50Mb) we want to try to enable the cache again. Indeed, without the cache step, the CIs are wayyyy slower (45min instead of 20-30min). For later: Rust 1.68 introduced a new way to fetch crates. Updating the rust version might also help in the future! Co-authored-by: curquiza <clementine@meilisearch.com>	2023-03-13 13:51:32 +00:00
curquiza	f45daf8031	Enable cache again in test suite CI	2023-03-13 14:24:15 +01:00
bors[bot]	fb1260ee88	Merge #3568 #3569 3568: CI: Fix `publish-aarch64` job that still uses ubuntu-18.04 r=Kerollmops a=curquiza Fixes #3563 Main change - add the usage of the `ubuntu-18.04` container instead of the native `ubuntu-18.04` of GitHub actions: I had to install docker in the container. Small additional changes - remove useless `fail-fast` and unused/irrelevant matrix inputs (`build`, `linker`, `os`, `use-cross`...) - Remove useless step in job Proof of work with this CI triggered on this current branch: https://github.com/meilisearch/meilisearch/actions/runs/4366233882 3569: Enhance Japanese language detection r=dureuill a=ManyTheFish # Pull Request This PR is a prototype and can be tested by downloading [the dedicated docker image](https://hub.docker.com/layers/getmeili/meilisearch/prototype-better-language-detection-0/images/sha256-a12847de00e21a71ab797879fd09777dadcb0881f65b5f810e7d1ed434d116ef?context=explore): ```bash $ docker pull getmeili/meilisearch:prototype-better-language-detection-0 ``` ## Context Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization. A [first iteration has been implemented for v1.1.0](https://github.com/meilisearch/meilisearch/pull/3347) but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search. Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing. For instance, a Japanese document `{"id": 1, "name": "東京スカパラダイスオーケストラ"}` is detected as Japanese during indexing, during the search the query `東京` will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese. However if in the dataset there is at least one document containing a field with only Kanjis like: _A document with only 1 field containing only Kanjis:_ ```json { "id":4, "name": "東京特許許可局" } ``` _A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:_ ```json { "id":105, "name": "東京特許許可局", "desc": "日経平均株価は26日に約8カ月ぶりに2万4000円の心理的な節目を上回った。株高を支える材料のひとつは、自民党総裁選で3選を決めた安倍晋三首相の経済政策への期待だ。恩恵が見込まれるとされる人材サービスや建設株の一角が買われている。ただ思惑が先行して資金が集まっている面は否めない。実際に政策効果を取り込む企業はどこか、なお未知数だ。" } ``` Then, in both cases, the field `name` will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore, the query `東京` will be detected as Chinese and only the two last documents will be retrieved by Meilisearch. ## Technical Approach The current PR partially fixes these issues by: 1) Adding a check over potential miss-detections and rerunning the extraction of the document forcing the tokenization over the main Languages detected in it. > 1) run a first extraction allowing the tokenizer to detect any Language in any Script > 2) generate a distribution of tokens by Script and Languages (`script_language`) > 3) if for a Script we have a token distribution of one of the Language that is under the threshold, then we rerun the extraction forbidding the tokenizer to detect the marginal Languages > 4) the tokenizer will fall back on the other available Languages to tokenize the text. For example, if the Chinese were marginally detected compared to the Japanese on the CJ script, then the second extraction will force Japanese tokenization for CJ text in the document. however, the text on another script like Latin will not be impacted by this restriction. 2) Adding a filtering threshold during the search over Languages that have been marginally detected in documents ## Limits This PR introduces 2 arbitrary thresholds: 1) during the indexing, a Language is considered miss-detected if the number of detected tokens of this Language is under 10% of the tokens detected in the same Script (Japanese and Chinese are 2 different Languages sharing the "same" script "CJK"). 2) during the search, a Language is considered marginal if less than 5% of documents are detected as this Language. This PR only partially fixes these issues: - ✅ the query `東京` now find Japanese documents if less than 5% of documents are detected as Chinese. - ✅ the document with the id `105` containing the Japanese field `desc` but the miss-detected field `name` is now completely detected and tokenized as Japanese and is found with the query `東京`. - ❌ the document with the id `4` no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search. ## Related issue Fixes #3565 ## Possible future enhancements - Change or contribute to the Library used to detect the Language - the related issue on Whatlang: https://github.com/greyblake/whatlang-rs/issues/122 Co-authored-by: curquiza <clementine@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Many the fish <many@meilisearch.com>	2023-03-09 15:34:35 +00:00
bors[bot]	48a51e5cd6	Merge #3577 3577: Avoid fetching an LMDB value with an empty string r=ManyTheFish a=Kerollmops # Pull Request ## Related issue Fixes #3574 ## What does this PR do? This PR fixes a bug where an empty key fetches an entry in the database. LMDB throws an error if an empty or too-long key is used to fetch an entry. This empty string seems to have been generated by the Charabia tokenizer. Co-authored-by: Clément Renault <clement@meilisearch.com>	2023-03-09 14:35:25 +00:00
ManyTheFish	2f8eb4f54a	last PR fixes	2023-03-09 15:34:36 +01:00
Many the fish	dea101e3d9	Update meilisearch/src/routes/indexes/mod.rs Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-03-09 15:17:03 +01:00
Clément Renault	175e8a8495	Fix a diacritic issue	2023-03-09 14:57:47 +01:00
Clément Renault	6da54d0cb6	Add a test to fix a diacritic issue	2023-03-09 14:57:38 +01:00
bors[bot]	667bb87e35	Merge #3541 3541: Add cache on the indexes stats r=dureuill a=irevoire Fix https://github.com/meilisearch/meilisearch/issues/3540 Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-03-09 13:32:52 +00:00
bors[bot]	7935bef4cd	Merge #3567 3567: Clean CI file names r=curquiza a=curquiza Make the CI names more consistent to ease the Gillian's onboarding 😇 No impact for the users or the developers of the team Co-authored-by: curquiza <clementine@meilisearch.com>	2023-03-09 12:20:18 +00:00
Tamo	eddefb0e0f	refactor the error type of the milli::document thing silence a warning	2023-03-09 13:03:14 +01:00
ManyTheFish	dff2715ef3	Try removing needless collect	2023-03-09 11:28:10 +01:00
ManyTheFish	5deea631ea	fix clippy too many arguments	2023-03-09 11:19:13 +01:00
Tamo	c5f22be6e1	add boolean support for csv documents	2023-03-09 11:12:49 +01:00
ManyTheFish	b4b859ec8c	Fix typos	2023-03-09 10:58:35 +01:00
curquiza	febc8d1b52	Clean CI file names	2023-03-08 19:12:33 +01:00
curquiza	b99ef3d336	Update CI to still use ubuntu-18	2023-03-08 17:11:36 +01:00
ManyTheFish	7e2fd82e41	Use Language allow list in the highlighter	2023-03-08 12:44:16 +01:00
ManyTheFish	24c0775c67	Change indexing threshold	2023-03-08 12:36:04 +01:00
ManyTheFish	3092cf0448	Fix clippy errors	2023-03-08 10:53:42 +01:00
ManyTheFish	37d4551e8e	Add a threshold filtering the Languages allowed to be detected at search time	2023-03-07 19:38:01 +01:00
ManyTheFish	da48506f15	Rerun extraction when language detection might have failed	2023-03-07 18:35:26 +01:00
Louis Dureuil	2f5b9fbbd8	Restore contribution of the index sizes to the db size - the index size now contributes to the db size even if the index is not authorized	2023-03-07 14:05:27 +01:00
Louis Dureuil	7faa9a22f6	Pass IndexStat by ref in store_stats_of	2023-03-07 14:00:54 +01:00
bors[bot]	370d88f626	Merge #3561 3561: Fix the snapshots permissions on unix system r=irevoire a=irevoire # Pull Request ## Related issue Fixes https://github.com/meilisearch/meilisearch/issues/3507 The snapshot permissions were wrong after the v0.30 and the huge refacto of the index scheduler. Fix this issue + add a test on the permissions on unix Co-authored-by: Tamo <tamo@meilisearch.com>	2023-03-07 08:51:38 +00:00
Tamo	d34faa8f9c	put back the sleep as it was and fix the from	2023-03-06 18:09:09 +01:00
Tamo	e5d0bef6d8	update a comment	2023-03-06 17:04:24 +01:00
Louis Dureuil	76288fad72	Fix snapshots	2023-03-06 16:57:31 +01:00
Louis Dureuil	076a3d371c	Eagerly compute stats as fallback to the cache. - Refactor all around to avoid spawning indexes more times than necessary	2023-03-06 16:57:31 +01:00

1 2 3 4 5 ...

7631 Commits