meilisearch

mirror of https://github.com/meilisearch/meilisearch.git synced 2025-03-03 12:24:40 +08:00

Go to file

3568: CI: Fix `publish-aarch64` job that still uses ubuntu-18.04 r=Kerollmops a=curquiza

Fixes #3563 

Main change
- add the usage of the `ubuntu-18.04` container instead of the native `ubuntu-18.04` of GitHub actions: I had to install docker in the container.

Small additional changes
- remove useless `fail-fast` and unused/irrelevant matrix inputs (`build`, `linker`, `os`, `use-cross`...)
- Remove useless step in job

Proof of work with this CI triggered on this current branch: https://github.com/meilisearch/meilisearch/actions/runs/4366233882

3569: Enhance Japanese language detection r=dureuill a=ManyTheFish

# Pull Request

This PR is a prototype and can be tested by downloading [the dedicated docker image](https://hub.docker.com/layers/getmeili/meilisearch/prototype-better-language-detection-0/images/sha256-a12847de00e21a71ab797879fd09777dadcb0881f65b5f810e7d1ed434d116ef?context=explore):

```bash
$ docker pull getmeili/meilisearch:prototype-better-language-detection-0
```

## Context
Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization.

A [first iteration has been implemented for v1.1.0](https://github.com/meilisearch/meilisearch/pull/3347) but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search.
Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing.

For instance, a Japanese document `{"id": 1, "name": "東京スカパラダイスオーケストラ"}` is detected as Japanese during indexing, during the search the query `東京` will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese.
However if in the dataset there is at least one document containing a field with only Kanjis like:
_A document with only 1 field containing only Kanjis:_
```json
{
 "id":4,
 "name": "東京特許許可局"
}
```
_A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:_
```json
{
 "id":105,
 "name": "東京特許許可局",
 "desc": "日経平均株価は26日 に約8カ月ぶりに2万4000円の心理的な節目を上回った。株高を支える材料のひとつは、自民党総裁選で3選を決めた安倍晋三首相の経済政策への期待だ。恩恵が見込まれるとされる人材サービスや建設株の一角が買われている。ただ思惑が先行して資金が集まっている面 は否めない。実際に政策効果を取り込む企業はどこか、なお未知数だ。"
}
```

Then, in both cases, the field `name` will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore,  the query `東京` will be detected as Chinese and only the two last documents will be retrieved by Meilisearch.

## Technical Approach

The current PR partially fixes these issues by:
1) Adding a check over potential miss-detections and rerunning the extraction of the document forcing the tokenization over the main Languages detected in it.
 >  1) run a first extraction allowing the tokenizer to detect any Language in any Script
 >  2) generate a distribution of tokens by Script and Languages (`script_language`)
 >  3) if for a Script we have a token distribution of one of the Language that is under the threshold, then we rerun the extraction forbidding the tokenizer to detect the marginal Languages
 >  4) the tokenizer will fall back on the other available Languages to tokenize the text. For example, if the Chinese were marginally detected compared to the Japanese on the CJ script, then the second extraction will force Japanese tokenization for CJ text in the document. however, the text on another script like Latin will not be impacted by this restriction.

2) Adding a filtering threshold during the search over Languages that have been marginally detected in documents

## Limits
This PR introduces 2 arbitrary thresholds:
1) during the indexing, a Language is considered miss-detected if the number of detected tokens of this Language is under 10% of the tokens detected in the same Script (Japanese and Chinese are 2 different Languages sharing the "same" script "CJK").
2) during the search, a Language is considered marginal if less than 5% of documents are detected as this Language.

This PR only partially fixes these issues:
- ✅ the query `東京` now find Japanese documents if less than 5% of documents are detected as Chinese.
- ✅ the document with the id `105` containing the Japanese field `desc` but the miss-detected field `name` is now completely detected and tokenized as Japanese and is found with the query `東京`.
- ❌ the document with the id `4` no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search.

## Related issue
Fixes #3565

## Possible future enhancements
- Change or contribute to the Library used to detect the Language
  - the related issue on Whatlang: https://github.com/greyblake/whatlang-rs/issues/122

Co-authored-by: curquiza <clementine@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Many the fish <many@meilisearch.com>

2023-03-09 15:34:35 +00:00

.github

Update CI to still use ubuntu-18

2023-03-08 17:11:36 +01:00

assets

Add a README to the milli crate

2023-01-16 16:25:12 +01:00

benchmarks

Return an internal error in the case of matching word is invalid

2023-03-01 19:05:16 +01:00

dump

Use the workspace inheritance feature of rust 1.64

2023-02-15 13:51:07 +01:00

file-store

fix a bug where the filestore could try to parse its own tmp file and fail

2023-02-23 16:52:41 +01:00

filter-parser

Use the workspace inheritance feature of rust 1.64

2023-02-15 13:51:07 +01:00

flatten-serde-json

Use the workspace inheritance feature of rust 1.64

2023-02-15 13:51:07 +01:00

grafana-dashboards

Add suffix describing the unit when needed; Replace MeiliSearch by Meilisearch; Precised some metrics name

2022-08-23 17:09:27 +02:00

index-scheduler

Merge #3541

2023-03-09 13:32:52 +00:00

json-depth-checker

Use the workspace inheritance feature of rust 1.64

2023-02-15 13:51:07 +01:00

meili-snap

Use the workspace inheritance feature of rust 1.64

2023-02-15 13:51:07 +01:00

meilisearch

Merge #3568 #3569

2023-03-09 15:34:35 +00:00

meilisearch-auth

Authentication: AuthFilter::allow_index_creation both check that the index is authorized and the IndexCreate action

2023-02-22 16:37:13 +01:00

meilisearch-types

Update migration link to the docs

2023-02-23 18:36:30 +01:00

milli

Merge #3568 #3569

2023-03-09 15:34:35 +00:00

permissive-json-pointer

Use the workspace inheritance feature of rust 1.64

2023-02-15 13:51:07 +01:00

.dockerignore

import .git to docker to fix vergen

2021-07-28 19:12:40 +02:00

.gitignore

edit gitignore to ignore .idea and .vscode folders

2023-02-10 11:42:19 +04:00

.rustfmt.toml

Introduce a rustfmt file

2022-10-27 11:35:05 +02:00

bors.toml

Remove macos-latest and windows-latest usages

2022-12-20 11:10:09 +01:00

Cargo.lock

Update version for the next release (v1.1.0) in Cargo.toml

2023-03-06 13:52:54 +00:00

Cargo.toml

Update version for the next release (v1.1.0) in Cargo.toml

2023-03-06 13:52:54 +00:00

CODE_OF_CONDUCT.md

Create CODE_OF_CONDUCT.md

2020-04-30 20:16:02 +02:00

config.toml

config: case experimental_enable_metrics in snake_case

2023-02-27 17:14:06 +01:00

CONTRIBUTING.md

Update contributing.md

2023-02-16 10:53:14 +01:00

Cross.toml

Cross build with action-rs

2021-10-10 02:21:30 +08:00

Dockerfile

Change Dockerfile to also pass the VERGEN_GIT_SEMVER_LIGHTWEIGHT when building

2023-02-16 10:53:14 +01:00

download-latest.sh

Update download-latest.sh

2022-11-30 16:55:32 +01:00

LICENSE

Update LICENSE

2022-02-15 15:54:45 +01:00

README.md

Merge #3399

2023-02-01 14:34:55 +00:00

SECURITY.md

docs(security): Fix Supported

2022-05-31 14:21:34 -05:00

README.md

Website | Roadmap | Blog | Documentation | FAQ | Discord

⚡ A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow 🔍

Meilisearch helps you shape a delightful search experience in a snap, offering features that work out-of-the-box to speed up your workflow.

🔥 Try it! 🔥

✨ Features

Search-as-you-type: find search results in less than 50 milliseconds
Typo tolerance: get relevant matches even when queries contain typos and misspellings
Filtering and faceted search: enhance your user's search experience with custom filters and build a faceted search interface in a few lines of code
Sorting: sort results based on price, date, or pretty much anything else your users need
Synonym support: configure synonyms to include more relevant content in your search results
Geosearch: filter and sort documents based on geographic data
Extensive language support: search datasets in any language, with optimized support for Chinese, Japanese, Hebrew, and languages using the Latin alphabet
Security management: control which users can access what data with API keys that allow fine-grained permissions handling
Multi-Tenancy: personalize search results for any number of application tenants
Highly Customizable: customize Meilisearch to your specific needs or use our out-of-the-box and hassle-free presets
RESTful API: integrate Meilisearch in your technical stack with our plugins and SDKs
Easy to install, deploy, and maintain

📖 Documentation

You can consult Meilisearch's documentation at https://docs.meilisearch.com.

🚀 Getting started

For basic instructions on how to set up Meilisearch, add documents to an index, and search for documents, take a look at our Quick Start guide.

You may also want to check out Meilisearch 101 for an introduction to some of Meilisearch's most popular features.

☁️ Meilisearch cloud

Let us manage your infrastructure so you can focus on integrating a great search experience. Try Meilisearch Cloud today.

🧰 SDKs & integration tools

Install one of our SDKs in your project for seamless integration between Meilisearch and your favorite language or framework!

Take a look at the complete Meilisearch integration list.

⚙️ Advanced usage

Experienced users will want to keep our API Reference close at hand.

We also offer a wide range of dedicated guides to all Meilisearch features, such as filtering, sorting, geosearch, API keys, and tenant tokens.

Finally, for more in-depth information, refer to our articles explaining fundamental Meilisearch concepts such as documents and indexes.

📊 Telemetry

Meilisearch collects anonymized data from users to help us improve our product. You can deactivate this whenever you want.

To request deletion of collected data, please write to us at privacy@meilisearch.com. Don't forget to include your Instance UID in the message, as this helps us quickly find and delete your data.

If you want to know more about the kind of data we collect and what we use it for, check the telemetry section of our documentation.

📫 Get in touch!

Meilisearch is a search engine created by Meili, a software development company based in France and with team members all over the world. Want to know more about us? Check out our blog!

🗞 Subscribe to our newsletter if you don't want to miss any updates! We promise we won't clutter your mailbox: we only send one edition every two months.

💌 Want to make a suggestion or give feedback? Here are some of the channels where you can reach us:

For feature requests, please visit our product repository
Found a bug? Open an issue!
Want to be part of our Discord community? Join us!
For everything else, please check this page listing some of the other places where you can find us

Thank you for your support!

👩‍💻 Contributing

Meilisearch is, and will always be, open-source! If you want to contribute to the project, please take a look at our contribution guidelines.

📦 Versioning

Meilisearch releases and their associated binaries are available in this GitHub page.

The binaries are versioned following SemVer conventions. To know more, read our versioning policy.

Differently from the binaries, crates in this repository are not currently available on crates.io and do not follow SemVer conventions.