mirror of https://github.com/meilisearch/meilisearch.git synced 2025-01-18 08:48:32 +08:00

A lightning-fast search API that fits effortlessly into your apps, websites, and workflow

Go to file

Kerollmops 2ae3f40971 Make the indexer ignore certain words This is a preparation for making the indexing fully parallel by making the indexer only be aware of certain words for each threads to avoid postings lists conflicts for each words		2020-07-01 17:49:46 +02:00
benches	Introduce the criterion dependency to bench the engine	2020-06-19 18:32:25 +02:00
public	Add an help message on the front page	2020-06-04 21:22:45 +02:00
src	Make the indexer ignore certain words	2020-07-01 17:49:46 +02:00
.gitignore	Initial commit	2020-05-31 14:22:06 +02:00
Cargo.lock	Make the indexer ignore certain words	2020-07-01 17:49:46 +02:00
Cargo.toml	Make the indexer ignore certain words	2020-07-01 17:49:46 +02:00
LICENSE	Initial commit	2020-05-31 14:21:56 +02:00
qc_loop.sh	Initial commit	2020-05-31 14:22:06 +02:00
README.md	Update the README	2020-06-28 12:40:08 +02:00

README.md

mega-mini-indexer

A prototype of concurrent indexing, only contains postings ids

Introduction

This engine is a prototype, do not use it in production. This is one of the most advanced search engine I have worked on. It currently only supports the proximity criterion.

Compile all the binaries

cargo build --release --bins

Indexing

It can index mass documents in no much time, I already achieved to index:

109m songs (song and artist name) in 21min and take 29GB on disk.
12m cities (name, timezone and country ID) in 3min13s and take 3.3GB on disk.

All of that on a 39$/month machine with 4cores.

Index your documents

You first need to split your csv yourself, the engine is currently not able to split it itself. The bigger the split size is the faster the engine will index your documents but the higher the RAM usage will be too.

Here we use the awesome xsv tool to split our big dataset.

cat my-data.csv | xsv split -s 2000000 my-data-split/

Once your data is ready you can feed the engine with it, it will spawn one thread by CSV part up to one by number of core.

./target/release/indexer --db my-data.mmdb ../my-data-split/*

Querying

The engine is designed to handle very frequent words like any other word frequency. This is why you can search for "asia dubai" (the most common timezone) in the countries datasets in no time (59ms) even with 12m documents.

We haven't modified the algorithm to handle queries that are scattered over multiple attributes, this is an open issue (#4).

Exposing a website to request the database

Once you've indexed the dataset you will be able to access it with your brwoser.

./target/release/serve -l 0.0.0.0:8700 --db my-data.mmdb

Gaps

There is many ways to make the engine search for too long and consume too much CPU. This can for example be achieved by querying the engine for "the best of the do" on the songs and subreddits datasets.

There is plenty of way to improve the algorithms and there is and will be new issues explaining potential improvements.