From 1718fe3d742921e234255d5f3c7f5984b2699afa Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Cl=C3=A9ment=20Renault?= Date: Mon, 2 Nov 2020 18:06:10 +0100 Subject: [PATCH] Update the README to be up to date with the recent updates --- README.md | 49 ++++++++++++++++++------------------------------- 1 file changed, 18 insertions(+), 31 deletions(-) diff --git a/README.md b/README.md index d06493a54..6090b71b9 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ the milli logo

-

A concurrent indexer combined with fast and relevant search algorithms.

+

a concurrent indexer combined with fast and relevant search algorithms

## Introduction @@ -10,46 +10,33 @@ This engine is a prototype, do not use it in production. This is one of the most advanced search engine I have worked on. It currently only supports the proximity criterion. -### Compile all the binaries +### Compile and Run the server + +You can specify the number of threads to use to index documents and many other settings too. ```bash -cargo build --release --bins +cargo run --release -- serve --db my-database.mdb -vvv --indexing-jobs 8 ``` -## Indexing - -It can index mass documents in no much time, I already achieved to index: - - 109m songs (song and artist name) in 21min and take 29GB on disk. - - 12m cities (name, timezone and country ID) in 3min13s and take 3.3GB on disk. - -All of that on a 39$/month machine with 4cores. - ### Index your documents -You can feed the engine with your CSV data: +It can index a massive amount of documents in not much time, I already achieved to index: + - 115m songs (song and artist name) in ~1h and take 107GB on disk. + - 12m cities (name, timezone and country ID) in 15min and take 10GB on disk. + +All of that on a 39$/month machine with 4cores. + +You can feed the engine with your CSV (comma-seperated, yes) data like this: ```bash -./target/release/indexer --db my-data.mmdb ../my-data.csv +cat "name,age\nhello,32\nkiki,24\n" | http POST 127.0.0.1:9700/documents content-type:text/csv ``` -## Querying +Here ids will be automatically generated as UUID v4 if they doesn't exist in some or every documents. -The engine is designed to handle very frequent words like any other word frequency. -This is why you can search for "asia dubai" (the most common timezone) in the countries datasets in no time (59ms) even with 12m documents. +Note that it also support JSON and JSON streaming, you can send them to the engine by using +the `content-type:application/json` and `content-type:application/x-ndjson` headers respectively. -We haven't modified the algorithm to handle queries that are scattered over multiple attributes, this is an open issue (#4). +### Querying the engine via the website -### Exposing a website to request the database - -Once you've indexed the dataset you will be able to access it with your brwoser. - -```bash -./target/release/serve -l 0.0.0.0:8700 --db my-data.mmdb -``` - -## Gaps - -There is many ways to make the engine search for too long and consume too much CPU. -This can for example be achieved by querying the engine for "the best of the do" on the songs and subreddits datasets. - -There is plenty of way to improve the algorithms and there is and will be new issues explaining potential improvements. +You can query the engine by going to [the HTML page itself](http://127.0.0.1:9700).