Meilisearch: Document not fully indexed

Created on 27 Mar 2020  Â·  3Comments  Â·  Source: meilisearch/MeiliSearch

Maybe I'm doing something wrong, but I found that some large documents don't seem to get fully indexed. This is with MeiliSearch 0.9.0. Basically, indexing seems to stop after a certain number of words.

Steps to reproduce

curl -LO https://github.com/meilisearch/MeiliSearch/releases/download/v0.9.0/meilisearch-linux-amd64
chmod +x meilisearch-linux-amd64
./meilisearch-linux-amd64 --db-path testfoo.ms

# In another terminal:

curl \
  -X POST 'http://localhost:7700/indexes' \
  --data '{
  "name": "testfoo",
  "uid": "testfoo"
}'

# Get large test document; contains the entirety of Tolstoy's "War and Peace"
# in English (3.3 MB) in the field `full_text`
curl -O https://x.unix.se/etc/testfoo.json

curl -i -X POST 'http://127.0.0.1:7700/indexes/testfoo/documents' \
  --header 'content-type: application/json' \
  --data-binary @testfoo.json

# Verify that status is processed
curl -s 'http://127.0.0.1:7700/indexes/testfoo/updates/0'|jq

# Search #1
curl -s 'http://127.0.0.1:7700/indexes/testfoo/search?attributesToCrop=full_text&cropLength=50&q=prince'|jq ''|tail -n 13
# Search #2
curl -s 'http://127.0.0.1:7700/indexes/testfoo/search?attributesToCrop=full_text&cropLength=50&q=faithful'|jq ''|tail -n 13

What happens

~ curl \
        -X POST 'http://localhost:7700/indexes' \
        --data '{
        "name": "testfoo",
        "uid": "testfoo"
      }'
{"name":"testfoo","uid":"testfoo","createdAt":"2020-03-27T19:22:17.374411443Z","updatedAt":"2020-03-27T19:22:17.374451926Z","primaryKey":null}

~ curl -i -X POST 'http://127.0.0.1:7700/indexes/testfoo/documents' \
        --header 'content-type: application/json' \
        --data-binary @testfoo.json
HTTP/1.1 100 Continue

HTTP/1.1 202 Accepted
content-type: application/json
access-control-allow-origin: *
transfer-encoding: chunked
date: Fri, 27 Mar 2020 19:37:16 GMT

{"updateId":0}

~ curl -s 'http://127.0.0.1:7700/indexes/testfoo/updates/0'|jq
{
  "status": "processed",
  "updateId": 0,
  "type": {
    "name": "DocumentsAddition",
    "number": 1
  },
  "duration": 0.007595999,
  "enqueuedAt": "2020-03-27T19:37:16.820200414Z",
  "processedAt": "2020-03-27T19:37:16.866531267Z"
}

After front matter and a long table of contents, War and Peace starts with:

“Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don’t tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by that
Antichrist—I really believe he is Antichrist—I will have nothing
more to do with you and you are no longer my friend, no longer my
‘faithful slave,’ as you call yourself! But how do you do? I see I
have frightened you—sit down and tell me all the news.”

Searching for "prince" works:

~ curl -s 'http://127.0.0.1:7700/indexes/testfoo/search?attributesToCrop=full_text&cropLength=50&q=prince'|jq ''|tail -n 13
      "_formatted": {
        "id": 1,
        "title": "War and Peace",
        "author": "Leo Tolstoy",
        "full_text": "I\n\n\n\n\n\n\n\n\n\n\nBOOK ONE: 1805\n\n\n\n\n\nCHAPTER I\n\n“Well, Prince, so Genoa and Lucca are now just family est"
      }
    }
  ],
  "offset": 0,
  "limit": 20,
  "processingTimeMs": 1,
  "query": "prince"
}

But somewhere in the middle of the second sentence, it stops; I get a hit for "warn" but not "defend" or "horrors" or anything else thereafter:

~ curl -s 'http://127.0.0.1:7700/indexes/testfoo/search?attributesToCrop=full_text&cropLength=50&q=defend'|jq ''|tail -n 13
{
  "hits": [],
  "offset": 0,
  "limit": 20,
  "processingTimeMs": 0,
  "query": "defend"
}

Any ideas? (I've verified that the entire text is indeed stored in the database)

bug meilisearch-core meilisearch-tokenizer question

Most helpful comment

@Kerollmops,

Ah! I figured I hit some kind of limit, but I couldn't find anything about it the docs, so I wasn't sure. Will try splitting, thanks for the advice!

All 3 comments

Hey @andersju,

I thought I had mentioned this in the documentation (will do) but the engine currently only accept a maximum of 2^16 (65536) characters by attribute. In a future version of MeiliSearch we will probably change this limit to 1000 words by attribute.

This is something that is kind of needed to keep good performances and a good relevancy.
MeiliSearch is designed to handle little but many documents.

If you need to index big documents like articles it is recommended to split them by paragraph and add a distinct rule on the artcile id for example.

[
  {
    "id": 3456,
    "article-id": 321,
    "content": "this is the FIRST paragraph of the article 321",
  },
  {
    "id": 4567,
    "article-id": 432,
    "content": "this is the FIRST paragraph of the article 432",
  },
  {
    "id": 5678,
    "article-id": 321,
    "content": "this is the SECOND paragraph of the article 321",
  }
]

By doing so you will get precise results on where a query matched in a document, keep great performances, and be able to search inside of all the documents texts.

@Kerollmops,

Ah! I figured I hit some kind of limit, but I couldn't find anything about it the docs, so I wasn't sure. Will try splitting, thanks for the advice!

Hey @meilisearch/doc-team Would it be possible to create a documentation page to explain that MeiliSearch has limitations and big documents must be cropped, this way results are more accurate and targets the right portion of text.

https://github.com/meilisearch/MeiliSearch/issues/556#issuecomment-605333369

Thank you very much :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mzperix picture mzperix  Â·  4Comments

curquiza picture curquiza  Â·  5Comments

ayalon picture ayalon  Â·  3Comments

bhavyalatha26 picture bhavyalatha26  Â·  3Comments

imor picture imor  Â·  4Comments