Elasticsearch: Finding documents with empty string as value

Created on 29 Aug 2014 · 8Comments · Source: elastic/elasticsearch

Hello,

I've been trying to find all documents that contains a field with an empty string value (_textContent) inside my index using the missing filter. But I think that this filter doesn't treat empty strings as a null value.

Is this a bug, or the intended behavior for this filter? If it is like this by design, may I suggest adding an empty filter, if that's even possible?

As a plus, I'll post the mapping that I'm currently using, the document that contains the empty string field and the query that I'm trying to run:

Mapping:

{
  "documents": {
    "mappings": {
      "document": {
        "properties": {
          "_contratante": {
            "type": "string"
          },
          "_dateFields": {
            "type": "nested",
            "properties": {
              "id": {
                "type": "string",
                "index": "not_analyzed"
              },
              "value": {
                "type": "date",
                "format": "dateOptionalTime"
              }
            }
          },
          "_indexadoPor": {
            "type": "string"
          },
          "_textContent": {
            "type": "string"
          },
          "_textFields": {
            "type": "nested",
            "properties": {
              "id": {
                "type": "string",
                "index": "not_analyzed"
              },
              "value": {
                "type": "string"
              }
            }
          },
          "_tipoDocumento": {
            "type": "string"
          }
        }
      }
    }
  }
}

Document:

{
  "_index": "documents",
  "_type": "document",
  "_id": "xxx",
  "_version": 1,
  "found": true,
  "_source": {
    "_id": "xxx",
    "_contratante": "xxx",
    "_tipoDocumento": "xxx",
    "_indexadoPor": "xxx",
    "_dateFields": [
      {
        "id": "538730ece4b0d13600208d7a:2",
        "value": 1404183600000
      }
    ],
    "_textFields": [
      {
        "id": "538730ece4b0d13600208d7a:0",
        "value": "xxx"
      },
      {
        "id": "538730ece4b0d13600208d7a:1",
        "value": "xxx"
      },
      {
        "id": "538730ece4b0d13600208d7a:3",
        "value": ""
      },
      {
        "id": "538730ece4b0d13600208d7a:4",
        "value": "xxxx"
      }
    ],
    "_textContent": ""
  }
}

Query:

{
  "query": {
    "filtered": {
      "filter": {
        "missing": {
          "field": "_textContent"
        }
      }
    }
  }
}

Thanks a lot for the amazing tool developed here!

Source

paulovictorv

Most helpful comment

Here's a workaround:

PUT t/t/1
{
  "textContent": ""
}

PUT t/t/2
{
  "textContent": "foo"
}

GET t/t/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "exists": {
            "field": "textContent"
          }
        }
      ],
      "must_not": [
        {
          "wildcard": {
            "textContent": "*"
          }
        }
      ]
    }
  }
}

Not optimal, but it works

clintongormley on 21 Nov 2015

👍25

All 8 comments

Hi Paulo and everyone,

I just ran a few tests and also have same questions about it.
The situation can be produced by the gist: https://gist.github.com/hxuanji/d941c21fc75648ce7ba4

On the ES 1.3.2, the empty string cannot be find and the gist above can only find one document {"name": null}. And its explanation is
ConstantScore(cache(NotFilter(cache(BooleanFilter(_field_names:name)))))

But on the ES 1.0.1, the final result of the gist can both find the empty-string and null-value documents. Also its explanation is
ConstantScore(cache(NotFilter(cache(BooleanFilter(name:[* TO *]))))).

It seems the parsing rule changed, _field_names seems not consider some special null cases on the ES 1.3.2 ?!
And this _field_names added from #5659.

Any ideas?

hxuanji on 1 Sep 2014

I think this is a duplicate of https://github.com/elasticsearch/elasticsearch/issues/7348 ?

jpountz on 1 Sep 2014

Hi @jprante,

Yes, I think so.

hxuanji on 1 Sep 2014

Does adding an empty filter sounds silly? I think there's some good use cases for this.

paulovictorv on 1 Sep 2014

👍1

Here's a workaround:

PUT t/t/1
{
  "textContent": ""
}

PUT t/t/2
{
  "textContent": "foo"
}

GET t/t/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "exists": {
            "field": "textContent"
          }
        }
      ],
      "must_not": [
        {
          "wildcard": {
            "textContent": "*"
          }
        }
      ]
    }
  }
}

Not optimal, but it works

clintongormley on 21 Nov 2015

👍25

You're a life saver @clintongormley

davidhesson on 9 Jan 2018

I want to do the inverse of this:

"query": {
  "bool": {
    "filter": {
      "wildcard": {
      "name": "*"
    }
  }
}

And it appears to work. Any document with a name matches and any doc with null or '' name does not match. But I don't know why it works! I thought that the wildcard filter expands * to some number of terms so that if I have, say, 9999 terms in the name field, if I index a document with name = "zzzzzzzzzzzzebra" then that will be well past the term expansion and it shouldn't match. But it does. Why?

Somehow I guess it's doing the constant_score thing here and it's using the variant that doesn't hit the 1024 clause error. But how does that work? How can you look up anything in the index w/o having a should clause with all the terms inside of it?

@clintongormley ?

JnBrymn-EB on 9 Nov 2018

Now it's 2020. Still not found a better solution for "field is not the empty string".
Used the following solution, but is this the really good practice?