Mongoengine: Should ListField support Nones for non-required sub-fields?

Created on 26 Dec 2016 · 18Comments · Source: MongoEngine/mongoengine

I haven't fully thought this through and I'm just logging an issue @JohnAD brought up in https://github.com/MongoEngine/mongoengine/issues/1106#issuecomment-269027633. We should think through all the use cases and discuss what's the best approach.

Right now, you can store None in any non-required field, e.g.

In [46]: class EmbeddedDoc(EmbeddedDocument):
    text = StringField()
   ....:

In [47]: class Doc(Document):
    name = StringField()
    num = IntField()
    embedded = EmbeddedDocumentField(EmbeddedDoc)
   ....:

In [48]: Doc.objects.create(name=None, num=None, embedded=None)
Out[48]: <Doc: Doc object>

Of course, that capability goes away once you define a field as required:

In [50]: Doc.objects.create(name=None, num=None, embedded=None)
---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-50-eb46633914b2> in <module>()
----> 1 Doc.objects.create(name=None, num=None, embedded=None)

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/queryset/base.pyc in create(self, **kwargs)
    286         .. versionadded:: 0.4
    287         """
--> 288         return self._document(**kwargs).save()
    289
    290     def first(self):

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/document.pyc in save(self, force_insert, validate, clean, write_concern, cascade, cascade_kwargs, _refs, save_condition, signal_kwargs, **kwargs)
    318
    319         if validate:
--> 320             self.validate(clean=clean)
    321
    322         if write_concern is None:

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/base/document.py in validate(self, clean)
    400                 pk = self._instance.pk
    401             message = 'ValidationError (%s:%s) ' % (self._class_name, pk)
--> 402             raise ValidationError(message, errors=errors)
    403
    404     def to_json(self, *args, **kwargs):

ValidationError: ValidationError (Doc:None) (Field is required: ['num', 'name', 'embedded'])

Now, I can imagine that we'd expect the same behavior from particular items within the ListField, namely that lists of non-required strings/ints/embedded docs would allow Nones in the list. That is not the case right now:

In [55]: class Doc(Document):
    names = ListField(StringField())
    nums = ListField(IntField())
    embedded_docs = ListField(EmbeddedDocumentField(EmbeddedDoc))
   ....:

In [56]: Doc.objects.create(names=[None], nums=[1], embedded_docs=[EmbeddedDoc(text='aaa')])
---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-56-c349daa935c2> in <module>()
----> 1 Doc.objects.create(names=[None], nums=[1], embedded_docs=[EmbeddedDoc(text='aaa')])

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/queryset/base.pyc in create(self, **kwargs)
    286         .. versionadded:: 0.4
    287         """
--> 288         return self._document(**kwargs).save()
    289
    290     def first(self):

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/document.pyc in save(self, force_insert, validate, clean, write_concern, cascade, cascade_kwargs, _refs, save_condition, signal_kwargs, **kwargs)
    318
    319         if validate:
--> 320             self.validate(clean=clean)
    321
    322         if write_concern is None:

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/base/document.py in validate(self, clean)
    400                 pk = self._instance.pk
    401             message = 'ValidationError (%s:%s) ' % (self._class_name, pk)
--> 402             raise ValidationError(message, errors=errors)
    403
    404     def to_json(self, *args, **kwargs):

ValidationError: ValidationError (Doc:None) (StringField only accepts string values: ['names'])

In [57]: Doc.objects.create(names=['name'], nums=[None], embedded_docs=[EmbeddedDoc(text='aaa')])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-57-d6ac608e67d9> in <module>()
----> 1 Doc.objects.create(names=['name'], nums=[None], embedded_docs=[EmbeddedDoc(text='aaa')])

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/queryset/base.pyc in create(self, **kwargs)
    286         .. versionadded:: 0.4
    287         """
--> 288         return self._document(**kwargs).save()
    289
    290     def first(self):

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/base/document.py in __init__(self, *args, **values)
    113                         field = self._fields.get(key)
    114                         if field and not isinstance(field, FileField):
--> 115                             value = field.to_python(value)
    116                     setattr(self, key, value)
    117                 else:

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/base/fields.pyc in to_python(self, value)
    304             self.field._auto_dereference = self._auto_dereference
    305             value_dict = {key: self.field.to_python(item)
--> 306                           for key, item in value.items()}
    307         else:
    308             Document = _import_class('Document')

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/base/fields.pyc in <dictcomp>((key, item))
    304             self.field._auto_dereference = self._auto_dereference
    305             value_dict = {key: self.field.to_python(item)
--> 306                           for key, item in value.items()}
    307         else:
    308             Document = _import_class('Document')

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/fields.py in to_python(self, value)
    179     def to_python(self, value):
    180         try:
--> 181             value = int(value)
    182         except ValueError:
    183             pass

TypeError: int() argument must be a string or a number, not 'NoneType'

In [58]: Doc.objects.create(names=['name'], nums=[1], embedded_docs=[None])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-58-e79991fbb4fe> in <module>()
----> 1 Doc.objects.create(names=['name'], nums=[1], embedded_docs=[None])

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/queryset/base.pyc in create(self, **kwargs)
    286         .. versionadded:: 0.4
    287         """
--> 288         return self._document(**kwargs).save()
    289
    290     def first(self):

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/base/document.py in __init__(self, *args, **values)
    113                         field = self._fields.get(key)
    114                         if field and not isinstance(field, FileField):
--> 115                             value = field.to_python(value)
    116                     setattr(self, key, value)
    117                 else:

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/base/fields.pyc in to_python(self, value)
    304             self.field._auto_dereference = self._auto_dereference
    305             value_dict = {key: self.field.to_python(item)
--> 306                           for key, item in value.items()}
    307         else:
    308             Document = _import_class('Document')

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/base/fields.pyc in <dictcomp>((key, item))
    304             self.field._auto_dereference = self._auto_dereference
    305             value_dict = {key: self.field.to_python(item)
--> 306                           for key, item in value.items()}
    307         else:
    308             Document = _import_class('Document')

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/fields.py in to_python(self, value)
    544     def to_python(self, value):
    545         if not isinstance(value, self.document_type):
--> 546             return self.document_type._from_son(value, _auto_dereference=self._auto_dereference)
    547         return value
    548

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/base/document.py in _from_son(cls, son, _auto_dereference, only_fields, created)
    681         # Get the class name from the document, falling back to the given
    682         # class if unavailable
--> 683         class_name = son.get('_cls', cls._class_name)
    684
    685         # Convert SON to a dict, making sure each key is a string

AttributeError: 'NoneType' object has no attribute 'get'

As you can see, embedded document is completely broken (trying to call _from_son w/ None), but even the non-required strings and ints don't validate None properly...

Whatever we do here, we should consider performance implications, backward compatibility, etc. before making the final decision.

Discussion

Source

wojcikstefan

Most helpful comment

As it happens, when I started using mongoengine, I was bit confused by 'ListField' and it's naming.

In fact, I was also confused by some of MongoDB/JSON naming conventions. Technically, an array is an ordered sequence of homogeneous items. A list is an ordered sequence of items of any kind. In other words, a list can mix strings, numbers, nulls, booleans, whatever. Whereas an array forbids this. An array must have all elements be of the same type.

So, an array is always also a list. But the opposite is not always true.

Both Python and C/C++ get this correct.

But unfortunately, JSON (and thus MongoDB) get it wrong. They allow different types in what they call an array. But by doing so, it clearly violates the CompSci meaning of the word array. It's really a list.

So, when I saw the ListField I thought: "Oh, mongoengine has properly named this so-called-'array' as a list". So mongoengine will allow both StringFields and IntFields into the ListField. Except...no actually. mongoengine is requiring all of the elements of the list to be the same. So, it is treating it as an array. but calling it a list. I've been wondering why it wasn't being called an ArrayField to match it's array-like behavior.

TL:DR:
C/C++ has arrays that are arrays
Python has lists that are lists
JSON has arrays that are really lists
MongoEngine has lists that are really arrays

Way off-topic I suppose. I'm not suggesting we change anything.

(Also, I agree that ListField should forbid mixing types. It should be behaving as an array. All is good.)

JohnAD on 13 Jan 2017

👍2

All 18 comments

I have been thinking about this for a while. Please forgive my wordiness:

In addition to handling None, we also need to deal with invalid data. For example, if a document called d in the Doc container was:

    {
        "_id": {"$oid": "586329e585f4820be722fe25"},
        "names": ["a", 0, "c"],
        "nums": [1, 2, 3]
    }

We can see that any iterative string operations on the d.names attribute would generate errors on the second entry (0).

So, the bigger question could be: how should invalid data be handled?

I see 6 possible options:

INVALID-DATA-IGNORED OPTION

Perhaps we could have an option to ignore any invalid data, including None. So, if the ListField was declared with ignoreInvalid=True:

    class Doc(me.Document):
        names = me.ListField(me.StringField(), ignoreInvalid=True)
        nums = me.ListField(me.IntField())
        embedded_docs = me.EmbeddedDocumentListField(EmbeddedDoc)

Then the document would be 'seen' as:

{
    "_id": {"$oid": "586329e585f4820be722fe25"},
    "names": ["a", "c"], 
    "nums": [1, 2, 3]
}

Any null entry would be, of course, simply another form of invalid data.

When read from MongoDB, the names field would be pre-marked as changed so that a subsequent d.save() would change d.names to a corrected list.

In fact, I suggest we make ignoreInvalid=True the default, since I suspect that is what most users would select in most circumstances.

DEFAULT OPTION

Sometimes the ordering and placement of each item is critical in a list. In that case, removing an invalid element would cause damage. In this example, if "c" must be the third item in the list, then something must be placed into the second element.

We could support a simple 'defaultForInvalidEntry' option in the ListField. So, for example:

    class Doc(me.Document):
        names = me.ListField(me.StringField(), defaultForInvalidEntry="invalid")
        nums = me.ListField(me.IntField())
        embedded_docs = me.EmbeddedDocumentListField(EmbeddedDoc)

The document would be 'seen' as:

{
    "_id": {"$oid": "586329e585f4820be722fe25"},
    "names": ["a", "invalid", "c"], 
    "nums": [1, 2, 3]
}

This would be very similar to the 'default' option of other fields.

Again, null is simply another form of invalid data.

For EmbeddedDocs, the default could be a dict. Of course, if that dict has lists and those lists have dicts....

This option would be consistent with other fields. But... there is large part of me that is uncomfortable with this technique. It screams "duct-tape for error handling". I'm not a fan of this option; but that is just my opinion.

INVALID-FIELD-HANDLER OPTION

Another way to have an invalid field handled is by calling a function in the class for it's handling. This allows for a great deal of flexibility.

For example, having an 'onInvalidField' option:

    class Doc(me.Document):
        names = me.ListField(me.StringField(), onInvalidField=badNameHandler)
        nums = me.ListField(me.IntField())
        embedded_docs = me.EmbeddedDocumentListField(EmbeddedDoc)

        def badNameHandler(self, item):
            if item is None:
                return ""
            elif str(item)=="-1":
                return "xyz"
            else:
                return str("error, was: {}".format(repr(item)))

In this example, I've written a function that handles invalid data such as null/None or anything else. Since it is a handler for a bad string, it returns a string.

In this scenario, the programmer has full control over the error handling. Handling a None or invalid embedded document could look this:

    class Doc(me.Document):
        names = me.ListField(me.StringField())
        nums = me.ListField(me.IntField())
        embedded_docs = me.EmbeddedDocumentListField(EmbeddedDoc, onInvalidField=badEDHandler)

        def badEDHandler(self, item):
            ed = EmbeddedDoc()
            ed.text = ""
            try:
                ed.text = str(item["text"])
            except Exception:
                pass
            return ed

IGNORE-THE-DOCUMENT OPTION

For this option, a new meta field allows the document to be ignored as non-existent if it fails to load the ListField.

    class Doc(me.Document):
        names = me.ListField(me.StringField())
        nums = me.ListField(me.IntField())
        embedded_docs = me.EmbeddedDocumentListField(EmbeddedDoc)

        meta = {ignoreInvalidDocs=True}

I would probably rarely ever use this option myself. But, it is much faster than the other options.

And, IMO, it certainly should not be a default since it only 'seems' like the document does not exist. In practice, that could cause invalid document counts and scenarios that cause 'upsert'-style loops.

A MIX OF OPTIONS THREE AND FOUR.

Rather than have both 'defaultForInvalidEntry' and 'onInvalidField', perhaps we could have a combined field called something else such as 'handleInvalidEntry'. If set to a value, it is used as the default. If set to a function pointer, it calls that function.

    class Doc(me.Document):
        names = me.ListField(me.StringField(), handleInvalidEntry=badName)
        nums = me.ListField(me.IntField(), handleInvalidEntry=0)
        embedded_docs = me.EmbeddedDocumentListField(EmbeddedDoc)

        def badName(self, item):
            return str(item)[:1]

So, if a name entry is invalid, an attempt is made to return the first character of a str version of it, (N on a null None). If a nums entry is invalid, it is set to integer 0.

'LET IT BE' OPTION

Simply:

Allow null fields to be imported as None. I believe this is how MongoEngine 0.11.0 behaved.
Allow other invalid fields to be read and automatically interpreted into other field types; hope the programmer corrects them prior to saving, otherwise an error is generated.

This is, I believe, the behavior of MongoEngine 0.11.0.

This option does not solve lists of embedded documents problem. But it 'solves' things for the other fields.

And, since this as behavior I've seen in the past, it is certainly the easier way out.

THOUGHTS SO FAR

So, those are my thoughts. Personally, I see quite a bit of value to options 1 and 5.

Option 6 could also work, but I'm not in favor of it. I realize that options 1 and 5 will have performance impacts, IMO such impacts are expected. If I was strongly oriented to performance, I'd simply be using PyMongo. But I want the oversight to prevent me, the programmer, from making and distributing data errors and I'm willing to live with performance hits.

Esp. if that impact is of a O(n) scale. If of O(n^2) impact, that might be a bigger issue. I'm not expert like you all on the internals of this library, so let me know if these options would scale badly.

JohnAD on 28 Dec 2016

Adding a bit to the handleInvalidEntry option. We could also pass the index in the array/list, was well as the item. So:

    def badNameHandler(self, item, index):
        ...

BTW, if setup as an instance function, as seen above, it would have access to the other fields. And while that sounds great, that might introduce complications since the document is not fully parsed yet. Instead, we might make it strictly a class method:

    @classmethod
    def badNameHandler(item, index):
        ...

JohnAD on 28 Dec 2016

Reviewing the BaseField __init__ parameters, perhaps a better name for handleInvalidEntry is simply entry_default or invalid_entry_default and document that it can be a callable. The parameter would only apply to ComplexBaseField class; so it might be handled as an addition to __init__ for that class.

    def __init__(self, *args, **kwargs):
        super(ComplexBaseField, self).__init__(*args, **kwargs)
        self.invalid_entry_default = kwargs.get("invalid_entry_default", None)
        if self.invalid_entry_default is None:
            self.ignore_invalid = kwargs.get("ignore_invalid", True)
        else:
            self.ignore_invalid = False

I should not be writing code during a discussion. Sorry 'bout that. :)

JohnAD on 28 Dec 2016

It has been 9 days since last comment. Okay, admittedly, most of them have been by me. :)

Would everyone be okay I attempted the following?:

Add an invalid_entry_default=None parameter to ComplexBaseField so that it applies to:

'ListField'
'DictField'
'MapField'
'EmbeddedDocumentListField'
'SortedListField'

The effect of this option, if set, is that if a list item (or key/value) is pulled into the MongoDB document is not valid, the 'invalid_entry_default' is used as a replacement. Specifically:

if invalid_entry_default is a literal value or field instance, then that value/field is used as a replacement.
if invalid_entry_default is a callable, then that function is called to get the replacement value/field. A check is made to ensure that the function is _not_ an instance function. The parameters passed are (index, item) if a list and (key, value) if a dict.
if invalid_entry_default is None then the entry (or key/value) is skipped. This is the default behavior.

If the invalid_entry_default's behavior is invoked during the reading of a document, then the field is marked as changed. Otherwise, this parameter only effects the reading of the document.

In some ways, this parameter helps ensure the Robustness Principle .

Example:

class Doc(me.Document):
    # default any None, strings, or another other kind of invalid number, to 0:
    nums = me.ListField(me.IntField(), invalid_entry_default=0)
    # default to a hand-written handler for bad strings:
    names = me.ListField(me.StringField(), invalid_entry_default=badName)
    # default to skipping bad names:
    more_names = me.ListField(me.StringField(), invalid_entry_default=None)
    # default to a field instance:
    embedded_docs = me.EmbeddedDocumentListField(EmbeddedDoc, invalid_entry_default=EmbeddedDoc(text="hello"))

    @classmethod
    def badName(index, item):
        return "bad name: {} at index {}".format(str(item), index)"

To finish all this:

Flush out and debug code.
Write the test cases for all five of the fields.
Update the documentation to explain this lightly in API reference.
Flush out section 2.3.3.2. List fields

And possibly, if there is time:

Flush out section 2.3.3.4. Dictionary Fields
Do a YouTube video explaining the change.

(BTW, I'm happy to support DictFields, but really recommend against using them. MapFields seem okay.)

JohnAD on 6 Jan 2017

Hi @JohnAD sorry to be stalling on this one, have gotten a bit swamped. I'll respond within the next few days though!

wojcikstefan on 6 Jan 2017

Hm, let me push back a bit on this one ;)

What you're essentially proposing is validation upon read. That could have disastrous effects on performance and IMHO, we're already doing way too much processing of the data coming in.

If you (meaning the user of this package) truly need to validate data after fetching it, you can always call the validate() method. If you need something more custom then that, I'd much rather have you write your case-specific logic than try to cover some of the possible scenarios in the library itself. I think the cost of such an approach in terms of complexity and future maintenance would be huge.

Finally, if we'd want to do that for lists and dicts, why wouldn't we do it for regular fields? For example, consider this:

In [15]: class Doc(Document):
    ...:     txt = StringField()
    ...:

In [16]: Doc.drop_collection()

In [17]: Doc._get_collection().insert({ 'txt': 1 })  # invalid type - int instead of str
Out[17]: ObjectId('58781640c6e47b544835677b')

In [18]: doc = Doc.objects.first()

In [19]: doc.txt
Out[19]: 1

In [20]: type(doc.txt)
Out[20]: int

In [21]: doc.validate()  # raises a validation error
---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-21-35785ada099a> in <module>()
----> 1 doc.validate()  # raises a validation error

/Users/wojcikstefan/Repos/temp/mongoengine/mongoengine/base/document.pyc in validate(self, clean)
    400                 pk = self._instance.pk
    401             message = 'ValidationError (%s:%s) ' % (self._class_name, pk)
--> 402             raise ValidationError(message, errors=errors)
    403
    404     def to_json(self, *args, **kwargs):

ValidationError: ValidationError (Doc:58781640c6e47b544835677b) (StringField only accepts string values: ['txt'])

I understand the temptation to provide (somewhat) flexible validation upon reading, but I'd much rather have MongoEngine stick to the validate-upon-save (or on demand via validate()) approach.

wojcikstefan on 13 Jan 2017

And to bring it back to the original issue from the description, StringField() allows None or an instance of basestring as a valid value. It does not allow e.g. an int. Hence I believe ListField(StringField()) should allow a mix of Nones and basestrings (again, validated upon save).

wojcikstefan on 13 Jan 2017

@wojcikstefan I should have read your response prior to creating issue #1468 . Sorry about that. Essentially it is talking about this very thing on other fields.

I agree that validation does create considerable overhead, especially when pulling in a large list of documents and each document might, in turn, have a large group of lists (and lists within lists.)

As such, I can totally see wanting it to be strictly optional.

In a large production system involving multiple software agents (not all of them in python), it is, IMO, generally assumed that bad data (in documents) is common. That is never desired, of course, but handling bad data should be both resilient and as self-evident as possible. For me personally, having validation on reading would be very valuable as I only do it in a haphazard fashion now and do regret it sometimes.

I'd like to brainstorm some ideas:

idea 1: optional on per-doc or per-field basis

Keep the read-validation optional on a per-field or per-document basis. If the parameter is never invoked, then it is never used. Put in the documentation that such validation has a high cost.

idea 2: allow the parameters, but ignore them on reading, but use them with a `filter` method:

# without filter
x = SomeDoc.objects.first()
# with filter:
x = SomeDoc.objects.filter().first()

idea 3: do this as merged library

Essentially, create filteredmongoengine which uses mongoengine as a library.

idea 4: enhance validate to make field correction more fluid.

.validate() raises error and does mention the failed condition and field name in the error message. But turning that around to invoking additional response code seems difficult. I might be missing something though. Perhaps a variant of that method make such response easier.

Just thinking out loud. Looking this over, perhaps "idea 2" has a certain appeal.

JohnAD on 13 Jan 2017

As it happens, when I started using mongoengine, I was bit confused by 'ListField' and it's naming.

So, an array is always also a list. But the opposite is not always true.

Both Python and C/C++ get this correct.

TL:DR:
C/C++ has arrays that are arrays
Python has lists that are lists
JSON has arrays that are really lists
MongoEngine has lists that are really arrays

Way off-topic I suppose. I'm not suggesting we change anything.

(Also, I agree that ListField should forbid mixing types. It should be behaving as an array. All is good.)

JohnAD on 13 Jan 2017

👍2

@wojcikstefan wrote:

And to bring it back to the original issue from the description, StringField() allows None or an instance of basestring as a valid value. It does not allow e.g. an int.

We might want start a separate issue for this. But here is the challenge regarding None, IMO:

We could have two different Doc documents (using your code from earlier):

{
    "_id": {
        "$oid": "5877f4e29a9cb569cc6b4e49"
    }, 
    "txt": null
}

and

{
    "_id": {
        "$oid": "5877f4e29a9cb569cc6b4e49"
    }
}

Generally speaking, we are using None to mean not there. But "not there" and null are really different things. In a standalone StringField it might be rare that we really need to see the difference, but I strongly suspect that is not the case in a ListField. This is somewhat akin to comparing "" and None in python. Most of the time one can safely ignore the difference, but when it is important, one can write:

if myvar is None:
    something()
elif myvar=="":
    somethingelse()

rather than the generic truthiness check:

if myvar:

I've not found a way in mongoengine to handle this. If we were to treat null as None, then that implies that the entry was never there; even though it is. This further complicates the picture.

Ideally, IMO, it would be good to have some equivalent of:

if doc.txt is None:
    somethingmissing()
elif doc.txt=="":
    somethingempty()
elif doc.txt==mongoengine.null:
    somethingnull()

Or for a list of strings:

{
    "_id": {
        "$oid": "5877f4e29a9cb569cc6b4e49"
    },
    "strings": [
        "a",
        null,
        "c",
    ]
}

would become:

["a", null, "b"]

But I suspect handling that is a very non-trivial change.

For a bug fix in 0.11.x, I recommend we insert None entries into lists. Since it does not allow null entries currently, we don't need to worry about legacy code. We could then add a better or different solution in later versions.

JohnAD on 13 Jan 2017

I do apologize for writing so much...I can get wordy sometimes.

Inspired by the idea of a separate refine filter function, I might write a quick proof-of-concept branch this weekend. That way any validation on reading is done explicity, separately, and with knowledge of its cost.

import mongoengine as me

class Post(me.Document):
    title = me.StringField(max_length=120, required=True, correction=str)
    tags = me.ListField(me.StringField(max_length=30), correction=me.skip)
    author = me.StringField(required=True, correction="unknown", validation=my_routine)

post_list = Post.objects(tags='mongodb').refine
other_list = Post.objects(tags='other').refine(limit_check=['title'])

JohnAD on 13 Jan 2017

Generally speaking, we are using None to mean not there. But "not there" and null are really different things.

Generally speaking, you're right. However, that's not the case in MongoDB. As you can see in the following snippet, both "not there" and null are returned when you're looking for null:

> db.test.insert({ whatever: 1 })
> db.test.find({ aaa: null })
{
  "_id": ObjectId("587c717499a6938456f321ae"),
  "whatever": 1
}
Fetched 1 record(s) in 1ms
> db.test.insert({ aaa: null })
> db.test.find({ aaa: null })
{
  "_id": ObjectId("587c717499a6938456f321ae"),
  "whatever": 1
}
{
  "_id": ObjectId("587c71f399a6938456f321af"),
  "aaa": null
}
Fetched 2 record(s) in 3ms

wojcikstefan on 16 Jan 2017

Hi again @JohnAD, sorry for a delay! I'm afraid I'm still very skeptical of this built-in validate-on-read approach... There's a thousand different ways one might want to "refine" corrupted or outdated data depending on a use case, and exposing a few of the possible options to deal with it in the schema seems like a complex way to kinda solve some of the potential problems... And even if something is "completely optional", once it's part of the core library, the cost of supporting it in the future is IMHO insurmountable.

What you can do now:

If you had a one-time incident where your data became corrupted, write a migration to clean it up. No need to involve the schema.
If some part of your system is writing incorrect data into MongoDB, fix that part of the system.
If for some reason you can't fix it bcos e.g. it's a 3rd party proprietary system (which seems like an obscure case), perhaps your schema should change to accept these values... After all, the schema is supposed to be the representation of the data in the database. Once you load that data up into a MongoEngine doc, you can have a custom serializer/set of getters that refine the data before passing the values to another system. IMO, there's plenty of cleaner and more explicit ways than adding auto-magical kwargs to your schema's fields.

wojcikstefan on 16 Jan 2017

@wojcikstefan, when writing the early code for refine, I discovered that pyMongo does the same. So, your pointing out that MongoDB itself treats null and missing as identical makes even more sense. So, officially, None is both null and an indication of missing.

That being case, allowing the ListField to contain either the Field type or None makes a lot of sense. I'll see what it takes to make that work.

As to validate-on-read, I'll assume for now that my needs are odd and I'll write a private library. Something possibly heritable akin to:

from mongoengine import Document, StringField, ListField
from mongoenginerefine import Refine, Skip

class Post(Document, Refine):
    title = StringField(max_length=120, required=True, correction=str)
    tags = ListField(StringField(max_length=30), allow_null=False, correction=Skip)
    author = StringField(required=True, correction="unknown", validation=my_routine)

post_list = Post.objects(tags='mongodb').refine
other_list = Post.objects(tags='other').refine(limit_check=['title'])

JohnAD on 17 Jan 2017

👍1

Modified code for getting both EmbeddedDocumentField and EmbeddedDocumentListField to allow None. It looks like this will have little impact on performance.

Looks pretty straightforward. I'll get the remaining fields fixed and add test cases. I'm guessing it will be about a week before I'm ready with a pull request.

As a bonus, the required kwarg actually means something now for EmbeddedDocumentField.

JohnAD on 19 Jan 2017

👍1

Awesome @JohnAD! Thank you very much for contributing so much! Could you open a work-in-progress PR so that it's easy to track the development and potentially offer help while working out some kinks?

wojcikstefan on 19 Jan 2017

If you use an empty Object, then MongoEngine 0.13.0 will fill in the object with default values defined in the EmbeddedDocument. This could be a workaround until the pull request #1470 is accepted.

mjmonroe on 2 May 2017

Hi all, I think I've just hit up against this in v0.22.1. It seems like the same issue as here but while there are a few other tickets discussing similar problems I can't see if there has been a resolution?

mylist = ListField(StringField(null=True))

Setting mydoc.mylist = [None, 'hello', None] gives the error StringField only accepts string values.
Am I doing something wrong here or is there a workaround to get this working? Obviously I could store '' and convert when loading, but thought there might be a neater solution?