Marshmallow: Performance of serializing nested collections is poor

Created on 28 Dec 2013  ·  7Comments  ·  Source: marshmallow-code/marshmallow

I worked up a quick test using the nose timed decorator.

class TestSerializerTime(unittest.TestCase):

    def setUp(self):
        self.users = []
        self.blogs = []
        letters = list(string.ascii_letters)

        for i in range(500):
            self.users.append(User(''.join(random.sample(letters, 15)),
                email='[email protected]', age=random.randint(10, 50)))

        for i in range(500):
            self.blogs.append(Blog(''.join(random.sample(letters, 50)),
                user=random.choice(self.users)))

    @timed(.2)
    def test_small_blog_set(self):
        res = BlogSerializer(self.blogs[:20], many=True)

    @timed(.4)
    def test_medium_blog_set(self):
        res = BlogSerializer(self.blogs[:250], many=True)

    @timed(1)
    def test_large_blog_set(self):
        res = BlogSerializer(self.blogs, many=True)

    @timed(.1)
    def test_small_user_set(self):
        res = UserSerializer(self.users[:20], many=True)

    @timed(.2)
    def test_medium_user_set(self):
        res = UserSerializer(self.users[:250], many=True)

    @timed(.5)
    def test_large_user_set(self):
        res = UserSerializer(self.users, many=True)

The user tests all pass, but the medium and large blog tests do not. Obviously, these could pass on some machines, but it's still rather slow.

I did a little bit more testing with profile. Serializing the whole blog collection was running between 5 and 6s.

It looks like the bottleneck is the deepcopy operation in serializer.py and it doesn't seem like the call can be removed, or changed to a pickle/unpickle operation.

I'm going to keep digging to see what I can do. If you have any insight, I'd appreciate the help. Thanks!

Most helpful comment

@mgd722 I haven't compared the two usages in a while, but they should be similar. If you're serializing ORM objects, I'd first look into your relationship loading technique and make sure you're not running into the n+1 problem.

All 7 comments

The deepcopy operation is expensive, but necessary, so that serializers can store errors from nested serializers.

I did a little work with cProfile and your code above (the gist of the script is here) and found 2 significant speedups:

  • Passing an _instance_--not a class--into a Nested field.

Example:

collaborators = fields.Nested(UserSerializer(), many=True)

instead of

collaborators = fields.Nested(UserSerializer, many=True)

This avoids repeating the initialization code (including the deepcopy) for each collaborator. In the future, it'll be better to cache the nested serializer object, or disallow passing classes altogether.

  • Overriding __deepcopy__ method of field objects so they are only shallow copied (9c0f062). Even though the declared_fields dictionary must be deep-copied, field objects themselves don't need to be deep-copied.

These two modifications decreased the execution time of the above script by almost half.

Thanks for reporting this. I will continue to do more profiling and see where performance can be improved even further.

I underestimated the effect of passing in an instance into a nested Field: doing this for both the "user" and the "collaborators" field reduces the total runtime of the profiling script from ~5.7s to ~1.6s.

class BlogSerializer(Serializer):
    title = fields.String()
    user = fields.Nested(UserSerializer())
    collaborators = fields.Nested(UserSerializer(), many=True)
    categories = fields.List(fields.String)
    id = fields.String()

That's great! Very interesting...

Thanks again!

Glad I could help.

I've made some further improvements so that performance should be good whether you pass in a Serializer class or a Serializer object into a Nested field.

I have same issue with serialize, and I was run into this issue.

I think, you should update the documentation http://marshmallow.readthedocs.org/en/latest/nesting.html about use _instance_ instead of _class_ passing trough.

@sloria, just to be clear-- performance should now be similar in both of the following scenarios:

fields.Nested(UserSerializer(), many=True)
fields.Nested(UserSerializer, many=True)

so if nested fields are running slow for me it's just my shitty code?

@mgd722 I haven't compared the two usages in a while, but they should be similar. If you're serializing ORM objects, I'd first look into your relationship loading technique and make sure you're not running into the n+1 problem.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sloria picture sloria  ·  3Comments

j4k0bk picture j4k0bk  ·  3Comments

rastikerdar picture rastikerdar  ·  3Comments

zohuchneg picture zohuchneg  ·  3Comments

agatheblues picture agatheblues  ·  3Comments