Realm-cocoa: Deleting large numbers of objects is extremely slow

Created on 23 Oct 2017  路  8Comments  路  Source: realm/realm-cocoa

Goals

I have a data model that represents sensor samples from devices. Samples have an associated device, a timestamp, a value and a unit (UInt8 enum). I've extracted the relevant classes in the code sample below.

The samples accumulate over time, so they are reduced periodically when they get older to obtain an upper bound for the number of samples stored for each device (currently around 200k).

The reduction process executes in a single write transaction, and objects that will be deleted are first accumulated into an array and then passed to realm.delete(objects).

The problem I'm having is that the reduction process involves deleting large numbers of samples, and this takes extremely long time, in the order of several minutes. Right now I'm trying to delete around 125K samples from a total of 2M samples, and it's been running for half an hour.

Clearly I'm doing something wrong.

Any advice on doing this kind of process with Realm? Any hints on how to manage large data efficiently would be greatly appreciated.

Code Sample

class Device: Object {
// ...
    let samples = LinkingObjects(fromType: Sample.self, property: "device")
// ...
}

class Sample: Object {
    @objc dynamic var timestamp: Date = Date(timeIntervalSince1970: 0)
    @objc dynamic var value: Double = 0
    @objc dynamic var unit: Unit = .T
    @objc dynamic var device: Device?

//...
    override static func indexedProperties() -> [String] {
        return [ "timestamp", "unit" ]
    }
// ...
}

Version of Realm and Tooling

I'm using RealmSwift 2.10.2, iOS 11.0.3

T-Help

Most helpful comment

I put together a simple bit of code which I could run locally which tries to mimick what your code is doing:

@interface Sample : RLMObject
@property (nonatomic) int value;
@end
@implementation Sample
@end

void prepare(RLMRealm *realm) {
    for (int i = 0; i < 100; ++i) {
        @autoreleasepool {
            [realm beginWriteTransaction];
            for (int j = 0; j < 2000; ++j) {
                [Sample createInRealm:realm withValue:@[@(j)]];
            }
            [realm commitWriteTransaction];
        }
    }
}

void run(RLMRealm *realm) {
    srand(0); // use a fixed seed

    NSLog(@"begin");
    RLMResults *queryResult = [[Sample objectsInRealm:realm where:@"value > 50"] sortedResultsUsingKeyPath:@"value" ascending:YES];
    NSMutableArray *candidates = [NSMutableArray new];
    for (Sample *sample in queryResult) {
        [candidates addObject:sample];
    }

    NSLog(@"pick candidates");
    NSMutableArray *samplesToPrune = [NSMutableArray new];
    Sample *previousSample;
    for (Sample *candidate in candidates) {
        if (rand() % 8 == 0) {
            [samplesToPrune addObject:candidate];
        }
        previousSample = candidate;
    }

    NSLog(@"delete");
    [realm beginWriteTransaction];
    [realm deleteObjects:samplesToPrune];
    [realm commitWriteTransaction];

    NSLog(@"end");
}

This inserts 200k objects, and then deletes approximately 10k of them. On my mac, the write transaction which deletes the objects takes about a minute.

The most important change to make here is to wrap the code which selects which objects to delete in an explicit autoreleasepool:

var samplesToPrune = [Sample]()

autoreleasepool {
    let candidates = database.findSamples(forDevice: device, with: unit,
                                          startDate: Date.distantPast,
                                          endDate: timestamp.addingTimeInterval(-maxTtlOfNextRate))
    for candidate in candidates {
        if ... {
            samplesToPrune.add(candidate)
        }
    }
}

database.deleteSamples(samplesToPrune)

In my reduced sample, this speeds up the deletion from 54 seconds to 3 seconds. The reason this is so much faster is that deleting objects in Realm can change the internal row numbers of other objects. Each live accessor object stores the row number it points to, and that potentially has to be updated, so having large numbers of objects alive when deleting objects hurts performance.

After making this change, the next easy optimization is to skip copying the objects from the Results to an array. In the code you've shown there's no reason to do this, and doing it takes about a quarter of the total runtime of the process.

Scaling this up to deleting 125k objects out of 2M starts to run into trouble again, though. At this point, there's enough object accessors for just the objects you're deleting to cause problems. I tried running this to see how long it'd take, but killed it after a few minutes. A fix for this is to do the deletion in two stages: first, you flag all of the objects which should be deleted, and then you use a realm query to do the actual deletion:

class Sample: Object {
    @objc dynamic var timestamp: Date = Date(timeIntervalSince1970: 0)
    @objc dynamic var value: Double = 0
    @objc dynamic var unit: Unit = .T
    @objc dynamic var device: Device?
    @objc dynamic var shouldDelete: Bool = false
// ...
}

autoreleasepool {
    try! realm.write {
        let candidates = database.findSamples(forDevice: device, with: unit,
                                              startDate: Date.distantPast,
                                              endDate: timestamp.addingTimeInterval(-maxTtlOfNextRate))
        for candidate in candidates {
            if ... {
                candidate.shouldDelete = true
            }
        }
    }
}

try! realm.write {
    realm.delete(realm.objects(Sample.self).filter("shouldDelete = true"))
}

With this change flagging the objects to delete takes 5 seconds, and then the deletion itself takes well under a second.

All 8 comments

@suzukieng Could you share your code around where you are going to delete objects? I suspect that copying objects into an array is a bottleneck instead of deleting. Also, you should pass Results < Sample > instead of Array < Sample > to the delete() method for performance. In my environment, I can delete 150K objects in just less than a second on iPhone 7.

Dear @kishikawakatsumi -san,

Many thanks for getting back to me so quickly.
I can't share the code as-is, but here is what I do, step for step.... :-)

I obtain a list of candidates for reduction in a time range (start, end), by filtering the Sample's by timestamp, device and unit and copying the result into an array.

let candidates = database.findSamples(forDevice: device, with: unit,
                                       startDate: Date.distantPast,
                                       endDate: timestamp.addingTimeInterval(-maxTtlOfNextRate))

// ...
    func findSamples(forDevice device: Device, with unit: Unit,
                     startDate: Date, endDate: Date) -> [Sample] {
        var result = [Sample]()
        let queryResult = realm.objects(Sample.self)
            .filter("device == %@ AND timestamp >= %@ AND timestamp < %@ AND unit == %@",
                    device, startDate, endDate, unit.rawValue)
            .sorted(byKeyPath: "timestamp", ascending: true)
        for sample in queryResult {
            result.append(sample)
        }
        return result
    }

Then I perform the reduction operation on this array: I will do a full iteration over the array, and add Sample's that are "pending deletion" to another array (samplesToPrune).

            var samplesToPrune = [Sample]()
            var previousSample: Sample?
            for candidate in candidates {
                if let previousSample = previousSample {
                    if /* some logic based on timestamp of sample and relation to previous sample */ {
                        samplesToPrune.append(candidate)
                        numDeleted += 1
                        continue // keep previous sample
                    }
                }
                previousSample = candidate
            }

Then finally I delete all objects in the samplesToPrune array.

database.deleteSamples(samples: samplesToPrune)
//...
    func deleteSamples(samples: [Sample]) {
        try! ensureInWriteTransaction()
        realm.delete(samples)
    }

You mention that I should pass Results to Realm for deletion, but how would do that if I can not express which samples I want to delete in the form of a Realm query, i.e. need to do some in-memory processing first?

I put together a simple bit of code which I could run locally which tries to mimick what your code is doing:

@interface Sample : RLMObject
@property (nonatomic) int value;
@end
@implementation Sample
@end

void prepare(RLMRealm *realm) {
    for (int i = 0; i < 100; ++i) {
        @autoreleasepool {
            [realm beginWriteTransaction];
            for (int j = 0; j < 2000; ++j) {
                [Sample createInRealm:realm withValue:@[@(j)]];
            }
            [realm commitWriteTransaction];
        }
    }
}

void run(RLMRealm *realm) {
    srand(0); // use a fixed seed

    NSLog(@"begin");
    RLMResults *queryResult = [[Sample objectsInRealm:realm where:@"value > 50"] sortedResultsUsingKeyPath:@"value" ascending:YES];
    NSMutableArray *candidates = [NSMutableArray new];
    for (Sample *sample in queryResult) {
        [candidates addObject:sample];
    }

    NSLog(@"pick candidates");
    NSMutableArray *samplesToPrune = [NSMutableArray new];
    Sample *previousSample;
    for (Sample *candidate in candidates) {
        if (rand() % 8 == 0) {
            [samplesToPrune addObject:candidate];
        }
        previousSample = candidate;
    }

    NSLog(@"delete");
    [realm beginWriteTransaction];
    [realm deleteObjects:samplesToPrune];
    [realm commitWriteTransaction];

    NSLog(@"end");
}

This inserts 200k objects, and then deletes approximately 10k of them. On my mac, the write transaction which deletes the objects takes about a minute.

The most important change to make here is to wrap the code which selects which objects to delete in an explicit autoreleasepool:

var samplesToPrune = [Sample]()

autoreleasepool {
    let candidates = database.findSamples(forDevice: device, with: unit,
                                          startDate: Date.distantPast,
                                          endDate: timestamp.addingTimeInterval(-maxTtlOfNextRate))
    for candidate in candidates {
        if ... {
            samplesToPrune.add(candidate)
        }
    }
}

database.deleteSamples(samplesToPrune)

In my reduced sample, this speeds up the deletion from 54 seconds to 3 seconds. The reason this is so much faster is that deleting objects in Realm can change the internal row numbers of other objects. Each live accessor object stores the row number it points to, and that potentially has to be updated, so having large numbers of objects alive when deleting objects hurts performance.

After making this change, the next easy optimization is to skip copying the objects from the Results to an array. In the code you've shown there's no reason to do this, and doing it takes about a quarter of the total runtime of the process.

Scaling this up to deleting 125k objects out of 2M starts to run into trouble again, though. At this point, there's enough object accessors for just the objects you're deleting to cause problems. I tried running this to see how long it'd take, but killed it after a few minutes. A fix for this is to do the deletion in two stages: first, you flag all of the objects which should be deleted, and then you use a realm query to do the actual deletion:

class Sample: Object {
    @objc dynamic var timestamp: Date = Date(timeIntervalSince1970: 0)
    @objc dynamic var value: Double = 0
    @objc dynamic var unit: Unit = .T
    @objc dynamic var device: Device?
    @objc dynamic var shouldDelete: Bool = false
// ...
}

autoreleasepool {
    try! realm.write {
        let candidates = database.findSamples(forDevice: device, with: unit,
                                              startDate: Date.distantPast,
                                              endDate: timestamp.addingTimeInterval(-maxTtlOfNextRate))
        for candidate in candidates {
            if ... {
                candidate.shouldDelete = true
            }
        }
    }
}

try! realm.write {
    realm.delete(realm.objects(Sample.self).filter("shouldDelete = true"))
}

With this change flagging the objects to delete takes 5 seconds, and then the deletion itself takes well under a second.

@tgoyne Thank you so much for this extremely useful answer. I like the mark/sweep approach, it's like a garbage collector. I will attempt to refactor my code and see how it performs, and post back here.

@tgoyne @kishikawakatsumi I refactored my code to do the deletion in two phases, first by marking and then deleting. I ran the tests again and get a significant improvement, thanks so much! But I still see O(n^2) time for deletion vs. O(n) for inserts. In the test, the majority of the samples are deleted. This is not normally the case, as the cleanup runs periodically and the batches are smaller, but still good to know. Here are the numbers:

realmbenchmark

Is there any way to work around the O(n^2) for deletion?
Does it make sense for instance to put a cap on the number of objects marked/deleted in one pass, and then performing multiple passes? I see that you batched your insert into 2K objects, that's why I'm asking.

Numbers are from the iPhone 7 Simulator running on a Late 2013 2.2 Ghz MacBook Pro.

The deletion itself is normally linear time. The part of the process which can be superlinear is updating Results and Object instances of the type being deleted.

Some examples of various things and their asymptotic performance:

// Assuming no other Realm objects on the same thread

// O(1)
// Clearing a table is special-cased
realm.delete(realm.objects(Sample.self))

// O(N)
// Still using the table-clear special case, but now needs to detach all the
// accessors in the array
autoreleasepool {
  let objects = Array(realm.objects(Sample.self))
  realm.delete(realm.objects(Sample.self))
}

// O(N)
// Now loops over the objects and checks if they match the query
autoreleasepool {
  realm.delete(realm.objects(Sample.self).filter("TRUEPREDICATE"))
}

// O(N^2)
// Loops over the objects in the array to delete them, and then for each one
// loops over every object in the array to check if they're the object being
// deleted
autoreleasepool {
  let objects = Array(realm.objects(Sample.self))
  realm.delete(objects)
}

// O(N^2)
// Slightly better constant factor than the above but basically the same thing being done
autoreleasepool {
  let objects = Array(realm.objects(Sample.self))
  realm.delete(realm.objects(Sample.self).filter("TRUEPREDICATE"))
}

// O(N)
// Object accessors are destroyed before the deletion is done, so no need to
// update them
autoreleasepool {
  let objects = Array(realm.objects(Sample.self))
}
autoreleasepool {
  realm.delete(realm.objects(Sample.self).filter("TRUEPREDICATE"))
}

One implication of this is that deleting items in smaller chunks is unlikely to help you except by coincidence, as the time spent is proportional to number of objects being deleted times number of object accessors (including ones not being deleted).

(As a side note, we're working on some architectural changes which would happen to make this problem go away entirely, but that won't be done any time soon.)

@tgoyne Thank you, very helpful again. I need to go over my code again then. It is implemented in the fashion you describe at the end... object accessors are destroyed by wrapping the mark phase in an autoreleasepool before deletion, so I should be able to achieve O(N) time but currently seeing O(N^2).

It's been two weeks so I'm closing this ticket. Feel free to create a new one if you require further assistance, and thanks for using Realm!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

carvalho-oak picture carvalho-oak  路  3Comments

duribreux picture duribreux  路  3Comments

dennisgec picture dennisgec  路  3Comments

jpsim picture jpsim  路  3Comments

yangmeyer picture yangmeyer  路  3Comments