Yarp: bufferedPort data corruption

Created on 19 Apr 2017  路  24Comments  路  Source: robotology/yarp

I use a bufferedPort to send 'events' from the event-cameras on the purple robot to other processing modules. The events are time-stamped chronologically on the icub-zynq board inside the robot. Sometimes I would notice the events received at a processing module were not in chronological order. The attached picture shows an example of a bottle of events with timestamps that are larger than those received afterwards.
bottlesenderror

The data in the bottle was identical to one received two bottles after (there are some differences that can be accounted for), but the bottle number in the envelope was still in the correct order. I therefore concluded the data in the bottle was changing after asking the bottle to be sent and before the bottle actually being sent. A bottle is sent as:

ev::vBottleMimic &vbm = portvBottle.prepare();
vbm.setdata((const char *)data.data(), nBytesRead);
vStamp.update();
portvBottle.setEnvelope(vStamp);
portvBottle.write(strict);

and the ev::vBottleMimic is a yarp::os::Portable with the following write function:

    virtual bool write(yarp::os::ConnectionWriter& connection) {

        connection.appendBlock((const char *)header1.data(),
                                       header1.size() * sizeof(YARP_INT32));
        connection.appendBlock((const char *)header2.data(),
                                       header2.size() * sizeof(char));
        connection.appendBlock((const char *)header3.data(),
                                       header3.size() * sizeof(YARP_INT32));
        connection.appendBlock(datablock, datalength);

        return !connection.isError();
    }

Indeed, I can prevent the error from occurring by adding portvBottle.waitForWrite(); after portvBottle.write(strict); but in doing so the bufferedPort is acting like a regular port as the thread is blocked until the message is sent.

After I call portvBottle.write(strict); I almost immediately start over-writing the same memory location that was given to the port to write (i.e. data.data()) and therefore it could be plausible that the data is being changed. However, my understanding of the connection.appendBlock() would cause a copy of the data to be made and therefore I should be safe to modify the original memory (as opposed to connection.appendExternalBlock()). All memory copying functions I could find associated with appendBlock() seemed to be in the same thread. If my assumption is correct, perhaps the memory copy in appendBlock() needs to be flushed (if it is possible?) before returning.

  1. I can operate using portvBottle.waitForWrite(); but this issue might also be occurring for others, and is especially undesirable given I am using a strict writing procedure. The system is most probably pushing the limits in order to send as many events as possible with minimal latency.
  2. Perhaps there is another cause? Could it be possible that portvBottle.prepare(); returns a bottle that has not yet fully been sent? or a bug in my code?
Documentation QSupport Answered

All 24 comments

Isn't a vbm.clear() missing after portvBottle.prepare()?
I don't know how your class works, but if you call prepare() on bufferedPort<Bottle>, it does not return an empty bottle: it returns the last bottle used! (you are thus appending data to a bottle that increase its size every iteration, resulting in a catastrophic memory leak)

Thanks for the comment @randaz81. The way the ev::vmb functions means it doesn't have the problem of an increasing size of the bottle. The data is not appended as with a yarp::os::Bottle. However you are probably right in that there could be a memory leak if yarp::os::ConnectionWriter::appendBlock() does a memory copy and doesn't have an automatic way to clear it. I can't find a method to clear the block within the yarp::os::ConnectionWriter class so I assumed it must be internally cleared after writing.

Initially I was using a yarp::os::Port and using yarp::os::ConnectionWriter::appendExternalBlock() in which case the memory isn't owned by the bottle and therefore your _shouldn't_ clear it.

If you know the correct way I should be clearing data within the yarp::os::ConnectionWriter class I can try it to see if it is related to the problem I am having. Grazie.

Maybe the problem is at the receiver side, if it has not been fixed there used to be a race condition in the read. If I remember correctly reading with a callback is not affected by the race condition. Can you try that out? @drdanz knows more.

My understanding was the race condition was between the envelope and the bottle contents. If that is the same issue you are mentioning. Currently I am using a callback to receive the data. Thinking if it is possible that the problem is at the receiver end - I don't think it could be: the 'out of order' packet is not an exact copy of the data, but rather a filtered version of it. The filtering happens at the sender side of communications. Thanks @lornat75 .

From the graph it looks like something that might not be related to YARP... How often do you get this kind of error?
In an interval of 500 events the time is still monotonic, the number of samples is too large to be related to the BufferedPort buffer or some other race condition. I suspect that this might be related to the system clock. I'm not sure about the clock we are using but depending on the clock used, there might be some adjustments (for example when ntp updates the system time).

@drdanz, speaking of system clock, yesterday @barbalberto and @massimoregoli noticed that the purple setup had some problem with the NTP daemon and, in particular, it was not able to properly synch to the correct world time (@barbalberto knows detail about it).
Taking this issue into account, my two cents here is that, as @drdanz said, the problem is related to the system clock and error in the synchronization mechanism.

@drdanz - the error is not periodic, but occurs on the order of once every several seconds. Just to make sure you have interpreted the figure correctly. The event timestamp (y axis) is created by an FPGA counter and is not tied at all to the system clock (and doesn't need to be synchronised). The xaxis is the event number with the yellow lines indicating the packet separation. So it shows three packets being sent, with the timescale being under 1 millisecond.

Sorry, I think I'm not understanding the problem... Probably I'm missing something, if the timestamp is generated on the FPGA and several events are sent as one single bottle, why should the envelope be responsible of all the timestamps of the events in the bottle?

From my perspective the envelope has nothing to do with the problem I am reporting. The envelope is correctly increasing for every packet I receive and appears to be working correctly.

The problem is that I am missing a data packet - which is replaced by an (almost) identical data packet that is then correctly sent again two packets later. My belief is that sometime between when I ask to send a data packet and when it is actually sent I can overwrite that data and the new data is sent. I think this shouldn't be possible if I am using appendBlock() on my connection writer as it should make a copy of the data that is to be sent. If the data I ask to be sent is copied it shouldn't be possible to overwrite.

Sorry for mentioning envelopes and making confusion I now realize the problem is unrelated.

Are you sure nobody is writing to ev::vBottleMimic &vbm after the call write() and the next call prepare()? just checking you are using the port correctly.

@lornat75 - I double checked that there are no calls to ev::vBottleMimic &vbm between write() and prepare().

I do, however, alter the data.data() which is the payload of the packet between write() and prepare(). If I was using ConnectionWriter::appendExternalBlock() I would expect that the data I send to be altered by changing data.data(). However with ConnectionWriter::appendBlock() I expect the initial data.data() to be copied before sending. This is where I _think_ might be the problem, and at this point I get a little lost in the YARP code.

I can prevent the problem by using BufferedPort::waitForWrite() directly after BufferedPort::write() to ensure the data is sent before altering the contents of data.data(). Can you suggest any debugging steps to validate whether the problem is indeed as I think?

I think the race happens anyway because the BufferedPort will access the object concurrently with your thread.

I would solve the problem by removing the race condition in your code. Why don't you use two (or more) objects like data and you alternate which pointer you pass to:

vbm.setdata();

to avoid the race condition?

Okay, I would have hoped that the bufferedPort would access the copied version of the data and thus prevent the race condition. Indeed, if I don't use the the copying method (and use ConnectionWriter::appendExternalBlock() instead) the data is completely corrupted due to the same problem, so at least ConnectionWriter::appendBlock() is working as intended, but for some other reason doesn't work 100% of the time.

Unfortunately I am already using a double-buffer system which is why the corrupted packet resembles the packet that is advanced by 2. It's possible that a third buffer could solve the problem (or just make it less likely).

As for a practical work-around, I believe that I can use a standard port without too much problem. I already have a multiple threaded environment and so waiting for the port to send data doesn't largely impact the collection of data from the sensor itself.

It could be a problem that affects other modules or users, but maybe less likely if they aren't trying to push through data at the same high rate that I am.

Can you point as to the full version of your code to better see what's going on and to create a small example for regression tests?

Thanks all for the help, after talking with @lornat75 I now understand the problem. I was under the assumption that when I call yarp::os::BufferedPort::write() that ev::vBottleMimic::write() would be called. However, it is the BufferedPort thread which envokes the ev::vBottleMimic::write() itself when it has a chance to run, which makes a lot of sense. Therefore there can be a race condition between calling yarp::os::BufferedPort::write() and the data being copied in the ev::vBottleMimic::write() function. Which is why yarp::os::BufferedPort::waitForWrite() managed to solve the problem. Thanks, and I can close the issue.

My two cents here: is the result of this issue worth being write to Q&A?
It may be useful to many other people in the future.

@arrenglover @lornat75 @pattacini

@claudiofantacci I agree with you that this discussion might be helpful to understand the gut of yarp::os::BufferedPort; however, I would first try to clean it a bit from the details that tend to be too tied to the particular implementation @arrenglover has developed.

@pattacini I agree with you, let's see if @arrenglover can take very core of this discussion and write a small Q&A on the BufferPort.
Would be very much appreciated 馃槃 !

As the secrets of BufferedPort have caused a lot of pain and misunderstandings (now and in the past) I have written this page:

http://www.yarp.it/yarp_buffering.html

it covers some of the problems discussed in this issue. I hope it helps.

@lornat75 I was referring to more about when to use yarp::os::BufferedPort::waitForWrite(), which I think fits well in a Q&A post and it is the actual solution that solved the issue for @arrenglover 馃槃

The new page for BufferPort and messages life-cycle is definitely 馃憤

@claudiofantacci - yarp::os::BufferedPort::waitForWrite() is not really a solution to the problem. It forces the BufferedPort to act like a standard yarp::os::Port in that it blocks after every write call until the packet has been sent to the network layer. I was mentioning it as it stopped the problem from occurring and could perhaps lead to an understanding of where the problem was. I wouldn't suggest it as a final solution to anyone that has the same problem.

Mmmm, I see. Then maybe there is no need for a Q&A, but at the same time I do not understand whether it is need to fix something, and keep this issue open, or not.

There is nothing to fix in terms of code (except my own). I just misunderstood the way the BufferedPort was working. In terms of documentation - I'm not sure. The problem only arises because I have my own portable class which only points to a memory location that I want to send. I reuse the same memory locations to avoid calls to malloc (to push the speed on a not so powerful cpu). I can't imagine it is a common problem?

My final solution as discussed with Lorenzo could be to add a third buffer to the system that would remove the race condition. Alternatively, I can switch to a standard Port given I already observe my system that uses waitForWrite() seems to be working satisfactorily.

Now I understand the problem and the specific situation.
Thanks for explaining it (maybe once more) 馃槃

Was this page helpful?
0 / 5 - 0 ratings

Related issues

diegoferigo picture diegoferigo  路  3Comments

diegoferigo picture diegoferigo  路  3Comments

drdanz picture drdanz  路  3Comments

Nicogene picture Nicogene  路  4Comments

Giulero picture Giulero  路  3Comments