In my production , the slaves replication is always lag for many hours!
127.0.0.1:6379> info replication
role:master
connected_slaves:2
slave0:ip=10.xxx.xxx.xxx,port=6379,state=online,offset=416543935501,lag=1
slave1:ip=10.xxx.xxx.xxx,port=6379,state=online,offset=416543965574,lag=1
master_repl_offset:416543969598
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:416542921023
repl_backlog_histlen:1048576
the offset of slave* and master_repl_offset are big different.
how to solve or optimize the problem?
repl_backlog_size is equal the repl_backlog_histlen value, the buffer size(default 1048576 = 1M) is too small
The values here all make sense, they just aren't clearly explained and have some weird names.
When you see repl_backlog those are only for PSYNC. So, they could probably be named psync_buffer for the same effect.
repl_backlog_size is the capacity of a buffer holding data for PSYNC. repl_backlog_histlen is how much actual data is in the PSYNC buffer. They will usually be equal since repl_backlog_histlen can only grow as big as repl_backlog_size.
Also notice how the backlog first byte offset (repl_backlog_first_byte_offset) is equal to the maximum PSYNC buffer size (repl_backlog_size) which is also equal to the currently populated PSYNC buffer data (repl_backlog_histlen). So, master_repl_offset - repl_backlog_first_byte_offset = repl_backlog_size: 416543969598 - 416542921023 = 1048575 (yeah, there's an off-by-one error somewhere).
The actual lag is the difference between each slave offset and the master_repl_offset. So, in this case, slave0 is 416543969598 - 416543935501 = 34 KB behind the master and slave1 is 416543969598 - 416543965574 = 4 KB behind the master.
The _actual_ replication lag could be reported nicer in the INFO output, but... it isn't. :-\
Useful explanation @mattsta, would be great to have it on http://redis.io/commands/info
@antirez Would you be willing to accept a patchset that adds a field with the difference between slavex.offset and master_repl_offset to be able to see at a glance the number of bytes that each slave lags behind the master?
Most helpful comment
The values here all make sense, they just aren't clearly explained and have some weird names.
When you see
repl_backlogthose are only forPSYNC. So, they could probably be namedpsync_bufferfor the same effect.repl_backlog_sizeis the capacity of a buffer holding data forPSYNC.repl_backlog_histlenis how much actual data is in thePSYNCbuffer. They will usually be equal sincerepl_backlog_histlencan only grow as big asrepl_backlog_size.Also notice how the backlog first byte offset (
repl_backlog_first_byte_offset) is equal to the maximum PSYNC buffer size (repl_backlog_size) which is also equal to the currently populated PSYNC buffer data (repl_backlog_histlen). So,master_repl_offset - repl_backlog_first_byte_offset=repl_backlog_size:416543969598 - 416542921023=1048575(yeah, there's an off-by-one error somewhere).The actual lag is the difference between each slave
offsetand themaster_repl_offset. So, in this case,slave0is416543969598 - 416543935501=34 KBbehind the master andslave1is416543969598 - 416543965574=4 KBbehind the master.The _actual_ replication lag could be reported nicer in the INFO output, but... it isn't. :-\