Neo: Some consensus node only proposal one miner transaction when there are transactions in the network

Created on 12 Jun 2019  Â·  43Comments  Â·  Source: neo-project/neo

Recently we find that some consensus nodes only proposal one minter transaction. you can find details on neoscan from 3846424 to 3856930. During this time, there only 2 out of 7 consensus nodes proposal non-miner transactions.
And there is an old problem that sometimes one certain consensus node won't proposal transaction except miner transaction. This issue has been discussed before at #474 . But now seems not solved.
I will do some test first to find the reason.
First, I will make certain whether the transactions arrived in the mempool of the problem cn.
Do you have any idea about this? Is a solution to this currently in progress?
@vncoelho @igormcoelho @jsolman

bug critical

Most helpful comment

I have tested on testnet.

  1. I use a tool keep sending transaction to seed9.ngd.network:20333 via p2p.
  2. When I find ngd consensus node produce empty block, I get mempool of the node via rpc and most
    of the time get nothing. But sometimes get 1 or 2 transaction. I use show state check the peers. There are more than 10 connected.
  3. Then I use my tool to send tx to the consensus node directly via p2p. The node works again and relay
    transactions too and never produce empty block.
  4. Then I stop sending directly and send transactions to seed9.ngd.network:20333, the issue reproduce.
  5. Then I send transaction to seed1 testnet and seed5 testnet, the issue still exists.
  6. I restart my consensus node. It works again.

All in all, I think it seems we have route problem?
Once a node started, its peers keep unchanging, right?
Do you guys have any idea, why the cn can't get transactions from its peers?

All 43 comments

Nice, my friend, good innitiave.

Their mempool will be probably empty, it is a problem in the tx rebroadcasting.
It should be P2P optimizations with extra filter details problems.

I have tested on testnet.

  1. I use a tool keep sending transaction to seed9.ngd.network:20333 via p2p.
  2. When I find ngd consensus node produce empty block, I get mempool of the node via rpc and most
    of the time get nothing. But sometimes get 1 or 2 transaction. I use show state check the peers. There are more than 10 connected.
  3. Then I use my tool to send tx to the consensus node directly via p2p. The node works again and relay
    transactions too and never produce empty block.
  4. Then I stop sending directly and send transactions to seed9.ngd.network:20333, the issue reproduce.
  5. Then I send transaction to seed1 testnet and seed5 testnet, the issue still exists.
  6. I restart my consensus node. It works again.

All in all, I think it seems we have route problem?
Once a node started, its peers keep unchanging, right?
Do you guys have any idea, why the cn can't get transactions from its peers?

Good work!
Now we need to know if the problem is with the node itself, or it's peers. It's funny because it should affect non-CN nodes too, right?
If it always produces blocks when you send tx to it, it means they are not "processing" these transactions.
It is weird because we know it is receiving network messages 🤔 Maybe it is discarding these transactions?

Did you manage to log the messages received 'per node'? We now need confirmation that:

  • TXs messages are not arriving; OR
  • TXs are being discarded.

Is there another possible cause? I can only think about these two considering the evidence you provided.

Edit: If you have to reboot the node to make it work, is it evidence that the error is in the node itself, and not with other peers?

@KickSeason, it is this. This was one of the hypotheses, now you tested and found a fact. Nice work.
In general, the classes connection, peer, taskmanager and protocolhandle need a better organization and possible some improvements.

The class remotenode has an strange think related to the ack state, in which it is set to false every time data is sent on the TCP, correct, @erikzhang ?
https://github.com/neo-project/neo/blob/70bf2f51c148930ea027332a1f610ea0cdd811ba/neo/Network/P2P/RemoteNode.cs#L230-L234

After that, it is again set to true OnAck:
https://github.com/neo-project/neo/blob/70bf2f51c148930ea027332a1f610ea0cdd811ba/neo/Network/P2P/RemoteNode.cs#L106-L110

Which is a message usually created from here: https://github.com/neo-project/neo/blob/70bf2f51c148930ea027332a1f610ea0cdd811ba/neo/Network/P2P/Connection.cs#L131-L144

I believe we need to better arrange the timeouts and this requirement set of false all the time looks strange.

@lock9, aligned with this we have the priorities on messages, which might be one of the reasons in which they still participate in the consensus but "do not have time" to receive the other types of TCP data.

Interesting observations... perhaps message is received but not internally passed on Akka?

I will continue testing and record all p2p messages. If the node received transaction message, we can be sure it's akka problem or node internal problem.
@lock9 NGD have tried to reproduce on privatenet but not easy. I will carry on test in testnet.

any news @KickSeason ?

@shargon, can you take a look at me last message above about the ack flag?

no progress. I'm testing on testnet. I can only access one consensus node. I have to wait it reproducing this issue.

I built a p2p-plugin and filter tx inv and tx message:
p2ptx.log
my code:

    public bool OnP2PMessage(Message message)  
    {  
        if (message.Command == "inv") {  
            InvPayload inv = message.GetPayload<InvPayload>();  
            if (inv.Type == InventoryType.TX) {  
                var hs = inv.Hashes.Where(h => Blockchain.Singleton.GetTransaction(h) == null && !Blockchain.Singleton.MemPool.ContainsKey(h)).ToList();  
                if (0 < hs.Count())  
                {  
                    string hashes = "[";
                    foreach (var hash in hs)
                    {
                        if (1 < hashes.Length)
                        {
                            hashes += ",";
                        }
                        hashes += hash.ToString();
                    }
                    hashes += "]";
                    log("inv", $"receive transaction inv, type: tx, count: {hs.Count()}, hashs: {hashes}");
                }                    
            }
        }
        if (message.Command == "tx") {
            Transaction tx = message.GetTransaction();
            string txid = tx.Hash.ToString();
            log("tx", $"receive transaction {txid}");
        }
        return true;
    }

I wonder why there are so many hashes that cn didn't get tx from? I searched those hashes. They aren't on blockchain.
Do you guys have any idea?

Those hashes are on know hashes but no received by inv or tx ?

[inv] With this tag, It is an inv message.
[tx] With this, It is a tx message.
I only log the tx inv and tx message.
If you guys need consensus log, I will add.

If there are many nodes without SimplePolicy, the network will be full of illegal tx. I think this may have a negative effect.

image
Look. Although consensus nodes are producing block, 2 out of them pack more than 10 every time. Others only a few.

Ngd have set up another testnet for repoduce the issue.
I add log into v2.10.2 and deploy on testnet as consensus node to record p2p messages and tasks info. Hope I will help.

Could NGD test it too with the last patch https://github.com/neo-project/neo/pull/865?

We just set up another testnet to try reproducing this issue. But it haven't happened yet.
If we can't reproduce, we can't tell whether it works or not.
We will test this patch a few days later.

Thanks for all @KickSeason !

@shargon It's already happened on Testnet No.2, we're making analysis for consensus log.

Good news, the first step is "reproduce it"

hello guys. I add log in neo based on v2.10.2 and reprodued the issue. Here is the log file:

2019-07-02.log
and neo-scan of this net.

At height 31922, this node is Primary. It proposal empty block.
There are a few inv tx hashes received. But this node only rqeust to get one tx data.
it's tx 0x6e5ff1844dc4050e5fd85d3be495c6a200e491b3f8da12fd7c5f415f92d844de. Which is confirmed at 31914. But you cann't find inv of this tx. Maybe it's long time before.

@shargon @vncoelho @jsolman @igormcoelho
I took a quick view at the file just now. Now we can be sure the node received tx hash. But it didn't try to get them.
We should do more deep investigate.
If you need more information I will add more log and reproduce

I think there are so many high priority messages in TaskManagerMailbox that no new task for tx.
Am I right?

I will add mailbox queue count logs.

And the block messages. This node can handle block messages. Block message is also hight priority letter.
Poor network node can only handle cosensus and block messages. It may have no more capacity for more work.

I took a quick view at the file just now. Now we can be sure the node received tx hash. But it didn't try to get them.

It is like a knownhash problem?

If you need more information I will add more log and reproduce

Add more logs on memorypool please

Hmm, if it receives the transaction but doesn’t think it is valid, but adds it in known hashes, but it later becomes valid, that could be a problem if knownhashes won’t allow it to receive it again. We should check if this is what is happening.

However that would not cause all blocks by that CN to be completely empty of all but the miner transaction. So that may not explain what is happening.

There are knownhashes count in log file. About 154000 hashes, it's around 4MB.

Will add mempool log and test

Could you test in parallel, with and without patch?

You mean the knownhashes fix patch?

Yes, just for confirm if this solve the problem, at least in the same time frame

Btw
```
private void OnInvMessageReceived(InvPayload payload)
{
UInt256[] hashes = payload.Hashes.Where(p => knownHashes.Add(p)).ToArray();
if (hashes.Length == 0) return;
switch (payload.Type)
{
case InventoryType.Block:
using (Snapshot snapshot = Blockchain.Singleton.GetSnapshot())
hashes = hashes.Where(p => !snapshot.ContainsBlock(p)).ToArray();
Logger.Info("ProtocolHandler", $"receive inv, type: block, hashes: {hashes.HashesToString()}");
break;
case InventoryType.TX:
using (Snapshot snapshot = Blockchain.Singleton.GetSnapshot())
hashes = hashes.Where(p => !snapshot.ContainsTransaction(p)).ToArray();
Logger.Info("ProtocolHandler", $"receive inv, type: tx, hashes: {hashes.HashesToString()}");
break;
}
if (hashes.Length == 0) return;
system.TaskManager.Tell(new TaskManager.NewTasks { Payload = InvPayload.Create(payload.Type, hashes) }, Context.Parent);
}
````
This is where I add inv tx log. I think it is nothing to do with knownHases in ProtocolHandler if there is a print log.

I think we make a great progress. I add more logs and reproduced again in our testnet2.
2017-07-07.log. This all logs of a fault node.
consensus.log. Consensus log I filtered out.
TaskManagerMailbox.log. TaskManagerMailbox log I filtered out.
you can have a look where I add log here: https://github.com/KickSeason/neo/tree/test/log

you can look at the TaskManagerMailbox.log first. It's is crowding by a lot of low priority messages.
image
@vncoelho @igormcoelho @shargon @jsolman

13.300 low priority messages????

yes

TaskManagerMailbox.log. TaskManagerMailbox log I filtered out.

Is the same url, is this right?

Sorry, mistake.

I would like to know how many duplicates, we should check the hash before append to the queue

How many hours until the issue appear?

3 days after restart.Only 1 node behave like this.
Now only queue messages count, if you need detail, I will add extra log.

I would like to know the type of the messages in the queue, something like
{ "GetBlock": 1123, "GetPeers": 999 ... }

is that possible?

Thanks for your work!

Will do.

We should cache the consensus messages on the CN nodes in case they arrive out of order. Other nodes should cache also for a time and drop duplicates and prevent storms of consensus messages.

Yes, we can have a consensus message pool

Also, any consensus message from a height less than the current block can just be dropped immediately and not forwarded. This alone will probably fix the issue.

I will lose this because #900 #899 ?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

erikzhang picture erikzhang  Â·  72Comments

Tommo-L picture Tommo-L  Â·  66Comments

EdgeDLT picture EdgeDLT  Â·  59Comments

erikzhang picture erikzhang  Â·  43Comments

lock9 picture lock9  Â·  35Comments