Author: Xin Zhang

Background:

During the AElf single node testing phase, testers found that the node suddenly went offline. After checking the log, it was found that Workers (transaction execution process) were all dropped and the execution of transactions was halted, which caused the node to crash.

Preliminary diagnostics:

This problem is very strange, since nodes and all Workers are on the same server, network communication should not be the problem. Additional diagnostics revealed that the main node, all Workers and Lighthouses dropped offline at almost the same time. We continued our troubleshooting and found problems through zabbix — the RAM of the server came close to reaching its operation capacity at one point. Looking at the time stamp, it coincided with the time which the node malfunctioned.

Reproducing the problem:

We focused on testing the memory usage. The test found that as the node runs for a long time, the process occupies the server’s memory continuously. The memory usage grows significantly especially after large numbers of transactions were sent and the memory was not released even after transactions stopped.

How we reproduced the problem:

First I must introduce the service environment, we use Ubuntu 16.04.5 LTS, dotnet core version is 2.1.402

When the node is running alone: the memory usage is about 90M.

Then by continuously sending a large number of transactions to the node, we can monitor the node trading pool as it accumulates and executes the transactions. This can be observed from the image below.

After some time monitoring the system, the memory usage has reached 1G.

At this point, the transactions have stopped being executed. As seen below all transactions already in the trading pool have been executed.

We continue to observe the memory footprint, and after some time we find that the memory usage has not decreased, and is maintaining a usage level of 1G.

Problem Analyse:

Next we use lldb to analyze our nodes.

This is done by first installing lldb on the server

Sudo apt-get install lldb

Find the local ibsosplugin.so location

Find /usr -name libsosplugin.so

Start lldb and attach it to the process

Sudo lldb –p 13067

Load libsosplugin.so

Plugin load /usr/share/dotnet/shared/Microsoft.NETCore.App/2.1.4/libsosplugin.so

Setclrpath /usr/share/dotnet/shared/Microsoft.NETCore.App/2.1.4/

Analyze the next object

Dumpheap -stat

We can see that there are a large number of the following objects

AElf.Kernel.TransactionHolder

System.String

AElf.Common.Address

System.Collections.Concurrent.ConcurrentDictionary`2+Node[[AElf.Common.Hash,A Elf.Common],[AElf.Kernel.TransactionHolder,AElf.Kernel.TxHub]]

AElf.Kernel.Transaction

AElf.Common.Hash

Google.Protobuf.ByteString

System.Byte[]

Let’s look at objects larger than 1024 bytes.

We can see that there are 4 objects of the same type that are relatively large

System.Collections.Concurrent.ConcurrentDictionary`2+Node[[AElf.Common.Hash, AElf.Common],[AElf.Kernel.TransactionHolder, AElf.Kernel.TxHub]][]

Upon looking further at the objects corresponding to the MethodTable

You can see that there are 8 objects, 4 of which are large. Picking one of them to view the object information and we found that 573,437 values are stored in it.

Based on the above analysis results, we check the corresponding source code and located the class: AElf.Kernel.TxHub. The main role of this class is to store transaction pool transaction data. This class contains 8 ConcurrentDictionary<Hash, TransactionHolder> for storing transaction data, while the TransactionHolder class stores Hash, Transaction, etc., which is consistent with the results of the above memory analysis. Looking again at the internal logic, we found that all transactions are stored in TxHub after they enter the transaction pool and are no longer released. From this we were able to locate the core of the problem.

After resolving this issue, we repeated the above steps for verification. The effect was obvious. After the transaction in the transaction pool had executed, the memory usage dropped significantly. The final results of the memory usage is as following:

Then, by looking at the contents of the objects in memory, we can see that the total number of objects has also dropped significantly.

However, there is still a small increase in the memory, and there are cases where large objects reside in the memory, and this problem will be analyzed further.

— Join the Community:

· Get on our Telegram Discord Slack and Kakao channel

· Follow us on Twitter Reddit and Facebook

· Read weekly articles on the aelf blog

· Catch up with the develop progress on Github

· Telegram community in 한국 ,日本語 ,русский ,العربية ,Deutsch ,ItalianoandTiếng Việt,

· Instagram: aelfblockchain

· YouTube Channel: aelf

For more information, visit aelf.io