Practical guidelines for avoiding memory leaks with Node.js
Memory Hunt. Practical guidelines for avoiding memory leaks with Node.js
In times when computers were large and memory was low, developers struggled for every bit of precious circuits. Programming languages at that time literally required the authors of programs to become acquainted with the low-level features of a particular architecture and, accordingly, a mega-accurate attitude to the isolation and release of already limited resources.
However, some time has passed since memory was measured by numbers with nine or more zeros, tools were abstracted from architectural features, overgrown with helper functionality, frameworks, techniques. And there is a lot of rational grain in this because nowadays a developer can better focus on the subject part of a problem that his code solves, instead of wasting time learning the intricacies of particular hardware architecture. Unfortunately, it is impossible to fully abstract the entire hardware, and one of the clearest examples of this is the presence of Out of Memory Error / OOM errors in the vast majority, and possibly in all modern programming languages.
The concept of working with pointers is impossible without knowing how to address and allocate memory, which forced our team to master Zen. Then there were Java, Python, and Node.js, which no longer forced to scrutinize the code for free calls after each malloc, and greatly reduced the number of SIGSEGVs by focusing on absent storage units.
For the last five years, we have been working on an open-source project for Appium is a Core Maintainer. This is a server-side framework written on Node.js that implements a REST API compatible with Selenium WebDriver to automate functional testing. Appium is usually associated with mobile testing, but because of the driver concept, it supports a wide variety of platforms, including Windows, Mac OS, Raspberry Pi, and more. It is in this project that we are facing (and continue to face) interesting, problems, practical recipes, which we would like to talk about. Also, while working on the project and as a result of error handling, we have made a few rules for myself, including limited resources that will be useful not only in the context of Node.js development.
A little boring theory
The implementation of the garbage bin in Node.js is quite advanced. It detects potential loopbacks, deals with closures, but is powerless in the face of classic developer release issues (that is, lack of such release). In order not to go deep into the depths, it is necessary to know that the newly created object will be considered "alive" and cannot be transferred to the garbage collection procedure until at least one entity of the active namespace is referenced. If such "living" objects accumulate enough to maximize their total size, the V8 will not be able to allocate memory for the next new entity, and it will "crash" the entire process (FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed).
Investigating the problem
We use Chrome Inspector / Remote Debugger to diagnose memory usage problems in Node.js in Appium. There is plenty of information on the Internet about how to use this tool. Debugging can be done in real-time or by exploring Heap Snapshots. In our opinion, the latter method is more effective, especially if there is a problem on the computer of a user who is not physically accessible. In Node.js version 12, you need to connect the Heap Dump module and send the SIGUSR2 process for Node.js to create a snapshot of the active process directory (not sure it works on Windows because there are no signals). After the release of version 12, this functionality is included in the main Node.js codebase, so you no longer need to connect this module. The snapshot of the pile can then be uploaded to the same surveyor for study. If the shot is taken at the right time, the inspector will be able to see which entities occupy the largest heap space and the stack of calls when they are created, which in many cases is sufficient to locate the error in the code. After making the necessary corrections, the procedure should be repeated and make sure that there is no problem and there is no recourse.
Unfortunately, these problems are almost impossible to solve without pictures or at least a sufficiently detailed description of the steps to play them. The latter option is rather uncertain as sometimes the whole environment will have to be recreated to solve the problem. Also, depending on the nature of the error, playback may take a long time (sometimes it is hours, even days).
Double blow. It turns out that journaling may not be that safe
Logging is a common practice for all more or less serious applications. Logs can sometimes be the only thread that can untangle a tangle of mysterious application behavior in a particular environment (read on a user's computer).
But in our case, it was the journaling itself that "consumed" the available memory and caused OOM.
Appium uses a thin wrapper above the NPMLOG module for logging. An analysis of the above snapshots from the report showed that a large amount of pile storage is occupied by strings and, to be more precise, by logged lines. In addition, it turned out that most of them occupy several megabytes (!), Although the log file was smaller than 1 MB.
Further analysis of the NPMLOG module revealed that it has an internal loopback buffer that stores N recent log entries. The standard N is 10,000. Assuming an average of 1 line is 1 MB (and much longer), this will give a total log size of at least 10,000 MB or 10 GB, which is far beyond the available memory. Therefore, we initially set the maximum buffer size to be 3000. This gave some improvement, but still did not solve the problem - the buffer got very long lines that were not in the log (that is, they were truncated up to 300 characters from the beginning), and at first, we could not understand the reason for this behavior. A journey through the innards of the Node.js implementation, which lasted several nights, finally helped to find the answer - strings interning and an appropriate bug on the subject.
It turns out that calls to substring, trim, split and similar methods to "split" rows do not save a new copy of the original row, and saves a pointer to its slice. By storing such a pointer in an array, we increase the number of references to the original (long) string, which then prevents the garbage bin from erasing it. This problem is not being solved, but we haven't found a better one yet. The idea here is simple - forcing the interpreter to save a copy of the shortened string while saving it to the log buffer instead of storing the original.
Don't rely on third-party modules to fully understand their APIs and features beforehand.
Don't be afraid to look inside these modules if you are unsure of their functionality.
Learn by eye to determine the approximate maximum size of dynamic data structures when writing code - lists, hashes, and trees. Special attention should be paid to structures that will be available over the application lifecycle or much of it (ideally, such structures should be at least). If the approximate maximum size of a structure reaches at least a tenth of the size of all available memory, then you should consider limiting the number of elements in it and / or limiting the size of those elements.
Be aware of the particularities of implementing your programming language - though it may require additional efforts, it will help you avoid problems that are difficult to diagnose in the future.
Almost every line of code can be a real cause of memory loss in a lot of projects - trying to find it by chance and solving the problem without having the right materials for detailed research will be a waste of time.
Giving Promise - Hold On, Not Giving - Stay Strong
Another interesting case can be found here.
According to the picture, the pile was full of rows, but it is unclear how they are there and why they remain. The stack call chain (Stack Trace) demonstrated that the strings are contained within Bluebird fisheries. But why these delays accumulate in memory and are not erased from there, no one could say. After several tries, it became clear that without a radical decision it would be difficult. One of the methods, the signature of which is placed above the stack, had such a sophisticated implementation of the queue that the very description of this implementation took more space than its implementation. And even before that, it was all "fueled" by hacks that made it possible to cancel fisheries. Replacing the queue using async locks (async lock) avoided memory loss and made the existing hacks unnecessary.
If you cannot understand a certain part of the code or the code is such that only through many hours of meditation it is possible to discover its sacred secrets, and besides, in the execution of the code there are problems, it may mean that it must be rewritten. Of course, this decision should be made by the team, but having clever hacks, TODO and FIXME comments will certainly be a great indicator that you are on the right track.
Always first try to implement the functionality with standard documented tools. Only use hacks when other standard solutions do not work. Usually using hidden APIs is the easiest way to shoot yourself, especially if you are not sure what to do.
Where simple, there live 100 years
The third and final case in this review.
The problem is all in the same fisheries that (unclear how) remain in memory. The strategy from the previous case was applied, and another tricky hack was rewritten using more "established" methods. But it did not help, and memory was still lost. But in the picture, the stack was much easier, and now finding the "culprit of the celebration" was no longer a super-difficult problem.
- Subsequently, any project increases in size when no one is able to fully understand its architecture. Simplifying and locating problem areas are always great "friends" for solving problems.
- Think of memory management as you write code and peer review commitments. Having a Garbage collector, even though it was advanced, does not eliminate the code for possible memory loss issues.
- Add storage monitoring to your continuous integration process. There are tools that can automatically generate alerts if memory usage by the application is much "deviated" from the median.
- Use static code analyzers.
- The more you know, the more you doubt it
In our opinion, memory problems are one of the most difficult to solve software problems. They are invisible to compilers and interpreters, and may, under certain conditions, either not occur at all or very seldom, over significant intervals of time. The occurrence of such a problem at the system level is considered a critical mistake and leads to the "collapse" of the whole process. This makes memory leaks both dangerous and challenging from a developer's perspective.