Friday, July 20, 2007

Distant Memory

At some point we had enough of it. The main process was constantly attacked by the OOM killer - the Linux feature that stops a process that uses the most memory when there's not enough free memory on the machine. Which was all logical, except for one point. At the moment of OOM killing signal, our 4 GB machines had usually between 1 to 2.5 gigabytes of RAM free.

So, how comes that 1610612736 bytes of memory is not enough?

The system we are talking about was very large software router, keeping up with tens of thousands of active network connections. Running on Fedora Core Linux, which never added stability. Further investigation has revealed that during the crash the system experienced very high incoming load.

The interesting thing about OOM that not every allocation attempt will cause it. If I have a system with very low free memory and I request 3GB ram via malloc, then the allocation will fail. Well, it should fail, and it would fail on any system but Linux, which, by default will pretend that the allocation had succeeded. On Linux my process will get a segmentation violation signal when I will try to access the memory.

So we figured that it was not any user process, but rather the kernel itself. Kernel, due to its distinct duties sometimes requires more control on the behaviour of allocations, something more elaborate than just fail now/fail later semantics. And the kernel indeed does have it.

The primary memory allocation routine in the kernel, get_free_pages gets additional flag, called priority. The priority can be "normal", meaning that if there is not enough memory, than the process will sleep, or "atomic", meaning that the memory is needed in an interrupt handler, and, therefore, can not sleep. So if there is some free memory, "atomic" request will get it first.

What happens if there is not enough memory? A quick look into the kernel code answers that:

The memory allocation call propagates to function __alloc_pages which, in case of need, may call function out_of_memory. That function, in turn, will invoke the OOM killer.

So, "atomic" request could, in theory, cause a OOM to be invoked.

Now we had two possible solutions:
One possible solution was to disable the OOM killer. I argued that this is a bad idea, as if we do disable it, we stay with a system that can not be accessed via the network, and is not capable of doing its main task - processing packets.

The other solution required understanding, why we reach this situation, and prevent it.

Further investigation has shown that during the crash, the output packets were experiencing a peak. Lots of data was sent via TCP but was stuck in the OS buffers, waiting for the sliding window algorithm to allow further packet emission. This spilled some light on the problem.

Enter the zoned allocator.

Virtual memory in Linux is not very uniform. The whole virtual address space is divided into three zones: the DMA zone - memory that is directly accessible for DMA operations, normal and high zones.

The difference between the normal and high zones is the structure of a virtual memory. Normal zone addresses have their four upper bits set to zero. Translating a virtual address that belongs to the normal zone to real (machine) address is as fast as adding a constant base address to the address.

For high zone virtual address, however, the process is much more complicated. It requres going through the MMU to resolve the virtual address to the real addres, and in fact, it requires temporary modifications to the MMU tables. Compared to the speed of the CPU, this operations will take ages.

This is the reason why some non-sleeping "atomic" allocations always happen from the normal zone. When all you have is a few nanoseconds to allocate a page, there is really no need to go for the high zone. Same that high zone for less real-time applications.

It turned out that network allocations for sending packets happen from the "normal" zone. Therefore, it was possible to exhaust the normal zone by sending LOTS of outgoing traffic. Enough to exhaust the outgoing buffers.

The solution came from the TCP itself. TCP incorporates congestion control mechanism, which, in short, avoids congestions by reducing the throughput drastically when a congestion is discovered.

Since in our case discovering congestions was synonymous with crashing, we opted to estimate congestions by keeping track of our estimated capacity. Once the "virtual congestion" condition was met we were slowing down our our sending rate.

Q.E.D.



No comments: