Performance by Design

Tuesday, July 24, 2007

How to effectively abuse code reviews

"No delusion is greater than the notion that method of industry can make up for lack of mother-wit, either in science or practical life. (Thomas Huxley) "

In order to make sure that developers develop buggy code on a constant basis, follow these simple rules:

1. No commit is done without a code review.

2. The responsibility on working code falls on a person who performs the code review.

3. What is being reviewed is the differences between versions.

These steps will bring quick success in the risky business of ruining programmer's productivity, and here's why:

The first rule will ensure that almost always there will be some poor soul looking for a potential reviewer. Chances are that he will pick up someone who is not busy. All worthy developers are busy creating great code, so guess who will be the reviewer? Correct, the weakest developer will do the most code reviews out there.

The second rule will ensure that the poor reviewer will enforce his design and implementation decisions on the original developer. As he is the weakest developer, it may, and eventually will have bad effect on the code quality.

Just to make sure that the poor reviewer will not learn the arcane skill of chasing bugs before they reveal their presence - and ruin by this your dark scheme of sabotaging the code quality - the third rule comes and rips all context out of the review. No chance to perform global refactoring, as the scope is limited. And if out of five places which had to be changed only four were, well - that's great news.

But if your goal is not ruining quality of the code, then maybe the following simple consideration will help.

It is all about getting closer to the ideal of having in the source control repository only the code that serves your goals best. So as long as both persons - the developer and the reviewer - understand these goals, they can proceed to make sure that the goals are met. If necessary, they can review only the differences. Sometimes, a look from a higher level will bring more results. It's all about the attitude.

P.S. Just don't let your goals become your gaol.

Friday, July 20, 2007

Distant Memory

At some point we had enough of it. The main process was constantly attacked by the OOM killer - the Linux feature that stops a process that uses the most memory when there's not enough free memory on the machine. Which was all logical, except for one point. At the moment of OOM killing signal, our 4 GB machines had usually between 1 to 2.5 gigabytes of RAM free.

So, how comes that 1610612736 bytes of memory is not enough?

The system we are talking about was very large software router, keeping up with tens of thousands of active network connections. Running on Fedora Core Linux, which never added stability. Further investigation has revealed that during the crash the system experienced very high incoming load.

The interesting thing about OOM that not every allocation attempt will cause it. If I have a system with very low free memory and I request 3GB ram via malloc, then the allocation will fail. Well, it should fail, and it would fail on any system but Linux, which, by default will pretend that the allocation had succeeded. On Linux my process will get a segmentation violation signal when I will try to access the memory.

So we figured that it was not any user process, but rather the kernel itself. Kernel, due to its distinct duties sometimes requires more control on the behaviour of allocations, something more elaborate than just fail now/fail later semantics. And the kernel indeed does have it.

The primary memory allocation routine in the kernel, get_free_pages gets additional flag, called priority. The priority can be "normal", meaning that if there is not enough memory, than the process will sleep, or "atomic", meaning that the memory is needed in an interrupt handler, and, therefore, can not sleep. So if there is some free memory, "atomic" request will get it first.

What happens if there is not enough memory? A quick look into the kernel code answers that:

The memory allocation call propagates to function __alloc_pages which, in case of need, may call function out_of_memory. That function, in turn, will invoke the OOM killer.

So, "atomic" request could, in theory, cause a OOM to be invoked.

Now we had two possible solutions:
One possible solution was to disable the OOM killer. I argued that this is a bad idea, as if we do disable it, we stay with a system that can not be accessed via the network, and is not capable of doing its main task - processing packets.

The other solution required understanding, why we reach this situation, and prevent it.

Further investigation has shown that during the crash, the output packets were experiencing a peak. Lots of data was sent via TCP but was stuck in the OS buffers, waiting for the sliding window algorithm to allow further packet emission. This spilled some light on the problem.

Enter the zoned allocator.

Virtual memory in Linux is not very uniform. The whole virtual address space is divided into three zones: the DMA zone - memory that is directly accessible for DMA operations, normal and high zones.

The difference between the normal and high zones is the structure of a virtual memory. Normal zone addresses have their four upper bits set to zero. Translating a virtual address that belongs to the normal zone to real (machine) address is as fast as adding a constant base address to the address.

For high zone virtual address, however, the process is much more complicated. It requres going through the MMU to resolve the virtual address to the real addres, and in fact, it requires temporary modifications to the MMU tables. Compared to the speed of the CPU, this operations will take ages.

This is the reason why some non-sleeping "atomic" allocations always happen from the normal zone. When all you have is a few nanoseconds to allocate a page, there is really no need to go for the high zone. Same that high zone for less real-time applications.

It turned out that network allocations for sending packets happen from the "normal" zone. Therefore, it was possible to exhaust the normal zone by sending LOTS of outgoing traffic. Enough to exhaust the outgoing buffers.

The solution came from the TCP itself. TCP incorporates congestion control mechanism, which, in short, avoids congestions by reducing the throughput drastically when a congestion is discovered.

Since in our case discovering congestions was synonymous with crashing, we opted to estimate congestions by keeping track of our estimated capacity. Once the "virtual congestion" condition was met we were slowing down our our sending rate.

Q.E.D.

Wednesday, July 18, 2007

The Moore law is still with us.

True, single core systems are being phased out, but performance of a single computing system will double in eighteen months from today.

Note that its a computer system, not a chip.

And stay assured that the exact meaning of double and eighteen months will change.

This implies that any given software system that is designed to last for eighteen months in a row can be run on hardware that is twice as fast.

At S., the company I worked for from its beginning we had to accommodate the immense load of modern network environment while using off-the-shelf PC computers, running Fedora Core Linux. A year and a half after, we had upgraded to latest 64 bit multi-core machines.

In this blog I will reflect on some architectural insights we had learned from the experience of building the super-fast networking appliances.