Handling large connection/second rates on Amazon EC2

Some people say Amazon EC2 aka “the Cloud” is the answer to every technical problem one can have.
Just add more servers and things will go faster, unfortunately this is not always true on Amazon EC2.

Part of our EC2 infrastructure is responsible for handling about 2500/requests/s on average, this requests are very small and do not put a lot of stress on the web server, despite this we used to have a lot of instances handling that traffic (which were mostly idling around 5% CPU usage)
After migrating from us-east to eu-west and reducing the number of servers everything died, EBS were reporting:

"HTTP/1.1 503 Service Unavailable: Back-end server is at capacity"

After checking up the load was about 5-10% and everything looked good.
The Problem:
netstat  showed a lot of TIME_WAIT connections

# netstat -tan | grep ':80 ' | awk '{print $6}' | sort | uniq -c
 23816 TIME_WAIT

why only over 11k connections even when there are over 60k ports?
It has to do with a way Linux chooses source port for outgoing connection

$ cat /proc/sys/net/ipv4/ip_local_port_range
 32768 61000


So with the default linux kernel configuration there are only 28k port numbers that can be used for outgoing connection, not more


There are a few possible solutions to the problem, the fastest and safest is to set:

net.ipv4.tcp_tw_reuse = 1

in sysctl

From the kernel documentation:

Enable fast recycling TIME-WAIT sockets. Default value is 0.
It should not be changed without advice/request of technical

This will force the kernel to reuse sockets that are in TIME_WAIT state if this is safe from the protocol point of view

There is also another tuning parameter:

tcp_tw_recycle - BOOLEAN
Enable fast recycling TIME-WAIT sockets. Default value is 0.
It should not be changed without advice/request of technical

This option will forcefully recycle TIME_WAIT sockets, the side effect is that it will break NATed connections.


