Ticket #805 (closed defect: fixed)

Opened 10 years ago

Last modified 9 years ago

advertise server flakiness: timeouts, failures, or outages

Reported by: jsamuel Owned by: armon
Priority: major Milestone:
Component: repy Version: 0.1o
Severity: Medium Keywords:
Cc: justinc, armon, monzum Blocking:
Blocked By:

Description

The advertise server has been flaky since mid-December. This corresponds to when we last rebuilt it from trunk. The most obvious change in that rebuild was r3175, Armon's change to sockettimeout.repy to have it use willblock instead of timers (this change was made 11/11, but the advertise server wasn't rebuilt until 12/16).

The flakiness can be seen in the graphs of advertisements for both the production and beta testbeds. There does appear to be some correlation of the timing of the failures, but it's not complete (of course, the advertise lookups aren't happening at exactly the same time from these two systems when gathering the data for the graph).

 https://seattlegeni.cs.washington.edu/stats/seattlegeni/

 https://betabox.cs.washington.edu/stats/seattlegeni/

Monzur reported on irc a few days after the rebuild that some central advertise tests were failing occasionally that previously were very reliable:

 https://seattle.cs.washington.edu/irclogs/2009-12/%23seattle.2009-12-22.log

I don't see anything in the advertise server's stderr log. The stdout log shows expected entries such as:

recv() timed out!
Socket closed
recv() timed out!
recv() timed out!
Socket closed
recv() timed out!
recv() timed out!

I'm inclined to think that the problem is timeouts happening earlier than intended.

Unless people have other suggestions, I will probably revert the version of sockettimeout.repy used in the advertise server and see if the problem goes away.

Monzur: what is being reported when the tests fail?

Change History

Changed 10 years ago by jsamuel

I've now changed the advertiseserver back to using the previous version of sockettimeout.repy. I didn't update the rest from trunk so the only change is the version of sockettimeout.

If this solves the problem, I'll probably rename this ticket and assign it to Armon to decide if he wants to revert the version in svn or debug the issue further. It's not unlikely that this is not a bug in Armon's change to sockettimeout but is instead insight into a deeper bug.

Changed 10 years ago by justinc

  • owner changed from jsamuel to justinc
  • status changed from new to assigned

Changed 10 years ago by monzum

I have set up two versions of the advertise server on blackbox.cs one with the latest version is running on port 20001 and one from r3308 which is running on port 20002. So far I've not been able to replicate the timeout problems from the actual central server running on satya.cs.washington.edu. I've been running 100 announces then lookups and on both the blackbox advertise servers the test runs smoothly and very fast. When I test out the advertise server on satya, the test takes almost twice to three times longer to finish and occasionally some of the announces or lookups timeout.

I've tried running the tests locally on blackbox.cs and also externally from betabox.cs and I get the same results from both machines.

Changed 10 years ago by monzum

Ran test today with different revisions from the trunk with 100 threads announcing/looking up with each thread announcing 100 keys then looking up all the keys that were successfully announced. Here is the result:

Revision 3368(current version):
1st run:
announce succeeded: 9061 announce failed: 939
lookup succeeded: 7976 lookup failed: 1085

2nd run:
announce succeeded: 9123 announce failed: 877
lookup succeeded: 7947 lookup failed: 1176

Revision 3170(right before sockettimeout was changed with willblock):
announce succeeded: 8734 announce failed: 1266
lookup succeeded: 7621 lookup failed: 1113

Revision 3100:
announce succeeded: 7933 announce failed: 2067
lookup succeeded: 6035 lookup failed: 1898

So far it seems that the current version is doing slightly better then older versions. I'm going to play around with the timeouts a bit to see if I can improve anything.

Changed 10 years ago by monzum

After logging some information from the advertise server it seems that all the timeout failure is due to the fact that the server is not seeing the requests from the client. Most likely the client requests are timing out or the requests are not coming through to be handled by the server.

Next step is to check out the network to see if there is any network problems.

Changed 9 years ago by monzum

After running more tests with the server side timeout set at 10 seconds and client side timeout set at 30 seconds, there were roughly about 50-60 timeouts total with 5000 announces and lookups of the successful announces. All the timeouts were happening on the client side. The advertise server (from r3368) seems to be running somewhat normally. At least I think that this is the right behavior.

I checked the network connections using netstat on both seattle.cs.washington.edu and blackbox.cs.washington.edu (where the testing advertise server is) to see the number of connections being made.

seattle.cs:
port: 10101
number of connections: 900-1000 roughly

blackbox.cs:
port: 20001
number of connections: 1600-1700

Changed 9 years ago by monzum

I increased the timeout on the client side to check if it reduces the number of failures and announces. After increasing the timeout to roughly 4-5 min there were 0 failures of announces and lookups with 100 threads and 100 announces and lookups per thread.

Next I used tcpdump to check how many SYN packets and FIN packets are being sent by the client and is being received by the server. I used the command:

tcpdump 'dst port 20001 and tcp[tcpflags] & (tcp-syn|tcp-fin) != 0'

all the timeouts occur on the client side.
Using 100 threads and 50 announces/lookups per thread, here are the results.

Timeout=30 sec:
number of announce succeeded: 4889 (111 fails)
number of lookup succeeded: 4840 (49 fails)

client-side: testbed-opensuse.cs.washington.edu
SYN packets sent: 13217 packets
FIN packets sent: 9729 packets

server-side: blackbox.cs.washington.edu
SYN packets received: 9765 packets
FIN packets received: 9729 packets
Timeout=300 sec:
number of announce succeeded: 5000 (0 fails)
number of lookup succeeded: 5000 (0 fails)

client-side: testbed-opensuse.cs.washington.edu
SYN packets sent: 11776 packets
FIN packets sent: 10000 packets

server-side: blackbox.cs.washington.edu
SYN packets received: 10011 packets
FIN packets received: 10000 packets

Changed 9 years ago by justinc

When I run tcpdump on both the client and server, I see that all packets that the client sends are seen by the server (even when connections time out).

Changed 9 years ago by justinc

Failed connections seem to send a few SYN packets and then time out...

justinc@blackbox:~$ grep testbed-opensuse.cs.washington.edu.18307 new.advertiseserver
09:44:44.123216 IP testbed-opensuse.cs.washington.edu.18307 > testbed-ubuntu.cs.washington.edu.40001: S 3996116232:3996116232(0) win 5840 <mss 1460,sackOK,timestamp 2420608031 0,nop,wscale 5>
09:44:47.123070 IP testbed-opensuse.cs.washington.edu.18307 > testbed-ubuntu.cs.washington.edu.40001: S 3996116232:3996116232(0) win 5840 <mss 1460,sackOK,timestamp 2420608781 0,nop,wscale 5>
09:44:53.123562 IP testbed-opensuse.cs.washington.edu.18307 > testbed-ubuntu.cs.washington.edu.40001: S 3996116232:3996116232(0) win 5840 <mss 1460,sackOK,timestamp 2420610281 0,nop,wscale 5>
justinc@blackbox:~$

Changed 9 years ago by justinc

  • owner changed from justinc to armon
  • priority changed from critical to major
  • component changed from advertise server to repy

The problem gets substantially worse when I delay the acceptance of new connections. Adding a .1 second sleep increases from about 1% failures to about 15% (assuming 50 threads with 10 lookups / puts each).

It seems plausible that the thread creation time is limiting the throughput. This problem could have come up more recently because Armon's non-blocking changes alter the time it takes to do an accept on a socket.

I imagine that using a thread pool will fix this problem. This means this problem should solve itself in repy V2.

Let's close this ticket once we have verified that the throughput for Repy V2 is adequate for the sort of load that is seen by the advertiseserver.

Changed 9 years ago by justinc

I've written a thread pool version of emulcomm. It's a hack job and isn't suitable for general use because it circumvents some types of accounting.

It has the main thread accept a connection and then put it into a list for further processing. This reduces the connection dropping by more than 1/2. I've also tried doing the accept in the thread pool (and only indicating the need to accept) and this is about the same. I'm using the first technique because it's simpler to reason about.

I've deployed the code for the time being.

jsamuel noticed the following behavior:

Exception in thread SocketSelector:
Traceback (most recent call last):
  File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner
  File "/home/geni/advertiseserver_deployed/emulcomm.py", line 688, in run
  File "/home/geni/advertiseserver_deployed/nonportable.py", line 196, in getruntime
  File "/home/geni/advertiseserver_deployed/linux_api.py", line 239, in get_system_uptime
IOError: [Errno 24] Too many open files: '/proc/uptime'

I believe this is a sign that the thread pool is set too high (currently SIZE_OF_THREAD_POOL = 150). We can lower this value and re-run if this reoccurs.

Changed 9 years ago by armon

A minor point is that the "Too Many Open Files" doesn't necessarily mean actual files. Linux has a maximum number of kernel handles per process, and when those are exhausted we will start to see this error. Most likely there are too many sockets open. We should be able to inspect this with "lsof" to see which handles the program is actually using.

Changed 9 years ago by justinc

The logging was turned on and this caused the process to be over its disk quota around 9:30PM on Feb 11th. I turned it off and restarted the advertise server.

Changed 9 years ago by justinc

  • status changed from assigned to closed
  • resolution set to fixed

I fixed this a long while back. I think this was actually the bug Conrad created where nodes were spamming the advertise server with data.

Note: See TracTickets for help on using tickets.