Ticket #1052 (closed defect: fixed)

Opened 7 years ago

Last modified 7 years ago

Nodemanager unexpectedly dying.

Reported by: monzum Owned by: monzum
Priority: critical Milestone:
Component: node manager Version: 0.1t
Severity: Medium Keywords:
Cc: gpress, sebass63 Blocking:
Blocked By:

Description

We noticed recently that the Seattle node count has been dropping rapidly over the last week or two. In order to find out what is happening, I did a node count to see how many planetlab nodes had Seattle installed (had Seattle in its home directory). There was a total of 441 nodes that had Seattle present in its director, however far fewer nodes were seen advertising.

Then I checked how many of these nodes were running both the Nodemanager and the Software Updater. Upon which I found that only 228 of these nodes were running both Nodemanager and Software Updater. 31 nodes weren't running either Nodemanager or Software Updater. 7 nodes were running only the Nodemanager and 185 nodes were only running the Software Updater.

After looking at the logs of the nodemanager for these nodes that were only running the Software Updater, there didn't seem to be any indication as to why the Nodemanager had died. The logs seemed normal for the handful of nodes I checked. Below is a tail of one of the Nodemanager logs.

"""
1330851961.93:PID-27110:[INFO]: node manager is alive...
1330852622.76:PID-27110:[INFO]: node manager is alive...
1330853075.42:PID-27110:Traceback (most recent call last):

File "/home/uw_seattle/seattle/seattle_repy/nmrequesthandler.py", line 75, in handle_request
File "nodemanager.repyhelpercache/session_repy.py", line 83, in session_recvmessage

IndexError?: string index out of range

1330853283.64:PID-27110:[INFO]: node manager is alive...
1330853531.94:PID-27110:Traceback (most recent call last):

File "/home/uw_seattle/seattle/seattle_repy/nmrequesthandler.py", line 75, in handle_request
File "nodemanager.repyhelpercache/session_repy.py", line 83, in session_recvmessage

IndexError?: string index out of range

1330853944.47:PID-27110:[INFO]: node manager is alive...
1330854605.3:PID-27110:[INFO]: node manager is alive...
1330855266.12:PID-27110:[INFO]: node manager is alive...
"""

It seems that the Nodemanager had died without much indication as to why. The IndexError? is related to ticket #1051

Change History

Changed 7 years ago by justinc

  • cc gpress, sebass63 added

I think I may know what is causing this. Or at least I know one issue that crashes the node state transition scripts that is likely the cause.

Look at  https://seattle.cs.washington.edu/browser/seattle/trunk/seattlelib/udpcentralizedadvertise.repy on lines 124 and 134. They perform stopcomm. Somehow, sometimes the result can be a KeyError? which happens in a separate thread and crashes the process. This occurs on the del line in emulcomm's cleanup despite what I think it correct race condition detection.

I'm not sure what is happening. I was hoping someone could help by generating a testcase that is as small as possible which replicates the issue.

Traceback follows:

---
Uncaught exception! Following is a full traceback, and a user traceback.
The user traceback excludes non-user modules. The most recent call is displayed last.

Full debugging traceback:
  "/tmp/deploy/seattle/emulcomm.py", line 685, in run
  "/tmp/deploy/seattle/udpcentralizedadvertise_repy.py", line 150, in _listenformessage
  "/tmp/deploy/seattle/emulcomm.py", line 965, in stopcomm
  "/tmp/deploy/seattle/emulcomm.py", line 1028, in cleanup

User traceback:

Exception (with type 'exceptions.KeyError'): '_COMMH:_EVENT:Thread:145:406'
---
Terminated

Changed 7 years ago by sebass63

Monzur and I have confirmed that this KeyError? could be causing the nodemanager crashes. It seems that this KeyError? is able to ignore repy try-except blocks completely and crash its invoking functions. This would pose a threat to nodemanager, and although this bug is relatively unusual it does tend to show up once every 50 - 100 advertisements.

Changed 7 years ago by justinc

But why is this happening? Can you replicate this with a standalone program? Can you make an example that is as small as possible?

Changed 7 years ago by monzum

Sebastian wrote up a small program that is able to replicate this problem by using just udpcentralizedadvertise. We concluded that it was a certain race condition that was occurring in emulcomm.py a line 1028. However, looking at the function where it was failing, we could not determine why it was happening as the cleanup() function (this is where the error is occurring) is seems to be thread safe.

Changed 7 years ago by justinc

Please try to make the smallest test case that replicates the problem. Ideally, something that fits on a screen.

Changed 7 years ago by justinc

  • owner changed from justinc to monzum
  • status changed from new to assigned

Changed 7 years ago by monzum

  • status changed from assigned to closed
  • resolution set to fixed

This problem has gone away since we removed the udp advertise service and changed changed back session.repy to how it was before.

Note: See TracTickets for help on using tickets.