[Clusterusers] hex issues

Lee Spector lspector at hampshire.edu
Tue Dec 14 20:41:45 EST 2004


Fellow cluster users,

I've recently resumed doing some runs on hex and have been having some 
puzzling problems.

The most puzzling is that my script to start lisp runs on all of the 
nodes seems most of the time to get interrupted at some random point, 
meaning that it starts my processes on nodes n01-n??, where n?? is 
occasionally 23 but sometimes 19, 15, or something else. One 
possibility here is that the processes really are being started but 
that they abort before they can do anything -- I thought this might 
happen if the nodes were heavily loaded, but when I poked around with w 
and top I didn't see much happening on the failing nodes... And besides 
it seems unlikely that somebody's running short, intense processes on 
the highest-numbered nodes, always in sequence from n23 down...

In a possibly related problem I've been getting the following message 
printed to my shell on the master node from time to time:

      protocol failure in circuit setup

I've googled this but only learned that it's an rsh error, which makes 
sense in relation to the previous problem.

Another problem is that n02 is unresponsive, and seemingly in a way 
that's somehow worse than a normal crashed node. Normally a crashed 
node causes a hiccup and an error message ("unreachable node" or 
something like that) from some of my scripts that are currently hanging 
instead. Incidentally, some of the scripts I use to do diagnostics (and 
kill my processes!) are hanging this way, making it hard to mess around 
much.

Any leads or diagnostic suggestions?

Thanks,

  -Lee

--
Lee Spector
Dean, Cognitive Science + Professor, Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438





More information about the Clusterusers mailing list