[Clusterusers] hex issues
Lee Spector
lspector at hampshire.edu
Tue Dec 14 20:41:45 EST 2004
Fellow cluster users,
I've recently resumed doing some runs on hex and have been having some
puzzling problems.
The most puzzling is that my script to start lisp runs on all of the
nodes seems most of the time to get interrupted at some random point,
meaning that it starts my processes on nodes n01-n??, where n?? is
occasionally 23 but sometimes 19, 15, or something else. One
possibility here is that the processes really are being started but
that they abort before they can do anything -- I thought this might
happen if the nodes were heavily loaded, but when I poked around with w
and top I didn't see much happening on the failing nodes... And besides
it seems unlikely that somebody's running short, intense processes on
the highest-numbered nodes, always in sequence from n23 down...
In a possibly related problem I've been getting the following message
printed to my shell on the master node from time to time:
protocol failure in circuit setup
I've googled this but only learned that it's an rsh error, which makes
sense in relation to the previous problem.
Another problem is that n02 is unresponsive, and seemingly in a way
that's somehow worse than a normal crashed node. Normally a crashed
node causes a hiccup and an error message ("unreachable node" or
something like that) from some of my scripts that are currently hanging
instead. Incidentally, some of the scripts I use to do diagnostics (and
kill my processes!) are hanging this way, making it hard to mess around
much.
Any leads or diagnostic suggestions?
Thanks,
-Lee
--
Lee Spector
Dean, Cognitive Science + Professor, Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438
More information about the Clusterusers
mailing list