[Clusterusers] now something seems really strange on fly
Lee Spector
lspector at hampshire.edu
Sun Oct 14 23:26:26 EDT 2007
Josiah and clusterusers,
I tried earlier today to kill all of my cluster jobs that had been
running for the previous day or so. I thought it had worked, but just
noticed that I had got a timeout after the first node. Tried again --
and my kill script hangs on ALL nodes, but continues to the next one
if I command-C (over and over again to get to each node). Never seen
that before. And when I got up to compute-1-13, compute-1-15, and
compute-1-16, which might not even have had my jobs launched on them
(because of a previous problem -- I think I wrote about this to
Josiah but not to the cluster list), they asked me for a password.
I've had to deal with node/password issues before, but I don't see
any reason why this should have cropped up now.
So something seems to be pretty hosed at the moment. All of my breve
processes do seem to have been killed, assuming cluster-top is
telling the truth, but something's wrong.
All of which (along with the other recent problems) is really
puzzling me because I've been running essentially this same code for
months -- it's just breve/PushGP on a straightforward problem, using
the same shell scripts that I've been using for months -- and I've
never had problems anything like this before the last couple of
weeks. What's changed? Is it the simultaneous use of the cluster for
lots of rendering? But hasn't that been about the same for a while
too? Could it be Kyle's new work with breve -- Kyle, is it possible
that your block-transport code is consuming a lot of OS network
resources of some kind? Any other ideas?
-Lee
--
Lee Spector, Professor of Computer Science
School of Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438
More information about the Clusterusers
mailing list