[Clusterusers] Re: now something seems really strange on fly
Wm. Josiah Erikson
wjerikson at hampshire.edu
Mon Oct 15 10:43:10 EDT 2007
Lee (and others),
I'm not sure yet exactly why, but here's the situation:
-the "killed" breve jobs aren't showing up in cluster top
-breve is still running on the nodes, taking up memory and putting
the load average up, but not using any CPU (according to top when you
SSH into the node)
-the home directory missing issue is because the NFS server on the
head node has crashed. I restart it and everything is fine, but why did
it crash? The logs show nothing.
-all of this seems to only happen when we run breve. But why? I'm
not convinced that my ulimit stuff worked... in fact, I'm pretty sure it
didn't.
I can't kill the breve jobs - they won't die, not even with a signal
9. I think I"ll have to restart the nodes to get them to behave again.
Obviously, I've got my work cut out for me here :) I'll report
anything I figure out. I'm assuming for now that any breve jobs on the
cluster are junk, yes? I don't see any that are actually still using
CPU, meaning, I think, that you killed them but they didn't die.
-Josiah
Lee Spector wrote:
>
> Josiah and clusterusers,
>
> I tried earlier today to kill all of my cluster jobs that had been
> running for the previous day or so. I thought it had worked, but just
> noticed that I had got a timeout after the first node. Tried again --
> and my kill script hangs on ALL nodes, but continues to the next one
> if I command-C (over and over again to get to each node). Never seen
> that before. And when I got up to compute-1-13, compute-1-15, and
> compute-1-16, which might not even have had my jobs launched on them
> (because of a previous problem -- I think I wrote about this to Josiah
> but not to the cluster list), they asked me for a password. I've had
> to deal with node/password issues before, but I don't see any reason
> why this should have cropped up now.
>
> So something seems to be pretty hosed at the moment. All of my breve
> processes do seem to have been killed, assuming cluster-top is telling
> the truth, but something's wrong.
>
> All of which (along with the other recent problems) is really puzzling
> me because I've been running essentially this same code for months --
> it's just breve/PushGP on a straightforward problem, using the same
> shell scripts that I've been using for months -- and I've never had
> problems anything like this before the last couple of weeks. What's
> changed? Is it the simultaneous use of the cluster for lots of
> rendering? But hasn't that been about the same for a while too? Could
> it be Kyle's new work with breve -- Kyle, is it possible that your
> block-transport code is consuming a lot of OS network resources of
> some kind? Any other ideas?
>
> -Lee
>
> --
> Lee Spector, Professor of Computer Science
> School of Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspector at hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
>
--
Wm. Josiah Erikson
Computing Support
School of Cognitive Science
Hampshire College
Amherst, MA 01002
(413) 559-6091
More information about the Clusterusers
mailing list