[Clusterusers] Re: now something seems really strange on fly

Mon Oct 15 10:43:10 EDT 2007

Lee (and others),
    I'm not sure yet exactly why, but here's the situation:

    -the "killed" breve jobs aren't showing up in cluster top
    -breve is still running on the nodes, taking up memory and putting 
the load average up, but not using any CPU (according to top when you 
SSH into the node)
    -the home directory missing issue is because the NFS server on the 
head node has crashed. I restart it and everything is fine,  but why did 
it crash? The logs show nothing.

    -all of this seems to only happen when we run breve. But why? I'm 
not convinced that my ulimit stuff worked... in fact, I'm pretty sure it 
didn't.


    I can't kill the breve jobs - they won't die, not even with a signal 
9. I think I"ll have to restart the nodes to get them to behave again.

    Obviously, I've got my work cut out for me here :) I'll report 
anything I figure out. I'm assuming for now that any breve jobs on the 
cluster are junk, yes? I don't see any that are actually still using 
CPU, meaning, I think, that you killed them but they didn't die.


-Josiah


Lee Spector wrote:
>
> Josiah and clusterusers,
>
> I tried earlier today to kill all of my cluster jobs that had been 
> running for the previous day or so. I thought it had worked, but just 
> noticed that I had got a timeout after the first node. Tried again -- 
> and my kill script hangs on ALL nodes, but continues to the next one 
> if I command-C (over and over again to get to each node). Never seen 
> that before. And when I got up to compute-1-13, compute-1-15, and 
> compute-1-16, which might not even have had my jobs launched on them 
> (because of a previous problem -- I think I wrote about this to Josiah 
> but not to the cluster list), they asked me for a password. I've had 
> to deal with node/password issues before, but I don't see any reason 
> why this should have cropped up now.
>
> So something seems to be pretty hosed at the moment. All of my breve 
> processes do seem to have been killed, assuming cluster-top is telling 
> the truth, but something's wrong.
>
> All of which (along with the other recent problems) is really puzzling 
> me because I've been running essentially this same code for months -- 
> it's just breve/PushGP on a straightforward problem, using the same 
> shell scripts that I've been using for months -- and I've never had 
> problems anything like this before the last couple of weeks. What's 
> changed? Is it the simultaneous use of the cluster for lots of 
> rendering? But hasn't that been about the same for a while too? Could 
> it be Kyle's new work with breve -- Kyle, is it possible that your 
> block-transport code is consuming a lot of OS network resources of 
> some kind? Any other ideas?
>
>  -Lee
>
> -- 
> Lee Spector, Professor of Computer Science
> School of Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspector at hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
>

-- 
Wm. Josiah Erikson
Computing Support
School of Cognitive Science
Hampshire College
Amherst, MA 01002
(413) 559-6091