[Clusterusers] Re: now something seems really strange on fly

Mon Oct 15 10:55:24 EDT 2007

Josiah (a clusterusers and I'm cc-ing Jon too since there may be  
breve-related issues at play),

Thanks for the detective work. Please do kill all of my undead breve  
processes. Some of them DO still appear to be generating output, but  
it's nothing useful at this point and it's time to drive the wooden  
stake through their hearts :-).

I think we have too little data to know this is really breve, because  
I was also running breve a lot when this WASN'T happening. But it  
could conceivably be something in a recently installed version of  
breve... and now there's a new official release, and I don't know if  
that has been installed so maybe that would help (Jon -- can you make  
sure that the version on fly is the latest?).

Since there was an NSF server crash on the head node and since Kyle  
has been experimenting with some new networking stuff (also in breve)  
I think that may be a useful place to look...

Thanks,

  -Lee

On Oct 15, 2007, at 10:43 AM, Wm. Josiah Erikson wrote:

> Lee (and others),
>    I'm not sure yet exactly why, but here's the situation:
>
>    -the "killed" breve jobs aren't showing up in cluster top
>    -breve is still running on the nodes, taking up memory and  
> putting the load average up, but not using any CPU (according to  
> top when you SSH into the node)
>    -the home directory missing issue is because the NFS server on  
> the head node has crashed. I restart it and everything is fine,   
> but why did it crash? The logs show nothing.
>
>    -all of this seems to only happen when we run breve. But why?  
> I'm not convinced that my ulimit stuff worked... in fact, I'm  
> pretty sure it didn't.
>
>
>    I can't kill the breve jobs - they won't die, not even with a  
> signal 9. I think I"ll have to restart the nodes to get them to  
> behave again.
>
>    Obviously, I've got my work cut out for me here :) I'll report  
> anything I figure out. I'm assuming for now that any breve jobs on  
> the cluster are junk, yes? I don't see any that are actually still  
> using CPU, meaning, I think, that you killed them but they didn't die.
>
>
> -Josiah
>
>
>
> Lee Spector wrote:
>>
>> Josiah and clusterusers,
>>
>> I tried earlier today to kill all of my cluster jobs that had been  
>> running for the previous day or so. I thought it had worked, but  
>> just noticed that I had got a timeout after the first node. Tried  
>> again -- and my kill script hangs on ALL nodes, but continues to  
>> the next one if I command-C (over and over again to get to each  
>> node). Never seen that before. And when I got up to compute-1-13,  
>> compute-1-15, and compute-1-16, which might not even have had my  
>> jobs launched on them (because of a previous problem -- I think I  
>> wrote about this to Josiah but not to the cluster list), they  
>> asked me for a password. I've had to deal with node/password  
>> issues before, but I don't see any reason why this should have  
>> cropped up now.
>>
>> So something seems to be pretty hosed at the moment. All of my  
>> breve processes do seem to have been killed, assuming cluster-top  
>> is telling the truth, but something's wrong.
>>
>> All of which (along with the other recent problems) is really  
>> puzzling me because I've been running essentially this same code  
>> for months -- it's just breve/PushGP on a straightforward problem,  
>> using the same shell scripts that I've been using for months --  
>> and I've never had problems anything like this before the last  
>> couple of weeks. What's changed? Is it the simultaneous use of the  
>> cluster for lots of rendering? But hasn't that been about the same  
>> for a while too? Could it be Kyle's new work with breve -- Kyle,  
>> is it possible that your block-transport code is consuming a lot  
>> of OS network resources of some kind? Any other ideas?
>>
>>  -Lee
>>
>> -- 
>> Lee Spector, Professor of Computer Science
>> School of Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>>
>
> -- 
> Wm. Josiah Erikson
> Computing Support
> School of Cognitive Science
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers

--
Lee Spector, Professor of Computer Science
School of Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438