[Clusterusers] Re: now something seems really strange on fly

Mon Oct 15 11:01:31 EDT 2007

I'm beginning to suspect an NFS bug - other people on the ROCKS list 
have been talking about similar things. I think the processes aren't 
dying because they're all waiting to write data to the NFS share and 
they're in "uninterruptible iowait", which is a state in which they will 
not die.

When I restarted the NFS server, lots of things got better. I'll see if 
I can upgrade the NFS server on the head node and perhaps it will stop 
crashing.

    -Josiah

Lee Spector wrote:
>
> Josiah (a clusterusers and I'm cc-ing Jon too since there may be 
> breve-related issues at play),
>
> Thanks for the detective work. Please do kill all of my undead breve 
> processes. Some of them DO still appear to be generating output, but 
> it's nothing useful at this point and it's time to drive the wooden 
> stake through their hearts :-).
>
> I think we have too little data to know this is really breve, because 
> I was also running breve a lot when this WASN'T happening. But it 
> could conceivably be something in a recently installed version of 
> breve... and now there's a new official release, and I don't know if 
> that has been installed so maybe that would help (Jon -- can you make 
> sure that the version on fly is the latest?).
>
> Since there was an NSF server crash on the head node and since Kyle 
> has been experimenting with some new networking stuff (also in breve) 
> I think that may be a useful place to look...
>
> Thanks,
>
>  -Lee
>
>
> On Oct 15, 2007, at 10:43 AM, Wm. Josiah Erikson wrote:
>
>> Lee (and others),
>>    I'm not sure yet exactly why, but here's the situation:
>>
>>    -the "killed" breve jobs aren't showing up in cluster top
>>    -breve is still running on the nodes, taking up memory and putting 
>> the load average up, but not using any CPU (according to top when you 
>> SSH into the node)
>>    -the home directory missing issue is because the NFS server on the 
>> head node has crashed. I restart it and everything is fine,  but why 
>> did it crash? The logs show nothing.
>>
>>    -all of this seems to only happen when we run breve. But why? I'm 
>> not convinced that my ulimit stuff worked... in fact, I'm pretty sure 
>> it didn't.
>>
>>
>>    I can't kill the breve jobs - they won't die, not even with a 
>> signal 9. I think I"ll have to restart the nodes to get them to 
>> behave again.
>>
>>    Obviously, I've got my work cut out for me here :) I'll report 
>> anything I figure out. I'm assuming for now that any breve jobs on 
>> the cluster are junk, yes? I don't see any that are actually still 
>> using CPU, meaning, I think, that you killed them but they didn't die.
>>
>>
>> -Josiah
>>
>>
>>
>> Lee Spector wrote:
>>>
>>> Josiah and clusterusers,
>>>
>>> I tried earlier today to kill all of my cluster jobs that had been 
>>> running for the previous day or so. I thought it had worked, but 
>>> just noticed that I had got a timeout after the first node. Tried 
>>> again -- and my kill script hangs on ALL nodes, but continues to the 
>>> next one if I command-C (over and over again to get to each node). 
>>> Never seen that before. And when I got up to compute-1-13, 
>>> compute-1-15, and compute-1-16, which might not even have had my 
>>> jobs launched on them (because of a previous problem -- I think I 
>>> wrote about this to Josiah but not to the cluster list), they asked 
>>> me for a password. I've had to deal with node/password issues 
>>> before, but I don't see any reason why this should have cropped up now.
>>>
>>> So something seems to be pretty hosed at the moment. All of my breve 
>>> processes do seem to have been killed, assuming cluster-top is 
>>> telling the truth, but something's wrong.
>>>
>>> All of which (along with the other recent problems) is really 
>>> puzzling me because I've been running essentially this same code for 
>>> months -- it's just breve/PushGP on a straightforward problem, using 
>>> the same shell scripts that I've been using for months -- and I've 
>>> never had problems anything like this before the last couple of 
>>> weeks. What's changed? Is it the simultaneous use of the cluster for 
>>> lots of rendering? But hasn't that been about the same for a while 
>>> too? Could it be Kyle's new work with breve -- Kyle, is it possible 
>>> that your block-transport code is consuming a lot of OS network 
>>> resources of some kind? Any other ideas?
>>>
>>>  -Lee
>>>
>>> -- 
>>> Lee Spector, Professor of Computer Science
>>> School of Cognitive Science, Hampshire College
>>> 893 West Street, Amherst, MA 01002-3359
>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>
>>
>> -- 
>> Wm. Josiah Erikson
>> Computing Support
>> School of Cognitive Science
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> -- 
> Lee Spector, Professor of Computer Science
> School of Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspector at hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 
Wm. Josiah Erikson
Computing Support
School of Cognitive Science
Hampshire College
Amherst, MA 01002
(413) 559-6091