[Clusterusers] Re: now something seems really strange on fly

Mon Oct 15 19:04:44 EDT 2007

I'll add that we have experienced odd NFS stuff from fly to keg since  
we got keg. In fact, it was happening even back in the urza days.  
Remember? We'd make a file from fly, it would appear on urza, but not  
when you ls the urza folder from fly. Here's hoping the new NFS stuff  
improves things for all of us!

- chris

On Oct 15, 2007, at 11:01 AM, Wm. Josiah Erikson wrote:

> I'm beginning to suspect an NFS bug - other people on the ROCKS  
> list have been talking about similar things. I think the processes  
> aren't dying because they're all waiting to write data to the NFS  
> share and they're in "uninterruptible iowait", which is a state in  
> which they will not die.
>
> When I restarted the NFS server, lots of things got better. I'll  
> see if I can upgrade the NFS server on the head node and perhaps it  
> will stop crashing.
>
>    -Josiah
>
>
>
> Lee Spector wrote:
>>
>> Josiah (a clusterusers and I'm cc-ing Jon too since there may be  
>> breve-related issues at play),
>>
>> Thanks for the detective work. Please do kill all of my undead  
>> breve processes. Some of them DO still appear to be generating  
>> output, but it's nothing useful at this point and it's time to  
>> drive the wooden stake through their hearts :-).
>>
>> I think we have too little data to know this is really breve,  
>> because I was also running breve a lot when this WASN'T happening.  
>> But it could conceivably be something in a recently installed  
>> version of breve... and now there's a new official release, and I  
>> don't know if that has been installed so maybe that would help  
>> (Jon -- can you make sure that the version on fly is the latest?).
>>
>> Since there was an NSF server crash on the head node and since  
>> Kyle has been experimenting with some new networking stuff (also  
>> in breve) I think that may be a useful place to look...
>>
>> Thanks,
>>
>>  -Lee
>>
>>
>> On Oct 15, 2007, at 10:43 AM, Wm. Josiah Erikson wrote:
>>
>>> Lee (and others),
>>>    I'm not sure yet exactly why, but here's the situation:
>>>
>>>    -the "killed" breve jobs aren't showing up in cluster top
>>>    -breve is still running on the nodes, taking up memory and  
>>> putting the load average up, but not using any CPU (according to  
>>> top when you SSH into the node)
>>>    -the home directory missing issue is because the NFS server on  
>>> the head node has crashed. I restart it and everything is fine,   
>>> but why did it crash? The logs show nothing.
>>>
>>>    -all of this seems to only happen when we run breve. But why?  
>>> I'm not convinced that my ulimit stuff worked... in fact, I'm  
>>> pretty sure it didn't.
>>>
>>>
>>>    I can't kill the breve jobs - they won't die, not even with a  
>>> signal 9. I think I"ll have to restart the nodes to get them to  
>>> behave again.
>>>
>>>    Obviously, I've got my work cut out for me here :) I'll report  
>>> anything I figure out. I'm assuming for now that any breve jobs  
>>> on the cluster are junk, yes? I don't see any that are actually  
>>> still using CPU, meaning, I think, that you killed them but they  
>>> didn't die.
>>>
>>>
>>> -Josiah
>>>
>>>
>>>
>>> Lee Spector wrote:
>>>>
>>>> Josiah and clusterusers,
>>>>
>>>> I tried earlier today to kill all of my cluster jobs that had  
>>>> been running for the previous day or so. I thought it had  
>>>> worked, but just noticed that I had got a timeout after the  
>>>> first node. Tried again -- and my kill script hangs on ALL  
>>>> nodes, but continues to the next one if I command-C (over and  
>>>> over again to get to each node). Never seen that before. And  
>>>> when I got up to compute-1-13, compute-1-15, and compute-1-16,  
>>>> which might not even have had my jobs launched on them (because  
>>>> of a previous problem -- I think I wrote about this to Josiah  
>>>> but not to the cluster list), they asked me for a password. I've  
>>>> had to deal with node/password issues before, but I don't see  
>>>> any reason why this should have cropped up now.
>>>>
>>>> So something seems to be pretty hosed at the moment. All of my  
>>>> breve processes do seem to have been killed, assuming cluster- 
>>>> top is telling the truth, but something's wrong.
>>>>
>>>> All of which (along with the other recent problems) is really  
>>>> puzzling me because I've been running essentially this same code  
>>>> for months -- it's just breve/PushGP on a straightforward  
>>>> problem, using the same shell scripts that I've been using for  
>>>> months -- and I've never had problems anything like this before  
>>>> the last couple of weeks. What's changed? Is it the simultaneous  
>>>> use of the cluster for lots of rendering? But hasn't that been  
>>>> about the same for a while too? Could it be Kyle's new work with  
>>>> breve -- Kyle, is it possible that your block-transport code is  
>>>> consuming a lot of OS network resources of some kind? Any other  
>>>> ideas?
>>>>
>>>>  -Lee
>>>>
>>>> -- 
>>>> Lee Spector, Professor of Computer Science
>>>> School of Cognitive Science, Hampshire College
>>>> 893 West Street, Amherst, MA 01002-3359
>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>>
>>>
>>> -- 
>>> Wm. Josiah Erikson
>>> Computing Support
>>> School of Cognitive Science
>>> Hampshire College
>>> Amherst, MA 01002
>>> (413) 559-6091
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>> -- 
>> Lee Spector, Professor of Computer Science
>> School of Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> -- 
> Wm. Josiah Erikson
> Computing Support
> School of Cognitive Science
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers