[Clusterusers] Re: cluster mailing lists & something funky on compute-1-3

Lee Spector lspector at hampshire.edu
Tue Oct 9 12:46:52 EDT 2007


I've just killed everything of mine, so you can blast away anything  
that's still running as far as I'm concerned. Let me know when I can  
try starting things up again.

Thanks,

  -Lee


On Oct 9, 2007, at 11:03 AM, Wm. Josiah Erikson wrote:

> I'd actually like to compare with you which nodes are supposed to  
> be running breve and which ones actually are right now so that we  
> can en-mass reboot any nodes that have hanging breve processes....  
> I suspect that compute-0-15, compute-0-12, 0-16 and 0-17 might be  
> in a similar state... (perhaps have extra breve processes running)
>
> IM or 'phone or some other real-time means would be best for this,  
> probably...
>    -Josiah
>
>
>
> Lee Spector wrote:
>>
>> Drat -- now there's something similar wrong with compute-0-18. It  
>> doesn't hang in the simple cluster-fork test I sent earlier but  
>> it's hanging my launch script (which uses a ./sh loop to ssh to  
>> each node).
>>
>>  -Lee
>>
>>
>> On Oct 9, 2007, at 9:17 AM, Wm. Josiah Erikson wrote:
>>
>>> uh... that's clusterusers at lists.hampshire.edu obviously :)
>>>
>>> compute-1-3 was stuck in permanent iowait with a breve process  
>>> using up more RAM than the machine actually had (meaning it was  
>>> in swap). I think it probably got stuck when the head node  
>>> crashed and ended up with a stale file handle or something when  
>>> the home directory got pulled out from under it.
>>>
>>> I rebooted it :)
>>>
>>>    -Josiah
>>>
>>>
>>>
>>> Wm. Josiah Erikson wrote:
>>>> I have subscribed Adam, Brian was already subscribed, as is  
>>>> Kyle, Lee, Jaime, Chris, me, Michael, and a few others. I think  
>>>> helga can be removed from the discussion unless there is  
>>>> somebody else on that list that should know what's up with the  
>>>> cluster.
>>>>
>>>> clusterusers at hermes doesn't work anymore - that's old and the  
>>>> hostnames have changed. clusteruers at lists.hampshire.edu should  
>>>> be the proper address. We'll see if this gets through :)
>>>>
>>>> I'll go check on compute-1-3 and report back.
>>>>
>>>>    -Josiah
>>>>
>>>>
>>>>
>>>> Lee Spector wrote:
>>>>>
>>>>> It seems like a lot of people are using/concerned with the  
>>>>> status of the cluster these days, but maybe not all of them are  
>>>>> on the cluster-users list. Is that right? It'd be nice to clear  
>>>>> this up so that we're not having unsynced conversations among  
>>>>> those of us on the cluster list, the helga list, and maybe no  
>>>>> list, not to mention individual email side conversations. Can  
>>>>> we take care of this by subscribing helga to cluster-users and  
>>>>> making sure that all non-helga users and administrators are  
>>>>> individually subscribed?
>>>>>
>>>>> On an actual cluster-related note: Does anyone know in what  
>>>>> particular way compute-1-3 is currently hosed, such that it  
>>>>> hangs my cluster-looped scripts? To see what I mean you can try  
>>>>> 'cluster-fork hostname' (I'll include the output below since if  
>>>>> someone fixes compute-1-3 you won't see what I mean...).  
>>>>> Cluster-fork is smart enough to ignore some node pathologies  
>>>>> (e.g. compute-0-10 is currently down, and is properly skipped,  
>>>>> and compute-1-12 and compute-1-13 are refusing ssh connections,  
>>>>> and are properly skipped), but compute-1-3 requires a command- 
>>>>> C. Worse, and more important for me, is that command-C (or  
>>>>> other cluster-fork options) won't work for many of my scripts,  
>>>>> which use ./sh loops to run through the nodes.
>>>>>
>>>>> If someone has the access/knowledge/power to fix compute-1-3  
>>>>> then please do so and let us know. Or if you just have an idea  
>>>>> what might be up with compute-1-3 and how to prevent it then  
>>>>> please share.
>>>>>
>>>>> Thanks,
>>>>>
>>>>>  -Lee
>>>>>
>>>>> -----------
>>>>>
>>>>> [lspector at fly bin]$ cluster-fork hostnamecompute-0-1:
>>>>> compute-0-1.local
>>>>> compute-0-2:
>>>>> compute-0-2.local
>>>>> compute-0-3:
>>>>> compute-0-3.local
>>>>> compute-0-4:
>>>>> compute-0-4.local
>>>>> compute-0-5:
>>>>> compute-0-5.local
>>>>> compute-0-6:
>>>>> compute-0-6.local
>>>>> compute-0-7:
>>>>> compute-0-7.local
>>>>> compute-0-8:
>>>>> compute-0-8.local
>>>>> compute-0-9:
>>>>> compute-0-9.local
>>>>> compute-0-10: down
>>>>> compute-0-11:
>>>>> compute-0-11.local
>>>>> compute-0-12:
>>>>> compute-0-12.local
>>>>> compute-0-13:
>>>>> compute-0-13.local
>>>>> compute-0-14:
>>>>> compute-0-14.local
>>>>> compute-0-15:
>>>>> compute-0-15.local
>>>>> compute-0-16:
>>>>> compute-0-16.local
>>>>> compute-0-17:
>>>>> compute-0-17.local
>>>>> compute-0-18:
>>>>> compute-0-18.local
>>>>> compute-0-19:
>>>>> compute-0-19.local
>>>>> compute-0-20:
>>>>> compute-0-20.local
>>>>> compute-0-21:
>>>>> compute-0-21.local
>>>>> compute-0-22:
>>>>> compute-0-22.local
>>>>> compute-0-23:
>>>>> compute-0-23.local
>>>>> compute-1-1:
>>>>> compute-1-1.local
>>>>> compute-1-2:
>>>>> compute-1-2.local
>>>>> compute-1-3:             [ HANGS HERE, CTRL-C ALLOWS CONTINUATION
>>>>> compute-1-4:
>>>>> compute-1-4.local
>>>>> compute-1-5:
>>>>> compute-1-5.local
>>>>> compute-1-6:
>>>>> compute-1-6.local
>>>>> compute-1-7:
>>>>> compute-1-7.local
>>>>> compute-1-8:
>>>>> compute-1-8.local
>>>>> compute-1-9:
>>>>> compute-1-9.local
>>>>> compute-1-10:
>>>>> compute-1-10.local
>>>>> compute-1-11:
>>>>> compute-1-11.local
>>>>> compute-1-12:
>>>>> ssh: connect to host compute-1-12 port 22: Connection refused
>>>>> compute-1-13:
>>>>> ssh: connect to host compute-1-13 port 22: Connection refused
>>>>> compute-1-14:
>>>>> compute-1-14.local
>>>>> compute-1-15:
>>>>> compute-1-15.local
>>>>> compute-1-16:
>>>>> compute-1-16.local
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Lee Spector, Professor of Computer Science
>>>>> School of Cognitive Science, Hampshire College
>>>>> 893 West Street, Amherst, MA 01002-3359
>>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>>>
>>>>
>>>
>>> -- 
>>> Wm. Josiah Erikson
>>> Computing Support
>>> School of Cognitive Science
>>> Hampshire College
>>> Amherst, MA 01002
>>> (413) 559-6091
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>> -- 
>> Lee Spector, Professor of Computer Science
>> School of Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> -- 
> Wm. Josiah Erikson
> Computing Support
> School of Cognitive Science
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers

--
Lee Spector, Professor of Computer Science
School of Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438




More information about the Clusterusers mailing list