[Clusterusers] Re: cluster mailing lists & something funky on compute-1-3

Wm. Josiah Erikson wjerikson at hampshire.edu
Tue Oct 9 11:03:36 EDT 2007


I'd actually like to compare with you which nodes are supposed to be 
running breve and which ones actually are right now so that we can 
en-mass reboot any nodes that have hanging breve processes.... I suspect 
that compute-0-15, compute-0-12, 0-16 and 0-17 might be in a similar 
state... (perhaps have extra breve processes running)

IM or 'phone or some other real-time means would be best for this, 
probably...
    -Josiah



Lee Spector wrote:
>
> Drat -- now there's something similar wrong with compute-0-18. It 
> doesn't hang in the simple cluster-fork test I sent earlier but it's 
> hanging my launch script (which uses a ./sh loop to ssh to each node).
>
>  -Lee
>
>
> On Oct 9, 2007, at 9:17 AM, Wm. Josiah Erikson wrote:
>
>> uh... that's clusterusers at lists.hampshire.edu obviously :)
>>
>> compute-1-3 was stuck in permanent iowait with a breve process using 
>> up more RAM than the machine actually had (meaning it was in swap). I 
>> think it probably got stuck when the head node crashed and ended up 
>> with a stale file handle or something when the home directory got 
>> pulled out from under it.
>>
>> I rebooted it :)
>>
>>    -Josiah
>>
>>
>>
>> Wm. Josiah Erikson wrote:
>>> I have subscribed Adam, Brian was already subscribed, as is Kyle, 
>>> Lee, Jaime, Chris, me, Michael, and a few others. I think helga can 
>>> be removed from the discussion unless there is somebody else on that 
>>> list that should know what's up with the cluster.
>>>
>>> clusterusers at hermes doesn't work anymore - that's old and the 
>>> hostnames have changed. clusteruers at lists.hampshire.edu should be 
>>> the proper address. We'll see if this gets through :)
>>>
>>> I'll go check on compute-1-3 and report back.
>>>
>>>    -Josiah
>>>
>>>
>>>
>>> Lee Spector wrote:
>>>>
>>>> It seems like a lot of people are using/concerned with the status 
>>>> of the cluster these days, but maybe not all of them are on the 
>>>> cluster-users list. Is that right? It'd be nice to clear this up so 
>>>> that we're not having unsynced conversations among those of us on 
>>>> the cluster list, the helga list, and maybe no list, not to mention 
>>>> individual email side conversations. Can we take care of this by 
>>>> subscribing helga to cluster-users and making sure that all 
>>>> non-helga users and administrators are individually subscribed?
>>>>
>>>> On an actual cluster-related note: Does anyone know in what 
>>>> particular way compute-1-3 is currently hosed, such that it hangs 
>>>> my cluster-looped scripts? To see what I mean you can try 
>>>> 'cluster-fork hostname' (I'll include the output below since if 
>>>> someone fixes compute-1-3 you won't see what I mean...). 
>>>> Cluster-fork is smart enough to ignore some node pathologies (e.g. 
>>>> compute-0-10 is currently down, and is properly skipped, and 
>>>> compute-1-12 and compute-1-13 are refusing ssh connections, and are 
>>>> properly skipped), but compute-1-3 requires a command-C. Worse, and 
>>>> more important for me, is that command-C (or other cluster-fork 
>>>> options) won't work for many of my scripts, which use ./sh loops to 
>>>> run through the nodes.
>>>>
>>>> If someone has the access/knowledge/power to fix compute-1-3 then 
>>>> please do so and let us know. Or if you just have an idea what 
>>>> might be up with compute-1-3 and how to prevent it then please share.
>>>>
>>>> Thanks,
>>>>
>>>>  -Lee
>>>>
>>>> -----------
>>>>
>>>> [lspector at fly bin]$ cluster-fork hostnamecompute-0-1:
>>>> compute-0-1.local
>>>> compute-0-2:
>>>> compute-0-2.local
>>>> compute-0-3:
>>>> compute-0-3.local
>>>> compute-0-4:
>>>> compute-0-4.local
>>>> compute-0-5:
>>>> compute-0-5.local
>>>> compute-0-6:
>>>> compute-0-6.local
>>>> compute-0-7:
>>>> compute-0-7.local
>>>> compute-0-8:
>>>> compute-0-8.local
>>>> compute-0-9:
>>>> compute-0-9.local
>>>> compute-0-10: down
>>>> compute-0-11:
>>>> compute-0-11.local
>>>> compute-0-12:
>>>> compute-0-12.local
>>>> compute-0-13:
>>>> compute-0-13.local
>>>> compute-0-14:
>>>> compute-0-14.local
>>>> compute-0-15:
>>>> compute-0-15.local
>>>> compute-0-16:
>>>> compute-0-16.local
>>>> compute-0-17:
>>>> compute-0-17.local
>>>> compute-0-18:
>>>> compute-0-18.local
>>>> compute-0-19:
>>>> compute-0-19.local
>>>> compute-0-20:
>>>> compute-0-20.local
>>>> compute-0-21:
>>>> compute-0-21.local
>>>> compute-0-22:
>>>> compute-0-22.local
>>>> compute-0-23:
>>>> compute-0-23.local
>>>> compute-1-1:
>>>> compute-1-1.local
>>>> compute-1-2:
>>>> compute-1-2.local
>>>> compute-1-3:             [ HANGS HERE, CTRL-C ALLOWS CONTINUATION
>>>> compute-1-4:
>>>> compute-1-4.local
>>>> compute-1-5:
>>>> compute-1-5.local
>>>> compute-1-6:
>>>> compute-1-6.local
>>>> compute-1-7:
>>>> compute-1-7.local
>>>> compute-1-8:
>>>> compute-1-8.local
>>>> compute-1-9:
>>>> compute-1-9.local
>>>> compute-1-10:
>>>> compute-1-10.local
>>>> compute-1-11:
>>>> compute-1-11.local
>>>> compute-1-12:
>>>> ssh: connect to host compute-1-12 port 22: Connection refused
>>>> compute-1-13:
>>>> ssh: connect to host compute-1-13 port 22: Connection refused
>>>> compute-1-14:
>>>> compute-1-14.local
>>>> compute-1-15:
>>>> compute-1-15.local
>>>> compute-1-16:
>>>> compute-1-16.local
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> Lee Spector, Professor of Computer Science
>>>> School of Cognitive Science, Hampshire College
>>>> 893 West Street, Amherst, MA 01002-3359
>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>>
>>>
>>
>> -- 
>> Wm. Josiah Erikson
>> Computing Support
>> School of Cognitive Science
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> -- 
> Lee Spector, Professor of Computer Science
> School of Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspector at hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 
Wm. Josiah Erikson
Computing Support
School of Cognitive Science
Hampshire College
Amherst, MA 01002
(413) 559-6091




More information about the Clusterusers mailing list