[Clusterusers] Re: cluster mailing lists & something funky on compute-1-3

Wm. Josiah Erikson wjerikson at hampshire.edu
Tue Oct 9 13:51:21 EDT 2007


The cluster is a blank slate, waiting for new things.

compute-1-3 is still down, but should be back up shortly. compute-0-10 
is still waiting for a power supply, so is down as well, but everything 
else should be good.

    -Josiah


Lee Spector wrote:
> I've just killed everything of mine, so you can blast away anything 
> that's still running as far as I'm concerned. Let me know when I can 
> try starting things up again.
>
> Thanks,
>
>  -Lee
>
>
> On Oct 9, 2007, at 11:03 AM, Wm. Josiah Erikson wrote:
>
>> I'd actually like to compare with you which nodes are supposed to be 
>> running breve and which ones actually are right now so that we can 
>> en-mass reboot any nodes that have hanging breve processes.... I 
>> suspect that compute-0-15, compute-0-12, 0-16 and 0-17 might be in a 
>> similar state... (perhaps have extra breve processes running)
>>
>> IM or 'phone or some other real-time means would be best for this, 
>> probably...
>>    -Josiah
>>
>>
>>
>> Lee Spector wrote:
>>>
>>> Drat -- now there's something similar wrong with compute-0-18. It 
>>> doesn't hang in the simple cluster-fork test I sent earlier but it's 
>>> hanging my launch script (which uses a ./sh loop to ssh to each node).
>>>
>>>  -Lee
>>>
>>>
>>> On Oct 9, 2007, at 9:17 AM, Wm. Josiah Erikson wrote:
>>>
>>>> uh... that's clusterusers at lists.hampshire.edu obviously :)
>>>>
>>>> compute-1-3 was stuck in permanent iowait with a breve process 
>>>> using up more RAM than the machine actually had (meaning it was in 
>>>> swap). I think it probably got stuck when the head node crashed and 
>>>> ended up with a stale file handle or something when the home 
>>>> directory got pulled out from under it.
>>>>
>>>> I rebooted it :)
>>>>
>>>>    -Josiah
>>>>
>>>>
>>>>
>>>> Wm. Josiah Erikson wrote:
>>>>> I have subscribed Adam, Brian was already subscribed, as is Kyle, 
>>>>> Lee, Jaime, Chris, me, Michael, and a few others. I think helga 
>>>>> can be removed from the discussion unless there is somebody else 
>>>>> on that list that should know what's up with the cluster.
>>>>>
>>>>> clusterusers at hermes doesn't work anymore - that's old and the 
>>>>> hostnames have changed. clusteruers at lists.hampshire.edu should be 
>>>>> the proper address. We'll see if this gets through :)
>>>>>
>>>>> I'll go check on compute-1-3 and report back.
>>>>>
>>>>>    -Josiah
>>>>>
>>>>>
>>>>>
>>>>> Lee Spector wrote:
>>>>>>
>>>>>> It seems like a lot of people are using/concerned with the status 
>>>>>> of the cluster these days, but maybe not all of them are on the 
>>>>>> cluster-users list. Is that right? It'd be nice to clear this up 
>>>>>> so that we're not having unsynced conversations among those of us 
>>>>>> on the cluster list, the helga list, and maybe no list, not to 
>>>>>> mention individual email side conversations. Can we take care of 
>>>>>> this by subscribing helga to cluster-users and making sure that 
>>>>>> all non-helga users and administrators are individually subscribed?
>>>>>>
>>>>>> On an actual cluster-related note: Does anyone know in what 
>>>>>> particular way compute-1-3 is currently hosed, such that it hangs 
>>>>>> my cluster-looped scripts? To see what I mean you can try 
>>>>>> 'cluster-fork hostname' (I'll include the output below since if 
>>>>>> someone fixes compute-1-3 you won't see what I mean...). 
>>>>>> Cluster-fork is smart enough to ignore some node pathologies 
>>>>>> (e.g. compute-0-10 is currently down, and is properly skipped, 
>>>>>> and compute-1-12 and compute-1-13 are refusing ssh connections, 
>>>>>> and are properly skipped), but compute-1-3 requires a command-C. 
>>>>>> Worse, and more important for me, is that command-C (or other 
>>>>>> cluster-fork options) won't work for many of my scripts, which 
>>>>>> use ./sh loops to run through the nodes.
>>>>>>
>>>>>> If someone has the access/knowledge/power to fix compute-1-3 then 
>>>>>> please do so and let us know. Or if you just have an idea what 
>>>>>> might be up with compute-1-3 and how to prevent it then please 
>>>>>> share.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>  -Lee
>>>>>>
>>>>>> -----------
>>>>>>
>>>>>> [lspector at fly bin]$ cluster-fork hostnamecompute-0-1:
>>>>>> compute-0-1.local
>>>>>> compute-0-2:
>>>>>> compute-0-2.local
>>>>>> compute-0-3:
>>>>>> compute-0-3.local
>>>>>> compute-0-4:
>>>>>> compute-0-4.local
>>>>>> compute-0-5:
>>>>>> compute-0-5.local
>>>>>> compute-0-6:
>>>>>> compute-0-6.local
>>>>>> compute-0-7:
>>>>>> compute-0-7.local
>>>>>> compute-0-8:
>>>>>> compute-0-8.local
>>>>>> compute-0-9:
>>>>>> compute-0-9.local
>>>>>> compute-0-10: down
>>>>>> compute-0-11:
>>>>>> compute-0-11.local
>>>>>> compute-0-12:
>>>>>> compute-0-12.local
>>>>>> compute-0-13:
>>>>>> compute-0-13.local
>>>>>> compute-0-14:
>>>>>> compute-0-14.local
>>>>>> compute-0-15:
>>>>>> compute-0-15.local
>>>>>> compute-0-16:
>>>>>> compute-0-16.local
>>>>>> compute-0-17:
>>>>>> compute-0-17.local
>>>>>> compute-0-18:
>>>>>> compute-0-18.local
>>>>>> compute-0-19:
>>>>>> compute-0-19.local
>>>>>> compute-0-20:
>>>>>> compute-0-20.local
>>>>>> compute-0-21:
>>>>>> compute-0-21.local
>>>>>> compute-0-22:
>>>>>> compute-0-22.local
>>>>>> compute-0-23:
>>>>>> compute-0-23.local
>>>>>> compute-1-1:
>>>>>> compute-1-1.local
>>>>>> compute-1-2:
>>>>>> compute-1-2.local
>>>>>> compute-1-3:             [ HANGS HERE, CTRL-C ALLOWS CONTINUATION
>>>>>> compute-1-4:
>>>>>> compute-1-4.local
>>>>>> compute-1-5:
>>>>>> compute-1-5.local
>>>>>> compute-1-6:
>>>>>> compute-1-6.local
>>>>>> compute-1-7:
>>>>>> compute-1-7.local
>>>>>> compute-1-8:
>>>>>> compute-1-8.local
>>>>>> compute-1-9:
>>>>>> compute-1-9.local
>>>>>> compute-1-10:
>>>>>> compute-1-10.local
>>>>>> compute-1-11:
>>>>>> compute-1-11.local
>>>>>> compute-1-12:
>>>>>> ssh: connect to host compute-1-12 port 22: Connection refused
>>>>>> compute-1-13:
>>>>>> ssh: connect to host compute-1-13 port 22: Connection refused
>>>>>> compute-1-14:
>>>>>> compute-1-14.local
>>>>>> compute-1-15:
>>>>>> compute-1-15.local
>>>>>> compute-1-16:
>>>>>> compute-1-16.local
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Lee Spector, Professor of Computer Science
>>>>>> School of Cognitive Science, Hampshire College
>>>>>> 893 West Street, Amherst, MA 01002-3359
>>>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>>>>
>>>>>
>>>>
>>>> -- 
>>>> Wm. Josiah Erikson
>>>> Computing Support
>>>> School of Cognitive Science
>>>> Hampshire College
>>>> Amherst, MA 01002
>>>> (413) 559-6091
>>>>
>>>> _______________________________________________
>>>> Clusterusers mailing list
>>>> Clusterusers at lists.hampshire.edu
>>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>> -- 
>>> Lee Spector, Professor of Computer Science
>>> School of Cognitive Science, Hampshire College
>>> 893 West Street, Amherst, MA 01002-3359
>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>> -- 
>> Wm. Josiah Erikson
>> Computing Support
>> School of Cognitive Science
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> -- 
> Lee Spector, Professor of Computer Science
> School of Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspector at hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 
Wm. Josiah Erikson
Computing Support
School of Cognitive Science
Hampshire College
Amherst, MA 01002
(413) 559-6091




More information about the Clusterusers mailing list