[Clusterusers] Re: cluster mailing lists & something funky on compute-1-3
Wm. Josiah Erikson
wjerikson at hampshire.edu
Tue Oct 9 11:03:36 EDT 2007
I'd actually like to compare with you which nodes are supposed to be
running breve and which ones actually are right now so that we can
en-mass reboot any nodes that have hanging breve processes.... I suspect
that compute-0-15, compute-0-12, 0-16 and 0-17 might be in a similar
state... (perhaps have extra breve processes running)
IM or 'phone or some other real-time means would be best for this,
probably...
-Josiah
Lee Spector wrote:
>
> Drat -- now there's something similar wrong with compute-0-18. It
> doesn't hang in the simple cluster-fork test I sent earlier but it's
> hanging my launch script (which uses a ./sh loop to ssh to each node).
>
> -Lee
>
>
> On Oct 9, 2007, at 9:17 AM, Wm. Josiah Erikson wrote:
>
>> uh... that's clusterusers at lists.hampshire.edu obviously :)
>>
>> compute-1-3 was stuck in permanent iowait with a breve process using
>> up more RAM than the machine actually had (meaning it was in swap). I
>> think it probably got stuck when the head node crashed and ended up
>> with a stale file handle or something when the home directory got
>> pulled out from under it.
>>
>> I rebooted it :)
>>
>> -Josiah
>>
>>
>>
>> Wm. Josiah Erikson wrote:
>>> I have subscribed Adam, Brian was already subscribed, as is Kyle,
>>> Lee, Jaime, Chris, me, Michael, and a few others. I think helga can
>>> be removed from the discussion unless there is somebody else on that
>>> list that should know what's up with the cluster.
>>>
>>> clusterusers at hermes doesn't work anymore - that's old and the
>>> hostnames have changed. clusteruers at lists.hampshire.edu should be
>>> the proper address. We'll see if this gets through :)
>>>
>>> I'll go check on compute-1-3 and report back.
>>>
>>> -Josiah
>>>
>>>
>>>
>>> Lee Spector wrote:
>>>>
>>>> It seems like a lot of people are using/concerned with the status
>>>> of the cluster these days, but maybe not all of them are on the
>>>> cluster-users list. Is that right? It'd be nice to clear this up so
>>>> that we're not having unsynced conversations among those of us on
>>>> the cluster list, the helga list, and maybe no list, not to mention
>>>> individual email side conversations. Can we take care of this by
>>>> subscribing helga to cluster-users and making sure that all
>>>> non-helga users and administrators are individually subscribed?
>>>>
>>>> On an actual cluster-related note: Does anyone know in what
>>>> particular way compute-1-3 is currently hosed, such that it hangs
>>>> my cluster-looped scripts? To see what I mean you can try
>>>> 'cluster-fork hostname' (I'll include the output below since if
>>>> someone fixes compute-1-3 you won't see what I mean...).
>>>> Cluster-fork is smart enough to ignore some node pathologies (e.g.
>>>> compute-0-10 is currently down, and is properly skipped, and
>>>> compute-1-12 and compute-1-13 are refusing ssh connections, and are
>>>> properly skipped), but compute-1-3 requires a command-C. Worse, and
>>>> more important for me, is that command-C (or other cluster-fork
>>>> options) won't work for many of my scripts, which use ./sh loops to
>>>> run through the nodes.
>>>>
>>>> If someone has the access/knowledge/power to fix compute-1-3 then
>>>> please do so and let us know. Or if you just have an idea what
>>>> might be up with compute-1-3 and how to prevent it then please share.
>>>>
>>>> Thanks,
>>>>
>>>> -Lee
>>>>
>>>> -----------
>>>>
>>>> [lspector at fly bin]$ cluster-fork hostnamecompute-0-1:
>>>> compute-0-1.local
>>>> compute-0-2:
>>>> compute-0-2.local
>>>> compute-0-3:
>>>> compute-0-3.local
>>>> compute-0-4:
>>>> compute-0-4.local
>>>> compute-0-5:
>>>> compute-0-5.local
>>>> compute-0-6:
>>>> compute-0-6.local
>>>> compute-0-7:
>>>> compute-0-7.local
>>>> compute-0-8:
>>>> compute-0-8.local
>>>> compute-0-9:
>>>> compute-0-9.local
>>>> compute-0-10: down
>>>> compute-0-11:
>>>> compute-0-11.local
>>>> compute-0-12:
>>>> compute-0-12.local
>>>> compute-0-13:
>>>> compute-0-13.local
>>>> compute-0-14:
>>>> compute-0-14.local
>>>> compute-0-15:
>>>> compute-0-15.local
>>>> compute-0-16:
>>>> compute-0-16.local
>>>> compute-0-17:
>>>> compute-0-17.local
>>>> compute-0-18:
>>>> compute-0-18.local
>>>> compute-0-19:
>>>> compute-0-19.local
>>>> compute-0-20:
>>>> compute-0-20.local
>>>> compute-0-21:
>>>> compute-0-21.local
>>>> compute-0-22:
>>>> compute-0-22.local
>>>> compute-0-23:
>>>> compute-0-23.local
>>>> compute-1-1:
>>>> compute-1-1.local
>>>> compute-1-2:
>>>> compute-1-2.local
>>>> compute-1-3: [ HANGS HERE, CTRL-C ALLOWS CONTINUATION
>>>> compute-1-4:
>>>> compute-1-4.local
>>>> compute-1-5:
>>>> compute-1-5.local
>>>> compute-1-6:
>>>> compute-1-6.local
>>>> compute-1-7:
>>>> compute-1-7.local
>>>> compute-1-8:
>>>> compute-1-8.local
>>>> compute-1-9:
>>>> compute-1-9.local
>>>> compute-1-10:
>>>> compute-1-10.local
>>>> compute-1-11:
>>>> compute-1-11.local
>>>> compute-1-12:
>>>> ssh: connect to host compute-1-12 port 22: Connection refused
>>>> compute-1-13:
>>>> ssh: connect to host compute-1-13 port 22: Connection refused
>>>> compute-1-14:
>>>> compute-1-14.local
>>>> compute-1-15:
>>>> compute-1-15.local
>>>> compute-1-16:
>>>> compute-1-16.local
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lee Spector, Professor of Computer Science
>>>> School of Cognitive Science, Hampshire College
>>>> 893 West Street, Amherst, MA 01002-3359
>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>>
>>>
>>
>> --
>> Wm. Josiah Erikson
>> Computing Support
>> School of Cognitive Science
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> --
> Lee Spector, Professor of Computer Science
> School of Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspector at hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers
--
Wm. Josiah Erikson
Computing Support
School of Cognitive Science
Hampshire College
Amherst, MA 01002
(413) 559-6091
More information about the Clusterusers
mailing list