[Clusterusers] Re: cluster mailing lists & something funky on compute-1-3

Wm. Josiah Erikson wjerikson at hampshire.edu
Tue Oct 9 09:17:40 EDT 2007


uh... that's clusterusers at lists.hampshire.edu obviously :)

compute-1-3 was stuck in permanent iowait with a breve process using up 
more RAM than the machine actually had (meaning it was in swap). I think 
it probably got stuck when the head node crashed and ended up with a 
stale file handle or something when the home directory got pulled out 
from under it.

I rebooted it :)

    -Josiah



Wm. Josiah Erikson wrote:
> I have subscribed Adam, Brian was already subscribed, as is Kyle, Lee, 
> Jaime, Chris, me, Michael, and a few others. I think helga can be 
> removed from the discussion unless there is somebody else on that list 
> that should know what's up with the cluster.
>
> clusterusers at hermes doesn't work anymore - that's old and the 
> hostnames have changed. clusteruers at lists.hampshire.edu should be the 
> proper address. We'll see if this gets through :)
>
> I'll go check on compute-1-3 and report back.
>
>    -Josiah
>
>
>
> Lee Spector wrote:
>>
>> It seems like a lot of people are using/concerned with the status of 
>> the cluster these days, but maybe not all of them are on the 
>> cluster-users list. Is that right? It'd be nice to clear this up so 
>> that we're not having unsynced conversations among those of us on the 
>> cluster list, the helga list, and maybe no list, not to mention 
>> individual email side conversations. Can we take care of this by 
>> subscribing helga to cluster-users and making sure that all non-helga 
>> users and administrators are individually subscribed?
>>
>> On an actual cluster-related note: Does anyone know in what 
>> particular way compute-1-3 is currently hosed, such that it hangs my 
>> cluster-looped scripts? To see what I mean you can try 'cluster-fork 
>> hostname' (I'll include the output below since if someone fixes 
>> compute-1-3 you won't see what I mean...). Cluster-fork is smart 
>> enough to ignore some node pathologies (e.g. compute-0-10 is 
>> currently down, and is properly skipped, and compute-1-12 and 
>> compute-1-13 are refusing ssh connections, and are properly skipped), 
>> but compute-1-3 requires a command-C. Worse, and more important for 
>> me, is that command-C (or other cluster-fork options) won't work for 
>> many of my scripts, which use ./sh loops to run through the nodes.
>>
>> If someone has the access/knowledge/power to fix compute-1-3 then 
>> please do so and let us know. Or if you just have an idea what might 
>> be up with compute-1-3 and how to prevent it then please share.
>>
>> Thanks,
>>
>>  -Lee
>>
>> -----------
>>
>> [lspector at fly bin]$ cluster-fork hostnamecompute-0-1:
>> compute-0-1.local
>> compute-0-2:
>> compute-0-2.local
>> compute-0-3:
>> compute-0-3.local
>> compute-0-4:
>> compute-0-4.local
>> compute-0-5:
>> compute-0-5.local
>> compute-0-6:
>> compute-0-6.local
>> compute-0-7:
>> compute-0-7.local
>> compute-0-8:
>> compute-0-8.local
>> compute-0-9:
>> compute-0-9.local
>> compute-0-10: down
>> compute-0-11:
>> compute-0-11.local
>> compute-0-12:
>> compute-0-12.local
>> compute-0-13:
>> compute-0-13.local
>> compute-0-14:
>> compute-0-14.local
>> compute-0-15:
>> compute-0-15.local
>> compute-0-16:
>> compute-0-16.local
>> compute-0-17:
>> compute-0-17.local
>> compute-0-18:
>> compute-0-18.local
>> compute-0-19:
>> compute-0-19.local
>> compute-0-20:
>> compute-0-20.local
>> compute-0-21:
>> compute-0-21.local
>> compute-0-22:
>> compute-0-22.local
>> compute-0-23:
>> compute-0-23.local
>> compute-1-1:
>> compute-1-1.local
>> compute-1-2:
>> compute-1-2.local
>> compute-1-3:             [ HANGS HERE, CTRL-C ALLOWS CONTINUATION
>> compute-1-4:
>> compute-1-4.local
>> compute-1-5:
>> compute-1-5.local
>> compute-1-6:
>> compute-1-6.local
>> compute-1-7:
>> compute-1-7.local
>> compute-1-8:
>> compute-1-8.local
>> compute-1-9:
>> compute-1-9.local
>> compute-1-10:
>> compute-1-10.local
>> compute-1-11:
>> compute-1-11.local
>> compute-1-12:
>> ssh: connect to host compute-1-12 port 22: Connection refused
>> compute-1-13:
>> ssh: connect to host compute-1-13 port 22: Connection refused
>> compute-1-14:
>> compute-1-14.local
>> compute-1-15:
>> compute-1-15.local
>> compute-1-16:
>> compute-1-16.local
>>
>>
>>
>>
>>
>> -- 
>> Lee Spector, Professor of Computer Science
>> School of Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>>
>

-- 
Wm. Josiah Erikson
Computing Support
School of Cognitive Science
Hampshire College
Amherst, MA 01002
(413) 559-6091




More information about the Clusterusers mailing list