[Clusterusers] Re: cluster mailing lists & something funky on compute-1-3

Lee Spector lspector at hampshire.edu
Tue Oct 9 10:32:18 EDT 2007


Drat -- now there's something similar wrong with compute-0-18. It  
doesn't hang in the simple cluster-fork test I sent earlier but it's  
hanging my launch script (which uses a ./sh loop to ssh to each node).

  -Lee


On Oct 9, 2007, at 9:17 AM, Wm. Josiah Erikson wrote:

> uh... that's clusterusers at lists.hampshire.edu obviously :)
>
> compute-1-3 was stuck in permanent iowait with a breve process  
> using up more RAM than the machine actually had (meaning it was in  
> swap). I think it probably got stuck when the head node crashed and  
> ended up with a stale file handle or something when the home  
> directory got pulled out from under it.
>
> I rebooted it :)
>
>    -Josiah
>
>
>
> Wm. Josiah Erikson wrote:
>> I have subscribed Adam, Brian was already subscribed, as is Kyle,  
>> Lee, Jaime, Chris, me, Michael, and a few others. I think helga  
>> can be removed from the discussion unless there is somebody else  
>> on that list that should know what's up with the cluster.
>>
>> clusterusers at hermes doesn't work anymore - that's old and the  
>> hostnames have changed. clusteruers at lists.hampshire.edu should be  
>> the proper address. We'll see if this gets through :)
>>
>> I'll go check on compute-1-3 and report back.
>>
>>    -Josiah
>>
>>
>>
>> Lee Spector wrote:
>>>
>>> It seems like a lot of people are using/concerned with the status  
>>> of the cluster these days, but maybe not all of them are on the  
>>> cluster-users list. Is that right? It'd be nice to clear this up  
>>> so that we're not having unsynced conversations among those of us  
>>> on the cluster list, the helga list, and maybe no list, not to  
>>> mention individual email side conversations. Can we take care of  
>>> this by subscribing helga to cluster-users and making sure that  
>>> all non-helga users and administrators are individually subscribed?
>>>
>>> On an actual cluster-related note: Does anyone know in what  
>>> particular way compute-1-3 is currently hosed, such that it hangs  
>>> my cluster-looped scripts? To see what I mean you can try  
>>> 'cluster-fork hostname' (I'll include the output below since if  
>>> someone fixes compute-1-3 you won't see what I mean...). Cluster- 
>>> fork is smart enough to ignore some node pathologies (e.g.  
>>> compute-0-10 is currently down, and is properly skipped, and  
>>> compute-1-12 and compute-1-13 are refusing ssh connections, and  
>>> are properly skipped), but compute-1-3 requires a command-C.  
>>> Worse, and more important for me, is that command-C (or other  
>>> cluster-fork options) won't work for many of my scripts, which  
>>> use ./sh loops to run through the nodes.
>>>
>>> If someone has the access/knowledge/power to fix compute-1-3 then  
>>> please do so and let us know. Or if you just have an idea what  
>>> might be up with compute-1-3 and how to prevent it then please  
>>> share.
>>>
>>> Thanks,
>>>
>>>  -Lee
>>>
>>> -----------
>>>
>>> [lspector at fly bin]$ cluster-fork hostnamecompute-0-1:
>>> compute-0-1.local
>>> compute-0-2:
>>> compute-0-2.local
>>> compute-0-3:
>>> compute-0-3.local
>>> compute-0-4:
>>> compute-0-4.local
>>> compute-0-5:
>>> compute-0-5.local
>>> compute-0-6:
>>> compute-0-6.local
>>> compute-0-7:
>>> compute-0-7.local
>>> compute-0-8:
>>> compute-0-8.local
>>> compute-0-9:
>>> compute-0-9.local
>>> compute-0-10: down
>>> compute-0-11:
>>> compute-0-11.local
>>> compute-0-12:
>>> compute-0-12.local
>>> compute-0-13:
>>> compute-0-13.local
>>> compute-0-14:
>>> compute-0-14.local
>>> compute-0-15:
>>> compute-0-15.local
>>> compute-0-16:
>>> compute-0-16.local
>>> compute-0-17:
>>> compute-0-17.local
>>> compute-0-18:
>>> compute-0-18.local
>>> compute-0-19:
>>> compute-0-19.local
>>> compute-0-20:
>>> compute-0-20.local
>>> compute-0-21:
>>> compute-0-21.local
>>> compute-0-22:
>>> compute-0-22.local
>>> compute-0-23:
>>> compute-0-23.local
>>> compute-1-1:
>>> compute-1-1.local
>>> compute-1-2:
>>> compute-1-2.local
>>> compute-1-3:             [ HANGS HERE, CTRL-C ALLOWS CONTINUATION
>>> compute-1-4:
>>> compute-1-4.local
>>> compute-1-5:
>>> compute-1-5.local
>>> compute-1-6:
>>> compute-1-6.local
>>> compute-1-7:
>>> compute-1-7.local
>>> compute-1-8:
>>> compute-1-8.local
>>> compute-1-9:
>>> compute-1-9.local
>>> compute-1-10:
>>> compute-1-10.local
>>> compute-1-11:
>>> compute-1-11.local
>>> compute-1-12:
>>> ssh: connect to host compute-1-12 port 22: Connection refused
>>> compute-1-13:
>>> ssh: connect to host compute-1-13 port 22: Connection refused
>>> compute-1-14:
>>> compute-1-14.local
>>> compute-1-15:
>>> compute-1-15.local
>>> compute-1-16:
>>> compute-1-16.local
>>>
>>>
>>>
>>>
>>>
>>> -- 
>>> Lee Spector, Professor of Computer Science
>>> School of Cognitive Science, Hampshire College
>>> 893 West Street, Amherst, MA 01002-3359
>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>
>>
>
> -- 
> Wm. Josiah Erikson
> Computing Support
> School of Cognitive Science
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers

--
Lee Spector, Professor of Computer Science
School of Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438




More information about the Clusterusers mailing list