[Clusterusers] Re: cluster mailing lists & something funky on compute-1-3

Lee Spector lspector at hampshire.edu
Tue Oct 9 12:58:03 EDT 2007


Actually, I prefer swapping to crashing, when it's necessary. If a  
process is near to making an important discovery and it needs a  
little more than the available RAM, and swapping can allow things to  
continue (albeit slowly), then that's much preferable to losing  
everything (which may be the results of several days of computing).  
Chris made some similar suggestions, e.g. having processes dump log  
info and restart somehow, but these will have to be special-cased for  
each application and some may be problematic in other ways.

I do think that just having more RAM may go a long way. It's true  
that some processes may eat 4G soon after they eat 2G, but others may  
top out below that depending on other parameters. (As for the  
economics I still don't know about how the costs for all of the  
options compare -- Josiah, I got your message about the high cost of  
upping the RAM on some of the old nodes but I'm not clear about which  
nodes currently or will soon have more, what the costs of other  
options may be, etc... we should talk about that.)

I do think that partitioning the cluster may be the best thing to do  
in the short term, though if it's a static partition then we're bound  
to be wasting some resources some times and hobbling some of our  
projects at some times.

If there's a way to pause a bloating process and ask the user what to  
do that might be interesting, but just killing it will often be bad  
from my perspective.

  -Lee


On Oct 9, 2007, at 11:04 AM, Wm. Josiah Erikson wrote:

> I may want to look into the way that our kernels are allocating  
> RAM, or, possibly, just make our swap space much smaller,  
> eliminating the possibility of having breve trying to run in swap  
> and having it just die instead, which is actually probably better.
>
>    -Josiah
>
>
>
> Lee Spector wrote:
>>
>> Thanks!
>>
>> Interesting... If your theory is right then this will only happen  
>> when the head node crashes, so maybe we should always check for  
>> that whenever rebooting the head.
>>
>> Unfortunately ballooning RAM consumption is probably always going  
>> to be a possibility not just with breve or with our particular  
>> simulations but in any evolutionary context with variable-sized  
>> programs/representations and/or variable-sized populations. There  
>> are ways to limit things but sometimes you don't know what  
>> dimension will grow or how much, it may depend on random seeds or  
>> other things, some limits will prevent evolutionary progress, etc.  
>> So it's something that we have to be able to recover from.
>>
>>  -Lee
>>
>>
>>
>>
>> On Oct 9, 2007, at 9:17 AM, Wm. Josiah Erikson wrote:
>>
>>> uh... that's clusterusers at lists.hampshire.edu obviously :)
>>>
>>> compute-1-3 was stuck in permanent iowait with a breve process  
>>> using up more RAM than the machine actually had (meaning it was  
>>> in swap). I think it probably got stuck when the head node  
>>> crashed and ended up with a stale file handle or something when  
>>> the home directory got pulled out from under it.
>>>
>>> I rebooted it :)
>>>
>>>    -Josiah
>>>
>>>
>>>
>>> Wm. Josiah Erikson wrote:
>>>> I have subscribed Adam, Brian was already subscribed, as is  
>>>> Kyle, Lee, Jaime, Chris, me, Michael, and a few others. I think  
>>>> helga can be removed from the discussion unless there is  
>>>> somebody else on that list that should know what's up with the  
>>>> cluster.
>>>>
>>>> clusterusers at hermes doesn't work anymore - that's old and the  
>>>> hostnames have changed. clusteruers at lists.hampshire.edu should  
>>>> be the proper address. We'll see if this gets through :)
>>>>
>>>> I'll go check on compute-1-3 and report back.
>>>>
>>>>    -Josiah
>>>>
>>>>
>>>>
>>>> Lee Spector wrote:
>>>>>
>>>>> It seems like a lot of people are using/concerned with the  
>>>>> status of the cluster these days, but maybe not all of them are  
>>>>> on the cluster-users list. Is that right? It'd be nice to clear  
>>>>> this up so that we're not having unsynced conversations among  
>>>>> those of us on the cluster list, the helga list, and maybe no  
>>>>> list, not to mention individual email side conversations. Can  
>>>>> we take care of this by subscribing helga to cluster-users and  
>>>>> making sure that all non-helga users and administrators are  
>>>>> individually subscribed?
>>>>>
>>>>> On an actual cluster-related note: Does anyone know in what  
>>>>> particular way compute-1-3 is currently hosed, such that it  
>>>>> hangs my cluster-looped scripts? To see what I mean you can try  
>>>>> 'cluster-fork hostname' (I'll include the output below since if  
>>>>> someone fixes compute-1-3 you won't see what I mean...).  
>>>>> Cluster-fork is smart enough to ignore some node pathologies  
>>>>> (e.g. compute-0-10 is currently down, and is properly skipped,  
>>>>> and compute-1-12 and compute-1-13 are refusing ssh connections,  
>>>>> and are properly skipped), but compute-1-3 requires a command- 
>>>>> C. Worse, and more important for me, is that command-C (or  
>>>>> other cluster-fork options) won't work for many of my scripts,  
>>>>> which use ./sh loops to run through the nodes.
>>>>>
>>>>> If someone has the access/knowledge/power to fix compute-1-3  
>>>>> then please do so and let us know. Or if you just have an idea  
>>>>> what might be up with compute-1-3 and how to prevent it then  
>>>>> please share.
>>>>>
>>>>> Thanks,
>>>>>
>>>>>  -Lee
>>>>>
>>>>> -----------
>>>>>
>>>>> [lspector at fly bin]$ cluster-fork hostnamecompute-0-1:
>>>>> compute-0-1.local
>>>>> compute-0-2:
>>>>> compute-0-2.local
>>>>> compute-0-3:
>>>>> compute-0-3.local
>>>>> compute-0-4:
>>>>> compute-0-4.local
>>>>> compute-0-5:
>>>>> compute-0-5.local
>>>>> compute-0-6:
>>>>> compute-0-6.local
>>>>> compute-0-7:
>>>>> compute-0-7.local
>>>>> compute-0-8:
>>>>> compute-0-8.local
>>>>> compute-0-9:
>>>>> compute-0-9.local
>>>>> compute-0-10: down
>>>>> compute-0-11:
>>>>> compute-0-11.local
>>>>> compute-0-12:
>>>>> compute-0-12.local
>>>>> compute-0-13:
>>>>> compute-0-13.local
>>>>> compute-0-14:
>>>>> compute-0-14.local
>>>>> compute-0-15:
>>>>> compute-0-15.local
>>>>> compute-0-16:
>>>>> compute-0-16.local
>>>>> compute-0-17:
>>>>> compute-0-17.local
>>>>> compute-0-18:
>>>>> compute-0-18.local
>>>>> compute-0-19:
>>>>> compute-0-19.local
>>>>> compute-0-20:
>>>>> compute-0-20.local
>>>>> compute-0-21:
>>>>> compute-0-21.local
>>>>> compute-0-22:
>>>>> compute-0-22.local
>>>>> compute-0-23:
>>>>> compute-0-23.local
>>>>> compute-1-1:
>>>>> compute-1-1.local
>>>>> compute-1-2:
>>>>> compute-1-2.local
>>>>> compute-1-3:             [ HANGS HERE, CTRL-C ALLOWS CONTINUATION
>>>>> compute-1-4:
>>>>> compute-1-4.local
>>>>> compute-1-5:
>>>>> compute-1-5.local
>>>>> compute-1-6:
>>>>> compute-1-6.local
>>>>> compute-1-7:
>>>>> compute-1-7.local
>>>>> compute-1-8:
>>>>> compute-1-8.local
>>>>> compute-1-9:
>>>>> compute-1-9.local
>>>>> compute-1-10:
>>>>> compute-1-10.local
>>>>> compute-1-11:
>>>>> compute-1-11.local
>>>>> compute-1-12:
>>>>> ssh: connect to host compute-1-12 port 22: Connection refused
>>>>> compute-1-13:
>>>>> ssh: connect to host compute-1-13 port 22: Connection refused
>>>>> compute-1-14:
>>>>> compute-1-14.local
>>>>> compute-1-15:
>>>>> compute-1-15.local
>>>>> compute-1-16:
>>>>> compute-1-16.local
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Lee Spector, Professor of Computer Science
>>>>> School of Cognitive Science, Hampshire College
>>>>> 893 West Street, Amherst, MA 01002-3359
>>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>>>
>>>>
>>>
>>> -- 
>>> Wm. Josiah Erikson
>>> Computing Support
>>> School of Cognitive Science
>>> Hampshire College
>>> Amherst, MA 01002
>>> (413) 559-6091
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>> -- 
>> Lee Spector, Professor of Computer Science
>> School of Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> -- 
> Wm. Josiah Erikson
> Computing Support
> School of Cognitive Science
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers

--
Lee Spector, Professor of Computer Science
School of Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438




More information about the Clusterusers mailing list