[Clusterusers] Re: cluster mailing lists & something funky on compute-1-3
Lee Spector
lspector at hampshire.edu
Tue Oct 9 09:43:30 EDT 2007
Thanks!
Interesting... If your theory is right then this will only happen
when the head node crashes, so maybe we should always check for that
whenever rebooting the head.
Unfortunately ballooning RAM consumption is probably always going to
be a possibility not just with breve or with our particular
simulations but in any evolutionary context with variable-sized
programs/representations and/or variable-sized populations. There are
ways to limit things but sometimes you don't know what dimension will
grow or how much, it may depend on random seeds or other things, some
limits will prevent evolutionary progress, etc. So it's something
that we have to be able to recover from.
-Lee
On Oct 9, 2007, at 9:17 AM, Wm. Josiah Erikson wrote:
> uh... that's clusterusers at lists.hampshire.edu obviously :)
>
> compute-1-3 was stuck in permanent iowait with a breve process
> using up more RAM than the machine actually had (meaning it was in
> swap). I think it probably got stuck when the head node crashed and
> ended up with a stale file handle or something when the home
> directory got pulled out from under it.
>
> I rebooted it :)
>
> -Josiah
>
>
>
> Wm. Josiah Erikson wrote:
>> I have subscribed Adam, Brian was already subscribed, as is Kyle,
>> Lee, Jaime, Chris, me, Michael, and a few others. I think helga
>> can be removed from the discussion unless there is somebody else
>> on that list that should know what's up with the cluster.
>>
>> clusterusers at hermes doesn't work anymore - that's old and the
>> hostnames have changed. clusteruers at lists.hampshire.edu should be
>> the proper address. We'll see if this gets through :)
>>
>> I'll go check on compute-1-3 and report back.
>>
>> -Josiah
>>
>>
>>
>> Lee Spector wrote:
>>>
>>> It seems like a lot of people are using/concerned with the status
>>> of the cluster these days, but maybe not all of them are on the
>>> cluster-users list. Is that right? It'd be nice to clear this up
>>> so that we're not having unsynced conversations among those of us
>>> on the cluster list, the helga list, and maybe no list, not to
>>> mention individual email side conversations. Can we take care of
>>> this by subscribing helga to cluster-users and making sure that
>>> all non-helga users and administrators are individually subscribed?
>>>
>>> On an actual cluster-related note: Does anyone know in what
>>> particular way compute-1-3 is currently hosed, such that it hangs
>>> my cluster-looped scripts? To see what I mean you can try
>>> 'cluster-fork hostname' (I'll include the output below since if
>>> someone fixes compute-1-3 you won't see what I mean...). Cluster-
>>> fork is smart enough to ignore some node pathologies (e.g.
>>> compute-0-10 is currently down, and is properly skipped, and
>>> compute-1-12 and compute-1-13 are refusing ssh connections, and
>>> are properly skipped), but compute-1-3 requires a command-C.
>>> Worse, and more important for me, is that command-C (or other
>>> cluster-fork options) won't work for many of my scripts, which
>>> use ./sh loops to run through the nodes.
>>>
>>> If someone has the access/knowledge/power to fix compute-1-3 then
>>> please do so and let us know. Or if you just have an idea what
>>> might be up with compute-1-3 and how to prevent it then please
>>> share.
>>>
>>> Thanks,
>>>
>>> -Lee
>>>
>>> -----------
>>>
>>> [lspector at fly bin]$ cluster-fork hostnamecompute-0-1:
>>> compute-0-1.local
>>> compute-0-2:
>>> compute-0-2.local
>>> compute-0-3:
>>> compute-0-3.local
>>> compute-0-4:
>>> compute-0-4.local
>>> compute-0-5:
>>> compute-0-5.local
>>> compute-0-6:
>>> compute-0-6.local
>>> compute-0-7:
>>> compute-0-7.local
>>> compute-0-8:
>>> compute-0-8.local
>>> compute-0-9:
>>> compute-0-9.local
>>> compute-0-10: down
>>> compute-0-11:
>>> compute-0-11.local
>>> compute-0-12:
>>> compute-0-12.local
>>> compute-0-13:
>>> compute-0-13.local
>>> compute-0-14:
>>> compute-0-14.local
>>> compute-0-15:
>>> compute-0-15.local
>>> compute-0-16:
>>> compute-0-16.local
>>> compute-0-17:
>>> compute-0-17.local
>>> compute-0-18:
>>> compute-0-18.local
>>> compute-0-19:
>>> compute-0-19.local
>>> compute-0-20:
>>> compute-0-20.local
>>> compute-0-21:
>>> compute-0-21.local
>>> compute-0-22:
>>> compute-0-22.local
>>> compute-0-23:
>>> compute-0-23.local
>>> compute-1-1:
>>> compute-1-1.local
>>> compute-1-2:
>>> compute-1-2.local
>>> compute-1-3: [ HANGS HERE, CTRL-C ALLOWS CONTINUATION
>>> compute-1-4:
>>> compute-1-4.local
>>> compute-1-5:
>>> compute-1-5.local
>>> compute-1-6:
>>> compute-1-6.local
>>> compute-1-7:
>>> compute-1-7.local
>>> compute-1-8:
>>> compute-1-8.local
>>> compute-1-9:
>>> compute-1-9.local
>>> compute-1-10:
>>> compute-1-10.local
>>> compute-1-11:
>>> compute-1-11.local
>>> compute-1-12:
>>> ssh: connect to host compute-1-12 port 22: Connection refused
>>> compute-1-13:
>>> ssh: connect to host compute-1-13 port 22: Connection refused
>>> compute-1-14:
>>> compute-1-14.local
>>> compute-1-15:
>>> compute-1-15.local
>>> compute-1-16:
>>> compute-1-16.local
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Lee Spector, Professor of Computer Science
>>> School of Cognitive Science, Hampshire College
>>> 893 West Street, Amherst, MA 01002-3359
>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>
>>
>
> --
> Wm. Josiah Erikson
> Computing Support
> School of Cognitive Science
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers
--
Lee Spector, Professor of Computer Science
School of Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438
More information about the Clusterusers
mailing list