[Clusterusers] New nodes

Thomas Helmuth thelmuth at cs.umass.edu
Fri Jan 24 15:46:06 EST 2014


Good to know its been fixed, thanks! I'll probably restart just the runs
that got killed, since it should work alright with my experiments. Is there
an easy way to find out which runs were killed this way?

-Tom


On Fri, Jan 24, 2014 at 3:28 PM, Wm. Josiah Erikson <wjerikson at hampshire.edu
> wrote:

>  I'm sending this to the list, as the list may benefit from this knowledge:
>
> Ah! I looked through the logs and figured it out. Here's what happened. I
> think it's probably suitably random, but I'll leave that up to you:
>
> -Owen (an animation student working with Perry, who I've added to this
> list) spooled a render that tickled a bug in Maya that caused a simple,
> routine render to chew up outrageous amounts of RAM
> -Several nodes ran out of RAM
> -The oom-killer (out of memory killer) killed the renders and/or your jobs
>
> The oom-killer doesn't just kill the process that is using the most RAM -
> it's more random than that, though it does prefer processes that are using
> more RAM. I don't pretend to remember the exact details, but we could look
> them up if it's important. However, I am pretty sure that the oom-killer
> killing your jobs is probably suitably random for you to just restart those
> ones.... but I'm not you, or your paper reviewers :)
>
> We found the bug and also a workaround yesterday, so those jobs are using
> reasonable amounts of RAM now.
>
> All the best,
>
>     -Josiah
>
>
>
> On 1/24/14 3:09 PM, Thomas Helmuth wrote:
>
>  I'm seeing it on a variety of nodes, but all are on rack 2. All of the
> killed runs I'm seeing stopped between 3 and 7 AM yesterday morning, though
> that could just be coincidence since I started them Wednesday afternoon.
>
>  -Tom
>
>
> On Fri, Jan 24, 2014 at 2:55 PM, Wm. Josiah Erikson <
> wjerikson at hampshire.edu> wrote:
>
>>  That's very odd. What nodes did that happen on?
>> Killed usually means something actually killed it - either it was
>> pre-empted by tractor, killed by a script, or killed manually - it doesn't
>> usually mean crashed.
>>     -Josiah
>>
>>
>> On 1/24/14 2:52 PM, Thomas Helmuth wrote:
>>
>>  No problem about the 1-6 power cord.
>>
>>  Any idea what killed the other runs I mentioned in my second email? I
>> might have to restart that whole set of runs to make sure the stats aren't
>> affected, and it would be nice to know that they won't crash as well.
>>
>>  -Tom
>>
>>
>> On Fri, Jan 24, 2014 at 10:53 AM, Thomas Helmuth <thelmuth at cs.umass.edu>wrote:
>>
>>>   Hi Josiah,
>>>
>>>  Probably unrelated, but it looks like yesterday morning I had some runs
>>> stop before they were done, without printing any errors I can find. The
>>> only thing I can find is when I select the run in tractor it says something
>>> like:
>>>
>>> /bin/sh: line 1: 23661 Killed /share/apps/bin/lein with-profiles
>>> production trampoline run clojush.examples.wc :use-lexicase-selection false
>>> > ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
>>>
>>>  Most of that is just the shell command used to launch the run, but do
>>> you know what that beginning part means, or have any idea why these died
>>> quietly?
>>>
>>>  Thanks,
>>> Tom
>>>
>>>
>>> On Fri, Jan 24, 2014 at 10:36 AM, Thomas Helmuth <thelmuth at cs.umass.edu>wrote:
>>>
>>>>   Hi Josiah,
>>>>
>>>>  Sounds good! At first I was worried this may have crashed some of my
>>>> runs, but it looks like they just never started when they tried to run on
>>>> compute-1-18 around the time of the NIMBY, so its not a big deal.
>>>>
>>>>  I did just have two runs crash on compute-1-6, any idea if that's
>>>> related?
>>>>
>>>>  We are at crunch time (paper deadline Jan. 29), and the runs I have
>>>> going are pretty important for that, so it would be great to have minimum
>>>> disruptions until then!
>>>>
>>>>  -Tom
>>>>
>>>>
>>>> On Fri, Jan 24, 2014 at 9:13 AM, Wm. Josiah Erikson <
>>>> wjerikson at hampshire.edu> wrote:
>>>>
>>>>> Hi guys,
>>>>>     I've got compute-1-17 and compute-1-18 NIMBYed for now because I'm
>>>>> going to move them to make room (physically) to rack 1 to make room for 8
>>>>> new nodes! (each node has 2x X5560 quad-core 2.8Ghzprocessors and 24GB of
>>>>> RAM)
>>>>>     Just so nobody un-NIMBY's them.
>>>>>
>>>>> --
>>>>> Wm. Josiah Erikson
>>>>> Assistant Director of IT, Infrastructure Group
>>>>> System Administrator, School of CS
>>>>> Hampshire College
>>>>> Amherst, MA 01002
>>>>> (413) 559-6091 <%28413%29%20559-6091>
>>>>>
>>>>> _______________________________________________
>>>>> Clusterusers mailing list
>>>>> Clusterusers at lists.hampshire.edu
>>>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>>
>>>>
>>>>
>>>
>>
>> --
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002(413) 559-6091
>>
>>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20140124/b1bf9c2a/attachment-0001.html>


More information about the Clusterusers mailing list