[Clusterusers] New nodes

Thomas Helmuth thelmuth at cs.umass.edu
Fri Jan 24 16:01:54 EST 2014


Perfect -- thanks!
-Tom


On Fri, Jan 24, 2014 at 3:56 PM, Wm. Josiah Erikson <wjerikson at hampshire.edu
> wrote:

>  [josiah at fly J1401220065]$ pwd
> /helga/public_html/tractor/cmd-logs/thelmuth/J1401220065
> [josiah at fly J1401220065]$ grep Killed *
> T10.log:/bin/sh: line 1: 21228 Killed
> /share/apps/bin/lein with-profiles production trampoline run
> clojush.examples.wc :use-lexicase-selection false >
> ../Results/GECCO14/wc/tourney-max-points-1000-two/log9.txt
> T11.log:/bin/sh: line 1: 29350 Killed
> /share/apps/bin/lein with-profiles production trampoline run
> clojush.examples.wc :use-lexicase-selection false >
> ../Results/GECCO14/wc/tourney-max-points-1000-two/log10.txt
> T15.log:/bin/sh: line 1: 12113 Killed
> /share/apps/bin/lein with-profiles production trampoline run
> clojush.examples.wc :use-lexicase-selection false >
> ../Results/GECCO14/wc/tourney-max-points-1000-two/log14.txt
> T2.log:/bin/sh: line 1: 23661 Killed                  /share/apps/bin/lein
> with-profiles production trampoline run clojush.examples.wc
> :use-lexicase-selection false >
> ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
> T32.log:/bin/sh: line 1:  1598 Killed
> /share/apps/bin/lein with-profiles production trampoline run
> clojush.examples.wc :use-lexicase-selection false >
> ../Results/GECCO14/wc/tourney-max-points-1000-two/log31.txt
> T39.log:/bin/sh: line 1: 20134 Killed
> /share/apps/bin/lein with-profiles production trampoline run
> clojush.examples.wc :use-lexicase-selection false >
> ../Results/GECCO14/wc/tourney-max-points-1000-two/log38.txt
> T42.log:/bin/sh: line 1: 23599 Killed
> /share/apps/bin/lein with-profiles production trampoline run
> clojush.examples.wc :use-lexicase-selection false >
> ../Results/GECCO14/wc/tourney-max-points-1000-two/log41.txt
> T6.log:/bin/sh: line 1: 19407 Killed                  /share/apps/bin/lein
> with-profiles production trampoline run clojush.examples.wc
> :use-lexicase-selection false >
> ../Results/GECCO14/wc/tourney-max-points-1000-two/log5.txt
> T9.log:/bin/sh: line 1: 11738 Killed                  /share/apps/bin/lein
> with-profiles production trampoline run clojush.examples.wc
> :use-lexicase-selection false >
> ../Results/GECCO14/wc/tourney-max-points-1000-two/log8.txt
> [josiah at fly J1401220065]$
>
> So 2,9,10,11,15,etc for that job (Job ID 1401220065).
>
> Does that make sense? You can do a similar thing for other Job ID's your
> worried about - go to /helga/public_html/tractor/cmd-logs/thelmuth/J<jobid>
> and do the same, or extrapolate whatever procedure you like from that :)
>
>     -Josiah
>
>
>
>
>
> On 1/24/14 3:46 PM, Thomas Helmuth wrote:
>
>  Good to know its been fixed, thanks! I'll probably restart just the runs
> that got killed, since it should work alright with my experiments. Is there
> an easy way to find out which runs were killed this way?
>
>  -Tom
>
>
> On Fri, Jan 24, 2014 at 3:28 PM, Wm. Josiah Erikson <
> wjerikson at hampshire.edu> wrote:
>
>>  I'm sending this to the list, as the list may benefit from this
>> knowledge:
>>
>> Ah! I looked through the logs and figured it out. Here's what happened. I
>> think it's probably suitably random, but I'll leave that up to you:
>>
>> -Owen (an animation student working with Perry, who I've added to this
>> list) spooled a render that tickled a bug in Maya that caused a simple,
>> routine render to chew up outrageous amounts of RAM
>> -Several nodes ran out of RAM
>> -The oom-killer (out of memory killer) killed the renders and/or your jobs
>>
>> The oom-killer doesn't just kill the process that is using the most RAM -
>> it's more random than that, though it does prefer processes that are using
>> more RAM. I don't pretend to remember the exact details, but we could look
>> them up if it's important. However, I am pretty sure that the oom-killer
>> killing your jobs is probably suitably random for you to just restart those
>> ones.... but I'm not you, or your paper reviewers :)
>>
>> We found the bug and also a workaround yesterday, so those jobs are using
>> reasonable amounts of RAM now.
>>
>> All the best,
>>
>>     -Josiah
>>
>>
>>
>> On 1/24/14 3:09 PM, Thomas Helmuth wrote:
>>
>>  I'm seeing it on a variety of nodes, but all are on rack 2. All of the
>> killed runs I'm seeing stopped between 3 and 7 AM yesterday morning, though
>> that could just be coincidence since I started them Wednesday afternoon.
>>
>>  -Tom
>>
>>
>> On Fri, Jan 24, 2014 at 2:55 PM, Wm. Josiah Erikson <
>> wjerikson at hampshire.edu> wrote:
>>
>>>  That's very odd. What nodes did that happen on?
>>> Killed usually means something actually killed it - either it was
>>> pre-empted by tractor, killed by a script, or killed manually - it doesn't
>>> usually mean crashed.
>>>     -Josiah
>>>
>>>
>>> On 1/24/14 2:52 PM, Thomas Helmuth wrote:
>>>
>>>  No problem about the 1-6 power cord.
>>>
>>>  Any idea what killed the other runs I mentioned in my second email? I
>>> might have to restart that whole set of runs to make sure the stats aren't
>>> affected, and it would be nice to know that they won't crash as well.
>>>
>>>  -Tom
>>>
>>>
>>> On Fri, Jan 24, 2014 at 10:53 AM, Thomas Helmuth <thelmuth at cs.umass.edu>wrote:
>>>
>>>>   Hi Josiah,
>>>>
>>>>  Probably unrelated, but it looks like yesterday morning I had some
>>>> runs stop before they were done, without printing any errors I can find.
>>>> The only thing I can find is when I select the run in tractor it says
>>>> something like:
>>>>
>>>> /bin/sh: line 1: 23661 Killed /share/apps/bin/lein with-profiles
>>>> production trampoline run clojush.examples.wc :use-lexicase-selection false
>>>> > ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
>>>>
>>>>  Most of that is just the shell command used to launch the run, but do
>>>> you know what that beginning part means, or have any idea why these died
>>>> quietly?
>>>>
>>>>  Thanks,
>>>> Tom
>>>>
>>>>
>>>> On Fri, Jan 24, 2014 at 10:36 AM, Thomas Helmuth <thelmuth at cs.umass.edu
>>>> > wrote:
>>>>
>>>>>   Hi Josiah,
>>>>>
>>>>>  Sounds good! At first I was worried this may have crashed some of my
>>>>> runs, but it looks like they just never started when they tried to run on
>>>>> compute-1-18 around the time of the NIMBY, so its not a big deal.
>>>>>
>>>>>  I did just have two runs crash on compute-1-6, any idea if that's
>>>>> related?
>>>>>
>>>>>  We are at crunch time (paper deadline Jan. 29), and the runs I have
>>>>> going are pretty important for that, so it would be great to have minimum
>>>>> disruptions until then!
>>>>>
>>>>>  -Tom
>>>>>
>>>>>
>>>>> On Fri, Jan 24, 2014 at 9:13 AM, Wm. Josiah Erikson <
>>>>> wjerikson at hampshire.edu> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>     I've got compute-1-17 and compute-1-18 NIMBYed for now because
>>>>>> I'm going to move them to make room (physically) to rack 1 to make room for
>>>>>> 8 new nodes! (each node has 2x X5560 quad-core 2.8Ghzprocessors and 24GB of
>>>>>> RAM)
>>>>>>     Just so nobody un-NIMBY's them.
>>>>>>
>>>>>> --
>>>>>> Wm. Josiah Erikson
>>>>>> Assistant Director of IT, Infrastructure Group
>>>>>> System Administrator, School of CS
>>>>>> Hampshire College
>>>>>> Amherst, MA 01002
>>>>>> (413) 559-6091 <%28413%29%20559-6091>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Clusterusers mailing list
>>>>>> Clusterusers at lists.hampshire.edu
>>>>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Wm. Josiah Erikson
>>> Assistant Director of IT, Infrastructure Group
>>> System Administrator, School of CS
>>> Hampshire College
>>> Amherst, MA 01002(413) 559-6091
>>>
>>>
>>
>> --
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002(413) 559-6091
>>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>>
>
>
> _______________________________________________
> Clusterusers mailing listClusterusers at lists.hampshire.eduhttps://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20140124/aa6dc592/attachment-0001.html>


More information about the Clusterusers mailing list