[Clusterusers] New nodes
Wm. Josiah Erikson
wjerikson at hampshire.edu
Fri Jan 24 15:56:32 EST 2014
[josiah at fly J1401220065]$ pwd
/helga/public_html/tractor/cmd-logs/thelmuth/J1401220065
[josiah at fly J1401220065]$ grep Killed *
T10.log:/bin/sh: line 1: 21228 Killed /share/apps/bin/lein with-profiles
production trampoline run clojush.examples.wc :use-lexicase-selection
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log9.txt
T11.log:/bin/sh: line 1: 29350 Killed /share/apps/bin/lein with-profiles
production trampoline run clojush.examples.wc :use-lexicase-selection
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log10.txt
T15.log:/bin/sh: line 1: 12113 Killed /share/apps/bin/lein with-profiles
production trampoline run clojush.examples.wc :use-lexicase-selection
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log14.txt
T2.log:/bin/sh: line 1: 23661 Killed /share/apps/bin/lein with-profiles
production trampoline run clojush.examples.wc :use-lexicase-selection
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
T32.log:/bin/sh: line 1: 1598 Killed /share/apps/bin/lein with-profiles
production trampoline run clojush.examples.wc :use-lexicase-selection
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log31.txt
T39.log:/bin/sh: line 1: 20134 Killed /share/apps/bin/lein with-profiles
production trampoline run clojush.examples.wc :use-lexicase-selection
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log38.txt
T42.log:/bin/sh: line 1: 23599 Killed /share/apps/bin/lein with-profiles
production trampoline run clojush.examples.wc :use-lexicase-selection
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log41.txt
T6.log:/bin/sh: line 1: 19407 Killed /share/apps/bin/lein with-profiles
production trampoline run clojush.examples.wc :use-lexicase-selection
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log5.txt
T9.log:/bin/sh: line 1: 11738 Killed /share/apps/bin/lein with-profiles
production trampoline run clojush.examples.wc :use-lexicase-selection
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log8.txt
[josiah at fly J1401220065]$
So 2,9,10,11,15,etc for that job (Job ID 1401220065).
Does that make sense? You can do a similar thing for other Job ID's your
worried about - go to
/helga/public_html/tractor/cmd-logs/thelmuth/J<jobid> and do the same,
or extrapolate whatever procedure you like from that :)
-Josiah
On 1/24/14 3:46 PM, Thomas Helmuth wrote:
> Good to know its been fixed, thanks! I'll probably restart just the
> runs that got killed, since it should work alright with my
> experiments. Is there an easy way to find out which runs were killed
> this way?
>
> -Tom
>
>
> On Fri, Jan 24, 2014 at 3:28 PM, Wm. Josiah Erikson
> <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>> wrote:
>
> I'm sending this to the list, as the list may benefit from this
> knowledge:
>
> Ah! I looked through the logs and figured it out. Here's what
> happened. I think it's probably suitably random, but I'll leave
> that up to you:
>
> -Owen (an animation student working with Perry, who I've added to
> this list) spooled a render that tickled a bug in Maya that caused
> a simple, routine render to chew up outrageous amounts of RAM
> -Several nodes ran out of RAM
> -The oom-killer (out of memory killer) killed the renders and/or
> your jobs
>
> The oom-killer doesn't just kill the process that is using the
> most RAM - it's more random than that, though it does prefer
> processes that are using more RAM. I don't pretend to remember the
> exact details, but we could look them up if it's important.
> However, I am pretty sure that the oom-killer killing your jobs is
> probably suitably random for you to just restart those ones....
> but I'm not you, or your paper reviewers :)
>
> We found the bug and also a workaround yesterday, so those jobs
> are using reasonable amounts of RAM now.
>
> All the best,
>
> -Josiah
>
>
>
> On 1/24/14 3:09 PM, Thomas Helmuth wrote:
>> I'm seeing it on a variety of nodes, but all are on rack 2. All
>> of the killed runs I'm seeing stopped between 3 and 7 AM
>> yesterday morning, though that could just be coincidence since I
>> started them Wednesday afternoon.
>>
>> -Tom
>>
>>
>> On Fri, Jan 24, 2014 at 2:55 PM, Wm. Josiah Erikson
>> <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>> wrote:
>>
>> That's very odd. What nodes did that happen on?
>> Killed usually means something actually killed it - either it
>> was pre-empted by tractor, killed by a script, or killed
>> manually - it doesn't usually mean crashed.
>> -Josiah
>>
>>
>> On 1/24/14 2:52 PM, Thomas Helmuth wrote:
>>> No problem about the 1-6 power cord.
>>>
>>> Any idea what killed the other runs I mentioned in my second
>>> email? I might have to restart that whole set of runs to
>>> make sure the stats aren't affected, and it would be nice to
>>> know that they won't crash as well.
>>>
>>> -Tom
>>>
>>>
>>> On Fri, Jan 24, 2014 at 10:53 AM, Thomas Helmuth
>>> <thelmuth at cs.umass.edu <mailto:thelmuth at cs.umass.edu>> wrote:
>>>
>>> Hi Josiah,
>>>
>>> Probably unrelated, but it looks like yesterday morning
>>> I had some runs stop before they were done, without
>>> printing any errors I can find. The only thing I can
>>> find is when I select the run in tractor it says
>>> something like:
>>>
>>> /bin/sh: line 1: 23661 Killed /share/apps/bin/lein
>>> with-profiles production trampoline run
>>> clojush.examples.wc :use-lexicase-selection false >
>>> ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
>>>
>>> Most of that is just the shell command used to launch
>>> the run, but do you know what that beginning part means,
>>> or have any idea why these died quietly?
>>>
>>> Thanks,
>>> Tom
>>>
>>>
>>> On Fri, Jan 24, 2014 at 10:36 AM, Thomas Helmuth
>>> <thelmuth at cs.umass.edu <mailto:thelmuth at cs.umass.edu>>
>>> wrote:
>>>
>>> Hi Josiah,
>>>
>>> Sounds good! At first I was worried this may have
>>> crashed some of my runs, but it looks like they just
>>> never started when they tried to run on compute-1-18
>>> around the time of the NIMBY, so its not a big deal.
>>>
>>> I did just have two runs crash on compute-1-6, any
>>> idea if that's related?
>>>
>>> We are at crunch time (paper deadline Jan. 29), and
>>> the runs I have going are pretty important for that,
>>> so it would be great to have minimum disruptions
>>> until then!
>>>
>>> -Tom
>>>
>>>
>>> On Fri, Jan 24, 2014 at 9:13 AM, Wm. Josiah Erikson
>>> <wjerikson at hampshire.edu
>>> <mailto:wjerikson at hampshire.edu>> wrote:
>>>
>>> Hi guys,
>>> I've got compute-1-17 and compute-1-18
>>> NIMBYed for now because I'm going to move them
>>> to make room (physically) to rack 1 to make room
>>> for 8 new nodes! (each node has 2x X5560
>>> quad-core 2.8Ghzprocessors and 24GB of RAM)
>>> Just so nobody un-NIMBY's them.
>>>
>>> --
>>> Wm. Josiah Erikson
>>> Assistant Director of IT, Infrastructure Group
>>> System Administrator, School of CS
>>> Hampshire College
>>> Amherst, MA 01002
>>> (413) 559-6091 <tel:%28413%29%20559-6091>
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> <mailto:Clusterusers at lists.hampshire.edu>
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>>
>>>
>>
>> --
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091 <tel:%28413%29%20559-6091>
>>
>>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091 <tel:%28413%29%20559-6091>
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> <mailto:Clusterusers at lists.hampshire.edu>
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20140124/8a11c60a/attachment-0001.html>
More information about the Clusterusers
mailing list