[Clusterusers] New nodes

Wm. Josiah Erikson wjerikson at hampshire.edu
Fri Jan 24 15:56:32 EST 2014


[josiah at fly J1401220065]$ pwd
/helga/public_html/tractor/cmd-logs/thelmuth/J1401220065
[josiah at fly J1401220065]$ grep Killed *
T10.log:/bin/sh: line 1: 21228 Killed /share/apps/bin/lein with-profiles 
production trampoline run clojush.examples.wc :use-lexicase-selection 
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log9.txt
T11.log:/bin/sh: line 1: 29350 Killed /share/apps/bin/lein with-profiles 
production trampoline run clojush.examples.wc :use-lexicase-selection 
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log10.txt
T15.log:/bin/sh: line 1: 12113 Killed /share/apps/bin/lein with-profiles 
production trampoline run clojush.examples.wc :use-lexicase-selection 
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log14.txt
T2.log:/bin/sh: line 1: 23661 Killed /share/apps/bin/lein with-profiles 
production trampoline run clojush.examples.wc :use-lexicase-selection 
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
T32.log:/bin/sh: line 1:  1598 Killed /share/apps/bin/lein with-profiles 
production trampoline run clojush.examples.wc :use-lexicase-selection 
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log31.txt
T39.log:/bin/sh: line 1: 20134 Killed /share/apps/bin/lein with-profiles 
production trampoline run clojush.examples.wc :use-lexicase-selection 
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log38.txt
T42.log:/bin/sh: line 1: 23599 Killed /share/apps/bin/lein with-profiles 
production trampoline run clojush.examples.wc :use-lexicase-selection 
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log41.txt
T6.log:/bin/sh: line 1: 19407 Killed /share/apps/bin/lein with-profiles 
production trampoline run clojush.examples.wc :use-lexicase-selection 
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log5.txt
T9.log:/bin/sh: line 1: 11738 Killed /share/apps/bin/lein with-profiles 
production trampoline run clojush.examples.wc :use-lexicase-selection 
false > ../Results/GECCO14/wc/tourney-max-points-1000-two/log8.txt
[josiah at fly J1401220065]$

So 2,9,10,11,15,etc for that job (Job ID 1401220065).

Does that make sense? You can do a similar thing for other Job ID's your 
worried about - go to 
/helga/public_html/tractor/cmd-logs/thelmuth/J<jobid> and do the same, 
or extrapolate whatever procedure you like from that :)

     -Josiah




On 1/24/14 3:46 PM, Thomas Helmuth wrote:
> Good to know its been fixed, thanks! I'll probably restart just the 
> runs that got killed, since it should work alright with my 
> experiments. Is there an easy way to find out which runs were killed 
> this way?
>
> -Tom
>
>
> On Fri, Jan 24, 2014 at 3:28 PM, Wm. Josiah Erikson 
> <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>> wrote:
>
>     I'm sending this to the list, as the list may benefit from this
>     knowledge:
>
>     Ah! I looked through the logs and figured it out. Here's what
>     happened. I think it's probably suitably random, but I'll leave
>     that up to you:
>
>     -Owen (an animation student working with Perry, who I've added to
>     this list) spooled a render that tickled a bug in Maya that caused
>     a simple, routine render to chew up outrageous amounts of RAM
>     -Several nodes ran out of RAM
>     -The oom-killer (out of memory killer) killed the renders and/or
>     your jobs
>
>     The oom-killer doesn't just kill the process that is using the
>     most RAM - it's more random than that, though it does prefer
>     processes that are using more RAM. I don't pretend to remember the
>     exact details, but we could look them up if it's important.
>     However, I am pretty sure that the oom-killer killing your jobs is
>     probably suitably random for you to just restart those ones....
>     but I'm not you, or your paper reviewers :)
>
>     We found the bug and also a workaround yesterday, so those jobs
>     are using reasonable amounts of RAM now.
>
>     All the best,
>
>         -Josiah
>
>
>
>     On 1/24/14 3:09 PM, Thomas Helmuth wrote:
>>     I'm seeing it on a variety of nodes, but all are on rack 2. All
>>     of the killed runs I'm seeing stopped between 3 and 7 AM
>>     yesterday morning, though that could just be coincidence since I
>>     started them Wednesday afternoon.
>>
>>     -Tom
>>
>>
>>     On Fri, Jan 24, 2014 at 2:55 PM, Wm. Josiah Erikson
>>     <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>> wrote:
>>
>>         That's very odd. What nodes did that happen on?
>>         Killed usually means something actually killed it - either it
>>         was pre-empted by tractor, killed by a script, or killed
>>         manually - it doesn't usually mean crashed.
>>             -Josiah
>>
>>
>>         On 1/24/14 2:52 PM, Thomas Helmuth wrote:
>>>         No problem about the 1-6 power cord.
>>>
>>>         Any idea what killed the other runs I mentioned in my second
>>>         email? I might have to restart that whole set of runs to
>>>         make sure the stats aren't affected, and it would be nice to
>>>         know that they won't crash as well.
>>>
>>>         -Tom
>>>
>>>
>>>         On Fri, Jan 24, 2014 at 10:53 AM, Thomas Helmuth
>>>         <thelmuth at cs.umass.edu <mailto:thelmuth at cs.umass.edu>> wrote:
>>>
>>>             Hi Josiah,
>>>
>>>             Probably unrelated, but it looks like yesterday morning
>>>             I had some runs stop before they were done, without
>>>             printing any errors I can find. The only thing I can
>>>             find is when I select the run in tractor it says
>>>             something like:
>>>
>>>             /bin/sh: line 1: 23661 Killed /share/apps/bin/lein
>>>             with-profiles production trampoline run
>>>             clojush.examples.wc :use-lexicase-selection false >
>>>             ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
>>>
>>>             Most of that is just the shell command used to launch
>>>             the run, but do you know what that beginning part means,
>>>             or have any idea why these died quietly?
>>>
>>>             Thanks,
>>>             Tom
>>>
>>>
>>>             On Fri, Jan 24, 2014 at 10:36 AM, Thomas Helmuth
>>>             <thelmuth at cs.umass.edu <mailto:thelmuth at cs.umass.edu>>
>>>             wrote:
>>>
>>>                 Hi Josiah,
>>>
>>>                 Sounds good! At first I was worried this may have
>>>                 crashed some of my runs, but it looks like they just
>>>                 never started when they tried to run on compute-1-18
>>>                 around the time of the NIMBY, so its not a big deal.
>>>
>>>                 I did just have two runs crash on compute-1-6, any
>>>                 idea if that's related?
>>>
>>>                 We are at crunch time (paper deadline Jan. 29), and
>>>                 the runs I have going are pretty important for that,
>>>                 so it would be great to have minimum disruptions
>>>                 until then!
>>>
>>>                 -Tom
>>>
>>>
>>>                 On Fri, Jan 24, 2014 at 9:13 AM, Wm. Josiah Erikson
>>>                 <wjerikson at hampshire.edu
>>>                 <mailto:wjerikson at hampshire.edu>> wrote:
>>>
>>>                     Hi guys,
>>>                         I've got compute-1-17 and compute-1-18
>>>                     NIMBYed for now because I'm going to move them
>>>                     to make room (physically) to rack 1 to make room
>>>                     for 8 new nodes! (each node has 2x X5560
>>>                     quad-core 2.8Ghzprocessors and 24GB of RAM)
>>>                         Just so nobody un-NIMBY's them.
>>>
>>>                     -- 
>>>                     Wm. Josiah Erikson
>>>                     Assistant Director of IT, Infrastructure Group
>>>                     System Administrator, School of CS
>>>                     Hampshire College
>>>                     Amherst, MA 01002
>>>                     (413) 559-6091 <tel:%28413%29%20559-6091>
>>>
>>>                     _______________________________________________
>>>                     Clusterusers mailing list
>>>                     Clusterusers at lists.hampshire.edu
>>>                     <mailto:Clusterusers at lists.hampshire.edu>
>>>                     https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>>
>>>
>>
>>         -- 
>>         Wm. Josiah Erikson
>>         Assistant Director of IT, Infrastructure Group
>>         System Administrator, School of CS
>>         Hampshire College
>>         Amherst, MA 01002
>>         (413) 559-6091  <tel:%28413%29%20559-6091>
>>
>>
>
>     -- 
>     Wm. Josiah Erikson
>     Assistant Director of IT, Infrastructure Group
>     System Administrator, School of CS
>     Hampshire College
>     Amherst, MA 01002
>     (413) 559-6091  <tel:%28413%29%20559-6091>
>
>
>     _______________________________________________
>     Clusterusers mailing list
>     Clusterusers at lists.hampshire.edu
>     <mailto:Clusterusers at lists.hampshire.edu>
>     https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20140124/8a11c60a/attachment-0001.html>


More information about the Clusterusers mailing list