<div dir="ltr"><div>Perfect -- thanks!<br></div>-Tom<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Jan 24, 2014 at 3:56 PM, Wm. Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
[josiah@fly J1401220065]$ pwd<br>
/helga/public_html/tractor/cmd-logs/thelmuth/J1401220065<br>
[josiah@fly J1401220065]$ grep Killed *<br>
T10.log:/bin/sh: line 1: 21228 Killed
/share/apps/bin/lein with-profiles production trampoline run
clojush.examples.wc :use-lexicase-selection false >
../Results/GECCO14/wc/tourney-max-points-1000-two/log9.txt<br>
T11.log:/bin/sh: line 1: 29350 Killed
/share/apps/bin/lein with-profiles production trampoline run
clojush.examples.wc :use-lexicase-selection false >
../Results/GECCO14/wc/tourney-max-points-1000-two/log10.txt<br>
T15.log:/bin/sh: line 1: 12113 Killed
/share/apps/bin/lein with-profiles production trampoline run
clojush.examples.wc :use-lexicase-selection false >
../Results/GECCO14/wc/tourney-max-points-1000-two/log14.txt<br>
T2.log:/bin/sh: line 1: 23661 Killed
/share/apps/bin/lein with-profiles production trampoline run
clojush.examples.wc :use-lexicase-selection false >
../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt<br>
T32.log:/bin/sh: line 1: 1598 Killed
/share/apps/bin/lein with-profiles production trampoline run
clojush.examples.wc :use-lexicase-selection false >
../Results/GECCO14/wc/tourney-max-points-1000-two/log31.txt<br>
T39.log:/bin/sh: line 1: 20134 Killed
/share/apps/bin/lein with-profiles production trampoline run
clojush.examples.wc :use-lexicase-selection false >
../Results/GECCO14/wc/tourney-max-points-1000-two/log38.txt<br>
T42.log:/bin/sh: line 1: 23599 Killed
/share/apps/bin/lein with-profiles production trampoline run
clojush.examples.wc :use-lexicase-selection false >
../Results/GECCO14/wc/tourney-max-points-1000-two/log41.txt<br>
T6.log:/bin/sh: line 1: 19407 Killed
/share/apps/bin/lein with-profiles production trampoline run
clojush.examples.wc :use-lexicase-selection false >
../Results/GECCO14/wc/tourney-max-points-1000-two/log5.txt<br>
T9.log:/bin/sh: line 1: 11738 Killed
/share/apps/bin/lein with-profiles production trampoline run
clojush.examples.wc :use-lexicase-selection false >
../Results/GECCO14/wc/tourney-max-points-1000-two/log8.txt<br>
[josiah@fly J1401220065]$ <br>
<br>
So 2,9,10,11,15,etc for that job (Job ID 1401220065).<br>
<br>
Does that make sense? You can do a similar thing for other Job ID's
your worried about - go to
/helga/public_html/tractor/cmd-logs/thelmuth/J<jobid> and do
the same, or extrapolate whatever procedure you like from that :)<span class="HOEnZb"><font color="#888888"><br>
<br>
-Josiah</font></span><div><div class="h5"><br>
<br>
<br>
<br>
<br>
<div>On 1/24/14 3:46 PM, Thomas Helmuth
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Good to know its been fixed, thanks! I'll probably restart
just the runs that got killed, since it should work alright
with my experiments. Is there an easy way to find out which
runs were killed this way?<br>
<br>
</div>
-Tom<br>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Fri, Jan 24, 2014 at 3:28 PM, Wm.
Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> I'm sending this to
the list, as the list may benefit from this knowledge:<br>
<br>
Ah! I looked through the logs and figured it out. Here's
what happened. I think it's probably suitably random, but
I'll leave that up to you:<br>
<br>
-Owen (an animation student working with Perry, who I've
added to this list) spooled a render that tickled a bug in
Maya that caused a simple, routine render to chew up
outrageous amounts of RAM<br>
-Several nodes ran out of RAM<br>
-The oom-killer (out of memory killer) killed the renders
and/or your jobs<br>
<br>
The oom-killer doesn't just kill the process that is using
the most RAM - it's more random than that, though it does
prefer processes that are using more RAM. I don't pretend
to remember the exact details, but we could look them up
if it's important. However, I am pretty sure that the
oom-killer killing your jobs is probably suitably random
for you to just restart those ones.... but I'm not you, or
your paper reviewers :)<br>
<br>
We found the bug and also a workaround yesterday, so those
jobs are using reasonable amounts of RAM now. <br>
<br>
All the best,<br>
<br>
-Josiah
<div>
<div><br>
<br>
<br>
<div>On 1/24/14 3:09 PM, Thomas Helmuth wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>I'm seeing it on a variety of nodes, but all
are on rack 2. All of the killed runs I'm seeing
stopped between 3 and 7 AM yesterday morning,
though that could just be coincidence since I
started them Wednesday afternoon.<br>
<br>
</div>
-Tom<br>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Fri, Jan 24, 2014 at
2:55 PM, Wm. Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> That's
very odd. What nodes did that happen on?<br>
Killed usually means something actually
killed it - either it was pre-empted by
tractor, killed by a script, or killed
manually - it doesn't usually mean crashed.<span><font color="#888888"><br>
-Josiah</font></span>
<div>
<div><br>
<br>
<div>On 1/24/14 2:52 PM, Thomas Helmuth
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>No problem about the 1-6
power cord.<br>
<br>
</div>
Any idea what killed the other
runs I mentioned in my second
email? I might have to restart
that whole set of runs to make
sure the stats aren't affected,
and it would be nice to know that
they won't crash as well.<br>
<br>
</div>
-Tom<br>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Fri, Jan
24, 2014 at 10:53 AM, Thomas
Helmuth <span dir="ltr"><<a href="mailto:thelmuth@cs.umass.edu" target="_blank">thelmuth@cs.umass.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>Hi Josiah,<br>
<br>
</div>
Probably unrelated, but it
looks like yesterday
morning I had some runs
stop before they were
done, without printing any
errors I can find. The
only thing I can find is
when I select the run in
tractor it says something
like:<br>
<br>
/bin/sh: line 1: 23661
Killed
/share/apps/bin/lein
with-profiles production
trampoline run
clojush.examples.wc
:use-lexicase-selection
false >
../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
<br>
<br>
</div>
Most of that is just the
shell command used to launch
the run, but do you know
what that beginning part
means, or have any idea why
these died quietly?<br>
<br>
</div>
Thanks,<br>
Tom<br>
</div>
<div>
<div>
<div class="gmail_extra"> <br>
<br>
<div class="gmail_quote">On
Fri, Jan 24, 2014 at
10:36 AM, Thomas Helmuth
<span dir="ltr"><<a href="mailto:thelmuth@cs.umass.edu" target="_blank">thelmuth@cs.umass.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>Hi Josiah,<br>
<br>
</div>
Sounds good! At
first I was
worried this may
have crashed
some of my runs,
but it looks
like they just
never started
when they tried
to run on
compute-1-18
around the time
of the NIMBY, so
its not a big
deal.<br>
<br>
</div>
I did just have
two runs crash on
compute-1-6, any
idea if that's
related?<br>
<br>
</div>
<div>We are at
crunch time (paper
deadline Jan. 29),
and the runs I
have going are
pretty important
for that, so it
would be great to
have minimum
disruptions until
then!<br>
</div>
<div><br>
</div>
-Tom<br>
</div>
<div>
<div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On
Fri, Jan 24,
2014 at 9:13
AM, Wm. Josiah
Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi
guys,<br>
I've got
compute-1-17
and
compute-1-18
NIMBYed for
now because
I'm going to
move them to
make room
(physically)
to rack 1 to
make room for
8 new nodes!
(each node has
2x X5560
quad-core
2.8Ghzprocessors
and 24GB of
RAM)<br>
Just so
nobody
un-NIMBY's
them.<span><font color="#888888"><br>
<br>
-- <br>
Wm. Josiah
Erikson<br>
Assistant
Director of
IT,
Infrastructure
Group<br>
System
Administrator,
School of CS<br>
Hampshire
College<br>
Amherst, MA
01002<br>
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413)
559-6091</a><br>
<br>
_______________________________________________<br>
Clusterusers
mailing list<br>
<a href="mailto:Clusterusers@lists.hampshire.edu" target="_blank">Clusterusers@lists.hampshire.edu</a><br>
<a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
</font></span></blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<pre cols="72">--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<pre cols="72">--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
</div>
</div>
</div>
<br>
_______________________________________________<br>
Clusterusers mailing list<br>
<a href="mailto:Clusterusers@lists.hampshire.edu" target="_blank">Clusterusers@lists.hampshire.edu</a><br>
<a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
<br>
</blockquote>
</div>
<br>
</div>
<br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
Clusterusers mailing list
<a href="mailto:Clusterusers@lists.hampshire.edu" target="_blank">Clusterusers@lists.hampshire.edu</a>
<a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>
</pre>
</blockquote>
<br>
<pre cols="72">--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
</div></div></div>
<br>_______________________________________________<br>
Clusterusers mailing list<br>
<a href="mailto:Clusterusers@lists.hampshire.edu">Clusterusers@lists.hampshire.edu</a><br>
<a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
<br></blockquote></div><br></div>