<div dir="ltr"><div>Good to know its been fixed, thanks! I'll probably restart just the runs that got killed, since it should work alright with my experiments. Is there an easy way to find out which runs were killed this way?<br>
<br></div>-Tom<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Jan 24, 2014 at 3:28 PM, Wm. Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
I'm sending this to the list, as the list may benefit from this
knowledge:<br>
<br>
Ah! I looked through the logs and figured it out. Here's what
happened. I think it's probably suitably random, but I'll leave that
up to you:<br>
<br>
-Owen (an animation student working with Perry, who I've added to
this list) spooled a render that tickled a bug in Maya that caused a
simple, routine render to chew up outrageous amounts of RAM<br>
-Several nodes ran out of RAM<br>
-The oom-killer (out of memory killer) killed the renders and/or
your jobs<br>
<br>
The oom-killer doesn't just kill the process that is using the most
RAM - it's more random than that, though it does prefer processes
that are using more RAM. I don't pretend to remember the exact
details, but we could look them up if it's important. However, I am
pretty sure that the oom-killer killing your jobs is probably
suitably random for you to just restart those ones.... but I'm not
you, or your paper reviewers :)<br>
<br>
We found the bug and also a workaround yesterday, so those jobs are
using reasonable amounts of RAM now. <br>
<br>
All the best,<br>
<br>
-Josiah<div><div class="h5"><br>
<br>
<br>
<div>On 1/24/14 3:09 PM, Thomas Helmuth
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>I'm seeing it on a variety of nodes, but all are on rack 2.
All of the killed runs I'm seeing stopped between 3 and 7 AM
yesterday morning, though that could just be coincidence since
I started them Wednesday afternoon.<br>
<br>
</div>
-Tom<br>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Fri, Jan 24, 2014 at 2:55 PM, Wm.
Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> That's very odd. What
nodes did that happen on?<br>
Killed usually means something actually killed it - either
it was pre-empted by tractor, killed by a script, or
killed manually - it doesn't usually mean crashed.<span><font color="#888888"><br>
-Josiah</font></span>
<div>
<div><br>
<br>
<div>On 1/24/14 2:52 PM, Thomas Helmuth wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>No problem about the 1-6 power cord.<br>
<br>
</div>
Any idea what killed the other runs I mentioned
in my second email? I might have to restart that
whole set of runs to make sure the stats aren't
affected, and it would be nice to know that they
won't crash as well.<br>
<br>
</div>
-Tom<br>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Fri, Jan 24, 2014 at
10:53 AM, Thomas Helmuth <span dir="ltr"><<a href="mailto:thelmuth@cs.umass.edu" target="_blank">thelmuth@cs.umass.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>Hi Josiah,<br>
<br>
</div>
Probably unrelated, but it looks like
yesterday morning I had some runs stop
before they were done, without printing
any errors I can find. The only thing I
can find is when I select the run in
tractor it says something like:<br>
<br>
/bin/sh: line 1: 23661 Killed
/share/apps/bin/lein with-profiles
production trampoline run
clojush.examples.wc
:use-lexicase-selection false >
../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
<br>
<br>
</div>
Most of that is just the shell command
used to launch the run, but do you know
what that beginning part means, or have
any idea why these died quietly?<br>
<br>
</div>
Thanks,<br>
Tom<br>
</div>
<div>
<div>
<div class="gmail_extra"> <br>
<br>
<div class="gmail_quote">On Fri, Jan 24,
2014 at 10:36 AM, Thomas Helmuth <span dir="ltr"><<a href="mailto:thelmuth@cs.umass.edu" target="_blank">thelmuth@cs.umass.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>Hi Josiah,<br>
<br>
</div>
Sounds good! At first I was
worried this may have crashed
some of my runs, but it looks
like they just never started
when they tried to run on
compute-1-18 around the time
of the NIMBY, so its not a big
deal.<br>
<br>
</div>
I did just have two runs crash
on compute-1-6, any idea if
that's related?<br>
<br>
</div>
<div>We are at crunch time (paper
deadline Jan. 29), and the runs
I have going are pretty
important for that, so it would
be great to have minimum
disruptions until then!<br>
</div>
<div><br>
</div>
-Tom<br>
</div>
<div>
<div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On
Fri, Jan 24, 2014 at 9:13
AM, Wm. Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi
guys,<br>
I've got compute-1-17
and compute-1-18 NIMBYed
for now because I'm going
to move them to make room
(physically) to rack 1 to
make room for 8 new nodes!
(each node has 2x X5560
quad-core 2.8Ghzprocessors
and 24GB of RAM)<br>
Just so nobody
un-NIMBY's them.<span><font color="#888888"><br>
<br>
-- <br>
Wm. Josiah Erikson<br>
Assistant Director of
IT, Infrastructure
Group<br>
System Administrator,
School of CS<br>
Hampshire College<br>
Amherst, MA 01002<br>
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413)
559-6091</a><br>
<br>
_______________________________________________<br>
Clusterusers mailing
list<br>
<a href="mailto:Clusterusers@lists.hampshire.edu" target="_blank">Clusterusers@lists.hampshire.edu</a><br>
<a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
</font></span></blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<pre cols="72">--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<pre cols="72">--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
</div></div></div>
<br>_______________________________________________<br>
Clusterusers mailing list<br>
<a href="mailto:Clusterusers@lists.hampshire.edu">Clusterusers@lists.hampshire.edu</a><br>
<a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
<br></blockquote></div><br></div>