<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
I'm sending this to the list, as the list may benefit from this
knowledge:<br>
<br>
Ah! I looked through the logs and figured it out. Here's what
happened. I think it's probably suitably random, but I'll leave that
up to you:<br>
<br>
-Owen (an animation student working with Perry, who I've added to
this list) spooled a render that tickled a bug in Maya that caused a
simple, routine render to chew up outrageous amounts of RAM<br>
-Several nodes ran out of RAM<br>
-The oom-killer (out of memory killer) killed the renders and/or
your jobs<br>
<br>
The oom-killer doesn't just kill the process that is using the most
RAM - it's more random than that, though it does prefer processes
that are using more RAM. I don't pretend to remember the exact
details, but we could look them up if it's important. However, I am
pretty sure that the oom-killer killing your jobs is probably
suitably random for you to just restart those ones.... but I'm not
you, or your paper reviewers :)<br>
<br>
We found the bug and also a workaround yesterday, so those jobs are
using reasonable amounts of RAM now. <br>
<br>
All the best,<br>
<br>
-Josiah<br>
<br>
<br>
<div class="moz-cite-prefix">On 1/24/14 3:09 PM, Thomas Helmuth
wrote:<br>
</div>
<blockquote
cite="mid:CABgVVjeYh76d-NWZBHjZVmkuw0_xC0w-FPSneFs7iFTwr_LbHA@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>I'm seeing it on a variety of nodes, but all are on rack 2.
All of the killed runs I'm seeing stopped between 3 and 7 AM
yesterday morning, though that could just be coincidence since
I started them Wednesday afternoon.<br>
<br>
</div>
-Tom<br>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Fri, Jan 24, 2014 at 2:55 PM, Wm.
Josiah Erikson <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> That's very odd. What
nodes did that happen on?<br>
Killed usually means something actually killed it - either
it was pre-empted by tractor, killed by a script, or
killed manually - it doesn't usually mean crashed.<span
class="HOEnZb"><font color="#888888"><br>
-Josiah</font></span>
<div>
<div class="h5"><br>
<br>
<div>On 1/24/14 2:52 PM, Thomas Helmuth wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>No problem about the 1-6 power cord.<br>
<br>
</div>
Any idea what killed the other runs I mentioned
in my second email? I might have to restart that
whole set of runs to make sure the stats aren't
affected, and it would be nice to know that they
won't crash as well.<br>
<br>
</div>
-Tom<br>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Fri, Jan 24, 2014 at
10:53 AM, Thomas Helmuth <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:thelmuth@cs.umass.edu"
target="_blank">thelmuth@cs.umass.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0
0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>Hi Josiah,<br>
<br>
</div>
Probably unrelated, but it looks like
yesterday morning I had some runs stop
before they were done, without printing
any errors I can find. The only thing I
can find is when I select the run in
tractor it says something like:<br>
<br>
/bin/sh: line 1: 23661 Killed
/share/apps/bin/lein with-profiles
production trampoline run
clojush.examples.wc
:use-lexicase-selection false >
../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
<br>
<br>
</div>
Most of that is just the shell command
used to launch the run, but do you know
what that beginning part means, or have
any idea why these died quietly?<br>
<br>
</div>
Thanks,<br>
Tom<br>
</div>
<div>
<div>
<div class="gmail_extra"> <br>
<br>
<div class="gmail_quote">On Fri, Jan 24,
2014 at 10:36 AM, Thomas Helmuth <span
dir="ltr"><<a
moz-do-not-send="true"
href="mailto:thelmuth@cs.umass.edu"
target="_blank">thelmuth@cs.umass.edu</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>Hi Josiah,<br>
<br>
</div>
Sounds good! At first I was
worried this may have crashed
some of my runs, but it looks
like they just never started
when they tried to run on
compute-1-18 around the time
of the NIMBY, so its not a big
deal.<br>
<br>
</div>
I did just have two runs crash
on compute-1-6, any idea if
that's related?<br>
<br>
</div>
<div>We are at crunch time (paper
deadline Jan. 29), and the runs
I have going are pretty
important for that, so it would
be great to have minimum
disruptions until then!<br>
</div>
<div><br>
</div>
-Tom<br>
</div>
<div>
<div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On
Fri, Jan 24, 2014 at 9:13
AM, Wm. Josiah Erikson <span
dir="ltr"><<a
moz-do-not-send="true"
href="mailto:wjerikson@hampshire.edu"
target="_blank">wjerikson@hampshire.edu</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">Hi
guys,<br>
I've got compute-1-17
and compute-1-18 NIMBYed
for now because I'm going
to move them to make room
(physically) to rack 1 to
make room for 8 new nodes!
(each node has 2x X5560
quad-core 2.8Ghzprocessors
and 24GB of RAM)<br>
Just so nobody
un-NIMBY's them.<span><font
color="#888888"><br>
<br>
-- <br>
Wm. Josiah Erikson<br>
Assistant Director of
IT, Infrastructure
Group<br>
System Administrator,
School of CS<br>
Hampshire College<br>
Amherst, MA 01002<br>
<a
moz-do-not-send="true"
href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413)
559-6091</a><br>
<br>
_______________________________________________<br>
Clusterusers mailing
list<br>
<a
moz-do-not-send="true"
href="mailto:Clusterusers@lists.hampshire.edu" target="_blank">Clusterusers@lists.hampshire.edu</a><br>
<a
moz-do-not-send="true"
href="https://lists.hampshire.edu/mailman/listinfo/clusterusers"
target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
</font></span></blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<pre cols="72">--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a moz-do-not-send="true" href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
</pre>
</body>
</html>