<div dir="ltr"><div>Good to know its been fixed, thanks! I'll probably restart just the runs that got killed, since it should work alright with my experiments. Is there an easy way to find out which runs were killed this way?<br>

<br></div>-Tom<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Jan 24, 2014 at 3:28 PM, Wm. Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    I'm sending this to the list, as the list may benefit from this
    knowledge:<br>
    <br>
    Ah! I looked through the logs and figured it out. Here's what
    happened. I think it's probably suitably random, but I'll leave that
    up to you:<br>
    <br>
    -Owen (an animation student working with Perry, who I've added to
    this list) spooled a render that tickled a bug in Maya that caused a
    simple, routine render to chew up outrageous amounts of RAM<br>
    -Several nodes ran out of RAM<br>
    -The oom-killer (out of memory killer) killed the renders and/or
    your jobs<br>
    <br>
    The oom-killer doesn't just kill the process that is using the most
    RAM - it's more random than that, though it does prefer processes
    that are using more RAM. I don't pretend to remember the exact
    details, but we could look them up if it's important. However, I am
    pretty sure that the oom-killer killing your jobs is probably
    suitably random for you to just restart those ones.... but I'm not
    you, or your paper reviewers :)<br>
    <br>
    We found the bug and also a workaround yesterday, so those jobs are
    using reasonable amounts of RAM now.    <br>
    <br>
    All the best,<br>
    <br>
        -Josiah<div><div class="h5"><br>
    <br>
    <br>
    <div>On 1/24/14 3:09 PM, Thomas Helmuth
      wrote:<br>
    </div>
    <blockquote type="cite">
      <div dir="ltr">
        <div>I'm seeing it on a variety of nodes, but all are on rack 2.
          All of the killed runs I'm seeing stopped between 3 and 7 AM
          yesterday morning, though that could just be coincidence since
          I started them Wednesday afternoon.<br>
          <br>
        </div>
        -Tom<br>
      </div>
      <div class="gmail_extra"><br>
        <br>
        <div class="gmail_quote">On Fri, Jan 24, 2014 at 2:55 PM, Wm.
          Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div bgcolor="#FFFFFF" text="#000000"> That's very odd. What
              nodes did that happen on?<br>
              Killed usually means something actually killed it - either
              it was pre-empted by tractor, killed by a script, or
              killed manually - it doesn't usually mean crashed.<span><font color="#888888"><br>
                      -Josiah</font></span>
              <div>
                <div><br>
                  <br>
                  <div>On 1/24/14 2:52 PM, Thomas Helmuth wrote:<br>
                  </div>
                  <blockquote type="cite">
                    <div dir="ltr">
                      <div>
                        <div>No problem about the 1-6 power cord.<br>
                          <br>
                        </div>
                        Any idea what killed the other runs I mentioned
                        in my second email? I might have to restart that
                        whole set of runs to make sure the stats aren't
                        affected, and it would be nice to know that they
                        won't crash as well.<br>
                        <br>
                      </div>
                      -Tom<br>
                    </div>
                    <div class="gmail_extra"><br>
                      <br>
                      <div class="gmail_quote">On Fri, Jan 24, 2014 at
                        10:53 AM, Thomas Helmuth <span dir="ltr"><<a href="mailto:thelmuth@cs.umass.edu" target="_blank">thelmuth@cs.umass.edu</a>></span>
                        wrote:<br>
                        <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                          <div dir="ltr">
                            <div>
                              <div>
                                <div>Hi Josiah,<br>
                                  <br>
                                </div>
                                Probably unrelated, but it looks like
                                yesterday morning I had some runs stop
                                before they were done, without printing
                                any errors I can find. The only thing I
                                can find is when I select the run in
                                tractor it says something like:<br>
                                <br>
                                /bin/sh: line 1: 23661 Killed
                                /share/apps/bin/lein with-profiles
                                production trampoline run
                                clojush.examples.wc
                                :use-lexicase-selection false >
                                ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt

                                <br>
                                <br>
                              </div>
                              Most of that is just the shell command
                              used to launch the run, but do you know
                              what that beginning part means, or have
                              any idea why these died quietly?<br>
                              <br>
                            </div>
                            Thanks,<br>
                            Tom<br>
                          </div>
                          <div>
                            <div>
                              <div class="gmail_extra"> <br>
                                <br>
                                <div class="gmail_quote">On Fri, Jan 24,
                                  2014 at 10:36 AM, Thomas Helmuth <span dir="ltr"><<a href="mailto:thelmuth@cs.umass.edu" target="_blank">thelmuth@cs.umass.edu</a>></span>
                                  wrote:<br>
                                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                    <div dir="ltr">
                                      <div>
                                        <div>
                                          <div>Hi Josiah,<br>
                                            <br>
                                          </div>
                                          Sounds good! At first I was
                                          worried this may have crashed
                                          some of my runs, but it looks
                                          like they just never started
                                          when they tried to run on
                                          compute-1-18 around the time
                                          of the NIMBY, so its not a big
                                          deal.<br>
                                          <br>
                                        </div>
                                        I did just have two runs crash
                                        on compute-1-6, any idea if
                                        that's related?<br>
                                        <br>
                                      </div>
                                      <div>We are at crunch time (paper
                                        deadline Jan. 29), and the runs
                                        I have going are pretty
                                        important for that, so it would
                                        be great to have minimum
                                        disruptions until then!<br>
                                      </div>
                                      <div><br>
                                      </div>
                                      -Tom<br>
                                    </div>
                                    <div>
                                      <div>
                                        <div class="gmail_extra"><br>
                                          <br>
                                          <div class="gmail_quote">On
                                            Fri, Jan 24, 2014 at 9:13
                                            AM, Wm. Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>
                                            wrote:<br>
                                            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi
                                              guys,<br>
                                                  I've got compute-1-17
                                              and compute-1-18 NIMBYed
                                              for now because I'm going
                                              to move them to make room
                                              (physically) to rack 1 to
                                              make room for 8 new nodes!
                                              (each node has 2x X5560
                                              quad-core 2.8Ghzprocessors
                                              and 24GB of RAM)<br>
                                                  Just so nobody
                                              un-NIMBY's them.<span><font color="#888888"><br>
                                                  <br>
                                                  -- <br>
                                                  Wm. Josiah Erikson<br>
                                                  Assistant Director of
                                                  IT, Infrastructure
                                                  Group<br>
                                                  System Administrator,
                                                  School of CS<br>
                                                  Hampshire College<br>
                                                  Amherst, MA 01002<br>
                                                  <a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413)
                                                    559-6091</a><br>
                                                  <br>
_______________________________________________<br>
                                                  Clusterusers mailing
                                                  list<br>
                                                  <a href="mailto:Clusterusers@lists.hampshire.edu" target="_blank">Clusterusers@lists.hampshire.edu</a><br>
                                                  <a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
                                                </font></span></blockquote>
                                          </div>
                                          <br>
                                        </div>
                                      </div>
                                    </div>
                                  </blockquote>
                                </div>
                                <br>
                              </div>
                            </div>
                          </div>
                        </blockquote>
                      </div>
                      <br>
                    </div>
                  </blockquote>
                  <br>
                  <pre cols="72">-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
                </div>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
    <pre cols="72">-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
  </div></div></div>

<br>_______________________________________________<br>
Clusterusers mailing list<br>
<a href="mailto:Clusterusers@lists.hampshire.edu">Clusterusers@lists.hampshire.edu</a><br>
<a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
<br></blockquote></div><br></div>