<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    I'm sending this to the list, as the list may benefit from this
    knowledge:<br>
    <br>
    Ah! I looked through the logs and figured it out. Here's what
    happened. I think it's probably suitably random, but I'll leave that
    up to you:<br>
    <br>
    -Owen (an animation student working with Perry, who I've added to
    this list) spooled a render that tickled a bug in Maya that caused a
    simple, routine render to chew up outrageous amounts of RAM<br>
    -Several nodes ran out of RAM<br>
    -The oom-killer (out of memory killer) killed the renders and/or
    your jobs<br>
    <br>
    The oom-killer doesn't just kill the process that is using the most
    RAM - it's more random than that, though it does prefer processes
    that are using more RAM. I don't pretend to remember the exact
    details, but we could look them up if it's important. However, I am
    pretty sure that the oom-killer killing your jobs is probably
    suitably random for you to just restart those ones.... but I'm not
    you, or your paper reviewers :)<br>
    <br>
    We found the bug and also a workaround yesterday, so those jobs are
    using reasonable amounts of RAM now.    <br>
    <br>
    All the best,<br>
    <br>
        -Josiah<br>
    <br>
    <br>
    <div class="moz-cite-prefix">On 1/24/14 3:09 PM, Thomas Helmuth
      wrote:<br>
    </div>
    <blockquote
cite="mid:CABgVVjeYh76d-NWZBHjZVmkuw0_xC0w-FPSneFs7iFTwr_LbHA@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div>I'm seeing it on a variety of nodes, but all are on rack 2.
          All of the killed runs I'm seeing stopped between 3 and 7 AM
          yesterday morning, though that could just be coincidence since
          I started them Wednesday afternoon.<br>
          <br>
        </div>
        -Tom<br>
      </div>
      <div class="gmail_extra"><br>
        <br>
        <div class="gmail_quote">On Fri, Jan 24, 2014 at 2:55 PM, Wm.
          Josiah Erikson <span dir="ltr"><<a moz-do-not-send="true"
              href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div bgcolor="#FFFFFF" text="#000000"> That's very odd. What
              nodes did that happen on?<br>
              Killed usually means something actually killed it - either
              it was pre-empted by tractor, killed by a script, or
              killed manually - it doesn't usually mean crashed.<span
                class="HOEnZb"><font color="#888888"><br>
                      -Josiah</font></span>
              <div>
                <div class="h5"><br>
                  <br>
                  <div>On 1/24/14 2:52 PM, Thomas Helmuth wrote:<br>
                  </div>
                  <blockquote type="cite">
                    <div dir="ltr">
                      <div>
                        <div>No problem about the 1-6 power cord.<br>
                          <br>
                        </div>
                        Any idea what killed the other runs I mentioned
                        in my second email? I might have to restart that
                        whole set of runs to make sure the stats aren't
                        affected, and it would be nice to know that they
                        won't crash as well.<br>
                        <br>
                      </div>
                      -Tom<br>
                    </div>
                    <div class="gmail_extra"><br>
                      <br>
                      <div class="gmail_quote">On Fri, Jan 24, 2014 at
                        10:53 AM, Thomas Helmuth <span dir="ltr"><<a
                            moz-do-not-send="true"
                            href="mailto:thelmuth@cs.umass.edu"
                            target="_blank">thelmuth@cs.umass.edu</a>></span>
                        wrote:<br>
                        <blockquote class="gmail_quote" style="margin:0
                          0 0 .8ex;border-left:1px #ccc
                          solid;padding-left:1ex">
                          <div dir="ltr">
                            <div>
                              <div>
                                <div>Hi Josiah,<br>
                                  <br>
                                </div>
                                Probably unrelated, but it looks like
                                yesterday morning I had some runs stop
                                before they were done, without printing
                                any errors I can find. The only thing I
                                can find is when I select the run in
                                tractor it says something like:<br>
                                <br>
                                /bin/sh: line 1: 23661 Killed
                                /share/apps/bin/lein with-profiles
                                production trampoline run
                                clojush.examples.wc
                                :use-lexicase-selection false >
                                ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt

                                <br>
                                <br>
                              </div>
                              Most of that is just the shell command
                              used to launch the run, but do you know
                              what that beginning part means, or have
                              any idea why these died quietly?<br>
                              <br>
                            </div>
                            Thanks,<br>
                            Tom<br>
                          </div>
                          <div>
                            <div>
                              <div class="gmail_extra"> <br>
                                <br>
                                <div class="gmail_quote">On Fri, Jan 24,
                                  2014 at 10:36 AM, Thomas Helmuth <span
                                    dir="ltr"><<a
                                      moz-do-not-send="true"
                                      href="mailto:thelmuth@cs.umass.edu"
                                      target="_blank">thelmuth@cs.umass.edu</a>></span>
                                  wrote:<br>
                                  <blockquote class="gmail_quote"
                                    style="margin:0 0 0
                                    .8ex;border-left:1px #ccc
                                    solid;padding-left:1ex">
                                    <div dir="ltr">
                                      <div>
                                        <div>
                                          <div>Hi Josiah,<br>
                                            <br>
                                          </div>
                                          Sounds good! At first I was
                                          worried this may have crashed
                                          some of my runs, but it looks
                                          like they just never started
                                          when they tried to run on
                                          compute-1-18 around the time
                                          of the NIMBY, so its not a big
                                          deal.<br>
                                          <br>
                                        </div>
                                        I did just have two runs crash
                                        on compute-1-6, any idea if
                                        that's related?<br>
                                        <br>
                                      </div>
                                      <div>We are at crunch time (paper
                                        deadline Jan. 29), and the runs
                                        I have going are pretty
                                        important for that, so it would
                                        be great to have minimum
                                        disruptions until then!<br>
                                      </div>
                                      <div><br>
                                      </div>
                                      -Tom<br>
                                    </div>
                                    <div>
                                      <div>
                                        <div class="gmail_extra"><br>
                                          <br>
                                          <div class="gmail_quote">On
                                            Fri, Jan 24, 2014 at 9:13
                                            AM, Wm. Josiah Erikson <span
                                              dir="ltr"><<a
                                                moz-do-not-send="true"
                                                href="mailto:wjerikson@hampshire.edu"
                                                target="_blank">wjerikson@hampshire.edu</a>></span>
                                            wrote:<br>
                                            <blockquote
                                              class="gmail_quote"
                                              style="margin:0 0 0
                                              .8ex;border-left:1px #ccc
                                              solid;padding-left:1ex">Hi
                                              guys,<br>
                                                  I've got compute-1-17
                                              and compute-1-18 NIMBYed
                                              for now because I'm going
                                              to move them to make room
                                              (physically) to rack 1 to
                                              make room for 8 new nodes!
                                              (each node has 2x X5560
                                              quad-core 2.8Ghzprocessors
                                              and 24GB of RAM)<br>
                                                  Just so nobody
                                              un-NIMBY's them.<span><font
                                                  color="#888888"><br>
                                                  <br>
                                                  -- <br>
                                                  Wm. Josiah Erikson<br>
                                                  Assistant Director of
                                                  IT, Infrastructure
                                                  Group<br>
                                                  System Administrator,
                                                  School of CS<br>
                                                  Hampshire College<br>
                                                  Amherst, MA 01002<br>
                                                  <a
                                                    moz-do-not-send="true"
href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413)
                                                    559-6091</a><br>
                                                  <br>
_______________________________________________<br>
                                                  Clusterusers mailing
                                                  list<br>
                                                  <a
                                                    moz-do-not-send="true"
href="mailto:Clusterusers@lists.hampshire.edu" target="_blank">Clusterusers@lists.hampshire.edu</a><br>
                                                  <a
                                                    moz-do-not-send="true"
href="https://lists.hampshire.edu/mailman/listinfo/clusterusers"
                                                    target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
                                                </font></span></blockquote>
                                          </div>
                                          <br>
                                        </div>
                                      </div>
                                    </div>
                                  </blockquote>
                                </div>
                                <br>
                              </div>
                            </div>
                          </div>
                        </blockquote>
                      </div>
                      <br>
                    </div>
                  </blockquote>
                  <br>
                  <pre cols="72">-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a moz-do-not-send="true" href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
                </div>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
</pre>
  </body>
</html>