<div dir="ltr"><div>Good to know its been fixed, thanks! I'll probably restart just the runs that got killed, since it should work alright with my experiments. Is there an easy way to find out which runs were killed this way?<br>

<br></div>-Tom<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Jan 24, 2014 at 3:28 PM, Wm. Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000">

    I'm sending this to the list, as the list may benefit from this

    knowledge:<br>

    <br>

    Ah! I looked through the logs and figured it out. Here's what

    happened. I think it's probably suitably random, but I'll leave that

    up to you:<br>

    <br>

    -Owen (an animation student working with Perry, who I've added to

    this list) spooled a render that tickled a bug in Maya that caused a

    simple, routine render to chew up outrageous amounts of RAM<br>

    -Several nodes ran out of RAM<br>

    -The oom-killer (out of memory killer) killed the renders and/or

    your jobs<br>

    <br>

    The oom-killer doesn't just kill the process that is using the most

    RAM - it's more random than that, though it does prefer processes

    that are using more RAM. I don't pretend to remember the exact

    details, but we could look them up if it's important. However, I am

    pretty sure that the oom-killer killing your jobs is probably

    suitably random for you to just restart those ones.... but I'm not

    you, or your paper reviewers :)<br>

    <br>

    We found the bug and also a workaround yesterday, so those jobs are

    using reasonable amounts of RAM now.    <br>

    <br>

    All the best,<br>

    <br>

        -Josiah<div><div class="h5"><br>

    <br>

    <br>

    <div>On 1/24/14 3:09 PM, Thomas Helmuth

      wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div>I'm seeing it on a variety of nodes, but all are on rack 2.

          All of the killed runs I'm seeing stopped between 3 and 7 AM

          yesterday morning, though that could just be coincidence since

          I started them Wednesday afternoon.<br>

          <br>

        </div>

        -Tom<br>

      </div>

      <div class="gmail_extra"><br>

        <br>

        <div class="gmail_quote">On Fri, Jan 24, 2014 at 2:55 PM, Wm.

          Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div bgcolor="#FFFFFF" text="#000000"> That's very odd. What

              nodes did that happen on?<br>

              Killed usually means something actually killed it - either

              it was pre-empted by tractor, killed by a script, or

              killed manually - it doesn't usually mean crashed.<span><font color="#888888"><br>

                      -Josiah</font></span>

              <div>

                <div><br>

                  <br>

                  <div>On 1/24/14 2:52 PM, Thomas Helmuth wrote:<br>

                  </div>

                  <blockquote type="cite">

                    <div dir="ltr">

                      <div>

                        <div>No problem about the 1-6 power cord.<br>

                          <br>

                        </div>

                        Any idea what killed the other runs I mentioned

                        in my second email? I might have to restart that

                        whole set of runs to make sure the stats aren't

                        affected, and it would be nice to know that they

                        won't crash as well.<br>

                        <br>

                      </div>

                      -Tom<br>

                    </div>

                    <div class="gmail_extra"><br>

                      <br>

                      <div class="gmail_quote">On Fri, Jan 24, 2014 at

                        10:53 AM, Thomas Helmuth <span dir="ltr"><<a href="mailto:thelmuth@cs.umass.edu" target="_blank">thelmuth@cs.umass.edu</a>></span>

                        wrote:<br>

                        <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                          <div dir="ltr">

                            <div>

                              <div>

                                <div>Hi Josiah,<br>

                                  <br>

                                </div>

                                Probably unrelated, but it looks like

                                yesterday morning I had some runs stop

                                before they were done, without printing

                                any errors I can find. The only thing I

                                can find is when I select the run in

                                tractor it says something like:<br>

                                <br>

                                /bin/sh: line 1: 23661 Killed

                                /share/apps/bin/lein with-profiles

                                production trampoline run

                                clojush.examples.wc

                                :use-lexicase-selection false >

                                ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt

                                <br>

                                <br>

                              </div>

                              Most of that is just the shell command

                              used to launch the run, but do you know

                              what that beginning part means, or have

                              any idea why these died quietly?<br>

                              <br>

                            </div>

                            Thanks,<br>

                            Tom<br>

                          </div>

                          <div>

                            <div>

                              <div class="gmail_extra"> <br>

                                <br>

                                <div class="gmail_quote">On Fri, Jan 24,

                                  2014 at 10:36 AM, Thomas Helmuth <span dir="ltr"><<a href="mailto:thelmuth@cs.umass.edu" target="_blank">thelmuth@cs.umass.edu</a>></span>

                                  wrote:<br>

                                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

                                    <div dir="ltr">

                                      <div>

                                        <div>

                                          <div>Hi Josiah,<br>

                                            <br>

                                          </div>

                                          Sounds good! At first I was

                                          worried this may have crashed

                                          some of my runs, but it looks

                                          like they just never started

                                          when they tried to run on

                                          compute-1-18 around the time

                                          of the NIMBY, so its not a big

                                          deal.<br>

                                          <br>

                                        </div>

                                        I did just have two runs crash

                                        on compute-1-6, any idea if

                                        that's related?<br>

                                        <br>

                                      </div>

                                      <div>We are at crunch time (paper

                                        deadline Jan. 29), and the runs

                                        I have going are pretty

                                        important for that, so it would

                                        be great to have minimum

                                        disruptions until then!<br>

                                      </div>

                                      <div><br>

                                      </div>

                                      -Tom<br>

                                    </div>

                                    <div>

                                      <div>

                                        <div class="gmail_extra"><br>

                                          <br>

                                          <div class="gmail_quote">On

                                            Fri, Jan 24, 2014 at 9:13

                                            AM, Wm. Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>

                                            wrote:<br>

                                            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi

                                              guys,<br>

                                                  I've got compute-1-17

                                              and compute-1-18 NIMBYed

                                              for now because I'm going

                                              to move them to make room

                                              (physically) to rack 1 to

                                              make room for 8 new nodes!

                                              (each node has 2x X5560

                                              quad-core 2.8Ghzprocessors

                                              and 24GB of RAM)<br>

                                                  Just so nobody

                                              un-NIMBY's them.<span><font color="#888888"><br>

                                                  <br>

                                                  -- <br>

                                                  Wm. Josiah Erikson<br>

                                                  Assistant Director of

                                                  IT, Infrastructure

                                                  Group<br>

                                                  System Administrator,

                                                  School of CS<br>

                                                  Hampshire College<br>

                                                  Amherst, MA 01002<br>

                                                  <a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413)

                                                    559-6091</a><br>

                                                  <br>

_______________________________________________<br>

                                                  Clusterusers mailing

                                                  list<br>

                                                  <a href="mailto:Clusterusers@lists.hampshire.edu" target="_blank">Clusterusers@lists.hampshire.edu</a><br>

                                                  <a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>

                                                </font></span></blockquote>

                                          </div>

                                          <br>

                                        </div>

                                      </div>

                                    </div>

                                  </blockquote>

                                </div>

                                <br>

                              </div>

                            </div>

                          </div>

                        </blockquote>

                      </div>

                      <br>

                    </div>

                  </blockquote>

                  <br>

                  <pre cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>

</pre>

                </div>

              </div>

            </div>

          </blockquote>

        </div>

        <br>

      </div>

    </blockquote>

    <br>

    <pre cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>

</pre>

  </div></div></div>

<br>_______________________________________________<br>

Clusterusers mailing list<br>

<a href="mailto:Clusterusers@lists.hampshire.edu">Clusterusers@lists.hampshire.edu</a><br>

<a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>

<br></blockquote></div><br></div>