<div dir="ltr"><div>Perfect -- thanks!<br></div>-Tom<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Jan 24, 2014 at 3:56 PM, Wm. Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    [josiah@fly J1401220065]$ pwd<br>
    /helga/public_html/tractor/cmd-logs/thelmuth/J1401220065<br>
    [josiah@fly J1401220065]$ grep Killed *<br>
    T10.log:/bin/sh: line 1: 21228 Killed                 
    /share/apps/bin/lein with-profiles production trampoline run
    clojush.examples.wc :use-lexicase-selection false >
    ../Results/GECCO14/wc/tourney-max-points-1000-two/log9.txt<br>
    T11.log:/bin/sh: line 1: 29350 Killed                 
    /share/apps/bin/lein with-profiles production trampoline run
    clojush.examples.wc :use-lexicase-selection false >
    ../Results/GECCO14/wc/tourney-max-points-1000-two/log10.txt<br>
    T15.log:/bin/sh: line 1: 12113 Killed                 
    /share/apps/bin/lein with-profiles production trampoline run
    clojush.examples.wc :use-lexicase-selection false >
    ../Results/GECCO14/wc/tourney-max-points-1000-two/log14.txt<br>
    T2.log:/bin/sh: line 1: 23661 Killed                 
    /share/apps/bin/lein with-profiles production trampoline run
    clojush.examples.wc :use-lexicase-selection false >
    ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt<br>
    T32.log:/bin/sh: line 1:  1598 Killed                 
    /share/apps/bin/lein with-profiles production trampoline run
    clojush.examples.wc :use-lexicase-selection false >
    ../Results/GECCO14/wc/tourney-max-points-1000-two/log31.txt<br>
    T39.log:/bin/sh: line 1: 20134 Killed                 
    /share/apps/bin/lein with-profiles production trampoline run
    clojush.examples.wc :use-lexicase-selection false >
    ../Results/GECCO14/wc/tourney-max-points-1000-two/log38.txt<br>
    T42.log:/bin/sh: line 1: 23599 Killed                 
    /share/apps/bin/lein with-profiles production trampoline run
    clojush.examples.wc :use-lexicase-selection false >
    ../Results/GECCO14/wc/tourney-max-points-1000-two/log41.txt<br>
    T6.log:/bin/sh: line 1: 19407 Killed                 
    /share/apps/bin/lein with-profiles production trampoline run
    clojush.examples.wc :use-lexicase-selection false >
    ../Results/GECCO14/wc/tourney-max-points-1000-two/log5.txt<br>
    T9.log:/bin/sh: line 1: 11738 Killed                 
    /share/apps/bin/lein with-profiles production trampoline run
    clojush.examples.wc :use-lexicase-selection false >
    ../Results/GECCO14/wc/tourney-max-points-1000-two/log8.txt<br>
    [josiah@fly J1401220065]$ <br>
    <br>
    So 2,9,10,11,15,etc for that job (Job ID 1401220065).<br>
    <br>
    Does that make sense? You can do a similar thing for other Job ID's
    your worried about - go to
    /helga/public_html/tractor/cmd-logs/thelmuth/J<jobid> and do
    the same, or extrapolate whatever procedure you like from that :)<span class="HOEnZb"><font color="#888888"><br>
    <br>
        -Josiah</font></span><div><div class="h5"><br>
    <br>
    <br>
    <br>
    <br>
    <div>On 1/24/14 3:46 PM, Thomas Helmuth
      wrote:<br>
    </div>
    <blockquote type="cite">
      <div dir="ltr">
        <div>Good to know its been fixed, thanks! I'll probably restart
          just the runs that got killed, since it should work alright
          with my experiments. Is there an easy way to find out which
          runs were killed this way?<br>
          <br>
        </div>
        -Tom<br>
      </div>
      <div class="gmail_extra"><br>
        <br>
        <div class="gmail_quote">On Fri, Jan 24, 2014 at 3:28 PM, Wm.
          Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div bgcolor="#FFFFFF" text="#000000"> I'm sending this to
              the list, as the list may benefit from this knowledge:<br>
              <br>
              Ah! I looked through the logs and figured it out. Here's
              what happened. I think it's probably suitably random, but
              I'll leave that up to you:<br>
              <br>
              -Owen (an animation student working with Perry, who I've
              added to this list) spooled a render that tickled a bug in
              Maya that caused a simple, routine render to chew up
              outrageous amounts of RAM<br>
              -Several nodes ran out of RAM<br>
              -The oom-killer (out of memory killer) killed the renders
              and/or your jobs<br>
              <br>
              The oom-killer doesn't just kill the process that is using
              the most RAM - it's more random than that, though it does
              prefer processes that are using more RAM. I don't pretend
              to remember the exact details, but we could look them up
              if it's important. However, I am pretty sure that the
              oom-killer killing your jobs is probably suitably random
              for you to just restart those ones.... but I'm not you, or
              your paper reviewers :)<br>
              <br>
              We found the bug and also a workaround yesterday, so those
              jobs are using reasonable amounts of RAM now.    <br>
              <br>
              All the best,<br>
              <br>
                  -Josiah
              <div>
                <div><br>
                  <br>
                  <br>
                  <div>On 1/24/14 3:09 PM, Thomas Helmuth wrote:<br>
                  </div>
                  <blockquote type="cite">
                    <div dir="ltr">
                      <div>I'm seeing it on a variety of nodes, but all
                        are on rack 2. All of the killed runs I'm seeing
                        stopped between 3 and 7 AM yesterday morning,
                        though that could just be coincidence since I
                        started them Wednesday afternoon.<br>
                        <br>
                      </div>
                      -Tom<br>
                    </div>
                    <div class="gmail_extra"><br>
                      <br>
                      <div class="gmail_quote">On Fri, Jan 24, 2014 at
                        2:55 PM, Wm. Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>
                        wrote:<br>
                        <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                          <div bgcolor="#FFFFFF" text="#000000"> That's
                            very odd. What nodes did that happen on?<br>
                            Killed usually means something actually
                            killed it - either it was pre-empted by
                            tractor, killed by a script, or killed
                            manually - it doesn't usually mean crashed.<span><font color="#888888"><br>
                                    -Josiah</font></span>
                            <div>
                              <div><br>
                                <br>
                                <div>On 1/24/14 2:52 PM, Thomas Helmuth
                                  wrote:<br>
                                </div>
                                <blockquote type="cite">
                                  <div dir="ltr">
                                    <div>
                                      <div>No problem about the 1-6
                                        power cord.<br>
                                        <br>
                                      </div>
                                      Any idea what killed the other
                                      runs I mentioned in my second
                                      email? I might have to restart
                                      that whole set of runs to make
                                      sure the stats aren't affected,
                                      and it would be nice to know that
                                      they won't crash as well.<br>
                                      <br>
                                    </div>
                                    -Tom<br>
                                  </div>
                                  <div class="gmail_extra"><br>
                                    <br>
                                    <div class="gmail_quote">On Fri, Jan
                                      24, 2014 at 10:53 AM, Thomas
                                      Helmuth <span dir="ltr"><<a href="mailto:thelmuth@cs.umass.edu" target="_blank">thelmuth@cs.umass.edu</a>></span>
                                      wrote:<br>
                                      <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                        <div dir="ltr">
                                          <div>
                                            <div>
                                              <div>Hi Josiah,<br>
                                                <br>
                                              </div>
                                              Probably unrelated, but it
                                              looks like yesterday
                                              morning I had some runs
                                              stop before they were
                                              done, without printing any
                                              errors I can find. The
                                              only thing I can find is
                                              when I select the run in
                                              tractor it says something
                                              like:<br>
                                              <br>
                                              /bin/sh: line 1: 23661
                                              Killed
                                              /share/apps/bin/lein
                                              with-profiles production
                                              trampoline run
                                              clojush.examples.wc
                                              :use-lexicase-selection
                                              false >
                                              ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
                                              <br>
                                              <br>
                                            </div>
                                            Most of that is just the
                                            shell command used to launch
                                            the run, but do you know
                                            what that beginning part
                                            means, or have any idea why
                                            these died quietly?<br>
                                            <br>
                                          </div>
                                          Thanks,<br>
                                          Tom<br>
                                        </div>
                                        <div>
                                          <div>
                                            <div class="gmail_extra"> <br>
                                              <br>
                                              <div class="gmail_quote">On
                                                Fri, Jan 24, 2014 at
                                                10:36 AM, Thomas Helmuth
                                                <span dir="ltr"><<a href="mailto:thelmuth@cs.umass.edu" target="_blank">thelmuth@cs.umass.edu</a>></span>
                                                wrote:<br>
                                                <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                                  <div dir="ltr">
                                                    <div>
                                                      <div>
                                                        <div>Hi Josiah,<br>
                                                          <br>
                                                        </div>
                                                        Sounds good! At
                                                        first I was
                                                        worried this may
                                                        have crashed
                                                        some of my runs,
                                                        but it looks
                                                        like they just
                                                        never started
                                                        when they tried
                                                        to run on
                                                        compute-1-18
                                                        around the time
                                                        of the NIMBY, so
                                                        its not a big
                                                        deal.<br>
                                                        <br>
                                                      </div>
                                                      I did just have
                                                      two runs crash on
                                                      compute-1-6, any
                                                      idea if that's
                                                      related?<br>
                                                      <br>
                                                    </div>
                                                    <div>We are at
                                                      crunch time (paper
                                                      deadline Jan. 29),
                                                      and the runs I
                                                      have going are
                                                      pretty important
                                                      for that, so it
                                                      would be great to
                                                      have minimum
                                                      disruptions until
                                                      then!<br>
                                                    </div>
                                                    <div><br>
                                                    </div>
                                                    -Tom<br>
                                                  </div>
                                                  <div>
                                                    <div>
                                                      <div class="gmail_extra"><br>
                                                        <br>
                                                        <div class="gmail_quote">On

                                                          Fri, Jan 24,
                                                          2014 at 9:13
                                                          AM, Wm. Josiah
                                                          Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span>
                                                          wrote:<br>
                                                          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi

                                                          guys,<br>
                                                              I've got
                                                          compute-1-17
                                                          and
                                                          compute-1-18
                                                          NIMBYed for
                                                          now because
                                                          I'm going to
                                                          move them to
                                                          make room
                                                          (physically)
                                                          to rack 1 to
                                                          make room for
                                                          8 new nodes!
                                                          (each node has
                                                          2x X5560
                                                          quad-core
                                                          2.8Ghzprocessors
                                                          and 24GB of
                                                          RAM)<br>
                                                              Just so
                                                          nobody
                                                          un-NIMBY's
                                                          them.<span><font color="#888888"><br>
                                                          <br>
                                                          -- <br>
                                                          Wm. Josiah
                                                          Erikson<br>
                                                          Assistant
                                                          Director of
                                                          IT,
                                                          Infrastructure
                                                          Group<br>
                                                          System
                                                          Administrator,
                                                          School of CS<br>
                                                          Hampshire
                                                          College<br>
                                                          Amherst, MA
                                                          01002<br>
                                                          <a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413)

                                                          559-6091</a><br>
                                                          <br>
_______________________________________________<br>
                                                          Clusterusers
                                                          mailing list<br>
                                                          <a href="mailto:Clusterusers@lists.hampshire.edu" target="_blank">Clusterusers@lists.hampshire.edu</a><br>
                                                          <a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
                                                          </font></span></blockquote>
                                                        </div>
                                                        <br>
                                                      </div>
                                                    </div>
                                                  </div>
                                                </blockquote>
                                              </div>
                                              <br>
                                            </div>
                                          </div>
                                        </div>
                                      </blockquote>
                                    </div>
                                    <br>
                                  </div>
                                </blockquote>
                                <br>
                                <pre cols="72">-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
                              </div>
                            </div>
                          </div>
                        </blockquote>
                      </div>
                      <br>
                    </div>
                  </blockquote>
                  <br>
                  <pre cols="72">-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
                </div>
              </div>
            </div>
            <br>
            _______________________________________________<br>
            Clusterusers mailing list<br>
            <a href="mailto:Clusterusers@lists.hampshire.edu" target="_blank">Clusterusers@lists.hampshire.edu</a><br>
            <a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
            <br>
          </blockquote>
        </div>
        <br>
      </div>
      <br>
      <fieldset></fieldset>
      <br>
      <pre>_______________________________________________
Clusterusers mailing list
<a href="mailto:Clusterusers@lists.hampshire.edu" target="_blank">Clusterusers@lists.hampshire.edu</a>
<a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>
</pre>
    </blockquote>
    <br>
    <pre cols="72">-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
  </div></div></div>

<br>_______________________________________________<br>
Clusterusers mailing list<br>
<a href="mailto:Clusterusers@lists.hampshire.edu">Clusterusers@lists.hampshire.edu</a><br>
<a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
<br></blockquote></div><br></div>