<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>Hm yeah and then they do this one by one:</p>
    <p>top - 12:07:01 up 2 days, 19:41,  1 user,  load average: 2.86,
      0.65, 0.27<br>
      Tasks: 713 total,   1 running, 712 sleeping,   0 stopped,   0
      zombie<br>
      Cpu(s): 98.7%us,  0.1%sy,  0.0%ni,  1.2%id,  0.0%wa,  0.0%hi, 
      0.0%si,  0.0%st<br>
      Mem:  66012900k total,  7350780k used, 58662120k free,   157388k
      buffers<br>
      Swap:  2047996k total,        0k used,  2047996k free,   610008k
      cached<br>
      <br>
         PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ 
      COMMAND          <br>
       63986 thelmuth  20   0 20.8g 4.2g  11m S 3161.1  6.6   6:27.23
      java            <br>
       63928 thelmuth  20   0 18.7g 330m  11m S  0.0  0.5   0:09.88
      java              <br>
       63780 thelmuth  20   0 18.7g 327m  11m S  0.0  0.5   0:09.61
      java              <br>
       64232 thelmuth  20   0 18.7g 252m  11m S  0.0  0.4   0:06.74
      java              <br>
       64053 thelmuth  20   0 18.7g 185m  11m S  0.0  0.3   0:05.91
      java              <br>
       64112 thelmuth  20   0 18.7g 182m  11m S  0.0  0.3   0:05.53
      java              <br>
       63838 thelmuth  20   0 18.7g 177m  11m S  0.0  0.3   0:05.95
      java              <br>
       64172 thelmuth  20   0 18.7g 173m  11m S  0.0  0.3   0:04.87
      java              <br>
    </p>
    <p>    -Josiah</p>
    <p><br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 7/24/17 12:06 PM, Wm. Josiah Erikson
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:22dc9ae6-4a1b-0e65-0ee1-c96a0ba9a19a@hampshire.edu">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <p>I am seeing something similar here on Tom's Novelty Search |
        Distance: hamming | Super Anagrams - run 0 job (maybe would
        happen on any job):</p>
      <p>-The node is not running out of RAM and there are no errors</p>
      <p>-However, before the job actually starts doing anything, there
        is a long (5-10 minute or longer) delay after initial memory
        allocation:</p>
      <p><br>
      </p>
      <p>top - 12:04:09 up 2 days, 19:38,  1 user,  load average: 0.00,
        0.11, 0.09<br>
        Tasks: 714 total,   1 running, 713 sleeping,   0 stopped,   0
        zombie<br>
        Cpu(s):  0.1%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi, 
        0.0%si,  0.0%st<br>
        Mem:  66012900k total,  3158436k used, 62854464k free,   157252k
        buffers<br>
        Swap:  2047996k total,        0k used,  2047996k free,   609584k
        cached<br>
        <br>
           PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ 
        COMMAND          <br>
         63928 thelmuth  20   0 18.7g 330m  11m S  0.0  0.5   0:09.82
        java              <br>
         63986 thelmuth  20   0 18.7g 327m  11m S  0.0  0.5   0:10.52
        java              <br>
         64232 thelmuth  20   0 18.7g 252m  11m S  0.0  0.4   0:06.68
        java              <br>
         64053 thelmuth  20   0 18.7g 185m  11m S  0.0  0.3   0:05.85
        java              <br>
         64112 thelmuth  20   0 18.7g 182m  11m S  0.0  0.3   0:05.47
        java              <br>
         63838 thelmuth  20   0 18.7g 177m  11m S  0.0  0.3   0:05.90
        java              <br>
         64172 thelmuth  20   0 18.7g 173m  11m S  0.0  0.3   0:04.80
        java              <br>
         63780 thelmuth  20   0 18.7g 169m  11m S  0.0  0.3   0:05.52
        java              <br>
      </p>
      <p>Very little CPU is being used - the network isn't sluggish -
        it's not out of RAM (though that's a LOT of VIRT RAM for each
        job to claim... but Java just kinda seems to do that)...
        occasionally you'll see one of the processes use some CPU for a
        second... and in a few minutes they will all start chugging
        away. Weird and I'm not sure what's up at the moment, but it
        doesn't seem to exactly harm anything - just weird.<br>
      </p>
      <p>    -Josiah</p>
      <p>     <br>
      </p>
      <p><br>
      </p>
      <p><br>
      </p>
      <br>
      <div class="moz-cite-prefix">On 7/19/17 10:55 AM, Wm. Josiah
        Erikson wrote:<br>
      </div>
      <blockquote type="cite"
        cite="mid:8410f858-dc8b-a67a-9055-81dae081bebf@hampshire.edu">
        <meta http-equiv="Content-Type" content="text/html;
          charset=utf-8">
        <p>I just saw one of Tom's jobs do the same thing (run rack0
          nodes out of RAM - not sure if any of the jobs actually
          crashed yet, but still...), so I did two things:</p>
        <p>1. Set the per-blade "sh" limit to 8 (previously unlimited)<br>
        </p>
        <p>2. Set the number of slots on Rack 0 nodes to 8 (previously
          16)<br>
        </p>
        <p>    -Josiah</p>
        <p><br>
        </p>
        <br>
        <div class="moz-cite-prefix">On 7/19/17 9:40 AM, Wm. Josiah
          Erikson wrote:<br>
        </div>
        <blockquote type="cite"
          cite="mid:c5cc0cfa-7e80-81c6-b1c9-0efc88e76b43@hampshire.edu">
          <meta http-equiv="Content-Type" content="text/html;
            charset=utf-8">
          <p>Also, some additional notes:</p>
          <p>1. I've set the "bash" limit tag to 8 per node. Tom seems
            to use sh instead of bash, so this won't affect him, I
            think.<br>
          </p>
          <p>2. If you want to do 4 per node instead, just add the limit
            tag "4pernode" to a RemoteCmd in the .alf file, like this:</p>
          <p>    RemoteCmd "bash -c "env blah blah" -tags "4pernode"
            -service "tom"</p>
          <p>3. There's also a 2pernode tag</p>
          <p>    -Josiah</p>
          <p><br>
          </p>
          <br>
          <div class="moz-cite-prefix">On 7/19/17 9:30 AM, Wm. Josiah
            Erikson wrote:<br>
          </div>
          <blockquote type="cite"
            cite="mid:53149407-f293-d666-8194-a9cdb4f44b52@hampshire.edu">
            <meta http-equiv="Content-Type" content="text/html;
              charset=utf-8">
            <p>Yeah, it ran out of RAM - probably tried to take 16 jobs
              simultaneously. It's got 64GB of RAM, so should be able to
              do quite a few. It's currently running at under 1/3 RAM
              usage with 3 jobs, so 8 jobs should be safe. I'll limit
              "bash" limit tag jobs (what yours default to since you
              didn't set one explicitly and that's the command you're
              running) to 8 per node. I could lower it to 4 if you
              prefer. Let me know.<br>
            </p>
            <p>    -Josiah</p>
            <p><br>
            </p>
            <br>
            <div class="moz-cite-prefix">On 7/19/17 8:38 AM, Thomas
              Helmuth wrote:<br>
            </div>
            <blockquote type="cite"
cite="mid:CABgVVjfn0FW28918bU4GDky0NCX-4kyv4+uApHwUcBzhWWd8NA@mail.gmail.com">
              <div dir="auto">Moving today, so I'll be brief.
                <div dir="auto"><br>
                </div>
                <div dir="auto">This sounds like the memory problems
                  we've had on rack 4. There, there are just too many
                  cures for the amount of memory, so if too many runs
                  start, none get enough memory. That's why rack 4 isn't
                  on the Tom or big go service tags. Probably means we
                  should remove rack 0 from those tags for now until we
                  figure it out.</div>
                <div dir="auto"><br>
                </div>
                <div dir="auto">I've done runs on rack 4 by limiting
                  each run to 1GB of memory.</div>
                <div dir="auto"><br>
                </div>
                <div dir="auto">Tom</div>
              </div>
              <div class="gmail_extra"><br>
                <div class="gmail_quote">On Jul 19, 2017 7:46 AM, "Lee
                  Spector" <<a href="mailto:lspector@hampshire.edu"
                    moz-do-not-send="true">lspector@hampshire.edu</a>>
                  wrote:<br type="attribution">
                  <blockquote class="quote" style="margin:0 0 0
                    .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
                    I launched some runs this morning (a couple of hours
                    ago -- I'm in Europe) that behaved oddly. I don't
                    know if this is related to recent
                    upgrades/reconfiguration, or what, but I thought I'd
                    share them here.<br>
                    <br>
                    First, there were weird delays between launches and
                    the generation of the initial output. I'm accustomed
                    to delays of a couple of minutes at this point,
                    which I think we've determined (or just guessed) are
                    due to the necessity of re-downloading dependencies
                    on fly nodes, which doesn't occur when running on
                    other machines. But these took much longer, maybe
                    30-45 minutes or longer (not sure of the exact
                    times, but they were on that order). And weirder
                    still, there were long times during which many had
                    launched but only one was producing output and
                    proceeding for a long time, with others really
                    kicking in only after the early one(s) finished. I'd
                    understand this if the non-producing ones weren't
                    launched at all, because they were waiting for free
                    slots.... but that's not what was happening.<br>
                    <br>
                    And then one crashed, on compute-0-1 (this was
                    launched with the "tom" service tag, using the
                    run-fly script in Clojush), with nothing in the .err
                    file (from standard error) but the following in the
                    .out file (from standard output):<br>
                    <br>
                    Producing offspring...<br>
                    Java HotSpot(TM) 64-Bit Server VM warning: INFO:
                    os::commit_memory(<wbr>0x00000006f0e00000,
                    1059061760, 0) failed; error='Cannot allocate
                    memory' (errno=12)<br>
                    #<br>
                    # There is insufficient memory for the Java Runtime
                    Environment to continue.<br>
                    # Native memory allocation (malloc) failed to
                    allocate 1059061760 bytes for committing reserved
                    memory.<br>
                    # An error report file with more information is
                    saved as:<br>
                    # /home/lspector/runs/july19a-<wbr>9235/Clojush/hs_err_pid57996.<wbr>log<br>
                    <br>
                     -Lee<br>
                    <br>
                    --<br>
                    Lee Spector, Professor of Computer Science<br>
                    Director, Institute for Computational Intelligence<br>
                    Hampshire College, Amherst, Massachusetts, 01002,
                    USA<br>
                    <a href="mailto:lspector@hampshire.edu"
                      moz-do-not-send="true">lspector@hampshire.edu</a>,
                    <a href="http://hampshire.edu/lspector/"
                      rel="noreferrer" target="_blank"
                      moz-do-not-send="true">http://hampshire.edu/lspector/</a><wbr>,
                    <a href="tel:413-559-5352" value="+14135595352"
                      moz-do-not-send="true">413-559-5352</a><br>
                    <br>
                    ______________________________<wbr>_________________<br>
                    Clusterusers mailing list<br>
                    <a href="mailto:Clusterusers@lists.hampshire.edu"
                      moz-do-not-send="true">Clusterusers@lists.hampshire.<wbr>edu</a><br>
                    <a
                      href="https://lists.hampshire.edu/mailman/listinfo/clusterusers"
                      rel="noreferrer" target="_blank"
                      moz-do-not-send="true">https://lists.hampshire.edu/<wbr>mailman/listinfo/clusterusers</a><br>
                  </blockquote>
                </div>
                <br>
              </div>
              <br>
              <fieldset class="mimeAttachmentHeader"></fieldset>
              <br>
              <pre wrap="">_______________________________________________
Clusterusers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Clusterusers@lists.hampshire.edu" moz-do-not-send="true">Clusterusers@lists.hampshire.edu</a>
<a class="moz-txt-link-freetext" href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" moz-do-not-send="true">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>
</pre>
            </blockquote>
            <br>
            <pre class="moz-signature" cols="72">-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
</pre>
            <br>
            <fieldset class="mimeAttachmentHeader"></fieldset>
            <br>
            <pre wrap="">_______________________________________________
Clusterusers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Clusterusers@lists.hampshire.edu" moz-do-not-send="true">Clusterusers@lists.hampshire.edu</a>
<a class="moz-txt-link-freetext" href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" moz-do-not-send="true">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>
</pre>
          </blockquote>
          <br>
          <pre class="moz-signature" cols="72">-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
</pre>
        </blockquote>
        <br>
        <pre class="moz-signature" cols="72">-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
</pre>
      </blockquote>
      <br>
      <pre class="moz-signature" cols="72">-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
</pre>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
Clusterusers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Clusterusers@lists.hampshire.edu">Clusterusers@lists.hampshire.edu</a>
<a class="moz-txt-link-freetext" href="https://lists.hampshire.edu/mailman/listinfo/clusterusers">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>
</pre>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
</pre>
  </body>
</html>