<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Hm yeah and then they do this one by one:</p>

    <p>top - 12:07:01 up 2 days, 19:41,  1 user,  load average: 2.86,

      0.65, 0.27<br>

      Tasks: 713 total,   1 running, 712 sleeping,   0 stopped,   0

      zombie<br>

      Cpu(s): 98.7%us,  0.1%sy,  0.0%ni,  1.2%id,  0.0%wa,  0.0%hi, 

      0.0%si,  0.0%st<br>

      Mem:  66012900k total,  7350780k used, 58662120k free,   157388k

      buffers<br>

      Swap:  2047996k total,        0k used,  2047996k free,   610008k

      cached<br>

      <br>

         PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ 

      COMMAND          <br>

       63986 thelmuth  20   0 20.8g 4.2g  11m S 3161.1  6.6   6:27.23

      java            <br>

       63928 thelmuth  20   0 18.7g 330m  11m S  0.0  0.5   0:09.88

      java              <br>

       63780 thelmuth  20   0 18.7g 327m  11m S  0.0  0.5   0:09.61

      java              <br>

       64232 thelmuth  20   0 18.7g 252m  11m S  0.0  0.4   0:06.74

      java              <br>

       64053 thelmuth  20   0 18.7g 185m  11m S  0.0  0.3   0:05.91

      java              <br>

       64112 thelmuth  20   0 18.7g 182m  11m S  0.0  0.3   0:05.53

      java              <br>

       63838 thelmuth  20   0 18.7g 177m  11m S  0.0  0.3   0:05.95

      java              <br>

       64172 thelmuth  20   0 18.7g 173m  11m S  0.0  0.3   0:04.87

      java              <br>

    </p>

    <p>    -Josiah</p>

    <p><br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 7/24/17 12:06 PM, Wm. Josiah Erikson

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:22dc9ae6-4a1b-0e65-0ee1-c96a0ba9a19a@hampshire.edu">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <p>I am seeing something similar here on Tom's Novelty Search |

        Distance: hamming | Super Anagrams - run 0 job (maybe would

        happen on any job):</p>

      <p>-The node is not running out of RAM and there are no errors</p>

      <p>-However, before the job actually starts doing anything, there

        is a long (5-10 minute or longer) delay after initial memory

        allocation:</p>

      <p><br>

      </p>

      <p>top - 12:04:09 up 2 days, 19:38,  1 user,  load average: 0.00,

        0.11, 0.09<br>

        Tasks: 714 total,   1 running, 713 sleeping,   0 stopped,   0

        zombie<br>

        Cpu(s):  0.1%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi, 

        0.0%si,  0.0%st<br>

        Mem:  66012900k total,  3158436k used, 62854464k free,   157252k

        buffers<br>

        Swap:  2047996k total,        0k used,  2047996k free,   609584k

        cached<br>

        <br>

           PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ 

        COMMAND          <br>

         63928 thelmuth  20   0 18.7g 330m  11m S  0.0  0.5   0:09.82

        java              <br>

         63986 thelmuth  20   0 18.7g 327m  11m S  0.0  0.5   0:10.52

        java              <br>

         64232 thelmuth  20   0 18.7g 252m  11m S  0.0  0.4   0:06.68

        java              <br>

         64053 thelmuth  20   0 18.7g 185m  11m S  0.0  0.3   0:05.85

        java              <br>

         64112 thelmuth  20   0 18.7g 182m  11m S  0.0  0.3   0:05.47

        java              <br>

         63838 thelmuth  20   0 18.7g 177m  11m S  0.0  0.3   0:05.90

        java              <br>

         64172 thelmuth  20   0 18.7g 173m  11m S  0.0  0.3   0:04.80

        java              <br>

         63780 thelmuth  20   0 18.7g 169m  11m S  0.0  0.3   0:05.52

        java              <br>

      </p>

      <p>Very little CPU is being used - the network isn't sluggish -

        it's not out of RAM (though that's a LOT of VIRT RAM for each

        job to claim... but Java just kinda seems to do that)...

        occasionally you'll see one of the processes use some CPU for a

        second... and in a few minutes they will all start chugging

        away. Weird and I'm not sure what's up at the moment, but it

        doesn't seem to exactly harm anything - just weird.<br>

      </p>

      <p>    -Josiah</p>

      <p>     <br>

      </p>

      <p><br>

      </p>

      <p><br>

      </p>

      <br>

      <div class="moz-cite-prefix">On 7/19/17 10:55 AM, Wm. Josiah

        Erikson wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:8410f858-dc8b-a67a-9055-81dae081bebf@hampshire.edu">

        <meta http-equiv="Content-Type" content="text/html;

          charset=utf-8">

        <p>I just saw one of Tom's jobs do the same thing (run rack0

          nodes out of RAM - not sure if any of the jobs actually

          crashed yet, but still...), so I did two things:</p>

        <p>1. Set the per-blade "sh" limit to 8 (previously unlimited)<br>

        </p>

        <p>2. Set the number of slots on Rack 0 nodes to 8 (previously

          16)<br>

        </p>

        <p>    -Josiah</p>

        <p><br>

        </p>

        <br>

        <div class="moz-cite-prefix">On 7/19/17 9:40 AM, Wm. Josiah

          Erikson wrote:<br>

        </div>

        <blockquote type="cite"

          cite="mid:c5cc0cfa-7e80-81c6-b1c9-0efc88e76b43@hampshire.edu">

          <meta http-equiv="Content-Type" content="text/html;

            charset=utf-8">

          <p>Also, some additional notes:</p>

          <p>1. I've set the "bash" limit tag to 8 per node. Tom seems

            to use sh instead of bash, so this won't affect him, I

            think.<br>

          </p>

          <p>2. If you want to do 4 per node instead, just add the limit

            tag "4pernode" to a RemoteCmd in the .alf file, like this:</p>

          <p>    RemoteCmd "bash -c "env blah blah" -tags "4pernode"

            -service "tom"</p>

          <p>3. There's also a 2pernode tag</p>

          <p>    -Josiah</p>

          <p><br>

          </p>

          <br>

          <div class="moz-cite-prefix">On 7/19/17 9:30 AM, Wm. Josiah

            Erikson wrote:<br>

          </div>

          <blockquote type="cite"

            cite="mid:53149407-f293-d666-8194-a9cdb4f44b52@hampshire.edu">

            <meta http-equiv="Content-Type" content="text/html;

              charset=utf-8">

            <p>Yeah, it ran out of RAM - probably tried to take 16 jobs

              simultaneously. It's got 64GB of RAM, so should be able to

              do quite a few. It's currently running at under 1/3 RAM

              usage with 3 jobs, so 8 jobs should be safe. I'll limit

              "bash" limit tag jobs (what yours default to since you

              didn't set one explicitly and that's the command you're

              running) to 8 per node. I could lower it to 4 if you

              prefer. Let me know.<br>

            </p>

            <p>    -Josiah</p>

            <p><br>

            </p>

            <br>

            <div class="moz-cite-prefix">On 7/19/17 8:38 AM, Thomas

              Helmuth wrote:<br>

            </div>

            <blockquote type="cite"

cite="mid:CABgVVjfn0FW28918bU4GDky0NCX-4kyv4+uApHwUcBzhWWd8NA@mail.gmail.com">

              <div dir="auto">Moving today, so I'll be brief.

                <div dir="auto"><br>

                </div>

                <div dir="auto">This sounds like the memory problems

                  we've had on rack 4. There, there are just too many

                  cures for the amount of memory, so if too many runs

                  start, none get enough memory. That's why rack 4 isn't

                  on the Tom or big go service tags. Probably means we

                  should remove rack 0 from those tags for now until we

                  figure it out.</div>

                <div dir="auto"><br>

                </div>

                <div dir="auto">I've done runs on rack 4 by limiting

                  each run to 1GB of memory.</div>

                <div dir="auto"><br>

                </div>

                <div dir="auto">Tom</div>

              </div>

              <div class="gmail_extra"><br>

                <div class="gmail_quote">On Jul 19, 2017 7:46 AM, "Lee

                  Spector" <<a href="mailto:lspector@hampshire.edu"

                    moz-do-not-send="true">lspector@hampshire.edu</a>>

                  wrote:<br type="attribution">

                  <blockquote class="quote" style="margin:0 0 0

                    .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

                    I launched some runs this morning (a couple of hours

                    ago -- I'm in Europe) that behaved oddly. I don't

                    know if this is related to recent

                    upgrades/reconfiguration, or what, but I thought I'd

                    share them here.<br>

                    <br>

                    First, there were weird delays between launches and

                    the generation of the initial output. I'm accustomed

                    to delays of a couple of minutes at this point,

                    which I think we've determined (or just guessed) are

                    due to the necessity of re-downloading dependencies

                    on fly nodes, which doesn't occur when running on

                    other machines. But these took much longer, maybe

                    30-45 minutes or longer (not sure of the exact

                    times, but they were on that order). And weirder

                    still, there were long times during which many had

                    launched but only one was producing output and

                    proceeding for a long time, with others really

                    kicking in only after the early one(s) finished. I'd

                    understand this if the non-producing ones weren't

                    launched at all, because they were waiting for free

                    slots.... but that's not what was happening.<br>

                    <br>

                    And then one crashed, on compute-0-1 (this was

                    launched with the "tom" service tag, using the

                    run-fly script in Clojush), with nothing in the .err

                    file (from standard error) but the following in the

                    .out file (from standard output):<br>

                    <br>

                    Producing offspring...<br>

                    Java HotSpot(TM) 64-Bit Server VM warning: INFO:

                    os::commit_memory(<wbr>0x00000006f0e00000,

                    1059061760, 0) failed; error='Cannot allocate

                    memory' (errno=12)<br>

                    #<br>

                    # There is insufficient memory for the Java Runtime

                    Environment to continue.<br>

                    # Native memory allocation (malloc) failed to

                    allocate 1059061760 bytes for committing reserved

                    memory.<br>

                    # An error report file with more information is

                    saved as:<br>

                    # /home/lspector/runs/july19a-<wbr>9235/Clojush/hs_err_pid57996.<wbr>log<br>

                    <br>

                     -Lee<br>

                    <br>

                    --<br>

                    Lee Spector, Professor of Computer Science<br>

                    Director, Institute for Computational Intelligence<br>

                    Hampshire College, Amherst, Massachusetts, 01002,

                    USA<br>

                    <a href="mailto:lspector@hampshire.edu"

                      moz-do-not-send="true">lspector@hampshire.edu</a>,

                    <a href="http://hampshire.edu/lspector/"

                      rel="noreferrer" target="_blank"

                      moz-do-not-send="true">http://hampshire.edu/lspector/</a><wbr>,

                    <a href="tel:413-559-5352" value="+14135595352"

                      moz-do-not-send="true">413-559-5352</a><br>

                    <br>

                    ______________________________<wbr>_________________<br>

                    Clusterusers mailing list<br>

                    <a href="mailto:Clusterusers@lists.hampshire.edu"

                      moz-do-not-send="true">Clusterusers@lists.hampshire.<wbr>edu</a><br>

                    <a

                      href="https://lists.hampshire.edu/mailman/listinfo/clusterusers"

                      rel="noreferrer" target="_blank"

                      moz-do-not-send="true">https://lists.hampshire.edu/<wbr>mailman/listinfo/clusterusers</a><br>

                  </blockquote>

                </div>

                <br>

              </div>

              <br>

              <fieldset class="mimeAttachmentHeader"></fieldset>

              <br>

              <pre wrap="">_______________________________________________

Clusterusers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Clusterusers@lists.hampshire.edu" moz-do-not-send="true">Clusterusers@lists.hampshire.edu</a>

<a class="moz-txt-link-freetext" href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" moz-do-not-send="true">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>

</pre>

            </blockquote>

            <br>

            <pre class="moz-signature" cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

(413) 559-6091

</pre>

            <br>

            <fieldset class="mimeAttachmentHeader"></fieldset>

            <br>

            <pre wrap="">_______________________________________________

Clusterusers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Clusterusers@lists.hampshire.edu" moz-do-not-send="true">Clusterusers@lists.hampshire.edu</a>

<a class="moz-txt-link-freetext" href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" moz-do-not-send="true">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>

</pre>

          </blockquote>

          <br>

          <pre class="moz-signature" cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

(413) 559-6091

</pre>

        </blockquote>

        <br>

        <pre class="moz-signature" cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

(413) 559-6091

</pre>

      </blockquote>

      <br>

      <pre class="moz-signature" cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

(413) 559-6091

</pre>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Clusterusers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Clusterusers@lists.hampshire.edu">Clusterusers@lists.hampshire.edu</a>

<a class="moz-txt-link-freetext" href="https://lists.hampshire.edu/mailman/listinfo/clusterusers">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>

</pre>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

(413) 559-6091

</pre>

  </body>

</html>