<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>I am seeing something similar here on Tom's Novelty Search |

      Distance: hamming | Super Anagrams - run 0 job (maybe would happen

      on any job):</p>

    <p>-The node is not running out of RAM and there are no errors</p>

    <p>-However, before the job actually starts doing anything, there is

      a long (5-10 minute or longer) delay after initial memory

      allocation:</p>

    <p><br>

    </p>

    <p>top - 12:04:09 up 2 days, 19:38,  1 user,  load average: 0.00,

      0.11, 0.09<br>

      Tasks: 714 total,   1 running, 713 sleeping,   0 stopped,   0

      zombie<br>

      Cpu(s):  0.1%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi, 

      0.0%si,  0.0%st<br>

      Mem:  66012900k total,  3158436k used, 62854464k free,   157252k

      buffers<br>

      Swap:  2047996k total,        0k used,  2047996k free,   609584k

      cached<br>

      <br>

         PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ 

      COMMAND          <br>

       63928 thelmuth  20   0 18.7g 330m  11m S  0.0  0.5   0:09.82

      java              <br>

       63986 thelmuth  20   0 18.7g 327m  11m S  0.0  0.5   0:10.52

      java              <br>

       64232 thelmuth  20   0 18.7g 252m  11m S  0.0  0.4   0:06.68

      java              <br>

       64053 thelmuth  20   0 18.7g 185m  11m S  0.0  0.3   0:05.85

      java              <br>

       64112 thelmuth  20   0 18.7g 182m  11m S  0.0  0.3   0:05.47

      java              <br>

       63838 thelmuth  20   0 18.7g 177m  11m S  0.0  0.3   0:05.90

      java              <br>

       64172 thelmuth  20   0 18.7g 173m  11m S  0.0  0.3   0:04.80

      java              <br>

       63780 thelmuth  20   0 18.7g 169m  11m S  0.0  0.3   0:05.52

      java              <br>

    </p>

    <p>Very little CPU is being used - the network isn't sluggish - it's

      not out of RAM (though that's a LOT of VIRT RAM for each job to

      claim... but Java just kinda seems to do that)... occasionally

      you'll see one of the processes use some CPU for a second... and

      in a few minutes they will all start chugging away. Weird and I'm

      not sure what's up at the moment, but it doesn't seem to exactly

      harm anything - just weird.<br>

    </p>

    <p>    -Josiah</p>

    <p>     <br>

    </p>

    <p><br>

    </p>

    <p><br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 7/19/17 10:55 AM, Wm. Josiah Erikson

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:8410f858-dc8b-a67a-9055-81dae081bebf@hampshire.edu">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <p>I just saw one of Tom's jobs do the same thing (run rack0 nodes

        out of RAM - not sure if any of the jobs actually crashed yet,

        but still...), so I did two things:</p>

      <p>1. Set the per-blade "sh" limit to 8 (previously unlimited)<br>

      </p>

      <p>2. Set the number of slots on Rack 0 nodes to 8 (previously 16)<br>

      </p>

      <p>    -Josiah</p>

      <p><br>

      </p>

      <br>

      <div class="moz-cite-prefix">On 7/19/17 9:40 AM, Wm. Josiah

        Erikson wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:c5cc0cfa-7e80-81c6-b1c9-0efc88e76b43@hampshire.edu">

        <meta http-equiv="Content-Type" content="text/html;

          charset=utf-8">

        <p>Also, some additional notes:</p>

        <p>1. I've set the "bash" limit tag to 8 per node. Tom seems to

          use sh instead of bash, so this won't affect him, I think.<br>

        </p>

        <p>2. If you want to do 4 per node instead, just add the limit

          tag "4pernode" to a RemoteCmd in the .alf file, like this:</p>

        <p>    RemoteCmd "bash -c "env blah blah" -tags "4pernode"

          -service "tom"</p>

        <p>3. There's also a 2pernode tag</p>

        <p>    -Josiah</p>

        <p><br>

        </p>

        <br>

        <div class="moz-cite-prefix">On 7/19/17 9:30 AM, Wm. Josiah

          Erikson wrote:<br>

        </div>

        <blockquote type="cite"

          cite="mid:53149407-f293-d666-8194-a9cdb4f44b52@hampshire.edu">

          <meta http-equiv="Content-Type" content="text/html;

            charset=utf-8">

          <p>Yeah, it ran out of RAM - probably tried to take 16 jobs

            simultaneously. It's got 64GB of RAM, so should be able to

            do quite a few. It's currently running at under 1/3 RAM

            usage with 3 jobs, so 8 jobs should be safe. I'll limit

            "bash" limit tag jobs (what yours default to since you

            didn't set one explicitly and that's the command you're

            running) to 8 per node. I could lower it to 4 if you prefer.

            Let me know.<br>

          </p>

          <p>    -Josiah</p>

          <p><br>

          </p>

          <br>

          <div class="moz-cite-prefix">On 7/19/17 8:38 AM, Thomas

            Helmuth wrote:<br>

          </div>

          <blockquote type="cite"

cite="mid:CABgVVjfn0FW28918bU4GDky0NCX-4kyv4+uApHwUcBzhWWd8NA@mail.gmail.com">

            <div dir="auto">Moving today, so I'll be brief.

              <div dir="auto"><br>

              </div>

              <div dir="auto">This sounds like the memory problems we've

                had on rack 4. There, there are just too many cures for

                the amount of memory, so if too many runs start, none

                get enough memory. That's why rack 4 isn't on the Tom or

                big go service tags. Probably means we should remove

                rack 0 from those tags for now until we figure it out.</div>

              <div dir="auto"><br>

              </div>

              <div dir="auto">I've done runs on rack 4 by limiting each

                run to 1GB of memory.</div>

              <div dir="auto"><br>

              </div>

              <div dir="auto">Tom</div>

            </div>

            <div class="gmail_extra"><br>

              <div class="gmail_quote">On Jul 19, 2017 7:46 AM, "Lee

                Spector" <<a href="mailto:lspector@hampshire.edu"

                  moz-do-not-send="true">lspector@hampshire.edu</a>>

                wrote:<br type="attribution">

                <blockquote class="quote" style="margin:0 0 0

                  .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

                  I launched some runs this morning (a couple of hours

                  ago -- I'm in Europe) that behaved oddly. I don't know

                  if this is related to recent upgrades/reconfiguration,

                  or what, but I thought I'd share them here.<br>

                  <br>

                  First, there were weird delays between launches and

                  the generation of the initial output. I'm accustomed

                  to delays of a couple of minutes at this point, which

                  I think we've determined (or just guessed) are due to

                  the necessity of re-downloading dependencies on fly

                  nodes, which doesn't occur when running on other

                  machines. But these took much longer, maybe 30-45

                  minutes or longer (not sure of the exact times, but

                  they were on that order). And weirder still, there

                  were long times during which many had launched but

                  only one was producing output and proceeding for a

                  long time, with others really kicking in only after

                  the early one(s) finished. I'd understand this if the

                  non-producing ones weren't launched at all, because

                  they were waiting for free slots.... but that's not

                  what was happening.<br>

                  <br>

                  And then one crashed, on compute-0-1 (this was

                  launched with the "tom" service tag, using the run-fly

                  script in Clojush), with nothing in the .err file

                  (from standard error) but the following in the .out

                  file (from standard output):<br>

                  <br>

                  Producing offspring...<br>

                  Java HotSpot(TM) 64-Bit Server VM warning: INFO:

                  os::commit_memory(<wbr>0x00000006f0e00000, 1059061760,

                  0) failed; error='Cannot allocate memory' (errno=12)<br>

                  #<br>

                  # There is insufficient memory for the Java Runtime

                  Environment to continue.<br>

                  # Native memory allocation (malloc) failed to allocate

                  1059061760 bytes for committing reserved memory.<br>

                  # An error report file with more information is saved

                  as:<br>

                  # /home/lspector/runs/july19a-<wbr>9235/Clojush/hs_err_pid57996.<wbr>log<br>

                  <br>

                   -Lee<br>

                  <br>

                  --<br>

                  Lee Spector, Professor of Computer Science<br>

                  Director, Institute for Computational Intelligence<br>

                  Hampshire College, Amherst, Massachusetts, 01002, USA<br>

                  <a href="mailto:lspector@hampshire.edu"

                    moz-do-not-send="true">lspector@hampshire.edu</a>, <a

                    href="http://hampshire.edu/lspector/"

                    rel="noreferrer" target="_blank"

                    moz-do-not-send="true">http://hampshire.edu/lspector/</a><wbr>,

                  <a href="tel:413-559-5352" value="+14135595352"

                    moz-do-not-send="true">413-559-5352</a><br>

                  <br>

                  ______________________________<wbr>_________________<br>

                  Clusterusers mailing list<br>

                  <a href="mailto:Clusterusers@lists.hampshire.edu"

                    moz-do-not-send="true">Clusterusers@lists.hampshire.<wbr>edu</a><br>

                  <a

                    href="https://lists.hampshire.edu/mailman/listinfo/clusterusers"

                    rel="noreferrer" target="_blank"

                    moz-do-not-send="true">https://lists.hampshire.edu/<wbr>mailman/listinfo/clusterusers</a><br>

                </blockquote>

              </div>

              <br>

            </div>

            <br>

            <fieldset class="mimeAttachmentHeader"></fieldset>

            <br>

            <pre wrap="">_______________________________________________

Clusterusers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Clusterusers@lists.hampshire.edu" moz-do-not-send="true">Clusterusers@lists.hampshire.edu</a>

<a class="moz-txt-link-freetext" href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" moz-do-not-send="true">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>

</pre>

          </blockquote>

          <br>

          <pre class="moz-signature" cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

(413) 559-6091

</pre>

          <br>

          <fieldset class="mimeAttachmentHeader"></fieldset>

          <br>

          <pre wrap="">_______________________________________________

Clusterusers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Clusterusers@lists.hampshire.edu" moz-do-not-send="true">Clusterusers@lists.hampshire.edu</a>

<a class="moz-txt-link-freetext" href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" moz-do-not-send="true">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>

</pre>

        </blockquote>

        <br>

        <pre class="moz-signature" cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

(413) 559-6091

</pre>

      </blockquote>

      <br>

      <pre class="moz-signature" cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

(413) 559-6091

</pre>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

(413) 559-6091

</pre>

  </body>

</html>