<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>I just saw one of Tom's jobs do the same thing (run rack0 nodes

      out of RAM - not sure if any of the jobs actually crashed yet, but

      still...), so I did two things:</p>

    <p>1. Set the per-blade "sh" limit to 8 (previously unlimited)<br>

    </p>

    <p>2. Set the number of slots on Rack 0 nodes to 8 (previously 16)<br>

    </p>

    <p>    -Josiah</p>

    <p><br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 7/19/17 9:40 AM, Wm. Josiah Erikson

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:c5cc0cfa-7e80-81c6-b1c9-0efc88e76b43@hampshire.edu">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <p>Also, some additional notes:</p>

      <p>1. I've set the "bash" limit tag to 8 per node. Tom seems to

        use sh instead of bash, so this won't affect him, I think.<br>

      </p>

      <p>2. If you want to do 4 per node instead, just add the limit tag

        "4pernode" to a RemoteCmd in the .alf file, like this:</p>

      <p>    RemoteCmd "bash -c "env blah blah" -tags "4pernode"

        -service "tom"</p>

      <p>3. There's also a 2pernode tag</p>

      <p>    -Josiah</p>

      <p><br>

      </p>

      <br>

      <div class="moz-cite-prefix">On 7/19/17 9:30 AM, Wm. Josiah

        Erikson wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:53149407-f293-d666-8194-a9cdb4f44b52@hampshire.edu">

        <meta http-equiv="Content-Type" content="text/html;

          charset=utf-8">

        <p>Yeah, it ran out of RAM - probably tried to take 16 jobs

          simultaneously. It's got 64GB of RAM, so should be able to do

          quite a few. It's currently running at under 1/3 RAM usage

          with 3 jobs, so 8 jobs should be safe. I'll limit "bash" limit

          tag jobs (what yours default to since you didn't set one

          explicitly and that's the command you're running) to 8 per

          node. I could lower it to 4 if you prefer. Let me know.<br>

        </p>

        <p>    -Josiah</p>

        <p><br>

        </p>

        <br>

        <div class="moz-cite-prefix">On 7/19/17 8:38 AM, Thomas Helmuth

          wrote:<br>

        </div>

        <blockquote type="cite"

cite="mid:CABgVVjfn0FW28918bU4GDky0NCX-4kyv4+uApHwUcBzhWWd8NA@mail.gmail.com">

          <div dir="auto">Moving today, so I'll be brief.

            <div dir="auto"><br>

            </div>

            <div dir="auto">This sounds like the memory problems we've

              had on rack 4. There, there are just too many cures for

              the amount of memory, so if too many runs start, none get

              enough memory. That's why rack 4 isn't on the Tom or big

              go service tags. Probably means we should remove rack 0

              from those tags for now until we figure it out.</div>

            <div dir="auto"><br>

            </div>

            <div dir="auto">I've done runs on rack 4 by limiting each

              run to 1GB of memory.</div>

            <div dir="auto"><br>

            </div>

            <div dir="auto">Tom</div>

          </div>

          <div class="gmail_extra"><br>

            <div class="gmail_quote">On Jul 19, 2017 7:46 AM, "Lee

              Spector" <<a href="mailto:lspector@hampshire.edu"

                moz-do-not-send="true">lspector@hampshire.edu</a>>

              wrote:<br type="attribution">

              <blockquote class="quote" style="margin:0 0 0

                .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

                I launched some runs this morning (a couple of hours ago

                -- I'm in Europe) that behaved oddly. I don't know if

                this is related to recent upgrades/reconfiguration, or

                what, but I thought I'd share them here.<br>

                <br>

                First, there were weird delays between launches and the

                generation of the initial output. I'm accustomed to

                delays of a couple of minutes at this point, which I

                think we've determined (or just guessed) are due to the

                necessity of re-downloading dependencies on fly nodes,

                which doesn't occur when running on other machines. But

                these took much longer, maybe 30-45 minutes or longer

                (not sure of the exact times, but they were on that

                order). And weirder still, there were long times during

                which many had launched but only one was producing

                output and proceeding for a long time, with others

                really kicking in only after the early one(s) finished.

                I'd understand this if the non-producing ones weren't

                launched at all, because they were waiting for free

                slots.... but that's not what was happening.<br>

                <br>

                And then one crashed, on compute-0-1 (this was launched

                with the "tom" service tag, using the run-fly script in

                Clojush), with nothing in the .err file (from standard

                error) but the following in the .out file (from standard

                output):<br>

                <br>

                Producing offspring...<br>

                Java HotSpot(TM) 64-Bit Server VM warning: INFO:

                os::commit_memory(<wbr>0x00000006f0e00000, 1059061760,

                0) failed; error='Cannot allocate memory' (errno=12)<br>

                #<br>

                # There is insufficient memory for the Java Runtime

                Environment to continue.<br>

                # Native memory allocation (malloc) failed to allocate

                1059061760 bytes for committing reserved memory.<br>

                # An error report file with more information is saved

                as:<br>

                # /home/lspector/runs/july19a-<wbr>9235/Clojush/hs_err_pid57996.<wbr>log<br>

                <br>

                 -Lee<br>

                <br>

                --<br>

                Lee Spector, Professor of Computer Science<br>

                Director, Institute for Computational Intelligence<br>

                Hampshire College, Amherst, Massachusetts, 01002, USA<br>

                <a href="mailto:lspector@hampshire.edu"

                  moz-do-not-send="true">lspector@hampshire.edu</a>, <a

                  href="http://hampshire.edu/lspector/" rel="noreferrer"

                  target="_blank" moz-do-not-send="true">http://hampshire.edu/lspector/</a><wbr>,

                <a href="tel:413-559-5352" value="+14135595352"

                  moz-do-not-send="true">413-559-5352</a><br>

                <br>

                ______________________________<wbr>_________________<br>

                Clusterusers mailing list<br>

                <a href="mailto:Clusterusers@lists.hampshire.edu"

                  moz-do-not-send="true">Clusterusers@lists.hampshire.<wbr>edu</a><br>

                <a

                  href="https://lists.hampshire.edu/mailman/listinfo/clusterusers"

                  rel="noreferrer" target="_blank"

                  moz-do-not-send="true">https://lists.hampshire.edu/<wbr>mailman/listinfo/clusterusers</a><br>

              </blockquote>

            </div>

            <br>

          </div>

          <br>

          <fieldset class="mimeAttachmentHeader"></fieldset>

          <br>

          <pre wrap="">_______________________________________________

Clusterusers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Clusterusers@lists.hampshire.edu" moz-do-not-send="true">Clusterusers@lists.hampshire.edu</a>

<a class="moz-txt-link-freetext" href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" moz-do-not-send="true">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>

</pre>

        </blockquote>

        <br>

        <pre class="moz-signature" cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

(413) 559-6091

</pre>

        <br>

        <fieldset class="mimeAttachmentHeader"></fieldset>

        <br>

        <pre wrap="">_______________________________________________

Clusterusers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Clusterusers@lists.hampshire.edu" moz-do-not-send="true">Clusterusers@lists.hampshire.edu</a>

<a class="moz-txt-link-freetext" href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" moz-do-not-send="true">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>

</pre>

      </blockquote>

      <br>

      <pre class="moz-signature" cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

(413) 559-6091

</pre>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Wm. Josiah Erikson

Assistant Director of IT, Infrastructure Group

System Administrator, School of CS

Hampshire College

Amherst, MA 01002

(413) 559-6091

</pre>

  </body>

</html>