[Clusterusers] sluggishness/errors on new fly launches

Wm. Josiah Erikson wjerikson at hampshire.edu
Wed Jul 19 09:30:00 EDT 2017


Yeah, it ran out of RAM - probably tried to take 16 jobs simultaneously.
It's got 64GB of RAM, so should be able to do quite a few. It's
currently running at under 1/3 RAM usage with 3 jobs, so 8 jobs should
be safe. I'll limit "bash" limit tag jobs (what yours default to since
you didn't set one explicitly and that's the command you're running) to
8 per node. I could lower it to 4 if you prefer. Let me know.

    -Josiah



On 7/19/17 8:38 AM, Thomas Helmuth wrote:
> Moving today, so I'll be brief.
>
> This sounds like the memory problems we've had on rack 4. There, there
> are just too many cures for the amount of memory, so if too many runs
> start, none get enough memory. That's why rack 4 isn't on the Tom or
> big go service tags. Probably means we should remove rack 0 from those
> tags for now until we figure it out.
>
> I've done runs on rack 4 by limiting each run to 1GB of memory.
>
> Tom
>
> On Jul 19, 2017 7:46 AM, "Lee Spector" <lspector at hampshire.edu
> <mailto:lspector at hampshire.edu>> wrote:
>
>
>     I launched some runs this morning (a couple of hours ago -- I'm in
>     Europe) that behaved oddly. I don't know if this is related to
>     recent upgrades/reconfiguration, or what, but I thought I'd share
>     them here.
>
>     First, there were weird delays between launches and the generation
>     of the initial output. I'm accustomed to delays of a couple of
>     minutes at this point, which I think we've determined (or just
>     guessed) are due to the necessity of re-downloading dependencies
>     on fly nodes, which doesn't occur when running on other machines.
>     But these took much longer, maybe 30-45 minutes or longer (not
>     sure of the exact times, but they were on that order). And weirder
>     still, there were long times during which many had launched but
>     only one was producing output and proceeding for a long time, with
>     others really kicking in only after the early one(s) finished. I'd
>     understand this if the non-producing ones weren't launched at all,
>     because they were waiting for free slots.... but that's not what
>     was happening.
>
>     And then one crashed, on compute-0-1 (this was launched with the
>     "tom" service tag, using the run-fly script in Clojush), with
>     nothing in the .err file (from standard error) but the following
>     in the .out file (from standard output):
>
>     Producing offspring...
>     Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>     os::commit_memory(0x00000006f0e00000, 1059061760, 0) failed;
>     error='Cannot allocate memory' (errno=12)
>     #
>     # There is insufficient memory for the Java Runtime Environment to
>     continue.
>     # Native memory allocation (malloc) failed to allocate 1059061760
>     bytes for committing reserved memory.
>     # An error report file with more information is saved as:
>     # /home/lspector/runs/july19a-9235/Clojush/hs_err_pid57996.log
>
>      -Lee
>
>     --
>     Lee Spector, Professor of Computer Science
>     Director, Institute for Computational Intelligence
>     Hampshire College, Amherst, Massachusetts, 01002, USA
>     lspector at hampshire.edu <mailto:lspector at hampshire.edu>,
>     http://hampshire.edu/lspector/, 413-559-5352 <tel:413-559-5352>
>
>     _______________________________________________
>     Clusterusers mailing list
>     Clusterusers at lists.hampshire.edu
>     <mailto:Clusterusers at lists.hampshire.edu>
>     https://lists.hampshire.edu/mailman/listinfo/clusterusers
>     <https://lists.hampshire.edu/mailman/listinfo/clusterusers>
>
>
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20170719/75039107/attachment.html>


More information about the Clusterusers mailing list