[Clusterusers] sluggishness/errors on new fly launches

Thomas Helmuth trhtom at gmail.com
Wed Jul 19 08:38:26 EDT 2017


Moving today, so I'll be brief.

This sounds like the memory problems we've had on rack 4. There, there are
just too many cures for the amount of memory, so if too many runs start,
none get enough memory. That's why rack 4 isn't on the Tom or big go
service tags. Probably means we should remove rack 0 from those tags for
now until we figure it out.

I've done runs on rack 4 by limiting each run to 1GB of memory.

Tom

On Jul 19, 2017 7:46 AM, "Lee Spector" <lspector at hampshire.edu> wrote:


I launched some runs this morning (a couple of hours ago -- I'm in Europe)
that behaved oddly. I don't know if this is related to recent
upgrades/reconfiguration, or what, but I thought I'd share them here.

First, there were weird delays between launches and the generation of the
initial output. I'm accustomed to delays of a couple of minutes at this
point, which I think we've determined (or just guessed) are due to the
necessity of re-downloading dependencies on fly nodes, which doesn't occur
when running on other machines. But these took much longer, maybe 30-45
minutes or longer (not sure of the exact times, but they were on that
order). And weirder still, there were long times during which many had
launched but only one was producing output and proceeding for a long time,
with others really kicking in only after the early one(s) finished. I'd
understand this if the non-producing ones weren't launched at all, because
they were waiting for free slots.... but that's not what was happening.

And then one crashed, on compute-0-1 (this was launched with the "tom"
service tag, using the run-fly script in Clojush), with nothing in the .err
file (from standard error) but the following in the .out file (from
standard output):

Producing offspring...
Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x00000006f0e00000,
1059061760, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 1059061760 bytes for
committing reserved memory.
# An error report file with more information is saved as:
# /home/lspector/runs/july19a-9235/Clojush/hs_err_pid57996.log

 -Lee

--
Lee Spector, Professor of Computer Science
Director, Institute for Computational Intelligence
Hampshire College, Amherst, Massachusetts, 01002, USA
lspector at hampshire.edu, http://hampshire.edu/lspector/, 413-559-5352

_______________________________________________
Clusterusers mailing list
Clusterusers at lists.hampshire.edu
https://lists.hampshire.edu/mailman/listinfo/clusterusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20170719/99eb0503/attachment.html>


More information about the Clusterusers mailing list