[Clusterusers] sluggishness/errors on new fly launches

Wed Jul 19 11:13:04 EDT 2017

Sounds great, thanks!

On Jul 19, 2017 10:55 AM, "Wm. Josiah Erikson" <wjerikson at hampshire.edu>
wrote:

> I just saw one of Tom's jobs do the same thing (run rack0 nodes out of RAM
> - not sure if any of the jobs actually crashed yet, but still...), so I did
> two things:
>
> 1. Set the per-blade "sh" limit to 8 (previously unlimited)
>
> 2. Set the number of slots on Rack 0 nodes to 8 (previously 16)
>
>     -Josiah
>
>
>
> On 7/19/17 9:40 AM, Wm. Josiah Erikson wrote:
>
> Also, some additional notes:
>
> 1. I've set the "bash" limit tag to 8 per node. Tom seems to use sh
> instead of bash, so this won't affect him, I think.
>
> 2. If you want to do 4 per node instead, just add the limit tag "4pernode"
> to a RemoteCmd in the .alf file, like this:
>
>     RemoteCmd "bash -c "env blah blah" -tags "4pernode" -service "tom"
>
> 3. There's also a 2pernode tag
>
>     -Josiah
>
>
>
> On 7/19/17 9:30 AM, Wm. Josiah Erikson wrote:
>
> Yeah, it ran out of RAM - probably tried to take 16 jobs simultaneously.
> It's got 64GB of RAM, so should be able to do quite a few. It's currently
> running at under 1/3 RAM usage with 3 jobs, so 8 jobs should be safe. I'll
> limit "bash" limit tag jobs (what yours default to since you didn't set one
> explicitly and that's the command you're running) to 8 per node. I could
> lower it to 4 if you prefer. Let me know.
>
>     -Josiah
>
>
>
> On 7/19/17 8:38 AM, Thomas Helmuth wrote:
>
> Moving today, so I'll be brief.
>
> This sounds like the memory problems we've had on rack 4. There, there are
> just too many cures for the amount of memory, so if too many runs start,
> none get enough memory. That's why rack 4 isn't on the Tom or big go
> service tags. Probably means we should remove rack 0 from those tags for
> now until we figure it out.
>
> I've done runs on rack 4 by limiting each run to 1GB of memory.
>
> Tom
>
> On Jul 19, 2017 7:46 AM, "Lee Spector" <lspector at hampshire.edu> wrote:
>
>
> I launched some runs this morning (a couple of hours ago -- I'm in Europe)
> that behaved oddly. I don't know if this is related to recent
> upgrades/reconfiguration, or what, but I thought I'd share them here.
>
> First, there were weird delays between launches and the generation of the
> initial output. I'm accustomed to delays of a couple of minutes at this
> point, which I think we've determined (or just guessed) are due to the
> necessity of re-downloading dependencies on fly nodes, which doesn't occur
> when running on other machines. But these took much longer, maybe 30-45
> minutes or longer (not sure of the exact times, but they were on that
> order). And weirder still, there were long times during which many had
> launched but only one was producing output and proceeding for a long time,
> with others really kicking in only after the early one(s) finished. I'd
> understand this if the non-producing ones weren't launched at all, because
> they were waiting for free slots.... but that's not what was happening.
>
> And then one crashed, on compute-0-1 (this was launched with the "tom"
> service tag, using the run-fly script in Clojush), with nothing in the .err
> file (from standard error) but the following in the .out file (from
> standard output):
>
> Producing offspring...
> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
> os::commit_memory(0x00000006f0e00000, 1059061760, 0) failed;
> error='Cannot allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to
> continue.
> # Native memory allocation (malloc) failed to allocate 1059061760 bytes
> for committing reserved memory.
> # An error report file with more information is saved as:
> # /home/lspector/runs/july19a-9235/Clojush/hs_err_pid57996.log
>
>  -Lee
>
> --
> Lee Spector, Professor of Computer Science
> Director, Institute for Computational Intelligence
> Hampshire College, Amherst, Massachusetts, 01002, USA
> lspector at hampshire.edu, http://hampshire.edu/lspector/, 413-559-5352
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
>
>
> _______________________________________________
> Clusterusers mailing listClusterusers at lists.hampshire.eduhttps://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
>
> _______________________________________________
> Clusterusers mailing listClusterusers at lists.hampshire.eduhttps://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20170719/65039a74/attachment-0001.html>