[Clusterusers] sluggishness/errors on new fly launches

Wed Jul 19 10:55:20 EDT 2017

I just saw one of Tom's jobs do the same thing (run rack0 nodes out of
RAM - not sure if any of the jobs actually crashed yet, but still...),
so I did two things:

1. Set the per-blade "sh" limit to 8 (previously unlimited)

2. Set the number of slots on Rack 0 nodes to 8 (previously 16)

    -Josiah

On 7/19/17 9:40 AM, Wm. Josiah Erikson wrote:
>
> Also, some additional notes:
>
> 1. I've set the "bash" limit tag to 8 per node. Tom seems to use sh
> instead of bash, so this won't affect him, I think.
>
> 2. If you want to do 4 per node instead, just add the limit tag
> "4pernode" to a RemoteCmd in the .alf file, like this:
>
>     RemoteCmd "bash -c "env blah blah" -tags "4pernode" -service "tom"
>
> 3. There's also a 2pernode tag
>
>     -Josiah
>
>
>
> On 7/19/17 9:30 AM, Wm. Josiah Erikson wrote:
>>
>> Yeah, it ran out of RAM - probably tried to take 16 jobs
>> simultaneously. It's got 64GB of RAM, so should be able to do quite a
>> few. It's currently running at under 1/3 RAM usage with 3 jobs, so 8
>> jobs should be safe. I'll limit "bash" limit tag jobs (what yours
>> default to since you didn't set one explicitly and that's the command
>> you're running) to 8 per node. I could lower it to 4 if you prefer.
>> Let me know.
>>
>>     -Josiah
>>
>>
>>
>> On 7/19/17 8:38 AM, Thomas Helmuth wrote:
>>> Moving today, so I'll be brief.
>>>
>>> This sounds like the memory problems we've had on rack 4. There,
>>> there are just too many cures for the amount of memory, so if too
>>> many runs start, none get enough memory. That's why rack 4 isn't on
>>> the Tom or big go service tags. Probably means we should remove rack
>>> 0 from those tags for now until we figure it out.
>>>
>>> I've done runs on rack 4 by limiting each run to 1GB of memory.
>>>
>>> Tom
>>>
>>> On Jul 19, 2017 7:46 AM, "Lee Spector" <lspector at hampshire.edu
>>> <mailto:lspector at hampshire.edu>> wrote:
>>>
>>>
>>>     I launched some runs this morning (a couple of hours ago -- I'm
>>>     in Europe) that behaved oddly. I don't know if this is related
>>>     to recent upgrades/reconfiguration, or what, but I thought I'd
>>>     share them here.
>>>
>>>     First, there were weird delays between launches and the
>>>     generation of the initial output. I'm accustomed to delays of a
>>>     couple of minutes at this point, which I think we've determined
>>>     (or just guessed) are due to the necessity of re-downloading
>>>     dependencies on fly nodes, which doesn't occur when running on
>>>     other machines. But these took much longer, maybe 30-45 minutes
>>>     or longer (not sure of the exact times, but they were on that
>>>     order). And weirder still, there were long times during which
>>>     many had launched but only one was producing output and
>>>     proceeding for a long time, with others really kicking in only
>>>     after the early one(s) finished. I'd understand this if the
>>>     non-producing ones weren't launched at all, because they were
>>>     waiting for free slots.... but that's not what was happening.
>>>
>>>     And then one crashed, on compute-0-1 (this was launched with the
>>>     "tom" service tag, using the run-fly script in Clojush), with
>>>     nothing in the .err file (from standard error) but the following
>>>     in the .out file (from standard output):
>>>
>>>     Producing offspring...
>>>     Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>>>     os::commit_memory(0x00000006f0e00000, 1059061760, 0) failed;
>>>     error='Cannot allocate memory' (errno=12)
>>>     #
>>>     # There is insufficient memory for the Java Runtime Environment
>>>     to continue.
>>>     # Native memory allocation (malloc) failed to allocate
>>>     1059061760 bytes for committing reserved memory.
>>>     # An error report file with more information is saved as:
>>>     # /home/lspector/runs/july19a-9235/Clojush/hs_err_pid57996.log
>>>
>>>      -Lee
>>>
>>>     --
>>>     Lee Spector, Professor of Computer Science
>>>     Director, Institute for Computational Intelligence
>>>     Hampshire College, Amherst, Massachusetts, 01002, USA
>>>     lspector at hampshire.edu <mailto:lspector at hampshire.edu>,
>>>     http://hampshire.edu/lspector/, 413-559-5352 <tel:413-559-5352>
>>>
>>>     _______________________________________________
>>>     Clusterusers mailing list
>>>     Clusterusers at lists.hampshire.edu
>>>     <mailto:Clusterusers at lists.hampshire.edu>
>>>     https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>     <https://lists.hampshire.edu/mailman/listinfo/clusterusers>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>> -- 
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091
>>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> -- 
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20170719/7e416c2d/attachment.html>