[Clusterusers] sluggishness/errors on new fly launches
Wm. Josiah Erikson
wjerikson at hampshire.edu
Mon Jul 24 12:06:37 EDT 2017
I am seeing something similar here on Tom's Novelty Search | Distance:
hamming | Super Anagrams - run 0 job (maybe would happen on any job):
-The node is not running out of RAM and there are no errors
-However, before the job actually starts doing anything, there is a long
(5-10 minute or longer) delay after initial memory allocation:
top - 12:04:09 up 2 days, 19:38, 1 user, load average: 0.00, 0.11, 0.09
Tasks: 714 total, 1 running, 713 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 66012900k total, 3158436k used, 62854464k free, 157252k buffers
Swap: 2047996k total, 0k used, 2047996k free, 609584k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
63928 thelmuth 20 0 18.7g 330m 11m S 0.0 0.5 0:09.82
java
63986 thelmuth 20 0 18.7g 327m 11m S 0.0 0.5 0:10.52
java
64232 thelmuth 20 0 18.7g 252m 11m S 0.0 0.4 0:06.68
java
64053 thelmuth 20 0 18.7g 185m 11m S 0.0 0.3 0:05.85
java
64112 thelmuth 20 0 18.7g 182m 11m S 0.0 0.3 0:05.47
java
63838 thelmuth 20 0 18.7g 177m 11m S 0.0 0.3 0:05.90
java
64172 thelmuth 20 0 18.7g 173m 11m S 0.0 0.3 0:04.80
java
63780 thelmuth 20 0 18.7g 169m 11m S 0.0 0.3 0:05.52
java
Very little CPU is being used - the network isn't sluggish - it's not
out of RAM (though that's a LOT of VIRT RAM for each job to claim... but
Java just kinda seems to do that)... occasionally you'll see one of the
processes use some CPU for a second... and in a few minutes they will
all start chugging away. Weird and I'm not sure what's up at the moment,
but it doesn't seem to exactly harm anything - just weird.
-Josiah
On 7/19/17 10:55 AM, Wm. Josiah Erikson wrote:
>
> I just saw one of Tom's jobs do the same thing (run rack0 nodes out of
> RAM - not sure if any of the jobs actually crashed yet, but still...),
> so I did two things:
>
> 1. Set the per-blade "sh" limit to 8 (previously unlimited)
>
> 2. Set the number of slots on Rack 0 nodes to 8 (previously 16)
>
> -Josiah
>
>
>
> On 7/19/17 9:40 AM, Wm. Josiah Erikson wrote:
>>
>> Also, some additional notes:
>>
>> 1. I've set the "bash" limit tag to 8 per node. Tom seems to use sh
>> instead of bash, so this won't affect him, I think.
>>
>> 2. If you want to do 4 per node instead, just add the limit tag
>> "4pernode" to a RemoteCmd in the .alf file, like this:
>>
>> RemoteCmd "bash -c "env blah blah" -tags "4pernode" -service "tom"
>>
>> 3. There's also a 2pernode tag
>>
>> -Josiah
>>
>>
>>
>> On 7/19/17 9:30 AM, Wm. Josiah Erikson wrote:
>>>
>>> Yeah, it ran out of RAM - probably tried to take 16 jobs
>>> simultaneously. It's got 64GB of RAM, so should be able to do quite
>>> a few. It's currently running at under 1/3 RAM usage with 3 jobs, so
>>> 8 jobs should be safe. I'll limit "bash" limit tag jobs (what yours
>>> default to since you didn't set one explicitly and that's the
>>> command you're running) to 8 per node. I could lower it to 4 if you
>>> prefer. Let me know.
>>>
>>> -Josiah
>>>
>>>
>>>
>>> On 7/19/17 8:38 AM, Thomas Helmuth wrote:
>>>> Moving today, so I'll be brief.
>>>>
>>>> This sounds like the memory problems we've had on rack 4. There,
>>>> there are just too many cures for the amount of memory, so if too
>>>> many runs start, none get enough memory. That's why rack 4 isn't on
>>>> the Tom or big go service tags. Probably means we should remove
>>>> rack 0 from those tags for now until we figure it out.
>>>>
>>>> I've done runs on rack 4 by limiting each run to 1GB of memory.
>>>>
>>>> Tom
>>>>
>>>> On Jul 19, 2017 7:46 AM, "Lee Spector" <lspector at hampshire.edu
>>>> <mailto:lspector at hampshire.edu>> wrote:
>>>>
>>>>
>>>> I launched some runs this morning (a couple of hours ago -- I'm
>>>> in Europe) that behaved oddly. I don't know if this is related
>>>> to recent upgrades/reconfiguration, or what, but I thought I'd
>>>> share them here.
>>>>
>>>> First, there were weird delays between launches and the
>>>> generation of the initial output. I'm accustomed to delays of a
>>>> couple of minutes at this point, which I think we've determined
>>>> (or just guessed) are due to the necessity of re-downloading
>>>> dependencies on fly nodes, which doesn't occur when running on
>>>> other machines. But these took much longer, maybe 30-45 minutes
>>>> or longer (not sure of the exact times, but they were on that
>>>> order). And weirder still, there were long times during which
>>>> many had launched but only one was producing output and
>>>> proceeding for a long time, with others really kicking in only
>>>> after the early one(s) finished. I'd understand this if the
>>>> non-producing ones weren't launched at all, because they were
>>>> waiting for free slots.... but that's not what was happening.
>>>>
>>>> And then one crashed, on compute-0-1 (this was launched with
>>>> the "tom" service tag, using the run-fly script in Clojush),
>>>> with nothing in the .err file (from standard error) but the
>>>> following in the .out file (from standard output):
>>>>
>>>> Producing offspring...
>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>>>> os::commit_memory(0x00000006f0e00000, 1059061760, 0) failed;
>>>> error='Cannot allocate memory' (errno=12)
>>>> #
>>>> # There is insufficient memory for the Java Runtime Environment
>>>> to continue.
>>>> # Native memory allocation (malloc) failed to allocate
>>>> 1059061760 bytes for committing reserved memory.
>>>> # An error report file with more information is saved as:
>>>> # /home/lspector/runs/july19a-9235/Clojush/hs_err_pid57996.log
>>>>
>>>> -Lee
>>>>
>>>> --
>>>> Lee Spector, Professor of Computer Science
>>>> Director, Institute for Computational Intelligence
>>>> Hampshire College, Amherst, Massachusetts, 01002, USA
>>>> lspector at hampshire.edu <mailto:lspector at hampshire.edu>,
>>>> http://hampshire.edu/lspector/, 413-559-5352 <tel:413-559-5352>
>>>>
>>>> _______________________________________________
>>>> Clusterusers mailing list
>>>> Clusterusers at lists.hampshire.edu
>>>> <mailto:Clusterusers at lists.hampshire.edu>
>>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>> <https://lists.hampshire.edu/mailman/listinfo/clusterusers>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Clusterusers mailing list
>>>> Clusterusers at lists.hampshire.edu
>>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>> --
>>> Wm. Josiah Erikson
>>> Assistant Director of IT, Infrastructure Group
>>> System Administrator, School of CS
>>> Hampshire College
>>> Amherst, MA 01002
>>> (413) 559-6091
>>>
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>> --
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091
--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20170724/cefbca00/attachment.html>
More information about the Clusterusers
mailing list