[Clusterusers] sluggishness/errors on new fly launches

Mon Jul 24 12:07:54 EDT 2017

Hm yeah and then they do this one by one:

top - 12:07:01 up 2 days, 19:41,  1 user,  load average: 2.86, 0.65, 0.27
Tasks: 713 total,   1 running, 712 sleeping,   0 stopped,   0 zombie
Cpu(s): 98.7%us,  0.1%sy,  0.0%ni,  1.2%id,  0.0%wa,  0.0%hi,  0.0%si, 
0.0%st
Mem:  66012900k total,  7350780k used, 58662120k free,   157388k buffers
Swap:  2047996k total,        0k used,  2047996k free,   610008k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ 
COMMAND         
 63986 thelmuth  20   0 20.8g 4.2g  11m S 3161.1  6.6   6:27.23
java           
 63928 thelmuth  20   0 18.7g 330m  11m S  0.0  0.5   0:09.88
java             
 63780 thelmuth  20   0 18.7g 327m  11m S  0.0  0.5   0:09.61
java             
 64232 thelmuth  20   0 18.7g 252m  11m S  0.0  0.4   0:06.74
java             
 64053 thelmuth  20   0 18.7g 185m  11m S  0.0  0.3   0:05.91
java             
 64112 thelmuth  20   0 18.7g 182m  11m S  0.0  0.3   0:05.53
java             
 63838 thelmuth  20   0 18.7g 177m  11m S  0.0  0.3   0:05.95
java             
 64172 thelmuth  20   0 18.7g 173m  11m S  0.0  0.3   0:04.87
java             

    -Josiah

On 7/24/17 12:06 PM, Wm. Josiah Erikson wrote:
>
> I am seeing something similar here on Tom's Novelty Search | Distance:
> hamming | Super Anagrams - run 0 job (maybe would happen on any job):
>
> -The node is not running out of RAM and there are no errors
>
> -However, before the job actually starts doing anything, there is a
> long (5-10 minute or longer) delay after initial memory allocation:
>
>
> top - 12:04:09 up 2 days, 19:38,  1 user,  load average: 0.00, 0.11, 0.09
> Tasks: 714 total,   1 running, 713 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.1%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi, 
> 0.0%si,  0.0%st
> Mem:  66012900k total,  3158436k used, 62854464k free,   157252k buffers
> Swap:  2047996k total,        0k used,  2047996k free,   609584k cached
>
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ 
> COMMAND         
>  63928 thelmuth  20   0 18.7g 330m  11m S  0.0  0.5   0:09.82
> java             
>  63986 thelmuth  20   0 18.7g 327m  11m S  0.0  0.5   0:10.52
> java             
>  64232 thelmuth  20   0 18.7g 252m  11m S  0.0  0.4   0:06.68
> java             
>  64053 thelmuth  20   0 18.7g 185m  11m S  0.0  0.3   0:05.85
> java             
>  64112 thelmuth  20   0 18.7g 182m  11m S  0.0  0.3   0:05.47
> java             
>  63838 thelmuth  20   0 18.7g 177m  11m S  0.0  0.3   0:05.90
> java             
>  64172 thelmuth  20   0 18.7g 173m  11m S  0.0  0.3   0:04.80
> java             
>  63780 thelmuth  20   0 18.7g 169m  11m S  0.0  0.3   0:05.52
> java             
>
> Very little CPU is being used - the network isn't sluggish - it's not
> out of RAM (though that's a LOT of VIRT RAM for each job to claim...
> but Java just kinda seems to do that)... occasionally you'll see one
> of the processes use some CPU for a second... and in a few minutes
> they will all start chugging away. Weird and I'm not sure what's up at
> the moment, but it doesn't seem to exactly harm anything - just weird.
>
>     -Josiah
>
>    
>
>
>
>
> On 7/19/17 10:55 AM, Wm. Josiah Erikson wrote:
>>
>> I just saw one of Tom's jobs do the same thing (run rack0 nodes out
>> of RAM - not sure if any of the jobs actually crashed yet, but
>> still...), so I did two things:
>>
>> 1. Set the per-blade "sh" limit to 8 (previously unlimited)
>>
>> 2. Set the number of slots on Rack 0 nodes to 8 (previously 16)
>>
>>     -Josiah
>>
>>
>>
>> On 7/19/17 9:40 AM, Wm. Josiah Erikson wrote:
>>>
>>> Also, some additional notes:
>>>
>>> 1. I've set the "bash" limit tag to 8 per node. Tom seems to use sh
>>> instead of bash, so this won't affect him, I think.
>>>
>>> 2. If you want to do 4 per node instead, just add the limit tag
>>> "4pernode" to a RemoteCmd in the .alf file, like this:
>>>
>>>     RemoteCmd "bash -c "env blah blah" -tags "4pernode" -service "tom"
>>>
>>> 3. There's also a 2pernode tag
>>>
>>>     -Josiah
>>>
>>>
>>>
>>> On 7/19/17 9:30 AM, Wm. Josiah Erikson wrote:
>>>>
>>>> Yeah, it ran out of RAM - probably tried to take 16 jobs
>>>> simultaneously. It's got 64GB of RAM, so should be able to do quite
>>>> a few. It's currently running at under 1/3 RAM usage with 3 jobs,
>>>> so 8 jobs should be safe. I'll limit "bash" limit tag jobs (what
>>>> yours default to since you didn't set one explicitly and that's the
>>>> command you're running) to 8 per node. I could lower it to 4 if you
>>>> prefer. Let me know.
>>>>
>>>>     -Josiah
>>>>
>>>>
>>>>
>>>> On 7/19/17 8:38 AM, Thomas Helmuth wrote:
>>>>> Moving today, so I'll be brief.
>>>>>
>>>>> This sounds like the memory problems we've had on rack 4. There,
>>>>> there are just too many cures for the amount of memory, so if too
>>>>> many runs start, none get enough memory. That's why rack 4 isn't
>>>>> on the Tom or big go service tags. Probably means we should remove
>>>>> rack 0 from those tags for now until we figure it out.
>>>>>
>>>>> I've done runs on rack 4 by limiting each run to 1GB of memory.
>>>>>
>>>>> Tom
>>>>>
>>>>> On Jul 19, 2017 7:46 AM, "Lee Spector" <lspector at hampshire.edu
>>>>> <mailto:lspector at hampshire.edu>> wrote:
>>>>>
>>>>>
>>>>>     I launched some runs this morning (a couple of hours ago --
>>>>>     I'm in Europe) that behaved oddly. I don't know if this is
>>>>>     related to recent upgrades/reconfiguration, or what, but I
>>>>>     thought I'd share them here.
>>>>>
>>>>>     First, there were weird delays between launches and the
>>>>>     generation of the initial output. I'm accustomed to delays of
>>>>>     a couple of minutes at this point, which I think we've
>>>>>     determined (or just guessed) are due to the necessity of
>>>>>     re-downloading dependencies on fly nodes, which doesn't occur
>>>>>     when running on other machines. But these took much longer,
>>>>>     maybe 30-45 minutes or longer (not sure of the exact times,
>>>>>     but they were on that order). And weirder still, there were
>>>>>     long times during which many had launched but only one was
>>>>>     producing output and proceeding for a long time, with others
>>>>>     really kicking in only after the early one(s) finished. I'd
>>>>>     understand this if the non-producing ones weren't launched at
>>>>>     all, because they were waiting for free slots.... but that's
>>>>>     not what was happening.
>>>>>
>>>>>     And then one crashed, on compute-0-1 (this was launched with
>>>>>     the "tom" service tag, using the run-fly script in Clojush),
>>>>>     with nothing in the .err file (from standard error) but the
>>>>>     following in the .out file (from standard output):
>>>>>
>>>>>     Producing offspring...
>>>>>     Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>>>>>     os::commit_memory(0x00000006f0e00000, 1059061760, 0) failed;
>>>>>     error='Cannot allocate memory' (errno=12)
>>>>>     #
>>>>>     # There is insufficient memory for the Java Runtime
>>>>>     Environment to continue.
>>>>>     # Native memory allocation (malloc) failed to allocate
>>>>>     1059061760 bytes for committing reserved memory.
>>>>>     # An error report file with more information is saved as:
>>>>>     # /home/lspector/runs/july19a-9235/Clojush/hs_err_pid57996.log
>>>>>
>>>>>      -Lee
>>>>>
>>>>>     --
>>>>>     Lee Spector, Professor of Computer Science
>>>>>     Director, Institute for Computational Intelligence
>>>>>     Hampshire College, Amherst, Massachusetts, 01002, USA
>>>>>     lspector at hampshire.edu <mailto:lspector at hampshire.edu>,
>>>>>     http://hampshire.edu/lspector/, 413-559-5352 <tel:413-559-5352>
>>>>>
>>>>>     _______________________________________________
>>>>>     Clusterusers mailing list
>>>>>     Clusterusers at lists.hampshire.edu
>>>>>     <mailto:Clusterusers at lists.hampshire.edu>
>>>>>     https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>>     <https://lists.hampshire.edu/mailman/listinfo/clusterusers>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Clusterusers mailing list
>>>>> Clusterusers at lists.hampshire.edu
>>>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>
>>>> -- 
>>>> Wm. Josiah Erikson
>>>> Assistant Director of IT, Infrastructure Group
>>>> System Administrator, School of CS
>>>> Hampshire College
>>>> Amherst, MA 01002
>>>> (413) 559-6091
>>>>
>>>>
>>>> _______________________________________________
>>>> Clusterusers mailing list
>>>> Clusterusers at lists.hampshire.edu
>>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>> -- 
>>> Wm. Josiah Erikson
>>> Assistant Director of IT, Infrastructure Group
>>> System Administrator, School of CS
>>> Hampshire College
>>> Amherst, MA 01002
>>> (413) 559-6091
>>
>> -- 
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091
>
> -- 
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20170724/e77a49b1/attachment-0001.html>