[Clusterusers] Errors on fly

Wed Apr 22 19:43:26 EDT 2015

(now including the fly listserv)

I just started some genetic programming runs, and got weird Java errors
that look like they ran out of memory on some of the rack2 nodes (see
below). I just remembered that Piper was having issues earlier today with
RAM, so maybe this is related? Does this seem plausible?

BTW, if people are in crunch time for D3 projects, I'd be happy to hold off
on my runs for a few days, or to lower my priority. Let me know!

Tom

On Wed, Apr 22, 2015 at 7:31 PM, Thomas Helmuth <thelmuth at cs.umass.edu>
wrote:

> (including Lee in case he has any ideas -- Lee, see older message below
> first)
>
> So, it looks like it was a different Java error before! The old ones
> looked more like:
>
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  Internal Error (synchronizer.cpp:1401), pid=28984, tid=140656188155648
> #  guarantee(mid->header()->is_neutral()) failed: invariant
> #
> # JRE version: 7.0_03-b04
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (22.1-b02 mixed mode
> linux-amd64 c\
> ompressed oops)
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dump\
> ing, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/thelmuth/Clojush/hs_err_pid28984.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.sun.com/bugreport/crash.jsp
> #
>
> Which has no mention of memory problems. I've attached one of the new
> error logs in case it's useful. I'll double check if I see any differences
> in our Java arguments regarding memory, but as far as I can tell, they're
> no different.
>
> Tom
>
> On Wed, Apr 22, 2015 at 7:22 PM, Thomas Helmuth <thelmuth at cs.umass.edu>
> wrote:
>
>> Hi Josiah,
>>
>> I just started my first runs on fly since the hard drive switch. I'm
>> getting some weird errors. The first is that every run started on
>> compute-1-17 crashed with the following message:
>>
>> ====[2015/04/22 19:06:03 /J1504220051/T10/C10/thelmuth on compute-1-17
>> ]====
>>  /bin/sh: line 0: cd: /home/thelmuth/ClusteringBench/: Not a directory
>>  /bin/sh:
>> /home/thelmuth/Results/clustering-bench/determin-decim/ratio-0.25/replace-space-with-newline/tourney-7/logs/log9.txt:
>> Stale file handle
>>  /bin/sh: line 0: cd:
>> /home/thelmuth/Results/clustering-bench/determin-decim/ratio-0.25/replace-space-with-newline/tourney-7/csv/:
>> Not a directory
>> ...
>>
>> It sounds like 1-17 for some reason cannot access my homedir. When I
>> tried to SSH to compute-1-17, it asks for my password, which it doesn't for
>> other nodes. So, sounds like there's something amiss there.
>>
>> I had some other runs, it looks like all on rack 2 nodes, that crashed
>> printing the following to the output log files:
>>
>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>> os::commit_memory(0x00000007e5500000, 447741952, 0) failed; error='Cannot
>> allocate memory' (e\
>> rrno=12)
>> #
>> # There is insufficient memory for the Java Runtime Environment to
>> continue.
>> # Native memory allocation (malloc) failed to allocate 447741952 bytes
>> for committing reserved memory.
>> # An error report file with more information is saved as:
>> # /home/thelmuth/ClusteringBench/hs_err_pid21904.log
>>
>> This looks like an error we used to get with Java but I haven't seen
>> recently. I checked, and I don't think anything has changed in our code
>> regarding memory management.
>>
>> Is it possible that one or both of these errors are caused by something
>> in the move, or an upgrade to tractor or something? I can go back and look
>> for similar errors to the second one to see if we figured out what went
>> wrong there.
>>
>> Tom
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150422/22ce9580/attachment.html>