<div dir="ltr"><div><div>(now including the fly listserv)<br><br></div><div>I just started some genetic programming runs, and got weird Java errors that look like they ran out of memory on some of the rack2 nodes (see below). I just remembered that Piper was having issues earlier today with RAM, so maybe this is related? Does this seem plausible?<br><br></div>BTW, if people are in crunch time for D3 projects, I'd be happy to hold off on my runs for a few days, or to lower my priority. Let me know!<br><br></div>Tom<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 22, 2015 at 7:31 PM, Thomas Helmuth <span dir="ltr"><<a href="mailto:thelmuth@cs.umass.edu" target="_blank">thelmuth@cs.umass.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>(including Lee in case he has any ideas -- Lee, see older message below first)<br></div><div><br>So, it looks like it was a different Java error before! The old ones looked more like:<br><br># A fatal <span>error</span> has been detected by the Java Runtime Environment:<br>#<br>#  Internal <span>Error</span> (synchronizer.cpp:1401), pid=28984, tid=140656188155648<br>#  guarantee(mid->header()->is_neutral()) failed: invariant<br>

#<br># JRE version: 7.0_03-b04<br># Java VM: Java HotSpot(TM) 64-Bit Server VM (22.1-b02 mixed mode linux-amd64 c\<br>ompressed oops)<br># Failed to write core dump. Core dumps have been disabled. To enable core dump\<br>

ing, try "ulimit -c unlimited" before starting Java again<span class=""><br>#<br># <span>An</span> <span>error</span> <span>report</span> <span>file</span> <span>with</span> <span>more</span> <span>information</span> <span>is</span> <span>saved</span> <span>as</span>:<br></span># /home/thelmuth/Clojush/hs_err_pid28984.log<br>#<br># If you would like to submit a bug <span>report</span>, please visit:<br>

#   <a href="http://bugreport.sun.com/bugreport/crash.jsp" target="_blank">http://bugreport.sun.com/bugreport/crash.jsp</a><br>#<br><br></div>Which has no mention of memory problems. I've attached one of the new error logs in case it's useful. I'll double check if I see any differences in our Java arguments regarding memory, but as far as I can tell, they're no different.<br><br></div>Tom<br></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 22, 2015 at 7:22 PM, Thomas Helmuth <span dir="ltr"><<a href="mailto:thelmuth@cs.umass.edu" target="_blank">thelmuth@cs.umass.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div>Hi Josiah,<br><br></div>I just started my first runs on fly since the hard drive switch. I'm getting some weird errors. The first is that every run started on compute-1-17 crashed with the following message:<br><br>====[2015/04/22 19:06:03 /J1504220051/T10/C10/thelmuth on compute-1-17 

]====<br> /bin/sh: line 0: cd: /home/thelmuth/ClusteringBench/: Not a directory<br> /bin/sh: 

/home/thelmuth/Results/clustering-bench/determin-decim/ratio-0.25/replace-space-with-newline/tourney-7/logs/log9.txt:

 Stale file handle<br> /bin/sh: line 0: cd: 

/home/thelmuth/Results/clustering-bench/determin-decim/ratio-0.25/replace-space-with-newline/tourney-7/csv/:

 Not a directory<br>...<br><br></div>It sounds like 1-17 for some reason cannot access my homedir. When I tried to SSH to compute-1-17, it asks for my password, which it doesn't for other nodes. So, sounds like there's something amiss there.<br><br></div>I had some other runs, it looks like all on rack 2 nodes, that crashed printing the following to the output log files:<br><br>Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007e5500000, 447741952, 0) failed; error='Cannot allocate memory' (e\<br>rrno=12)<br>#<br># There is insufficient memory for the Java Runtime Environment to continue.<br># Native memory allocation (malloc) failed to allocate 447741952 bytes for committing reserved memory.<br># An error report file with more information is saved as:<br># /home/thelmuth/ClusteringBench/hs_err_pid21904.log<br><br></div>This looks like an error we used to get with Java but I haven't seen recently. I checked, and I don't think anything has changed in our code regarding memory management.<br><br></div>Is it possible that one or both of these errors are caused by something in the move, or an upgrade to tractor or something? I can go back and look for similar errors to the second one to see if we figured out what went wrong there.<br><br></div>Tom<br></div>

</blockquote></div><br></div>

</div></div></blockquote></div><br></div>