[Clusterusers] Errors on fly

Thomas Helmuth thelmuth at cs.umass.edu
Wed Apr 22 20:03:18 EDT 2015


Sounds good!

On Wed, Apr 22, 2015 at 7:59 PM, Wm. Josiah Erikson <wjens at hampshire.edu>
wrote:

>  I've changed MinRAM to 3GB. I think actually, on second thought, that 4GB
> might prevent machines that only have 8GB from taking jobs when they should
> be able to. I'll watch and tweak as necessary.
> Wow the cluster is getting hammered. Fun :)
>     -Josiah
>
>
>
> On 4/22/15 7:54 PM, Thomas Helmuth wrote:
>
> Ok, good, glad it's not something in Clojush doing the crashing! I'm
> guessing somewhere in the range of 4GB would do the trick. So, let's try
> that?
>
> On Wed, Apr 22, 2015 at 7:52 PM, Wm. Josiah Erikson <wjens at hampshire.edu>
> wrote:
>
>> I have rebooted compute-1-17. That should fix that. (Apologies to Piper,
>> who had a maxwell job running. You'll have to retry that one)
>>
>> The other problem is that some nodes are actually out of memory because
>> of the insanely intensive rendering jobs that are running on them. We could
>> play with minimum free memory required to take new jobs.... how much RAM do
>> you think is necessary? Right now it's set to 1GB. Maybe I should try
>> changing it to 2GB or 4GB?
>>     -Josiah
>>
>>
>>
>> On 4/22/15 7:22 PM, Thomas Helmuth wrote:
>>
>>> Hi Josiah,
>>>
>>> I just started my first runs on fly since the hard drive switch. I'm
>>> getting some weird errors. The first is that every run started on
>>> compute-1-17 crashed with the following message:
>>>
>>> ====[2015/04/22 19:06:03 /J1504220051/T10/C10/thelmuth on compute-1-17
>>> ]====
>>>  /bin/sh: line 0: cd: /home/thelmuth/ClusteringBench/: Not a directory
>>>  /bin/sh:
>>> /home/thelmuth/Results/clustering-bench/determin-decim/ratio-0.25/replace-space-with-newline/tourney-7/logs/log9.txt:
>>> Stale file handle
>>>  /bin/sh: line 0: cd:
>>> /home/thelmuth/Results/clustering-bench/determin-decim/ratio-0.25/replace-space-with-newline/tourney-7/csv/:
>>> Not a directory
>>> ...
>>>
>>> It sounds like 1-17 for some reason cannot access my homedir. When I
>>> tried to SSH to compute-1-17, it asks for my password, which it doesn't for
>>> other nodes. So, sounds like there's something amiss there.
>>>
>>> I had some other runs, it looks like all on rack 2 nodes, that crashed
>>> printing the following to the output log files:
>>>
>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>>> os::commit_memory(0x00000007e5500000, 447741952, 0) failed; error='Cannot
>>> allocate memory' (e\
>>> rrno=12)
>>> #
>>> # There is insufficient memory for the Java Runtime Environment to
>>> continue.
>>> # Native memory allocation (malloc) failed to allocate 447741952 bytes
>>> for committing reserved memory.
>>> # An error report file with more information is saved as:
>>> # /home/thelmuth/ClusteringBench/hs_err_pid21904.log
>>>
>>> This looks like an error we used to get with Java but I haven't seen
>>> recently. I checked, and I don't think anything has changed in our code
>>> regarding memory management.
>>>
>>> Is it possible that one or both of these errors are caused by something
>>> in the move, or an upgrade to tractor or something? I can go back and look
>>> for similar errors to the second one to see if we figured out what went
>>> wrong there.
>>>
>>> Tom
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150422/6f560709/attachment.html>


More information about the Clusterusers mailing list