[Clusterusers] Errors on fly

Thomas Helmuth thelmuth at cs.umass.edu
Thu Apr 23 07:53:34 EDT 2015


Sure, no problem! Let me know when you're done rendering.

Tom

On Wed, Apr 22, 2015 at 10:08 PM, Piper Odegard <pto10 at hampshire.edu> wrote:

>  Hi Tom--can you ease up on your farm usage for the next week? Two of us
> are trying to render out our div 3 projects, for a show this Friday
> and then other screenings next week. It's a pretty hideous time crunch for
> us to get rendered by deadline.
>
> Thanks,
> - Piper
>
>
>
>
>
> On 2015-04-22 20:03, Thomas Helmuth wrote:
>
> Sounds good!
>
> On Wed, Apr 22, 2015 at 7:59 PM, Wm. Josiah Erikson <wjens at hampshire.edu>
> wrote:
>
>> I've changed MinRAM to 3GB. I think actually, on second thought, that 4GB
>> might prevent machines that only have 8GB from taking jobs when they should
>> be able to. I'll watch and tweak as necessary.
>> Wow the cluster is getting hammered. Fun :)
>>     -Josiah
>>
>>
>>
>> On 4/22/15 7:54 PM, Thomas Helmuth wrote:
>>
>> Ok, good, glad it's not something in Clojush doing the crashing! I'm
>> guessing somewhere in the range of 4GB would do the trick. So, let's try
>> that?
>>
>> On Wed, Apr 22, 2015 at 7:52 PM, Wm. Josiah Erikson <wjens at hampshire.edu>
>> wrote:
>>
>>> I have rebooted compute-1-17. That should fix that. (Apologies to Piper,
>>> who had a maxwell job running. You'll have to retry that one)
>>>
>>> The other problem is that some nodes are actually out of memory because
>>> of the insanely intensive rendering jobs that are running on them. We could
>>> play with minimum free memory required to take new jobs.... how much RAM do
>>> you think is necessary? Right now it's set to 1GB. Maybe I should try
>>> changing it to 2GB or 4GB?
>>>     -Josiah
>>>
>>>
>>>
>>> On 4/22/15 7:22 PM, Thomas Helmuth wrote:
>>>
>>>> Hi Josiah,
>>>>
>>>> I just started my first runs on fly since the hard drive switch. I'm
>>>> getting some weird errors. The first is that every run started on
>>>> compute-1-17 crashed with the following message:
>>>>
>>>> ====[2015/04/22 19:06:03 /J1504220051/T10/C10/thelmuth on compute-1-17
>>>> ]====
>>>>  /bin/sh: line 0: cd: /home/thelmuth/ClusteringBench/: Not a directory
>>>>  /bin/sh:
>>>> /home/thelmuth/Results/clustering-bench/determin-decim/ratio-0.25/replace-space-with-newline/tourney-7/logs/log9.txt:
>>>> Stale file handle
>>>>  /bin/sh: line 0: cd:
>>>> /home/thelmuth/Results/clustering-bench/determin-decim/ratio-0.25/replace-space-with-newline/tourney-7/csv/:
>>>> Not a directory
>>>> ...
>>>>
>>>> It sounds like 1-17 for some reason cannot access my homedir. When I
>>>> tried to SSH to compute-1-17, it asks for my password, which it doesn't for
>>>> other nodes. So, sounds like there's something amiss there.
>>>>
>>>> I had some other runs, it looks like all on rack 2 nodes, that crashed
>>>> printing the following to the output log files:
>>>>
>>>> Java HotSpot(TM) 64-Bit Server VM warning: INFO:
>>>> os::commit_memory(0x00000007e5500000, 447741952, 0) failed; error='Cannot
>>>> allocate memory' (e\
>>>> rrno=12)
>>>> #
>>>> # There is insufficient memory for the Java Runtime Environment to
>>>> continue.
>>>> # Native memory allocation (malloc) failed to allocate 447741952 bytes
>>>> for committing reserved memory.
>>>> # An error report file with more information is saved as:
>>>> # /home/thelmuth/ClusteringBench/hs_err_pid21904.log
>>>>
>>>> This looks like an error we used to get with Java but I haven't seen
>>>> recently. I checked, and I don't think anything has changed in our code
>>>> regarding memory management.
>>>>
>>>> Is it possible that one or both of these errors are caused by something
>>>> in the move, or an upgrade to tractor or something? I can go back and look
>>>> for similar errors to the second one to see if we figured out what went
>>>> wrong there.
>>>>
>>>> Tom
>>>
>>>
> _______________________________________________
> Clusterusers mailing listClusterusers at lists.hampshire.eduhttps://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150423/c083b7ac/attachment.html>


More information about the Clusterusers mailing list