[Clusterusers] compute-1-10

Wm. Josiah Erikson wjerikson at hampshire.edu
Fri Jan 20 13:02:06 EST 2017


Yes, it was bad RAM. I found and removed the offending stick and will
RMA it. For now, compute-1-10 "only" has 24GB of RAM.

    -Josiah



On 1/17/17 2:41 PM, Wm. Josiah Erikson wrote:
>
>     Not sure what exactly is going on there, but I'm suspicious of bad
> memory. I'm NIMBYing it and will run a memory test on it when the
> existing jobs are finished or errored out :)
>
>     Sending this to the list too so everyone knows why compute-1-10 is
> NIMBYed.
>
>     -Josiah
>
>
>
> On 1/17/17 10:55 AM, Thomas Helmuth wrote:
>> Sure! It looks like most or all of them threw our old friend hs_err
>> messages with a bunch of Java info. I've attached a bunch of them. I
>> remember trying to get to the bottom of these years ago, I think with
>> a different compute node, and we ended up just ignoring the error and
>> taking the node off my tag.
>>
>> The errors printed to tractor are all over the place -- from just
>> saying they aborted to null pointer exceptions to array index out of
>> bounds exceptions. I think they're unrelated and were just caused by
>> whatever threw the hs_err.
>>
>> All of these errors occurred within 1-3 minutes of starting the run.
>>
>> Thanks,
>> Tom
>>
>> On Tue, Jan 17, 2017 at 9:11 AM, Wm. Josiah Erikson
>> <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>> wrote:
>>
>>     My quick and dirty look at the node doesn't see anything wrong
>>     with it -
>>     can you send me a link to or the text of the weird error?
>>
>>         -Josiah
>>
>>
>>
>>     On 1/13/17 3:51 PM, Thomas Helmuth wrote:
>>     > Hi Josiah,
>>     >
>>     > I have some runs going on fly, and noticed that a bunch of them on
>>     > compute-1-10 crashed with various weird error messages. This
>>     was the
>>     > only node with weird crashes, so I'm wondering if something is
>>     going
>>     > bad in that node. Any ideas? Would you be able to either take that
>>     > node offline, or remove it from tag "tom" so my runs don't use it?
>>     >
>>     > Thanks,
>>     > Tom
>>
>>     --
>>     Wm. Josiah Erikson
>>     Assistant Director of IT, Infrastructure Group
>>     System Administrator, School of CS
>>     Hampshire College
>>     Amherst, MA 01002
>>     (413) 559-6091 <tel:%28413%29%20559-6091>
>>
>>
>
> -- 
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20170120/cf3900f4/attachment.html>


More information about the Clusterusers mailing list