[Clusterusers] compute-1-10

Thomas Helmuth helmutht at wlu.edu
Fri Jan 20 13:33:36 EST 2017


Awesome, thanks for finding this!

Tom

On Fri, Jan 20, 2017 at 1:02 PM, Wm. Josiah Erikson <wjerikson at hampshire.edu
> wrote:

> Yes, it was bad RAM. I found and removed the offending stick and will RMA
> it. For now, compute-1-10 "only" has 24GB of RAM.
>
>     -Josiah
>
>
>
> On 1/17/17 2:41 PM, Wm. Josiah Erikson wrote:
>
>     Not sure what exactly is going on there, but I'm suspicious of bad
> memory. I'm NIMBYing it and will run a memory test on it when the existing
> jobs are finished or errored out :)
>
>     Sending this to the list too so everyone knows why compute-1-10 is
> NIMBYed.
>
>     -Josiah
>
>
>
> On 1/17/17 10:55 AM, Thomas Helmuth wrote:
>
> Sure! It looks like most or all of them threw our old friend hs_err
> messages with a bunch of Java info. I've attached a bunch of them. I
> remember trying to get to the bottom of these years ago, I think with a
> different compute node, and we ended up just ignoring the error and taking
> the node off my tag.
>
> The errors printed to tractor are all over the place -- from just saying
> they aborted to null pointer exceptions to array index out of bounds
> exceptions. I think they're unrelated and were just caused by whatever
> threw the hs_err.
>
> All of these errors occurred within 1-3 minutes of starting the run.
>
> Thanks,
> Tom
>
> On Tue, Jan 17, 2017 at 9:11 AM, Wm. Josiah Erikson <
> wjerikson at hampshire.edu> wrote:
>
>> My quick and dirty look at the node doesn't see anything wrong with it -
>> can you send me a link to or the text of the weird error?
>>
>>     -Josiah
>>
>>
>>
>> On 1/13/17 3:51 PM, Thomas Helmuth wrote:
>> > Hi Josiah,
>> >
>> > I have some runs going on fly, and noticed that a bunch of them on
>> > compute-1-10 crashed with various weird error messages. This was the
>> > only node with weird crashes, so I'm wondering if something is going
>> > bad in that node. Any ideas? Would you be able to either take that
>> > node offline, or remove it from tag "tom" so my runs don't use it?
>> >
>> > Thanks,
>> > Tom
>>
>> --
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091
>>
>>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20170120/608056cc/attachment.html>


More information about the Clusterusers mailing list