[Clusterusers] compute-1-10

Wm. Josiah Erikson wjerikson at hampshire.edu
Tue Jan 17 14:41:53 EST 2017


    Not sure what exactly is going on there, but I'm suspicious of bad
memory. I'm NIMBYing it and will run a memory test on it when the
existing jobs are finished or errored out :)

    Sending this to the list too so everyone knows why compute-1-10 is
NIMBYed.

    -Josiah



On 1/17/17 10:55 AM, Thomas Helmuth wrote:
> Sure! It looks like most or all of them threw our old friend hs_err
> messages with a bunch of Java info. I've attached a bunch of them. I
> remember trying to get to the bottom of these years ago, I think with
> a different compute node, and we ended up just ignoring the error and
> taking the node off my tag.
>
> The errors printed to tractor are all over the place -- from just
> saying they aborted to null pointer exceptions to array index out of
> bounds exceptions. I think they're unrelated and were just caused by
> whatever threw the hs_err.
>
> All of these errors occurred within 1-3 minutes of starting the run.
>
> Thanks,
> Tom
>
> On Tue, Jan 17, 2017 at 9:11 AM, Wm. Josiah Erikson
> <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>> wrote:
>
>     My quick and dirty look at the node doesn't see anything wrong
>     with it -
>     can you send me a link to or the text of the weird error?
>
>         -Josiah
>
>
>
>     On 1/13/17 3:51 PM, Thomas Helmuth wrote:
>     > Hi Josiah,
>     >
>     > I have some runs going on fly, and noticed that a bunch of them on
>     > compute-1-10 crashed with various weird error messages. This was the
>     > only node with weird crashes, so I'm wondering if something is going
>     > bad in that node. Any ideas? Would you be able to either take that
>     > node offline, or remove it from tag "tom" so my runs don't use it?
>     >
>     > Thanks,
>     > Tom
>
>     --
>     Wm. Josiah Erikson
>     Assistant Director of IT, Infrastructure Group
>     System Administrator, School of CS
>     Hampshire College
>     Amherst, MA 01002
>     (413) 559-6091 <tel:%28413%29%20559-6091>
>
>

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20170117/253bbbaf/attachment.html>


More information about the Clusterusers mailing list