[Clusterusers] fly's hardware problem

Thomas Helmuth thelmuth at cs.umass.edu
Tue Dec 22 10:21:33 EST 2015


Hi Josiah,

As far as I know, neither Lee nor I plan on using cluster for GP runs over
the break. On the other hand, we are working on some papers with deadlines
in late January that have a bunch of data on fly that we'll need. Will we
not even be able to access that data? That could be a problem.

Basically, I don't think we'll need tractor or access to the nodes, but it
would be good to be able to access and gather data. If that's not possible,
it's not a huge deal since we have some initial data, but we'd need access
to it ASAP after break.

Thanks!
Tom

On Tue, Dec 22, 2015 at 10:13 AM, Wm. Josiah Erikson <
wjerikson at hampshire.edu> wrote:

> Hey folks,
>     I'm trying to get an idea of how much people were planning on using
> the cluster over the holidays, i.e. how much effort I should put into
> this. Are there folks who were planning to do important work over the
> break whose plans will be totally screwed up if it's not back until
> after the new year?
>     I will come in tomorrow and probably be able to get it back up
> anyway, just want to know what kind of priority to give it in case of
> mishap.
>     Thanks and happy holidays!
>     -Josiah
>
>
> On 12/22/15, 8:47 AM, Wm. Josiah Erikson wrote:
> > I got some time this morning and came in on a vacation day...
> >
> > After some mucking around with BIOS and CPU's and memory config, things
> > deteriorating quickly...
> >
> > Well, the machine won't even post now, so I'll have to move everything
> > over to a new machine and get a new license file from Pixar. I'll try to
> > get this done tomorrow morning, but this may mean that the cluster will
> > be down until January if everything doesn't go smoothly.
> >
> > Apologies!
> >     -Josiah
> >
> >
> > On 12/21/15 9:28 PM, Wm. Josiah Erikson wrote:
> >> ...or not. It appears to be happening again. I'm out tomorrow, but I
> >> should be in briefly on Wednesday and will make another attempt.
> >>     -Josiah
> >>
> >>
> >> On 12/21/15, 11:04 AM, Wm. Josiah Erikson wrote:
> >>> One of the power supplies was probably bad (dim/flickering lights, and
> >>> Google searches had other people experiencing this problem with a bad
> >>> power supply). I replaced it with a spare I had sitting around (thanks
> >>> Mt. Holyoke)! and hopefully that will be the end of that. Of course, it
> >>> could be something else too :)
> >>> For now, fly's back up. Hopefully for good, but no guarantees.
> >>>     -Josiah
> >>>
> >>>
> >>> On 12/21/15 10:35 AM, Wm. Josiah Erikson wrote:
> >>>> Well, it just kernel panicked. Neat. That's new. So it's down now
> until
> >>>> I get it back up :) I'll keep you updated...
> >>>>     -Josiah
> >>>>
> >>>>
> >>>> On 12/21/15 10:08 AM, Wm. Josiah Erikson wrote:
> >>>>> Hello all,
> >>>>>     I rebooted fly again - it is not fixed yet, though it is up and
> >>>>> running currently. I will probably take it down later today to try
> >>>>> removing one of the CPU's to see if it is the problem. This should
> >>>>> result in less than 10 minutes of downtime, and jobs should resume
> where
> >>>>> they left off, so no need to hold off launching renders or whatever.
> My
> >>>>> current theory is that CPU #2 is faulty. It could also be a bad power
> >>>>> supply. I'll keep you updated.
> >>>>>
>
> --
> -----
> Wm. Josiah Erikson
> Head, Systems and Networking
> Hampshire College
> Amherst, MA 01002
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20151222/0839437e/attachment.html>


More information about the Clusterusers mailing list