[Clusterusers] fly's hardware problem

Wm. Josiah Erikson wjerikson at hampshire.edu
Tue Dec 22 10:13:50 EST 2015


Hey folks,
    I'm trying to get an idea of how much people were planning on using
the cluster over the holidays, i.e. how much effort I should put into
this. Are there folks who were planning to do important work over the
break whose plans will be totally screwed up if it's not back until
after the new year?
    I will come in tomorrow and probably be able to get it back up
anyway, just want to know what kind of priority to give it in case of
mishap.
    Thanks and happy holidays!
    -Josiah


On 12/22/15, 8:47 AM, Wm. Josiah Erikson wrote:
> I got some time this morning and came in on a vacation day...
>
> After some mucking around with BIOS and CPU's and memory config, things
> deteriorating quickly...
>
> Well, the machine won't even post now, so I'll have to move everything
> over to a new machine and get a new license file from Pixar. I'll try to
> get this done tomorrow morning, but this may mean that the cluster will
> be down until January if everything doesn't go smoothly.
>
> Apologies!
>     -Josiah
>
>
> On 12/21/15 9:28 PM, Wm. Josiah Erikson wrote:
>> ...or not. It appears to be happening again. I'm out tomorrow, but I
>> should be in briefly on Wednesday and will make another attempt.
>>     -Josiah
>>
>>
>> On 12/21/15, 11:04 AM, Wm. Josiah Erikson wrote:
>>> One of the power supplies was probably bad (dim/flickering lights, and
>>> Google searches had other people experiencing this problem with a bad
>>> power supply). I replaced it with a spare I had sitting around (thanks
>>> Mt. Holyoke)! and hopefully that will be the end of that. Of course, it
>>> could be something else too :)
>>> For now, fly's back up. Hopefully for good, but no guarantees.
>>>     -Josiah
>>>
>>>
>>> On 12/21/15 10:35 AM, Wm. Josiah Erikson wrote:
>>>> Well, it just kernel panicked. Neat. That's new. So it's down now until
>>>> I get it back up :) I'll keep you updated...
>>>>     -Josiah
>>>>
>>>>
>>>> On 12/21/15 10:08 AM, Wm. Josiah Erikson wrote:
>>>>> Hello all,
>>>>>     I rebooted fly again - it is not fixed yet, though it is up and
>>>>> running currently. I will probably take it down later today to try
>>>>> removing one of the CPU's to see if it is the problem. This should
>>>>> result in less than 10 minutes of downtime, and jobs should resume where
>>>>> they left off, so no need to hold off launching renders or whatever. My
>>>>> current theory is that CPU #2 is faulty. It could also be a bad power
>>>>> supply. I'll keep you updated.
>>>>>

-- 
-----
Wm. Josiah Erikson
Head, Systems and Networking
Hampshire College
Amherst, MA 01002



More information about the Clusterusers mailing list