[Clusterusers] fly's hardware problem

Wm. Josiah Erikson wjerikson at hampshire.edu
Wed Dec 23 06:41:57 EST 2015


Hi all,
    Thanks for all the feedback and thanks. As far as I can tell,
everything went well. I swapped the disks, controller, memory, power
supplies, and extra dual-interface network card into a new machine, and
it came up with a few tweaks. Luckily eth0 was on the card I swapped
over, so no need for a new license file!
    Hopefully the problem was with the motherboard or CPUs on the old
machine, otherwise I just swapped the problem over... only time will tell.
    Enjoy, and tell me if you have any issues.
    All the best,
    -Josiah


On 12/22/15 10:13 AM, Wm. Josiah Erikson wrote:
> Hey folks,
>     I'm trying to get an idea of how much people were planning on using
> the cluster over the holidays, i.e. how much effort I should put into
> this. Are there folks who were planning to do important work over the
> break whose plans will be totally screwed up if it's not back until
> after the new year?
>     I will come in tomorrow and probably be able to get it back up
> anyway, just want to know what kind of priority to give it in case of
> mishap.
>     Thanks and happy holidays!
>     -Josiah
>
>
> On 12/22/15, 8:47 AM, Wm. Josiah Erikson wrote:
>> I got some time this morning and came in on a vacation day...
>>
>> After some mucking around with BIOS and CPU's and memory config, things
>> deteriorating quickly...
>>
>> Well, the machine won't even post now, so I'll have to move everything
>> over to a new machine and get a new license file from Pixar. I'll try to
>> get this done tomorrow morning, but this may mean that the cluster will
>> be down until January if everything doesn't go smoothly.
>>
>> Apologies!
>>     -Josiah
>>
>>
>> On 12/21/15 9:28 PM, Wm. Josiah Erikson wrote:
>>> ...or not. It appears to be happening again. I'm out tomorrow, but I
>>> should be in briefly on Wednesday and will make another attempt.
>>>     -Josiah
>>>
>>>
>>> On 12/21/15, 11:04 AM, Wm. Josiah Erikson wrote:
>>>> One of the power supplies was probably bad (dim/flickering lights, and
>>>> Google searches had other people experiencing this problem with a bad
>>>> power supply). I replaced it with a spare I had sitting around (thanks
>>>> Mt. Holyoke)! and hopefully that will be the end of that. Of course, it
>>>> could be something else too :)
>>>> For now, fly's back up. Hopefully for good, but no guarantees.
>>>>     -Josiah
>>>>
>>>>
>>>> On 12/21/15 10:35 AM, Wm. Josiah Erikson wrote:
>>>>> Well, it just kernel panicked. Neat. That's new. So it's down now until
>>>>> I get it back up :) I'll keep you updated...
>>>>>     -Josiah
>>>>>
>>>>>
>>>>> On 12/21/15 10:08 AM, Wm. Josiah Erikson wrote:
>>>>>> Hello all,
>>>>>>     I rebooted fly again - it is not fixed yet, though it is up and
>>>>>> running currently. I will probably take it down later today to try
>>>>>> removing one of the CPU's to see if it is the problem. This should
>>>>>> result in less than 10 minutes of downtime, and jobs should resume where
>>>>>> they left off, so no need to hold off launching renders or whatever. My
>>>>>> current theory is that CPU #2 is faulty. It could also be a bad power
>>>>>> supply. I'll keep you updated.
>>>>>>

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091



More information about the Clusterusers mailing list