[Clusterusers] fly's hardware problem

Jaime Davila jjdCCS at hampshire.edu
Wed Dec 23 10:02:39 EST 2015


Thanks Josiah, much appreciated.

BTW, it looks like the clock on fly is out of sync. I don't think it's
an issue, I just mention it in case that affects tractor in some way. My
jobs get listed initially as having been running for 4+ hours, and then
their run time gets corrected when they finish. They seem to be running
fine, but I don't know if internally tractor looks at those times for
anything like load balancing, etc. As far as I can tell my processes are
not being affected by any of it.

Thanks again,

Jaime


On 12/23/2015 06:41 AM, Wm. Josiah Erikson wrote:
> Hi all,
>     Thanks for all the feedback and thanks. As far as I can tell,
> everything went well. I swapped the disks, controller, memory, power
> supplies, and extra dual-interface network card into a new machine, and
> it came up with a few tweaks. Luckily eth0 was on the card I swapped
> over, so no need for a new license file!
>     Hopefully the problem was with the motherboard or CPUs on the old
> machine, otherwise I just swapped the problem over... only time will tell.
>     Enjoy, and tell me if you have any issues.
>     All the best,
>     -Josiah
> 
> 
> On 12/22/15 10:13 AM, Wm. Josiah Erikson wrote:
>> Hey folks,
>>     I'm trying to get an idea of how much people were planning on using
>> the cluster over the holidays, i.e. how much effort I should put into
>> this. Are there folks who were planning to do important work over the
>> break whose plans will be totally screwed up if it's not back until
>> after the new year?
>>     I will come in tomorrow and probably be able to get it back up
>> anyway, just want to know what kind of priority to give it in case of
>> mishap.
>>     Thanks and happy holidays!
>>     -Josiah
>>
>>
>> On 12/22/15, 8:47 AM, Wm. Josiah Erikson wrote:
>>> I got some time this morning and came in on a vacation day...
>>>
>>> After some mucking around with BIOS and CPU's and memory config, things
>>> deteriorating quickly...
>>>
>>> Well, the machine won't even post now, so I'll have to move everything
>>> over to a new machine and get a new license file from Pixar. I'll try to
>>> get this done tomorrow morning, but this may mean that the cluster will
>>> be down until January if everything doesn't go smoothly.
>>>
>>> Apologies!
>>>     -Josiah
>>>
>>>
>>> On 12/21/15 9:28 PM, Wm. Josiah Erikson wrote:
>>>> ...or not. It appears to be happening again. I'm out tomorrow, but I
>>>> should be in briefly on Wednesday and will make another attempt.
>>>>     -Josiah
>>>>
>>>>
>>>> On 12/21/15, 11:04 AM, Wm. Josiah Erikson wrote:
>>>>> One of the power supplies was probably bad (dim/flickering lights, and
>>>>> Google searches had other people experiencing this problem with a bad
>>>>> power supply). I replaced it with a spare I had sitting around (thanks
>>>>> Mt. Holyoke)! and hopefully that will be the end of that. Of course, it
>>>>> could be something else too :)
>>>>> For now, fly's back up. Hopefully for good, but no guarantees.
>>>>>     -Josiah
>>>>>
>>>>>
>>>>> On 12/21/15 10:35 AM, Wm. Josiah Erikson wrote:
>>>>>> Well, it just kernel panicked. Neat. That's new. So it's down now until
>>>>>> I get it back up :) I'll keep you updated...
>>>>>>     -Josiah
>>>>>>
>>>>>>
>>>>>> On 12/21/15 10:08 AM, Wm. Josiah Erikson wrote:
>>>>>>> Hello all,
>>>>>>>     I rebooted fly again - it is not fixed yet, though it is up and
>>>>>>> running currently. I will probably take it down later today to try
>>>>>>> removing one of the CPU's to see if it is the problem. This should
>>>>>>> result in less than 10 minutes of downtime, and jobs should resume where
>>>>>>> they left off, so no need to hold off launching renders or whatever. My
>>>>>>> current theory is that CPU #2 is faulty. It could also be a bad power
>>>>>>> supply. I'll keep you updated.
>>>>>>>
> 

-- 
------------------------------------
Jaime Davila
Associate Professor of Computer Science
Hampshire College, School of Cognitive Science
http://hampshire.edu/jdavila/
=== N.B. I sign my emails electronically with a publicly available key.
=== To verify the authenticity of this email,
=== install PGP software for your email client
=== and retrieve public key for address jdavila at hampshire.edu
------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 555 bytes
Desc: OpenPGP digital signature
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20151223/b856b2db/attachment.pgp>


More information about the Clusterusers mailing list