[Clusterusers] fly's hardware problem

Wm. Josiah Erikson wjerikson at hampshire.edu
Wed Dec 23 10:09:53 EST 2015


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Quite so - thanks for pointing that out. Fixed!
    -Josiah


On 12/23/15 10:02 AM, Jaime Davila wrote:
> Thanks Josiah, much appreciated.
>
> BTW, it looks like the clock on fly is out of sync. I don't think it's
> an issue, I just mention it in case that affects tractor in some way. My
> jobs get listed initially as having been running for 4+ hours, and then
> their run time gets corrected when they finish. They seem to be running
> fine, but I don't know if internally tractor looks at those times for
> anything like load balancing, etc. As far as I can tell my processes are
> not being affected by any of it.
>
> Thanks again,
>
> Jaime
>
>
> On 12/23/2015 06:41 AM, Wm. Josiah Erikson wrote:
>> Hi all,
>>     Thanks for all the feedback and thanks. As far as I can tell,
>> everything went well. I swapped the disks, controller, memory, power
>> supplies, and extra dual-interface network card into a new machine, and
>> it came up with a few tweaks. Luckily eth0 was on the card I swapped
>> over, so no need for a new license file!
>>     Hopefully the problem was with the motherboard or CPUs on the old
>> machine, otherwise I just swapped the problem over... only time will
tell.
>>     Enjoy, and tell me if you have any issues.
>>     All the best,
>>     -Josiah
>>
>>
>> On 12/22/15 10:13 AM, Wm. Josiah Erikson wrote:
>>> Hey folks,
>>>     I'm trying to get an idea of how much people were planning on using
>>> the cluster over the holidays, i.e. how much effort I should put into
>>> this. Are there folks who were planning to do important work over the
>>> break whose plans will be totally screwed up if it's not back until
>>> after the new year?
>>>     I will come in tomorrow and probably be able to get it back up
>>> anyway, just want to know what kind of priority to give it in case of
>>> mishap.
>>>     Thanks and happy holidays!
>>>     -Josiah
>>>
>>>
>>> On 12/22/15, 8:47 AM, Wm. Josiah Erikson wrote:
>>>> I got some time this morning and came in on a vacation day...
>>>>
>>>> After some mucking around with BIOS and CPU's and memory config, things
>>>> deteriorating quickly...
>>>>
>>>> Well, the machine won't even post now, so I'll have to move everything
>>>> over to a new machine and get a new license file from Pixar. I'll
try to
>>>> get this done tomorrow morning, but this may mean that the cluster will
>>>> be down until January if everything doesn't go smoothly.
>>>>
>>>> Apologies!
>>>>     -Josiah
>>>>
>>>>
>>>> On 12/21/15 9:28 PM, Wm. Josiah Erikson wrote:
>>>>> ...or not. It appears to be happening again. I'm out tomorrow, but I
>>>>> should be in briefly on Wednesday and will make another attempt.
>>>>>     -Josiah
>>>>>
>>>>>
>>>>> On 12/21/15, 11:04 AM, Wm. Josiah Erikson wrote:
>>>>>> One of the power supplies was probably bad (dim/flickering
lights, and
>>>>>> Google searches had other people experiencing this problem with a bad
>>>>>> power supply). I replaced it with a spare I had sitting around
(thanks
>>>>>> Mt. Holyoke)! and hopefully that will be the end of that. Of
course, it
>>>>>> could be something else too :)
>>>>>> For now, fly's back up. Hopefully for good, but no guarantees.
>>>>>>     -Josiah
>>>>>>
>>>>>>
>>>>>> On 12/21/15 10:35 AM, Wm. Josiah Erikson wrote:
>>>>>>> Well, it just kernel panicked. Neat. That's new. So it's down
now until
>>>>>>> I get it back up :) I'll keep you updated...
>>>>>>>     -Josiah
>>>>>>>
>>>>>>>
>>>>>>> On 12/21/15 10:08 AM, Wm. Josiah Erikson wrote:
>>>>>>>> Hello all,
>>>>>>>>     I rebooted fly again - it is not fixed yet, though it is up and
>>>>>>>> running currently. I will probably take it down later today to try
>>>>>>>> removing one of the CPU's to see if it is the problem. This should
>>>>>>>> result in less than 10 minutes of downtime, and jobs should
resume where
>>>>>>>> they left off, so no need to hold off launching renders or
whatever. My
>>>>>>>> current theory is that CPU #2 is faulty. It could also be a bad
power
>>>>>>>> supply. I'll keep you updated.
>>>>>>>>
>>
>
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers

- -- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - https://gpgtools.org

iQEcBAEBCgAGBQJWerlBAAoJEEhWIsC/LdC33qQH/1Cy/LSR0njniUVcaMLr3p+X
4Nq2sX2nFICfxXnNimy8gbArJYs8BXtSS3KG1OPjncOMEvvRLLKihe0yIp+p0DjE
2B7GCseDnm6pgNuRMp5K4oO6t5ITBlZj9zbwD90Mh17cnviVzQCbNxNMN7Y6Xy3H
fHKOghYY/QLJi4xNHtcVHs7GTUGuaH4ZDNNhfdJFRsYT9cnqjO149DavPSjD+oiA
ZjakVHynObRydd+Y21FmT4WFUXFX6XtN5h4d+KCthnFn/cAe0CBitxpoGa+Jlliv
dDrUz1i45FFDzuG2BNA+O7a1i12YCHIkWSNvcRiTbT8VCJWbUV/D/JC5xVZsQkw=
=TC8s
-----END PGP SIGNATURE-----

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20151223/4e2f7071/attachment.html>


More information about the Clusterusers mailing list