[Clusterusers] Switching to new keg
Wm. Josiah Erikson
wjerikson at hampshire.edu
Mon Jun 22 16:38:35 EDT 2015
I've got a couple of nodes (compute-1-4 and compute-2-27) that won't
remount /helga. It looks, Tom, like your processes are still running and
won't let go of /helga because they were in fact writing logs to it,
which I did not anticipate. If I can reboot those nodes, we should be
all set:
[root at fly install]# ssh compute-1-4
Rocks Compute Node
Rocks 6.1.1 (Sand Boa)
Profile built 09:02 08-Jun-2015
Kickstarted 09:09 08-Jun-2015
[root at compute-1-4 ~]# ls /helga
ls: cannot access /helga: Stale file handle
[root at compute-1-4 ~]# umount -f /helga
umount2: Device or resource busy
umount.nfs: /helga: device is busy
umount2: Device or resource busy
umount.nfs: /helga: device is busy
[root at compute-1-4 ~]# ps aux | grep max
root 723 0.0 0.0 105312 864 pts/0 S+ 16:31 0:00 grep max
root 3717 0.1 0.0 349216 13068 ? Sl Jun08 38:24
/opt/pixar/tractor-blade-1.7.2/python/bin/python-bin
/opt/pixar/tractor-blade-1.7.2/blade-modules/tractor-blade.py
--dyld_framework_path=None
--ld_library_path=/opt/pixar/tractor-blade-1.7.2/python/lib:/opt/gridengine/lib/linux-x64:/opt/openmpi/lib:/opt/python/lib:/share/apps/maxwell-3.1:/share/apps/maxwell64-3.1
--debug --log /var/log/tractor-blade.log -P 8000
[root at compute-1-4 ~]# ps aux | grep Max
root 725 0.0 0.0 105312 864 pts/0 S+ 16:32 0:00 grep Max
[root at compute-1-4 ~]# lsof | grep /helga
lsof: WARNING: can't stat() nfs4 file system /helga
Output information may be incomplete.
python-bi 3717 root 5w
unknown
/helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T2.log (stat:
Stale file handle)
python-bi 3717 root 8w
unknown
/helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T9.log (stat:
Stale file handle)
python-bi 3717 root 12w
unknown
/helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T14.log (stat:
Stale file handle)
python-bi 3717 root 16w
unknown
/helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T24.log (stat:
Stale file handle)
[root at compute-1-4 ~]# top
top - 16:32:23 up 14 days, 7:21, 1 user, load average: 33.77, 35.33,
35.73
Tasks: 230 total, 1 running, 229 sleeping, 0 stopped, 0 zombie
Cpu(s): 50.2%us, 0.2%sy, 0.0%ni, 49.6%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 15928760k total, 8015840k used, 7912920k free, 238044k buffers
Swap: 2047992k total, 0k used, 2047992k free, 1094888k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
8428 thelmuth 20 0 6089m 1.5g 11m S 254.9 10.1 2809:37
java
8499 thelmuth 20 0 6089m 1.5g 11m S 243.6 10.1 2829:15
java
8367 thelmuth 20 0 6089m 1.5g 11m S 236.0 10.0 2829:42
java
8397 thelmuth 20 0 6089m 1.5g 11m S 28.3 10.1 2787:46
java
737 root 20 0 15036 1220 840 R 1.9 0.0 0:00.01
top
1536 root 20 0 0 0 0 S 1.9 0.0 10:24.07
kondemand/6
1 root 20 0 19344 1552 1232 S 0.0 0.0 0:00.46
init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02
kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.53
migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:02.85
ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00
migration/0
6 root RT 0 0 0 0 S 0.0 0.0 0:01.50
watchdog/0
7 root RT 0 0 0 0 S 0.0 0.0 0:00.49
migration/1
8 root RT 0 0 0 0 S 0.0 0.0 0:00.00
migration/1
9 root 20 0 0 0 0 S 0.0 0.0 0:01.85
ksoftirqd/1
10 root RT 0 0 0 0 S 0.0 0.0 0:01.21
watchdog/1
11 root RT 0 0 0 0 S 0.0 0.0 0:00.50
migration/2
12 root RT 0 0 0 0 S 0.0 0.0 0:00.00
migration/2
13 root 20 0 0 0 0 S 0.0 0.0 0:02.12
ksoftirqd/2
14 root RT 0 0 0 0 S 0.0 0.0 0:01.06
watchdog/2
15 root RT 0 0 0 0 S 0.0 0.0 0:01.81
migration/3
16 root RT 0 0 0 0 S 0.0 0.0 0:00.00
migration/3
17 root 20 0 0 0 0 S 0.0 0.0 0:00.50
ksoftirqd/3
18 root RT 0 0 0 0 S 0.0 0.0 0:00.87
watchdog/3
19 root RT 0 0 0 0 S 0.0 0.0 0:01.76
migration/4
20 root RT 0 0 0 0 S 0.0 0.0 0:00.00
migration/4
21 root 20 0 0 0 0 S 0.0 0.0 0:04.57
ksoftirqd/4
22 root RT 0 0 0 0 S 0.0 0.0 0:01.03
watchdog/4
23 root RT 0 0 0 0 S 0.0 0.0 0:00.51
migration/5
24 root RT 0 0 0 0 S 0.0 0.0 0:00.00
migration/5
25 root 20 0 0 0 0 S 0.0 0.0 0:02.27
ksoftirqd/5
[root at compute-1-4 ~]# umount -f /helga
umount2: Device or resource busy
umount.nfs: /helga: device is busy
umount2: Device or resource busy
umount.nfs: /helga: device is busy
[root at compute-1-4 ~]# umount -f /helga
umount2: Device or resource busy
umount.nfs: /helga: device is busy
umount2: Device or resource busy
umount.nfs: /helga: device is busy
[root at compute-1-4 ~]# mount -o remount /helga
mount.nfs: Stale file handle
[root at compute-1-4 ~]# mount -o remount /helga
mount.nfs: Stale file handle
[root at compute-1-4 ~]# top
top - 16:33:19 up 14 days, 7:22, 1 user, load average: 36.17, 35.76,
35.86
Tasks: 229 total, 1 running, 228 sleeping, 0 stopped, 0 zombie
Cpu(s): 99.6%us, 0.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 15928760k total, 8016408k used, 7912352k free, 238044k buffers
Swap: 2047992k total, 0k used, 2047992k free, 1094888k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
8499 thelmuth 20 0 6089m 1.5g 11m S 209.2 10.1 2831:08
java
8397 thelmuth 20 0 6089m 1.5g 11m S 198.9 10.1 2789:37
java
8367 thelmuth 20 0 6089m 1.5g 11m S 196.2 10.0 2831:37
java
8428 thelmuth 20 0 6089m 1.5g 11m S 192.9 10.1 2811:28
java
764 root 20 0 15164 1352 952 R 0.7 0.0 0:00.03
top
1532 root 20 0 0 0 0 S 0.3 0.0 20:31.35
kondemand/2
1534 root 20 0 0 0 0 S 0.3 0.0 12:32.23
kondemand/4
1536 root 20 0 0 0 0 S 0.3 0.0 10:24.14
kondemand/6
1537 root 20 0 0 0 0 S 0.3 0.0 17:00.03
kondemand/7
3717 root 20 0 341m 12m 3000 S 0.3 0.1 38:24.92
python-bin
1 root 20 0 19344 1552 1232 S 0.0 0.0 0:00.46
init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02
kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.53
migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:02.85
ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00
migration/0
6 root RT 0 0 0 0 S 0.0 0.0 0:01.50
watchdog/0
7 root RT 0 0 0 0 S 0.0 0.0 0:00.49
migration/1
8 root RT 0 0 0 0 S 0.0 0.0 0:00.00
migration/1
9 root 20 0 0 0 0 S 0.0 0.0 0:01.85
ksoftirqd/1
10 root RT 0 0 0 0 S 0.0 0.0 0:01.21
watchdog/1
11 root RT 0 0 0 0 S 0.0 0.0 0:00.50
migration/2
12 root RT 0 0 0 0 S 0.0 0.0 0:00.00
migration/2
13 root 20 0 0 0 0 S 0.0 0.0 0:02.12
ksoftirqd/2
14 root RT 0 0 0 0 S 0.0 0.0 0:01.06
watchdog/2
15 root RT 0 0 0 0 S 0.0 0.0 0:01.81
migration/3
16 root RT 0 0 0 0 S 0.0 0.0 0:00.00
migration/3
17 root 20 0 0 0 0 S 0.0 0.0 0:00.50
ksoftirqd/3
18 root RT 0 0 0 0 S 0.0 0.0 0:00.87
watchdog/3
19 root RT 0 0 0 0 S 0.0 0.0 0:01.76
migration/4
20 root RT 0 0 0 0 S 0.0 0.0 0:00.00
migration/4
21 root 20 0 0 0 0 S 0.0 0.0 0:04.57
ksoftirqd/4
[root at compute-1-4 ~]# umount -f /helga
umount2: Device or resource busy
umount.nfs: /helga: device is busy
umount2: Device or resource busy
umount.nfs: /helga: device is busy
[root at compute-1-4 ~]# mount /helga
mount.nfs: Stale file handle
[root at compute-1-4 ~]# logout
Connection to compute-1-4 closed.
[root at fly install]# ssh compute-2-17
Rocks Compute Node
Rocks 6.1.1 (Sand Boa)
Profile built 23:42 25-Dec-2014
Kickstarted 23:58 25-Dec-2014
[root at compute-2-17 ~]# ls /helga
ls: cannot access /helga: Stale file handle
[root at compute-2-17 ~]# umount -f /helga
umount2: Device or resource busy
umount.nfs: /helga: device is busy
umount2: Device or resource busy
umount.nfs: /helga: device is busy
[root at compute-2-17 ~]# lsof | grep /helga
lsof: WARNING: can't stat() nfs4 file system /helga
Output information may be incomplete.
python-bi 28615 root 8w
unknown
/helga/public_html/tractor/cmd-logs/hz12/J1504210040/T3.log (stat: Stale
file handle)
[root at compute-2-17 ~]#
On 6/22/15 4:33 PM, Wm. Josiah Erikson wrote:
> Uh... shouldn't have been. I wonder if it was because I yanked its
> logfile place out from under it? I did not expect that to happen! It's
> back now, but it looks like all of your runs errored out. I'm so sorry!
> -Josiah
>
>
> On 6/22/15 4:00 PM, Thomas Helmuth wrote:
>> Very cool!
>>
>> It looks like tractor is down. Is that related to this move?
>>
>> Tom
>>
>> On Mon, Jun 22, 2015 at 1:50 PM, Wm. Josiah Erikson
>> <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>> wrote:
>>
>> Keg will be going down shortly, and coming back up as a harder,
>> faster,
>> better, stronger version shortly as well, I hope :)
>> -Josiah
>>
>>
>> On 6/11/15 9:53 AM, Wm. Josiah Erikson wrote:
>> > Hi all,
>> > I'm pretty sure there is 100% overlap between people who care
>> > about fly and people who care about keg at this point (though there
>> > are some people who care about fly and not so much keg, like
>> Lee and
>> > Tom - sorry), so I'm sending this to this list.
>> > I have a new 32TB 14+2 RAID6 with 24GB of RAM superfast (way
>> > faster than gigabit) keg all ready to go! I would like to bring
>> it up
>> > on Monday, June 22nd. It would be ideal if rendering was NOT
>> happening
>> > at that time, to make my rsyncing life easier :) Any objections?
>> > -Josiah
>> >
>>
>> --
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091 <tel:%28413%29%20559-6091>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> <mailto:Clusterusers at lists.hampshire.edu>
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>>
>>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150622/788e5f6b/attachment-0001.html>
More information about the Clusterusers
mailing list