AIX 5.3 TL9+ 安装oracle rac vip故障解决
同事在aix 5.3 TL9上安装oracle 10.2 rac,安装完10.2.0.1 crs以后,发现vip一开始会ONLINE,但是过一会就变成offline。
问题反馈以后,首先想到的是aix 6.1上遇到过的bug 8413088。虽然操作系统版本不一致,但由于TL较新,还是用以下命令检查了一下:
netstat -f inet -n -I en0
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 0.21.5e.b3.45.a0 480117 0 197560 0 0
en0 1500 172.17 172.17.0.3 480117 0 197560 0 0
从返回值看来不是这个bug。6.1上返回值多了一列,应该是下面的输出:
Name Mtu Network Address ZoneID Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 32.6b.89.34.64.4 - 4943149 0 3409007 0 0
en0 1500 10.20.16 10.20.16.8 - 4943149 0 3409007 0 0
排除这个bug的原因以后,只能从其他方面检测。
用oifcfg命令重新检查网卡配置,确认没有其他错误。
检查crs日志,从racg中的ora.NODE1.vip.log日志中,发现大量如下的报错:
Invalid parameters, or failed to bring up VIP (host=NODE1)
同样在node2上日志中也有
Invalid parameters, or failed to bring up VIP (host=NODE2)
从以上信息看一看出,确实应该是oracle crs认为vip出问题了。做个debug跟踪:
./crsctl debug log res "ora.his2.vip:5"
srvctl start nodeapps -n his2
检查节点2的cssd日志,可以看到设置debug生效:
2011-06-23 15:17:45.445: [ CRSOCR][8235]32CAAOCR SET Debug Level[ora.his2.vip]: 2
再次查看ora.NODE2.vip.log,可以看到更详细的信息:
Thu Jun 23 15:20:30 BEIST 2011 [ 229504 ] Calling getifbyip
Thu Jun 23 15:20:30 BEIST 2011 [ 229504 ] getifbyip: started for 172.17.0.2
Thu Jun 23 15:20:30 BEIST 2011 [ 229504 ] getif
2011-06-23 15:20:32.983: [ RACG][1] [69690][1][ora.node2.vip]: byip: checking if failover is happening (en0)
Thu Jun 23 15:20:30 BEIST 2011 [ 229504 ] getifbyip: failover is not happening (en0)
Thu Jun 23 15:20:30 BEIST 2011 [ 229504 ] getifbyip: returning IP en0
Thu Jun 23 15:20:30 BEIST 2011 [ 229504 ] Completed g
2011-06-23 15:20:32.983: [ RACG][1] [69690][1][ora.node2.vip]: etifbyip en0
Thu Jun 23 15:20:30 BEIST 2011 [ 229504 ] Completed with initial interface test
Thu Jun 23 15:20:30 BEIST 2011 [ 229504 ] Broadcast = 172.17.0.255
Thu Jun 23 15:20:30 BEIST 2011 [ 229504 ] checkIf: start for if=en0
Thu Jun 23 15:20:30 BEIST 2
2011-06-23 15:20:32.983: [ RACG][1] [69690][1][ora.node2.vip]: 011 [ 229504 ] checkIf: entstat checked if=en0 failed
Thu Jun 23 15:20:30 BEIST 2011 [ 229504 ] defaultgw: started
Thu Jun 23 15:20:30 BEIST 2011 [ 229504 ] defaultgw: completed with 172.17.0.254
Thu Jun 23 15:20:32 BEIST 2011 [ 229504 ] checkIf: RX pac
2011-06-23 15:20:32.983: [ RACG][1] [69690][1][ora.node2.vip]: kets checked if=en0 failed
Interface en0 checked failed (host=node2)
Thu Jun 23 15:20:32 BEIST 2011 [ 229504 ] checkIf: end for if=en0
Thu Jun 23 15:20:32 BEIST 2011 [ 229504 ] Performing CRS_STAT testing
Thu Jun 23 15:20:32 BEIST 2011 [ 229504 ] Completed
2011-06-23 15:20:32.983: [ RACG][1] [69690][1][ora.node2.vip]: CRS_STAT testing
Thu Jun 23 15:20:32 BEIST 2011 [ 229504 ] Completed second gateway test
Thu Jun 23 15:20:32 BEIST 2011 [ 229504 ] Interface tests
Invalid parameters, or failed to bring up VIP (host=node2)
从上面debug信息可以看出用entstat检查是否en0失败的时候,最后报出了刚开始的错误:Invalid parameters, or failed to bring up VIP。
从metalink上查相关信息,找到对应的文档,发现这个问题影响10.2.0.1-10.2.0.4的版本。当然现在10.2.0.5可用了,直接打10.2.0.5补丁就可以了。
VIP on AIX 5.3TL9+ Fails to Come Up with "Invalid Parameters, Or Failed To Bring Up VIP" [ID 959746.1]
Modified 04-MAY-2011 Type PROBLEM Status PUBLISHED
In this Document
Symptoms
Changes
Cause
Solution
References
Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 10.2.0.4 - Release: 10.2 to 10.2
IBM AIX on POWER Systems (64-bit)
IBM AIX Based Systems (64-bit)
Symptoms
On AIX 5.3 TL9+ , AIX 6 or AIX 6.1, the VIP fails to come up, with error "Invalid Parameters, Or Failed To Bring Up VIP".
Tracing the racgvip command shows that it fails when checking to see if the public interface (NIC) is up.
However, the public interface is up.
An error message similar to the following may be seen in logs:
2009-07-23 17:40:05.812: [ RACG][1] [270490][1][ora.srvr0101.vip]: Thu Jul 23 17:40:05 BST 2009 [ 159774 ] IsIfAlive: /usr/bin/entstat -d en0 failed. Return = 1 (host=srvr0101)
Thu Jul 23 17:40:05 BST 2009 [ 159774 ] checkIf: end for if=en0
Invalid parameters, or failed to bring up VIP (host=srvr0101)
Changes
The adapter type for the public network is LHEA (IBM Logical Host Ethernet Adapter):
# /usr/bin/entstat -d en0
-------------------------------------------------------------
ETHERNET STATISTICS (en0) :
Device Type: Host Ethernet Adapter (l-hea)
...
Cause
The entstat output for LHEA is different from a regular adapter.
The racgvip script uses entstat output to determine if the specified interface is up or not; because the entstat output for LHEA is different, the check fails, therefore the VIP will not come up.
This is a known issue reported in the following bug:
Bug 8725020 - VIP WONT RUN (LHEA) ADAPTER 5.3 TL9
Solution
The following workaround fixes the racgvip script so that it does not fail for LHEA adapters:
1. Backup the racgvip script
2. Edit this line in the script using vi:
$ENTSTAT -d $_IF | $GREP -iEq '.*lan.*state.*:.*operational.*|.*link.*status.*:.*up.*|.*port.*operational.*state.*:.*up.*'
and replace it with this:
$ENTSTAT -d $_IF | $GREP -iEq '.*lan.*state.*:.*operational.*|.*link.*status.*:.*up.*|.*port.*operational.*state.*:.*up.*|.*driver.*flags.*:.*up.*'
(Notice that an extra regexp clause has been tacked on the end of the grep argument.)
3. Make sure that no stray characters have been introduced.
4. Save the racgvip file.
References
BUG:8725020 - VIP WONT RUN (LHEA) ADAPTER 5.3 TL9
NOTE:567286.1 - RAC on AIX: With Virtual Interfaces Racgvip Fails Even Though Public Interface is Up