GC CR Block Lost vs 12c

August 20, 2016 ❖ Tags: oracle, rac, bugs

Had a panic after a recent upgrade when we noticed that gc cr block lost had a 600x increase. Fortunately, Ram Cheruvattath had run across this previously and directed me to

False increase of ‘Global Cache Blocks Lost’ or ‘gc blocks lost’ after upgrade to 12c (Doc ID 2096299.1)

SQL> ed
Wrote file afiedit.sql
1 SELECT sysdate snap_time,a.inst_ID "INSTANCE", A.VALUE "GC BLOCKS LOST",
2 B.VALUE "GC CUR BLOCKS SERVED",
3 C.VALUE "GC CR BLOCKS SERVED",
4 A.VALUE/(B.VALUE+C.VALUE) RATIO
5 FROM GV$SYSSTAT A, GV$SYSSTAT B, GV$SYSSTAT C
6 WHERE A.NAME='gc blocks lost' AND
7 B.NAME='gc current blocks served' AND
8 C.NAME='gc cr blocks served' and
9 B.INST_ID=a.inst_id AND
10 C.INST_ID = a.inst_id
11* order by 2
SQL> /

SNAP_TIME INSTANCE GC BLOCKS LOST GC CUR BLOCKS SERVED GC CR BLOCKS SERVED RATIO
----------------- ---------- -------------- -------------------- ------------------- ----------
20160810 14:01:08 1 19373 6732648 7025417 .00140812
20160810 14:01:08 2 16649 7331907 5754179 .001272267
20160810 14:01:08 3 85182 10767159 8645896 .004387872
20160810 14:01:08 4 37808 8047542 14586125 .001670432
20160810 14:01:08 5 37987 8310080 15634197 .001586475
20160810 14:01:08 6 1877 2385405 713257 .000605745

6 rows selected.

SQL> /

SNAP_TIME INSTANCE GC BLOCKS LOST GC CUR BLOCKS SERVED GC CR BLOCKS SERVED RATIO
----------------- ---------- -------------- -------------------- ------------------- ----------
20160810 14:01:26 1 19446 6744274 7031562 .001411602
20160810 14:01:26 2 16716 7344979 5759818 .001275563
20160810 14:01:26 3 85187 10771512 8650641 .004386074
20160810 14:01:26 4 37917 8074631 14623146 .001670516
20160810 14:01:26 5 38083 8336128 15713424 .001583522
20160810 14:01:26 6 1877 2385555 713323 .000605703

6 rows selected.

This upgrade included a migration to brand new hardware so I was suspicious that that was the case, but having so many bad cables would have been even more suspicious. A reporting database that didn’t include new hardware exhibited the same behavior so, bug it is.

Now this being said, on a completely different databases, one of the instances on an Active Data Guard standby was reporting 30K lost blocks a minute. So, something else was going on. This server was showing many packet reassembles failures:

> cat packets.sh
while true;
do
  x=`netstat -s | grep "packet reassembles failed"|awk '{print $1}'`
  echo `date` ": " `expr $x - $t`;
  t=$x
  sleep 5;
done;

...
Wed Aug 17 16:08:01 CDT 2016 :  265
Wed Aug 17 16:08:06 CDT 2016 :  138
Wed Aug 17 16:08:11 CDT 2016 :  211
Wed Aug 17 16:08:16 CDT 2016 :  1105
Wed Aug 17 16:08:21 CDT 2016 :  1649
Wed Aug 17 16:08:26 CDT 2016 :  1449
Wed Aug 17 16:08:31 CDT 2016 :  1896
Wed Aug 17 16:08:36 CDT 2016 :  2175
Wed Aug 17 16:08:41 CDT 2016 :  3152
Wed Aug 17 16:08:46 CDT 2016 :  3205
Wed Aug 17 16:08:51 CDT 2016 :  2801
Wed Aug 17 16:08:56 CDT 2016 :  2675
Wed Aug 17 16:09:01 CDT 2016 :  1328
Wed Aug 17 16:09:06 CDT 2016 :  451
Wed Aug 17 16:09:11 CDT 2016 :  89
...

This issue was resolved by increasing the buffer size from 256 to 1M. Still need to figure out why there was such an increase in traffic for this to even manifest. But, stable for now.