Sandesh_rac_troubleshooting_diagnosability_otnyathra2.pdf

  • Uploaded by: Udaya Samudrala
  • 0
  • 0
  • March 2021
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Sandesh_rac_troubleshooting_diagnosability_otnyathra2.pdf as PDF for free.

More details

  • Words: 11,072
  • Pages: 129
Loading documents preview...
Troubleshooting and Diagnosing Oracle Database 12.2 and Oracle RAC https://www.linkedin.com/in/raosandesh/ sandeshr

Sandesh Rao, Senior Director , RAC Development Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Restricted

2

Common Questions • How do I contact you ? – Linkedin – Sandesh Rao – Email – [email protected]

• Where do I get your presentation ? – http://otnyathra.in/downloads/

• Which books on RAC do I read for basics or internals ? – Oracle Database 11g Oracle Real Application Clusters Handbook, 2nd Edition (Oracle Press) 2nd Edition – Pro Oracle Database 11g RAC on Linux (Expert's Voice in Oracle) 2nd ed. Edition – Oracle 10g RAC Grid, Services and Clustering 1st Edition – Pro Oracle Database 10g RAC on Linux: Installation, Administration, and Performance (Expert's Voice in Oracle) 1st Corrected ed., Corr. 3rd printing Edition – Oracle Database 12c Release 2 Oracle Real Application Clusters Handbook: Concepts, Administration, Tuning & Troubleshooting (Oracle Press) 1st Edition – Documentation – Autonomous Computing Guide , RAC Admin guide Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted

3

Agenda • Architectural Overview • Troubleshooting Scenarios • Proactive and Reactive tools • Q&A

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Grid Infrastructure Overview • Grid Infrastructure is the name for the combination of – Oracle Cluster Ready Services (CRS) – Oracle Automatic Storage Management (ASM)

• The Grid Home contains the software for both products • CRS can also be Standalone for ASM and/or Oracle Restart • CRS can run by itself or in combination with other vendor clusterware • Grid Home and RDBMS home must be installed in different locations – The installer locks the Grid Home path by setting root permissions.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Grid Infrastructure Overview • CRS requires shared Oracle Cluster Registry (OCR) and Voting files – Must be in ASM or CFS – OCR backed up every 4 hours automatically GIHOME/cdata – Kept 4,8,12 hours, 1 day, 1 week – Restored with ocrconfig – Voting file backed up into OCR at each change. – Voting file restored with crsctl

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Grid Infrastructure Overview • For network CRS requires – One/multiple high speed, low latency, redundant private network for inter node communications – Think of interconnect as a memory backplane for the cluster – Should be a separate physical network or managed converged network – VLANS are supported – Used for :• • • •

Clusterware messaging RDBMS messaging and block transfer ASM messaging HANFS for block traffic

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Grid Infrastructure Overview • Only one set of Clusterware daemons can run on each node • The CRS stack is spawned from Oracle HA Services Daemon (ohasd) • On Unix ohasd runs out of inittab with respawn • A node can be evicted when deemed unhealthy – May require reboot but at least CRS stack restart (rebootless restart) – IPMI integration or diskmon in case of Exadata

• CRS provides Cluster Time Synchronization services – Always runs but in observer mode if ntpd configured

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Grid Infrastructure Processes Agents change everything • Multi-threaded Daemons • Manage multiple resources and types • Implements entry points for multiple resource types – Start,stop check,clean,fail

• oraagent, orarootagent, application agent, script agent, cssdagent • Single process started from init on Unix (ohasd) • Diagram below shows all core resources

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Grid Infrastructure Processes Level 2a

Level 4a

Level 3

Level 0

Level 4b

Level 1

Level 2b

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Grid Infrastructure Processes Init Scripts

• /etc/init.d/ohasd ( location O/S dependent ) – RC script with “start” and “stop” actions – Initiates Oracle Clusterware autostart – Control file coordinates with CRSCTL

• /etc/init.d/init.ohasd ( location O/S dependent ) – OHASD Framework Script runs from init/upstart – Control file coordinates with CRSCTL – Named pipe syncs with OHASD

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Grid Infrastructure Processes • Level 1: OHASD Spawns: – cssdagent - Agent responsible for spawning CSSD – orarootagent - Agent responsible for managing all root owned ohasd resources – oraagent - Agent responsible for managing all oracle owned ohasd resources – cssdmonitor - Monitors CSSD and node health (along with the cssdagent)

• Level 2a: OHASD rootagent spawns: – CRSD - Primary daemon responsible for managing cluster resources. – CTSSD - Cluster Time Synchronization Services Daemon – Diskmon ( Exadata ) – ACFS (ASM Cluster File System) Drivers

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Grid Infrastructure Processes • Level 2b: OHASD oraagent spawns: – MDNSD– Multicast DNS daemon – GIPCD – Grid IPC Daemon – GPNPD – Grid Plug and Play Daemon – EVMD – Event Monitor Daemon – ASM – ASM instance started here as may be required by CRSD

• Level 3: CRSD spawns: – orarootagent - Agent responsible for managing all root owned crsd resources. – oraagent - Agent responsible for managing all nonroot owned crsd resources. • One is spawned for every user that has CRS resources to manage.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Grid Infrastructure Processes Startup Sequence

• Level 4: CRSD oraagent spawns: – ASM Resouce - ASM Instance(s) resource (proxy resource) – Diskgroup - Used for managing/monitoring ASM diskgroups. – DB Resource - Used for monitoring and managing the DB and instances – SCAN Listener - Listener for single client access name, listening on SCAN VIP – Listener - Node listener listening on the Node VIP – Services - Used for monitoring and managing services – ONS - Oracle Notification Service – eONS - Enhanced Oracle Notification Service ( pre 11.2.0.2 ) – GSD - For 9i backward compatibility – GNS (optional) - Grid Naming Service - Performs name resolution

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Oracle Flex Cluster

The standard going forward (every Oracle 12c Rel. 2 cluster is a Flex Cluster by default.)

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

15

Under the Hood: Any New Install Ends Up in a Flex Cluster

[GRID]> crsctl get cluster name CRS-6724: Current cluster name is 'SolarCluster' [GRID]> crsctl get cluster class CRS-41008: Cluster class is 'Standalone Cluster' [GRID]> crsctl get cluster type CRS-6539: The cluster type is 'flex'.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

16

Private Network

1

2

Database Member Cluster

Uses local ASM

Cluster Domain

3

4

Application Member Cluster

Database Member Cluster

Database Member Cluster

GI only

Uses IO & ASM Service of DSC

Uses ASM Service

SAN

Domain Services Cluster

NAS Mgmt Repository (GIMR) Service

Trace File Analyzer (TFA) Service

Rapid Home Provisioning (RHP) Service

Additional Optional Services

ASM Service

IO Service

Shared ASM

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

17

ASM Flex Diskgroups 1 Database-oriented Storage Management for more flexibility and availability Pre-12.2 diskgroup Organization

Diskgroup DB1 : File 1

DB3 : File 3

DB3 : File 1

DB2 : File 1

DB2 : File 2

DB1 : File 3

DB3 : File 2

DB2 : File 3

DB2 : File 4

DB1 : File 2

Shared resource management

12.2 Flex Diskgroup Organization

Database-oriented resource management

File Group

Flex Diskgroup DB1

DB2

DB3

File 1 File 2 File 3

File 1 File 2 File 3 File 4

File 1 File 2 File 3

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted

18

ASM Flex Diskgroups 2 Database-oriented Storage Management for more flexibility and availability 12.2 Flex Diskgroup Organization

• Flex Diskgroups enable

Flex Diskgroup

Quota

DB1

DB2

DB3

File 1 File 2 File 3

File 1 File 2 File 3 File 4

File 1 File 2 File 3

DB3 File 1 File 2

– Quota Management - limit the space databases can allocate in a diskgroup and thereby improve the customers’ ability to consolidate databases into fewer DGs – Redundancy Change – utilize lower redundancy for less critical databases – Shadow Copies (“split mirrors”) to easily and dynamically create database clones for test/dev or production databases

File 3

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted

19

Node Weighting in Oracle RAC 12c Release 2 Idea: Everything equal, let the majority of work survive

1

✔ 2

• Node Weighting is a new feature that considers the workload hosted in the cluster during fencing • The idea is to let the majority of work survive, if everything else is equal – Example: In a 2-node cluster, the node hosting the majority of services (at fencing time) is meant to survive

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

20

CSS_CRITICAL – Fencing with Manual Override Node eviction despite WL; WL will failover.

srvctl modify database -help |grep critical … -css_critical {YES | NO} Define whether the database or service is CSS critical

“Conflict”.



crsctl set server css_critical {YES|NO} + server restart

CSS_CRITICAL can be set on various levels / components to mark them as “critical” so that the cluster will try to preserve them in case of a failure.

CSS_CRITICAL will be honored if no other technical reason prohibits survival of the node which has at least one critical component at the time of failure.

A fallback scheme is applied if CSS_CRITICAL settings do not lead to an actionable outcome.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

21

Proven Features – Even More Beneficial on the DSC

Autonomous Health Framework (powered by machine learning) works more efficiently for you on the DSC, as continuous analysis is taken off the production cluster.

The DSC is the ideal hosting environment for Rapid Home Provisioning (RHP) enabling software fleet management.

Oracle ASM 12c Rel. 2 based storage consolidation is best performed on the DSC, as it enables numerous additional features and use cases.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

22

Node Eviction Basics

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Basic RAC Cluster with Oracle Clusterware Public Lan

Public Lan

Private Lan / Interconnect

CSSD

CSSD

SAN Network

Voting Disk

CSSD

SAN Network

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

What does CSSD do? CSSD monitors and evicts nodes • Monitors nodes using 2 communication channels: – Private Interconnect ó Network Heartbeat – Voting Disk based communication ó Disk Heartbeat

• Evicts (forcibly removes nodes from a cluster)

nodes dependent on heartbeat feedback (failures)

CSSD

“Ping”

CSSD

“Ping”

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Network Heartbeat Interconnect basics • Each node in the cluster is “pinged” every second • Nodes must respond in css_misscount time (defaults to 30 secs.) – Reducing the css_misscount time is generally not supported

• Network heartbeat failures will lead to node evictions – CSSD-log: [date / time] [CSSD][1111902528]clssnmPollingThread: node mynodename (5) at 75% heartbeat fatal, removal in 6.770 seconds

CSSD

“Ping”

CSSD

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Disk Heartbeat Voting Disk basics – Part 1 • Each node in the cluster “pings” (r/w) the Voting Disk(s) every second • Nodes must receive a response in (long / short) diskTimeout time – I/O errors indicate clear accessibility problems à timeout is irrelevant

• Disk heartbeat failures will lead to node evictions – CSSD-log: … [CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat: node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1)

CSSD

CSSD

“Ping”

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Voting Disk Structure Voting Disk basics – Part 2 • Voting Disks contain dynamic and static data: – Dynamic data: disk heartbeat logging – Static data: information about the nodes in the cluster

• With 11.2.0.1 Voting Disks got an “identity”: – E.g. Voting Disk serial number: [GRID]> crsctl query css votedisk

1.

2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]

• Voting Disks must therefore not be copied using “dd” or “cp” anymore

Node information

Disk Heartbeat Logging

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

“Simple Majority Rule” Voting Disk basics – Part 3 • Oracle supports redundant Voting Disks for disk failure protection • “Simple Majority Rule” applies: – Each node must “see” the simple majority of configured Voting Disks

at all times in order not to be evicted (to remain in the cluster) Ø trunc(n/2+1) with n=number of voting disks configured and n>=1

CSSD

CSSD

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Insertion 1: “Simple Majority Rule”… … In extended Oracle clusters

• http://www.oracle.com/goto/rac – Using standard NFS to support

a third voting file for extended cluster configurations (PDF)

CSSD

CSSD

• Same principles apply • Voting Disks are just

geographically dispersed

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Insertion 2: Voting Disk in Oracle ASM The way of storing Voting Disks doesn’t change its use [GRID]> crsctl query css votedisk 1.

2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]

2.

2 aafab95f9ef84f03bf6e26adc2a3b0e8 (/dev/sde5) [DATA]

3.

2 28dd4128f4a74f73bf8653dabd88c737 (/dev/sdd6) [DATA]

Located 3 voting disk(s).

• Oracle ASM auto creates 1/3/5 Voting Files – Based on Ext/Normal/High redundancy and on Failure Groups in the Disk Group – Per default there is one failure group per disk – ASM will enforce the required number of disks – New failure group type: Quorum Failgroup

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Why are nodes evicted? è To prevent worse things from happening… • Evicting (fencing) nodes is a preventive measure (a good thing)! • Nodes are evicted to prevent consequences of a split brain: – Shared data must not be written by independently operating nodes – The easiest way to prevent this is to forcibly remove a node from the cluster

1

CSSD

2

CSSD

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

How are nodes evicted? EXAMPLE: Heartbeat failure • The network heartbeat between nodes has failed – It is determined which nodes can still talk to each other – A “kill request” is sent to the node(s) to be evicted § Using all (remaining) communication channels à Voting Disk(s) • A node is requested to “kill itself”; executer: typically CSSD

1

CSSD

CSSD

2

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Re-bootless Node Fencing (restart)

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Re-bootless Node Fencing (restart) Fence the cluster, do not reboot the node • Until Oracle Clusterware 11.2.0.2, fencing meant “re-boot” • With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because:

– Re-boots affect applications that might run an a node, but are not protected – Customer requirement: prevent a reboot, just stop the cluster – implemented...

Standalone App X Oracle RAC DB Inst. 1

CSSD

Standalone App Y Oracle RAC DB Inst. 2

CSSD

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Re-bootless Node Fencing (restart) How it works • With Oracle Clusterware 11.2.0.2, re-boots will be seen less:

– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted • Then IO issuing processes are killed; it is made sure that no IO process remains

– For a RAC DB mainly the log writer and the database writer are of concern

Standalone App X Oracle RAC DB Inst. 1

CSSD

Standalone App Y

CSSD

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Re-bootless Node Fencing (restart) EXCEPTIONS • With Oracle Clusterware 11.2.0.2, re-boots will be seen less, unless…:

– – – –

IF the check for a successful kill of the IO processes fails → reboot IF CSSD gets killed during the operation → reboot IF cssdmonitor is not scheduled → reboot IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot

Standalone App X Oracle RAC DB Inst. 1

CSSD

Standalone App Y Oracle RAC DB Inst. 2

CSSD

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Troubleshooting Scenarios Cluster Startup Problem Triage (11.2+) Startup Sequence

ps –ef|grep init.ohasd ps –ef|grep ohasd.bin

Running?

NO

crsctl config crs ohasd.log

Obvious?

NO

Engage Oracle Support Engage Sysadmin Team

TFA Collector

YES YES Engage Sysadmin Team

Cluster Startup Diagnostic Flow

ps –ef|grep cssdagent ps –ef|grep ocssd.bin ps –ef|grep orarootagent ps –ef|grep ctssd.bin ps –ef|grep crsd.bin ps –ef|grep cssdmonitor ps –ef|grep oraagent ps –ef|grep ora.asm ps –ef|grep gpnpd.bin ps –ef|grep mdnsd.bin ps –ef|grep evmd.bin Crsctl check crs Crsctl check cluster

Engage Oracle Support Sysadmin Team

TFA Collector

Running?

NO

ohasd.log agent logs process logs

TFA Collector

ohasd.log OLR perms Compare reference system

Obvious?

YES

Engage Sysadmin Team

NO

YES

NO

Obvious?

YES

Engage Sysadmin Team

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Engage Oracle Support Sysadmin Team

Troubleshooting Scenarios Cluster Startup Problem Triage

• Multicast Domain Name Service Daemon (mDNS(d)) – Used by Grid Plug and Play to locate profiles in the cluster, as well as by GNS to perform name resolution. The mDNS process is a background process on Linux and UNIX and on Windows. – Uses multicast for cache updates on service advertisement arrival/departure. – Advertises/serves on all found node interfaces. – Log is GI_HOME/log/<node>/mdnsd/mdnsd.log

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Troubleshooting Scenarios Cluster Startup Problem Triage

x1H9LWjyNyMn6BsOykHhMvxnP8U=N+20jG4= Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Troubleshooting Scenarios Cluster Startup Problem Triage • cssd agent and monitor – Same functionality in both agent and monitor – Functionality of several pre-11.2 daemons consolidated in both • OPROCD – system hang • OMON – oracle clusterware monitor • VMON – vendor clusterware monitor – Run realtime with locked down memory, like CSSD – Provides enhanced stability and diagnosability – Logs are • GI_HOME/log/<node>/agent/oracssdagent_root/oracssdagent_root.log • GI_HOME/log/<node>/agent/oracssdmonitor_root/oracssdmonitor_root.log • 12c – ORACLE_BASE/diag/node/agent/.. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Troubleshooting Scenarios Node Evictions NHB?

Eviction Scenario

1531223.1 1328466.1 System log

Resource Starvation?

NO

YES

Obvious?

YES

Engage networking team

NO NO

YES Free memory? CPU load? Node Response?

Cluster alert ocssd.log

1050693.1 1534949.1 1546004.1

TFA Collector

DHB?

Engage appropriate team

1549428.1 1466639.1

Engage storage team

NO YES

YES Obvious?

Node Eviction Diagnostic Flow

Engage Oracle Support

NO Resolved?

YES

NO

Fenced?

YES

NO

Resource starvation

Engage sysadmin team

NO TFA Collector

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

YES

Missing Network Heartbeat (1) • ocssd.log from node 1 • ===> sending network heartbeats other nodes. Normally, this message is output once every 5 messages (seconds) • 2016-08-13 17:00:20.023: [ CSSD][4096109472]clssnmSendingThread: sending status msg to all nodes • 2016-08-13 17:00:20.023: [ CSSD][4096109472]clssnmSendingThread: sent 5 status msgs to all nodes • ===> The network heartbeat is not received from node 2 (drrac2) for 15 consecutive seconds. • ===> This means that 15 network heartbeats are missing and is the first warning (50% threshold). • 2016-08-13 17:00:22.818: [ CSSD][4106599328]clssnmPollingThread: node drrac2 (2) at 50% heartbeat fatal, removal in 14.520 seconds • 2016-08-13 17:00:22.818: [ CSSD][4106599328]clssnmPollingThread: node drrac2 (2) is impending reconfig, flag 132108, misstime 15480 • ===> continuing to send the network heartbeats and log messages once every 5 messages • 2016-08-13 17:00:25.023: [ CSSD][4096109472]clssnmSendingThread: sending status msg to all nodes • 2016-08-13 17:00:25.023: [ CSSD][4096109472]clssnmSendingThread: sent 5 status msgs to all nodes • ===> 75% threshold of missing network heartbeat is reached. This is second warning. • 2016-08-13 17:00:29.833: [ CSSD][4106599328]clssnmPollingThread: node drrac2 (2) at 75% heartbeat fatal, removal in 7.500 seconds Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Missing Network Heartbeat (2)

• ===> continuing to send the network heartbeats and log messages once every 5 messages • 2016-08-13 17:00:30.023: [ CSSD][4096109472]clssnmSendingThread: sending status msg to all nodes • 2016-08-13 17:00:30.023: [ CSSD][4096109472]clssnmSendingThread: sent 5 status msgs to all nodes • ===> continuing to send the network heartbeats, but the message is logged after 4 messages • 2016-08-13 17:00:34.021: [ CSSD][4096109472]clssnmSendingThread: sending status msg to all nodes • 2016-08-13 17:00:34.021: [ CSSD][4096109472]clssnmSendingThread: sent 4 status msgs to all nodes • ===> Last warning shows that 90% threshold of the missing network heartbeat is reached. • ===> The eviction will occur in 2.49 seconds. • 2016-08-13 17:00:34.841: [ CSSD][4106599328]clssnmPollingThread: node drrac2 (2) at 90% heartbeat fatal, removal in 2.490 seconds, seedhbimpd 1 • ===> Eviction of node 2 (drrac2) started • 2016-08-13 17:00:37.337: [ CSSD][4106599328]clssnmPollingThread: Removal started for node drrac2 (2), flags 0x2040c, state 3, wt4c 0 • ===> This shows that the node 2 is actively updating the voting disks • 2016-08-13 17:00:37.340: [ CSSD][4085619616]clssnmCheckSplit: Node 2, drrac2, is alive, DHB (1281744040, 1396854) more than disk timeout of 27000 after the last NHB (1281744011, 1367154) Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Missing Network Heartbeat (3) • ===> Evicting node 2 (drrac2) • 2016-08-13 17:00:37.340: [ CSSD][4085619616](:CSSNM00007:)clssnmrEvict: Evicting node 2, drrac2, from the cluster in incarnation 169934272, node birth incarnation 169934271, death incarnation 169934272, stateflags 0x24000 • ===> Reconfigured the cluster without node 2 • 2016-08-13 17:01:07.705: [ CSSD][4043389856]clssgmCMReconfig: reconfiguration successful, incarnation 169934272 with 1 nodes, local node number 1, master node number 1

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Missing Network Heartbeat (4) • ocssd.log from node 2: • ===> Logging the message to indicate 5 network heartbeats are sent to other nodes • 2016-08-13 17:00:26.009: [ CSSD][4062550944]clssnmSendingThread: sending status msg to all nodes • 2016-08-13 17:00:26.009: [ CSSD][4062550944]clssnmSendingThread: sent 5 status msgs to all nodes • ===> First warning of reaching 50% threshold of missing network heartbeats • 2016-08-13 17:00:26.213: [ CSSD][4073040800]clssnmPollingThread: node drrac1 (1) at 50% heartbeat fatal, removal in 14.540 seconds • 2016-08-13 17:00:26.213: [ CSSD][4073040800]clssnmPollingThread: node drrac1 (1) is impending reconfig, flag 394254, misstime 15460 • ===> Logging the message to indicate 5 network heartbeats are sent to other nodes • 2016-08-13 17:00:31.009: [ CSSD][4062550944]clssnmSendingThread: sending status msg to all nodes • 2016-08-13 17:00:31.009: [ CSSD][4062550944]clssnmSendingThread: sent 5 status msgs to all nodes • ===> Second warning of reaching 75% threshold of missing network heartbeats • 2016-08-13 17:00:33.227: [ CSSD][4073040800]clssnmPollingThread: node drrac1 (1) at 75% heartbeat fatal, removal in 7.470 seconds Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Missing Network Heartbeat (5) • ===> Logging the message to indicate 4 network heartbeats are sent • 2016-08-13 17:00:35.009: [ CSSD][4062550944]clssnmSendingThread: sending status msg to all nodes • 2016-08-13 17:00:35.009: [ CSSD][4062550944]clssnmSendingThread: sent 4 status msgs to all nodes • ===> Third warning of reaching 90% threshold of missing network heartbeats • 2016-08-13 17:00:38.236: [ CSSD][4073040800]clssnmPollingThread: node drrac1 (1) at 90% heartbeat fatal, removal in 2.460 seconds, seedhbimpd 1 • ===> Logging the message to indicate 5 network heartbeats are sent to other nodes • 2016-08-13 17:00:40.008: [ CSSD][4062550944]clssnmSendingThread: sending status msg to all nodes • 2016-08-13 17:00:40.009: [ CSSD][4062550944]clssnmSendingThread: sent 5 status msgs to all nodes • ===> Eviction started for node 1 (drrac1) • 2016-08-13 17:00:40.702: [ CSSD][4073040800]clssnmPollingThread: Removal started for node drrac1 (1), flags 0x6040e, state 3, wt4c 0 • ===> Node 1 is actively updating the voting disk, so this is a split brain condition • 2016-08-13 17:00:40.706: [ CSSD][4052061088]clssnmCheckSplit: Node 1, drrac1, is alive, DHB (1281744036, 1243744) more than disk timeout of 27000 after the last NHB (1281744007, 1214144) • 2016-08-13 17:00:40.706: [ CSSD][4052061088]clssnmCheckDskInfo: My cohort: 2 • 2016-08-13 17:00:40.707: [ CSSD][4052061088]clssnmCheckDskInfo: Surviving cohort: 1 Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Missing Network Heartbeat (6) • ===> Node 2 is aborting itself to resolve the split brain and ensure the cluster integrity • 2016-08-13 17:00:40.707: [ CSSD][4052061088](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, drrac2, is smaller than cohort of 1 nodes led by node 1, drrac1, based on map type 2 • 2016-08-13 17:00:40.707: [ CSSD][4052061088]################################### • 2016-08-13 17:00:40.707: [ CSSD][4052061088]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread • 2016-08-13 17:00:40.707: [ CSSD][4052061088]###################################

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Missing Network Heartbeat (7)



Observations

1.

Both nodes reported missing heartbeats at the same time

2.

Both nodes sent heartbeats to other nodes all the time

3.

Node 2 aborted itself to resolve split brain



Conclusion

1.

This is likely a network problem, engage network team

2.

Check OSWatcheroutput (netstat and traceroute) 1.

Configure private.net file, not configured by default

3.

Check CHM

4.

Check system log Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Voting Disk Access Problem (1) ocssd.log: ===> The first error indicating that it could not read voting disk -- first message to indicate a problem accessing the voting disk 2016-08-13 18:31:19.787: [ SKGFD][4131736480]ERROR: -9(Error 27072, OS Error (Linux Error: 5: Input/output error Additional information: 4 Additional information: 721425 Additional information: -1) ) 2016-08-13 18:31:19.787: [ CSSD][4131736480](:CSSNM00060:)clssnmvReadBlocks: read failed at offset 529 of /dev/sdb8 2016-08-13 18:31:19.802: [ CSSD][4131736480]clssnmvDiskAvailabilityChange: voting file /dev/sdb8 now offline Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Voting Disk Access Problem (2) ====> The error message that shows a problem accessing the voting disk repeats once every 4 seconds 2016-08-13 18:31:23.782: [ CSSD][150477728]clssnmvDiskOpen: Opening /dev/sdb8 2016-08-13 18:31:23.782: [ SKGFD][150477728]Handle 0xf43fc6c8 from lib :UFS:: for disk :/dev/sdb8: 2016-08-13 18:31:23.782: [ CLSF][150477728]Opened hdl:0xf4365708 for dev:/dev/sdb8: 2016-08-13 18:31:23.787: [ SKGFD][150477728]ERROR: -9(Error 27072, OS Error (Linux Error: 5: Input/output error Additional information: 4 Additional information: 720913 Additional information: -1) ) 2016-08-13 18:31:23.787: [ CSSD][150477728](:CSSNM00060:)clssnmvReadBlocks: read failed at offset 17 of /dev/sdb8

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Voting Disk Access Problem (3) ====> The last error that shows a problem accessing the voting disk. ====> Note that the last message is 200 seconds after the first message ====> because the long disktimeout is 200 seconds 2016-08-13 18:34:37.423: [ CSSD][150477728]clssnmvDiskOpen: Opening /dev/sdb8 2016-08-13 18:34:37.423: [ CLSF][150477728]Opened hdl:0xf4336530 for dev:/dev/sdb8: 2016-08-13 18:34:37.429: [ SKGFD][150477728]ERROR: -9(Error 27072, OS Error (Linux Error: 5: Input/output error Additional information: 4 Additional information: 720913 Additional information: -1) ) 2016-08-13 18:34:37.429: [ CSSD][150477728](:CSSNM00060:)clssnmvReadBlocks: read failed at offset 17 of /dev/sdb8 Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Voting Disk Access Problem (4) ====> This message shows that ocssd.bin tried accessing the voting disk for 200 seconds 2016-08-13 18:34:38.205: [ CSSD][4110736288](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for 200880 ms for voting file /dev/sdb8) ====> ocssd.bin aborts itself with an error message that the majority of voting disks are not available. In this case, there was only one voting disk, but if three voting disks were available, as long as two voting disks are accessible, ocssd.bin will not abort. 2016-08-13 18:34:38.206: [ CSSD][4110736288](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 1 configured voting disks available, need 1 2016-08-13 18:34:38.206: [ CSSD][4110736288]################################### 2016-08-13 18:34:38.206: [ CSSD][4110736288]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread 2016-08-13 18:34:38.206: [ CSSD][4110736288]###################################



Conclusion The voting disk was not available, engage storage team Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Troubleshooting Scenarios Node Eviction Triage • Time synchronisation issue • Cluster Time Synchronisation Services daemon – Provides time management in a cluster for Oracle. • Observer mode when Vendor time synchronisation s/w is found – Logs time difference to the CRS alert log • Active mode when no Vendor time sync s/w is found

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Troubleshooting Scenarios Node Eviction Triage • Cluster Ready Services Daemon – The CRSD daemon is primarily responsible for maintaining the availability of application resources, such as database instances. CRSD is responsible for starting and stopping these resources, relocating them when required to another node in the event of failure, and maintaining the resource profiles in the OCR (Oracle Cluster Registry). In addition, CRSD is responsible for overseeing the caching of the OCR for faster access, and also backing up the OCR. – Log file is GI_HOME/log/<node>/crsd/crsd.log • Rotation policy 10-50M • Retention policy 10 logs • Dynamic in 12.1 and can be changed

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Troubleshooting Scenarios Node Eviction Triage • CRSD oraagent – CRSD’s oraagent manages • all database, instance, service and diskgroup resources • node listeners • SCAN listeners, and ONS – If the Grid Infrastructure owner is different from the RDBMS home owner then you would have 2 oraagents each running as one of the installation owners. The database, and service resources would be managed by the RDBMS home owner and other resources by the Grid Infrastructure home owner. – Log file is • GI_HOME/log/<node>/agent/crsd/oraagent_<user>/oraagent_<user>.log

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Troubleshooting Scenarios Node Eviction Triage • CRSD orarootagent – CRSD’s rootagent manages • GNS and it’s VIP • Node VIP • SCAN VIP • network resources. – Log file is • GI_HOME/log/<node>/agent/crsd/orarootagent_root/oraagent_root.log

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Troubleshooting Scenarios Node Eviction Triage

• Agent return codes – Check entry must return one of the following return codes: • ONLINE • UNPLANNED_OFFLINE – Target=online, may be recovered failed over

• PLANNED_OFFLINE • UNKNOWN – Cannot determine, if previously online, partial then monitor

• PARTIAL – Some of a resources services are available. Instance up but not open.

• FAILED – Requires clean action

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Troubleshooting Scenarios Automatic Diagnostic Repository (ADR) § Important logs and traces § 11.2 – Databases only use ADR • Grid Infrastructure files in $GI_HOME/log/<node_name>/ – $GI_HOME/log/myHost/cssd – $GI_HOME/log/myHost/alertmyHost.log

§ 12c – Grid Infrastructure and Database use ADR § Different locations for Grid Infrastructure and Databases § Grid Infrastructure • Alert.log, cssd.log, csrd.log, etc

§ Databases §

Alert.log, background process traces, foreground process traces

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Oracle’s Database and Clusterware Tools • What if issues were detected before they had an impact?

Hang Manager

• What if you were notified with a specific diagnosis and corrective actions? • What if resource bottlenecks threatening SLAs were identified early?

Trace File Analyzer Quality of Service Management Cluster Health Advisor

EXAchk

• What if bottlenecks could be automatically relieved just in time?

Memory Guard

• What if database hangs and node reboots could be eliminated?

Cluster Health Monitor

ORAchk Cluster Verification Utility

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Restricted

60

Maintains Compliance with Best Practices and Alerts Vulnerabilities to Known Issues

Oracle 12c ORAchk & EXAchk Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

61

Why Oracle ORAchk & EXAchk Automatic proactive warning of problems before they impact you

Get scheduled health reports sent to you in email

Health checks for most impactful reoccurring problems

Engineered Systems

EXAchk

Runs in your environment with no need to send anything to Oracle

Findings can be integrated into other tools of choice

Common Framework Non Engineered Systems

ORAchk

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

62

Oracle Stack Coverage • Oracle Engineered Systems • Oracle Database Appliance o Oracle Exadata Database Machine o Oracle SuperCluster / MiniCluster o Oracle Private Cloud Appliance o Oracle Big Data Appliance o Oracle Exalogic Elastic Cloud o Oracle Exalytics In-Memory Machine o Oracle Zero Data Loss Recovery Appliance • Oracle ASR • Oracle Systems • Oracle Solaris • Cross stack checks • Solaris Cluster • OVN

• Oracle Database

• Oracle E-Business Suite

• Standalone Database • Grid Infrastructure & RAC

• Oracle Payables • Oracle Workflow

• Maximum Availability Architecture (MAA) Scorecard

• Oracle Purchasing

• Upgrade Readiness Validation

• Oracle Process Manufacturing

• Golden Gate • Oracle Restart • Oracle Enterprise Manager Cloud Control • Repository • Agent

• Oracle Order Management • Oracle Receivables • Oracle Fixed Assets • Oracle HCM • Oracle CRM • Oracle Project Billing

• OMS

• Oracle Siebel

• Oracle Middleware • Application Continuity • Oracle Identify and Access Management Suite (Oracle IAM)

• Database best practices • Oracle PeopleSoft • Database best practices • Oracle SAP • EXAdata best practices

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

63

Profiles

Profile asm avdf clusterware control_VM

• Profiles provide logical grouping of checks which are about similar topics • Run only checks in a specific profile ./exachk –profile <profile>

• Run everything except checks in a specific profile ./exachk –excludeprofile <profile>

Description

ASM Checks Audit Vault Configuration checks Oracle clusterware checks Checks only for Control VM(ec1-vm, ovmm, db, pc1, pc2). No cross node checks corroborate Exadata checks needs further review by user to determine pass or fail dba DBA Checks ebs Oracle E-Business Suite checks eci_healthchecks Enterprise Cloud Infrastructure Healthchecks ecs_healthchecks Enterprise Cloud System Healthchecks goldengate Oracle GoldenGate checks hardware Hardware specific checks for Oracle Engineered systems maa Maximum Availability Architecture Checks ovn Oracle Virtual Networking platinum Platinum certification checks preinstall Pre-installation checks prepatch Checks to execute before patching security Security checks solaris_cluster Solaris Cluster Checks storage Oracle Storage Server Checks switch Infiniband switch checks sysadmin Sysadmin checks user_defined_checks Run user defined checks from user_defined_checks.xml

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

64

Profiles • Profiles provide logical grouping of checks which are about similar topics • Run only checks in a specific profile ./orachk –profile <profile>

• Run everything except checks in a specific profile ./orachk –excludeprofile <profile>

Profile asm bi_middleware clusterware dba ebs emagent emoms em goldengate hardware oam oim oud ovn peoplesoft preinstall prepatch security siebel solaris_cluster storage switch sysadmin user_defined_checks

Description

ASM Checks Oracle Business Intelligence checks Oracle clusterware checks DBA Checks Oracle E-Business Suite checks Cloud control agent checks Cloud Control management server Cloud control checks Oracle GoldenGate checks Hardware specific checks for Oracle Engineered systems Oracle Access Manager checks Oracle Identify Manager checks Oracle Unified Directory server checks Oracle Virtual Networking Peoplesoft best practices Pre-installation checks Checks to execute before patching Security checks Siebel Checks Solaris Cluster Checks Oracle Storage Server Checks Infiniband switch checks Sysadmin checks Run user defined checks from user_defined_checks.xml

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

65

Keep Track of Changes to the Attributes of Important Files • Track changes to the attributes of important files with –fileattr – Looks at all files & directories within Grid Infrastructure and Database homes by default – The list of monitored directories and their contents can be configured to your specific requirements – Use –fileattr start to start the first snapshot ./orachk –fileattr start $ ./orachk -fileattr start CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /u01/app/11.2.0.4/grid?[y/n][y] Checking ssh user equivalency settings on all nodes in cluster Node mysrv22 is configured for ssh user equivalency for oradb user Node mysrv23 is configured for ssh user equivalency for oradb user List of directories(recursive) for checking file attributes: /u01/app/oradb/product/11.2.0/dbhome_11203 /u01/app/oradb/product/11.2.0/dbhome_11204 orachk has taken snapshot of file attributes for above directories at: /orahome/oradb/orachk/orachk_mysrv21_20170504_041214

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

66

Keep Track of Changes to the Attributes of Important Files • Compare current attributes against first snapshot using –fileattr check ./orachk –fileattr check $ ./orachk -fileattr check -includedir "/root/myapp/config" -excludediscovery CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /u01/app/12.2.0/grid?[y/n][y] Checking for prompts on myserver18 for oragrid user... Checking ssh user equivalency settings on all nodes in cluster Node myserver17 is configured for ssh user equivalency for root user List of directories(recursive) for checking file attributes: /root/myapp/config

• Results of snapshot comparison will also be shown in the HTML report output

Checking file attribute changes... . "/root/myapp/config/myappconfig.xml" is different: Baseline :

0644

oracle

root /root/myapp/config/myappconfig.xml

Current

0644

root

root /root/myapp/config/myappconfig.xml

:

…etc …etc

Note: • • •

Use the same arguments with check that you used with start Will proceed to perform standard health checks after attribute checking File Attribute Changes will also show in HTML report output

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

67

Improve performance of SQL queries • Many new checks focus on known issues in 12c Optimizer as well as SQL Plan Management

All contained in the dba profile: -profile dba

• These checks target problems such as: – Wrong results returned – High memory & CPU usage – Errors such as ORA-00600 or ORA-07445 – Issues with cursor usage – Other general SQL plan management problems

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Oracle Database Security Assessment Tool (DBSAT) included • DBSAT analyzes database configurations and security policies • Uncovers security risks • Improves the security posture of Oracle Databases

All results included within report output under the check: Validate database security configuration using database security assessment tool Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Upgrade to Database 12.2 with confidence • New checks to help when upgrading the database to 12.2 • Both pre and post upgrade verification to prevent problems related to: • OS configuration • Grid Infrastructure & Database patch prerequisites • Database configuration • Cluster configuration

Pre upgrade

-u –o pre

Post upgrade

-u –o post

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Oracle Health Checks Collection Manager • New Collection Manager app built on APEX 5 theme • Tabs replaced with drop down menus for easier navigation • ORAchk & EXAchk continue to ship with APEX 4 app too • No more new functionality in the APEX 4 app, all new features will go into the APEX 5 app Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

71

Enterprise Manager Integration

•Related checks grouped into compliance standards

•View targets checked, violations & average score

•Drill down into compliance standard to see individual check results

•View break down by target

•Check results integrated into EM compliance framework via plugin •View results in native EM compliance dashboards

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

72

Provision • Use Enterprise Manager provisioning feature and select ORAchk/EXAchk

• After selected this will launch the provisioning wizard, choose the system type

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

73

View Results by Compliance Standard Filter by Exachk%”

Drill into applicable standard and view individual checks & target status

Click individual checks for recommendation details

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

74

JSON Output to Integrate with Kibana, Elastic Search etc • The JSON provides many tags to allow dashboard filtering based on facts such as: • • • • • • • • •

Engineered System type Engineered System version Hardware type Node name OS version Rack identifier Rack type Database version And more...

• Kibana can be used to view health check compliance across your data center • Results can also be filtered based on any combination of exposed system attributes Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

75

JSON Output to Integrate with Kibana, Elastic Search etc

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

76

Speeds Issue Diagnosis, Triage and Resolution

Oracle 12c Trace File Analyzer

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal

77

Why TFA? Provides one interface for all diagnostic needs

Collects data across the cluster and consolidates it in one place

Collects all relevant diagnostic data at the time of the problem

Reduces time required to obtain diagnostic data, which saves your business money

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal

78

Supported Platforms and Versions • All Oracle Database & Grid versions 10.2+ are supported

• All major Operating Systems are supported – Linux (OEL, RedHat, SUSE, Itanium & zLinux) – Oracle Solaris (SPARC & x86-64) – AIX – HPUX (Itanium & PA-RISC) – Windows

• You probably already have TFA installed as it is included with: Oracle Grid Infrastructure

11.2.0.4+ 12.1.0.2+ 12.2.0.1+

Oracle Database

12.2.0.1+

• Updated quarterly via 1513912.1 OS versions supported are the same as those supported by the Database Java Runtime Edition 1.8 required

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

79

Linux / Unix Installation Root / Daemon Install

Non root / Non Daemon Install

1.

Download from 1513912.1

1.

Download from 1513912.1

2.

Copy to one required machine and unzip

2.

Copy to every required machine and unzip

3.

Run

3.

Run

./installTFA

Will : – – –

Install on all nodes Auto discover relevant Oracle Software & Exadata Storage Servers Start monitoring for problems & perform auto collections

Will:

./installTFA -extractto -javahome <jre_home>

– Only install on current host – Not do automatic collections – Not collect from remote hosts – Not collect files unreadable by install user

Recommended install location: /opt/oracle.tfa Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

80

Architecture • Remote Node n

TFA Remote Daemon Node 2

Scripts

Cluster TFA Daemon Remote Node 1

Scripts Scripts

TFA daemon runs on each cluster node •

TFA Daemon

TFA Daemon Alerts & Log files

Scripts

Alerts & Log files

tfactl Initiator Node

( Where command originated)

Cluster wide Collection

Or single instance when no Grid Infrastructure is used



Command line communication is via tfactl command



TFA Daemons on all nodes coordinate: • Script execution • Collection of diagnostics • Trimming of log contents



Cluster wide collection output is consolidated on one node

The daemon is only used when installed as root Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

81

Automatic Diagnostic Collections Oracle Trace File Analyzer 1 Oracle Grid Infrastructure & Database(s)

Automatically detect event

2 Collect & package relevant diagnostics

Significant problem occurs

DBA(s) / Sys Admin(s)

4

Upload collection to Oracle Support for further help

3

Notify relevant DBA and or Sys Admin by email

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

82

Command Interfaces Command line • Specify all command options at the command line tfactl

Shell 1.

Set and change context

2.

Run commands from within the shell

Menu 1.

Select menu navigation options then choose the command you want to run tfactl menu

tfactl tfaclt > database MyDB MyDB tfactl > oratop

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

83

Maintain • Option 1

• Option 2

– Applying standard PSUs will automatically update TFA – PSUs do not contain Support Tools Bundle updates

– To update with latest TFA & Support Tools Bundle 1. 2.

Download latest version: 1513912.1 Repeat the same installation steps

Upgrade to the latest version whenever possible to include bug fixes, new features & optimizations Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

84

View System & Cluster Summary Choose an option to drill down further

Quick summary of status of key components

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

85

Summary ASM Drill Down Example ASM Overview

ASM cluster wide summary

Problems found

ASM Cluster wide status

Problems found on myserver69

Also disk space warning on both servers Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

86

Summary ASM Drill Down Example View ASM problems for myserver69 View node wise & drill into myserver69

View ASM status summary for myserver69 View recent problems detected View component status

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

87

Investigate Logs & Look for Errors • Analyze all important recent log entries: tfactl analyze –last 1d

• Search recent log entries: tfactl analyze -search “ora-00600" -last 8h

Searching for “ora-00600”

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

88

Perform Analysis Using the Included Tools Tool orachk or exachk

oswatcher procwatcher oratop sqlt alertsummary ls pstack

Description Provides health checks for the Oracle stack. Oracle Trace File Analyzer will install either • Oracle EXAchk for Engineered Systems, see document 1070954.1 for more details or • Oracle ORAchk for all non-Engineered Systems, see document 1268927.2 for more details Collects and archives OS metrics. These are useful for instance or node evictions & performance Issues. See document 301137.1 for more details Automates & captures database performance diagnostics and session level hang information. See document 459694.1 for more details

Tool

Description

grep

Search alert or trace files with a given database and file name pattern, for a search string.

summary

Provides high level summary of the configuration

vi

Opens alert or trace files for viewing a given database and file name pattern in the vi editor

tail

Runs a tail on an alert or trace files for a given database and file name pattern

param

Shows all database and OS parameters that match a specified pattern

dbglevel

Sets and unsets multiple CRS trace levels with one command

Provides near real-time database monitoring. See document 1500864.1 for more details.

history

Shows the shell history for the tfactl shell

changes

Captures SQL trace data useful for tuning. See document 215187.1 for more details.

Reports changes in the system setup over a given time period. This includes database parameters, OS parameters and patches applied

calog

Reports major events from the Cluster Event log

events

Reports warnings and errors seen in the logs

Provides summary of events for one or more database or ASM alert files from all nodes Lists all files TFA knows about for a given file name pattern across all nodes Generate process stack for specified processes across all nodes

Not all tools are included in Grid or Database install. Download from 1513912.1 to get full collection of tools

managelogs Shows disk space usage and purges ADR log and trace files ps triage

Finds processes Summarize oswatcher/exawatcher data

Verify which tools you have installed: tfactl toolstatus Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

89

OS Watcher (Support Tools Bundle) Collect & Archive OS Metrics • Executes standard UNIX utilities (e.g. vmstat, iostat, ps, etc) on regular intervals • Built in Analyzer functionality to summarize, graph and report upon collected metrics • Output is Required for node reboot and performance issues • Simple to install, extremely lightweight • Runs on ALL platforms (Except Windows) • MOS Note: 301137.1 – OS Watcher Users Guide

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

90

Procwatcher (Support Tools Bundle) Monitor & Examine Database Processes • Single instance & RAC • Generates session wait, lock and latch reports as well as call stacks from any problem process(s) • Ability to collect stack traces of specific processes using Oracle Tools and OS Debuggers • Typically reduces SR resolution for performance related issues • Runs on ALL major UNIX Platforms • MOS Note: 459694.1 – Procwatcher Install Guide

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

91

oratop (Support Tools Bundle) Near Real-Time Database Monitoring • • • •

Single instance & RAC Monitoring current database activities Database performance Identifying contentions and bottleneck

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

92

Analyze • Each tool can be run using tfactl in shell mode • Start tfactl shell with

tfactl

• Run a tool with the tool name

tfactl > orachk

1. Where necessary set context with database 2. Then run tool

tfactl > database MyDB

MyDB tfactl > oratop

3. Clear context with database

MyDB tfactl > database

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

93

One Command SRDCs • For certain types of problems Oracle Support will ask you to run a Service Request Data Collection (SRDC) • Previously this would have involved: • Reading many different support documents • Collecting output from many different tasks • Gathering lots of different diagnostics • Packaging & uploading • Now just run: tfactl diagcollect -srdc <srdc_type> Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

94

Faster & Easier SR Data Collection tfactl diagcollect –srdc <srdc_type>

Type of Problem

Collection Scope

SRDC Types ORA-00600 ORA-00700 ORA-04030 ORA-04031 ORA-07445

ORA Errors

• • • • •

Other internal database errors

• internalerror

Local only

Database performance problems

• dbperf

Cluster wide

Database patching problems

• dbpatchinstall New • dbpatchconflict New

Local only

Database install / upgrade problems

• dbinstall New • dbupgrade New

Local only

Enterprise Manager tablespace usage metric problems

• emtbsmetrics

Enterprise Manager general metrics page or threshold problems - Run all three SRDCs

• emdebugon • emdebugoff

• ORA-27300 • ORA-27301 • ORA-27302

New New

• emmetricalert

New New

Local only

Local only (on EM Agent target) Local only (on EM Agent target & OMS) Local only (on EM Agent target & Repository DB)

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

95

One Command SRDCs – Examples of What’s Collected ORA4031: tfactl diagcollect –srdc ora4031

1. 2. 3. 4. 5.

IPS Package Patch Listing AWR report Memory information RDA

Database Performance tfactl diagcollect –srdc dbperf

1. 2. 3. 4. 5. 6.

ADDM report AWR for good and problem period AWR Compare Period report ASH report for good and problem period OS Watcher IPS Package (if errors during problem period) 7. ORAchk (performance related checks) Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

96

Manual Data Gathering vs One Command SRDC Manual Data Gathering

TFA SRDC

1. Generate ADDM reviewing Document 1680075.1

1. Run tfactl diagcollect –srdc dbperf

2. Identify “good” and “problem” periods and gather AWR reviewing Document 1903158.1

2. Upload resulting zip file to SR

3. Generate AWR compare report (awrddrpt.sql) using “good” and “problem” periods 4. Generate ASH report for “good” and “problem” periods reviewing Document 1903145.1 5. Collect OSWatcher data reviewing Document 301137.1 6. Check alert.log if there are any errors during the “problem” period 7. Find any trace files generated during the “problem” period 8. Collate and upload all the above files/outputs to SR

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

97

One Command SRDC Interactive Mode

tfactl diagcollect –srdc <srdc_type>

1.

Enter default for event date/time and database name

2.

Scans system to identify recent 10 events in the system (ORA600 example shown)

3.

Once the relevant event is chosen, proceeds with diagnostic collection

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

4.

All required files are identified

5.

Trimmed where applicable

6.

Package in a zip ready to provide to support

98

One Command SRDC Silent Mode

tfactl diagcollect –srdc <srdc_type> -database -for

More Documents from "Udaya Samudrala"