Monday, 7 April 2014

Manual Recovery Mechanisms in Oracle SOA Suite 11g – BPEL Process Manager

Hi Folks,

 My intention of this post is to provide a quick reference for Manual Recovery of Faults within the SOA.
 It aims to present some of the valuable information regarding Manual recovery in one place.

Integration flows can fail at run-time with a variety of errors. The cause of these failures could be either Business errors or System errors.  When Synchronous Integration Flows fail, they are restarted from the beginning. On the other hand, Asynchronous Integration flows when they error can potentially be resubmitted/recovered from designated/pre-configured milestones within the flow. These milestones could be persistence points like queues topics or database tables, where the state of the flow was last persisted. Recovery is a mechanism whereby a faulted Asynchronous Flow can be rerun from such a persistence milestone

BPEL Message Recovery:
To understand the BPEL Message Recovery, let us briefly look into how BPEL Service engine performs asynchronous processing. Asynchronous BPEL processes use an intermediate Delivery Store in the SOA Infrastructure Database to store the incoming request. The message is then picked up and further BPEL 
processing happens in an Invoke Thread. The Invoke Thread is one among the free threads from the ‘Invoke Thread Pool’ configured for BPEL Service Engine. The processing of the message from the delivery Store onwards until the next dehydration in the BPEL process or the next commit point in the flow constitutes a transaction. Figure below shows at a high level the Asynchronous request handling by BPEL Invoke Thread. Any unhandled errors during this processing will cause the message to roll back to the delivery Store.

During Recovery of these messages, the end user cannot make any modifications to the original payload. The messages marked recoverable can either be recovered or aborted. In the former case, the original message is simply redelivered for processing again.

The BPEL Configuration property ‘MaxRecoverAttempt’ determines the number of times a message can be recovered manually or automatically. Messages go to the exhausted state after reaching the MaxRecoverAttempt. They can be selected and ‘Reset’ back to make them available for manual/automatic recovery again.

A Fault Policy with configurable Actions to be bound to SOA Component. These can be attached at the Composite, Component or Reference levels. The configured Actions will be executed when the invocation fails. The available Actions could be retry, abort, human intervention, custom java callout, etc. When the Action applied is human intervention the faults become available for Manual Recovery from the Oracle Enterprise Manager Fusion Middleware Control [EM FMWC Console]. They show up as recoverable instances in the faults tab of ‘SOA->faults and rejected messages’

Automatic recovery program for pending BPEL call back messages:

BPEL engine maintains all async call back messages into database table called dlv_message. You can see such all messages in BPEL console call-back manual recovery area.The query being used by bpel console is joined on dlv_message and work_item tables.This query simply picks up all call back messages which are undelivered and have not been modified with in certain threshold time.

Call-back messages are processed in following steps
· BPEL engine assigns the call-back message to delivery service
· Delivery service saves the message into dlv_message table with state 'UNDELIVERED-0'
· Delivery service schedules a dispatcher thread to process message asynchronously
· Dispatcher thread enqueues message into JMS queue
· Message is picked up by MDB
·  MDB delivers the message to actual BPEL process  waiting for call-back and changes state to 'HANDLED=2'

So given above steps, there is always possibility that message is available in dlv_message table but MDB is 
failed in delivering it to BPEL process which keeps message always in state= 0.

Recovering the instances from recovery:

The instances in the recovery queue can be recovered manually to continue the processing.
Below are the some of the reasons the instances to go to manual recovery.

1.There are not enough threads or memory to process the message.
2.The server shuts down or crash before it finishes processing the BPEL message
3.The engine could not finish processing the message before reaching the time-out as dictated by the 
transaction-timeout configuration

BPEL process manager has a nice UI for looking at and managing these, but what if we need to be alerted
 when a process goes into one of these states? Well, BPEL PM doesn't have that capability if we want 
we can write a custom code for that or else we manually go and reinitiate the recoverable instances in EM.

Recovering the BPEL instances:-
1. Login  to EM console
2. Right click on soa-infra  ,Click on Service Engine --> BPEL
3. Click on Recovery tab
4. Change the Type accordingly(Invoke,Activity,Callback) and the Message state to “Undelivered”  
and click on search
5. All the recoverable messages that match the criteria will be displayed.
6. Select the required messages and click on Recovery button.

Auto Recovery feature in BPEL

Auto Recovery’ configuration is done by setting few of the MBean properties in EM console. 
To configure it in EM console one should navigate to soa-infra -> SOA Administration ->
 BPEL Properties -> More BPEL Configuration Properties -> RecoveryConfig.

This will bring up the following screen showing the default parameters. BPEL Auto recovery is enabled
 by default.The properties startWindowTime and stopWindowTime specify the period during 
which Auto Recovery is active. By default auto recovery feature will be active from 12AM to 4AM everyday (remember that it’s SOA server time), shown in above screenshot. We can change these settings by simply updating the time values in 24 hr format and do click on Apply.

The property maxMessageRaiseSize specifies the number of messages to be sent in each recovery attempt, in effect resembles the batch size.
The property subsequentTriggerDelay specifies interval between consecutive auto recovery attempts and the value is 300 sec by default.
The property threshHoldTimeInMinutes is used by BPEL engine, to mark particular instance eligible for auto recovery once the recoverable fault occurs which is 10 min by default.

If we observe closely, none of these properties mention about number of recovery attempts to be made which is altogether a separate MBean property. To set, navigate to soa-infra -> SOA Administration -> BPEL Properties -> More BPEL Configuration Properties -> MaxRecoverAttempt. The default value is 2.

To disable ‘Auto Recovery’, set the maxMessageRaiseSize property value to 0 as shown above.

Auto Recovery Behavior:
Whenever a recoverable fault (this term is more abstract, I verified this behavior with Remote, Binding and User Defined Faults) occurs during the BPEL processing, it will be visible in Recovery console. If Auto Recovery is enabled, after threshHoldTimeInMinutes BPEL runtime will try to auto recover the instance. If it’s not successful, again number of recovery attempts will be made as given for MaxRecoverAttempt with an 
interval as given forsubsequentTriggerDelay. If instance fails even after these maximum recover attempts,
 the instance will be marked as exhausted (can be queried on recovery console using message state as 
exhausted). We can use ‘Reset’ button to make these instances eligible for Auto Recovery again.

Note that, we observe this behavior only when the fault is thrown back to BPEL runtime or fault is not caught within BPEL process.

SOA New Features for BPEL Message Recovery

SOA added an important feature to the Message Recovery subject.

SOA added more pro-active alerts for BPEL stuck messages. A small part of this feature was available since SOA but just for the Composite Flow Trace, so you had to know the flow trace where you could have problems. Now clicking the soa-infra inside EM, we see a global alert that there are messages needing recovery:

Also, we have the same alert when clicking a composite which has some of these messages pending recovery while it disappears when we move to a composite which does not have such messages.
By clicking Show Details, we can see how many messages are pending recovery, grouped by type:

And clicking Go to BPEL Recovery Console will redirect you to the BPEL recovery console where you can recover or cancel the message:

Well, that’s it. This new feature is simple yet very powerful, helping SOA administrators to get alerts when messages need recovery and be more pro-active when administering the SOA environment.

Happy Learning...!!!!!!!!!!!! Fun Sharing.........!!!!!!!!!