HAPI-FHIR 5.1.0 introduced support for batch processing using the Spring Batch framework. However since its introduction, we discovered Spring Batch jobs do not recover well from ungraceful server shutdowns, which are increasingly common as implementors move to containerized deployment technologies such as Kubernetes.
HAPI-FHIR 6.0.0 has begun the process of replacing Spring Batch with a custom batch framework, called "batch2". This new "batch2" framework is designed to scale well across multiple processes sharing the same message broker, and most importantly, to robustly recover from unexpected server restarts.
A HAPI-FHIR batch job definition consists of a job name, version, parameter json input type, and a chain of job steps. Each step of the chain declares the json output type it produces, which will be the input type for the following step. The final step in the chain does not declare an output type as the final step will typically do the work of the job, e.g. reindex resources, export data to disk, etc.
After a job has been defined, instances of that job can be submitted for batch processing by populating a JobInstanceStartRequest
with the job name and job parameters json and then submitting that request to the Batch Job Coordinator.
The Batch Job Coordinator will then store two records in the database:
QUEUED
: that is the parent record for all data concerning this jobREADY
: this describes the first "chunk" of work required for this job. The first Batch Work Chunk contains no data.A Scheduled Job runs periodically (once a minute). For each Job Instance in the database, it:
COMPLETE
status). If the job is finished, purges any left over work chunks still in the database.POLL_WAITING
work chunks to READY
if their nextPollTime
has expired.COMPLETE
status). If the job is finished, purges any leftover work chunks still in the database.GATE_WAITING
to READY
. If the the job is being moved to its final reduction step, chunks are moved from GATE_WAITING
to REDUCTION_READY
.REDUCTION_READY
will be consumed at this point.READY
work chunks into the QUEUED
state and publishes a message to the Batch Notification Message Channel to inform worker threads that a work chunk is now ready for processing. ** An exception is for the final reduction step, where work chunks are not published to the Batch Notification Message Channel, but instead processed inline.
HAPI-FHIR Batch Jobs run based on job notification messages of the Batch Notification Message Channel (named batch2-work-notification
).
When a notification message arrives, the handler does the following:
QUEUED
to IN_PROGRESS
QUEUED
to IN_PROGRESS
CANCELLED
and abort processing 1. If the step creates new work chunks, each work chunk will be created in either the GATE_WAITING
state (for gated jobs) or READY
state (for non-gated jobs) and will be handled in the next maintenance job pass.IN_PROGRESS
to COMPLETED
, and the data it contained is deleted.RetryChunkLaterException
, the work chunk status is changed from IN_PROGRESS
to POLL_WAITING
, and a nextPollTime
value will be set.IN_PROGRESS
to either ERRORED
or FAILED
, depending on the severity of the error.The first step in a job definition is executed with just the job parameters.
Middle Steps in the job definition are executed using the initial Job Parameters and the Work Chunk data produced from the previous step.
The final step operates the same way as the middle steps, except it does not produce any new work chunks.
If a Job Definition is set to having Gated Execution, then all work chunks for a step must be COMPLETED
before any work chunks for the next step may begin.
A Batch Job Maintenance Service runs every minute to monitor the status of all Job Instances and the Job Instance is transitioned to either COMPLETED
, ERRORED
or FAILED
according to the status of all outstanding work chunks for that job instance. If the job instance is still IN_PROGRESS
this maintenance service also estimates the time remaining to complete the job.
The job instance ID and work chunk ID are both available through the logback MDC and can be accessed using the %X
specifier in a logback.xml
file. See Logging for more details about logging in HAPI.