Offset Management

Offset management is handled in the code by the PartitionTracker class. Each partition of a topic specified in the topics property has an associated PartitionTracker, ensuring that aggregation is only performed on data coming from the same partition.

The PartitionTracker maintains and uses three types of offsets:

Processed Offset: Tracks the offset of the most recent record received for the relevant partition. This is managed by the Put function within the AggregationSinkTask.
Flushed Offset: Represents the offset of the latest record that has been sent, along with its aggregation, to the Kafka topic defined by topicOut.
Commit Offset: Refers to the offset of the last record that was flushed and has had its offset committed to Kafka. Once a record's offset is committed, it will not be reprocessed, even in cases of task error or rebalance.

Here’s how these offsets are managed in practice:

When a new SinkRecord arrives in the Put function of AggregationSinkTask, its offset is processed. Once a threshold (such as element count, value pattern, timeout, or retention) is met, the aggregation, including the record, is sent to Kafka, and the record’s offset is marked as flushed. When the preCommit method is triggered in AggregationSinkTask, all flushed offsets across each partition are committed, provided they weren’t already.

At least once delivery is guaranteed, meaning a record is considered fully processed only when its offset is committed. Any record with a processed or flushed (but uncommitted) offset may be received again by the connector if a task failure or rebalance occurs. This ensures that a record, even if already flushed, could be reprocessed and sent again to Kafka under failure scenarios.

This design ensures a reliable at least once delivery model.