Fix Oplog Issues¶
On this page
Replication Oplog alerts can be triggered when the amount of oplog data generated on a primary cluster member is larger than the cluster's configured oplog size.
Alert Conditions¶
You can configure the following alert conditions in the project-level alert settings page to trigger alerts.
Replication Oplog Window is (X)
occurs if the approximate amount of
time available in the primary replication oplog meets or goes below
the specified threshold. This refers to the amount of time that the
primary can continue logging given the current rate at which oplog
data is generated.
Oplog Data Per Hour is (X)
occurs
if the amount of data per hour being written to a primary's
replication oplog meets or exceeds the specified threshold.
Common Triggers¶
These are a few common events which may lead to increased oplog activity:
- Intensive write and update operations in a short period of time.
- The cluster's configured oplog size is smaller than the value in the Oplog GB / Hour graph observed in the cluster metrics view.
Fix the Immediate Problem¶
These are a few possible actions to consider to help resolve Replication Oplog Alerts:
- Increase the oplog size by editing your cluster's configuration to ensure it is higher than the peak value from the Oplog GB / Hour graph in the cluster metrics view.
Increase the oplog size if you foresee intense write and update operations occurring in a short time period.
NoteYou may need to increase your cluster's storage to free enough space to resize the oplog.
- Ensure that all write operations specify a
write concern of
majority
to ensure that writes are replicated to at least one node before moving on to the next write operation. This controls the rate of traffic from your application by preventing the primary from accepting writes more quickly than the secondaries can handle.
Implement a Long-Term Solution¶
Refer to the following for more information on understanding oplog
sizing requirements:
Monitor Your Progress¶
You might observe the following scenarios when these alerts trigger:
- The Oplog GB / Hour graph in the metrics view spikes upward.
- The Replication Oplog Window graph in the metrics view is low.
The Atlas View and Download MongoDB Logs of secondary or unhealthy nodes display the following message:
We are too stale to use <node>:27017 as a sync source. An Atlas node is reporting a state of STARTUP2 and RECOVERING for an extended period of time.
Typically, this indicates that the node has "fallen off the oplog" and is unable to keep up with the oplog data being generated by the primary node. In this case, the node will require an initial sync in order to recover and ensure that the data is consistent across all nodes. You can check the state of a node using the
rs.status()
shell method.