5. The Durability Service

NOTE: Durability is deprecated and will not be supported, please use DLite instead

This section provides a description the most important concepts and mechanisms of the current durability service implementation, starting with a description of the purpose of the service. After that all its concepts and mechanisms are described.

The exact fulfilment of the durability responsibilities is determined by the configuration of the Durability Service. There are detailed descriptions of all of the available configuration parameters and their purpose in the Configuration section.

5.1. Durability Service Purpose

OpenSplice will make sure data is delivered to all ‘compatible’ subscribers that are available at the time the data is published using the ‘communication paths’ that are implicitly created by the middleware based on the interest of applications that participate in the domain. However, subscribers that are created after the data has been published (called late-joiners) may also be interested in the data that was published before they were created (called historical data). To facilitate this use case, DDS provides a concept called durability in the form of a Quality of Service (DurabilityQosPolicy).

The DurabilityQosPolicy prescribes how published data needs to be maintained by the DDS middleware and comes in four flavours:

VOLATILE

Data does not need to be maintained for late-joiners (default).

TRANSIENT_LOCAL

Data needs to be maintained for as long as the DataWriter is active.

TRANSIENT

Data needs to be maintained for as long as the middleware is running on at least one of the nodes.

PERSISTENT

Data needs to outlive system downtime. This implies that it must be kept somewhere on permanent storage in order to be able to make it available again for subscribers after the middleware is restarted.

In OpenSplice, the realisation of the non-volatile properties is the responsibility of the durability service. Maintenance and provision of historical data could in theory be done by a single durability service in the domain, but for fault-tolerance and efficiency one durability service is usually running on every computing node. These durability services are on the one hand responsible for maintaining the set of historical data and on the other hand responsible for providing historical data to late-joining subscribers. The configurations of the different services drive the behaviour on where and when specific data will be maintained and how it will be provided to late-joiners.

5.2. Durability Service Concepts

The following subsections describe the concepts that drive the implementation of the OpenSplice Durability Service.

5.2.1. Role and Scope

Each OpenSplice node can be configured with a so-called role. A role is a logical name and different nodes can be configured with the same role. The role itself does not impose anything, but multiple OpenSplice services use the role as a mechanism to distinguish behaviour between nodes with the equal and different roles. The durability service allows configuring a so-called scope, which is an expression that is matches against roles of other nodes. By using a scope, the durability service can be instructed to apply different behaviour with respect to merging of historical data sets (see Merge policy) to and from nodes that have equal or different roles.

Please refer to the Configuration section for detailed descriptions of:

  • //OpenSplice/Domain/Role

  • //OpenSplice/DurabilityService/NameSpaces/Policy/Merge[@scope]

5.2.2. Name-spaces

A sample published in DDS for a specific topic and instance is bound to one logical partition. This means that in case a publisher is associated with multiple partitions, a separate sample for each of the associated partitions is created. Even though they are syntactically equal, they have different semantics (consider for instance the situation where you have a sample in the ‘simulation’ partition versus one in the ‘real world’ partition).

Because applications might impose semantic relationships between instances published in different partitions, a mechanism is required to express this relationship and ensure consistency between partitions. For example, an application might expect a specific instance in partition Y to be available when it reads a specific instance from partition X.

This implies that the data in both partitions need to be maintained as one single set. For persistent data, this dependency implies that the durability services in a domain needs to make sure that this data set is re-published from one single persistent store instead of combining data coming from multiple stores on disk. To express this semantic relation between instances in different partitions to the durability service, the user can configure so-called ‘name-spaces’ in the durability configuration file.

Each name-space is formed by a collection of partitions and all instances in such a collection are always handled as an atomic data-set by the durability service. In other words, the data is guaranteed to be stored and reinserted as a whole.

This atomicity also implies that a name-space is a system-wide concept, meaning that different durability services need to agree on its definition, i.e. which partitions belong to one name-space. This doesn’t mean that each durability service needs to know all name-spaces, as long as the name-spaces one does know don’t conflict with one of the others in the domain. Name-spaces that are completely disjoint can co-exist (their intersection is an empty set); name-spaces conflict when they intersect. For example: name-spaces {p1, q} and {p2, r} can co-exist, but name-spaces {s, t} and {s, u} cannot.

Furthermore it is important to know that there is a set of configurable policies for name-spaces, allowing durability services throughout the domain to take different responsibilities for each name-space with respect to maintaining and providing of data that belongs to the name-space. The durability name-spaces define the mapping between logical partitions and the responsibilities that a specific durability service needs to play. In the default configuration file there is only one name-space by default (holding all partitions).

Next to the capability of associating a semantic relationship for data in one name-space, the need to differentiate the responsibilities of a particular durability service for a specific data-set is the second purpose of a name-space. Even though there may not be any relation between instances in different partitions, the choice of grouping specific partitions in different name-spaces can still be perfectly valid. The need for availability of non-volatile data under specific conditions (fault-tolerance) on the one hand versus requirements on performance (memory usage, network bandwidth, CPU usage, etc.) on the other hand may force the user to split up the maintaining of the non-volatile data-set over multiple durability services in the domain. Illustrative of this balance between fault-tolerance and performance is the example of maintaining all data in all durability services, which is maximally fault-tolerant, but also requires the most resources. The name-spaces concept allows the user to divide the total set of non-volatile data over multiple name-spaces and assign different responsibilities to different durability-services in the form of so-called name-space policies.

Please refer to the Configuration section for a detailed description of:

  • //OpenSplice/DurabilityService/NameSpaces/NameSpace

5.2.3. Name-space policies

This section describes the policies that can be configured per name-space giving the user full control over the fault-tolerance versus performance aspect on a per name-space level.

Please refer to the Configuration section for a detailed description of:

  • //OpenSplice/DurabilityService/NameSpaces/Policy

5.2.3.1. Alignment policy

The durability services in a domain are on the one hand responsible for maintaining the set of historical data between services and on the other hand responsible for providing historical data to late-joining applications. The configurations of the different services drive the behaviour on where and when specific data will be kept and how it will be provided to late-joiners. The optimal configuration is driven by fault-tolerance on the one hand and resource usage (like CPU usage, network bandwidth, disk space and memory usage) on the other hand. One mechanism to control the behaviour of a specific durability service is the usage of alignment policies that can be configured in the durability configuration file. This configuration option allows a user to specify if and when data for a specific name-space (see the section about Name-spaces) will be maintained by the durability service and whether or not it is allowed to act as an aligner for other durability services when they require (part of) the information.

The alignment responsibility of a durability service is therefore configurable by means of two different configuration options being the aligner and alignee responsibilities of the service:

Aligner policy

TRUE

The durability service will align others if needed.

FALSE

The durability service will not align others.

Alignee policy

INITIAL

Data will be retrieved immediately when the data is available and continuously maintained from that point forward.

LAZY

Data will be retrieved on first arising interest on the local node and continuously maintained from that point forward.

ON_REQUEST

Data will be retrieved only when requested by a subscriber, but not maintained. Therefore each request will lead to a new alignment action.

Note that a node which has configured a Lazy alignee policy will only acquire historical data for partition/topic combinations if local readers for these partition/topics exist on the local node. If there are no such readers, then the node will not know about data that has been published by publishers on remote nodes.

Data that is published by a node with a Lazy alignee policy will not be cached by the local durability service if no readers for the data exist on the local node. If at the time of publication there is also no remote durability service available that is able to cache the data, then the data is not maintained in the system and will not be made available to late joiners.

NOTE: When ddsi is used as networking service, fictious readers are locally created for writers that publish durable data. The creation of these fictious readers will cause that the durability service will always cache durable data, even in the case of a Lazy alignee policy. Therefore, lazy alignment will effectively default to Initial in case ddsi is used.

Please refer to the Configuration section for detailed descriptions of:

  • //OpenSplice/DurabilityService/NameSpaces/Policy[@aligner]

  • //OpenSplice/DurabilityService/NameSpaces/Policy[@alignee]

5.2.3.2. Durability policy

The durability service is capable of maintaining (part of) the set of non-volatile data in a domain. Normally this results in the outcome that data which is written as volatile is not stored, data written as transient is stored in memory and data that is written as persistent is stored in memory and on disk. However, there are use cases where the durability service is required to ‘weaken’ the DurabilityQosPolicy associated with the data, for instance by storing persistent data only in memory as if it were transient. Reasons for this are performance impact (CPU load, disk I/O) or simply because no permanent storage (in the form of some hard-disk) is available on a node. Be aware that it is not possible to ‘strengthen’ the durability of the data (Persistent > Transient > Volatile).

The durability service has the following options for maintaining a set of historical data:

PERSISTENT

Store persistent data on permanent storage, keep transient data in memory, and don’t maintain volatile data.

TRANSIENT

Keep both persistent and transient data in memory, and don’t maintain volatile data.

VOLATILE

Don’t maintain persistent, transient, or volatile data.

This configuration option is called the ‘durability policy’.

Please refer to the Configuration section for a detailed description of:

  • //OpenSplice/DurabilityService/NameSpaces/Policy[@durability]

5.2.3.3. Delayed alignment policy

The durability service has a mechanism in place to make sure that when multiple services with a persistent dataset exist, only one set (typically the one with the newest state) will be injected in the system (see Persistent data injection). This mechanism will, during the startup of the durability service, negotiate with other services which one has the best set (see Master selection). After negotiation the ‘best’ persistent set (which can be empty) is restored and aligned to all durability services.

Once persistent data has been re-published in the domain by a durability service for a specific name-space, other durability services in that domain cannot decide to re-publish their own set for that name-space from disk any longer. Applications may already have started their processing based on the already-published set, and re-publishing another set of data may confuse the business logic inside applications. Other durability services will therefore back-up their own set of data and align and store the set that is already available in the domain.

It is important to realise that an empty set of data is also considered a set. This means that once a durability service in the domain decides that there is no data (and has triggered applications that the set is complete), other late-joining durability services will not re-publish any persistent data that they potentially have available.

Some systems however do require re-publishing persistent data from disk if the already re-published set is empty and no data has been written for the corresponding name-space. The durability service can be instructed to still re-publish data from disk in this case by means of an additional policy in the configuration called ‘delayed alignment’. This Boolean policy instructs a late-joining durability service whether or not to re-publish persistent data for a name-space that has been marked complete already in the domain, but for which no data exists and no DataWriters have been created. Whatever setting is chosen, it should be consistent between all durability services in a domain to ensure proper behaviour on the system level.

Please refer to the Configuration section for a detailed description of:

  • //OpenSplice/DurabilityService/NameSpaces/Policy[@delayedAlignment]

5.2.3.4. Merge policy

A `split-brain syndrome’ can be described as the situation in which two different nodes (possibly) have a different perception of (part of) the set of historical data. This split-brain occurs when two nodes or two sets of nodes (i.e. two systems) that are participating in the same DDS domain have been running separately for some time and suddenly get connected to each other. This syndrome also arises when nodes re-connect after being disconnected for some time. Applications on these nodes may have been publishing information for the same topic in the same partition without this information reaching the other party. Therefore their perception of the set of data will be different.

In many cases, after this has occurred the exchange of information is no longer allowed, because there is no guarantee that data between the connected systems doesn’t conflict. For example, consider a fault-tolerant (distributed) global id service: this service will provide globally-unique ids, but this will be guaranteed if and only if there is no disruption of communication between all services. In such a case a disruption must be considered permanent and a reconnection must be avoided at any cost.

Some new environments demand supporting the possibility to (re)connect two separate systems though. One can think of ad-hoc networks where nodes dynamically connect when they are near each other and disconnect again when they’re out of range, but also systems where temporal loss of network connections is normal. Another use case is the deployment of OpenSplice in a hierarchical network, where higher-level ‘branch’ nodes need to combine different historical data sets from multiple ‘leaves’ into its own data set. In these new environments there is the same strong need for the availability of data for ‘late-joining’ applications (non-volatile data) as in any other system.

For these kinds of environments the durability service has additional functionality to support the alignment of historical data when two nodes get connected. Of course, the basic use case of a newly-started node joining an existing system is supported, but in contradiction to that situation there is no universal truth in determining who has the best (or the right) information when two already running nodes (re)connect. When this situation occurs, the durability service provides the following possibilities to handle the situation:

IGNORE

Ignore the situation and take no action at all. This means new knowledge is not actively built up. Durability is passive and will only build up knowledge that is ‘implicitly’ received from that point forward (simply by receiving updates that are published by applications from that point forward and delivered using the normal publish-subscribe mechanism).

DELETE

Dispose and delete all historical data. This means existing data is disposed and deleted and other data is not actively aligned. Durability is passive and will only maintain data that is ‘implicitly’ received from that point forward.

MERGE

Merge the historical data with the data set that is available on the connecting node.

REPLACE

Dispose and replace all historical data by the data set that is available on the connecting node. Because all data is disposed first, a side effect is that instances present both before and after the merge operation transition through NOT_ALIVE_DISPOSED and end up as NEW instances, with corresponding changes to the instance generation counters.

CATCHUP

Updates the historical data to match the historical data on the remote node by disposing those instances available in the local set but not in the remote set, and adding and updating all other instances. The resulting data set is the same as that for the REPLACE policy, but without the side effects. In particular, the instance state of instances that are both present on the local node and remote node and for which no updates have been done will remain unchanged.

caution

Note that REPLACE and CATCHUP result in the same data set, but the instance states of the data may differ.

From this point forward this set of options will be referred to as ‘merge policies’.

Like the networking service, the durability service also allows configuration of a so-called scope to give the user full control over what merge policy should be selected based on the role of the re-connecting node. The scope is a logical expression and every time nodes get physically connected, they match the role of the other party against the configured scope to see whether communication is allowed and if so, whether a merge action is required.

As part of the merge policy configuration, one can also configure a scope. This scope is matched against the role of remote durability services to determine what merge policy to apply. Because of this scope, the merge behaviour for (re-)connections can be configured on a per role basis. It might for instance be necessary to merge data when re-connecting to a node with the same role, whereas (re-)connecting to a node with a different role requires no action.

Please refer to the Configuration section for a detailed description of:

  • //OpenSplice/DurabilityService/NameSpaces/Policy/Merge

5.2.3.5. Prevent aligning equal data sets

As explained in previous sections, temporary disconnections can cause durability services to get out-of-sync, meaning that their data sets may diverge. To recover from such situations merge policies have been defined (see Merge policy) where a user can specify how to combine divergent data sets when they become reconnected. Many of these situations involve the transfer of data sets from one durability service to the other. This may generate a considerable amount of traffic for large data sets.

If the data sets do not get out-of-sync during disconnection it is not necessary to transfer data sets from one durability service to the other. Users can specify whether to compare data sets before alignment using the equalityCheck attribute. When this check is enabled, hashes of the data sets are calculated and compared; when they are equal, no data will be aligned. This may save valuable bandwidth during alignment. If the hashes are different then the complete data sets will be aligned.

Comparing data sets does not come for free as it requires hash calculations over data sets. For large sets this overhead may become significant; for that reason is not recommended to enable this feature for frequently-changing data sets. Doing so will impose the penalty of having to calculate hashes when the hashes are likely to differ and the data sets need to be aligned anyway.

Comparison of data sets using hashes is currently only supported for operational nodes that diverge; no support is provided during initial startup.

Please refer to the Configuration section for a detailed description of:

  • //OpenSplice/DurabilityService/NameSpaces/Policy[@equalityCheck]

5.2.3.6. Dynamic name-spaces

As specified in the previous sections, a set of policies can be configured for a (set of) given name-space(s). One may not know the complete set of name-spaces for the entire domain though, especially when new nodes dynamically join the domain. However, in case of maximum fault-tolerance, one may still have the need to define behaviour for a durability service by means of a set of policies for name-spaces that have not been configured on the current node.

Every name-space in the domain is identified by a logical name. To allow a durability service to fulfil a specific role for any name-space, each policy needs be configured with a name-space expression that is matched against the name of name-spaces in the domain. If the policy matches a name-space, it will be applied by the durability service, independently of whether or not the name-space itself is configured on the node where this durability service runs. This concept is referred to as ‘dynamic name-spaces’.

Please refer to the Configuration section for a detailed description of:

  • //OpenSplice/DurabilityService/NameSpaces/Policy[@nameSpace]

5.2.3.7. Master/slave

Each durability service that is responsible for maintaining data in a namespace must maintain the complete set for that namespace. It can achieve this by either requesting data from a durability service that indicates it has a complete set or, if none is available, request all data from all services for that namespace and combine this into a single complete set. This is the only way to ensure all available data will be obtained. In a system where all nodes are started at the same time, none of the durability services will have the complete set, because applications on some nodes may already have started to publish data. In the worst case every service that starts then needs to ask every other service for its data. This concept is not very scalable and also leads to a lot of unnecessary network traffic, because multiple nodes may (partly) have the same data. Besides that, start-up times of such a system will grow exponentially when adding new nodes. Therefore the so-called ‘master’ concept has been introduced.

Durability services will determine one ‘master’ for every name-space per configured role amongst themselves. Once the master has been selected, this master is the one that will obtain all historical data first (this also includes re-publishing its persistent data from disk) and all others wait for that process to complete before asking the master for the complete set of data. The advantage of this approach is that only the master (potentially) needs to ask all other durability services for their data and all others only need to ask just the master service for its complete set of data after that.

Additionally, a durability service is capable of combining alignment requests coming from multiple remote durability services and will align them all at the same time using the internal multicast capabilities. The combination of the master concept and the capability of aligning multiple durability services at the same time make the alignment process very scalable and prevent the start-up times from growing when the number of nodes in the system grows. The timing of the durability protocol can be tweaked by means of configuration in order to increase chances of combining alignment requests. This is particularly useful in environments where multiple nodes or the entire system is usually started at the same time and a considerable amount of non-volatile data needs to be aligned.

5.3. Mechanisms

5.3.1. Interaction with other durability services

To be able to obtain or provide historical data, the durability service needs to communicate with other durability services in the domain. These other durability services that participate in the same domain are called ‘fellows’. The durability service uses regular DDS to communicate with its fellows. This means all information exchange between different durability services is done with via standard DataWriters and DataReaders (without relying on non-volatile data properties of course).

Depending on the configured policies, DDS communication is used to determine and monitor the topology, exchange information about available historical data and alignment of actual data with fellow durability services.

5.3.2. Interaction with other OpenSplice services

In order to communicate with fellow durability services through regular DDS DataWriters and DataReaders, the durability service relies on the availability of a network service. This can be either the interoperable DDSI or the real-time networking service. It can even be a combination of multiple networking services in more complex environments. As networking services are pluggable like the durability service itself, they are separate processes or threads that perform tasks asynchronously next to the tasks that the durability service is performing. Some configuration is required to instruct the durability service to synchronise its activities with the configured networking service(s). The durability service aligns data separately per partition-topic combination. Before it can start alignment for a specific partition-topic combination it needs to be sure that the networking service(s) have detected the partition-topic combination and ensure that data published from that point forward is delivered from c.q. sent over the network. The durability service needs to be configured to instruct it which networking service(s) need to be attached to a partition-topic combination before starting alignment. This principle is called ‘wait-for-attachment’.

Furthermore, the durability service is responsible to announce its liveliness periodically with the splice-daemon. This allows the splice-daemon to take corrective measures in case the durability service becomes unresponsive. The durability service has a separate so-called `watch-dog’ thread to perform this task. The configuration file allows configuring the scheduling class and priority of this watch-dog thread.

Finally, the durability service is also responsible to monitor the splice-daemon. In case the splice-daemon itself fails to update its lease or initiates regular termination,0 the durability service will terminate automatically as well.

Please refer to the Configuration section for a detailed description of:

  • //OpenSplice/DurabilityService/Network

5.3.3. Interaction with applications

The durability service is responsible for providing historical data to late-joining subscribers.

Applications can use the DCPS API call wait_for_historical_data on a DataReader to synchronise on the availability of the complete set of historical data. Depending on whether the historical data is already available locally, data can be delivered immediately after the DataReader has been created or must be aligned from another durability service in the domain first. Once all historical data is delivered to the newly-created DataReader, the durability service will trigger the DataReader unblocking the wait_for_historical_data performed by the application. If the application does not need to block until the complete set of historical data is available before it starts processing, there is no need to call wait_for_historical_data. It should be noted that in such a case historical data still is delivered by the durability service when it becomes available.

5.3.4. Parallel alignment

When a durability service is started and joins an already running domain, it usually obtains historical data from one or more already running durability services. In case multiple durability services are started around the same time, each one of them needs to obtain a set of historical data from the already running domain. The set of data that needs to be obtained by the various durability services is often the same or at least has a large overlap. Instead of aligning each newly joining durability service separately, aligning all of them at the same time is very beneficial, especially if the set of historical data is quite big. By using the built-in multi-cast and broadcast capabilities of DDS, a durability service is able to align as many other durability services as desired in one go. This ability reduces the CPU, memory and bandwidth usage of the durability service and makes the alignment scale also in situations where many durability services are started around the same time and a large set of historical data exists. The concept of aligning multiple durability service at the same time is referred to as ‘parallel alignment’.

To allow this mechanism to work, durability services in a domain determine a master durability service for each name-space. Every durability service elects the same master for a given name-space based on a set of rules that will be explained later on in this document. When a durability service needs to be aligned, it will always send its request for alignment to its selected master. This results in only one durability service being asked for alignment by any other durability service in the domain for a specific name-space, but also allows the master to combine similar requests for historical data. To be able to combine alignment requests from different sources, a master will wait a period of time after receiving a request and before answering a request. This period of time is called the ‘request-combine period’.

The actual amount of time that defines the ‘request-combine period’ for the durability service is configurable. Increasing the amount of time will increase the likelihood of parallel alignment, but will also increase the amount of time before it will start aligning the remote durability service and in case only one request comes in within the configured period, this is non-optimal behaviour. The optimal configuration for the request-combine period therefore depends heavily on the anticipated behaviour of the system and optimal behaviour may be different in every use case.

In some systems, all nodes are started simultaneously, but from that point forward new nodes start or stop sporadically. In such systems, different configuration with respect to the request-combine period is desired when comparing the start-up and operational phases. That is why the configuration of this period is split into different settings: one during the start-up phase and one during the operational phase.

Please refer to the Configuration section for a detailed description of:

  • //OpenSplice/DurabilityService/Network/Alignment/RequestCombinePeriod

5.3.5. Tracing

Configuring durability services throughout a domain and finding out what exactly happens during the lifecycle of the service can prove difficult.

OpenSplice developers sometimes have a need to get more detailed durability specific state information than is available in the regular OpenSplice info and error logs to be able to analyse what is happening. To allow retrieval of more internal information about the service for (off-line) analysis to improve performance or analyse potential issues, the service can be configured to trace its activities to a specific output file on disk.

By default, this tracing is turned off for performance reasons, but it can be enabled by configuring it in the XML configuration file.

The durability service supports various tracing verbosity levels. In general can be stated that the more verbose level is configured (FINEST being the most verbose), the more detailed the information in the tracing file will be.

Please refer to the Configuration section for a detailed description of:

  • //OpenSplice/DurabilityService/Tracing

5.4. Lifecycle

During its lifecycle, the durability service performs all kinds of activities to be able to live up to the requirements imposed by the DDS specification with respect to non-volatile properties of published data. This section describes the various activities that a durability service performs to be able to maintain non-volatile data and provide it to late-joiners during its lifecycle.

5.4.1. Determine connectivity

Each durability service constantly needs to have knowledge on all other durability services that participate in the domain to determine the logical topology and changes in that topology (i.e. detect connecting, disconnecting and re-connecting nodes). This allows the durability service for instance to determine where non-volatile data potentially is available and whether a remote service will still respond to requests that have been sent to it reliably.

To determine connectivity, each durability service sends out a heartbeat periodically (every configurable amount of time) and checks whether incoming heartbeats have expired. When a heartbeat from a fellow expires, the durability service considers that fellow disconnected and expects no more answers from it. This means a new aligner will be selected for any outstanding alignment requests for the disconnected fellow. When a heartbeat from a newly (re)joining fellow is received, the durability service will assess whether that fellow is compatible and if so, start exchanging information.

Please refer to the Configuration section for a detailed description of:

  • //OpenSplice/DurabilityService/Network/Heartbeat

5.4.2. Determine compatibility

When a durability service detects a remote durability service in the domain it is participating in, it will determine whether that service has a compatible configuration before it will decide to start communicating with it. The reason not to start communicating with the newly discovered durability service would be a mismatch in configured name-spaces. As explained in the section about the Name-spaces concept, having different name-spaces is not an issue as long as they do not overlap. In case an overlap is detected, no communication will take place between the two ‘incompatible’ durability services. Such an incompatibility in your system is considered a mis-configuration and is reported as such in the OpenSplice error log.

Once the durability service determines name-spaces are compatible with the ones of all discovered other durability services, it will continue with selection of a master for every name-space, which is the next phase in its lifecycle.

5.4.3. Master selection

For each Namespace and Role combination there shall be at most one Master Durability Service. The Master Durability Service coordinates single source re-publishing of persistent data and to allow parallel alignment after system start-up, and to coordinate recovery of a split brain syndrome after connecting nodes having selected a different Master indicating that more than one state of the data may exist.

Therefore after system start-up as well as after any topology change (i.e. late joining nodes or leaving master node) a master selection process will take place for each affected Namespace/Role combination.

To control the master selection process a masterPriority attribute can be used.

Each Durability Service will have a configured masterPriority attribute per namespace which is an integer value between 0 and 255 and which specifies the eagerness of the Durability Service to become Master for that namespace. The values 0 and 255 have a special meaning. Value 0 is used to indicate that the Durability Service will never become Master. The value 255 is used to indicate that the Durability Service will not use priorities but instead uses the legacy selection algorithm. If not configured the default is 255.

During the master selection process each Durability service will exchange for each namespace the masterPriority and quality. The namespace quality is the timestamp of the latest update of the persistent data set stored on disk and only plays a role in master selection initially when no master has been chosen before and persistent data has not been injected yet.

Each Durability Service will determine the Master based upon the highest non zero masterPriority and in case of multiple masters further select based on namespace quality (but only if persistent data has not been injected before) and again in case of multiple masters select the highest system id. The local system id is an arbitrary value which unique identifies a durability service. After selection each Durability Service will communicate their determined master and on agreement effectuate the selection, on disagreement which may occur if some Durability Services had a temporary different view of the system this process of master selection will restart until all Durability Services have the same view of the system and have made the same selection. If no durability services exists having a masterPriority greater than zero then no master will be selected.

Summarizing, the precedence rules for master selection are (from high to low):

  1. The namespace masterPriority

  2. The namespace quality, if no data has been injected before.

  3. The Durability Service system id, which is unique for each durability service.

Please refer to the Configuration section for a detailed description of:

  • //OpenSplice/DurabilityService/Network/InitialDiscoveryPeriod

5.4.4. Persistent data injection

As persistent data needs to outlive system downtime, this data needs to be re-published in DDS once a domain is started.

If only one node is started, the durability service on that node can simply re-publish the persistent data from its disk. However, if multiple nodes are started at the same time, things become more difficult. Each one of them may have a different set available on permanent storage due to the fact that durability services have been stopped at a different moment in time. Therefore only one of them should be allowed to re-publish its data, to prevent inconsistencies and duplication of data.

The steps below describe how a durability service currently determines whether or not to inject its data during start-up:

  1. Determine validity of own persistent data — During this step the durability service determines whether its persistent store has initially been completely filled with all persistent data in the domain in the last run. If the service was shut down in the last run during initial alignment of the persistent data, the set of data will be incomplete and the service will restore its back-up of a full set of (older) data if that is available from a run before that. This is done because it is considered better to re-publish an older but complete set of data instead of a part of a newer set.

  2. Determine quality of own persistent data — If persistence has been configured, the durability service will inspect the quality of its persistent data on start-up. The quality is determined on a per-name-space level by looking at the time-stamps of the persistent data on disk. The latest time-stamp of the data on disk is used as the quality of the name-space. This information is useful when multiple nodes are started at the same time. Since there can only be one source per name-space that is allowed to actually inject the data from disk into DDS, this mechanism allows the durability services to select the source that has the latest data, because this is generally considered the best data. If this is not true then an intervention is required. The data on the node must be replaced by the correct data either by a supervisory (human or system management application) replacing the data files or starting the nodes in the desired sequence so that data is replaced by alignment.

  3. Determine topology — During this step, the durability service determines whether there are other durability services in the domain and what their state is. If this service is the only one, it will select itself as the ‘best’ source for the persistent data.

  4. Determine master — During this step the durability service will determine who will inject persistent data or who has injected persistent data already. The one that will or already has injected persistent data is called the ‘master’. This process is done on a per name-space level (see previous section).

    1. Find existing master – In case the durability service joins an already-running domain, the master has already been determined and this one has already injected the persistent data from its disk or is doing it right now. In this case, the durability service will set its current set of persistent data aside and will align data from the already existing master node. If there is no master yet, persistent data has not been injected yet.

    2. Determine new master – If the master has not been determined yet, the durability service determines the master for itself based on who has the best quality of persistent data. In case there is more than one service with the ‘best’ quality, the one with the highest system id (unique number) is selected. Furthermore, a durability service that is marked as not being an aligner for a name-space cannot become master for that name-space.

  5. Inject persistent data — During this final step the durability service injects its persistent data from disk into the running domain. This is only done when the service has determined that it is the master. In any other situation the durability service backs up its current persistent store and fills a new store with the data it aligns from the master durability service in the domain, or postpones alignment until a master becomes available in the domain.

caution

It is strongly discouraged to re-inject persistent data from a persistent store in a running system after persistent data has been published. Behaviour of re-injecting persistent stores in a running system is not specified and may be changed over time.

5.4.5. Discover historical data

During this phase, the durability service finds out what historical data is available in the domain that matches any of the locally configured name-spaces. All necessary topic definitions and partition information are retrieved during this phase. This step is performed before the historical data is actually aligned from others. The process of discovering historical data continues during the entire lifecycle of the service and is based on the reporting of locally-created partition-topic combinations by each durability service to all others in the domain.

5.4.6. Align historical data

Once all topic and partition information for all configured name-spaces are known, the initial alignment of historical data takes place. Depending on the configuration of the service, data is obtained either immediately after discovering it or only once local interest in the data arises. The process of aligning historical data continues during the entire lifecycle of the durability service.

5.4.7. Provide historical data

Once (a part of) the historical data is available in the durability service, it is able to provide historical data to local DataReaders as well as other durability services.

Providing of historical data to local DataReaders is performed automatically as soon as the data is available. This may be immediately after the DataReader is created (in case historical data is already available in the local durability service at that time) or immediately after it has been aligned from a remote durability service.

Providing of historical data to other durability services is done only on request by these services. In case the durability service has been configured to act as an aligner for others, it will respond to requests for historical data that are received. The set of locally available data that matches the request will be sent to the durability service that requested it.

5.4.8. Merge historical data

When a durability service discovers a remote durability service and detects that neither that service nor the service itself is in start-up phase, it concludes that they have been running separately for a while (or the entire time) and both may have a different (but potentially complete) set of historical data. When this situation occurs, the configured merge-policies will determine what actions are performed to recover from this situation. The process of merging historical data will be performed every time two separately running systems get (re-)connected.

5.5. Threads description

This section contains a short description of each durability thread. When applicable, relevant configuration parameters are mentioned.

5.5.1. ospl_durability

This is the main durability service thread. It starts most of the other threads, e.g. the listener threads that are used to receive the durability protocol messages, and it initiates initial alignment when necessary. This thread is responsible for periodically updating the service-lease so that the splice-daemon is aware the service is still alive. It also periodically (every 10 seconds) checks the state of all other service threads to detect if deadlock has occurred. If deadlock has occurred the service will indicate which thread didn’t make progress and action can be taken (e.g. the service refrains from updating its service-lease, causing the splice daemon to execute a failure action). Most of the time this thread is asleep.

5.5.2. conflictResolver

This thread is responsible for resolving conflicts. If a conflict has been detected and stored in the conflictQueue, the conflictResolver thread takes the conflict, checks whether the conflict still exists, and if so, starts the procedure to resolve the conflict (i.e., start to determine a new master, send out sample requests, etc).

5.5.3. statusThread

This thread is responsible for the sending status messages to other durability services. These messages are periodically sent between durability services to inform each other about their state (.e.,g INITIALIZING or TERMINATING).

Configuration parameters:

  • //OpenSplice/DurabilityService/Watchdog/Scheduling

5.5.4. d_adminActionQueue

The durability service maintains a queue to schedule timed action. The d_adminActionQueue periodically checks (every 100 ms) if an action is scheduled. Example actions are: electing a new master, detection of new local groups and deleting historical data.

Configuration parameters:

  • //OpenSplice/DurabilityService/Network/Heartbeat/Scheduling

5.5.5. AdminEventDispatcher

Communication between the splice-daemon and durability service is managed by events. The AdminEventDispatcher thread listens and acts upon these events. For example, the creation of a new topic is noticed by the splice-daemon, which generates an event for the durability service, which schedules an action on to request historical data for the topic.

5.5.6. groupCreationThread

The groupCreationThread is responsible for the creation of groups that exist in other federations. When a durability service receives a newGroup message from another federation, it must create the group locally in order to acquire data for it. Creation of a group may fail in case a topic is not yet known. The thread will retry with a 10ms interval.

5.5.7. sampleRequestHandler

This thread is responsible for the handling of sampleRequests. When a durability service receives a d_sampleRequest message (see the sampleRequestListener thread) it will not immediately answer the request, but wait some time until the time to combine requests has been expired (see //OpenSplice/DurabilityService/Network/Alignment/RequestCombinePeriod). When this time has expired the sampleRequestHandler will answer the request by collecting the requested data and sending the data as d_sampleChain messages to the requestor.

Configuration parameters:

  • //OpenSplice/DurabilityService/Network/Alignment/AlignerScheduling

5.5.8. resendQueue

This thread is responsible for injection of message in the group after it has been rejected before. When a durability service has received historical data from another fellow, historical data is injected in the group (see d_sampleChain). Injection of historical data can be rejected, e.g., when a resource limits are being used. When this happens, a new attempt to inject the data is scheduled overusing the resendQueue thread. This thread will try to deliver the data 1s later.

Configuration parameters:

  • //OpenSplice/DurabilityService/Network/Alignment/AligneeScheduling

5.5.9. masterMonitor

The masterMonitor is the thread that handles the selection of a new master. This thread is invoked when the conflict resolver detects that a master conflict has occurred. The masterMonitor is responsible for collecting master proposals from other fellows and sending out proposals to other fellows.

5.5.10. groupLocalListenerActionQueue

This thread is used to handle historical data requests from specific readers, and to handle delayed alignment (see //OpenSplice/DurabilityService/NameSpaces/Policy[@delayedAlignment])

5.5.11. d_groupsRequest

The d_groupsRequest thread is responsible for processing incoming d_groupsRequest messages from other fellows. When a durability service receives a message from a fellow, the durability service will send information about its groups to the requestor by means of d_newGroup messages. This thread collects group information, packs it in d_newGroup messages and send them to the requestor. This thread will only do something when a d_groupsRequest has been received from a fellow. Most of the time it will sleep.

5.5.12. d_nameSpaces

This thread is responsible for processing incoming d_nameSpaces messages from other fellows. Durability services send each other their namespaces state so that they can detect potential conflicts. The d_nameSpaces thread processes and administrates every incoming d_nameSpace. When a conflict is detected, the conflict is scheduled which may cause the conflictResolver thread to kick in.

5.5.13. d_nameSpacesRequest

The d_nameSpacesRequest thread is responsible for processing incoming d_nameSpacesRequest messages from other fellows. A durability service can request the namespaces form a fellow by sending a d_nameSpacesRequest message to the fellow. Whenever a durability service receives a d_nameSpacesRequest messages it will respond by sending its set of namespaces to the fellow. The thread handles incoming d_nameSpacesRequest messages. As a side effect new fellows can be discovered if a nameSpacesRequest is received from an unknown fellow.

5.5.14. d_status

The d_status thread is responsible for processing incoming d_status messages from other fellows. Durability services periodically send each other status information (see the statusThread). NOTE: in earlier versions missing d_status messages could lead to the conclusion that a fellows has been removed. In recent versions this mechanism has been replaced so that the durability service slaves itself to the liveness of remote federations based on heartbeats (see thread dcpsHeartbeatListener). Effectivily, the d_status message is not used anymore to verify liveliness of remote federations, it is only used to transfer the durability state of a remote federation.

5.5.15. d_newGroup

The d_newGroup thread is responsible for handling incoming d_newGroup messages from other fellows. Durability services inform each other about groups in the namespaces. They do that by sending d_newGroup messages to each other (see also thread d_groupsRequest). The d_newGroup thread is responsible for handling incoming groups.

5.5.16. d_sampleChain

The d_sampleChain thread handles incoming d_sampleChain messages from other fellows. When a durability service answers an d_sampleRequest, it must collect the requested data and send it to the requestor. The collected data is packed in d_sampleChain messages. The d_sampleChain thread handles incoming d_sampleChain messages and applies the configured merge policy for the data. For example, in case of a MERGE it injects all the received data in the local group and delivers the data to the available readers.

Configuration parameters:

  • //OpenSplice/DurabilityService/Network/Alignment/AligneeScheduling

5.5.17. d_sampleRequest

The d_sampleRequest thread is responsible for handling incoming d_sampleRequest messages from other fellows. A durability service can request historical data from a fellow by sending a d_sampleRequest message. The d_sampleRequest thread is used to process d_sampleRequest messages. Because d_sampleRequest messages are not handled immediately, they are stored in a list and handled later on (see thread sampleRequestHandler).

5.5.18. d_deleteData

The d_deleteData thread is responsible for handling incoming d_deleteData messages from other fellows. An application can call delete_historical_data(). This causes all historical data up till now to be deleted. To propagate deletion of historical data to all available durability services in the system, durability services send each other a d_deleteData message. The d_deleteData thread handles incoming d_deleteData messages and takes care that the relevant data is deleted. This thread will only be active after delete_historical_data() is called.

5.5.19. dcpsHeartbeatListener

The dcpsHeartbeatListener is responsible for the liveliness detection of remote federations. This thread listens to DCPSHeartbeat messages that are sent by federation. It is used to detect new federations or federations that disconnect. This thread will only do something when there is a change in federation topology. Most of the time it will be asleep.

5.5.20. d_capability

The thread is responsible for processing d_cability messages from other fellows. As soon as a durability service detects a fellow it will send its list of capabilities to the fellow. The fellow can use this list to find what functionality is supported by the durability service. Similarly, the durability service can receive capabilities from the fellow. This thread is used to process the capabilities sent by a fellow. This thread will only do something when a fellow is detected. | |

5.5.21. remoteReader

The remoteReader thread is responsible for the detection of remote readers on other federations. The DDSI service performs discovery and reader-writing matching. This is an asynchronous mechanism. When a durability service (say A) receives a request from a fellow durability service (say B) and DDSI is used as networking service, then A cannot be sure that DDSI has already detected the reader on B that should receive the answer to the request. To ensure that durability services will only answer if all relevant remote readers have been detected, the remoteReader thread keeps track of the readers that have been discovered by ddsi. Only when all relevant readers have been discovered durability services are allowed to answer requests. This prevents DDSI from dropping messages destined for readers that have not been discovered yet.

5.5.22. persistentDataListener

The persistentDataListenerThread is responsible for persisting durable data. When a durability service retrieves persistent data, the data is stored in a queue. The persistentDataListener thread retrieves the data from the queue and stores it in the persistent store. For large data sets persisting the data can take quite some time, depending mostly on the performance of the disk.

Note this thread is only created when persistency is enabled (//OpenSplice/DurabilityService/Persistent/StoreDirectory has a value set):

Configuration parameters:

  • //OpenSplice/DurabilityService/Persistent/Scheduling

5.5.23. historicalDataRequestHandler

This thread is responsible for handling incoming historicalDataRequest messages from durability clients. In case an application does not have the resources to run a durability service but still wants to acquire historical data it can configure a client. The client sends HistoricalDataRequest messages to the durability service. These messages are handled by the historicalDataRequestHandler thread.

Note this thread is only created when client durability is enabled (//OpenSplice/DurabilityService/ClientDurability element exists)

5.5.24. durabilityStateListener

This thread is responsible for handling incoming durabilityStateRequest messages from durability clients.

Note this thread is only created when client durability is enabled (//OpenSplice/DurabilityService/ClientDurability element exists)