OpenSplice - documentation index

Release Highlights

This page contains a list of all bug fixes and changes incorporated in OpenSplice V7.x series of releases

Fixed Bugs and Changes

OpenSplice V7.1.0 contains the following changes

DLite protocol change to support future extensions see OSPL-14891. This breaks backward compatibility for Dlite with the previous V7.0 and older releases

OpenSplice V7.0 contains the following changes

The DLite Service is now GA and has replaced the Durability Service as the default durability service. As result of this the Durability service is now deprecated without support. More information on how to migrate from Durability to DLite can be found in the Migration Guide
OpenSplice APIs are now by default build with support for timestamps after the year 2038 this requires all alpplication code to be recompiled.
The DDSI2 service now provides all the features of the DDSI2E service leaving one DDSI2 service.
The classic CPP and classic Java APIs are now deprecated and will no longer be supported. For CPP, the ISO/IEC C++ 2 API can be used and for Java the Java5 API is available as a replacement.
For Secure Native Networking the Blowfish cipher is deprecated

Fixed Bugs and Changes

7.1.0

Report ID.	Description
OSPL-14910	DLite can wrongly terminate under high load because it fails to renew its liveliness lease Under high load (e.g. alignments that take longer to publish than two times the discovery interval) dlite may fail to update it's liveliness lease in time. When that happens, dlite will be wrongly killed by the Splice daemon. Solution: Additional time-based lease-renewal points are added in dlite to guarantee in time lease renewal.
OSPL-14901 / 00021806	Fix for CVE-2020-18734 A stack buffer overflow in ddsi q_bitset_template.h causes a crash. More info on https://nvd.nist.gov/vuln/detail/CVE-2020-18734 Solution: The defect is fixed by adding an check for the buffer
OSPL-14899	Samples may be incorrectly dropped when samples from the same writer arrive out-of-order. When a writer writes a sample a sequence number is added to the sample. The sequence number is incremented for each sample. The sample that is written is put into the network queue to be processed by the network service, which transmits the sample to the other nodes. However, the network service (either ddsi2 or RT networking) applies its own sequence number scheme to provide reliable communication and to provide in order delivery. Lately an input filter was implemented at the receiving side which checks if samples from the same writer arrive in order. Normally that will be the case. However, it may occur that the network queue at the sending side is full. In this case it may occur that samples from different writer instances are not placed in the network queue in order of writer sequence number. In this case the input filter may incorrectly drop these samples. The same may occur when using transient-local. Solution: The input filter has been removed.
OSPL-14897	DLite must give aligners with a configured catchup scope precedence above other aligners. All nodes in a system must perform catchup when an aligner joins with an explicit catchup scope. Important is that this aligner provides data before any other potential aligner node, because another node may otherwise provide data that should be deleted by catchup and contaminate the data of catchup aligner. Solution: The catchup scope is added to the DLite scheduling algorithm so that catchup nodes get preference above non catchup nodes. Multiple catchup nodes can cause consistency problems when scopes overlap, a particular preference is likely desired in such cases which can be controlled the dlite priority attribute.
OSPL-14895 / 00021882	The networking bridge terminates shortly after being started. The networking bridge listens to the builtin topics to determine if a certain topic should be forwarded via the bridge or not. For that purpose a waitset is used. The main loop waits on this waitset for events to be processed. When the networking bridge has to terminate it triggers the waitset to return from waiting. The event handler in the networking bridge incorrectly sets the terminate flag when handling a trigger event which causes the networking bridge to terminate. Solution: The networking bridge event handler ignores the trigger event.
OSPL-14893	DLite may unjustly dispose ALIVE instances when using tombstone_cleanup_delay If you configure DLite to use a tombstone_cleanup_delay, then any tombstones exceeding this specified duration will be purged and DLite will record the sequence number of the most recently purged tombstone as the PSN (the Purge Sequence Number). This PSN indicates the boundary between sequence numbers for which differential alignment is possible (those > PSN) and those for which a catchup policy will need to be applied (those <= PSN). However, due to a bug in the alignment algorithm, samples with a sequence number < PSN belonging to ALIVE instances may unintentionally end up disposing their instance nevertheless. Solution: The alignment algorithm has been modified so that aligned samples with a sequence number < PSN will not automatically mark their instance for disposal anymore.
OSPL-14892	A DLite service leaks group info records in the internal pending list if they are not part of the global partition A DLite service will try to create kernel groups when groups are being provisioned by other dlite services and locally unkown. The creation may fail when the required topic for a group doesn't exist, in that case DLite will store the group information in a pending list until the topic becomes available. When a topic becomes available, a group lookup and group create is performed on all matching groups in the pending list. Unfortunately the group lookup function only returns groups from the global partition and not from other partitions. This will result in leakage of group info record in the pending list. The groups that are not created because the lookup failed is not a serious problem they will eventually be created in the next alignment that will be triggered because of the remaining differences between dlite services. Solution: Fixed the lookup function
OSPL-14891	Make DLite protocol extensible for future requirements The DLite protocol topics shall allow additional fields to support future extensions and remain backward compatible. Solution: A name-value type is introduced and each protocol topic is extended with a sequence of name-value pairs to support future extensions. DLite will ignore any name-value pairs it doesn't understand. Note that this does break backward compatibility with the previous V7.0 and older releases
OSPL-14890 / 00021878	Memory leak when using the dispose_all_data function. When dispose_all_data is called an internal Topic (DCPSCandMCommand) message is written to communicate the dispose_all_data to the other nodes in the system. This caused a small memory leak at the node where dispose_all_data is called. Solution: After writing the DCPSCandMCommand message, the memory is freed.
OSPL-14885	Catchup may cause repeated alignment when the aligner does not have a writer that the alignee has. When a catchup is performed and finished, the alignee should have the same state as the aligner. After the catchup has been performed and the aligner detects that there is still a difference with the alignee, it will again perform a catchup. In this case, it appeared that the alignee has knowledge and samples of a (not alive) writer which was unknown by the aligner. Although the first catchup caused the removal of the samples of this writer, the aligner did not remove the catchup ranges from the fellow's state and gap. This caused the aligner to determine that there was still an difference in the state, causing a new alignment. Solution: When an aligner provisions data to fellows that have writers with data within the catchup scope that are unknown by the aligner, then the aligner will remove these writer data ranges from it's fellow's state and gap because the fellow will perform catchup and no longer have the data. After this the aligners perception of the fellows state and gap should be aligned with actual state and gap of the fellow and no longer cause repetative alignments.
OSPL-14882	A catchup alignment could incorrectly remove newer life data that was received before the alignment was completed. A full catchup alignment has to result in the alignee getting the same state as the aligner. For that purpose the full catchup will remove and dispose those instances for which there are no samples in the alignment data set. However it may occur that some of the instances have been update by life data before the alignment has been completed by the dlite service. In that case the catchup may incorrectly remove samples from the transient store which are newer than the data contained in the alignment snapshot. Solution: When performing a full catchup samples newer than the snapshot time of the alignment data are not removed.
OSPL-14880	Creation of new groups can result in realignment of already aligned data Once alignment has taken place all fellow gaps are removed at the aligner assuming that these are resolved by the alignment that has taken place. Removing the gaps in combination with not updating the fellow states because fellows no longer send their updated state after alignment since the introduction of the NACK protocol changes means that the aligner will recalculate the previous gaps again after a topology change and again align the data that was previously aligned. Solution: Instead of deleting the gaps and remaining the old fellow state at the aligner the aligner should update the state and gaps with the aligned data so that it reflects the actual state of the fellows. By doing that any subsequent topo changes that will recalculate gaps will then calculate the correct gaps and not resend a previously aligned state difference.
OSPL-14879	On some platforms the ddsi service incorrectly reports the error "Failed to create interface monitor handle" To monitor network interface changes the ddsi service uses the rtnetlink functionality. Depending on the used platform the use of rtnetlink is enable or disabled in the provided OpenSplice release. When rtnetlink support is disabled the ddsi service incorrectly reports the error message. Solution: When the rtnetlink support is not enabled in the OpenSplice release ddsi service does not report the error message.
OSPL-14878	Instance in the NO_WRITERS state might be revived incorrectly When an instance in the NO_WRITERS state (written by a Writer with its autodispose set to FALSE) is consumed by a DataReader, it gets purged after expiry of the RetentionPeriod because the NO_WRITERS state is considered an end-of-lifecycle event and no new updates are to be expected. However, if due to some external events (for example a late joiner, or a disconnect/reconnect cycle on another node) this same instance gets re-aligned (and therefore multi-casted to all DLite services), the node that already consumed this data will get it re-inserted into its DataReader as well, which because it purged all traces of its previous incarnation will revive the instance as if it is a completely new, never seen before, instance. Solution: The transient store will check whether an aligned sample is a duplicate of something that is already there. If that is the case, it will no longer be forwarded to the Reader, preventing its revival.
OSPL-14874 / 00021718	A disconnection during an alignment session may cause alignment to never complete. When during an alignment session a disconnection occurs, the alignee is not made aware that the session will never be closed, and continues waiting for the session to continue. The configurable session timeout that would inform the alignee about this has recently been defaulted to infinite (where it used to be a finite value before), causing the session never to be closed from the perspective of the alignee. For that reason, the alignee considers itself to still be in the BUSY state, causing the aligner not to start another alignment session after it re-connects. Solution: When the alignee discovers that its aligner has disconnected, it will not automatically close any open alignment session from that aligner, resetting its BUSY state when no other alignment is taking place. This causes the aligner to start another alignment session upon the reconnect.
OSPL-14869 / 00021863	Catchup may cause repeated alignment when the aligner does not have a writer that the alignee has. When a catchup is performed and finished, the alignee should have the same state as the aligner. After the catchup has been performed and the aligner detects that there is still a difference with the alignee, it will again perform a catchup. In this case, it appeared that the alignee has knowledge and samples of a (not alive) writer which was unknown by the aligner. Although the first catchup caused the removal of the samples of this writer, the alignee still communicated the existence of this writer to the aligner. This caused the aligner to determine that there was still an difference in the state, causing a new alignment. Solution: When the alignment data does not contain a writer which is locally known and a catchup is applied, then this writer state is reset to the initial state, which will prevent it from appearing in the state information communicated by the provisioned node.
OSPL-14865 / 00021858	Templatized constructor for dds::core::InstanceHandle from anything needs to be explicit. In the isocpp2 API, the dds::core::InstanceHandle has a templatized constructor that can accept any kind of parameter. That means that the compiler can invoke this constructor at any point in time if it determines that a certain parameter doesn't match requirements, but an InstanceHandle would. This may lead to undesired transformations that the application programmer never intended. Solution: The templatized InstanceHandle constructor from anything is now made explicit, so that the application programmer would need to explicitly invoke this constructor when such a transformation is desired.
OSPL-14864	C99 Throughput example was reporting large numbers of samples as "out of order". When running the C99 Throughput example, the receiving side would report many messages as being "out of order". This report was not so much caused by order inversion between messages, but by gaps in subsequent sequence numbers, which were caused by the fact that the Reader had a KEEP_LAST policy with a depth of 1 instead of a KEEP_ALL policy like it should have had. This caused some messages to be overwritten before they could have been consumed, causing the reported sequence number gap. Solution: The Reader used in the C99 Throughput example now uses the correct KEEP_ALL policy by deriving itself from the TopicQos, which was already KEEP_ALL. This is done by no longer passing the ReaderQos explicitly, but by passing a NULL pointer instead. This causes the Reader to copy from the TopicQos instead.
OSPL-14858	Unexpected disposal of data after a catchup. When a reconnect occurs, a DLite service that acts as aligner will send its data to an alignee. In the case where a catchup was configured, the alignee is expected to catchup the data from the aligner. As part of the procedure to catchup the data, the alignee will dispose the data that is not being aligned by the aligner. If the alignee was not able to process all the alignment data (e.g., because a topic definition is (still) unknown), then the catchup procedure should be aborted to prevent the data from being wrongly disposed. However, due to a bug the catchup procedure was still being carried out even in the case where a failure was detected. This could lead to unexpected disposal of data after a catchup. Solution: When an error during alignment is discovered, the catchup procedure is aborted and no data is disposed.
OSPL-14853 / 00021853	An application that causes heap corruption may hang when the ospl signal handler tries to release and clean up the corresponding shared memory resources. When in an application a termination or exception signal occurs then the ospl signal handler will try to detach the application from shared memory. Depending on the setting of the InProcessExceptionHandling configuration parameter the signal handler will either try to cleanup the shared memory resources allocated to the application or it will leave this to the spliced daemon. The latter is the safest to be used in a deployed system. However the cleaning of the shared memory resources may use the malloc function to temporarily allocate some memory on the heap. This may also occur during normal operation. When the application causes heap corruption and raises a SEGV then malloc used by the signal handler to cleanup may again cause a SEGV in the signal handler thread which in this case caused the deadlock and cause the application to hang instead to exit with a core file. Note that it is not possible to remove the use of malloc when accessing shared memory. What can be done is to limit the use of malloc during the critical path to limit the change that this problem occurs. Another problem is that the signal handler is not able handle several signal that may occur when handling the first signal. For example when InProcessExceptionHandling is set to true the cleanup of the resources may cause that operation on dds entities that occur during the handling of the signal throw exceptions (isocpp2). When these signals are not catched by the application it results in an ABORT signal to be raised which could not correctly be handled by the signal handler. Solution: The use of malloc is removed from some of the critical paths in the ospl kernel and the signal handler does not abort when an second signal occurs during handling of the first signal.
OSPL-14847 / 00021846	When DDS security is used the tuner may not be able to create a reader for topics allowed by security. For the tuner to create a reader or writer for a certain topic it needs the associated topic and type definition. Topic discovery would provide that information. However when DDS security is used the topic discovery information may not be distributed. It appears that topic information is only sent when the permission file specifies the all wildcard "*" for the partition part of an allow rule. At the moment the topic is created the ddsi service will try to send topic discovery information and will ask the access control plugin if that is allowed. However the access control plugin will reject that request because the partition related to a reader or writer is not yet known. Solution: When DDS security is enabled topic discovery information is sent when access control permits the creation of a reader or writer. In this case the associated topic information will be distributed.
OSPL-14772 / 00021794	The spliced daemon may deadlock on termination when a service did not terminate in time and the spliced daemon forcefully terminates the service. When a process terminates or crashes, the spliced daemon will try to cleanup the shared memory resources that the process may have left. When a process crashes when accessing shared memory, which should not occur, the spliced daemon will not try to cleanup the shared memory resources and will terminate because the state of the shared memory could be compromised. However in this case, the spliced daemon receives a termination signal and will start terminating and shutting down the services. It appears that the ddsi and the durability services do not terminate fast enough, which causes that the spliced daemon to send a kill -9 to these services. Although the spliced daemon is in the terminating phase it still detects that the durability process has terminated and it detects that the durability process was accessing shared memory when it received the kill -9 signal. Because the spliced daemon is already in the terminating state, it does not check if the shared memory is compromised and starts cleaning up the shared memory, which causes the deadlock because the durability service was still holding a lock. Solution: When the splice daemon during termination has to forcefully terminate a non-responding service, it directly terminates without performing a cleanup action.
OSPL-14694 / 00021729	Rank values and GenerationCount values of SampleInfo object in isocpp2 are always set to 0. The attributes in the rank() object and generation_count() object in the SampleInfo of the ISOCPP2 API were always set to 0, even in cases where they should have been > 0. This was caused by the Reader modifying a copy of the object instead of its original value. Solution: Instead of obtaining a copy of the rank() object or generation_count() object and modifying its attributes, we now instantiate a new rank() or generation_count() object and set that directly into the SampleInfo.
OSPL-14681 / 00021724	Updated used SQLite version The SQLite version used in OpenSplice contained CVE vulnerabilities. Solution: The version of SQLite we use has been updated to 3.41.0 for the following OpenSplice product codes: P839, P840, P792, P738, P704.
OSPL-14676 / 00021718	Data can be missed when an autodispose writer reconnects. In the case where an autodispose writer disconnects, an invalid sample is created to communicate that the data has become disposed. Once the reconnect occurs, the data has to be made alive again. This is typically done by reinserting the data. However, in the case where the invalid sample was pushed out of the history, the information that data could become alive after a reconnect, is lost. In this case, the time stamp that is used to remember the last time when data was consumed may be set incorrectly. This incorrect setting could prevent data to become visible after a reconnect. Solution: The mechanism to correctly deal with the last consumed timestamp when an invalid sample is pushed out of the history is fixed.
OSPL-14675 / 00021716	There was a memory leak in the kernel, when setting QoS on a Subscriber. There was a memory leak in the OpenSplice kernel which would occur when using any of the APIs to setting QoS on subscribers. Solution: The memory leak was fixed.

7.0

Report ID.	Description
OSPL-14846	Missing state change messages on instances becoming alive after reconnect Instances that have become autodisposed after a disconnect not always become alive again after a reconnect in case no new data is published. These instances should become alive again regardless if new data is published, and in the absence of new data the last known state should be made available to readers to communicate the state change. The aforementioned reinsertion of the last known state is currently not made available to readers because DDSI wrongly unregisters disconnected writers explicitly. Solution: Made DDSI unregister disconnected writers implicitly
OSPL-14836	DLite must not align the builtin topics when RT networking is configured to align these. When the RT networking service is configured to align the builtin topics then the DLite service must not do the same. When the configuration file specifies that RT networking is responsible then the DLite service should refrain from aligning the builtin topics as well. Note that the default setting of the networking service is not to align the builtin topics. Solution: DLite checks if the configuration file specifies that RT networking will align the builtin topics.
OSPL-14832	Dlite internal unsafe multiplication of configuration discovery_interval with discovery_limit. Within DLite the discovery_interval is multiplied with the discovery_limit to get the maximum period alignment can be delayed by new fellows. A normal integer multiplication is used but this can lead to invalid values because it doesn't consider special duration and limit values such as invalid duration and infinite limit. The resulting delay period will be incorrect when multiplying with these values and will effect the scheduling behaviour between dlite services. Solution: This multiplication of os_duration with unsinged int should be performed by the operation os_durationMul() which does consider these special values.
OSPL-14825 / 00021833	[NodeJS] IDL containing a union field of type sequence did not work. If an IDL type contained a Union object with a field of type Sequence, then the NodeJS API would either fail with an obscure error or write data that did not contain the sequence data. Solution: NodeJS now supports sequence fields within a union.
OSPL-14820	Removal of obsolete DLite configuration elements. The following DLite configuration elements have become obsolete: - Purging/Rule/catchup - Settings/startup_delay Solution: Both have been removed and are no longer accepted as configuration elements.
OSPL-14818	DLite aligns transient-local data when ddsi is configured as a networking service. DLite is not expected to align transient-local data when ddsi is used as a networking service, because ddsi is already tasked with this responsibility. However, it was noticed that DLite did align transient-local data in this case. This is inefficient. Solution: When ddsi is used, transient-local topics are now excluded from alignment.
OSPL-14812	Provide a single DDSI2 service which includes the features previously provided by the extended DDSI2E service. The DDSI2E service which provides additional features is renamed to DDSI2 leaving one DDSI2 service Solution: The DDSI2 service provides all the features of the DDSI2E service.
OSPL-14809 / 00021824	Potential lease expiry for DLite when processing a large batch of incoming beads. When DLite is processing a large batch of incoming samples, it might not renew its lease in time, since the thread responsible for processing the incoming beads is the same thread that is also responsible for renewing its lease during its idle time. If the batch of incoming beads is very large, the processing thread will process the entire batch prior to renewing its lease and responding to state messages of others. If processing this entire batch takes too much time, the spliced might consider the DLite service not responsive, and terminate the service. Also, other DLite fellows might be impacted by the DLite service not responding to state messages of others in time. Solution: The implemented solution is to specify a limit on the size of a batch of beads to be processed. If the batch reached its limits, then it will be processed in its entirety, which will then be followed by a lease renewal and responding to any incoming state messages. After that, the next batch of incoming beads can then be processed.
OSPL-14806	In isocpp2, a scenario can occur where a waitset wakes up due to a spurious event where it actually should have timed out. Waitsets can be awoken by spurious wakeups. When in isocpp2 such a spurious wakeup occurs at the time when the timeout also expires, the timeout is ignored. As a result, no timeout is detected. Solution: The code has been changed so that in this (rare) event a timeout is returned.
OSPL-14805	Order reversal between live received unregister message and aligned historical data message can lead to not alive instances wrongly becoming alive. The problem is caused by the kernel dropping unregister messages for empty and no writer instances because it wrongly assumes that data is always received in order and that there is no need to keep the unregister message. Solution: The unregister message is no longer dropped for an empty and no writer instance but is instead inserted so that older data can no longer wrongly alter the instance state.
OSPL-14804	A persistent private group that is supposed to remain local can cause wait_for_historical_data() to block forever Persistent groups are loaded from disk during start up of a node. When networking uses the IgnoredPartition setting to prevent advertising these groups over the network, then the data in such groups is assumed to be local to node. After all, networking will prevent any live data for such groups ending up at remote nodes. It turned out that such groups may (wrongly) be considered incomplete, which causes wait_for_historical_data() to block. Because alignment for local groups is also blocked, there is no way this group can become unblocked. Solution: Persistent private groups are now considered to be complete by definition.
OSPL-14801	The configurator tool osplconf doesn't work. The configurator tool osplconf doesn't work. This is caused by an error in the xml document describing all configurator options, and that is used as input by the osplconf tool. The error was that the //OpenSplice/Dlite/Provisioning/simultaneous element did not specify a maximum length. Futhermore, an example described under //OpenSplice/Purging/Rule was not rendered correctly. Solution: The missing element was added, and the example has been updated so that it is rendered correctly.
OSPL-14799	When multiple writers exist for a group, purging based on PSN may lead to incorrect results. When multiple writers exist for a group, catchup based on PSN (Publisher Sequence Number) is no longer viable because the PSN no longer relates to all registered sequence numbers. The aligner must instead align all messages and notify receivers that a full group catchup is required instead of a PSN based catchup. This mechanism failed because the mark flag (that is used to dispose instances that have been purged) was not set on aligned messages when full group catchup is specified. As a result instances are not unmarked by aligned messages and wrongly removed, or instances are marked that belong to a different writer. Solution: In the described situations a full catchup is carried out, and instances are now properly marked.
OSPL-14798	Fix memory leaks and some other minor issues in the coherency mechanism. Code reviews pointed out that the coherency mechanism was suffering from some memory leaks, and also that incomplete coherent sets might be delivered in the case where during alignment the start of a session was missed, in which case it would just flush everything it received so far up to the end-of-session. Solution: Memory leaks have been fixed, and for historical alignment purposes a coherent update is ignored when you have not received both the start and the end of the alignment session.
OSPL-14797	DLite scheduling can hang causing a system not to become aligned. When a new writer is created and publishes data during alignment, a new gap can occur between the aligner and alignee that remains undetected by the aligner. This is caused by the aligner not seeing the difference between old and new gap, and results in the aligner taking no action. This problem normally doesn't persist forever; it can be resolved by new topology changes triggering new alignments. The problem can be detected in the DLite tracing, when DLite reports that a fellow has a GAP but it remains in the MONITORING state. Solution: The problem is solved by removing the old gap of fellows that are aligned so that it will detect the new gap.
OSPL-14792	Improve the support for handling builtin topics by RT Networking. When the RT Networking Service is configured to align the builtin topics, it currently uses a distinct DDS reader and writer to exchange the builtin topic information. Using DDS readers and writers causes the use of unnecessary resources which also causes the sending of the builtin topic information to be mixed with the application data, which may cause extra and unnecessary latency. A better solution would be to transmit the builtin topic data directly within the Networking Service. Solution: When the RT Networking Service is configured to handle the builtin topics it will send the serialized builtin topic samples directly and with priority.
OSPL-14791	When using the RT Networking Service, a node that reconnects could cause a number of system heartbeat changes. When an RT Networking Service is configured with several reliable channels then each channel may cause a change in the state of the system heartbeat related to a remote node when that node reconnects. This may cause an unnecessary load on the spliced daemon and the Durability Service. The reason why this occurs is because each channel maintains its own node administration. This node administration has to made partly global to enable all channels to use the same information and to provide only a single state change of the system heartbeat related to a remote node. Solution: The node administration is shared between the network channels.
OSPL-14788	Enable support for timestamps after the year 2038 in the OpenSplice APIs by default. Although the previous versions of OpenSplice supported timestamp stamps beyond the year 2038 the APIs still used by default 32 bit for the second member of DDS::Time_t. The newer isocpp2 and java5 API already supported the use of timestamps beyond the year 2038. For the classic C++ and classic Java APIs the custom libraries could be used to generated support for timestamps beyond 2038. Support for timestamps beyond the year 2038 has now been enabled by default which means that DDS::Time_t now uses a 64 bit value for the seconds. Solution: The definition of DDS::Time_t has been changed to support timestamps beyond the year 2038.
OSPL-14787 / 00021808	Transient-local alignment may be slow in the case of large fragmented user samples. When retransmitting a fragmented message, ddsi will first send the first fragment of the sample to provide better flow control of large samples. When message loss occurs, this will prevent complete samples having to be retransmitted when fragment are lost. When using this mode, ddsi will handle one sample at a time and proceed with the next sample after the first sample has been completely acknowledged. This could cause the alignment of a large amount of transient-local data to become slow which is related to the roundtrip latency. To accelerate the alignment a number of samples could be partly retransmitted. Solution: When a number of fragmented samples are scheduled for retransmission then retransmit fragments of a number of these samples before waiting for a nackfrag message.
OSPL-14780	Writers with a finite deadline can auto-unregister instances. Creating a writer with deadline 1ns, writing 1 sample and then doing nothing forever after will eventually result in the instance being unregistered. This is evidently incorrect. The issue is caused by increasing the deadline counter regardless of whether auto unregister has been enabled or not. Solution: Increasing the deadline count now takes the auto unregister setting into account.
OSPL-14770 / 00021791	The durability service may deadlock when resolving a connect conflict with nodes having a role defined. When detecting a fellow, the durability creates a connect conflict. A connect conflict can be combined with an existing connect conflict from a different fellow which enables a connect conflict to be resolved for all fellows in one alignment action. However connect conflicts from fellows which have different roles cannot be combined. The role of a fellow becomes known when a namespace message is received. Initially, connect conflict of fellows are combined but when the role information becomes available the connect conflicts have to be split again. The split of the connect conflicts causes a deadlock because the lock of the conflict was not released when adding the split conflict to the conflict administration. Solution: When a connect conflict for a particular fellow is removed from a combined connect conflict because the role of this fellow does not match, then release the conflict administration lock before adding the new conflict.
OSPL-14763	DLite slow processing of incoming alignment data. The main loop of DLite periodically waits in a waitset for incoming protocol messages. If alignment beads are received the waitset will unblock and DLite will process the incoming alignment data. However, every 100 messages DLite will pause processing of alignment and return to the main loop to verify if it needs to do some housekeeping. After that it should continue processing the remaining beads. However, it continues by calling the waitset wait again and this is incorrect in the case where no new data is received and remaining unprocessed data still exists. If no new data is received, the data available status is not set because it was reset in the previous cycle, so the wait will block until it receives new data or a timeout occurs (1 second), after which it will process the next 100 beads. This means that in a worst case scenario the processing of received alignment data will add a 1 second delay (timeout) for every 100 beads. Solution: The solution is to check if unprocessed data exists before entering the waitset wait. The waitset wait shall be skipped in the case where unprocessed data exists.
OSPL-14759 / 00021786	Using ordered access with group scope could result in a segmentation fault. To provide ordered access with group scope a subscriber related resource is shared by the readers of this subscriber. Not all operations on this shared resource are properly locked which could cause that more readers could manipulate this shared resource at the same time causing it to become corrupt. Solution: All concurrent operations on the shared resource which is used to provided coherent access on group scope are properly locked.
OSPL-14756	Memory leak in spliced command processing. When spliced processes a command, a string is allocated that is never freed. This results in a memory leak. Solution: The string is now freed when it is not needed any more.
OSPL-14749 / 00021776	A sample from a reliable writer can get lost when the sample is written during startup of OpenSplice. When an application writer writes a sample the sample is put in a queue. The samples on this queue are then handled by the ddsi service and forwarded on the network. To handle a sample the ddsi service needs the information about the application writer. For that purpose the ddsi service listens to the internally generated builtin topics that are created when the application writer is created. When the ddsi service reads a sample from the internal queue it checks if it already knows about the application writer, which means that it has received the internal generated builtin topic associated with the application writer. The ddsi service will drop the sample when it has not yet received the corresponding builtin topic. Normally this could not happen because the builtin topic is created when the application writer is created and thus before a sample can be written. However during startup of OpenSplice, it could occur that the ddsi service is not yet ready to receive the internally generated builtin topics and then the resend manager will be responsible to provide the builtin topics to the ddsi service at a later time. In that case there is a small chance that the ddsi service retrieves an application sample from the queue before it has received the corresponding builtin topic and then will drop the sample. Solution: The spliced daemon will set the state to operational after the networking services have been initialized and are able to process the builtin topics. This will resolve the issue because the creation of a domain participant will wait until the state becomes operational.
OSPL-14748	Simultaneous alignment Under normal circumstances, DLite services that have data to align will align one after other (i.e., serialized alignment). Especially in cases where there are many candidate aligners, it may take a while before all of them have aligned their data. To speed up alignment, a new option has been implemented to allow simultaneous alignment (as opposed to serialized alignment). The configuration option //OpenSplice/DLite/Provisioning/simultaneous can be used to specify the maximum number of aligners that are allowed to align simultaneously. The intended behaviour of a DLite service that has configured this to value n is as follows: as long as there are other candidate aligners which have a priority higher than specified by //OpenSplice/DLite/Provisioning/priority, then this DLite service will NOT start alignment (i.e., highest priority wins). if all other candidate aligners have a priority lower than //OpenSplice/DLite/Provisioning/priority, then this DLite service will start alignment. if there are no other candidates, aligners with a priority higher than //OpenSplice/DLite/Provisioning/priority, and there are multiple candidate aligners which have the same prioity, then this node will start alignment if it belongs to the top n best aligners. Valid values for the //OpenSplice/DLite/Provisioning/simultaneous setting are integers >= 1, and 'inf' (for infinite). The default is 1 (indicating serialized alignment). Parallel alignment can speed up the overall alignment process because alignments can occur simultaneously. However, the downside is that simultaneous alignment can cause message collisions in cases where there is too much traffic generated, potentially leading to temporary disconnects which cause realignments, and so on. This effectively slows down alignment. Notes: The current implementation interprets all values > 1 as infinite, meaning that there is no bound on the number of nodes that align simultaneously. Solution: Rudimentary support for simultaneous alignment has been implemented.
OSPL-14746 / 00021774	DLite can crash when persistent data needs to be persisted. When persistent data is published, a DLite service may need to persist this data (e.g., on disk). The producer of the data is decoupled from the consumer by a bounded queue: the data published by the writer is added to the queue (as long as there is space available), and the persistency thread takes this data and stores it on disk. It turned out that the implementation of this queue contained a bug, which could lead to buffer overflow and a crash. Solution: The implementation of the persistent queue has been refactored and simplified.
OSPL-14744	Atomic creation of a snapshot When an aligner needs to align data, it creates a snapshot of the data to align. If an alignee needs to catchup with the data, the alignee must dispose that data that does not exist any more at the time when the snapshot was made. This requires that a snapshot is made atomically, and that the timestamp of the snapshot is used at the time to dispose the data. In particular in case of a catchup, any live data that was published after the snapshot was taken and inserted by an alignee before the snapshot data was received and injected, should remain unaffected by the dispose. This was not happening, and this could lead to a situation where live data could become disposed. Solution: The timestamp at which a snapshot is made is now part of the alignment protocol.
OSPL-14743	Fix various bugs in the catchup mechanism. We have found several issues in DLite that can cause the wrong behavior when running in catchup mode: flags in dstate.h have unintended overlapping postfix. This was erroneously added during the manual solving of a merge conflict. When a catchup session is performed for the second time, it will revert to differential alignment for every Writer that was already covered in the first alignment session. L_MARK flag set in v_groupWriter prior to processing alignment data from a catchup session is not reset when receiving WriterBeads as part of that catchup session. This is wrong, since it will cause those writers to be considered dead, where in fact they may still be alive and kicking. Solution: The erroneous postfix in the flags has been removed, a full alignment is now applied in case of a second alignment, and the L_MARK flag is reset when beads are received.
OSPL-14736	improved efficiency DLite state protocol. DLite state messages are used for full state notifications on discovery and partial notifications on dynamic interest. Currently on dynamic interest a full state message is communicated where only a partial update is required. Solution: An additional flag is added to the state message to indicate whether a full or partial state is communicated and in case of dynamic interest changes only a partial state is communicated.
OSPL-14734	Update the decode-ddsi-log script to accommodate recent changes. In the past there have been changes to the ddsi logging. To accommodate these changes in the script that analyzes ddsi logs, the scipts has been updated. In particular, the script is now able to detect if discovery completes if there are no readers (or writers) on the remote node by updating the logic to distinguish a pre-emptive ACKNACK from an ACK in the absence of any data. Also, the group beads published by DLite are now parsed correctly. Solution: The decode-ddsi-log script has been updated.
OSPL-14733	The dispatcher thread of services that watch "spliced" has the same name as the main thread. All OpenSplice services start a dispatcher thread watching "spliced", but give it the same name as the main thread. By giving it a separate name, it becomes easier to distinguish the main thread from this dispatcher thread when monitoring per-thread CPU loads. Solution: Each dispatcher thread now has a dedicated name.
OSPL-14732	Potential crash when ddsi is under high load The part of DDSI that buffers messages received out-of-order tracks the number of received but discarded bytes. If the highest sequence number (interval) in the admin concerns a GAP message, and later arriving messages with lower sequence numbers push it out, a null pointer dereference would occur. Solution: The null pointer dereference is fixed.
OSPL-14731	DLite may not start when persistency is not configured. DLite may not start if built without sqlite because one of the stubs it then calls was missing a return statement. Solution: The return statement has been added.
OSPL-14730	The networking service may fail to execute the isolate node command properly. When the networking service receives the isolate node command an reliable channel has still to handle the messages that were sent before the isolate command was issued. Thus it has to wait until all reliable messages are acknowledged before disconnecting the node from the network. This to ensure that every node in the system has received the same messages. For that purpose it uses the number of messages that are still present in the resend queues associated with the receiving nodes. When this number reaches 0 the isolate is considered finished. However it may occur that count of queued messages is incorrectly increased twice when a message is resend. Solution: The count of queued messages in the resend queues is corrected.
OSPL-14718 / 00021757	Alignment scheduling can get stuck after alignment in the presence of topology changes during alignment. The problem is that an aligner is not aware that it can provision additional data after alignment because it has failed to detect a new gap that became available during alignment. In this scenario all other nodes will wait for this node to provision the data whereas the node itself does not act and the system will hang until some other event occurs that causes a reevaluation of aligners. It turns out that the status of the aligner is reset to MONITORING after alignment assuming all gaps are resolved, however that might not be true, e.g. if in the meantime a new writer was created and has published data, a new gap exists. In these scenarios the status should be set to EVALUATING, meaning that the node is aware of being the aligner and will act on it. The status is not set to EVALUATING because there was no change seen in existence of gaps (old gap replaced by new gap). Solution: The solution is to check after alignment whether the state was reset to MONITORING and in that case, call dspace_update_status which will check for gaps and set the state to EVALUATING again if new gaps exist.
OSPL-14714 / 00021756	Ddsi discovery of remote entities may fail after an asymmetrical disconnect. The ddsi discovery protocol of readers and writers is using transient-local semantics. When an asymmetrical disconnect occurs caused by massive packet loss, it may occur that a transient-local reader does not receive all the data of the corresponding transient-local writer because the writer did not notice the disconnect and assumes that all readers have received all the data, and does not sent a heartbeat. However, the asymmetrical disconnected reader does not send an acknack to retrigger the transient-local realignment because a heartbeat from the writers was already received before the asymmetrical disconnect occurred. Solution: A reader keeps asking for data (send acknack) when it detects that is has not received all the data with a configurable interval which is 1s by default.
OSPL-14713 / 00021525	Simulink integration functions idlImportSl and idlImportSLWithIncludePath fail if an output directory is specified. Both idlImportSl and idlImportSLWithIncludePath accept an optional final argument 'outputDirectory'. When a caller provides this parameter, the resulting call to the IDLPP processor will fail, resulting in the function failing. Solution: The order of arguments passed to IDLPP has been changed to prevent the failure.
OSPL-14707	Discovery of remote transient /persistent groups can take a long time Responsibility for ensuring transient/persistent groups become known in the system is with DLite. Unfortunately, DLite disseminates them as part of alignment. If it takes a while before a node aligns its data, it also takes a long time for other nodes to discover transient/persistent groups that have been known by the aligner. During this time, any live data that is published for this group will not stored by the nodes that have no knowledge about the existence of the group. Only when the alignee is made aware of the existence of the group, the alignee is able to receive live data. This is evidently an inefficient procedure, because all the live data that has been missed must be aligned. To speed up this process, it makes sense for a DLite service to actively discover the presence of transient/persistent groups for live writers, and create a group as soon as it discovers as soon as possible. In that way live data can already be received prior to alignment taking place. Solution: A DLite service is now equipped with a reader for DCPSPublications to discover the presence of transient/persistent writers. As soon as the topic definition is known, a DLite service can now match the topic against its interest set and create a group for it, prior to alignment taking place.
OSPL-14699	It may occur that DLite fails to receive messages from fellow DLite instances. The OpenSplice implementation contains an object called group which controls within an OpenSplice instance the distribution of the sample related to a particular topic-partition. As such the group has a function to route samples between local writer and local readers and also with the configured networking services. A group is created when the first reader or writer for that group is created. On group creation also the networking services should connect to that group to allow samples to be written to be forwarded on the network and received samples to be delivered to readers. It appeared that during initialization of the networking service there is a race condition in the notification mechanism which allows the networking service to connect to the group. In this case the networking service did not detect the presence of one of the groups related to the DLite service which causes that the DLite protocol failed to operate correctly. Solution: The race condition that existed when retrieving the list of existing groups at startup of a service is resolved.
OSPL-14697	Make the duration in which a state update is expected by DLite configurable Whenever a DLite service has aligned data, it expects a state message back from all fellow DLite services. To prevent that the Dlite service keeps waiting forever in cases where one or more of these fellow DLite services are not able to send a state message (e.g., because one of the fellows has crashed), a fail safe mechanism has been implemented. This fail safe mechanism currently expects a state message back in 10 * the discovery period (see //OpenSplice/DLite/Settings/discovery_interval). Although the duration in which to expect a state message can be indirectly configured by setting the discovery interval, this is not always desirable. In some cases you may want to increase the duration without increasing the discovery interval. To facilitate these use cases, it is better to make the duration used to determine whether the fail safe must kick in, independent of the discovery interval. For this reason, a new optional configuration option //OpenSplice/DLite/Settings/unresponsive_delay has been implemented that controls the interval during with a state message from a fellow is expected. Setting this value too low is undesirable, and a warning will be generated in ospl-info.log whenever the delay is smaller than 10 * discovery_interval. The default is infinite. Solution: A new configuration option //OpenSplice/DLite/Settings/unresponsive_delay is implemented that controls the interval during which a state message from a fellow is expected.
OSPL-14695	On some machines the XML parser for ddsi2 could hang. On machines where "char" is signed (many machines) incorrect sign extension can cause the XML parser in ddsi2 to hang, incorrectly interpreting a 255 byte as EOF in one place as a regular data in another. Solution: The incorrect sign extension has been fixed.
OSPL-14685	Whitelisting of partition/topics in ddsi, and regular expression support for blacklisting and whitelisting. Currently, ddsi allows blacklisting of partition/topic combinations to indicate which partition/topic combinations must NOT be sent over the network. In situations where there are many partition/topic combinations, it is cumbersome to add many entries to the blacklist. In these cases, it might be much more efficient to whitelist the partition/topic combinations that should be sent over the network. The idea is that everything that matches the blacklist is not sent over the network, and only the partition/topic combinations that match the whitelist will be sent over the network. If the whitelist is not specified, it is considered a whitelist that matches all partition/topics. An additional attribute (regex) has been implemented that indicates whether full POSIX regular expression matching should be used in the matching criteria. Solution: A whitelist is implemented that allows a user to specify which partition/topic combinations must be sent over the network.
OSPL-14684 / 00021591	The functions ignore_participant, ignore_publication and ignore_subscription do not work correctly. The functions ignore_participant, ignore_publication and ignore_subscription do not work as intended. The idea is that you pass the instance handle of the entity you want to ignore (you can obtain this instance handle from the builtin topics, or from the function get_instance_handle on the Entity class), and then all data originating from that Entity will be discarded. However, something went wrong in the translation of the instance handle into the intended target, causing it not to be located and therefore not to be ignored. Solution: The translation from instance handle to intended target has now been corrected. which causes the intended target to be ignored correctly.
OSPL-14683 / 00021723	A take/read_next_instance on a dataview may incorrectly fail. The take/read_next_instance on a view will loop through the view instances until it finds an instance and sample that passes the provided instance and sample-mask. However, before checking if a sample matches with the provided instance mask the state of the sample has already changed. For example, it has been set to read before the check on the instance mask is performed which indicates that the sample does not match. Solution: When performing a take/read_next_instance operation on a view first check if the instance passes the provided mask.
OSPL-14673	Alignment beads may get dropped prematurely, causing loss of data. When a DLite alignee receives alignment data from an aligner, the alignee will process the data. To prevent starvation in case processing takes too long, the processing is temporarily interrupted after processing a certain amount of beads have been processed. When the processing of beads in interrupted, it can happen that the remaining beads are silently dropped, causing a memory leak and data state inconsistencies. Solution: No beads are silently dropped anymore.
OSPL-14670	Dependencies should only be fulfilled by fellows that can act as provider. Currently, any fellow that matched the name expression of a Dependency is considered a provider and fulfills dependency requirements regardless of whether the fellow will provision data or not. This can lead to situations where a dependency is resolved because a fellow that is not able to provide the required data appears. Consequently, wait_for_historical_data() may return empty handed. In retrospect, it is logical not to consider fellows that are not willing to provide the data when checking if the minimum-dependency has been reached. Only fellows that actually can provide the data should be taken into account. Solution: When checking if a dependency is fulfilled, the ability of a fellow to provide the required data is taken in account.
OSPL-14668	Catchup wrongly processes groups for which the aligner is not a provisioner. Catchup should not process groups that match the catchup scope but are not provisioned by the aligner. By catchupping these groups, the alignee may end up in a state where data is disposed that should not have been disposed. The problem is caused by the fact that the aligner only communicates the catchup scope but not the provisioning scope. Solution: The provisioning scope is added to the session start bead so that alignees can prohibit catchup on groups not matching the provisioning scope.
OSPL-14666 / 00021711	Java Exception in OpenSplice Tuner when viewing Query 'Data type' details In OpenSplice Tuner, if you select a Query object, and view its details, and then switch to the 'Data type' tag, a Java exception will occur. Solution: The exception has been corrected, and the 'Data type' tag now correctly displays the data type of the Topic associated with the Query.
OSPL-14662	dbd should display sample and instance states in a more user friendly way. dbd used to display instance states and sample states by the integer value of their combined flags. This was not very user-friendly, because the end user would then need to derive manually which bits were set and then derive which flags they represented. It would be much more helpful if dbd would do this work instead. Solution: dbd now displays instance and sample states as a concatenation of the names of their elevated flags.
OSPL-14649 / 00021696	Previously disposed topics present in DDS after a restart of the system. When an instance becomes disposed and unregistered the instance may be purged. Thus samples contained in that instance are then removed from the instance. For a persistent topic the messages should also be removed from persistent storage to prevent that these samples and instances reappear after a restart of the system. In not all cases it appeared that sample purged from the instances in shared memory were not removed from the persistent store. Solution: In all cases where a sample of a persistent topic is removed the sample is also removed from the persistent storage.
OSPL-14648 / 00021698	The creation of a durable group that is not supposed to be provisioned to other DLite services, can nevertheless cause state changes of fellows. When a durable group (e.g., a transient writer for a particular partition/topic combination) is created, other DLite services that are still unaware of the existence of the group are not necessarily in sync any more. To become in sync again, they have to check whether data has published for this group, and if so, this data has to be aligned before the other DLite services are in sync again. This basically means that the creation of a durable group may cause fellows to become temporarily out-of-sync. Resolving this can be computation intensive. Currently, the above procedure was carried out for the creation of all durable groups, even the ones that do not match provisioning scope of the node that created the group. However, if a group does not match the provisioning scope, there there is not need to resync, because this node is not going to provide the data anyway. This means all fellows are already synchronised. Not resyncing in this case means that no unnecessary calculations are needed, and wait_for_historical_data() may unblock earlier. Solution: When a durable group is created that does not match the provisioning scope, fellows are not considered out of sync and no resync is needed.
OSPL-14647 / 00021695 00021686 00021634	Delivery of coherent data can fail when part of the data is received via alignment, and the other part via the live path. It is possible that part of a coherent data set is being delivered via alignment, and another part via live data. In the case where an alignee receives a complete data set via the live path, it is possible that the alignee wrongly delivers historical data of a previous, but still incomplete, transaction. This issue has been fixed by improving the coherency implementation, in particular by creating different administrations for the historical data path and the live path. This ensures that a complete data set which has been delivered via the live path does not affect the data that has been delivered via alignment. Solution: The administation for coherent data received via alignment is separated from the administration for coherent data that is received via the live path.
OSPL-14646 / 00021697	When an alignee catches up with an alignment data set from an aligner, other alignees may end up disposing the same data set. An aligner can specify a catchup policy. In such a case, an alignee should "take over" the data set of the aligner by disposing the instances that the aligner does not have, and inserting the instances that the aligner does have but the alignee misses. Internally, this is done by marking all data instances of the alignee prior to the alignment. When alignment data is inserted, the mark indicator is reset. All instances that are marked after alignment has completed are apparently not present anymore and should be disposed. When a late joining node receives alignment data, then other nodes also receive this data. It turned out that nodes that had already received all the data (e.g., due to a previous alignment) still marked all instances. In the case where the data set of the node has no differences, such a node would never inject the data and therefore also never reset the mark indicator. Consequently, when alignment has completed all instances will be disposed. Solution: The optimization to not inject the data has been removed. Now the data is injected, and the mark indicator is reset. This prevents the data from becoming disposed.
OSPL-14613	String values in metrics published by DLite are not properly formatted as JSON A DLite service can publish metrics that can be used to analyze the health of the DLite services. These metrics are published as a JSON formatted string. It turns out that non-null string values in a JSON metric miss out on their enclosing quotes, thus rendering the JSON as being invalid. Solution: Non-null string values in metrics published by DLite are now enclosed by quotes, as prescribed by the JSON syntax.
OSPL-14609 / 00021671	Networking service stopped because a channel had reached the maximum number of defragmentation buffers. A networking channel has a maximum number of defragmentation buffers configured which are used to store incoming packets before defragmentation and deserialization of the topic samples can be performed. The use of defragmentation buffers may increase when there is packet loss or when there is a peak load of incoming packets. In this case the problem seems to be caused by a peak load of incoming packets and the delivery becoming full, which causes incoming packets to only be stored until space in the delivery queue becomes available. Note that the delivery queue contains the topic messages that have to be delivered to the local readers. To reduce the risk of this issue occurring, the queue size of the delivery queue has to be increased. Solution: A configuration option has been added which allows the size of the delivery queue to be specified.
OSPL-14608 / 00021647	After a reboot, catchup with an empty set does not work. Assume that an aligner has configured the catchup setting to indicate that alignees should catch up their data set with the aligner. When the aligner publishes a transient data set it will be received by the alignee. If the aligner subsequently reboots and throws away its data set during the reboot, then the aligner joins the system with an empty data set after the reboot. The alignee should catch up with the aligner again and consequently should dispose all of its data too. This was not happening, because the aligner has no data to align after the reboot. Because no data was being aligned, the catch up action was not invoked, and the alignee did not dispose the data. Solution: When the aligner determines if it has data to align, it now also takes into consideration if the alignee has writers that the aligner does not have anymore. If so, this is now considered as a reason to align.
OSPL-14606 / 00021668	Durability incorrectly discards a native state conflict when receiving a namespace's message out of order. A durability protocol message could be received out-of-order. In this case, an old namespace message gets processed after a namespace message from the master node, indicating a state change which would normally generate a native state conflict. However, the processing of the old namespace message causes the namespace state to be reset, which causes the native state conflict to be discarded. Solution: The durability service discards messages which are older than the last handled message.
OSPL-14605	The durability service may incorrectly regard a terminated fellow as alive. When receiving a message the durability service will check if the corresponding fellow is alive and discard the message when the fellow is not alive. However, it may occur that a message is already accepted and not yet processed when the durability service detects that the fellow has been terminated. In that case the handling of that message may make the fellow alive again. Solution: When adding a fellow as a result of handling a message from that fellow, check again if the fellow is not marked as terminated. A fellow can only become alive when the corresponding system heartbeat is received by the durability service.
OSPL-14602	Deadlock in Durability during termination During termination of durability, the service can end up in a deadlock when an action is present in the actionQueue. During termination of durability, all the actions are removed from the actionQueue. When doing this, the action is executed one last time, but in the case where durability is already in terminating state, the execution of the action can cause durability to deadlock. Solution: A check is added to not execute any action when durability is in terminating mode, as the action has no use to run at this stage.
OSPL-14600 / 00021664	Alignment of coherent updates may lose coherent data The algorithm used to align coherent updates was squashing each individual coherent update into one combined coherent update, flushing it when the contributions of all participating Writers had been accounted for. For that purpose, it was important that the contribution of each individual Writer was collected prior to squashing it. For that reason, a sorting algorithm would line up each contribution first by Writer and then by ptid (Publisher TransactionId). However, the store used to hold these sorted contributions was itself unsorted. In other words, a sorted list went in, and an unsorted list came back out. This caused the the flushing algorithm to sometimes flush prematurely, namely when every Writer had been accounted for, but not every contribution for that Writer had been consumed yet. Solution: The store used to hold individual Writer contributions now maintains the order in which the Writer contributions have been inserted.
OSPL-14582	Memory leak when aligning topics A DLite service sends various types of bead as alignment data. A particular type of bead is a group bead. Such a bead is used to communicate the partition and topic name of transient/persistent data from the aligner to the alignee. It turned out that the topic name was leaking because it was accidentally allocated twice. Solution: The superfluous memory allocation has been removed.
OSPL-14577 / 00021579	Coherent data sets that have recently been received are incorrectly disposed a few seconds later if a catchup scope is configured Suppose that an alignee needs to catchup a coherent data set from an aligner. When an aligner aligns the coherent data set to an alignee for the first time, the data set is received by the aligner. If the aligner creates new writers for partition / topic combinations that are not known yet by the alignee, then the alignee may need historical data from this writer. This will trigger a second alignment to acquire the data. The second alignment causes a catchup, which disposes the data that was received previously. The issue is caused by a bug in the calculation that determines which groups would be disposed. As a result, groups that should not be disposed accidentally were disposed. Solution: The calculation bug is fixed.
OSPL-14563	Shared memory leakage of kernel group writers. The kernel keeps track of all discovered writers per kernel group by means of a v_groupWriter object. This object leaked in serveral places: - In DLite, after a lookup writer info used by alignment the reference to the object is not freed. - In the kernel, caused by unlocked modifications. - In the kernel, volatile writer objects are not freed immediately when writers are deleted but instead after the configured DLite alignstate-cleanup-delay, which is by default infinite and is not supposed to address volatile writers. Solution: All three issues are fixed - add a c_free in DLite - move administration modifications between locks - add test for volatile writers and set cleanup time to zero.
OSPL-14561	Unable to determine completeness of a group coherent transactions in case of non matching writers When a node has interest in a subset of coherent data, then it does not need to take a subscription of the data that it not interested in. To decide whether the data set is complete, the receiving node needs to calculate a digest over all participating writers in the coherent data set. To calculate such a digest, all writers that participate in the coherent data set should be known. However, if the node is only interested in a subset of the data, then the node will not receive any historical data that has been published by such a writer. Furthermore, if the writer is not present anymore, then it also not possible to discover the writer. In such a case, it is impossible for the receiving node to calculate the digest. Solution: To still be able to decide whether a dataset is complete, a workaround is implemented that assumes that all durable writers will eventualy be discovered. As long as nodes do not take a subscription on a subset of coherent data, this assumption holds.
OSPL-14558 / 00021627	Reader instances of a reader which uses user defined keys may leak. When performing a take operation on a reader which uses user defined keys then when the instance becomes empty, it will be directly purged from the reader. However in this case the reader instance is not properly released, which causes a memory leak. Solution: When removing an instance from a reader which uses user defined keys the instance is directly removed and released.
OSPL-14557 / 00021626	A memory leak may occur when using a dataview. When performing a read or take on a dataview the read operation walks through the instance table associated with the view and temporarily increments the reference count of that instance. After processing this instance the reference count is not decremented which causes a memory leak. Solution: Release a dataview instance after it has been accessed by a read or take operation because the read or take increases the reference count of the instance.
OSPL-14551 / 00021619	A segmentation fault may occur in ReadCondition::take_next_instance. The table containing the reader instances may be updated when a new sample is injected in the reader. At one location the update of this table is performed without locking the reader which causes a segmentation error when the table is modified during a read or take operation. Solution: The reader instance table is locked when accessing or modifying the table.
OSPL-14549 / 00021617	Endless repeating alignment caused by missing sequence number range merging The observed behavior of readers not being triggered was caused by an aligner hanging in an endless provisioning loop. The aligner did not see that the provisioned data was accepted by the alignee because the alignee did not merge the whole range of aligned sequence numbers in its state that it sent back to the aligner. As a result the aligner started provisioning again and again. Other nodes that also wanted to become aligner, as a result, never got selected to provision data. This resulted into a cascade of completeness issues. Solution: The problem was caused by a side effect of a catchup fix, one statement to merge ranges was accidently deleted from the code. This line is added again to solve this issue.
OSPL-14541	Tombstones remain indefinitely in the persistent store when when tombstone are not purged before shutdown. Tombstones are normally removed from the persistent store once purged, however in case the node shutdown happens before purging can be applied the tombstones will remain in the persistent store forever. Solution: On startup when loading persistent data, tombstones are detected and removed before writing persistent data back into the persistent store.
OSPL-14535	Order reversal coherent updates caused by partially live received update Nodes that join a system in the middle of a coherent update will receive the last part of the update as live data but require alignment of data to provide the first part of the coherent update. Only then, when the coherent update is complete, the data can be delivered to the readers. In this scenario, order reversal between successive coherent updates can occur when a new coherent update is published after the partly live received update but before alignment has completed that update. In this case the new coherent update will be delivered before the previous update that is awaiting alignment. Solution: This problem is solved by additionally provisioning the messages of the unfinished coherent updates during alignment and not wait until a next alignment is started to resolve remaining differences. These additional messages from the incomplete coherent update will together with the received live messages complete the coherent updates immediately after alignment, and can be read before any newer updates when using wait_for_historical_data.
OSPL-14526 / 00021596	A waitset wait on a readerview may deadlock. When the status or queue condition on a readerview is used in a waitset, the wait operation on the waitset may deadlock when the reader sample which has an associated view sample is purged at approximately the same time. The purge of the reader sample purges the associated view samples but it performs this operation without taking the corresponding view lock. Solution: The purge operations of the dataview instances are properly locked with the lock of the corresponding dataview.
OSPL-14507 / 21577	Improper formatting of the nanosecond field in DLite traces. DLite traces are timestamped in a 'second.nanosecond' format. The nanosecond field was printed in a format where preceding zeros are omitted. This evidently leads to an incorrect timestamp. It could lead to a situation where log line 1 is published prior to log line 2, but the timestamp of log line 1 is later than the timestamp of log line 2. Solution: The formatting of nanosecond is fixed by including preceding zeros.
OSPL-14491	Parse external commands for a DLite service that contain empty scopes. DLite services can react to external (topic-based) commands. One of these commands, the 'set-provisioning' command, can be used to change the provisioning scope. When the empty scope is set, a bug prevented that this had any effect. This caused the setting of the empty scope to be basically a noop. Solution: The set-provisioning command with an empty provisioning scope will now lead to an empty provisioning scope.
OSPL-14487 / 00021579	Implement a command to change the catchup scope of a DLite service. DLite services can receive commands to change their behaviour. One of these commands is the set-catchup command. This command can be used to change the catchup scope of a DLite service (see the //OpenSplice/DLite/Catchup configuration setting). For example, the command "set-catchup:scope=.Prov" sets the catchup scope to all partition/topic combinations matching .Prov. Solution: The command is implemented, and the catchup scope can now be changed by sending a command.
OSPL-14461 / 00021577, OSPL-14548 / 00021613	Fellows are considered prematurely NOTALIVE A DLite service keeps track of the state of other DLite services (called fellows). If a fellow is expected to send an update of its state, and the update does not arrive in time, then the fellow is declared NOTALIVE. Each time a sign of life is received from the fellow, the timer is reset. Due to an error in the logic, this timer was NOT reset in the case where a sign of life was received while the fellow did not indicate progress compared to its previous state. Not resetting the timer in this case may cause that the timer expires and the fellow is declared NOTALIVE, even though the fellow is still alive. The fellow is discovered again when a sign of life is received, and in that case realignment is started. This cycle can continue multiple times, contributing to long alignment delays. Solution: The logic to declare a fellow NOTALIVE has been changed. Now a fellow is not declared NOTALIVE anymore when a sign of life is received even if the fellow did not show any progress.
OSPL-14459	The 'reporting' attribute for the /Domain/UserClock element leads to an error. Every element in the configuration is validated against an xml meta description that describes the allowed elements in a configuration file. This ensures that every element that can be provided matches the format defined in the xml file. The downside of this validation step is that elements that are NOT in the xml are considered erroneous, and so any configuration file containing elements that are not specified in the meta description are considered a misconfiguration. In this case the meta description xml has not been updated with the 'reporting' attribute, and so any configuration file that contains the reporting attribute leads to an error, even though OpenSplice actually contained the logic to cope with the reporting element. Solution: The xsd has been updated, and now contains the //OpenSplice/Doman/UserClock[@reporting] element.
OSPL-14454 / 00021199	A crash occurs when kernel tracing is enabled. Kernel tracing can be enabled by setting an environment variable. When enabled, a crash occurs. The stacktrace that is obtained shows that an invalid print formatter for 64-bit integers on little endian platforms is responsible for the crash. Solution: The formatting of this print statement has been fixed.
OSPL-14451 / 00021569	LabView incorrectly handled IDL enum fields The LabView integration incorrectly mapped IDL enum fields to 16-bit unsigned integers, instead of 32-bit unsigned integers. LabView applications using an IDL struct containing an enum field would incorrectly serialize that field and, potentially, any other fields following the enum. Solution: The LabView IDL import process has been updated. Re-import IDL files referencing enum fields so that updated VIs can be generated.
OSPL-14414	Reader instances not being triggered by an aligned dispose or unregister message after reconnect. In the case where a reader has read a message that remains in the reader's history cache and a dispose or unregister message is aligned after a reconnect, the reader does not read the data with a not read mask despite the fact that the received invalid sample is not read. The received invalid message is supposed to piggyback on the existing valid sample, however this sample is read and therefore will not be read. For this we normally reset the read flag of the existing sample to make the invalid received sample accessible but in this particular case this did not happen. Solution: An additional test is added to detect this particular case and reset the read flag on the existing sample.
OSPL-14392	When (online) migration of a persistent store fails, this is not detected. To support users who want to use the DLite service instead of the legacy Durability Service, a utility has been developed that migrates a legacy persistent store to a persistent store that can be used by DLite (see the section on 'durability_migrate' in the deployment manual). This utility has an offline and online mode of operation. Currently, the utility does not return a proper return code in the case of failure. As a consequence, when this utility is started in online mode and migration fails, then DLite has no way of noticing that migration failed and DLite will start as if nothing bad has happened. This is incorrect. The proper action to take is to prevent DLite from starting because its precondition (the availability of a correctly migrated store) has not been met. Solution: The durability_migrate utility now returns an indicator if something bad happend (0 for success, everything else is a failure). DLite will terminate when the online invocation of durability_migrate has failed.
OSPL-14376 / 00021510	Group coherent update can cause heap corruption and the DLite service to crash. Finalization of group coherent updates initializes a temporary array in heap memory. Initialization is performed by a loop that iterated once too much and thereby writing a word (NULL) beyond the end of the array. This potentially overwrites and corrupts the memory header of a next allocated piece of memory. If that happens DLite will crash as soon as any malloc or free runs into the corrupted administration. Solution: Fixed the stop condition of the initializer loop so that it no longer writes beyond the end of the array.
OSPL-14360	struct and union definitions that have the same name as modules in which they are defined can cause crashes. Internal processing (deserialization) of the XML type descriptor will try to process the module as if it was the struct definition and will lead to a crash. The problem is that an internal search operation during deserialization searches the wrong scope and returns the module with the same name. Solution: The search operation is replaced by one that searches the correct scope
OSPL-14340 / 00021199	Remote late joining group-coherent subscribers read unexpected instances or instances in the NOT_ALIVE_NOWRITERS state but expected NOT_ALIVE_DISPOSED state. Group Coherent Subscribers that join a system with a Group Coherent Publisher that continuously publishes Group Coherent updates, can partly receive Group Coherent updates. A late joining Subscriber can miss the beginning of a Group Coherent update but receive the end of the Group Coherent update. Alignment should provide the missing first part and complete the Group Coherent update. Three problems are found: The first part that was missed was not forwarded to the network after the update was completed. After alignment, all Group Coherent updates at the alignee side were flushed, also prematurely unfinished live Group Coherent updates. If an unregister message was received out of order for an instance that was revived afterward, the unregister would still be applied to the now revived instance and set its instance state to NO_WRITERS, even though the instance should have remained ALIVE. These three issues can lead to missing and incomplete updates and that can lead to instances not being disposed and not being removed but also not being updated. Solution: Forwarding to the network of Group Coherent updates when becoming complete is added, and unfinished Group Coherent updates are no longer completed by the end of alignment. Also, unregister messages are no longer applied when their instance has been revived by newer messages.
OSPL-14255 / 00021085	In certain circumstances BY_RECEPTION_TIMESTAMP topics (including the builtin topics) may go back in time. BY_RECEPTION_TIMESTAMP topics (this includes all builtin topics that are BY_RECEPTION_TIMESTAMP according to the DDS specification) would always append newly arriving samples to the end of their corresponding instance queue. This would allow them to go back in time if samples would ever be receiver out-of-order. One particular scenario where this could wreak havoc is when builtin topics that are aligned using either Durability/DLite (this for example the case when using native networking, where the builtin topics do not get aligned as part of the networking protocol) get disposed before the transient snapshot (in which they are still included) arrives. So in a case like that, you first get a DISPOSE message from a DCPSPublication through the live networking path, followed by the sample preceding it from the transient snapshot, which would result in the DCPSPublication representing a particular Writer ending in the ALIVE state instead of in the DISPOSE state. This could cause Readers to assume there still is at least one ALIVE Writer while in fact there is not, causing their data to stay ALIVE even when this is incorrect. Also mechanisms like the synchronous reliability do not work correctly in scenarios like that: if a DCPSSubscription represents a synchronous Reader then the writing side can be fooled into believing there is still a Reader that needs to acknowledge its data, when in fact this Reader has already left the system. The writing side will then block waiting for an ack that will never be sent, effectively blocking the Writer application indefinitely. Solution: BY_RECEPTION_TIMESTAMP topics will no longer be allowed to go back in time for data originating from the same DataWriter. Each individual instance with samples originating from the same source becomes effectively "eventually consistent".
OSPL-14233	For the RT networking service, allow the use of the loopback interface. Specifying the loopback interface in the configuration of the networking service does not work. When selecting the network interface to be used, the networking checks if the interface is multicast capable which is normally not the case for the loopback interface. Therefore the networking service will ignore the configured loopback interface and select the first multicast capable interface. Solution: When the networking configuration specifies to use the loopback interface, it will accept this interface without checking if it is multicast capable. When using the loopback interface to communicate between networking instances on the same node, it is necessary for the GobalPartition address to be a multicast address and that EnableMulticastLoopback is enabled (which is the default).
OSPL-14137 / 00020974	Kill SIGSTOP on spliced does not trigger any change of the ospl status. The ospl tool only watches the state of the spliced daemon through the corresponding key file. When a STOP signal is sent to the spliced daemon it will also stop the spliced daemons watchdog. Thus the spliced daemon can not detect this and inform the ospl tool of this situation. The lease manager is responsible for watching the progress of the services and also of the spliced daemon itself. Solution: This problem is solved by attaching the ospl tool to the shared memory segment and it will check if the lease of the spliced daemon had not expired and report the corresponding status.
OSPL-14012 / 00020823	Tester: Old browser nodes not removed even when "Show disposed participants" is unchecked. Items representing disposed entities in the tree view of the Tester "Browser" tab are not removed even when the "Show disposed participants" option is cleared. Instead, these tree items are displayed with an orange background. Solution: Tester has been updated to remove these nodes from the tree view, when the "Show disposed participants" option is cleared.
OSPL-13885 / 00020778	Instance liveliness is not correctly managed when configuring native networking and DLite. Instance liveliness, when using native networking depends on the alignment of the DDS built-in publicationInfo topic. DLite does not automatically align the DDS built-in publicationInfo topic. Currently, the topic must be specified in the provisioning scope explicitly. Built-in topics should be provisioned automatically when not managed otherwise without the need explicit configuration. Solution: DLite internally extends the configured provisioning scope with the built-in topics in case these are not managed otherwise.
OSPL-13743 / 00020702	Possible alignment mismatch when an asymmetrical disconnect occurs during alignment. When nodes get reconnected after being disconnected, a request for alignment data is sent. When there is an asymmetrical disconnect AFTER the aligner has received the request but BEFORE the aligner has actually sent the data, then the alignee drops the request but the aligner does not. When the asymmetrical disconnect is resolved, the alignee sends a new request for alignment data to the aligner. It now can happen that the aligner sends the alignment data of the FIRST request to the alignee, and the alignee considers this as the answer to the SECOND request. When the alignee receives the alignment data to the SECOND request, the data gets dropped because there is no outstanding request anymore. This can lead to an incorrect state whenever the alignment set has been updated between the first and the second request. Solution: The answer to the first sample request is not considered a valid answer to the second request any more.
OSPL-12836	Whitelisting of partition/topics in ddsi Currently, ddsi allows blacklisting of partition/topic combinations to indicate which partition/topic combinations must NOT be sent over the network. In situations where there are many partition/topic combinations, it is cumbersome to add many entries to the blacklist. In these cases it might be much more efficient to whitelist the partition/topic combinations that should be sent over the network. The idea is that everything that matches the blacklist is not sent over the network, and only the partition/topic combinations that match the whitelist will be sent over the network. If the whitelist is not specified it is considered as a whitelist that matches all partition/topics. Solution: A whiltelist is implemented that allows a user to specify which partition/topic combinations must be sent over the network.