Investigating issues with duplicate data in CMDB
Successful identification between multiple source datasets requires each of the following:
- Each data provider populates data to the same classes in the CMDB, but into different source datasets.
- Each data provider populates values to the attributes used for identification.
- Each data provider populates identical values to the attributes used for identification.
- Reconciliation identification rules look for a match on the identification attributes, from the most restrictive to less restrictive combination of attribute values.
- Reconciliation tries to identify the CI in the second source dataset after the first one has been identified and merged into the golden dataset
- Reconciliation identification rules ignore default values such as NULL or 0 in identification attributes.
- Data providers do no populate non-unique, non-default values like "N/A" or "None" to identification attributes.
- If host names are re-used by different computer systems, they are purged from the golden dataset before the new computer system using it is leveraged
Some of these requirements depend on implementation decisions and quality of data before Reconciliation is attempted. Below is a good sequence of steps to follow to determine which requirements are met.
|To check for ...||Query or operation||Results and next steps|
Duplicate values of IntegrationId attribute
Perform a query in the ARSystem database for the specified DatasetId and attribute, for example BMC.ADDM and ADDMIntegrationId
SELECT COUNT(*),ADDMIntegrationId FROM BMC_CORE_BMC_BASEELEMENT
WHERE DatasetId = 'BMC.ADDM'
GROUP BY ADDMIntegrationid HAVING COUNT (*) > 1 ;
The query should return no results for the IntegrationId, DatasetId combination.
Check if the discovery source has multiple CIs with the same value of the IntegrationId, or if one of the CIs is considered deleted in the discovery source. This may indicate a deletion was not propagated to the source dataset in CMDB.
Also check if the attribute used as an IntegrationId is recycled, or re-used by the discovery source after it is considered deleted. This is another case where you would not want to leave-soft-deleted data in the CMDB. See Investigating soft-deleted data issues
|Hosts with duplicate identification attribute values within a specified DatasetId|
Examine identification rules for the class and make a list of all attributes used for identification.
Then perform a query in the ARSystem database for a match on all these attributes within a dataset.
WHERE DatasetId = 'BMC.ASSET'
GROUP BY HostName, Domain, SerialNumber, TokenId, isVirtual HAVING COUNT(*)>1;
If this returns any results, it means the data in the identification attributes is insufficient to uniquely identify a CI. See section below on "Solving Reconciliation issues by adding identification attributes".
|Computer System Identification attributes not populated|
Perform a query in the ARSystem database:
SELECT Name, instanceid, DatasetId FROM BMC_CORE_BMC_ComputerSystem
WHERE TokenId = '0' AND SerialNumber IS NULL AND Hostname IS NULL;
|Computer Systems without values for TokenId, SerialNumber, or Hostname are the primary attributes for identifying a computer system.|
If none of these three attributes are populated, Reconciliation is unlikely to find a match with data populated by another data source. See section below on "Solving Reconciliation identification issues using an IntegrationID attribute"
|Identification attribute value consistency for Computer Systems|
Perform a query in the ARSystem database:
SELECT Name, SerialNumber, Hostname, Domain, TokenId, isVirtual FROM
ORDER BY Hostname, SerialNumber, DatasetId, TokenID;
|View the results to verify the identification attribute values are populated to both datasets in the same format, with consistent values and case. Also scan for meaningless values like N/A, None, or Unknown.|
|Identification attribute value consistency for weak members between datasets|
Perform a query in the ARSystem database, for example, on the BMC_OperatingSystem class:
GROUP BY Name, DatasetId, TokenId;
View the results to verify the Name value is used consistent across datasets.
For example, if the Name attribute is used to identify OperatingSystems, it is important that both data sources provide the same value of the operating system Name. Viewing the query results on classes of interest will keep the results list short enough to scan.
Sequence of operations
Select an example of a CI duplicated in the golden dataset which is merged from two source datasets with identical values for identification attributes.
Check whether the CreateDate of both CIs in the golden dataset are later than the Modified Date of both CIs in the source datasets. This indicates the CIs were both identified before either was merged to the golden dataset.
|Avoid this situation by scheduling reconciliation jobs appropriately so identification and merging of the first source dataset is complete, before the second reconciliation job runs.|