Overview of datasets in an ETL task

This topic contains the following sections:

Datasets

dataset is a logical structure used to define and organize the data collected by an ETL task in a standard format.

The dataset structure is predefined in BMC Capacity Optimization; every dataset has a name and a numeric identifier (datasetid) that univocally refers to a specific structure. Every structure details a group of metrics, for example, all metrics inherent to hard drives are defined in the System Disk (SYSDSK) dataset.

Datasets are used to ensure that the imported data satisfy a well defined format specification, and to let you know when some imported metric is unsupported (that is not listed in a dataset).

The Administration > DATA WAREHOUSE > Datasets & metrics page displays the available datasets, and lists the supported metrics for each dataset. BMC Capacity Optimization integrators and ETL developers can find information about the dataset structure within the BMC Capacity Optimization ETL Development Studio. For more information, see Developing custom ETLs.

Horizontal and Vertical datasets

There are two kinds of datsets in BMC Capacity Optimization:

  • Horizontal datasets: For every supported metric there is one column, and every row defines a new data sample. Most datasets have a horizontal structure, and using horizontal datasets, you can load only the metrics (OOTB and custom) related to the specified (instantiated) dataset.
  • Vertical datasets: They have generic column names, and each metric sample is added to a new row. Using vertical datasets, you can load any BMC Capacity Optimization metric (all the metrics related to all the horizontal datasets). The SYSDAT or WKLDAT datasets in BMC Capacity Optimization have a Vertical structure. SYSDAT is used to load system data in a vertical structure, whereas WKLDAT is used to load business driver data in a vertical structure.

If you are going to import configuration and/or performance data contained in CSV files or you are writing a custom connector, you have to fill these two datasets in different ways:

  • In horizontal datasets, you can put values for more metrics in the same row.
  • In vertical datasets, for every metric you have to fill one row specifying the OBJNAME (metric)

Example scenario

You want to import the samples for CPU_UTIL and CPU_UTILMHZ of two systems, each sampled at midnight.

Case 1: Horizontal dataset
Dataset name: SYSGLB

TSDURATION SYSNMDS_SYSNMSYSTYPENMCPU_UTILCPU_UTILMHZ
2014-08-30
22:07:14
86400System1System1gm:vmw30500
2014-08-30
22:07:14
86400System2System2gm:vmw60600
2014-08-30
22:07:14
86400System1System1gm:vmw801100
2014-08-30
22:07:14
86400System2System2gm:vmw10650

Case 2: Vertical dataset
Dataset name: SYSDAT

TSDURATIONSYSNMDS_SYSNMSYS
TYPENM
OBJNMSUB
OBJNM
VALUE
2014-08-30
22:07:14
00:00
86400System1System1gm:vmwCPU_UTILGLOBAL500
2014-08-30
22:07:14
00:00
86400System1System1gm:vmwCPU_
UTILMHZ
GLOBAL30
2014-08-30
22:07:14
00:00
86400System2System2gm:vmwCPU_UTILGLOBAL600
2014-08-30
22:07:14
00:00
86400System2System2gm:vmwCPU_
UTILMHZ
GLOBAL60
2014-08-30
22:07:14
86400System1System1gm:vmwCPU_UTILGLOBAL1100
2014-08-30
22:07:14
86400System1System1gm:vmwCPU_
UTILMHZ
GLOBAL80
2014-08-30
22:07:14
86400System2System2gm:vmwCPU_UTILGLOBAL650
2014-08-30
22:07:14
86400System2System2gm:vmwCPU_
UTILMHZ
GLOBAL10

Dataset conventions

The following conventions must be kept in mind while working with datasets:

  • Timestamps must be in the YYYY-MM-DD HH:MI:SS format.
  • Data for percentage metrics must be between 0 and 1.
  • Data for the Duration column must be in seconds.

Related topics

Was this page helpful? Yes No Submitting... Thank you

Comments