Using ETL datasets


This topic contains the following sections:

Overview of datasets

A dataset is an instance of the ETL::DataSet class that has a sequential-write, and random-read matrix data structure.

It can be better understood as a collection of data presented in tabular form, where each column represents a particular variable, and each row corresponds to a given member of the dataset in question. Sequential-write means that rows can be added only to the end of the data structure, and random-read means you can read a row or a column present at any position in the dataset.

The Dataset object is not loaded entirely into memory, as it is very large in size. Instead, ETL uses a temporary file to save data to disk, and later restore it into memory to prevent data loss, and to ensure efficient handling of the dataset.

The ETL Datasets view in Integration-Studio provides information about the structure of the dataset. The following figure shows a list of available datasets with their unique IDs, names, and other related information.

The ETL Datasets view
dataset1.png

The structure of the preceding figure can be explained as follows:

Icons to identify special columns

  • star_red.png: Represents a mandatory column.
  • key1.png: Represents an identification column for the measured object.

Tables

  • To the left side: The Supported datasets table.
  • To the right side: Columns the selected dataset on the left side is composed of (in this case:SYSPROC)

Prefix

DS_ indicates that an identifier is valid within the data source, and another ETL can use a different ID for the same system.

General pointers

  • If two rows in the ID column (in the Supported datasets table) contain the same values, they are associated with the same series.
  • The current module (in this case: MyParserP.pm) can be made to use one or more available datasets by selecting the checkbox preceding the ID column (in the Supported datasets list). A list of these selected datasets is communicated to BMC TrueSight Capacity Optimization during the activation phase.
  • The list of selected datasets indicates what the current module (MyParserP.pm) is capable of producing and not what it actually produces. The choice to load a particular dataset is made during the configuration of an ETL instance.
Example

To create a new dataset, TEST, with two columns, Col_A and Col_B, write the following code:

my $res = new ETL::DataSet("TEST");
$res->setColumns("Col_A","Col_B");

Populating the dataset

In an ETL, the structure of every dataset contained in the result set must be standard and similar. This can be achieved by using the ETL::DefChecker object, that correctly initializes dataset columns and creates the appropriate structure.

Populating a dataset comprises the following activities:

Creating and initializing a dataset

The following code illustrates how to create and initialize a DataSet object corresponding to a standard dataset named SYSGLB:

$res = new ETL::DataSet("SYSGLB");
my $dc = $conf->getDefChecker();
$dc->initializeColumns($res);

Adding data to a dataset

To add data to the dataset, use the addRow method explicitly. The parameter array in the call must positionally correspond to the column structure of the dataset. The following code illustrates this operation:

$res->addRow(1,"alpha");

Associative population

A rather effective way to add data to datasets on the basis of column names is associative population. This approach has an advantage in that it does not depend on the structure of the dataset. The necessary steps required to achieve this are:

  1. Create a new and empty row in the dataset.
  2. Populate each column separately.
  3. Add the new row to the dataset.

The following code illustrates the preceding steps:

my @row = $res->newRow();        # Create an empty row in the dataset #
$res->fillRow("A",1,\@row);      # Populate column 'A' #
$res->fillRow("B","beta",\@row); # Populate column 'B' #
$res->addRow(@row);              # Add rows 'A' and 'B' to the dataset #

Reading the dataset

After you populate a dataset, you can read it using the following methods:

  • getRow
  • get

Note

Row and column indices range from 0 to -1.

The following code explains this Read activity:

# a loop through DataSet rows
for (my $rowidx = 0; $rowidx<$res->size();$rowidx++) {

         # getting the entire row as an array
    my @row = $res->getRow($rowidx);

 # reading associatively from the row array
 my $adata = $res->getByColName("a",@row);

 # or, equivalently finding the column index
 my $colidx = $res->findColumn("a");

 # and then accessing positionally by (row,column)
 $adata = $res->get($rowidx,$colidx);
}

Copying data between datasets

To copy data from a source dataset to target as per the target dataset's structure, use the fillAll method.

Example

Consider a source dataset stored in the variable $Source that contains three columns – A, B, and C. To copy data from $Source to a target dataset called $Target, as per the structure of $Target, use the following code:

my $Target = new ETL::DataSet("TARGET");
$Target->setColumns("A","C","D");
$Target->fillAll($Source);

When you execute this code, the result is:

  • Columns A and C from $Source will be copied to $Target.
  • Column B from $Source will not be copied to $Target at all.
  • Column D in $Target will remain empty.

Related topics

Dataset-reference-for-ETL-tasks

Overview-of-datasets-in-an-ETL-task

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*