Using ETL datasets

A dataset is an instance of the ETL::DataSet class that has a sequential-write, and random-read matrix data structure.

It can be better understood as a collection of data presented in tabular form, where each column represents a particular variable, and each row corresponds to a given member of the dataset in question. Sequential-write means that rows can be added only to the end of the data structure, and random-read means you can read a row or a column present at any position in the dataset.

The Dataset object is not loaded entirely into memory, as it is very large in size. Instead, ETL uses a temporary file to save data to disk, and later restore it into memory to prevent data loss, and to ensure efficient handling of the dataset.

Example

To create a new dataset, TEST, with two columns, Col_A and Col_B, write the following code:

my $res = new ETL::DataSet("TEST");
$res->setColumns("Col_A","Col_B");

Populating the dataset

In an ETL, the structure of every dataset contained in the result set must be standard and similar. This can be achieved by using the ETL::DefChecker object, that correctly initializes dataset columns and creates the appropriate structure.

Populating a dataset comprises the following activities:

Creating and initializing a dataset

The following code illustrates how to create and initialize a DataSet object corresponding to a standard dataset named SYSGLB:

$res = new ETL::DataSet("SYSGLB");
my $dc = $conf->getDefChecker();
$dc->initializeColumns($res);

Adding data to a dataset

To add data to the dataset, use the addRow method explicitly. The parameter array in the call must positionally correspond to the column structure of the dataset. The following code illustrates this operation:

$res->addRow(1,"alpha");

Associative population

A rather effective way to add data to datasets on the basis of column names is associative population. This approach has an advantage in that it does not depend on the structure of the dataset. The necessary steps required to achieve this are:

  1. Create a new and empty row in the dataset.
  2. Populate each column separately.
  3. Add the new row to the dataset.

The following code illustrates the preceding steps:

my @row = $res->newRow();        # Create an empty row in the dataset #
$res->fillRow("A",1,\@row);      # Populate column 'A' #
$res->fillRow("B","beta",\@row); # Populate column 'B' #
$res->addRow(@row);              # Add rows 'A' and 'B' to the dataset #

Reading the dataset

After you populate a dataset, you can read it using the following methods:

  • getRow
  • get

Note

Row and column indices range from 0 to -1.

The following code explains this Read activity:

# a loop through DataSet rows
for (my $rowidx = 0; $rowidx<$res->size();$rowidx++) {

          # getting the entire row as an array
   	  my @row = $res->getRow($rowidx);

	  # reading associatively from the row array
	  my $adata = $res->getByColName("a",@row);

	  # or, equivalently finding the column index
	  my $colidx = $res->findColumn("a");

	  # and then accessing positionally by (row,column)
	  $adata = $res->get($rowidx,$colidx);
}

Copying data between datasets

To copy data from a source dataset to target as per the target dataset's structure, use the fillAll method.

Example

Consider a source dataset stored in the variable $Source that contains three columns – A, B, and C. To copy data from $Source to a target dataset called $Target, as per the structure of $Target, use the following code:

my $Target = new ETL::DataSet("TARGET");
$Target->setColumns("A","C","D");
$Target->fillAll($Source);

When you execute this code, the result is:

  • Columns A and C from $Source will be copied to $Target.
  • Column B from $Source will not be copied to $Target at all.
  • Column D in $Target will remain empty.

Related topics

Dataset reference for ETL tasks

Overview of datasets in an ETL task

Was this page helpful? Yes No Submitting... Thank you

Comments