Overview of the ETL workflow
The ETL (Extract, Transform and Load) process is realized by different modules that run on top of a common engine framework (see ETL development API constructs for details). As shown in the diagram, the data import process is divided in three phases:
- Data extraction phase
- (Optional) A "before process command" can be executed in order to perform tasks which may be necessary to the import procedure (for example, locking the database schema, telling the monitoring application to flush its data to a CSV file, etc).
- The extractor module reads the data from an external database or file; these data are generally gathered by a third party monitoring application.
- If the source data resides on a file, the extractor feeds them to a parser module.
- Data are now converted to the internal representation format: the dataset.
- Transformation phase
(Optional) One or more transformer modules read the dataset data and perform some in-memory transformations; this phase is optional and its application depends on the extracted data. - Loading phase
One or more loader modules store the dataset data in BMC Helix Continuous Optimization.
Additional information
It is important to note that "before" and "after" commands are part of the import process but have nothing to do with the data itself. They are instrumental to the smooth execution of the process. These commands can be written in any language, for example, shell script, Java, C#.
Tip
An action is an operation that can be performed at any step of the import process and that generally deals with data manipulation.
The modularity of the process allows you to change many of its settings and modules by only editing the configuration, without writing a single line of code.
The following diagram summarizes the standard ETL workflow:
Entity name resolution
Sometimes data sources adhere to a naming convention that is different from the one used by BMC Helix Continuous Optimization: This makes it necessary to translate names according to the naming policy of the Optimize catalog.
It is possible to configure ETL-specific lookup tables which set, for each entity, the translation from the name used by the ETL task (ds_sysnm or ds_wklnm metrics) to the identifier used into the Optimize catalog (sysnm or wklnm). This process decouples sources naming from DWH naming and is unique for each ETL.
The lookup tables are automatically populated by ETL tasks when they find new objects in the source data; a new entry will be created for each new object, with identical Optimize catalog and data source names. You can manually modify the Optimize catalog name to make it comply to the BMC Helix Continuous Optimization naming policies.
If the measured entities are new, the ETL will automatically propose a name that you can manually modify. On the other hand, if the entities are already measured by another ETL, they are already associated to a name in the Optimize catalog. It is important to spot potential naming conflicts when creating new ETL tasks.
If you detect this situation before the first run of the ETL task, you can perform manual or shared lookup and solve the issue; this is the recommended solution. If you detect this situation only after running the ETL task, then the reconciliation process has to be performed. To avoid naming conflicts, remember to run a newly configured ETL task in simulation mode, and to check the imported data before actually importing it into BMC Helix Continuous Optimization. Lookup reconciliation is possible but discouraged; you should always try to fix every lookup issue before running the ETL task for the first time.
lastcounter
parameter
When running ETL tasks, only new data must be loaded. It is necessary to indicate the imported data.
Each ETL has an associated parameter called lastcounter which keeps track of the last imported samples. This counter is updated after every run and, at the start of the next one, the extractor module reads it to distinguish between old and new data.
As ETL extractors deal with both databases and files, a different lastcounter logic is required. It can be a timestamp, as it is usually for ETL tasks that collect data from databases, or a regular expression, to prevent the ETL task to parse files with a specific extension indicating that they have been already examined. In particular:
- Database extractors usually store the data's timestamp or ROWID as lastcounter; this procedure is very accurate.
- File extractors generally mark the parsed file with a done suffix and move it to an archive directory, or recognize unparsed files through a file name pattern. The last method involves a file name pattern containing the creation date: the ETL looks for files created the day before the execution, and imports them. A problem arises if you missed an ETL task run: you have to manually edit the lastcounter parameter to recover files that were not imported.
Comments
Log in or register to comment.