Developing a custom parser module

A parser is a class that extends the Parser class in BMC Helix Continuous Optimization. For the parser to produce data, you need to implement the abstract parse method.

About the abstract parse method

After you implement the abstract parse method, the parser receives the full name (of the file to parse) as the first parameter and returns a DataSetList object containing data that has been extracted.

The parser does not find and select files to parse. The ETL framework does this task in advance as per the configuration present during the creation of the ETL.

For instance, the ETL is configured to access a Secure File Transfer Protocol (SFTP) folder and select files that match a certain pattern. The ETL framework copies the selected files via SFTP to the local ETL engine disk and then, it sequentially calls the parse method of the defined custom parser for each file. This means that the ETL will:

Call the parse method for the first file.
Populate the output dataset with the result.
Call the parse method for the second file.
Append the result to the dataset, and so on.
After parsing each file and depending on configuration, the ETL framework will rename or move the parsed file.

Consider the following example:

Example

You need to extract the CPU Utilization metric from a text file in the following format:

2023-07-07/CPU:15.2,13,25,15,13,25,15.1,13,25.1,15.1,13,25,15,13,25
2023-07-07/DISK:1,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1
2023-07-08/CPU:15,13,25,15,13,25,15,13,25,15,13,25,15,13,25,15,13,25
2023-07-08/DISK:1,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1
.. ..
..

The contents of this file are not columnar; each row has different metrics and contains hourly samples from a 24-hour day. To parse a file in this format, you need to use a custom parser.

Full example code

You can download the full code of the example presented: MyParserP.pm.

Editing the parser code

Every new custom parser module that you create in the ETL Development Kit uses a code template that contains some auto-generated code pieces.

The following examples sequentially illustrate the initial code you need to write and the other operations you can perform on parsers.

Write the initial parser code:
package ETL.parser;
import com.neptuny.cpit.etl.DataSetList;
import com.neptuny.cpit.etl.parser.Parser;
/**
* Parser template
*/
public class MyParserP extends Parser {
@Override
public DataSetList parse(String filename) throws Exception {
// ...parsing code goes here...
}

@Override
public DataSetList adjustParseResult(DataSetList in) throws Exception {
//Add here after parse code, if needed
return in;
}
}
This class contains only a constructor and the parse method.
Instruct the parser to prepare the output datasets.

To extract the CPU Utilization metric, you can access the ETL Datasets view. This metric belongs to the SYSGLB (Global System Data) dataset. Hence, the output dataset can be built using the following code:
DataSetList dsList = new DataSetList();
DataSet res = new DataSet("SYSGLB");
this.getConf().getDefChecker().initializeColumns(res);
dsList.add(res);
Open the file and read lines of text.
long totallines = 0;
long goodlines = 0;
long convertedlines = 0;

BufferedReader filereader = new BufferedReader(new FileReader(new File(filename)));
try{
String line;
while((line = filereader.readLine())!=null){
totallines++;

line = line.trim();

if(line.length()==0){
continue; //Skip empty lines
}

goodlines++;

//TODO: implement parse method here

String[] row = res.newRow();
res.fillRow("TS","2007-07-01 10:00:00",row);
res.fillRow("DURATION","300",row);
res.fillRow("DS_SYSNM","server1",row);
res.fillRow("CPU_UTIL","0.5",row);
res.addRow(row);

convertedlines++; //If line has been correctly parsed and imported, increment converted line counter
}
} finally {
filereader.close();
}
Parse a line of text, extract the CPU Utilization samples, and put the data in the dataset.
Important
The SYSGLB dataset has three mandatory columns: timestamp (TS), DURATION of the sample, and the system name (DS_SYSNM). In this example, the metric name for CPU Utilization is CPU_UTIL.
The code for parsing lines and filling the dataset looks similar to the one elaborated in the following example:
Pattern linePattern = Pattern.compile("(\\d{4}-\\d{2}-\\d{2})\\/CPU:(.*)");
Matcher lineMatcher = linePattern.matcher(line);
if (lineMatcher.matches()){
String day = lineMatcher.group(1);
String[] samples = lineMatcher.group(2).split(",");
for(int h=0;h<samples.length;h++) {
String dayhour = String.format("%s %02d:00:00", day, h);
Double val = Double.parseDouble(samples[h])/100;
String[] row = res.newRow();
res.fillRow("TS",dayhour,row);
res.fillRow("DURATION","3600",row);
res.fillRow("DS_SYSNM","server1",row);
res.fillRow("CPU_UTIL",Double.toString(val),row);
res.addRow(row);
}
}

Badly formed file

We recommend that you add pieces of control code to the main parser code to make sure the successful run of a parser. The absence of control code results in a badly formed file – a file that contains incorrectly formed lines, which are eventually rejected by the parser. Such files are very common and are often encountered by the parser.

Rejection Percentage

As a best practice, we recommended that you calculate a rejection percentage (denoted by rp), on parsed content. rp is the percentage of rejected lines over the number of lines expected to be good and well-formed.

To derive a rejection percentage value, do the following:

Count the total number of lines in the file (tot).
Count the number of lines that match the regular expression used to select good lines (match).
In most cases, match = tot/2.
Therefore —
rp = (match - (tot/2) / (tot/2) * 100

After the rejection percentage rp is calculated, it can be logged to help the administrator in detecting bad files, or an error can be generated if the rp is too high. Using the rp command is only a recommendation. Adding the rp calculation code and logging functionality, the parser is complete (see the code example in Developing-a-custom-parser-module).