Sizing and scalability requirements for the ETL Engine server
ETL Engine servers can be scaled horizontally and vertically. The major sizing drivers for ETL Engine servers are:
- The required data processing throughput in samples per day. This value is the number of managed entities, times the average number of samples collected for each entity in a day.
- The number of connector instances (tasks) scheduled on the ETL Engine.
Use the sizing and scalability guidelines to determine the hardware capacity for ETL Engine servers according to your environment size:
To process more than 20 million samples per day, configure multiple ETL Engine servers.
For more information, refer to the following sections:
- Guidelines for disk space
- Guidelines for connector scheduling on ETL Engines
- Guidelines for service connectors
- Guidelines for remote ETL Engines
- Setting the Data hub JVM heap size
- Increasing the database connection pool size
- Guidelines for using a remote ETL Engine
Guidelines for disk space
The default values above allow for ten days of temporary files and log files accumulated during the normal day-to-day population activity. The default period setting for the File System Cleaner system task is ten days. If you increase this period for any reason, adjust the above numbers accordingly.
You need additional disk space for special activities:
- Bulk import of data, for example, for addition of new data sources with historical data.
- Recovery import of data, i.e., when a data source stops for a day or two for any reason and has to be recovered.
For the above special activities, estimate additional capacity using the number of anticipated additional samples per day. Temporary files and logs from these samples will remain on the disk for ten days (or whatever the File System Cleaner system task period is set to).
Guidelines for connector scheduling on ETL Engines
Follow these guidelines when scheduling connectors to avoid congestion:
- Do not configure a single connector instance to populate more than two million samples in each run. You can configure multiple connector instances to divide the work into smaller chunks, or you can configure multiple ETL Engine servers to manage the higher volume, when the data source can support it.
- Schedule connectors so that no more than one connector instance is running on any CPU core at any given time. Take into account the amount of time required to execute each connector. For example, if an ETL Engine server is configured with two CPU cores, ensure that no more than two connectors are running at any given time.
- If you are trying to scale up by increasing the size of the ETL Engine machine, then consider that certain types of connectors require significantly more memory than other connectors. For example, connectors written in Java might require twice as much memory, or more, than other types of connectors.
- Avoid congestion at the warehousing engine. Do not import data in large volumes, such as when recovering historical data. Split this data into smaller chunks.
Guidelines for service connectors
The guidelines above are for scheduled connectors, which are periodically started by the scheduler as batch jobs and run as separate processes. For service connectors, which run continuously within the same process as the scheduler, there are stricter guidelines as follows.
- A single ETL Engine machine can run no more than four service connectors.
- Service connectors use heap memory in the scheduler process, so increase the heap size by 512 MB for each service connector. For example, if the default heap size is 1 GB, this means that running two service connectors will require the heap size to be increased to 2 GB.
- The total heap size of the scheduler should remain within half the total memory of the ETL Engine machine. If you need to add more than two service connectors, use the larger 8 GB RAM size ETL Engine machine.
Modifying the heap size of the ETL Engine scheduler
- In the installation directory on the ETL Engine machine, open the customenvpre.sh file for editing.
- In the #SCHEDULER section, find the following statements:
#SCHEDULER_HEAP_SIZE="1024m"#export SCHEDULER_HEAP_SIZE - If the statements are preceded by a '#' character, remove it from both to uncomment them, and modify the number (1024m) to the new heap size.
- Restart the scheduler.
Sizing considerations for the VMware vCenter Extractor Service ETL module
- The main drivers for sizing the VMware-vCenter-Extractor-Service ETL module are:
- Number of VMs
- Number of managed clusters
- For every cluster in your vCenter, follow these guidelines:
- VMs per ETL module: Maximum 2000
- Scheduler heap: 2 GB for the first 2000 VMs and 1 GB for 2000 VMs thereafter
- Data storage size on the ETL Engine computer: 10 GB per 2000 VMs
- Limit the number of clusters for each ETL module because the number of clusters affects the number of threads used by the ETL module.
- Provision enough CPU resources to ensure timely polling of data.
- To manage a large vSphere environment that contains multiple vCenter servers:
- Ensure that the Capacity Optimization Data Warehouse, Application Server, and ETL Engine Server are sized appropriately. For information, see Sizing and scalability considerations and Sizing considerations for ETL Engine servers.
Use multiple ETL modules for data collection. You can split the vCenters and note down the information in the following format:
ETL module
vCenter
v1
vCenter server 1 - 50
v2
vCenter server 51 - n
...
...
Guidelines for remote ETL Engines
Unlike local ETL Engines, which load data into the data warehouse by connecting directly to the Oracle database, remote ETL Engines use a different method to load data into the data warehouse. The sizing and scalability characteristics of TrueSight Capacity Optimization installations with data being loaded from remote ETL Engines are therefore different. In general, a remote ETL Engine uses additional resources on the Data hub machine. To decide when to use a remote ETL Engine, see Guidelines on using a remote ETL engine.
Remote ETL Engines use a two-step process to load data into the data warehouse:
- A store-and-forward messaging infrastructure transfers data to the Data hub machine (usually the same as the Application Server machine).
- The data is then loaded into the data warehouse by connecting to the Oracle database.
This two-step process involves use of the following resources:
- Reads and writes to the disk on the Data hub machine.
- Enough disk space on the Data hub machine to accumulate transferred data for one day.
- CPU and memory on the Data hub machine to parse the transferred data and create data for loading into Oracle.
- Database connection pools for the Data hub application.
Even if there are sufficient resources for all of the above, the overall process also takes longer to finish than for local ETL Engines.
The following table provides some guidelines for the Data hub when you use remote ETL Engines:
Setting the Data hub JVM heap size
- Open the customenvpre.sh file for editing.
Remove the character preceding the following statements to uncomment them, and replace 1024m before restarting the Data hub service.
#DATAHUB_HEAP_SIZE="1024m"
#export DATAHUB_HEAP_SIZE
Increasing the database connection pool size
- Open the scheduler/conf/ds.properties file for editing.
- Look for the maxTotal= element and change it to the new value.
- Save the ds.properties file.
Guidelines for using a remote ETL Engine
The TrueSight Capacity Optimization ETL Engine is a server that runs connectors that populate data from external data sources to the Data Warehouse database.
An ETL Engine can be configured in two ways:
- Local: Directly populates the database (default).
- Remote: Populates the database via the Application Server's message service. You can select this option during installation.
The same connectors can run in both types of ETL Engine. The choice between local and remote is purely a deployment choice. Selection criteria explains how to choose the appropriate configuration for a TrueSight Capacity Optimization ETL Engine.
Selection criteria
In most cases, a local ETL Engine is recommended because it offers a higher throughput connection to TrueSight Capacity Optimization, to load high volumes of data.
A remote ETL Engine provides two advantages to connectors running in it:
- It is able to populate data via HTTPS.
- It uses a store-and-forward infrastructure to manage with unreliable network links.
However, a remote ETL Engine introduces overhead for robust transmission, thus reducing the total throughput obtained.
The remote ETL Engine option should be adopted only in the following cases:
- A limited bandwidth or unstable connection is present between data sources and the TrueSight Capacity Optimization central installation, for example a low capacity WAN link. In this case, a connector trying to populate data through this link could experience repeated failures. Running the connector in a remote ETL Engine ensures that data extraction occurs reliably close to the datasource, and the load phase occurs through the low capacity WAN link using a store-and-forward messaging infrastructure in TrueSight Capacity Optimization.
- It is not possible to expose the needed TCP ports on the TrueSight Capacity Optimization database machine or the TrueSight Capacity Optimization Application Server machine. In this case, a connector trying to populate data through this link would be unable to make database updates. Running the connector in a remote ETL Engine allows data population to occur via HTTP or HTTPS ports on the TrueSight Capacity Optimization Application Server machine.
You can always use multiple local ETL Engines over different LAN segments to drive higher volumes.
Network visibility details
A local ETL Engine requires TCP visibility of the TrueSight Capacity Optimization database server and application server from the ETL Engine machine; specifically TCP port 1521 for the database (default Oracle installations) and TCP port 8280 (default Data hub port) for the TrueSight Capacity Optimization Application Server.
A remote ETL Engine can be configured to use either:
- Direct HTTP connection to the TrueSight Capacity Optimization Application Server, that exposes both HTTP on TCP and JMS over HTTP on port 8280 (default; port should be open for communications).
- An HTTPS front-end based on Apache Web Server that exposes the above ports (optional).