Indicators for scaling
This topic provides the following information that you can use as guidelines for deciding when to scale. However this information is only indicative and is meant for reference purposes only.
The following factors are an indication that you might need to scale the Collection Station, Indexer, or Search components:
- Product performance is deteriorating
- Hardware resources such as the processor, memory, storage and disk I/O start exceeding acceptable limits.
- The rate of indexing is falling behind the rate of data collection.
- If the Anomaly baseline job takes more than 7 minutes to complete.
Note
If the Anomaly baseline job takes more than 7 minutes to complete, you must perform one or more of the following to decrease the baseline job completion time.
- Scale the Indexer.
- If the Search server has a lot of idle CPU, then increase the
baseline.cluster.search.concurrency
property present in custom/conf/server/olaengineCustomConfig.properties file. - Scale the Search server.
-
When you scale a component, you need to keep in mind that the issue observed for scaling that component can often shift to another component. For example, you notice that the rate of indexing is not keeping up with the incoming volume of data. After some analysis, you determine that there is an issue with the Collection Station and you scale the Collection Station to address the issue. At this time, the indexing rate is able to keep up with the incoming data volume. Now, you add a few hundred more data collectors to the system and notice that once again the indexing rate is not keeping up with the incoming volume of data. You may erroneously conclude that you need to scale the Collection Station to fix the issue. In this scenario, it is possible that the problem has shifted from the Collection Station component to the Indexer component, and that the Indexer needs to be scaled.
This topic contains the following information:
Indicators for scaling Collection Station
The Collection Station needs to be scaled when a large volume of data to be indexed exceeds the capacity of the current available hardware.
The following table lists the factors that you can use as indicators for scaling up the Collection Station component; you can also see the functions supported by the factors.
Factor | Indicators for scaling | Functions supported |
---|---|---|
Processor | Average CPU required exceeds 75%. |
|
Memory | Maximum Java heap size required is greater than 80% of the physical memory available. This is controlled by the | |
Disk I/O | Disk I/O transfer rate exceeds 75% of the disk I/O available. |
Indicators for scaling Indexer
The Indexer needs to be scaled when the overall product usage exceeds the hardware capacity and starts negatively impacting the Collection Station (indexing) or Search components.
The following table lists the factors that you can use as indicators for scaling up the Indexer component; you can also see the functions supported by the factors.
Factor | Indicators for scaling | Functions supported |
---|---|---|
Processor | Average CPU required exceeds 75%. |
|
Memory | Maximum Java heap size required is greater than 60% of the physical memory available. This is controlled by the |
|
Disk I/O | Disk I/O transfer rate exceeds 75% of the disk I/O available. |
|
Disk space | Disk space required exceeds 75% of the available disk space. Disk space required = (Rate of disk space growth per day) * (Number of days of data retention) |
|
Indicators for scaling Search
The Search component needs to be scaled when the overall product usage exceeds the hardware capacity and starts negatively impacting the Search component.
Use the following factors as indicators for scaling up the Search component:
Factor | Indicators for scaling | Functions supported |
---|---|---|
Processor | Average CPU required exceeds 75%. |
|
Memory | Maximum Java heap size required is greater than 75% of the physical memory available. This is controlled by the |
|
Anomaly baseline job | Anomaly baseline job completion time is greater than 7 minutes and the indexer is not CPU bottle-necked. This can be monitored through the itda.log file. | Anomaly detection |
Variables that impact hardware resources
Overall, the variables described in the following table determine the amount by which the resources required by the Collection Station, Indexer, and Search components are impacted. Your decision to scale depends on the amount by which the resources in your environment are affected by each of these usage factors.
For example, if you have a large volume of data to index, then you need to scale the Collection Station and Indexer components. Because the level of impact for this usage factor is high, you must certainly consider scaling.
The following table provides information about variables that impact your hardware resources and thereby impact your decision to to scale the Collection Station, Search, and Search components. The levels of impact indicate the impact to the Java heap size and CPU and is described as High, Medium, Low, and None.
Variables that determine impact to components that can be scaled
Variable | Collection Station | Indexer | Search |
---|---|---|---|
Large volume of data to index | High | High | None |
Large number of concurrent users actively using the product | None | High | High |
Large number of email notifications (containing reports) | None | None | High |
Large number of notifications | None | Medium | None |
Large number of fields added to the Filters panel | None | High | None |
Concurrent searches with large time ranges | None | High | None |
Large number of concurrent searches (without search commands) | None | High | Low |
Large number of concurrent simple search commands (dedup, group, filter, head, tail) | None | High | Medium |
Large number of concurrent advanced search commands (timechart, stats,top, rare) | None | High | Medium (Java heap size) |
Low (CPU) | |||
Higher data retention period (for Java heap size) | None | High | None |
Large number of data collectors | Medium | Medium | None |
Large number of dashboard (or dashlet) executions | None | High | Medium |
Large number of hosts | High | High | High |
Assessing the load handled by Collection Station or Indexer
You can assess the amount of load handled by the Collection Station or Indexer by finding out the indexing lag. The indexing lag is the time lag between the time at which data was collected and time at which data was indexed, for each poll (made by the data collector).
This can be done by performing searches on the data available in the Collection_metrics.log file. For example, you can run the following search query:
_index=metrics * && engine=COLLECTION_STATION && indexing-lag=* |
head 100000 | filter indexing-lag > 300000
This search query displays results that have an indexing lag greater than five minutes. The indexing lag is indicated by the indexing-lag field in the search results. The value of this field is provided in milliseconds. The results displayed by running this search query indicates that the indexing activity is not able to keep pace with the collection activity.
Example
You can modify the search query in various ways depending on your goal.
Example 1: If you want to see results with an indexing lag greater than two minutes, you need to change the string 300000
to 120000
in the search query, as follows:
_index=metrics * && engine=COLLECTION_STATION && indexing-lag=* |
head 100000 | filter indexing-lag > 120000
Example 2: If you want to use the search query to see the pattern in which the indexing lag occurred for the last seven days, then you can run the following search query with the time range set to Last 7 days.
_index=metrics * && engine=COLLECTION_STATION && indexing-lag=* |
head 100000 | filter indexing-lag > 300000 | timechart span=1d count(indexing-lag), avg(indexing-lag) by collectorid
In the preceding search query the collectorid
represents the Collection Station identifier.
collectorid
:- Total number of polls where the indexing lag is greater than five minutes (represented by the
count
function). - Average indexing lag obtained from the polls where the indexing lag is greater than five minutes (represented by the
avg
function).
You can also monitor the search results obtained by running the preceding search queries by adding dashboards or notifications. To do this, you need to first save the search query. For more information, see Saving and sharing searches for analytics and monitoring.
Comments
Log in or register to comment.