Monday, May 18, 2015

Understand BPPM As A Decision Maker - Part 8: Implementaion - combination of static and dynamic thresholds

We have gone through the details of static and dynamic thresholds in the last two posts.  In addition to set static thresholds and dynamic thresholds separately, you can also combine them on a BPPM server to add more flexibility in your threshold settings.

The first option is to add a dynamic adjustment to a static threshold.  In order to do that, you must set your static threshold at BPPM server not at each PATROL agent.  In addition to severity, duration, and threshold value that included in a normal static threshold, you can also add a dynamic adjustment here by specifying if the threshold violation also has to be outside a baseline.  You can select auto, hourly, daily, weekly, hourly+daily, and all baseline.

An example of the first option would be to set up a threshold for the number of login errors in the last data collection.  If you want to set a static threshold as 3, you may want to add 'outside auto baseline' as a dynamic adjustment so that the alert won't be raised if the baseline during that time of day (such as 9am) is 4.

The second option is to add a static adjustment to a dynamic threshold.  In addition to severity, duration, baseline, sampling window, absolute deviation, and percent deviation, you can also add a static adjustment here by specify if the threshold violation also has to violate a static threshold value. 

An example of the second option would be to set up a threshold for CPU utilization.  If you want to set a dynamic threshold as outside of auto baseline for 10 minutes with percent deviation as 15%, you may want to add a threshold value 50 as a static adjustment so that the alert won't be raised when the CPU utilization is 45% for 10 minutes even the baseline is 35%.

You may wonder what the difference is between the first option and the second option.  When should you use static threshold with dynamic adjustment and when should you use dynamic threshold with static adjustment?

Dynamic threshold with static adjustment contains deviation in absolute value and in percent value.  This feature is not available with static thresholds.  Using deviation in a dynamic threshold gives you a cushion or buffer when comparing to a baseline.  I personally find this feature very useful and I use deviation in most of my dynamic thresholds with or without static adjustments.

Static threshold with dynamic adjustment contains a 'predict' feature.  This feature is not available with dynamic thresholds.  Using 'predict' feature in a static threshold allows you to receive a predictive alert when an attribute with fixed-capacity is approaching its limit.  This is very useful for attributes such as disk space utilization.

As a decision maker, you will need to determine if you need to combine static thresholds and dynamic thresholds to add more flexibility to your thresholds.  If so, you will also need to decide which way to go: to add dynamic adjustment to a static threshold, or to add static adjustment to a dynamic threshold.

Tuesday, May 12, 2015

Understand BPPM As A Decision Maker - Part 7: Implementaion - dynamic thresholds

As mentioned previously, a dynamic threshold doesn't have an absolute value by itself.  The threshold value is calculated on the fly based on historical data values from a specified time period (also called baseline). A dynamic threshold needs to contain the following details:

1) Duration: How long does the threshold need to be violated before an alert will be raised?  By default, the duration is 0, meaning as soon as the threshold is violated an alert will be raised immediately.

2) Baseline: You can choose hously, daily, weekly, hourly & daily, and all baselines.  The default is auto baseline, meaning that BPPM server will automatically choose the best baseline for you.

3) Sampling Window: How long does a parameter/attribute value must be collected before an alert can be raised?  The default is 10 minutes or 5 data points, whichever is the longest.

4) Absolute Deviation: How much in absolute value does the parameter/attribute value must be above or below the threshold before an alert can be raised?  The default is 1.

5) Percent Deviation: How much in percentage does the parameter/attribute value must be above or below the threshold before an alert can be raised?  The default is 5%.

For example, you may want to set a dynamic threshold for your web transaction response time as follows: 1) Duration = 5 minutes; 2) Auto baseline; 3) Sampling window = 10 minutes; 4) Absolute Deviation = 1; 5) Percent Deviation = 40%.  If it normally takes 5 seconds to complete a web transaction during the same time of the day, but now it takes 7 seconds (40% more than 5 seconds) consistently for the last 5 minutes, an alert will be raised. 

As with a static threshold, a dynamic threshold can also have three different scopes: global, local, and instance.

Dynamic thresholds can only be set at BPPM server.  You can choose to use either BPPM operations console or CMA to set a dynamic threshold.  In BPPM operations console, you can use either options menu or tools menu.  In CMA, you can use global thresholds method or CMA policies.  If you use both BPPM operations console and CMA, the thresholds will be combined.  In case of conflict, the thresholds set by CMA will override the thresholds set by BPPM operations console.

In order to set a dynamic threshold in BPPM server, the parameter/attribute value must be stored in BPPM server database, meaning that the data must be streamed.  By default, all PATROL data are streamed to BPPM server database.  But you may want to filter out some data in order not to exceed 1.7 millions of attributes capacity per BPPM server.

As a decision maker, you will need to come up with detailed specification (duration, baseline, sampling window, absolute and percentage deviation) after you decide on using dynamic thresholds for some data.  You will need to decide if you need local or instance thresholds in addition to global thresholds.  You will also need to decide which method to use - BPPM operations console or CMA.

Tuesday, May 5, 2015

Understand BPPM As A Decision Maker - Part 6: Implementaion - static thresholds

In the previous post, we discussed static thresholds and dynamic thresholds in general.  Since there are many different variations of static thresholds, we are going to look into the details.

A static threshold can have three different scopes: global, local, and instance.

A static threshold with global scope applies to all servers and all instances in your environment.  For example, a global critical threshold with service status = 3 means that if the parameter status is equal to 3 for any service running on any server, a critical alert will be raised.

A static threshold with local scope applies to one particular server.  For example, a local critical threshold with free disk space percentage <15 means that if the parameter 'free disk space percentage' is below 15% for any disk running on this particular server, a critical alert will be raised.  A local threshold will always override the global threshold.  In this example, the global threshold could be free disk space percentage <10.  But because the applications running on this particular server tend to fill up disk space much faster than other servers, you may want to use a more conservative local threshold.
 
A static threshold with instance scope applies to one particular instance.  For example, a instance critical threshold with free disk space percentage <20 means that if the parameter 'free disk space percentage' is below 20% for one particular disk (e.g. C drive) running on any server, a critical alert will be raised.  A instance threshold will always override the global threshold.  In this example, the global threshold could be free disk space percentage <10.  But because C drive is usually smaller and more critical to keep the server up than other drives, you may want to use a more conservative instance threshold.
 
As we mentioned in the previous post, a static threshold can be configured at each PATROL agent or at BPPM server or at both places.  And BPPM does not relate the static thresholds configured at each PATROL agent with the ones at BPPM server.  If you decide to configure static thresholds at both PATROL agents and BPPM server, you need to manually keep tracking them so there won't be any gap or overlap. 

You may want to ask: Why not just configure all static thresholds in BPPM server?   There are two major limitations for this approach.

The first limitation is that each BPPM server can only store 1,700,000 attributes/parameters in its database.  If you have a large environment, you can only store a small subset of your parameters in BPPM server database.  In order to configure a static threshold for a parameter in BPPM server, this parameter must be stored in BPPM server database. 

The second limitation is that BPPM server still doesn't have an application-level quick fail-over architecture. If BPPM server becomes unavailable, no threshold can be applied and thus no alert can be raised until the OS-based secondary BPPM server is up - which usually takes 10 minutes or longer.

Some BMC customers with small environment in non-critical business did choose to configure all static thresholds in BPPM server.  So if that is doable in your environment, you can absolutely configure all static thresholds in BPPM server.

There is another aspect of static thresholds that you can set: duration - how long the threshold has to be violated before raising the alert.

If you set static thresholds at each PATROL agent, the duration is represented by the number of polling cycles.  To set your desired duration for a participial parameter, you must know the default polling cycle for that parameter and reset the polling cycle if the default one does not meet your needs.  The polling cycle for a parameter determines how often (in seconds) the parameter value will be collected.  The combination of polling cycle and the number of polling cycles determines the threshold duration in seconds.

If you set static thresholds at BPPM server, the duration is represented by the number of minutes thus polling cycle is not needed.

Finally if you set static thresholds at each PATROL agent, you can choose to use either pconfig method or CMA method.  In pconfig method, you use either PCM (PATROL Configuration Manager) or pconfig scripts.  In CMA method, you use CMA policies.  If you use both, the thresholds will be combined.  In case of conflict, the thresholds set by CMA method will override the thresholds set by pconfig method.

If you set dynamic thresholds at BPPM server, you can choose to use either BPPM operations console or CMA.  In BPPM operations console, you can use either options menu or tools menu.  In CMA, you can use global thresholds method or CMA policies.  If you use both BPPM operations console and CMA, the thresholds will be combined.  In case of conflict, the thresholds set by CMA will override the thresholds set by BPPM operations console. 

As a decision maker, you can tell by now that there are a lot more decisions to make after you decide on using static thresholds for some data.  You will need to decide if you need local or instance thresholds in addition to global thresholds.  You will need to decide where you want to set them - at each PATROL agent or at BPPM server.  You will need to decide threshold durations.  To set static thresholds in PATROL agents, you will need to decide which method to use - pconfig or CMA. To set dynamic thresholds in BPPM server, you will need to decide which method to use - BPPM operations console or CMA.

Tuesday, April 28, 2015

Understand BPPM As A Decision Maker - Part 5: Implementaion - thresholds

Once you have decided what to do with installation, console, and data, your next decision is about thresholds.  In my personal opinion, thresholds are the heart of enterprise system management because they determine what alerts and how many alerts you are going to receive.

A threshold is associated with a direction (above or below) and a severity.  If a threshold's direction is above and the severity is critical, when the parameter/attribute value is above the threshold, the threshold is violated and a critical alert is generated.  If a threshold's direction is below and the severity is warning, when the parameter/attribute value is below the threshold, the threshold is violated and a warning alert is generated.

If you are familiar with thresholds in general, keep in mind that thresholds in BPPM are more complicated than most other enterprise system management software due to historical reason.

The data collection agent PATROL was initially architected in 1995 as a completely self-contained system with its own local storage, thresholds, and alert system.  In another word, thresholds can be set in each PATROL agent.

When BMC merged PATROL, BMC Event Manager, and ProactiveNet into BPPM as one product, all or part of the data collected by PATROL agents can be stored in BPPM server database as a duplicate.  Therefore, thresholds can also be set at BPPM server in addition to each PATROL agent.

There are two kinds of thresholds: static thresholds and dynamic thresholds.  Static thresholds can be set in either PATROL agents or in BPPM server or both.  Static thresholds set in PATROL agents and in BPPM server work independently.  If you set static thresholds in both places, it is a manual effort to make sure there is no gap and no overlap between them.  Dynamic thresholds can only be set in BPPM server.

A static threshold has an absolute value.  For example, you can set your free disk percentage threshold at 10% so that you will receive an alert when you have less than 10% free disk space left.  There are many different variations for static thresholds that we will discuss in details in the next post.

A dynamic threshold doesn't have an absolute value by itself.  The threshold value is calculated on the fly based on historical data values from a specified time period (for example, hourly, daily, weekly, etc.)  Dynamic threshold is also called baseline.  For example, you can set your CPU utilization threshold as 10% above hourly baseline so that you will receive a high CPU alert when your current CPU utilization is more than 10% above historical average value for the same period of the day.  There are many different variations for dynamic thresholds and a dynamic threshold can also be combined with a static threshold to make it more flexible.  We will discuss more dynamic thresholds in details in a later post.

As a decision maker, the first thing to do is to determine on what types of data you want to set static thresholds and on what types of data you want to set dynamic thresholds.  Examples on data types include: availability (status), number of errors/failures, percentage of errors/failures, percentage of capacity utilization, resource utilization per server, resource utilization per component, response time, and wait time.  Have a brainstorm session with your implementation team and your end users to list all types of data collected in your environment.  Then make a decision on how you want to set their thresholds.

Monday, April 20, 2015

Understand BPPM As A Decision Maker - Part 4: Implementation - console and data

Once you have decided what to do with installation of BPPM server(s), BPPM integration service(s), BPPM cells, and PATROL local/remote agents, your next decision is about console - the user interface that your operations support staff will use to interact with BPPM on daily basis. And a related decision is about data - what data you want to save in BPPM server database.

When you install BPPM server, you automatically install BPPM operations console, a web interface that displays data and events.  From historical point of view, BPPM operations console is evolved from native ProactiveNet console, as BPPM is an integrated product from PATROL, BMC Event Manager, and ProactiveNet.  What about native PATROL console?  Do you need to install it?

It is BMC's intention to replace PATROL console with BPPM operations console.  At this time, BPPM operations console still cannot completely replace all the features available in PATROL console though your operations support staff can perform most of their work with BPPM operations console. However menu commands that help you configure PATROL KMs interactively or diagnose system issues are only available with PATROL console.

A key differentiator here is where data are stored.  PATROL was initially architected back in 1995 when network was not as fast and reliable as it is today.  All PATROL data are stored locally.  When PATROL was integrated into BPPM, selected or all PATROL data can be saved in BPPM server database.

When you are using PATROL console, you are viewing PATROL data stored on each PATROL agent.  When you are using BPPM operations console, you are viewing PATROL data stored in BPPM server database.  You can use BPPM operations console to view PATROL data stored on each PATROL agent (but not in BPPM server database) as 'non-streamed' data on demand, but the data are about 10-minute-old based on my observation consistently.

Along with a decision on PATROL console, you must decide what PATROL data you want to save in BPPM server database.  There are three major reasons to save PATROL data in BPPM server database: 1) To compute dynamic thresholds - Dynamic thresholds will be discussed in a later post; 2) To be included in reports - Data in report database are based on data in BPPM server database; 3) To see real-time (not 10-minute-old) data in BPPM operations console.

It would be nice to save all PATROL data in BPPM server database so that you don't have to decide which data to eliminate.  If you have a small site, this is entirely possible.  But if you have a large site, you must consider the cost and complexity of multiple BPPM servers as each BPPM server can only contain 1,700,000 attributes or 250,000 instances.  It is a trade-off between having more PATROL data in BPPM server database and keeping the cost of BPPM servers under control.

In summary, in order to decide if PATROL console should be implemented, you need to have an in-depth discussion with your implementation team (consultants or employees) and operations staff regarding to the requirement of PATROL console for PATROL KM configuration, their preferred tools for troubleshooting, and guidelines on what data should be stored in BPPM server database.

Monday, April 13, 2015

Understand BPPM As A Decision Maker - Part 3: Implementation - installation

Many people tend to believe that BPPM implementation is simply BPPM installation.  In fact, installation is only a small part of implementation.  We will discuss some key points about installation here and leave other parts of implementation to the next few posts.

Keep in mind that implementation doesn't start with installation.  It starts with planning.

1) Do not start installation before capacity planning.  You need to decide how many BPPM servers, how many BPPM cells, and how many PATROL agents you need to install.  Not only you need to consider your current business capability, you also need to consider future growth as well.

2) Do not start installation before high availability planning.  You need to decide if you want to have fail-over capability for BPPM servers, BPPM cells, and BPPM remote monitoring agents.  If high availability is desired, you also need to decide the level of high availability (application, OS, or VM level). You must understand what kind of protection each level of high availability offers you.  High availability decision is first determined by your business requirements and then determined by your implementation budget and maintenance budget.

3) Do not start installation before remote monitoring planning.  Although most monitoring happens locally, some monitoring can only be done remotely such as VMWare and ping monitoring, and some monitoring gives you the option between local and remote monitoring such as OS monitoring.  There are pros and cons between local and remote monitoring that you must understand.

4) Do not start installation before you decide how you want to install each PATROL agent.  BPPM uses PATROL agent to collect data.  Although some collection can happen remotely, majority of data must be collected locally.  Installing hundreds and thousands of PATROL agents is a labor intensive work though different installation methods have different requirements.  You can choose manual installation, CMA based installation, server duplication, using old PATROL Distribution Server, using BMC Client Management software, or using BMC BladeLogic client, etc.  You need to understand how long each installation method will take you and if there is an additional software license you must purchase.

5) Do not start installation before you decide how you want to assign agent tags to each PATROL agent.  An agent tag tells what information should be monitored on each server.  For example, your AIX server with Oracle database running should have an agent tag for UNIX operation system and another agent tag for Oracle database.  There are different ways to assign agent tags to each server including as part of installation package, as a post installation script, or through CMA.  But you need to have the decision made in advance - this is what I refer as 'framework'.

6) Do not start installation before you decide a name convention to name each BPPM component in your environment.  Without having a consistent way to name each server, cell, integration service, configuration file in your dev, QA, and production environment, sooner or later, you will find yourself in a big mess.  This is another thing I refer as 'framework'.

Your implementation team (consultants or employees) should be able to describe to you various options as well as pros and cons in the above decisions.  You, as a decision maker, should make the decision together with your implementation team on whether or not and how you should go with BPPM from resource point of view. 

Tuesday, April 7, 2015

Understand BPPM As A Decision Maker - Part 2: Required skills

In the previous post, I listed five aspects to be considered for any enterprise monitoring software.  Because each aspect has a different skill set requirement, it can potentially require separate resource though some resources can be shared.

1) Implementation - requires designing, capacity planning, installation, and framework creation skills.
2) Development - requires designing, coding, and framework creation skills.
3) Integration - requires coding, 3rd-party software, and framework creation skills.
4) Administration - requires user interaction, configuration, and framework following skills.
5) Operation - requires user interaction, monitoring, procedure following skills.

In general, aspects 1), 2) and 3) are the responsibilities of one (or more) implementation team(s).  Aspects 4) and 5) are the responsibilities of separate (or combined) administration and operations team. In this post, I want to spend some time addressing how they relate to each other.  To make it more intuitive, I am using my previous 'home construction' example again.  In home construction, here is how each aspect looks like:

1) Implementation - drawing blueprints and building a house
2) Development - replacing a standard jacuzzi bathtub with a wheelchair-accessible shower
3) Integration - installing solar panels on the roof
4) Administration - installing ceiling fans or repairing a leaking kitchen sink
5) Operation - cleaning floor when it is dirty or turning on/off porch lights at dusk/dawn everyday

For resources, here is how each aspect looks like:

1) Implementation - builder
2) Development - specialized crew from the same builder
3) Integration - specialized crew from the same builder or another company
4) Administration - handyman
5) Operation - homeowner

By the above comparison, here are some general points to help you understand BPPM as a decision maker:

1) Begin with the end in mind: How do you want to use an enterprise monitoring software in your IT organization?  For example, do you want your operations team to use this tool to perform root cause analysis, or do you want to leave root cause analysis to the assigned system administrator or DBA? The more details you know how you will be using the monitoring software, the easier your decision will be.

2) Good implementation is crucial: Implementation is like plumbing of a house.  It would be very expensive to add another bathroom if the plumbing wasn't there at the first place.  If the implementation is not done correctly, you may not have any other choice but redo the entire implementation.  For example, one of my previous clients handed me a partially implemented environment where BPPM and Entuity were running on the same server.  I could not find any way to repair it but had to re-do the entire implementation on two separate servers.

3) Not all BPPM experiences are the same: BPPM experience in implementation and BPPM experience in administration are two different kinds of experience though they share some common technical skill sets.  An implementation team's goal is to create a consistent framework that administration team and operation team can adapt quickly without consulting them so they can move on to another implementation project.  An administration team's goal is to use special knowledge in configuration so that they have 'job security' to stay forever.

4) One time vs repetitive work: Aspect 1), 2), and 3) are part of implementation project.  They are one time cost.  Aspect 4) and 5) are repetitive operations.  They are recurring cost.  If an implementation project is done right, it can cut down tremendous amount of recurring operation cost because the amount of work required for administrators and operators is minimized.  The recent trend from my observation seemed to indicate that many organizations didn't have their BPPM implementation done right and now they are trying to hire the best full-time administrators to make it up. Keep in mind, the supply of 'super handyman' is very limited.  At some point, it may cost less to re-do the incorrect implementation than to rely on a permanent 'super handyman'.