Tuesday, April 19, 2011

CMG in the Clouds
April fool’s day this year brought with it the Rocky Mountain CMG conference. Best view of any CMG show I have been at.

David Halbig from First Data Corp. (also the guy who organizes the show) kicked off with a surprisingly refreshing presentation for a CMG region. He talked about the cloud in an extremely practical way – as well as laying out the management of performance in the cloud.
David laid out some very compelling examples of issues that can potentially emerge when moving to the cloud. David broke up the monitoring approach into:
1. Continuous Monitoring
2. Specialized – intermittent use monitoring (the fire hydrant)
3. Middleware monitoring (application deepdive)
4. Business Transaction Monitoring – for a cross tier understanding of what is going on
David talked about how many vendors say that they are “BTM capable” – but really do not have the “real thing” – his “sure fire” list of requirements included:
• Horizontal view of aggregate and single transactions across all tiers of interest
• Resource consumption information from each monitored tier – basically saying that a network tap based solution cannot cut it.
• Auto-discovery of transaction path (said that some products needed to be “trained” on transaction paths)
• Capture path to non-monitored tiers
• Continuous operation at volume
• Low transaction path overhead
We (Correlsense) also presented; focusing on how Cloud (public/Private), Agile, SOA and Virtualization create an ever changing environment where the only things that stay the same are the user’s expectation of sub second response times and the transactions that run the business.
Although – as usual – the vast majority of the audience was not anticipating and migration to the cloud in the near future – the sessions were very informative because they dealt with managing performance in a constantly changing environment – something that everyone has to deal with.

Tuesday, April 5, 2011

Improving Conversion Rates

There is a clear and obvious correlation between the time it takes a landing page to load and the conversion rate of the call for action on that page. We know now that the time it takes to load your site will affect not only your conversion rate but your ranking at Google as well. If you agree with these statements, continue reading... If not, I suggest you to read this article and this one as well.

Today, the main tools for optimization are dealing with questions like "where the traffic is coming from" and "what people are doing on my site before they convert". The traditional set of tools available for the web analyst are tools like google analytics that provides information about where the visitors are coming from, funnel analytics tools that provide information about the steps in your acquisition funnel, AB and MVT testing tools to run experiments on the traffic to your site and user recording tools and life-cycle tools that let you view your users flow.

None of the above deals with site performance while we all agree there is a significant weight for this issue on your conversion rate.

Few months ago, one of our clients decided to put his site on CDN. His optimization manager told him that like any other change done to the site, they should run a test and evaluate the ROI of such change. They agreed to run a simple AB test on their Brazilian display traffic (traffic from banners) - 50% will be redirected to the regular site and 50% to the CDN version.

Before starting the test, they went to their IT manager and asked him to check using his monitoring tool how long it takes to load each version. The IT manager did not have a robot/agent in brazil so he bought a virtual server there and installed the agent and came back two weeks after with the following results: it took the original site to load 25 seconds while the new version took 6 seconds. Although it was clear 19 seconds is a huge difference, they decided to run the test after all.

Few days after, they got the results: the conversion rate of the new version was 65% LOWER than the original version... Based on the results, it appeared that the longer the user waited, the better chance he will convert (original version: 25 seconds to load, 6.4% conversion rate, new version: 6 seconds to load, 2.2% conversion rate)...

This is obviously wrong. By looking at the amount of page views of each variation they noticed that the amount of pageviews of the original version was much less than the amount of pageviews of the new version (about 1:3 while it should be 1:1) which lead them to the conclusion that because of the load time, most of the users left the page before it was fully loaded and before the web analytics tool sent the event to the server. The users who did wait 25 seconds for the page were more likely to convert...

Another thing they found was that when they measured the actual time it took for the visitor to load the page, it was much more than 25 seconds (closer to 40 seconds). The IT manager did not include the rendering time or the actual time it took the visitor to load all the files (relying on the network speed determined by the server will never be accurate).
You can assume visitors who are waiting 40 seconds for a page are really interested in its content...

The main things you should take from this story are:
- Page load time and abandonment rate during page load should be part of the basic metrics your analysts are using
- You should have a real site performance analytics tool that will capture 100% of your traffic and not count on robots or other sniffing tools
- You have all kinds of KPIs that provide you with a clear view of all of your SLAs and response time through the funnel except for one of the most important steps - when the visitor arrives to your store.

RUM measures the load time of each and every request for 100% of your traffic, from all around the world, and in the most accurate way. It does not rely on robots and synthetic request or calculates the processing time on the server or the size of the response or guessing the network speed of your visitors. Instead, it uses both client and server agents to calculate the actual time it took from the time the user requested the page till the page was fully rendered. It also tells you how many visitors left before getting the page (and before you got a pageview event to your web analytics tool). RUM has a simple yet powerful user interface that was designed not only for IT people.

Please contact us to schedule a demo or download the free addition of the tool.

Monday, November 8, 2010

Videos: Business Transaction Management by Correlsense

Correlsense is an industry leader when it comes to helping enterprises achieving IT Reliability. With their product, SharePath, they are changing the way enterprise companies monitor applications. In these videos, Lanir Shacham, CTO and founder, talks about what Correlsense is all about and what SharePath does:

Wednesday, August 25, 2010

More Information on Business Transaction Management

The market has continued to grow. There is now free options out there to get your toes wet with some software one definitely worth checking out is this free offering from Correlsense, it's their free Real User Monitoring tool., brought to you by Correlsense. Make sure to check them out on the web and also on Twitter and Facebook.

Also Check out this video from CEO Oren Elias about their free Real User Monitoring offer.

Monday, February 8, 2010

Why IT Operations is like an Action TV Series

I like watching the series "24", I can’t really explain why. Every time they nearly get the bad guys something wrong happens, there’s some sort of twist in the plot and they need to start all over again. For example, I’m sure that you are familiar with the following classic scene: The CTU (Counter Terrorism Unit) chopper is following a suspect that is driving a black van. The suspect’s van enters a tunnel, but the van doesn’t leave the tunnel. Instead, a number of different vehicles leave the tunnel at the same time, and the suspect is probably in one of them. By the time they figure out that the black van has been left empty in the tunnel – they have already lost the suspect. They shout “We have lost visual!!!” and are back to looking for the bad guys all over again - then they call Jack Bauer…

IT Operations is just like the CTU; the CTU is responsible for making sure that life goes on without any unpleasant surprises. Similarly, IT Operations needs to do the same in its own space and make sure that the business keeps on running and that business transactions are being executed properly and on time.

When something is about to go wrong, the CTU and IT Operations are expected to prevent it before it affects anyone. So they set up the war room, call everyone in, and start doing their detective work to find the needle in the haystack. If they don’t find it and something goes wrong then the results are significant; either people get hurt (in the CTU's case), or business is impacted.

IT Operations' War Chest

So which tools could IT Operations use to find out that there is a problem, identify the root cause of it and resolve the issue?

For example, IT Operations could use HTTP network appliances that help see every HTTP transaction and measure its response time. These network appliances are just like the CTU's choppers, they do not have adequate visibility into the datacenter. They can indicate that something is wrong with the response time of a transaction, but they cannot show why the response time of the transaction is high and cannot provide the visibility needed for resolution.

IT Operations also uses Event Correlation and Analysis (ECA) tools. ECA tools are like CSI detectives (yes… that’s another one I watch…), and rely on other tools to collect information for them, just like a the CSI detective who collects evidence from a crime scene. ECA tools are just as effective as the products that they rely on to provide them with the data. The issue with ECA tools is that, just like in a crime scene, the thief does not usually leave his ID behind, so all you are left with is just clues, and no accurate data to work with.

Additional tools that IT Ops relys on are; dashboards that monitor server resource consumption, J2EE/.Net tools that are capable of performing drill down diagnostics in application and database layers, synthetic transaction tools and Real User Measurement (RUM) tools. With all of these monitoring tools IT Operations still finds itself in a situation where all lights are green while users are complaining about bad response times. In spite of all of the investment in monitoring tools; the infrastructure that IT Operations is accountable for is still unpredictable. Why?

A Simple Example

Perhaps it’s best to take a look at this classic example: One of our customers had a problem with a wire-transfer transaction. The liability for the problem kept on going back and forth between Operations and Applications, who were pointing fingers at each other as to who was responsible for the issue. “All lights are green” said Operations, “We tested the application and it works just fine” said Applications. Simply put, no existing monitoring tool could point out the problem.

So what was the problem? The answer is simple; it appears that, by design, wire-transfers for over $100K were querying the mainframe nearly 100 times, while other transfers would query it only a couple of times. Same end-user, same application, same transaction, but just a single parameter made the transaction take a whole different path, and made the difference between a 3 second and a 2 minute response time.

What Exactly Are You Monitoring?

Now the question is: why can't existing monitoring tools identify the problem? The reason is simple. Traditional monitoring tools monitor the infrastructure and not the transactions. In a complex heterogeneous infrastructure, there are many tools for monitoring each and every component, but no single spinal cord that is able to show how transactions behave across components. None of the tools are able to deterministically correlate a single request coming into a server with all of the associated requests going out of a server and keep on doing so throughout the Transaction Path. Just like the chopper which could not figure out which of the vehicles coming out of the tunnel contained the suspect who came into the tunnel in the first place.

This situation raises some strategic questions regarding your monitoring approach. How effective is a monitoring framework without that business context? Are you supposed to just to make sure the servers are up and running and applications are responding, or is your real goal is to make sure that the business transactions are being executed as intended and on time?

“In God we trust; all others must bring data”

Applications are tricky, transactions are tricky, and they become even trickier in a complex heterogeneous infrastructure that is composed of multiple platforms, operating systems, application nodes, tiers, databases and where communication between components is in different protocols back and forth for every single click of a button by an end-user.

Only by being able to trace each and every single transaction activation throughout its entire path - 100% of the time, for all transactions, across all components - will you be able to systematically collect necessary granular information in order to get business-contextualized visibility into your datacenter. This kind of visibility is a key factor in being able to identify problems effectively when, or even before they arise.

W. Edwards Deming said; “In God we trust; all others must bring data”. I think he was absolutely right. IT Operations can use choppers, or CSI crime-lab detectives, or Jack Bauers. They all have their roles, but when it comes to fast and effective problem identification as well as many other IT related decision making processes (that’s a whole different article…) real accurate data is required – no partial data, no assumptions.

Business Transaction Management provides you with that data, and by doing so, it provides your IT Organization with visibility and predictability. Wouldn’t it be great if you could go to sleep at night knowing that your infrastructure is reliable? That is, unless you want to play the role of the CTU Director…

The 8th season of 24 will be premiered on January 17th, 2010. 'Till then – why don’t you get yourself a Business Transaction Management solution…

Wednesday, July 15, 2009

IT Reliability through Business Transaction Management

...Continued from the last post
Enabling Reliable IT – Managing Performance
How do you know when an end user is experiencing bad response times?
  • They call into the help desk to complain – usually only after a number of past events where they were un-happy with the application’s reliability.
  • An end user monitoring tool measures bad response times
End User Measurements
There are a great number of tools on the market today that perform this task in a variety of ways, below is a summary of the different approaches.
Software Based Real User Measurements
Desktop Agent
  • Strength – enables the monitoring of the end user’s desktop and can measure response times for fat client based applications
  • Weakness – must be installed at each desktop
Javascript Injection
  • Strength – no need for end user installation
  • Weakness – Javascript needs to be added to web application code
Browser Plug-In
  • Strength – easy installation without code modification
  • Weakness – still requires end user installation
Network Appliance Based Real User Measurements
All of these solutions utilize a network sniffer installed at a port mirror in order to guess end user response times. The advantage of using this solution is the ease of installation; the disadvantage is the cost of putting these probes at all of the points where the network is accessed and the
accuracy of the data.
Synthetic End User Measurements Performed by “Robots”
The classic availability monitors use this approach - scripts are used to “ping” the system and check its availability. The advantage is that availability can be monitored overnight and before the morning workload, the disadvantage is that real user response times are not being measured and scripts have to be modified with every change of the application.
Finding the Location of the Problem
Now that you know that there is a problem - since the end user is experiencing reliability issues with an application, narrowing down the location of the problem is the next step. Research has shown that 80% of time spent on troubleshooting performance problems is spent on finding the actual location of the problem.

In the picture below “John the User” is experiencing poor response times – the IT department is tasked with resolving this issue.
To be continued next week...

Sunday, July 5, 2009

IT Reliability through Business Transaction Management

I just got back from a regional CMG conference where the majority of attendees identify themselves as IT performance and capacity professionals. Ultimately the objective of their work is to make the organization’s IT systems more reliable.
  • By planning for tomorrow’s capacity they ensure that the applications that our co-workers and customers rely on will continue to be reliable throughout increased usage and new changes.
  • By managing IT performance they enable the day to day reliability of these applications so that revenue can be generated and growth achieved.
By focusing our efforts on the greater goal of IT reliability we are able to heal the great disconnect that happens much too often between the business and IT. Showing the business graphs of increased CPU consumption or bandwidth utilization does not enable them to relate to the performance and capacity issues that they are concerned with. IT resource consumption metrics may be the “vital signs” of IT systems, but this is kind of like checking a patient’s pulse and respiratory rate and saying that they are healthy enough even though the patient may be suffering from a terminal disease. The lights may be on, but are the IT systems you manage perceived as reliable by the end users and the business?

Put yourself in the shoes of the end user; you want any action that you perform within an application to have a quick and valid response. That is reliability in the eyes of the business and the end user and that is exactly what your work as a performance or capacity professional is seeking to enable.

What connects between all of the various parts of the puzzle? What links the business, the end users, the network, firewalls, proxy servers, web servers, application servers, load balancers, message brokers, databases, and mainframes? Transactions.
When posing the question "how do YOU define a transaction?" to a room full of IT professionals you are likely to receive a handful of different answers.
Typically one would define an HTTP request, database call, CICS transaction or SOAP request as a transaction. In this paper these are defined as transaction segments that are part of a greater transaction.
A transaction is the most elementary unit of work that can be performed by a user of an application. Whenever a user clicks a button within an application, they have performed a “transaction activation”. This “transaction activation” can trigger any number of IT processes within the datacenter that are used to get the work done. A transaction could be a transfer of funds, a purchase, an update of information, or the opening of a new account – any user interaction with the application – also known as a business service or unit of work.

To be continued next week...