What Causes Enterprise Applications to Slow Down - part 2

October 8th, 2008

In the previous post we identified the Wide Area Network and the impairments that it introduces as a key reason for why a local user (let’s say in NYC) experiences a faster application than a remote user (let’s say in Tokyo). I also presented a question to the group: “We identified network latency as one of the key reasons that impact application performance; we also said that a typical WAN link will introduce 10 – 500 msec of latency. The question is, why does network latency impact application performance, surely a user doesn’t notice an increase of a few milliseconds in response time, even 500 milliseconds = ½ second goes by in a flinch. So why does network latency have such a big impact on application performance?”

The following posts will answer this question and more, but in order to get answers we need to first address additional questions:

Consider the remote user in Tokyo, he is accessing multiple applications that are all hosted in the NYC data center.

Will all these applications perform the same way across the Wide Area Network?

The obvious answer is NO, some applications will perform well over the WAN, while others will provide intermittent poor performance and some will always perform poorly for a remote user.

The answers to the next questions are less obvious:

  • Why do some applications perform well over the network while others fail miserably?
  • What is it about the way the application is designed and architected that allows one application to perform better than others?
  • What are the key design flaws that cause applications to perform poorly over the network?

Since there are a lot of different applications, there are also a lot of different answers to these questions. In the next couple of posts we will focus on enterprise-data-driven- transactional applications and their performance flaws over the network (ignoring for a moment backend or desktop related bottlenecks). We will further limit this category to client server based applications, ignoring for a moment the complexity of N-Tier applications and multi web service based applications, these will be dealt with in future posts.

Over the years I was asked to analyze the performance problems of many transactional applications and specifically analyze their performance degradation over the network, the following are the key application design attributes that I found lead to performance degradation over the network:

  1. The number of (blocking) application turns per transaction (or how chatty the application is)
  2. The transaction size (or how much data needs to be downloaded from the server to the client in order to complete each transaction)
  3. The transaction efficiency factor (how much data a transaction downloads per application turn)
  4. The transaction initialization size (how much data does the transaction download initially Vs. sequential navigational steps)
  5. The caching ratio (how much data is cached locally as a percentage of the overall data needed by the application)
  6. The latency scale factor (How does the backend’s ability to scale change when network latency is added between front end clients and the backend)

In the next couple of posts I will explain each of the above factors and describe how each of them impacts enterprise application performance, so sign up for the RSS feed to get notification on these future posts.

Until then…

A Data Driven Transactional Application – A glossary post

October 8th, 2008

A data driven transactional application supports the execution of business processes. Each business process (such as book sale, update employee status, submit work hours, etc.) is comprised of multiple business transactions. A business transaction is described as the interaction and managed outcome of a well-defined step within a business process. A transaction is usually triggered by user interaction and its outcome can be measured and verified. The following is an example of a business process for updating an employee record and its underlying transactions as it is presented as part of a test plan for that business process:

 

Business Process Test Plan - Update employee record from Tokyo

Transaction name

Trigger

Expected outcome

Response time for a user in Tokyo

Service Level Objective

Sign in

User enters his credentials into the sign in page and clicks submit

The user is signed in and the application displays the home page

 

7 seconds

Navigate to company address book

Click on “Company address book tab”

The company address book page is displayed

 

3 seconds

Find employee

Enter employee name in the search box

Employee search results are displayed

 

7 seconds

Select employee

Click on employee link

Employee data page is displayed

 

7 seconds

Edit employee data

Click on edit

Employee edit data page is displayed

 

3 seconds

Update employee records

Click on submit

Employee records are updates

 

7 seconds

Sign out

Click on sign out

Sign in page is displayed

 

3 seconds

 

A FREE and EASY way to understand your network performance

October 3rd, 2008

In the previous post, I presented one explanation for why remote users of enterprise applications experience slower applications compared to their local colleagues. That reason is the impact that WAN impairments such as delay, packet loss and jitter have on enterprise applications. A question that keeps repeating itself in my consulting engagements is “ How do I know how much latency and packet loss there is on a given network link?”

So today, I want to share a tip about a tool that I have been using for years to get a quick snapshot of the existing impairments on a given link. This tool is the Shunra VE Network Catcher Lite (full disclosure, I have worked for Shunra for the past 7 years and am still involved with the company in a consulting capacity). The Network Catcher Lite tool is a win32 application that captures latency, packet loss and jitter from a user’s PC to a target server. The following tip explains how you could get a snapshot of any network link that your PC can access.

You can download the tool from http://www.shunra.com/free_network_monitoring_tool.aspx (fill out the download form at the bottom of the page). You can also find it on Tucows at http://www.tucows.com/preview/502454

Once downloaded launch the application from the program files menu

Enter a URL for a remote server or web site (e.g. www.excellingit.com or a server or router inside your intranet at a remote site)

The recording parameters should be filled as follows:

Enter the IP or URL of the target server (e.g. www.excellingit.com )

The sample interval determines the granularity of the recording, I use 100 msec

The recording time specifies the duration of the snapshot, the maximal time is 15 minutes (900 seconds)

The packet length determines the size of the recording packet, I typically leave it at 32 bytes

TOS allows you to record the impairments on different classes of service across a network, 0 stands for best effort

Click Start Recording

Make sure that a green line begins to chart inside the grid, if you see a red line at 100% loss it means that you can not connect to the remote server. One trick I use in this case is to tracert to the remote server and use the last accessible hop, more on that in a later post

Here is how you should read the recorded data, the green line shows the latency value across the timeline of the recording. Any red spikes represent packet loss events. Here are some examples:

This is a recording of a link from Tokyo to NY.

The recording was taken off hours, so the floor of the recording at 85 milliseconds is a good estimate for the one way propagation delay, this link will never have latency lower than that floor (so when deploying applications that will be used in Tokyo they should be tested with at least 85 milliseconds latency, that is the best case scenario for that link)

Most of the time networks are stable and exhibit a small deviation in latency, but even the best run networks can get congested (on average 0.5% to 1% of the time, which could add up to over 3 days a year). During a congestion event the latency can increase in several orders of magnitude from the propagation delay and the standard deviation of latency increases dramatically which increases the jitter.

Consider the following recording of a link from Dallas to NY during a congestion event.

See how the latency starts to deviate dramatically and how one way delay increases from a floor of 25 msec to over 120 msec. You can also notice that packet loss (the red spikes) increases during a congestion event, that is because routers that are buffering packets eventually run out of buffer space and begin dropping packets.

That’s it for now, hopefully you will find this tool useful. In future posts I will explain the relationship between network conditions and application response times. At that time the use for this tool will morph as it becomes a part of the performance engineering process.

Until then, let me know if you have any feedback on this tip, I will try to address them the best I can.

Amichai

 

What Causes Enterprise Applications to Slow Down - part 1

September 8th, 2008

The next several posts will deal with performance engineering concepts that will help us answer a simple question: “Why do applications slow down?”

Anyone who has been involved with applications, either from the IT side or as an end user, knows that there are many reasons for applications to slow down (viruses, slow desktops with limited resources, Windows tendency to decay over time, running out of memory or disk drive space, etc.). However, in this Blog we will mainly focus on Enterprise applications and specifically on performance issues with global Enterprise applications. I found that performance issues with global Enterprise applications are the most difficult to identify and consequently the most difficult to fix.

 

Performance issues of global Enterprise applications tend to result from the global nature of these applications which introduces the following three (3) conditions:

 

1. Users access these applications from remote sites over the Wide Area Network

2. These Enterprise applications need to perform under high end user volume

3. And the most elusive of them all, these applications need to perform under distributed load. This combines performance over the network with performance under end user load

 

Our first discussion will be about what causes applications to slow down as end users access them from remote location. This topic requires an understanding of two (2) concepts: network impairments and application network design. These will be the topics of the next 2 posts.

The network impact on application performance

Let’s begin with a simple question: “An application is hosted in a NYC data center, now consider two (2) users, one is located in the NYC data center and one is located in a remote office in San Francisco, will the application perform the same for both users? In other words will the application be as responsive to the San Francisco user as it is to the NYC user?”

 

Well the obvious answer is NO, in most cases the NYC user will enjoy a faster more responsive application. What is less obvious is why? What is it about the network that causes remote users to experience a slower application than local users?

 

When I ask this question during my training seminars, I get a variety of answers, many of them are the right ones, but I would like to address one wrong answer that keeps repeating itself for some reason.

 

Collisions – there is a general conception that collisions are common phenomena on the network which can explain any bad thing that happens to applications. The truth is that collisions are almost a thing of the past (on Enterprise LANs any how) and even when they happen they can’t explain why a remote user has a worst experience than a local user as both will experience a similar collision chance since collisions are a phenomena that happens on local area Ethernet networks.

 

Now to the right answers to the question, what is it about the Wide Area Network that causes applications to slow down:

There are 5 key conditions that predominately exist on Wide Area Networks and impact application performance, each in their own way:

 

1. Network Latency – the time it takes a packet to traverse from a source to the destination across the network, measured in milliseconds [msec]. A typical WAN link will introduce latency in the range of 10msec – 500 msec.

2. Bandwidth constraints – how fast can data be processed by the network link, measured in bits per second [bps, Kbps, Mbps, Gbps]

3. Bandwidth utilization (background traffic) – the percentage of bandwidth that is utilized by traffic that already exists on the link (background traffic).

4. Jitter – the deviation of the inter packet gap of sequential packets across a network link, it is a result of the deviation of the network latency and is sometimes used interchangeably with that standard deviation, measured in milliseconds [msec].

5. Packet Loss – the chance to drop a packet across an end to end network link, measured in %. Sometimes presented as the inverse metric called packet delivery rate.

 

The above are called network impairments, you can click on each one of the links to learn more about them and their causes.

Network impairments are performance conditions that inhibit the flow of data across a network. Each impairment type has an impact on the performance of business applications and network services. Some applications may be very sensitive to network impairments and some may be almost network agnostic. Sorting applications based on their network sensitivity is one of the important steps in performance engineering

 

In the next post we will discuss how application design can impact performance across the network. But in the mean time I would like to introduce a question for the group:

 

“We identified network latency as one of the key reasons that impact application performance; we also said that a typical WAN link will introduce 10 – 500 msec of latency. The question is, why does network latency have a big impact on application performance? surely a user doesn’t notice an increase of a few msec in response time, even 500 msec = ½ second goes by in a flinch. So think about it and let me know what you found based on your experience, why does network latency have such a big impact on application performance?”

 

That and more will be covered in the next post.

 

Talk to you soon,

 

Amichai

Network Latency – a glossary post

September 8th, 2008

Network Latency/Network Delay

One-way network latency is defined as the amount of time it takes for a packet to traverse a particular network path, from a device that created the packet to the destination device. This is also known as end-to-end latency.

Network delay is composed of propagation, processing and serialization, and queuing delay.

Propagation delay is the time it takes the physical signal to traverse the path. This delay is usually fairly constant if there are no route changes (it’s more constant if the end-points are static connected via optical fiber and has more variation if one of the end-points is a moving airplane or satellite). Propagation delay is the result of pure physics where as

t(msec) = d(distance)/(2/3rds of the speed of light)

 Propagation Delay

Processing and Serialization delay is the aggregate time it takes each hop on the route to process and transmit the packet. Depending on bit-rate (bandwidth) it may be a significant portion of overall delay or may be negligible. In modern networks where high bit-rate connections are becoming the norm, serialization delay is becoming more and more negligible. In any case, for given packet size and path, serialization delay is constant (unless affected by hardware compression) and varies during route changes and as packet length changes. Processing time in each hop depends on the services that this hop provides (bridging, routing, encryption, compression, tunneling, etc.) and on the speed of the device itself but generally in modern networks this is a number of a much lower order of magnitude than the rest of the latency factors.

Queuing delay is the time a packet spends in router queues. This time depends naturally on queue lengths: for an unloaded network it would be negligible; for a network that is heavily congested it could considerably contribute to the end to end delay. It is the most variable delay component in a typical modern network. Queuing delay changes based on the burstiness of the traffic, since a burst of traffic creates an increase in the queue depth. Queuing delay is the main reason for jitter, which is caused by the variation of latency over time.

 

 

Queuing delay is the time packets wait inside the devices’ queues.

Latency and bandwidth together determine the “speed” of a connection.

Latency can increase as the bandwidth utilization and traffic load changes but it will never go below the propagation delay (simple physics). As load increases, it is possible that latency will increase since buffers may begin to populate on the path between the sender and receiver. 

Bandwidth - a glossary post

September 8th, 2008

Bandwidth constraints:

Network capacity, or bandwidth, is the number of bits a network connection or interface can carry in a given period of time. It is measured in bps (bits-per-second), Kbps, Mbps or Gbps. The greater the bandwidth, the greater the number of concurrent application sessions the link can serve (for a given transaction) and the greater the rate that each application session can consume from the network.

Bandwidth Utilization:

Bandwidth utilization is a measure of how much of the link’s maximum data rate is being used. Consider an intuitive notion of utilization - it may start by picturing the WAN circuit as a pipe of a certain diameter and then imagining that it is partly filled with something we call traffic. Bandwidth utilization is a factor of the number of concurrent application sessions across the link and the average rate used by each session. For example, if a T1 link (1544 Kbps) serves an average of 20 concurrent application sessions and each session uses 50 Kbps on average each way then we would say that the link is 64.7% utilized ((50 Kbps * 20 sessions)/1544 kbps = 64.7%).

Starting at 70% BU, network performance starts to degrade, 80% badly degraded, and 90+% “Flooded”. Smaller pipes, are subject to easier flooding, which will cause significant increased latency and jitter

Packet Loss - a glossary post

September 8th, 2008

Packet Loss:

The term “packet loss” is used to describe the probability of dropping a packet at any point across the network link. The key reasons for packet loss across a network are:
 
  • When networks get congested over a long period of time the router buffers get saturated which at some point leads to a situation where there is no room to store more pakets in the buffers, so new arriving packets get dropped
  • Some routers implement a mechanism called random early detection, since IP doesn’t provide any explicit way to indicate that the network is congested, routers use loss for that purpose. Such loss is a way of communicating back to the users the need to scale back the offered load on the network. RED is becoming less common as new QoS techniques are used to implement rate control.
  • Communications across wireless and cellular networks may encounter packet loss due to interference or a weak signal
  • Hardware and cable errors are a common cause of packet loss

From the network perspective, loss can be categorized as random loss or loss due to congestion. The average packet loss rate for a network connection gives an overall sense of the quality of the connection. A connection with less than 1 percent average packet loss is considered a decent connection. But average loss doesn’t tell the whole story. There is importance to the type, or pattern, of packet loss. There are at least two kinds of packet loss that should be considered: ‘Random’ loss and ‘Burst’ loss. To explain the difference between them, let’s suppose we are trying to run 2 Voice over IP conversation over 2 links that have an average of 1 percent packet loss. Call A loses one packet in every 100 packets over the entire call (random loss) while Call B loses 100 packets in two incidents - at the beginning and the end of the call (burst loss). Which call would you rather have? That’s why it is important to consider not just the average packet loss but also the type of loss and information on any bursts of packet loss over time. 

Jitter - a glossary post

September 8th, 2008

Jitter:

 

Network jitter is the variation in the inter packet gap of subsequent data packets as they arrive over a network. For most types of data applications large gap variations between arrival times are acceptable. For voice or video applications, relatively small jitter can cause perceptible disturbances in the recreated voice or video at the receiving end.
Because Wide Area Networks over IP generally rely on shared access to the network and can have many asynchronous sessions all sharing the same network they tend to introduce relatively high jitter.

Jitter can be reduced with buffers. Buffering some incoming packets and then outputting them at a more regular rate reduces jitter. This is fine for applications that are not interactive, such as streaming music or videos. For interactive applications such as VoIP conversations or video conferencing, this scheme will add network latency, which reduces the quality of the conversation.

Performance Engineering - Why so many companies don’t get it - Part 3

July 29th, 2008

In the previous 2 posts we described several ways in which sub optimal performance engineering practices manifest themselves, as well as identified the lack of goal commonality between developers and performance engineers as one of the key reasons behind these sub optimal practices. In this post I want to look at the problem from a more holistic and organizational perspective.

Loosing site of the goal

What happens when IT departments lose site of the performance engineering goal? (Reminder in short the goal is to improve the end user’s quality of experience and productivity, while maintaining system costs within budget).

Well what happens is that each department gets lost in its own tactical goal:

  1. The Capacity planning team focuses on efficient and accurate hardware provisioning
  2. The load testing team focuses on test coverage and scale requirements
  3. The network engineering team focuses on the speed and capacity of the pipes
  4. The data base team focuses on the performance of the data bases
  5. The server team focuses on the performance of the backend servers
  6. Desktop team is focusing on the performance of the desktop clients
  7.  

What the organization ends up with is a set of local optimums, but in many cases those local optimums don’t amount to an optimal system. What’s missing in the above list is at least one department that is responsible for meeting the goal, it is very rare to find a team that oversees the end to end responsiveness and performance of the application across all its components from the end user’s perspective. It is even harder to find a team that is held accountable to end user performance.

 

But is it wrong for each team to improve its domain and make sure it is optimal? Well the counter intuitive answer is yes, it is wrong and for the following reasons:

Focusing on the wrong bottlenecks

Let’s consider the following transaction as an example: this transaction generates a time sheet report for global employees. This transaction is served by a client (web browser with java widgets) a few web servers behind a load balancer, a few application servers and a data base server. Now lets see what happens if there are performance issues with this transaction. Naturally each team will spend time in improving its own domain, so the data base team may index the employee’s data base to reduce the data base response time in half, the server team adds more web servers behind the load balancer to increase the application’s scalability and the network team adds more bandwidth to the data center router. All these steps sound like they should help, no? Well the realistic answer is that in some cases none of these steps help, in fact 3 negative things happen here:

a.      The teams spent time and money on the wrong bottlenecks

b.     The real bottleneck is still out there

c.     Increasing the speed of none bottleneck components places more strain on the real bottleneck, slowing things down even further

 

In future posts I will give specific examples of several problems that can not be addressed in the realm of one IT department. Those problems usually result from interdependencies between the different systems (servers, networks and data bases). It takes a holistic and multi-disciplinary process to find the right bottleneck let along find a solution for the problem. It may sound complicated, but the concept is quite simple, when dealing with performance, it does little good to focus on local optimums, any optimization effort that is not spent on the actual bottleneck is counter productive and a waste of IT resources and money. Remember, the goal is to improve end user response time at the desktop, not optimize a specific component that is part of a bigger system.

 

Even though the concept is simple, the solution isn’t always as simple. In future posts, I will offer practical ways to find application performance bottlenecks, I have used them in many engagements and they haven’t failed me yet. But before we can talk about the solutions, it is important to understand the problems so in the next several posts we will cover a few basic performance engineering concepts.

 

Talk to you soon…

Performance Engineering - Why so many companies don’t get it - Part 2

June 27th, 2008

Part 2  (for part 1 click here)

Anyone who was ever part of a performance engineering process should be able to relate to the following story:

“…Version 3.5 of a critical application is scheduled for release in 6 weeks, the latest stable build (internal version) of the application finally made it to the hands of the performance engineering team 1 week ago. The team ran a few performance tests and found that this new version performs much slower than the previous 3.0 version. The team even identified potential tuning opportunities, but they require changes in the code. Trying to get time from the developers to address these issues was nearly impossible as they some are already working on a patch that adds more functionality to this version and some are already deployed at another project. The application ends up being deployed as is, clients begin complaining about the poor performance and the business unit points fingers at the performance engineering team for not delivering on its goal…”

 

Why does this happen?

 

Lack of goal commonality

The main reason behind this kind of story and others like it is the lack of goal commonality. The development team which has the biggest impact on application performance is measured on timely delivery of specified functionality and is rarely measured on the performance and responsiveness of those applications. While application performance is the goal of the performance engineering team, who can usually just verify the compliance of applications with their performance goals, but can rarely impact any changes that will actually improve performance. The performance engineering team can oversee performance improvement projects but the changes have to be driven through the development team, which as already established is not measured on performance, hence will be somewhat reluctant to spend time on these performance improving projects. There are other examples of lack of goal commonalities within IT (Network Ops and Application development is another common example), but none impact application performance (or lack there off) more than the above.

Performance Vs. Functionality – lack of goal commonality makes this a tough choice

But even if the development team isn’t measured on performance, don’t developers care about the performance and responsiveness of their applications?

Well off course they do.

I come from a development background and I can tell you first hand that developers want to take pride in the code they write, but when faced with the choice of spending time on performance improvements or on functionality enhancements, they are usually forced to choose the later.

Lack of tools and expertise

 

To make things worst, the lack of goal commonality prevents developers from being able to address performance problems even if they set the time for it, since for the most part developers are not equipped with the right tools, best practices and expertise to address potential performance issues. When development managers make their hiring decisions and development tool selections, they tend to make choices that will improve the team’s ability to deliver functionality fast and tend to focus less on tools and expertise that are performance oriented.

Performance later

There is a tendency in the development world to push performance to the last stage in the development life cycle, also known as “performance last” or “performance later”. While this approach may make sense initially, since you can’t test for performance before you have runable code that passed functional testing, it is one of the reasons behind many of the performance problems that exist in applications today. As we will see in future posts, many performance problems are a result of decisions made very early in the development process, such as platform selection, development language selection and even mundane things such as the type of user interface container selected for a specific web page.

 

In summary we identified the lack of goal commonality between application development teams and performance engineering groups as one of the reasons behind a sub optimal performance engineering practice. In the next post we will look into the lack of a clear goal as a fundamental roadblock in achieving efficient performance engineering.

 

In the mean time I am interested in your feedback, check within your company’s software specification documents, how many have performance requirements documented in the specifications? How many end up into the test plan? Curious to hear what you find.

Talk to you soon…