Wednesday, August 11, 2010

Architectural Principles

Twelve Architectural Principles

In this section, we introduce twelve architectural principles. Many times after engagements, we will “seed” the architectural principle gardens of our clients with our twelve principles and then ask them to run their own process, taking as many of ours as they would like, discarding any that do not work for them, and adding as many as they would like. We only ask that they let us know what they are considering so that we can modify our principles over time if they come up with an especially ingenious or useful principle. The Venn diagram shown in Figure 12.3 depicts our principles as they relate to scalability, availability, and cost. We will discuss each of the principles at a high level and then dig more deeply into those that are identified as having an impact on scalability.

Figure 12.3. AKF Architecture Principles


N+1 Design

Simply stated, this principle is the need to ensure that anything you develop has at least one additional instance of that system in the event of failure. Apply the rule of three that we will discuss in Chapter 32, Planning Data Centers, or what we sometimes call ensuring that you build one for you, one for the customer, and one to fail. This principle holds true for everything from large data center design to Web Services implementations.

Design for Rollback

This is a critical principle for Web services, Web 2.0, or Software as a Service (SaaS) companies. Whatever you build, ensure that it is backward compatible. Make sure that you can roll it back if you find yourself in a position of spending too much time “fixing forward.” Some companies will indicate that they can roll back within a specific window of time, say the first couple of hours. Unfortunately, some of the worst and most disastrous failures don’t show up for a few days, especially when those failures have to do with customer data corruption. In the ideal case, you will also design to allow something to be rolled, pushed, or deployed while your product or platform is still “live.” The rollback process will be covered in more detail inChapter 18, Barrier Conditions and Rollback.

Design to Be Disabled

When designing systems, especially very risky systems that communicate to other systems or services, design them to be capable of being “marked down” or disabled. This may give you additional time to “fix forward” or ensure that you don’t go down as a result of a bug that introduces strange out of bounds demand characteristics on your system.

Design to Be Monitored

As we’ve discussed earlier in this book, systems should be designed from the ground up to be monitored. This goes beyond just applying agents to a system to monitor the utilization of CPU, memory, or disk I/O. It also goes beyond simply logging errors. You want your system to identify when it is performing differently than it normally operates in addition to telling you when it is not functioning properly.

Design for Multiple Live Sites

Many companies have disaster recovery centers with systems sitting mostly idle or used for QA until such time as they are needed. The primary issue with such solutions is that it takes a significant amount of time to fail over and validate the disaster recovery center in the event of a disaster. A better solution is to be serving traffic out of both sites live, such that the team is comfortable with the operation of both sites. Our rule of three applies here as well and in most cases you can operate three sites live at equal to or lower cost than the operation of a hot site and a cold disaster recovery site. We’ll discuss this topic in greater detail later in the chapter.

Use Mature Technologies

When you are buying technology, use technology that is proven and that has already had the bugs worked out of it. There are many cases where you might be willing or interested in the vendor promised competitive edge that some new technology offers. Be careful here, because if you become an early adopter of software or systems, you will also be on the leading edge of finding all the bugs with that software or system. If availability and reliability are important to you and your customers, try to be an early majority or late majority adopter of those systems that are critical to the operations of your service, product, or platform.

Asynchronous Design

Whenever possible, systems should communicate in an asynchronous fashion. Asynchronous systems tend to be more fault tolerant to extreme load and do not easily fall prey to the multiplicative effects of failure that characterize synchronous systems. We will discuss the reasons for this in greater detail in the next section of this chapter.

Stateless Systems

Although some systems need state, state has a cost in terms of availability, scalability, and overall cost of your system. When you store state, you do so at a cost of memory or disk space and maybe the cost of databases. This results in additional calls that are often made in synchronous fashion, which in turn reduces availability. As state is often costly compared to stateless systems, it increases the per unit cost of scaling your site. Try to avoid state whenever possible.

Scale Out Not Up

This is the principle that addresses the need to scale horizontally rather than vertically. Whenever you base the viability of your business on faster, bigger, and more expensive hardware, you define a limit on the growth of your business. That limit may change with time as larger scalable multiprocessor systems or vendor supported distributed systems become available, but you are still implicitly stating that you will grow governed by third-party technologies. When it comes to ensuring that you can meet your shareholder needs, design your systems to be able to be horizontally split in terms of data, transactions, and customers.

Design for at Least Two Axes of Scale

Whenever you design a major system, you should ensure that it is capable of being split on at least two axes of the cube that we introduce inChapter 22, Introduction to the AKF Scale Cube, to ensure that you have plenty of room for “surprise” demand. This does not mean that you need to implement those splits on day one, but rather that they are thought through and at least architected so that the long lead time of rearchitecting a system is avoided.

Buy When Non Core

We will discuss this a bit more in Chapter 15, Focus on Core Competencies: Build Versus Buy. Although we have this identified as a cost initiative, we can make arguments that it affects scalability and availability as well as productivity even though productivity isn’t a theme within our principles. The basic premise is that regardless of how smart you and your team are, you simply aren’t the best at everything. Furthermore, your shareholders really expect you to focus on the things that really create competitive differentiation and therefore shareholder value. So only build things when you are really good at it and it makes a significant difference in your product, platform, or system.

Use Commodity Hardware

We often get a lot of pushback on this one, but it fits in well with the rest of the principles we’ve outlined. It is similar to our principle of using mature technologies. Hardware, especially servers, moves at a rapid pace toward commoditization characterized by the market buying predominately based on cost. If you can develop your architecture such that you can scale horizontally easily, you should be buying the cheapest hardware you can get your hands on, assuming that the cost of ownership of that hardware (including the cost of handling higher failure rates) is lower than higher end hardware.

No comments:

Post a Comment