Architecture

5 reasons why Distributed Systems are hard to program

July 23rd, 2008  |  Published in Architecture, Distributed Systems, programming  | Add to del.icio.us

Here are 5 reasons why I found distributed system are hard to program. This is not some sort of thorough analysis, but merely my observations in dealing with such systems. For completeness, here is the definition of “Distributed System” I used.
A distributed system contains of more than one process that runs as a single system. These processes can be on the same computer or multiple computers that are on a local area network or geographically distributed over a wide area network.

Without any further do here are the reasons in no particular order.

1. Difficulty in identifying and dealing with failures.
When communicating between processes failures can happen at many levels. Dealing with them is not trivial. Of course you rely on frameworks based on technologies like RMI, CORBA, COM, SOAP, AMQP, REST(is an architectural style not a standard) etc to handle these. But the fact remains that you still need to clearly think about these cases and handle these situations properly.

For example if we consider a simple interaction between two processes on different computers, the following failures can happen.

  • Failures that occur within the process that initiates the communication (sending the message or invoking the RPC call).
  • Failures between the time the process hands over the request to the OS and the OS writing it to the network.
  • Network failures between the time it takes to transmit the packets from one computer to the other.
  • Failures between the time the OS on the receiving end receives the packets and then handing it over to the recipient process.
  • Failures that occur when the recipient process tried to process the request/message.

Sometimes the framework you use, is unable to/may not report all these error cases. Sometimes when the error is reported, it may not contain enough information to figure out at which level the error occurred.
Did it reach the remote computer? if so how far up the stack did it go?. If the receiving process got the request or message did the error occur before or after the request/message was processed?
In some cases where idempotency is built into the the receiving application or the framework/protocol (ex a message client that detects duplicate messages, or doing an HTTP GET) a simple retry maybe ok. In some cases Idempotency and retrying maybe expensive or difficult to implement. In such cases careful thought needs to be given on how these different errors are identified and handled.

2. Achieving consistency in data across processes.
One of the hardest problems in programming distributed systems is achieving a consistent view of data across the processes. When one processes updates some data, you need to replicate them across the other processes, so if any other process decides to operate on the same set of data, then it is doing so on the most current copy.
Lets look at two examples.

Assume a global banking application for ABC bank. A customer goes to a branch in New York, US and deposits money to an account. A few moments later his relative in London, UK does a withdraw on that account. Due to latency there is obviously a time lag before the process in London, UK sees the updated amount in the account.

In an online trading system, a user in NY places an item for sale. The transaction is updated on the closest data center which is in Boston. A few moments later another user in LA is searching for the exact same item and is served off a data center in Phoenix. The user in LA may or may not see the item due to the latency involved in replicating the data across

For example 1 strong consistency is required, while for example 2, you could get away with weak consistency, for example by setting an SLA that says data is valid within a 5 min time window.
This is not an easy problem to solve and this area itself is a subject on its own. Wener Vogels wrote a nice peice on this called Eventually Consistent which is worth reading.
Of course there are specialized frameworks/libraries that can handle this for you. But still there is no escape for you and you pretty much need to have an understanding of the pros and cons of various approaches, failure modes etc.

3. Heterogeneous nature of the components involved in the system.
A distributed system may contain components written in a variety of languages deployed across machines with different architectures and operating systems. Needless to say that this poses certain challenges (especially integration, interoperability issues) when implementing the system. A whole range of standards/technologies were presented to solve these issues, including but not limited to CORBA, SOAP, AMQP, REST (is an architectural style not a standard) and RPC based frameworks like ICE, Thrift, Etch etc. Anyone who has worked with these technologies knows that neither of these are trivial to use nor provide a complete solution in every situation.

If anybody has read the recent posts by Steve Vinoski and the discussions around it would realize the issues/challenges surrounding RPC. The following paper discuss the impedance mismatch problems when working with IDL based systems. The issues with type systems and data formats are not limited to RPC only. When using a message oriented approach like SOAP (doc lit style) or AMQP you will end up tunneling data thats not supported by the protocol as a string or a sequence of bytes. When using REST you would need to represent your resource in a format the requesting application understands/supports, which maybe quite different from the native format.

Again not an easy issue to deal with no matter what technology or framework is used. As an architect/developer you need to understand these issues and deal with them accordingly.

4. Testing a distributed system is quite difficult.
This is arguably one of the hardest aspects of developing a distributed system. Verification of the behavior and impact of your code in the system is not easy.
There are many aspects that needs to be tested, and doing so before every checkin is not a fun task at all. Running some of these tests before every checkin is not practical. But its a good idea to run them nightly and some tests during the weekend. Here are some of the areas that needs to be tested (I plan to write another blog entry elaborating on the testing aspects).

  • Functionality testing (can be covered with well written unit testing)
  • Integration testing - you need to test the distributed system as a whole with all the components involved
  • Interoperability testing - this is crucial when heterogeneous components (different languages, OS) are involved, and is quite different from integration testing
  • TCK compliance - If your system is based on standards/specifications, then you need to ensure that you haven’t broken anything w.r.t compliance
  • Performance testing - to ensure that your changes haven’t accidentally caused a degradation in performance
  • Stress testing - to ensure that your checkin hasn’t accidentally caused any stability issues - ex increased chance of deadlocks when the load increases
  • Soak testing - to ensure that your checkin hasn’t caused any longevity issues - ex a memory leak thats manifested after a couple hours, days

Most often than not developers cut corners in their testing as running these tests are tedious and time consuming. Also these tests need to be run regularly to catch issues in a timely manner and the best way to tackle this issue is to automate as much testing as possible. There many options with continuous build systems like cruisecontrol or using a plain old cron job.
Functionality testing, TCK compliance, certain types of integration and interoperability tests can be run periodically.
In most organizations test machines are just lying around doing nothing during the night (unless around the clock testing is done with development centers in different time zones.). Instead of wasting computing cycles, you could automate test suites to run during the night. More time consuming integration and interoperability tests, performance, stress and soak testing can be done nightly, while more longer duration soak testing can be scheduled to run during the weekends.

While testing is a tough issue for any type of system, distributed systems have a lot more failure points which adds to the complexity.
Getting these tests right to cover these failure points and executing them needs a lot of careful thought and planning.

5. The technologies involved in distributed systems are not easy to understand .
Distributed system are not easy to understand. Neither are the myriad of technologies used in developing these systems.
Most folks find it difficult to grasp the concepts behind these technologies. If you look into the discussions and misconceptions surrounding REST you can understand what I am trying to get at. CORBA was not an easy spec to understand, so is WS-* or AMQP. While it is true that you don’t need to understand everything to develop using them, you still need at least a reasonable understanding to figure how to tackle some of the above mentioned issues. Frameworks based on these technologies are touted as the cure for these problems. Sure they could help, but it still does not shift the burden away from you.
To compound the issue all sorts of vendors keep touting their technology/framework as the next silver bullet. No matter what vendor you use, at the end of the day you are still responsible for getting it right. And it is not an easy task. You need to face the reality that distributed systems are hard and that you cannot hide every complexity behind some framework.

Restructuring Code

June 9th, 2008  |  Published in Architecture, programming  | Add to del.icio.us

Most programmers need to deal with restructuring the code they work on due to a variety of reasons. While most of the time it is driven by demand, sometimes it is also done for personal reasons. Here are some of the reasons I have had or seen within the teams that I have worked over the years.

  1. The current code would have reached a stage where it is impossible to do any more modifications without breaking something else
  2. The requirements have changed so much that the current architecture/design cannot handle it without a redesign
  3. The current application doesn’t/will not scale, perform well enough as it wasn’t designed to handle the current load/anticipated growth
  4. The folks who worked on the code are no longer there and nobody knows what the code really does (or what it was supposed to do)
  5. You don’t really like the current way it is implemented and think that there is a better way to do it using framework X or library Y

While sometimes restructuring could be done easily, but 90% of the time it is not a trivial task. When the frustration gets to you, you may have even entertained the idea of rewriting an application/module/section from scratch. Is this really a good idea? Sometimes this maybe the only option, but most of the time this may end up being a bad idea due to a variety of reasons.

  • One of the biggest mistakes is to throw away the old code without any due consideration simply based on the assumption that the old code is bad and we are going to write much better code.

    Throwing away the old code (especially if it was in production) means, you are throwing away months (or years) of tested, battle hardned code that may have had fixes for bugs that you aren’t even aware of. If you don’t take this into account, the new code you write may end up showing the same bugs that are already fixed in the old code. This will waste a lot of time ,effort, knowledge gained over the years

  • This method or class is ugly, lets throw it away and write again.

    That odd looking method or that badly written class may have some fix for a race condition or an optimization that one of your customers is depending upon. Discarding that code as crap without really understanding whats going on may end up with dire consequences for your team.

  • Not paying enough attention to the existing unit tests/ test frameworks when building the new system.

    These tests were added for a reason, possibly in response to a bug or some sort of intermittent failure that was reported on a production system. These failures/errors may have been reported way before you joined the company. Sometimes none of the current team members are aware of all the issues. So discarding the test code means you are throwing away years of hard work and knowledge

  • The current code is way behind all the cool technology we have today. The new framework X can do things a lot more elegantly, so lets re write.

    This is by far the worst mistake. If something is not broken then why try to fix it?. The code is ugly, or technology used is not cool are not good reasons. Simply bcos the style or structure of the code is not according to your personal preference is not a reason at all for unwanted restructuring .

I have been guilty of doing most the of the above mentioned points and through hard lessons I have realized that,

  • The best approach for restructuring starts with taking stock of the existing code base and tests written against that code. This will help you understand the strengths and weaknesses in the existing code, so you could ensure that you preserve the strong points while avoiding the mistakes.
  • It is best to reuse as much code as possible, bcos no matter how ugly the code is, it has been tested/reviewed etc.
  • Incremental changes are better than one massive code change. Incremental changes allows you to gauge the impact on the system more easily through feedback from tests etc. It is not fun to see 80 test failures after you make a change and can lead to frustration/preasure that will in turn result in more bad decesions . A couple of test failures is easy to deal with and provides a more managable approach.
  • After each iteration it is important to ensure that the existing tests pass. Analyze why the tests are failing, and make modifications/add new tests if nessacery if the existing tests are not sufficient enough to cover the changes you made. Failure to do so can result in a lot of pain down the road.
  • Avoid the temptation to rewrite everything. Personal preferences and ego shouldn’t get in the way. If something works then don’t change it.
  • Remember that humans make mistakes and restructing will always not garuntee that it will be atleast as good as the previous attempt. I have seen and have been part of several failed restructuring attempts

Having said all of that, sometimes you have no option but to rewrite from scratch. But IMHO that should be your last resort.

MapReduce vs RDBMS

January 26th, 2008  |  Published in Architecture, Parallel Computing  | Add to del.icio.us

My friend Matt brought to my attention an article on why MapReduce is a step backward, written by David J. DeWitt and Michael Stonebraker. I also stumbled on this blog entry written by Mark CC from Google. It is sad to see that most people advocate a particular solution to every problem imaginable and then refuse to look at anything else from a neutral pov. I let you draw your own conclusions.

Architecting for latency - What I learnt from Dan Pritchett’s (eBay) talk

January 21st, 2008  |  Published in Architecture  | Add to del.icio.us

It’s been almost 3 months since I attended the Colorado Software Summit (2007), but due to studies and the Red Hat Messaging launch I never got around to blog about the notes I took during Dan’s talk on “Architecting for latency”. I managed to blog about his other talk on Scalability here. Well better late than never. So here it is.

One of the key points he mentioned was that people often ignored latency and tried to work around it instead of embracing the reality. Which is exactly the second fallacy of distributed computing - Latency is zero. Understanding this reality is important especially if you are involved in systems that are distributed geographically. Here are some of the tips I noted down.

  • Try to serve users from where they are located.
    Move latency from users into your network. This will move the complexity into your system, but it will improve the user perceived performance
    Ex: If you serve your European customers from a data center in US, your customers will experience some delays. If you add a data center in Europe then it will improve the customer experience, but now it will add more complexity into your system as you would need to deal with consistency between the data centers ..etc
  • Co-locating your services is a good strategy, but it may not be possible all the time. So always think about how you would architect your system or components so there are no issues if you move them apart from 10ms to 100ms.
  • Don’t couple your system to the hardware or network topology. Upgrading might be a nightmare
  • Trying to achieve global data consistency limits options
    Think about what your business can tolerate when it comes to inconsistent state? Trying to achieve global consistency can make things very complicated.
    Keep your users in one data center. Bouncing them between data centers will force your to maintain global consistency. Tell them that the data is valid within x secs.
  • Prioritize according to user needs
    Here is a nice example given by Dan. SLA for seeing the payments due on an invoice is 5 mins and the customer never complains:)
    SLA for seeing money in a sellers account is 10 secs, or else the users get a bit upset.
  • If you are recovering from a failure, reduce serviceability until your system gets back to normal state. This will prevent overloading the system and increasing latency.
  • Use distributed transactions carefully
    Here is an example.dtx

    • Component C in the first configuration is complicated as it needs to ensure both data bases are updated. DB2 can be in a different geographical location and updating it can add a lot to latency.
    • In configuration two, C is smaller and simple. We can introduce other components to process the data without impacting the user perceived performance for the client.
    • Another advantage is that C can do a transaction even when DB2 is down.

I REST my case

January 21st, 2008  |  Published in Architecture, General, REST, SOA  | Add to del.icio.us

Mark Baker has hung up his boots on the REST vs SOAP debate. I appreciate his effort in building awareness about the value of REST and convincing people that it provides a solid basis for designing distributed systems. In the same post, Mark also says that, “The war really has been won”. Other REST folks including Stefan Tilkov says more or less the same thing too. I don’t know. In my humble opinion, a few years from now when the systems we design/build today using a RESTful, WS-* (or whatever) approach will show us which approach yields the better result in terms of scalability, extensibility, reliability, interoperability, flexibility, versioning, reusability ..etc. Sure the REST folks would say the web has being there for a while now and it works. But usually there is a human on the other side that drives the interaction. So it remains to be seen whether it will be the same with application-to-application interactions as well.
After all the hype surrounding REST,WS-* etc.. dies out and when people have enough experience building real world applications using both approaches and realizes the advantages and disadvantages of each approach there will be less debate as to which is better, or whether we need both approaches ..etc. The answers to these questions will become more clear to the ordinary folks in time. I for one will have an open mind and is very interested to see the outcome of all of this.

The REST vs SOAP debate has been more emotional/religious and less technical as of late. Several folks burned bridges due to insidious remarks, inflammatory comments and even personal attacks. Irrespective of the technical merits one should be able to tolerate/appreciate differences of opinions and debate in a more disciplined manner than resorting to personal attacks or inflammatory comments. Even if you are the most intelligent person in the world it doesn’t matter if you can’t put forward your technical arguments without degrading your own self by making inflammatory comments or personal attacks.

Scaling your system - What I learnt from Dan Pritchett’s (eBay) talk

November 16th, 2007  |  Published in Architecture  | Add to del.icio.us

I was lucky enough to attend both talks given by Dan Pritchett during the Colorado Software Summit this year. The first one was, “You Scaled your what”. If you want to know what scalability is, here is an excellent introduction by Werner Vogels. After listening to what Dan had to say, the following points stood out in my mind

  • You need to understand and take complete control of your architecture. Read my post on Architecture is your responsibility for more thoughts on this.
  • You need to have scalability in mind right from the beginning. Trying to achieve scalability later can be time consuming and very costly. Quoting from Werner Vogels post
  • Why is scalability so hard? Because scalability cannot be an after-thought. It requires applications and platforms to be designed with scaling in mind, such that adding resources actually results in improving the performance or that if redundancy is introduced the system performance is not adversely affected. Many algorithms that perform reasonably well under low load and small datasets can explode in cost if either requests rates increase, the dataset grows or the number of nodes in the distributed system increases.

  • Transactional scaling is just one dimension. You need to think about Scalability of data, operational,deployment, power ..etc. This is a minimalistic set. Try to figure out what dimensions are important to your organization.
  • All scalability dimensions are related and impacts each other. Any dimension ignored can could evolve into a problem for your application
  • Prefer vertical over horizontal scaling. Vertical scaling is better for your vendors and is not a viable long term strategy. There is so much you can get by increasing memory, CPU etc..

Transactional Scaling
Usually measured in TPS and is a traditional indicator for application performance.

  • Keep asking the question “How long can the business survive?” based on,
    • Time-to-live on current resources.
    • Time-to-live on maximum plausible configuration.
  • These metrics should be taken regularly to anticipate possible production bottle necks and identify issues before they become a crisis.

Data Scaling
How well does your data scale? Think about,

  • Functional Decomposition, group data by logical relationships, business importance,transactional volumes etc.
  • Think about partitioning data (sharding).
  • Is all data equally important? prioritizing your data and allocating resources accordingly will help you scale better.

Operational Scalability
How hard is it to run your software? Operational scalability is a software problem and you need to think about operational concerns right from the beginning. Pay attention to,

  • Logging metrics, Monitoring.
  • Controlling/updating/tuning live apps without disrupting traffic.

Deployment Scalability
You need to design/architect your systems while keeping the following in mind,

  • Ability to do incremental roll outs (and rollback if there are problems) without disrupting live traffic.
  • Managing component dependencies during deployment without disrupting live traffic.
  • Your architecture shouldn’t assume or decouple itself to any hardware,network topology or data center topology. This allows you to take advantage of new hardware, network topologies ..etc without significant changes.

Power Scalability
Power can be a limiting factor in a data center and may put bounds on transactional scaling.

  • How efficient is your software?, wasted clock cycles == wasted watts.
  • Consider vitalization for best utilization of your hardware resources.

Some good tips I managed to note down

  • Run old and new schema parallelly and then take out the old schema after a while when you gain enough confidence.
  • Prioritize services, Based on the SLA’s you could take a hit on certain services over others.
  • Incremental rollouts is a very good way to roll out new features while managing risk and also prevents taking the system offline.
  • Schedule deployment during working hours instead of weekends/nights as this enables your developers, support staff to attend to problems while they are alert and without being distracted by non work issues.If you do incremental rollouts this is possible as you are not disrupting traffic.

The value of principled design - REST is just one example

November 14th, 2007  |  Published in Architecture, REST, SOA  | Add to del.icio.us

To me the value of Roy Fielding’s dissertation goes beyond REST. Steve Vinoksi summarised it very well in one of his comments while answering a comment I made on his blog.

It’s(Roy’s dissertation) not really primarily about REST; rather, it’s about principled design. Much of his dissertation is about architectural elements, principles, constraints, properties, and the relationships between them all. REST is used as a very clear example in chapter 5 of what principled design is all about.

Why can’t we use a principled design approach when we do SOA or for that matter any other architecture?

When we add contraints or relax constraints we induce certain properties in our architecture. As an architect you make an educated desision as to what constraints make sense in your environment and what doesn’t. When designing systems don’t we go through decisions like “should we make these services stateless or statefull ..etc” during our design meetings ?

I think in what ever system you design as an architect you should think through and note down the constraints you want to impose on your system. This will provide a proper foundation to your system and an excellent guideline to your developers which will clearly communicate the desired goals of your system. Then later on when somebody else wants to relax any of these constraints or add more constraints they already have a guideline and can see how the “relaxing of an existing constraint” or the “addition of a new constraint” can impact the overall system.

REST is just a name coined by Roy to identify a set of constraints, and they are not the only constraints, nor the best combination of constraints in every situation. As Steve mentioned Roy spends the first few chapters providing an excellent analysis about “architectural elements, principles, constraints, properties, and the relationships between them all” and of course the value of a principled design approach.
To me the value of Roy’s thesis goes beyond REST and I hope most people would realize the same.

Architecture is YOUR responsibility

October 29th, 2007  |  Published in Architecture  | Add to del.icio.us

I read this post on Steve Vinoski’s blog that quoted Ron Schmelzer of ZapThink, who makes an excellent point. “Architecture is YOUR responsibility“. Well guess what, as much as vendors would like to say it is not, the reality is that you need to make the critical decisions about the architecture. Instead of some vendor, you need to be in charge of the direction and overall vision in terms of the architecture. Instead of choosing a vendor/product and building your strategy/architecture around it, you need to think through your strategy/architecture and choose the right vendor/product that can help you achieve your vision. If anybody was lucky enough to attend a talk given by Dan Pritchett (eBay), you would have realized that companies who understood this reality and took responsibility for the architectural decisions eventually made it big.

Dan’s comments on architecture was very insightful (I want to write a separate post on what I learned from his talk at the recently concluded Colorado Software Summit). The underlying truth of everything he said, was that they understood and took responsibility for the architectural decisions they made, instead of relying on some vendor to provide direction and overall vision.

There is no vendor out there, that can provide you with some ESB that can magically transform your enterprise into a SOA platform or some messaging middleware that can help you scale your enterprise to whatever limits you want unless you know what you are doing and take ownership of the overall vision. You need to understand the overall architecture, make decisions and take responsibility for them. An ESB or a messaging middleware are merely a bunch of tools that help you get there or in other words they are just a means to an end not the end itself.

There is no framework out there that can force architectural decisions on your solutions that you are not willing to make yourself. During my REST in peace talk, there was a surprising number of folks who asked me about a framework that can help them develop RESFTful services. Guess what, the road to a RESTful approach (or for that matter any architectural style) starts with the architectural decisions you make (the way you think/design your services) and not with some framework where you have to flip a switch or use a bunch of annotations that turns your code into a RESTful service. That is precisely why the contract first approach is recommended over a code first approach when you do web services. You need to think about how you design your service first and then use some framework to generate your WSDL and your code from that, not the other way around.

We all remember how the EJB mania deceived us. Many companies paid millions of dollars to App Server vendors to solve their architectural problems. The whole notion of “you only need to think/write the business logic, and we will take care of the remoting, transactions, persistence, scalability ..etc” was just an illusion. Neither did it preclude people from making extremely stupid architectural decisions nor did it provide anymore scalability than the simple tomcat web server for most of the use cases.

You need to think carefully about the architectural decisions you make and understand the impact it has on the overall goals/vision of your enterprise. You need to be aware of operational, load, managerial and geographical scalability from day one. You cannot offset your lack of architectural vision by using some framework, product or vendor. It will only make your vendor happy, but not your customers.