Architecting for latency - What I learnt from Dan Pritchett’s (eBay) talk
January 21st, 2008 | Published in Architecture | Add to del.icio.us
It’s been almost 3 months since I attended the Colorado Software Summit (2007), but due to studies and the Red Hat Messaging launch I never got around to blog about the notes I took during Dan’s talk on “Architecting for latency”. I managed to blog about his other talk on Scalability here. Well better late than never. So here it is.
One of the key points he mentioned was that people often ignored latency and tried to work around it instead of embracing the reality. Which is exactly the second fallacy of distributed computing - Latency is zero. Understanding this reality is important especially if you are involved in systems that are distributed geographically. Here are some of the tips I noted down.
- Try to serve users from where they are located.
Move latency from users into your network. This will move the complexity into your system, but it will improve the user perceived performance
Ex: If you serve your European customers from a data center in US, your customers will experience some delays. If you add a data center in Europe then it will improve the customer experience, but now it will add more complexity into your system as you would need to deal with consistency between the data centers ..etc - Co-locating your services is a good strategy, but it may not be possible all the time. So always think about how you would architect your system or components so there are no issues if you move them apart from 10ms to 100ms.
- Don’t couple your system to the hardware or network topology. Upgrading might be a nightmare
- Trying to achieve global data consistency limits options
Think about what your business can tolerate when it comes to inconsistent state? Trying to achieve global consistency can make things very complicated.
Keep your users in one data center. Bouncing them between data centers will force your to maintain global consistency. Tell them that the data is valid within x secs. - Prioritize according to user needs
Here is a nice example given by Dan. SLA for seeing the payments due on an invoice is 5 mins and the customer never complains:)
SLA for seeing money in a sellers account is 10 secs, or else the users get a bit upset. - If you are recovering from a failure, reduce serviceability until your system gets back to normal state. This will prevent overloading the system and increasing latency.
- Use distributed transactions carefully
Here is an example.
- Component C in the first configuration is complicated as it needs to ensure both data bases are updated. DB2 can be in a different geographical location and updating it can add a lot to latency.
- In configuration two, C is smaller and simple. We can introduce other components to process the data without impacting the user perceived performance for the client.
- Another advantage is that C can do a transaction even when DB2 is down.