Three key principles for designing good systems

May 18, 2019

Everything difficult is usually complex. And everything complex is difficult. The resources and skills required to deal with complexity increase exponentially as complexity increases. An assembly line of 50 workers is n times harder to manage than that of 25. A flowchart with 10 branches is n times harder to understand than one with 5. A program of 1000 lines is n times harder to manage than one of 500. And n is usually much much higher than 2 in all these cases. Thus the complexity is the fundamental obstacle in running or scaling any system. All large successful systems, whether business processes, assembly lines, services, or software programs, need to arrest growing complexity successfully.

Complexity manifests in terms of known unknowns, unknown unknowns and unknowables. These are in increasing order of how hard they can be to manage. Complexity is also multidimensional. It can be the number of resources such as the amount of money, time, things or people required. It can be the cognitive complexity such as the time required to understand, the skill of the person required for the job or the chances of making a mistake or the number of people who need to arrive at a consensus. One way to quantify complexity is the cardinality of certain things, events, attributes, in a system. It may be the number of people, number of items, number of conditions, number of assumptions, number of inputs, number of outcomes, etc. It may not be necessary to define it mathematically or to understand the exact relationship between resources and complexity. But it is important to have clarity on what kind of complexity you are dealing with and what parameters affect it.

When it comes to software architecture, many developers associate it with choices of infrastructure, language, databases, frameworks, etc. However, it is not usually here where they go too wrong. Most of these evolve with time and end up being technical debt. The teams come to a grinding halt because they fail to understand and follow some fundamental principles thereby blowing the complexity of the system out of proportion.

How to keep the complexity of a system in check?

Flow — Most complex systems are split across layers of abstraction. For eg. a mobile app may have a networking layer, a view controller layer, a model layer, a database layer, etc. A server-side program might have primary data sources, secondary data sources serving as views and so on. A web app might have a hierarchy of UI components in React and might use Redux or GraphQL to feed data to the components and drive actions.

Firstly - Command and data should flow unidirectionally to keep systems simple. If X modifies Y then Y should not modify X as a part of the same operation. This also means X should not read back the state of Y as a part of the same operation. X and Y can be services, objects or functions. Read / query for a particular data only once during a workflow. For eg., if service X stores all users and service Y indexes all users stored by X, then X must not query the index of Y for that operation after it has modified a user object. Or if object B is modified every time object A is modified, then object B must not be modified directly in such respects. In a workflow that is meant to update a list of users and send them emails, read the list only once. The crux is that the answer to any question such as “who modifies this” or “from where is this coming from” should be a single entity or a single place. The number of possibilities increases exponentially with the number of branches and hence so does the entropy of the system. Ultimately it leads to a debugging hell.

Secondly - Be clear whether communication between two actors in a particular direction consists of commands and events or objects, but not both. For eg, in Redux, the UI component receives properties from the store while it dispatches actions back to the store. This unidirectional flow in both directions is typical of Redux. Typically, the communication between two consistent systems or from a store to an interface is usually modelled in terms of objects. Whereas when communication is meant to extend workflows and logical operations, it is modelled in terms of events that have a definite semantic meaning and can be interpreted wholly and unambiguously at the other end.

Redux Data Flow

Redux data flow (https://github.com/reduxjs/redux/issues/653)

Guarantees - Another great source of complexity in a system are the assumed and unassumed guarantees. These guarantees are usually related to consistency, availability, ordered processing, idempotence, scalability, response times, etc. Each of these guarantees is extremely hard to engineer and many are incompatible with each other. So, be mindful of what guarantees you want to build and assume in your systems. You must reason why you really need them and can the system be built by making simpler assumptions or design choices. Because it is very likely the guarantees you think you have built into the system fail in a certain small number of cases which may be very tedious to debug or reproduce and near impossible to fix.

CAP Theorem

CAP Theorem (http://guide.couchdb.org/draft/consistency.html)

Similarly, be mindful of what guarantees you assume from a system. Most of the time, these guarantees come with a lot of caveats that you may not be aware of. These problems can be very difficult to remediate later.

Boundaries — Even moderately large systems consist of many logical units and it makes sense to split a complex system into parts. It helps to understand and manage the system and collaboratively build it. It is also much easier and feasible to build tighter guarantees within smaller subsystems, which interact with each other with different sets of loose guarantees. Hence it is very important to get these boundaries right. You can call them bounded contexts, objects, etc but the idea is essentially the same. Boundaries are not simply about services. Boundaries are your semantically coherent units, which are also tightly coupled in terms of architectural guarantees. For eg., you may want the account balance and transactions to be strongly consistent, but bookings and account balance can be eventually consistent with failure handling in case of inconsistency (such as cancelling the booking). Hence your accounts +transactions is one context and bookings is another.

Secondly, the most difficult part of extending a system is in understanding it. If a developer needs to understand how the entire system works in order to make even a small change, then it shall never scale. Hence your ability to extend is not proportional to how many classes and interfaces you have made, but how little a developer needs to understand about the rest of the system in order to extend or develop upon it. Hence it is necessary to get your boundaries and guarantees right because they naturally scope the understanding needed to change any part of it. If your boundaries are not exclusive, fully separated and self-contained, they will solve little purpose. Don’t try to make everything extendible. If your boundaries are right, you can always further chunk parts into smaller parts later on.

Lastly, various parts of the system need to integrate in order to function as one system. This is called the integration layer and it includes how the parts communicate and depend on each other. For example, if one service writes an object which the other service reads, the integration layer implicitly is the database. Or if there is a GraphQL layer for combining data from various independent services, then the integration layer is the GraphQLQ API layer. Or if they share a common module, the integration layer is the code. Thus everything which is shared or is common across a boundary is an integration. Services may integrate at various points for different needs. Sometimes, these integration points may be accidental rather than by design. Remember - The lower you push integration down the stack (API being higher, followed by storage and finally code), more is the cost of making changes and difficult is understanding the impact of making those changes. The textbook domain driven design is where everything integrates at API and messaging layer and there is no integration point otherwise.

The more exclusively and exhaustively you can attribute the ownership of parts of the system to individuals or teams, the more easily you will be able to maintain and manage it. Avoid building components that will be or will end up being jointly owned by more than one team. A component that is owned by too many is owned by nobody. Clearly define not just the contracts and abstractions but also the vision behind each component or unit. All problems become people problems, in the long run.

Finally

To build something great a rule book is not enough. Rules can be too many and there are just too many things that could be done better. My philosophy is to proceed with a sense of beauty. Beauty lies in order, harmony, simplicity and attention to detail. Beauty is a pursuit leading to joy and is pursued for its own sake. Great engineers have cultivated this sense of beauty. And great systems are built in this pursuit.