Engineering practices after you have found the product market fit

November 20, 2020

As an engineering leader in a young organisation, you come across many things about best practices and things to implement in your organisation. Many of them are correct and make perfect sense from one point of view. But as a leader, you need to manage ten different priorities and it can be hard to figure out their relative importance and how much to invest where in your scheme of things. At least it was for me. As I transitioned from an individual contributor and researcher to a senior engineering leader, I realised there is a lot of nontrivial management required to deliver good software. So I wrote this post to give a macro perspective of the engineering management problem in a top-down manner and give my take on different aspects of it as exhaustively as I could. These are influenced by my experience with a B2B SaaS platform. Somethings might be more or less important for you based on your business context.

Once you have found the product market fit and you are sure that you are there to stay, you should think about building a lasting organisation. Hence these are important. On the other hand, you cannot implement everything from day one because you have limited resources and a lot to deliver. Premature optimisation is a waste of energy and opportunity. Hence the key challenge is to find the right balance.

A lot of startups I have seen ignore these challenges as long as they can and pay for it in the long run. No one likes to slow down especially when you think you could have released faster. However, with time you may find that it becomes increasingly difficult to make even simple changes to your software, productivity declines sharply and predictability takes a big hit. All this leads to increasing frustration in teams and team culture. At this stage bringing these processes becomes non-linearly expensive and difficult. They end up losing their market positions and the trust of their customers. They seem fatally stuck with problems due to unreadable code, unpredictable deployments, long and inaccurate QA cycles, day to day customer issues in your core workflows, hours of debugging and running data corrections, and people leaving to find better learning opportunities somewhere else. So I recommend that you prioritise these early on and strike a balance that works for you.

Delivering good quality software and keeping customers happy requires supporting infrastructure and processes. The infrastructure needed to reliably and productively deploy and run your software are -

  • Unit tests
  • Integration tests
  • CI/CD
  • Environments
  • Documentation
  • Monitoring and Alerting
  • Incident Management
  • Logging and log management
  • User analytics
  • Error monitoring
  • Disaster Recovery
  • Security

These processes are needed to increase productivity and deliver software on schedule -

  • Development techniques - TDD / ATDD / BDD
  • Software development processes - Agile / Waterfall
  • Branch management - GitFlow / GitlabFlow / Trunk based development
  • Project management - Scrum / Kanban
  • Issues and support tickets
  • Communication - Slack / Discord / Mail
  • Code reviews

Other important factors are

  • Code quality
  • High level design

As said, I will do my best to go over these things and give my take on each of them.

Code quality - The code is the fundamental support of your product and the primary output of your engineering team.

If your codebase is clean, easy to understand, DRY and well structured -

  • It is simpler to maintain and modify

  • It leads to fewer bugs, lesser regression issues and fewer surprises in production.

  • It needs less documentation

  • You can move engineers around easily across projects. They do not complain about dealing with someone else's mess. Handovers and cross team movements are easy. Engineers are not forever stuck with the code they wrote. Everyone can take their time off. They can grow and work on newer things.

Hence, productivity is higher. Keeping a codebase clean requires continuous effort in code reviews, regular refactors and training young engineers. Different codebases will be in different stages of maturity and lifecycle. Prioritise codebases of common libraries, classes, functions and APIs, followed by complex workflows for refactor and maintenance. Free up some bandwidth of your senior engineers for refactors and reviews. It is best if you do this continuously, rather than having to rewrite a whole codebase someday.

High level design - Good design is necessary for scalability, extensibility and reliability. The key problem is it is hard to find design flaws early because usually things break down later with scale and with time. Also, good design is subjective and you might not have enough understanding of the problem yet to arrive at one. Still, here are some pointers -

  • Preserve read and write ownership of objects and database or database tables. Do not access other services' database or tables in your services. Integrate data and services at a higher level rather than write cross table queries. Do not expose APIs which allow other services to read and write into the objects directly without any validation or schema. Rather than CRUD APIs, have APIs for functionalities in your services and keep underlying storage private. For frontend facing APIs, consider GraphQL seriously.

  • Avoid data inconsistencies as much as possible. Either all errors have been handled or workflows have been designed to avoid this. In case of errors either retry or rollback or at least have a dead letter queue where failed transactions are logged. This is a common problem especially in NoSQL where redundancy is a common design pattern. Data inconsistency is a plague and the errors will multiply and cause downstream workflows to break. It has the potential to give you a good steady stream of customer issues everyday.

  • Do not incorrectly assume strong guarantees such as ordered events, sequential execution, atomic operations, etc. When things scale, small percentage of errors become sizeable numbers. Design a system that least relies on chance.

  • Be vary before you introduce a new database or a new streaming system or some other cool stuff. You will need technical expertise in it sooner or later. So it should be worth it. Similarly, using too many infrastructure components will be a devops nightmare down the line when they start requiring maintenance. It is easy to spin up a Kubernetes cluster or a new database but much harder to manage one and keep it highly available. Keep things minimalistic and simple. Do not introduce new technologies, frameworks, etc. unless you can really justify them.

Unit tests - The benefits of having unit tests are

  • Lesser regression
  • Faster QA cycles
  • Local development and testing
  • Automated PR approvals in CI/CD pipeline

Additionally and importantly, unit tests act as an implicit specification of what the code does. This helps in understanding PRs better in terms of what has changed, since changes in tests are part of the PR. It helps the developers understand the magnitude and the kind of changes they are making early on which helps to grow the codebase in a better manner.

The two key challenges with writing unit tests are

  • When to write - When code is changing rapidly it might seem pointless to write unit tests. And once you have moved on, it's hard to prioritise and get back to write unit tests.
  • Skillset - Writing testable code and understanding what tests to write is not rocket science but not very trivial either.

So what to prioritise for writing unit tests? Most code is layered and things are built on top of one another. Some parts of the code are more critical and are called more often than others. Prioritise these codebases. It will be more important that these codebases have their specs coded as tests.

Test definitions can come either from requirements / specifications or they can be written for existing code. In the early stages my guess is you will not be following TDD, etc. hence I assume the latter. Some basic tips which I recommend -

  • Test what the code is meant to do eg. if there is no handling for a specific bad input, do not write the test which tests it on that input. Otherwise, there is no end. Intuitively this means derive specifications from code and convert those into tests.

  • Do not be religious that every function needs to be unit tested or that every function which is called needs to be mocked. Just make sure everything runs locally and external services are mocked. Test logical units together. That may be more than one function.

  • Keep the tests very simple. Sometimes we complicate a single unit test with cases, loops, etc. You can keep the tests DRY without putting everything in a single test.

  • Use them in your CI/CD. Create visibility. Otherwise, they will never be maintained.

Integration tests - The core benefit of having an integration suite is that you will be able to ensure that your infrastructure is functioning properly (assuming that you have the logic tested in your unit tests). You will be able to do a sanity of your environment whether all pieces have been wired correctly, messages are flowing, network is configured correctly, etc. You will need to deploy a few or all of your services locally sometimes or create a new development environment for load tests, etc. An integration test suite will make this very fast.

Although it might seem unnecessary, I recommend it because it won't take a lot of time. I suggest keep it simple to check whether all flows are working. Don't check for various cases, inputs, etc. That must be handled by the unit tests. Just wiring up a few API calls should not be a big effort and it will give you a good enough integration suite. If your architecture involves queues or streaming systems, it might be somewhat non trivial. But you will need it even more if that is so.

Secondly, the good thing about doing something even though small rather than not is that everything grows over time. We tend to ignore the effects of compounding. But making a start early on can go a big way when you compound the small increments over years.

CI/CD - It's better that you automate something which you would do at least thrice a day. It's easy also. Whether you use Gitlab, or Travis, etc. or Cloudbuild or Code Pipeline, does not matter too much. In the initial stages, you will not need multi project pipelines, etc so anything should be good. Rather than care much about the CI/CD platform, you must focus on these things which are core to any good CI/CD process -

  • Make your DockerFiles (or other artifacts) as slim and quick to build as possible
  • Reuse your artifacts from test to production. If not, then production branches should strictly follow test branches. Restrict PRs coming into production branches to your main test branch.
  • Manage your configurations etc. better. Do not hard code any settings or keys, etc. in your repository.
  • Have linters and unit test runners in your CI/CD pipeline.

Environments - There is only a little long you can continue to directly deploy from local to production. There are various levels at which your code needs to be tested before it goes to production -

  • Code correctness
  • Code formatting and structure
  • Logical correctness
  • Compatibility with other services

Things which can go wrong in a release apart from code are

  • Application and runtime configurations
  • Data / schema / migrations
  • Logging, monitoring, alerting and analytics
  • Changes to deployments, CI/CD, etc.
  • Incorrect code merges

I recommend you must have one staging environment which is identical to your production deployment but scaled down for costs. Test not just code but also infrastructure changes, schema changes, data migrations etc. on this environment. Use the exact artefacts you tested in this environment in production rather than building again. If you have other development environments, its enough if the behaviour of code is similar to production. These environments will be useful to integrate your code more often with other in-development code and perform developer testing and automated integration testing. For example, you may have a serverless function in production but may simply mock it behind a server in case of a development environment. Or you can run local emulators like AWS Localstack or firebase emulator to simulate the same behaviour. The more your infrastructure is coded and lesser the manual steps, the easier it is to manage multiple environments.

Documentation - There are many things as an engineer you can document such as -

  • APIs
  • Functions
  • Source code files
  • Blocks of code
  • Architecture
  • Services

It's easy and desirable to say that document everything but is not practical. I think clean code should take care of a lot of these things. And beyond that the opportunity cost is not too high if your code quality is good. What is mostly neglected and seldom documented are the functional and non functional requirements of the system. These are hardest to write, require the most clarity and will be the most missed when you want to take high level decisions on which product features to build and which problems to fix. You will want to go back in time and understand why you made certain decisions and that knowledge is remarkably easy to lose with time or when people leave. Definitely document these.

The other thing you must document are deployment and migration checklists. This will help to build discipline around production deployments and will go a long way in preventing incidents due to human errors. Along with this also document manual QA test cases and checklists. This will help you in predicting QA timelines and resources and help you drive QA accountability.

Monitoring and Alerting - The basic principle is to collect metrics from all necessary parts of the software deployed in production and to set alerts on these metrics so that you know if something is wrong before your users tell you. Monitoring falls into basically two categories Infrastructure and Application monitoring.

  • Infrastructure - EC2 machines, Kubernetes, Ingresses, Load balancers, Databases, Messaging, Data streaming, etc.
  • Applications - Containers, Serverless applications, Batch jobs, etc.

Both are priced separately. You can use an external service like Datadog or New Relic, or use the one by your cloud provider such as CloudWatch or Stackdriver for metrics. These metrics can be easily integrated with incident management platforms like PagerDuty, Opsgenie or Victorops where you can set who is notified if a metric raises any alert. You can also set up open source Prometheus to collect metrics and use Grafana for dashboards, or use other popular stacks like ELK (ElasticSearch, Logstash and Kibana) or TICK (Telegraf, InfluxDB, Chronograf and Kapacitor).

This is important because -

  • It is the most basic expectation of service guarantee that your platform is always available, no matter how small a company you are. And getting alerted as soon as anything goes down is the only way to do it.

  • You will need metrics to understand and tune how you are using your resources and how much you should scale.

  • You will need the application and host metrics to debug various kinds of runtime issues (especially when you cannot simulate them).

  • To set a basic culture of accountability and excellence. For example, in absence of any monitoring developers can take services down at will, write bad quality non rolling deployments, deploy directly to production and break things, etc. Unless that thought that every millisecond uptime is important is ingrained in your engineering team, you will not have excellence in processes. As I said, these things grow over time and become exponentially difficult later on. It is always good to start early.

Besides, it does not take a lot of effort or money. I suggest going for Datadog or New Relic. Even CloudWatch and Stackdriver will do if your platform is limited in the number of components and especially if you use a lot of serverless offerings. Running your own monitoring infrastructure requires effort and maintenance and it is hard to have it always up and running. Hence I would not recommend this as your primary monitoring solution. If you go with Datadog for example, a lot of metrics for almost all kinds of infrastructure and applications will come out of the box. It will also integrate with PagerDuty or Opsgenie very easily. It will take some effort when deploying new infrastructure pieces or applications and that should be largely sufficient.

Incident Management - Apart from setting up PagerDuty, etc. to get immediate calls on alerts, the only thing you should watch out here is that debug and find RCA for all downtimes. Many times simple application restarts or manual database edits will solve many problems and it is very tempting and habit forming to simply do that when the bell rings. But this behaviour will pile up critical technical debt which will definitely blow up in your face one day. Things always break. Accept that and RCA everything as a part of habit and process. Better still, document every incident. You can do this easily in PagerDuty, etc. or on JIRA (if you use).

Logging and log management - There are a couple of options for making your logs searchable in order to make your debugging faster

  • Managed log services such as Splunk, Sumologic, Datadog, etc.
  • Cloud provider's default log management such as CloudWatch, Stackdriver, etc.
  • Managed ElasticSearch such as UltraWarm ES from AWS
  • Self managed ELK stack
  • Search on cloud storage such as Athena over S3, etc.
  • Kafka as rotating log
  • Log rotate, filebeat and grep

The decision of which one to use is a lot about the costs of ingestion and storage. Some platforms offer a lot of intelligence over logs. But at this stage, easy log search and lower cost would be your primary goal. If nothing, you must atleast have all logs rotated and shipped to a common storage where you can grep them. You can also ship logs from here to ElasticSearch, etc. You must make sure you do not lose the last few log lines in case of application crashes, etc.

The key is to get logging right and the rest is very easy. Keep your logs minimalistic and clear. Log everything which is necessary and nothing which is not. Log all state changes and write workflows. Log in a consistent format across all your applications. Use log levels correctly and consistently. Having written guidelines is necessary to have such consistency. Have provision for changing log levels at runtime without redeploying or restarting the application. Don't blindly dump all requests and responses (I have seen this so many times). Have systematic error and exception handling in your applications. Throw early, catch late. Have some way to trace calls across applications using request identifiers, etc. Use stdout and stderr. Avoid writing logs to files, especially inside containers.

User analytics - Building measurability in your product is important because it will help you

  • Understand user behaviour and usage of your product
  • Debug user tickets by going through the events when the user faced that issue
  • Get data for business metrics such as daily active users, user retention, time to convert, churn prediction, etc.

Some platforms to help you do this are MixPanel, Google Analytics, etc. Both are good. There are tools like Segment and Snowplow which help funnel these events into various sources. I think they would be unnecessary at this stage.

To do this right you need to focus on these aspects -

  • The events are categorised correctly. You will need to give some thought here. For eg. should your event be my_button_click or button_click with property button=my_button Be consistent. Have some kind of guideline which you follow everywhere. Your data is as good as how easily you can analyse it. You would want a simple product analyst to be able to get all insights from the data vs. have an engineer write a script to extract and manipulate data which then needs clean up on Excel. Also, the simpler the analysis, the less the chances are of making numerical and interpretational errors.

  • All the necessary attributes of events are captured. This again will need some thought. Because it is possible that you might not have all the data available to log the event at the source. It would not be practical to get the necessary data from the server only to log events. We had used Snowplow to enrich the events in the server before sending it forward. But at your stage this might be an overkill. Some platforms like MixPanel allow you to identify entities and provide their static properties. That may be a good approximation for most purposes.

  • All the events are being captured exhaustively. Make sure your events are not blocked by ad blockers, etc. Events lost or missed are lost for ever.

Error monitoring - Not all end users will report errors when they encounter one. They will either find a workaround or stop using your product. If they do, maybe a handful will sit with you over a call to reproduce the error. None of this goes well with customer centric culture. Hence, you must capture all errors with stacktraces occurring at the end user such as in browser or mobile app. You may look at Sentry which is a standard tool for this. It integrates with other monitoring platforms well, and has good support for source maps in case of minified files. Other platforms like Stackdriver, Datadog, New Relic also offer some alternative. The tools are quite easy to use hence you should not think of skipping this. If you follow throw early catch late in your code, then it will be just a matter of adding a few lines in your central error handling routine. I recommend connecting Sentry with PagerDuty, etc. so that you are notified when anything breaks for the end user. Your policy should be zero client side errors or warnings. You might have to filter out connection related and other recoverable errors.

Disaster Recovery - There are two basic aspects to disaster recovery -

  • When an availability zone goes down does your platform keep running
  • In case of a complete disaster such as infrastructure failure (for eg. once our Kubernetes cluster crashed) or ransomware attack, etc. how quickly and to what extent are you able to recover? These are technically called Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

Achieving point 1 is relatively easy. Most cloud providers give options for distributing your infrastructure across multiple availability zones quite easily. You just need to once set up your infrastructure correctly. The second aspect is more difficult. The key idea is that your company should survive a black swan event which might happen once in the many years of its existence. It can be a disaster or a sabotage. RPO can be achieved by regular database backups. These backups can be stored in another account and encrypted for additional protection. The challenge would still be that for a distributed system, recovering individual databases will result in a certain % of errors due to data inconsistency. Solving this would be too technically complex and I would recommend not going there. Only make sure databases are backed up at same cadence and around similar timestamps. We had an in-house solution for real time backup by writing change capture data to Kafka. But for many years we were managing with snapshot backups, so can you too.

Achieving RTO would require you to restore the (a) infrastructure and (b) data. Restoring infrastructure can be ideal if your infrastructure is as code i.e. your infrastructure creation and configuration is done entirely via scripts. These can be done using CloudFormation, Terraform or python scripts and using your cloud provider APIs. Practically even if you have these, some steps would be manual. You must have all the steps documented that you need to execute manually using your cloud console. The more manual steps you have, the more this is necessary. Without this, you are going to be totally lost. Secondly, restoring data could be easy if your databases are managed. Or otherwise they would require some small scripting to populate data. This is not something to lose sleep over but you should once conduct a mock exercise for restoring data so that you know that you are backing up everything. After things blow up, you cannot go back in past and get something.

Security - Ideally, you need to protect your infrastructure and data against

  • Bots
  • Hackers
  • Internal agents, employees, etc.

How far you go depends on your level of paranoia. A reasonable middle path which I can recommend is

  • Most bot or ransomware attacks are due to negligence. These are relatively easy to protect against with some conscious effort and awareness.

  • Hacking a reasonably secure system is not easy. Isolate your systems reasonably.

  • Protecting against internal sabotage can be extremely difficult. Use disaster recovery approach here to recover post such incident. Leave audit trails. Protect yourself legally.

The basic elements of security which I recommend you must follow -

  • Isolation of your instances, containers, databases, etc. from the public internet.

  • Not to have access keys checked in your source code. This can be very risky and someone can gain access to your account or a subset of it.

  • Correct implementation of authentication and authorisation within your services and APIs. An end user with an authentication token should not be able to access data outside its scope.

  • Isolate your production environment from your other development environments. It can prevent an accident due to running a wrong command on the production database.

  • Use of VPN to access your infrastructure (it has become very easy with most cloud providers to configure one).

  • Use of most restrictively scoped IAM roles when granting access to a service or other system. Do not use user account credentials or roles in code. Use service specific roles. Be very specific about tokens used in client side code.

  • Authenticate all access to infrastructure and databases using Active Directory, etc. Retain auditable access and command logs. It might not really help to pin someone down but it does make one conscious before acting adventurous.

If you want to go a step further, you can implement access control for internal employees to access and store customer data. It might be necessary later when you go for SOC2, etc. compliances.

Development techniques - At this stage TDD / ATDD / BDD etc. would be an overkill. I only recommend writing good functional and non functional specs before starting to code. However, behave is a good way to write tests. You may take a cue from there to make your tests readable.

Software development processes and Project Management - The key idea is to bring predictability of effort and timelines in software development. The two popular ideologies you will hear are Agile and Waterfall. Agile is based on continuous delivery of software while the Waterfall model is based on planned longer term releases. There is also something called "nothing" - working long hours and releasing when its ready (which probably you are doing right now). There is also a lot of hubris around Agile which might not lead you anywhere practically in terms of your business objectives.

There are two basic elements to any such process - planning and project / task management. Planning can be vague and any random goal setting can get passed on as a plan. Task managements involves logging what has been done, what is currently being done and what is to be done. The first two are relatively simple. You can do that in your weekly scrums and move task items on JIRA and mistake that as Agile. The key is in knowing what is to be done which ties back to the planning part. Few engineers are good at this and fewer enjoy it. Estimation, especially, is like some black occult art of software development.

I recommend -

  • Focus on having a good quality backlog of tasks. The tasks should carry a definition of done i.e. what it means to complete the task. You can do this on Trello or JIRA or on Github itself. This makes a plan tangible. You probably won't get this right on the first day, or even the first month. But gradually it will improve, like everything else. How you pick these tasks up - daily, weekly, scrum, etc. can be flexible and I think not so important.

  • Once tasks have a clear definition of done, attach estimates to it. Remember, it's hard to get estimates correct. So don't beat yourself up on them or spend too much effort on getting them. Nevertheless it's important for having a sense of deadline and accountability. You can choose whether you want to estimate in hours, or days or even weeks. Or simply small, medium and large. It might be very tempting to skip all this and jump directly to code. But without resource estimation beforehand, you will not be able to decide how much effort to invest in something. Or how much not to invest in something. Ultimately productivity is about achieving goals in limited resources, and the factor which will contribute by far the most (by a factor of 100) is understanding how much time to spend where. Time is usually wasted not in the minutes but the months. Also, if you want to work efficiently and sustainably on the long list of things in this post, you should have good quality backlogs for everything - features, technical debt, devops, testing, etc. This will go a very long way in setting a sustainable cadence of work and team culture.

Branch management - The key objective is to have faster and predictable code merges. This becomes a real challenge when you have multiple feature branches simultaneously in development each having multiple versions of code. I highly recommend going through the trunk based development philosophy. The key requirement is to have feature flags whatever process you follow. It means to have the ability to turn off an in-development feature and move that code to production. This will help and facilitate easy and more frequent code merges across branches so that the code does not diverge too much across branches. The optimal version of this is trunk based development in which there is only one main long lived branch. You must also remove the feature flags once the feature is released to avoid a feature flag explosion in the future.

Issues and support tickets - With any running software you will need to deal with customer issues on day to day basis. It makes sense to optimise this process so lesser energy is spent on managing it. Some things you can do are -

  • Do not have customer facing teams or support teams post issues on your dev slack channel or mailing groups. This should be reserved for emergencies. It is highly disruptive. Instead, have them file issues in JIRA. Respond to them at a cadence. Not everything is urgent. If everything is urgent, then nothing remains urgent.

  • Categorise issues according to severity levels and have a priority assigned to them. This should be based on the customer impact - the number of users affected and the criticality of the feature affected. You should have SLAs for every priority and severity level. Track the currently open issues on a dashboard. Track the metrics on issue resolution time by priority and severity and see how many SLAs are being met.

  • Tag issues with the customer, product area, etc. to understand issue trends in the long run

  • Tag issues with root causes - regression, scale, unanticipated user behaviour, infrastructure failure, etc.

  • Do a monthly deep dive on the number of issues by product area, etc. It will give you an early idea when the platform starts deteriorating.

The benefit is

  • Managing issues becomes a part of the daily process. You do not need to follow up individually on status of every issue and customer.

  • Engineers can prioritise their work better. Otherwise issues can get either too interrupting or end up being ignored.

  • Engineers can coordinate on issues better. First level debugging can be done by QA themselves. Team members can also take turns on initial debugging of issues.

  • To drive a culture of being customer centric in your team. How professionally and swiftly you respond to the issues will help you gain customer support, trust and empathy. All software has issues. How you respond makes the largest difference. These customers whom you win over by your response to their problems will become your champions.

  • Once you set this process up, it will keep running on its own. It's a one time effort.

Code reviews - These are important for two reasons

  • Fixing bad code structure as early as possible before other things are built on top of it and it becomes a massive technical debt
  • To drive a culture of mentorship and growth between senior and junior engineers

To be effective

  • Code should be reviewed as frequently as possible. Do not put it as a step right before putting code to production. Big pull requests are hardly ever reviewed. Have weekly cadence instead of doing it at the end of development. The key is to have smaller and frequent pull requests to review.

  • Focus on code structure rather than checking the logic or such details. Code reviews are not meant to prevent bugs. That is for unit tests and QA. This way code reviews are enjoyable for everyone.

The main challenge in early stages is that there is no one to review. Either senior engineers don't have enough bandwidth or there is no one senior enough to do effective reviews. Or sometimes you have only one engineer for a certain area of expertise. Also, you don't feel the pinch as yet because everything works. But engineering problems usually take time to surface. My suggestion is that if you should fill in at least one senior position in your team. Since you have product market fit, you should be able to raise funds. The cost of engineering personnel is a relatively smaller expense compared to the cost of customer acquisition, etc. Your unit economic model must have a comfortable margin around engineering cost especially in the long run. So if cost is an issue, solve it with your CEO. Still you may not be able to afford and utilise very senior talent. Have friendly advisors on board. If required get expert help on hourly contracts. A few hours in a month, or a go to person for your team to get their questions answered at the right time can make a lot of difference. As a CTO your primary job is to help your team scale beyond their capabilities. Getting a good senior engineer onboard early will pay off great dividends in the long run. Leverage them to raise the bar for other engineers. It will help you avoid a lot of technical debt later and set a culture of excellence in your team early on.

Finally

This is a long list. How do you get things done? Here are some general tips -

  • Initiate. A lot of times things do not work out the first time. Initiate again. That's your job as a leader. Understand what didn't work last time, fix it and start again. I believe none of the above things are optional in the long run. Though your opinion may differ in the details of implementation. The later you start the more difficult it will be.

  • Delegate. Invest in people. Look beyond getting the current work done or the current problem solved. It takes about two to three months I think for any responsibility to be handed over meaningfully. If you are the only superman today, fighting all fires and telling everyone what to do, it won't be sustainable. Start by handing over small and less important things even if it means spending more time. Like everything, this shall improve too with time. If you cannot trust your team, then there is something more fundamentally wrong and you should look at that (either with yourself or with your team).

  • Measure. People (usually) do as you inspect, not as you instruct. So making speeches or writing mails won't translate into behaviour unless you monitor accordingly. There are two ways you can monitor something - (a) by getting into the details of what is happening and (b) by measuring the output or tangible indicators. Imagine yourself led by a manager who asks details of random things on random days and drops loose instructions which are then followed up again at random intervals. Get into details when fixing or debugging things or mentoring people but not as your primary way of having check over things. Have metrics which you review at a particular cadence (maybe another post). This will alleviate a lot of anxiety for yourself and your team and give structure to things. You will be able to set targets and track improvements on longer periods of time and look back and feel proud of yourself and your team.

  • Focus. Be disciplined. Do everything (most things) at a cadence. Be predictable. Set tangible and clear targets. Communicate goals and results clearly. Have predefined processes or guidelines which are seldom broken but frequently updated with equal discipline and consensus. The fewer decisions your team needs to make while executing, the more creativity they can invest in engineering and innovation where it matters (read about decision fatigue). Some constraints are better than no constraints when you need to optimise.

You will be surprised.