« August 2013 | Main | July 2014 »
June 08, 2014
Do's and Don'ts for Startup Operations
I've been at several mid-sized companies, post-startup acquisitions, or startups trying to grow and scale. All of them have had issues that could be prevented by thinking big while starting small. If you think about what your system will look like at hundreds of servers, you quickly realize that setting things up right to start with will save you effort if and when you grow and scale.
Here are the first 20 of these recommendations that I have come up with:
- Set up all production servers on UTC time. This will save you grief in your logging around daylight savings and multiple colos. You won't ever wonder what logs coincide with what time in different locations.
- Have a sane and enforced naming scheme. Do this for hostnames, data center naming scheme, and clustering. Do not start hosts foo1, foo2, ... foo10, foo11. Try foo001-sfo, foo002-sfo, ..., foo010-sfo, foo011-sfo. Then you can add foo001-aws... or foo001-nyc? Then later you won't have sorting issues, or have to rename hundreds of servers. Remember, if you want to grow, you need to think of management by many.
- Separate prod and non-prod. Keep your development and testing out of prod. Then no one will run an unauthorized test on your early adopters (the people whose trust and recommendations you need.) The idea is that production should be stable and solid, and only well tested code should go there.
- Plan for redundancy and graceful failover from the start. Make developers build it in to the software. Your customers will have fewer things to complain about down the road. From their viewpoint, any failover or server glitch should be invisible.
- Cluster for scalability and expansion. Even if you only have 5 servers now, plan for 500. If your venture goes well, you won't have to redesign your server farm every 6 months. This also means thinking about load and database sharing. Plan for load balancing and firewalling from the start. After all, your goal isn't hundreds of users, it's millions. You need to be able to recover quickly from a slashdotting.
- Start with reproducible system builds. Use kickstart, configuration management, putting system configs into source control, documenting system setup per class of server. No special snowflakes. Someone should be able to rebuild your entire stack from documentation and your source code repo.
- Set up your files and filesystems sanely. Don't pile all your stuff in one directory, and expect it to scale. Partition your hosts, whether you use LVM or hard partitions, even if you use RAID. Don't let your software write lots of little files. Plan for growth in your data and applications.
- Set up with an eye to having multiple DCs or POPs worldwide. Plan to be able to expand your data centers or cloud installations. If you do cloud, use a strategy that allows for multiple hosting firms, multiple stacks, and rapid failover between regions. That way you aren't dependent on some other company's business plan for your success.
- No SPOFs No single point of failure, either machine or, more importantly, people. If you have people whose death would severely damage your company, they are your Achilles heel. Document or die, and protect your data. One person told me of the startup that he and another guy were doing that died when his partner was killed in an accident - his partner kept all of his technical "secret sauce" in his head.
- Don't put your crown jewels in the public cloud. Keep your corporate LDAP, source code repository, personnel data, company wikis, bug tracking, etc on site, under your close control. Don't be an easy mark for hacking and theft of your intellectual property. Sure, it's convenient and cheap, but when you are a small company, you don't have a lot of time and cash for lawsuits when your stuff gets stolen.
- Monetize early, even if you plan to be acquired. A company with a realistic solid path to profit has much more negotiating leverage and value. Note: The ad supported model is not as lucrative as you think. Plus, advertisers sometimes demand control over content.
- Don't allow 'quick' spaghetti coding! Don't let overworked developers throwing half-assed bits of code at the
wallcustomer to see what sticks. No cowboys. Don't "build fast, fix later". Later never comes, mistakes can cost you customer confidence early when it counts most, and the mess is still embedded in your code years later. Don't do "agile" fallacies. Write it right to start with, and grow with a solid foundation. - Test, test, test!! Anything that has customer data, test the hell out of it for security vulnerabilities and unwanted behavior. Being agile, doing rapid development is not an excuse for lack of quality and refusal to test. Invest in a test/QA group early, and don't make it subordinate to development.
- Build your application stacks wide, not deep. Don't build them one to one, one to many, but many to many. Be able to separate and recombine application layers as needed. The flexibility means you can operate small to large. Keep like with like, but be able to split out intensive pieces onto specialized hardware.
- Plan for big data handling while you're still small. Don't envision hundreds or thousands of users, but millions. Plan how you will store, back up, protect and analyze that data. It should not be an afterthought, but an asset.
- If you use open source, plan to give back improvements and bug fixes. It helps attract talent and pays back the community that made your company possible. Don't be like some big companies that run on 90% open source, but forbid their employees to contribute bug fixes and infrastructure tools.
- Use best practices in both development and operations Write sensible logs, to sensible locations - system level logs to /var/log, application logs to /application/directory/log. Don't write logs to /tmp. Make your logs rotatable by logrotate - no need for hand pruning or roll-your-own scripts. Don't re-implement standard Linux utilities with "custom" local tools that do half as much and are buggy as hell. "Not invented here" is the wrong way to do operations. Your secret sauce should be your application, not your proprietary packaging and deployment system. Unless your business is writing tools and utilities, don't re-implement what's already out there.
- Code smart software. Build monitoring, alerting and performance metrics into your application stack from the start. Expose performance times either by log, snmp, or scoreboard. Don't make it so you have to run in debug mode to get metrics. Your software should tell you how it's doing.
- Plan to be able to deploy and roll back rapidly to clusters at scale. If it takes you 4 hours at 20 servers, how long will it take at 2000 servers? Modularize, separate and package. You should be able to deploy to just one part, adjust config, and go, then roll it back just as fast. Might take longer to design and code, but preventing outages is actually important when you're small. When you get larger this plus good automated testing makes CI/CD possible. Plan for it from the start.
- Make developers carry pagers. They built it, they need to maintain and fix it. They should be called for on-call escalation. Operations is the HW and OS layer. Developers should start out shadowing Ops for a week before they write a line of code.
There are more discrete examples of many of these. I'll be writing those up in subsequent entries, along with some contributions from other seasoned veterans who've seen the mess that lack of planning and understanding this stuff can make.
Posted by ljl at 06:38 PM | Comments (0)