We are living times where you hear about DevOps everywhere, how the walls should be removed between these two worlds like Development and Operations, but all these speeches are based on the point of view from the developer and the business, but never from the point to view of the Administrator.
We are coming from a time where the operation teams where split on several levels of escalation where each level should be less populated and more skilled than the previous one. So we have a first level with people with basic knowledge that are working 24x7 covering any kind of incident that could happen. In case anything happen they try to solve it with the knowledge (usually more document than knowledge…) and in case something is not working as expected they forward it to a second level with more knowledge about the platform where they are probably an on-call team to handle that and we’re going to have so many levels as wanted. How all of this fit with Devops, CI & CD and so on…? Ok, pretty easy..
Level 1 today doesn’t exists: Monitoring tools, CI & CD and so on, make no needed this first level, because if you could create a document with the steps to do when something wrong happen you are writing code but inside a Document so nobody stops you to deliver an automated tool to do that. So, in plain english, yesterday first level operators are now scripts. But we still need our operation teams, our 24x7 seven service and so on.. for sure, because from time to time (more usually that we’d liked it) something out of the normal happens and that’s need to be managed.
So, automation is never going to replace L2 & L3, so you’re going to need skill people to handle incidents, maybe you could have a smaller team as you automate more process but never you can get rid of the knowledge part, that’s not the point. Here, we can discuss if this team could be the development team or a mix team from both worlds, and that could be right. Any approach is valid with this. So, we’ve implemented all our new fancy CI & CD process, monitoring tools and the platforms seems to be running without any help and support until somethings really strange happen. So, what to do with that people? Of course, teach the skills to be valuable as L2 & L3, so they have to be better operator / system administrator /whatever word you like the most. And how they can do that?
As I said before we are moving from a world where the Operation teams works based on written procedures and they have their imagination limited to look far from its approved protocol, but that’s not anymore the way L2 & L3 works. When you are facing an incident, the procedure is pretty much the same as hunting a bug, or if we escape from the IT world, it’s like to solving a crime. What are the similarities between solving a crime and managing a platform? Ok, let’s enumerate them:
- - What? — What happened to my system? You start with the consequences of the issue, probably a error log trace, probably another system calling you because you system is unavailable.. Ok, here you have, this is your dead body.
- When? — When you know something wrong happen, you start to find the root cause, and you start search logs traces to find the first one that generate the problem, even you discard the log traces that are consequences from the first one, and you try to find when everything starting to fail. To do that, you need to seek evidences about the crash and so on.. So now, you are investigating, searching for evidences, talking to witnesses (yeah, your log traces are the most trustworthy witnesses you are going to find, rarely they lied. It’s like they are on the stand in fron the of a judge)
- ….. And now? How & Why? — And that’s the difficult point, how & why, are the next steps as you do in a bug hunting, but the main difference here, is when the dev team is hunting a bug, they can correlate the evidences they gather on the step two, with the source code they built to know how and why everything goes wrong.. But in you case, as a system administrator you are facing probably a proprietary system or you don’t have access to the code or how to fight it even if it was open source.. and probably you don’t have even access to the source code from the dev team.. So, how do you solve this?
- Ok, probably most of you are thinking something like: Knowing the product and your platform. Being a certificated operator of the product you are managing, know the whole manual from the product, and so on.. And that could be helpful, because that means you know better about how things works at a high level… but.. let’s be clear: Do you ever find in a certification course, or exam or documentation or whatever, so low-level info that could help you to the specific case you are facing.. ? In case the answer to my question is yes, maybe you’re not facing a difficult bug, but a main configuration error..
- So.. what we can do? As the title said: Learn to code. But you are probably thinking, how can be related know to code with hunting a bug when I don’t have access to the code even to take a look? And.. learn to code in what language? on the components that are managed in my platform? in java? in Go? In node.js? In C++? In x86? All of them? Ok… you’re right, maybe the question is not simply learn to code but that’s the idea: Learn to code, learn to design, learn to architect solutions…. Do you want to know why? Ok, let’s see. In my whole career I’ve been working with a lot of different products, different approaches, different paradigms, different base languages, different everything, but all of them share one thing, that all the systems nowadays shared: They are built by people.
Yes, each piece of software, each server, each program, each web page, each everything is built by a person, like you and like me..
If you think that all the products and piece of software are done by genius you are wrong. Are you aware how many software pieces are available? Do you think that exists so many genius all over the world? Of course, they are skilled people and some of them are truly brilliant, and that’s why they usually follow the common sense to architect, design and build their solutions.
And that’s the point we can use to go crack down our case and solve our murder, because with the evidences we have and the ideas of building solution we have to think: Ok, how had I built this if I was the one in charge of this piece of software? And you are going to see that you are right almost every time…
But I’m missing another important point that we leave unanswered before.. Learn to code in which language? In the one you platform are based: If you are managing a OSGi based platform, learn a lot of java development and OSGI development and architecture, you are going to find that all the issues are the same thing.. A dependency between two OSGI modules, and Import-package sentence that should be there.. the other in which someone load the packages or some Export-Package sentence that should be there…
Same thing, if you are running a .NET Desktop application, learn a lot of .NET development and you’ll be skilled enough to don’t need a document to know what to do, because you know how this should be work.. and that is going to lead you to why this is happening.
And with all that questions answered, just only thing is left. You need to put in motion a plan to mitigate or solve the issue, so the issue is never happened again. And with all of than, we filed our arrest order to the incident.
That finally you are at the court part, when you present you’re evidence, your theory about how and why this happened (the motive :P ) and you should be able to convince the jury (the customer) beyond a reasonable doubt, and finally you finish with the sentence that you asked for the bug/crash/incident that are the mitigation plan, and you platform is a better world with one less incident walking free.
What we describe here is how to do a post-morten analysis and probably for most of you this is daily stuff you do, but all the times in customers when we work in collaboration with operation team, we notice that they don’t follow this approach, so they are stuck because they don’t have a document which tell us how to do (step by step) in this strange situations.
So, I’d like to finish with a anthem to summarize all of this: When you are facing an incident: “Keep calm, Apply common sense and start reading the log traces!!