Debugging 101 Locating the Problem
I was recently observing a couple programmers attempt to fix a bug and watched (with considerable tension) the painstaking stabs in the dark as they tweaked and changed different lines of code.
I instantly knew what I would do, and it is much more similar to a binary search (see example below) that rapidly solves almost every bug I come across.
It consists of 3 steps:
- Replicate the problem
- Follow the request sequence
- Adjust code as appropriate
Please note that this is language-agnostic: from Javascript to GoLang the concept is the same*.* Experience and familiarity primarily help in your understanding of step 3 and, more relevant to this article, what the request sequence is.
For the sake of this article, I’m going to show how this applies to a user who is reporting an error that the “login doesn’t work”. Let’s go through the steps.
Replicate the problem
Jane says that the “login doesn’t work”. This isn’t very helpful (and hopefully we’re directing people to give better bug reports).
It is nearly impossible to solve any problem or bug if you’re not able to replicate it.
We try to login and we succeed. We try to login with a new account… and succeed. This is not very helpful. We need to be able to consistently replicate it or understand how to replicate it (if it’s dependant on a timer and we can only do it every X hours, that can be enough of an understanding to solve it).
Let’s say we go back and get more details from Jane: she tells us that she was trying to use the Google Authentication and that it happened after she logged out of her Gmail and logged back in.
We try this with our own account, and voila — we get a “Not authorized to login” error.
We can replicate the problem, onto the next step.
Follow the request sequence
Let me first define what I mean when I say “request sequence”. I mean everything that happens between the last action you took to reproduce the error up until the point of the error. This can be backend, frontend, 3rd party services, API calls, JSON interpreters, Dev Ops, etc.
The best example I’ve ever seen was in response to the question:
“What happens when you type google.com into your browser and press enter?”
This approach can have a nearly unlimited depth, but we never need to dive further than is necessary.
Let’s look at the Login problem. We can generalize the events.
- Click “Login with Google”
- Redirect to Google Auth page
- Type in email and password into Google’s Login screen and click “Login”
- Redirect back to our website
- Get logged in
Or, in our case:
- Receive “Not authorized to login” error.
Now we start our “binary search”. Here’s what I mean by binary search (as I am using it a little atypically): if we’re looking through a span of numbers from 1 to 71 and we want to find the number “7” you can start in the middle and simply ask, “is it lower or higher?” and repeat.
In our case, we want the same approach but the question is: “is the problem earlier or later in the request chain?”
We have 5 steps with the last step being the problem. We need to find where the error starts. In this case, we only have control fully over 1, 2 & 5. We have partial control over 4 (we tell Google where to redirect back to), and no control over step 3.
Since we can’t check 3, we should either start with 2 or 4. Because step 4 is closer to the problem, I will start there.
Now, we need to do some sort of check to see if everything is working at this point (“is the problem earlier or later?”). The very first part of step 4 is to see if we land on the redirect link.
We look at the error page we’re on, and yes, we’re on the redirect link. This indicates that steps 1–3 are working successfully and that something is breaking between step 4 and 5.
This means breaking it down further. Step 4 consists of:
a) Receiving information from Google
b) Checking to see if we have a User in our system with a matching Google token
c) Logging that user in or rejecting them with an error message from Google
d) Redirecting to the dashboard
Binary search again, we should check after b), do we have a user in our a system with a matching Google token?
No, we do not.
However, we know that we should have a user in our system that matches. Your system is, presumably, decently stable and people have been logging in just fine.
Great, now we know the problem either lies in 4a or 4b.
Check if 4a has the proper data — it does.
Our problem lies in 4b.
And repeat…
Rather than bore you with stepping through the whole process, I’ll summarize. 4b when split into further steps in the request sequence showed that we did have a user under that email address but it had a different, unmatching token.
Our findings are as follows: when you sign out of Gmail and back in, the next time you login Google provides the system with a different token than the one we have.
Adjust the code as appropriate
This usually consists of your knowledge and experience as a programmer as well as your ability to Google and search StackOverflow. It’s not the objective of this article to go into it deeply. In our faux-debugging, we find that we need a process to renew tokens for users and so implement that, and voila! We’re done.
Knowledge of the request chain
Knowledge of the request chain is often undervalued in companies. Who needs to know all of that detail?
In a previous company I asked my CTO if we could get some documentation for what our request lifecycle* looked at and was given a very high overview and was somewhat brushed off. This made it incredibly hard to debug and often reverted to the shoot-in-the-dark approach.
*Request lifecycle vs request chain: a lifecycle is the request chain specific to a domain (HTTP Request Lifecycle, Laravel Request Lifecycle), as opposed to every request chaining between your action and another event, which may span across domains.
Summary
Solving bugs consists of three steps:
- Replicate the problem
- Following the request sequence
- Adjust code as appropriate
The request sequence consists of every event that happens between your request and the problem that you’ve replicated. To find a bug, all you have to do is follow the chain (I use a binary search approach).
Knowledge and experience of the language and framework are helpful in quickly locating which part of the request chain the bug may be in, but the approach is the same.