Data Scientists, You Can’t Trust Your Memory

Here’s what you can do about it

Data Scientists, You Can’t Trust Your Memory

Do you remember the details of the data science project you worked on a few years ago? I certainly don’t. Thankfully I wrote some pointers for my CV back then; otherwise, I’d have no clue.

The most recent project you’re working on, how about that? The codes, and visualization plots are all vividly painted in your memory. But wouldn’t you forget them all too soon as well?

Science says; that every time we try to recall a memory, it’s less accurate than before. It’s natural to be confident in ourselves, but the truth is we can’t remember them all. And even if you can, as a data scientist, there’s no need.

The best data scientists in the industry are excellent problem solvers. They think out of the box and come up with innovative solutions. They don’t have the capacity to remember details of older work. They have systems in place that takes care of themselves. That’s the whole point.

You don’t need to trust your memory for every task you do. Instead, you could follow proven techniques and best practices to keep most of them out of your memory. Read on to learn more about these techniques and how to incorporate them into your professional life.

Create a Design Document for the Non-techies

I learnt this concept during my time at Google. The idea is to create a design document for every project from the initiation. We update the design document, and we are making progress. We keep the document simple such that non-technical stakeholders can understand the approach, progress and results.

At my current work, we follow a similar approach and call it an “Analytics Document”. It’s a set of PowerPoint slides detailing the approach, keeping a non-technical audience in mind.

You might wonder why is this important? As data scientists, it’s easy to get caught up in technical details, but not every stakeholder understands algorithms. Don’t get me wrong; I’m not saying technical information is unimportant; it gets documented elsewhere (more on this later).

Here are some basic steps to create this:

  • Create a simple PowerPoint presentation for the project you’re working on
  • Outline the overall approach in a single slide. You should simplify the approach until you can get it into a single slide.
  • Explain every step of the approach in more detail in the next slides. Try your best to use simple terms so that anyone can understand the content.
  • Next, present the results derived from the work you’ve done.
  • Risks and Mitigations: The business audience wants to know everything about the anticipated risks and how you can mitigate them.

Even after the project is completed, when someone wants to know more about it (which will 100% happen), I share this document for them to go through. It saves a lot of my time and keeps me from getting stressed about forgetting the older projects' details.

Set up Version Control From Day 1

Setting up version control is one of the basic software engineering best practices every data scientist should follow. Version control software helps build reproducible work while collaborating with the team.

Besides, something I’d like to highlight here is the ability to track the project's direction of development. Let me explain: when you start developing the solution, you encounter hurdles. At each point, you pause and problem-solve and make multiple decisions.

These decisions impact the codes developed, and it’s essential to keep track of them. It’s crystal clear in your memory now, and you’re sure you’ll remember every incremental development through the project's progress.

Let’s suppose you decide to take a new opportunity in a year. Would you be in a position to explain the incremental developments in detail? In contrast, would you be comfortable if your colleague didn’t transfer the knowledge adequately before leaving the team?

Wouldn’t you want a system that takes care of itself without needing your memory?

There are so many scenarios you can’t foresee, trust me on this. You’ll be better off incorporating this into your workflow.

How to do this:

  • Head over to your organization’s GitHub. Your organization could use some of the alternatives like BitBucket, GitLab, and Azure DevOps; in that case, you should stick to it. It’s the same steps, anyway.
  • Create a repository. Depending on the nature of the project, you might have to keep it private.
  • Add a file, which initially is a brief overview of the project. You can reuse some of the contents from the design document.
  • Clone this repository to your local machine and start coding.
  • Incrementally push your codes to the repository as you progress.

If you’ve never worked on version controlling before, it will take a while to become comfortable with it, but it’s pretty much standard across the industry, so it’s worth starting now.

Write Beautifully Commented, Self-Explanatory Code

The biggest mistake aspiring/junior data scientists make is to write code that only solves the problem. A working code that’s all that matters, no? Wrong.

Besides being correct code, it must be 1) self-explanatory and 2) sufficiently commented. The code should not show off the programming skill by writing complex programs or using complicated variables.

Chances are, you’ll never figure out why you wrote the code in the first place. I admit this has happened to my colleagues and me, and troubleshooting and bug-fixing is a nightmare. (Of course, I was an inexperienced junior data scientist once, it’s part of the process.)

You’re going to master writing better code only with experience; there are certain things you can do consciously to get better at it.

How to do this:

Head over to your favourite open-source library. Check out their codes and see how they detail every script's purpose with comments on the top. See how codes are modularized into functions.

Pick any function and see how the function name tells what it does. See if you find any variables with no context, such as x , y , a , b or c .

There are thousands of developers who work collaboratively in these open-source libraries. They can’t teach every single new developer what the code does. The code needs to be simple and explanatory such that new developers can collaboratively work with them.

With decades of experience, if the industry experts don’t trust their memory, who are we?

Wrap-up with a Detailed Technical Documentation

So you have created a design document, set up version control practices, wrote commented code; what’s next?

I’ll be honest with you; writing technical documentation sucks. I hate detailing how the code works, how to execute the pipelines, check model stability, and so on. I’m not going to lie; I wouldn’t have done it if it wasn’t mandatory to get the sign-off.

The first time I created it, I spent about 3 hours a day for a week. That’s a lot. Fast forward a year later, when models become stale, we need to retrain them. And I couldn’t recall what we had done a year earlier.

Imagine going through the 25+ code scripts to grasp what we did a year before.

It’ll be a nightmare. Thanks to the mandatory requirement, I wrote the technical documentation. I took it up and went through everything in one go to gain the much-needed familiarity.

The brutal truth is that you’ll only understand the importance years later.

How to create this:

  • Write for a technical data scientist who never worked on this project.
  • Structure the document according to the typical data science lifecycle: business problem, data sources, data engineering pipelines, initial data exploration, data pre-processing, model development, model tuning, deployment and model stability dashboard.
  • You don’t have to include codes in this document; referring to the scripts is sufficient.
  • It’ll be hard to get this right, iterate on drafts and get it reviewed by a senior data scientist at work. With experience, they’ll know what important information is missing.

No matter how tempting it is to rush and finish this document, please don’t remember you’re writing for your future self.

Checklists — Your Future Self Will Thank You

At this point, you’re going to think; this guy must be having an awful memory. Maybe, maybe not. I don’t care. I have thanked my past-self countless times, so I’ll continue to preach this technique.

Here’s my secret: I create checklists for most of my routine tasks. Why? More often, we need to do these again. Instead of using my brain to do it repeatedly, I create a checklist once and refer to it every other instance.

How to create this:

Books have been written on creating effective checklists, but we don’t have to complicate it thus far when it’s simple.

  • Open a google document, titling it with the task in hand. (Other options are Notion templates or simply a handy pen and paper.)
  • Start a numbered list and write down the steps you’re following to finish any task. Write only essentials and keep them small.
  • For separate tasks, create separate documents. You may organize them into folders as you like.

You now leave them in google docs and can access them with a simple search option. Easy right? I use this technique for my personal life too, and it’s rewarding. Honestly, try checklists, and your future self will thank you.

Finally, Why Do It All Alone?

The last thing I do is to convince everyone, the value of not trusting their memory. Guess what I am up to with you now, buddy?

Think about this: data science work is heavily team-oriented. We need data engineers, machine learning engineers, data analysts, data scientists to collaborate. If the rest of the team isn’t documenting the work properly, it will affect the team sooner or later.

The key is to practice what you preach first. Which means;

  • You’re going to start creating a design document for a non-technical audience.
  • You’re going to set up version control for your next project from day 1.
  • You’re going to write better commented, self-explanatory code.
  • You’re going to put in a genuine effort to write the most detailed technical document out there for your current project.
  • You’re going to create checklists for most routine tasks your team has to go through.

If you start doing all the above, you’ve won the battle.

Let them see your work.

Impress them by showing them your process.

They’ll fall in love.

They’ll want to be like you.

Finally, let them in with a little bit of guidance.

Nothing’s better than your whole team following the best practices.

For more helpful insights on breaking into data science, interesting collaborations, and mentorships, consider joining my private list of email friends.