Hands-on Tutorial

This 21 Step Guide Will Help Implement Your Machine Learning Project Idea

You can’t go wrong with this.

Arunn Thevapalan

12 Apr 2021 • 10 min read

I’ve summarized my experience working on 25+ projects over a span of 4 years into this single guide.

We all get cool ideas for projects. Some of them are not feasible, but some of them are. Yet, we don’t implement them. We want to do it, but we don’t know where to start.

What’s worse is, months later, we see someone on LinkedIn sharing a cool demo of a simple idea turned in to project.

Would it be helpful if someone helped you turn your idea into a reality? If you had a clear step-by-step framework to execute, would it kickstart your project? Wouldn’t you want to be the one who shares those cool demos on LinkedIn?

If you nodded yes to any of the above, this guide is for you. It’s something I wish existed when I was working on my first, second, and third projects.

Now it’s yours.

From Idea to Implementation: The 21 Proven Steps

When I initially created this guide, there were 20 steps. Not so cool of a number. Then I realized the 21st is the most important of them all. And I’m glad I talk about #13 since not many follow that. Not even me until last year.

Let’s get rolling?

1. Start with a business problem

Many, including me in the past, is guilty of not starting with this step. People like to start with a cool AI problem and build something around it. The problem? It fails, and nobody uses it. Andrew Ng recommends starting with the business problem.

Ask yourself what pain-point of the user are you attempting to help with the solution?

If you’re starting, and your goal is to build the first project to your portfolio, it’s okay to skip this and go to the next step. In all other cases, it’s crucial to identify the business problem.

2. Acquire the relevant data

So once we have an understanding of the problem, we need relevant data. Most data sources are available open-source in sites like Kaggle and UCI datasets, so it’s worth scanning them. Most computer vision problems have data here. For NLP, here’s a good place to search.

If your problem is unique, chances are there you might have to acquire your own data. Some common options are scraping the internet or manually labeling the data you collect.

When you go out of the way, the project often stands out, and you’ll benefit in the long run.

3. Setup a git repository

It’s best to adopt the software engineering best practices from the initiation of the project. Creating a git repository is good practice to start with. GitHub is a free and popular version control platform in the industry, and getting started is fairly simple.

Use this guide from GitHub to create your first repository. It’s okay to keep it private till you finish the project. The idea is as you progress through the project, you keep adding codes, data, and documents to the repository. That way, you can always refer back to the direction and flow of the project.

4. Prepare your data

Your data may have missing values, messy columns, outliers, incorrect data, and so on. We need to see the data strictly from a quality standpoint work towards cleaning and preparing the data.

A popular saying in the machine learning world is “Garbage in; Garbage out!”. It means when you use messy data of no quality; the results will not be accurate. I constantly search on google and StackOverflow to do this because we can’t always remember how to tackle various data quality issues.

5. Exploratory Data Analysis (EDA)

During EDA, we use several techniques are used to understand better the dataset being used. This phase is crucial because this is where you truly understand the data at hand. We uncover hidden patterns from the data, which can help us better solve the business problem.

A hack I often use is to take advantage of tools such as Pandas Profiling which helps us understand the data in less time. Here’s a guide I had written on how to do this for your next project.

The more you dig deep into the data, the more you uncover, and you’d be required to go back to step 4 and clean the data further, but hey, that’s a sign you’re becoming a better data analyst.

6. Frame the machine learning problem

Since we have already identified the business problem, here we reframe it into a machine learning problem. Broadly speaking, most business problems fall into one of these 3 types of machine learning problems.

Supervised Learning — The data used will also have labels. A machine learning algorithm tries to learn what patterns in the data lead to the labels.
Unsupervised Learning —When you have data but no labels. The algorithm finds similar patterns in data and groups them together.
Transfer Learning — When you take the information an existing machine learning model has learned and adjust it to your own problem. This is handy when training a model from scratch is expensive.

If you’re not familiar with any of the above and further breakdown (such as regression, classification, clustering), please google or YouTube them further and gain an in-depth understanding. Understanding what sort of problem you’re trying to solve is crucial to proceed further.

7. Choose the right evaluation metrics

Accuracy is not the only metric you should care about. In fact, sometimes, accuracy can be misleading, especially when the data is imbalanced.

Some other metrics commonly used in machine learning problems are precision, recall, F1 score, receiver operating characteristic, area under curve, mean absolute error, root mean square error, and more.

It was overwhelming when I first heard of these metrics. If you’ve got the data from Kaggle, they normally list the relevant metric in the competition itself. Here are guides to help you choose metrics for regression and classification problems.

8. Split the data accordingly

We never work with the complete dataset at once. We break it into train, test, and validation sets. Sometimes the dataset isn’t large enough, and in such scenarios, we ignore the validation set and use the k-fold cross-validation.

If you’re a beginner, take some time to go through this detailed guide and understand the difference between the datasets and how to allocate data points efficiently.

9. Build your first model (baseline)

You have the data cleaned and split, ready to be modeled. Don’t think too much of performance and get your first model built. Depending on your problem type, you may use basic algorithms such as linear regression, naive-Bayes classification, or KNN clustering, or so.

Use the cleaned data as-is and feed it to the algorithm and evaluate the metrics you chose. The idea here is to build your baseline model and use it as a benchmark to improve the model through iteration slowly.

10. Setup an experimentation framework

You have chosen the metrics. You’ve even built your first model. It would be best if you now had a framework that allows you to iterate rapidly.

You arrive at the best solution through iteration only, so setting up an experimentation framework helps you iterate faster. Until last year, even I skipped this step, and it took me far longer to finish projects than it should have.

A good tool to start using is MLflow, which is an open-source end-to-end machine learning workflow library. Using these, you can rapidly set up experiments and log all the results you can refer to whenever needed. Here’s a detailed guide on how to use MLflow like a pro.

11. Do feature engineering (from your EDA)

Now that you’ve got your setup ready, it’s time to improve your baseline models. To do so, you need more meaningful features. Based on my experiences, feature engineering makes or breaks the models.

Feature engineering is nothing but transforming the existing variables more meaningfully from the data. You should go back to the EDA you’ve done and see which variable could be more relevant to the problem and use them to create more features. The Feature Engineering module from the Python Data Science Handbook is a good guide to learn more.

12. Improve baseline by adding new features

Once you developed new features from the step above, it’s time to add these features to the baseline model. Normally the performance of the new model should be better than the baseline model. If this isn’t the case, we haven’t added good features and should go back to feature engineering to create better features.

13. Log all the experimentation results

Every different model you train from the baseline model is considered an experiment in MLflow, and it’s vital we log them. I mention MLflow since that’s what I use, but if you’re using another tool, that’s okay too.

The idea is you need to log all the experiments, including all details such as which features are being used, which model is being trained, and what the evaluation metrics are. Doing this during development helps us keep track of the project's direction and eventually choose the best performing model.

14. Explore different machine learning algorithms

Most data scientists feel a particular algorithm always performs the best. This is not true. This phenomenon is referred to as bias in machine learning. The truth is that there’s no single best algorithm; it always depends on the data and the problem we are trying to solve.

The only way to know which is the best is by exploring them all. If you’ve created the experimentation framework, exploring different models shouldn’t take a lot of time. You may use the same features to multiple algorithms to see which performs the best. PyCaret has a compare_models() function, which does this in few lines of code.

15. Select your best features and the algorithm

Now it’s time to review all the logged experiments, evaluate objectively without any bias on which algorithm performs better. See which combination of features helps boost performance because, at times, lesser features could be better.

Another factor to consider is the level of interpretability and the time taken for training the model. Based on all of these considerations, finalize the model and the set of features you’re going to feed into it.

16. Optimize the hyper-parameters

The last optimization you can do is to optimize the hyperparameters. Hyperparameters are the parameters that the user arbitrarily sets before training a model.

Keeping the algorithm and the features constant, we change the input hyper-parameters of the model and try to find the optimum value that maximizes performance. We can use several techniques to achieve this, such as Grid Search, Random Search, etc. Here’s a detailed guide I found that’ll help you understand hyper-parameter optimization better.

We have done everything to build the best machine learning model for the problem at hand.

17. Modularize the code

By now, whether you realized or not, you’ve already worn the “data analyst” and “data scientist” hats. It’s time to wear the “machine learning engineer” hat. A machine learning engineer ensures the models developed are end-user friendly and production-ready. As the first step, we need to go from notebook-style coding to modularized pipelines with software best practices.

Start by breaking the code into logical functions such that each function does only one thing. Thus create a set of functions and the main script that calls these functions. (Alternatively, you may follow an object-oriented programming style too.)

18. Build an ML web application

I’m a big fan of building machine learning web apps. It’s beginner-friendly and impresses the users easily. I know it since I’ve used it on my clients, and they surely were impressed.

Since you have successfully developed your project, the idea here is to take some hours and wrap it around as an ML app using libraries like Streamlit. Here’s a detailed guide where I build a project and create an ML app from scratch in a step-by-step guide. Trust me; you’ll have everything you need in this guide.

19. Dockerize the application

Why docker? Can’t the DevOps guys take care of it? NO. Many commit this mistake. It’s your project, so you know the best for it. The project you built on your machine should work fine in the cloud or anyone else’s machine. That’s the goal.

Simply put: With docker, the applications you build become reproducible anywhere. And you don’t need to learn everything, you only need to know just enough, and I’m here to help you with that. Here’s the detailed 3 step guide on how to dockerize any machine learning application.

20. Deploy it to the real-world

You have built the machine learning app. You have dockerized it. What’s left? The world needs to see what you’ve built, champ!

Shipping your project to the world is the most exciting thing to do. Having done this 25+ times, yet I still get excited when I’m deploying a new project!

Deployment practices in the real world take time to master; however, a good starting point is to use prebuilt platforms such as Streamlit Sharing, Netlify, etc. Here’s a guide that outlines one of the easiest ways you may deploy your project.

21. Document your work and write about it!

If you’ve done everything above and missed this, what’s the point? If you’ve taken the effort to finish every single step, then I want you to share what you have built with the world proudly.

Go back to the GitHub repository you have been maintaining so far. Use readme files to explain everything about your project as clearly as possible. The idea is to help someone who knows nothing about your project to get up and running with your project.

Finally, write a blog post on how and what you’ve built and your learning experience. Here’s one of my earliest examples. There’s nothing wrong with blowing your own trumpet when you’ve done the work. Be proud!

How Do I Know This Guide Works?

There are countless guides like these on the internet. Some are great too. But none of them seems to work. Do you know the real problem? We save them or bookmark them for later. And forget about them.

Nothing’s wrong with the guide; it’s just that we don’t take action. Sad but true.

I want this to be different. I want you to be different this time. If you know me, you know I love to help you beyond this article. To take real action. You can always reach out to me on LinkedIn, but please do this first, now.

Brainstorm a project idea. You probably know some already. Something worth doing.
Create the repository now and leave them in the comments? It’ll keep you accountable.
Build a rough plan on a piece of paper and commit to working on it for the next 30 days.

I’ve given you the framework and linked many guides above, but I trust you to use google for more. There’s always more but let’s start with this, shall we? Thirty days is not too long but a good enough period to finish a project for your portfolio, which you could be proud of.

The project you build could be helpful for your internship, job, or freelancing opportunity. Who knows where it could take you? The world is full of possibilities.

You’ll never know unless you start. So, start now?

For more helpful insights on breaking into data science, interesting collaborations, and mentorships, consider joining my private list of email friends.