This is a straightforward guide with a generic and efficient method for tackling machine learning projects. Of course, every problem has particularities & subtleties of its own, but here’s a general-purpose framework with the most useful elements of the ML pipeline.
Machine Learning is most certainly one of the trendiest topics of the last few years. With all the hype and fuss around the topic, it might get hard knowing how to start a project and get it right. How to make choices, and what to start with ?
I. ⚒️ Which tool should I use ?
R or Python ? Julia ? Cobol ? the answer is :
Just go for what matches your needs, and what you know how to use efficiently. Also, here’s a more elaborate answer 😉
👨💻 II. How do I get the job done in an efficient way ?
1. Start with descriptive statistics & graphics :
Import your data and do some uni, bi, and multivariate (PCA ?) exploration. With some dataviz and try answering these questions :
- What is your data about ?
- What does it look like on plots ? Does it have any special structure ?
- Are there any obvious correlations or relationships between the variables ?
- Are there missing values ? on what variables ? At which rate ? Why are they missing ?
- Any outliers ? Why ? any obvious ones that you can remove confidently ?
- How is the distribution of your target variable ? Any strong imbalance (very common in real life problems) ?
I use pandas and seaborn a lot here. I also make contingency tables and Chi-square tests of independence for categorical variables.
First things first: try without machine learning.
Using ML effectively is tricky. You need a robust pipeline, clean data flows, monitoring, orchestration, etc. I recommend always starting without it 😄 can you solve your problem in an easier way ? probably.
The first rule of machine learning: Start without machine learning
— Eugene Yan (@eugeneyan) September 10, 2021
How well can you solve the problem using simple heuristics ? they still work incredibly well. People were building intelligent systems before ML. At this point, using the exploration steps above, you should have a pretty clear answer. If that is not the case yet, do some more exploration : boxplots, scatterplots.
Using all that insight, if your data is “classic”, you can probably already make a decent system based on heuristics. A few examples:
- Recommendation system: recommending top-performing items from the previous period can be really good. Also easy and cheap.
- Classification: regex can do wonders. Obvious thresholds in numerical variables can help a lot too !
- Anomaly detection: depending on your data and sector, basic statistics (min, max, standard deviations, medians) can help you determine “what is normal”. Anything that deviates from that would be “an anomaly”
So much truth in this :
Doesn’t work ? let’s move on to a ML solution.
2. Do some preprocessing
I recommend doing preprocessing with scikit-learn using transformers rather than Pandas. For R, use dplyr and tidyverse packages.
- Split the predictors (X) and your target (y), and then train and test sets (it’s important to do this before any pre-processing to avoid leakage later). You should have 4 datasets now, 2 matrixes (train & test predictors) and 2 vectors (train & test target)
- Use one-hot encoding for categorical variables (while trying to keep your dimensionality under control. Avoid doing this on categorical variables with more than 10 levels)
- Scale true numerical variables (not the binary ones !)
- Impute missing values (I like kNN Imputation for this,
[KNNImputer()
](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) in sklearn) - Deal with imbalance (under or oversampling, if the ML library you plan on using doesn’t handle this internally. I’m talking about model arguments like
class_weight
in sklearn and Keras)
3. Keep the right features only
Selecting the right features in your data will make the difference between passable performance with long training times and great accuracy with short training times. Remove redundant features. Some features don’t offer any new information. Thou shall delete them.
- Delete near-zero variance features
- Delete redundant features (very highly correlated with others and linear combinations with no additional information)
- Eventually, use a Feature Selection procedure to eliminate useless data. Parsimony is awesome 😄 you will make your model simple and avoid overfitting. Keep it generalizable.
4. Look for the simplest model that does the job :
Scikit-learn is a must, caret (R) is good, Keras is wow for Deep Learning.
Fit a simple model or two, starting with linear models (logistic if you are doing a binary classification). Then maybe a basic decision tree. Rank features by importance to understand what variables are the most linked to your target (Variable importance): more insight !!
Is the performance satisfying already ? If the answer is no then move to a Random Forest or XGBoost model (those are pretty efficient 😉). You have a complex problem/data ? ok why not try a neural network model now. But keep in mind this comes at the expense of explicability and ease of operational use (the Grail). You need to keep things as simple as possible, you’ll thank yourself a year from now.
Does it look OK for you ? move to tuning the hyper-parameters. I use a classic RandomizedSearchCV()
to find the best model parameters.
Stop when your performance target is met. Real life is not Kaggle, spending 3 more weeks for a 0.1% performance gain is a waste of time.
Always check if your model can be generalized. Then, check again. Obviously, you must always test the relevance of your model on the test data sample.
The research question is central, keep it in mind. Especially when you have a large and rich data set, it’s very easy to get lost or distracted. Keep the focus on your destination: the research question.
Okay, you’re all set now 😉
And remember : Less is More, Simple is Awesome.
Build a complex model for the sake of doing so, and you’ll later find yourself with :
- A difficult data pipeline to set,
- A complex codebase to maintain,
- A hard time explaining your product to the business teams,
- Explainability issues,
- Lots of tricky documentation to write and keep up to date
Want a more elaborate answer ? Get yourself a good book on the subject. I personally like this one, it’s been my bible for a while, and it’s REALLY articulate: Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron.
It was translated to many languages and covers at least 95% of what you need to know to do ML in an efficient way. The book comes with a bunch of great Jupyter notebooks you can use freely (It’s hands-on !)
🎁 Wanna go further ? here are some free high quality ressources :
- The Google Machine Learning Guides
- The State of Machine Learning Frameworks in 2019
- Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning
- Machine Learning Crash Course with TensorFlow APIs (Google)
- Amazon’s Machine Learning University
- Rules of Machine Learning: Best Practices for ML Engineering (Google)
Hope you enjoyed the read ! 😊👋
Anas EL KHALOUI