This story tackles a very shitty common question in the Data Science space. I do not advertise any tool/package here, and all views are my own and based on my personal experience so please feel free to judge me.
The first and most common question when operationally getting in data science is about the tools and programming language to use.
I. Break Free !
Set aside commercial tools like SAS & Matlab, open source projects have strong and very rich communities working on them and are extremely popular in the community. They have way more functionality, give complete freedom, and evolve much faster than tools like SAS thanks to all the new versions and state of the art packages.
They are arguably the best choice for data science. Plus they’re free, which is also nice, you don’t have to worry about your choice, because it costs you nothing, you can just switch tools whenever you want, or use a combination of them.
Please, this is very important, keep this comic in mind :
II. Who’s best ?
So what it the “best” toolset ? R ? Python ? Julia ? Supposing you have complete freedom of choice, the answer to this is pretty straightforward : IT DOES NOT MATTER AT ALL, tools are nothing more than tools, and they should be chosen with purpose in mind. R & Python being the most common choices, they pretty much have the same functionality and features. In short points, having used both of them in various contexts, my opinion is :
If your IT department is familiar with a tool, the deployment process to production will be easier if you develop your model in that particular tool. Most enterprise infrastructures contain Python jobs running, and interfacing with your Python code or integrating it in existing processes will probably be easier.
If you have a “traditional” Computer Science background, built around Java, C & similar languages, Python is a great choice because of its mature object-oriented programming features.
Coming from Statistics ? chances are you already have some R skill. Go for it ! despite the recent hype around Python, R is still an excellent framework for data science and has the same set of capabilities. Yes, you can use Spark, Keras and Tensorflow with R, you can deploy your model as a REST API, use version control, work with databases and “big data”, scrap the web, etc. (Well you guessed it, R fan over here and this is starting to look like an awful sales pitch …)
Planning on deploying your model or tool as a nice and clean web User Interface with no front-end development knowledge ?
- R : check out Shiny, htmlwidgets and flexdashboard
- Pyhton : check out Dash, Streamlit and Gradio
Looking for ease of use, flexibility and speed in prototyping, R has a few awesome turnkey packages like Caret for ML, Recommenderlab for RecSys, etc. Same for Python with scikit-learn, Plotly, etc.
Don’t know where to start and never coded before ? have some Python, there are great tutorials, and it’s quite easy to get started. Also, the syntax is super simple : pretty much no : “;” at the end of the line, no { } in loops and control structurs, just tabs/spaces and line breaks. This is super nice to work with.
More of a GUI guy who prefers point-and-click over code ? R has many easy-to-use graphical add-ins for all kinds of tasks from data management to ML & modeling. The excellent Flow UI from H2O.ai is also worth citing for including an AutoML engine in a web GUI.
Looking for greater speed of training ? Seems like scikit-learn and the data management formalism it enforces (numerical variables only, which implies one-hot encoding or embedding categorical features) is slightly faster. Parallelism (distributing computation on many CPU cores) in R is a tiny bit trickier than in Python.
So just make your choice, depending on what will make you get stuff done fast in an easy way. Be lazy, and keep in mind that you can always go back and learn another language. Also, with all the resources and awesome communities on the interwebs, learning and using multiple languages at the same time have never been so easy.
I for example used R for years before switching to Python, because of libraries like PySpark. I still use R (for maintaining this blog for example).
NB : thanks to the guys at Rstudio, R has a real, fully functional interface to Python called retirculate. That means that you can now load your dataset with pandas, do all your data management with dplyr and the tidyverse, visualize the data with ggplot, and then fit a scikit-learn model or call Tensorflow. All of that in the same R script. It’s not going to end the eternal debate, but check it out ! Using it, I noticed a bit of annoying overhead with serialization & deserialization. Also, the conda env management adds complexity and hard disk fingerprint. Really cool feature but I prefer using pure Python or R for the sake of keeping things simple.
NB2 : I’ve been hearing some crazy things about Julia. The programmong language seems to be über powerful/fast, and there is a growing number of Data Science & ML libraries. Will definitely check it out.
- I hope you enjoyed the read ! 😊👋
Anas EL KHALOUI