Data Engineering For Data Scientists

All Companies want to extract the most out of their data. However, many find it difficult to deploy the insights gained from data science. Part of the problem can be from a lack of understanding about the roles and skillsets of both data scientists and data engineers.

On the face of it, both data scientists and engineers work with the same medium: Data. However, a data scientist’s key outputs are insights whereas a data engineer’s key outputs are products. The key to efficiently and effectively publishing a data solution is to understand the interactions between the scientists and the engineers and utilize them effectively.

Here are some common pitfalls companies can often fall into when trying to productionise a data solution, and then some habits data scientists can employ to create a good platform for engineers to build upon, without asking the data scientist to develop the entire system themself.

Antipatterns to Avoid

Treating The Value of Notebooks as The Code Not the Insights

Data scientists often work with notebooks such as Jupyter Notebooks to develop solutions. While notebooks are fantastic for quickly exploring data, hypothesis testing, and prototyping, they should be treated as methods of getting quick insights, rather than enduring code. Typically, notebooks are a record of all your explorations complete with all the dead ends, visualizations, and tangents that you made. This is unnecessary and confusing in a production codebase and one advantage of notebooks, the ability to run code out of order – becomes a weakness as the code will probably not run top to bottom. Furthermore, notebooks do not easily support testing – something that is critical to creating robust and maintainable software.

Once you have your insights, you should be prepared to throw away your notebook code and start again essentially.

A tempting resolution to this is for the data scientists to hand notebooks directly to the engineers to productionise. However, this causes slowdown, headaches, and tension for the data engineers who now have to filter relevant and irrelevant information and understand the complex analysis in addition to their role of integrating this method into the existing ecosystem. The end solution is often that the data engineer needs to painstakingly rewrite the entire notebook

Getting the Data Scientists to Deploy their own Products

Another option companies take is to have the data scientists build the initial product. Data Scientists are typically trained to be skilled in mathematics, statistics, and algorithm development. Most are not equipped to build a fully automated, CI/CD data platform. The results will lack the maturity needed from an automated system and take significantly longer.

Strategies to Employ

Convert Your Insights into a Proof of Concept

Once you have valuable insights from your notebook and you wish to turn them into a product. Consider transforming your notebooks into proof of concept (POC) to showcase the value of the insights. This quickly turns the notebooks into simple self-standing code for the engineers and also allows users to start using the product immediately and provide valuable feedback.

A simple starting point for this is to split your notebook into discrete functions and segment your code based on methods (eg model training, inference, ETL) then move these to independent scripts within a repository.

Writing tests at this point will enable the engineers to quickly understand and refactor the code where needed. Giving more focus to testing in areas outside of engineers’ experience, such as model training and inference will give the team more confidence in its functionality and so can focus on building an end-to-end product first. Whereas giving less focus to areas like data loading and API functionality as these sections will likely receive the most revisions anyway. You want to ensure that the core aspect of the analytics product has good testing, while peripheral aspects don’t require this.

Involve Data Engineering Early

Once you know you want to move forward with your data solution, engage with Data Engineering. The considerations required will be significantly different if deploying to an existing system, or whether this is part of a much larger business project.

Laying out requirements early to the engineering team will help give you get the context for the larger ecosystem you are deploying into and allow the engineering team to feedback on how a transition should look. Data Scientists should work with the engineers to understand the development and operations lifecycle and discuss things like:

How should errors and other info be captured and logged?
How do you want configuration variables to be managed?
How do you pass in parameters?

Closing Remarks

Understanding the skillsets of both your data science team and data engineering teams is optimal to get the smoothest transition of your insights into production. And hopefully, you can avoid some of the common pitfalls companies fall into.

Adam Fletcher