Radio DaTa Podcast
9 min read

Data Journey with Yetunde Dada & Ivan Danov (QuantumBlack) – Kedro (an open-source MLOps framework) – introduction, benefits, use-cases, data & insights used for its development

In this episode of the RadioData Podcast, Adam Kawa talks with Yetunde Dada & Ivan Danov  about QuantumBlack, Kedro, trends in the MLOps landscape e.g. so many MLOps tools and LLMOPs. We encourage you to listen to the whole podcast or, if you prefer reading, skip to the key takeaways listed below.

_________________

Host: Adam Kawa, GetInData | Part of Xebia CEO

Since 2010, Adam has been working with Big Data at Spotify (where he proudly operated one of the largest and fastest-growing Hadoop clusters in Europe), Truecaller and as a Cloudera Training Partner. Nine years ago, he co-founded GetInData | Part of Xebia – a company that helps its customers to become data-driven and builds custom Big Data solutions. Adam is also the creator of many community initiatives such as the RadioData podcast, Big Data meetups and the DATA Pill newsletter.

Guests: Yetunde Data & Ivan Danov

Yetunde is a Product Director at QuantumBlack and has been in the company for almost 4 years. 

Ivan is a Software Engineer and has been working for QuantumBlack for 6 years. He has been working on Kedro since the beginning. 

_________________

QuantumBlack

QuantumBlack, a McKinsey company, is a data science and advanced analytics company that works with customers from various industries. QuantumBlack was founded in 2009 and has its headquarters in London, United Kingdom. The company became a part of McKinsey & Company, a global management consulting firm in 2015, and now operates as part of McKinsey's global analytics practice.

_________________

Key takeaways:

1. What is Kedro?

Kedro is an open-source, Python workflow development framework that helps ML practitioners write maintainable and modular analytics code which is production ready. It achieves this by enabling teams to adopt software engineering best practices.

Most companies have separate research and production units. Research units often work with Jupyter notebooks and are responsible for inventing new solutions, whereas production units try to implement their work and run it in a production environment.

Kedro tries to give everyone, regardless of the team, the same level of software engineering practices, which makes the code production ready right from the start or with much less refactoring, than without those practices.

A lot of data engineering and data science prototypes that they write are production ready right from the start, or become production ready with a little bit of work. In the end this approach brings more value to the company.

2. What are the most important reasons why data practitioners choose to use Kedro?

If you are a data scientist you want to pick up Kedro, because you are collaborating with other team members and you want to write well structured code which you want to share with other people and make it more maintainable and understandable.

ML engineers often pick up Kedro because it helps them to create an environment where other team members can write prototypes in a specific, well structured way. It allows the users to build software that is easily scalable and can be run in different environments.

If you are a data engineer, you are involved in creating large scale feature engineering pipelines or some form of data cleaning. Kedro can provide a well structured workflow for those types of tasks.

The last group that benefits from Kedro are project leads. Kedro Viz can help you have a birds-eye view over the pipeline structure.

The code that is produced with Kedro is more modular and usable across different projects.

3. How do large enterprises work with Kedro? What are the differences between enterprises that use and do not use Kedro?

The example from Quantumblack is that before Kedro, each team developed each part of the code in a different programming language, and the following integration was a nightmare. The code was hard to understand and hard to read and it was complicated to move it between different projects and different environments.

Whereas when Kedro was introduced, it presented a common language that everyone could use to communicate with each other about the project. It presented a level of abstraction that helped with communication about data engineering tasks. They can suddenly start talking about the development of a „node” or a „pipeline” and everyone has a common understanding of what that means. 

It also presented a common code base structure which was cleaner and easier to work with across multiple teams.

As a consequence of that, they started to build bigger and bigger projects which were more usable. They have been able to industrialize the way that they write machine learning code at QuantumBlack thanks to Kedro.

One of the companies that benefited greatly from Kedro is Telkomsel (the link to the article about Telkomsel and Kedro). Telkomsel is Indonesia's largest communications company. Telkomsel used Kedro in several of their data engineering projects, and the benefits that they emphasize are collaboration improvement, configuration management and visualization of data pipelines of Kedro.

Data scientists who did not have a software engineering background become better software engineers by using Kedro.

4. What are the numbers or statistics that show the adoption of Kedro?

Kedro has over 8 thousand stars on github. The growth of the project was largely organic. There are over 1.6 thousand projects that depend on Kedro, and the number is growing. There are also almost 180 contributors to the project as well.

There are hundreds of companies that use Kedro, some of the most notable ones can be seen in the README on the github main project page, these are for example: Absa, AXA UK, NASA, ING, GetInData and AMAI GmbH.

5. Is there any specific segment of companies that benefit from Kedro the most?

If you are collaborating with others and are building a data engineering or data science pipeline, then Kedro is for you. Kedro supports creating code that should be deployed to production.

The Kedro design assumes being platform agnostic. It provides freedom in writing data pipeline code without having to worry about which cloud provider it is going to be used with. There are a number of plugins (some of which are developed by GetInData company like Kedro VertexAI and Kedro AzureML) which enable different data sources and data platforms / cloud providers to be used with Kedro and provide the freedom and a level of abstraction that helps to write more modular and reusable code.

Kedro wants to be a bridge between data scientists and production. They want data scientists to have a uniform experience, regardless of what they are developing and which cloud provider is going to be used to run the code.

6. Do you analyze data to define the product roadmap for Kedro?

Kedro is supposed to be governed by the community and all of the QuantumBlack work is done in public. You can see the github issues that they are working on and the milestones that they are currently trying to achieve.

Kedro has got a telemetry opt-in plugin that sends the data back to our database, so that they can see which commands are used more often and which are not. This helps them to decide what the next field of interest should be for the Kedro development team.

In terms of upcoming things to the Kedro project, QuantumBlack is working on improving templating and configuration management in newly created Kedro projects.

They also want to improve already created features and make sure that they are working as intended. They will also focus more on integration with Databricks, Sagemaker and AzureML. They want to equip our users with appropriate tools to work with those services.

The Kedro Viz project is also supposed to see improvements in visualizing dashboards and pipelines.

They also plan to improve Kedro online courses and documentation that will explain the basics of Kedro and how to take advantage of its features.

7. Can you share your thoughts on the future evolution of MLOps? What are the most important trends that you see when working with the open-source community and companies in regards to building their ML solutions?

Regarding the MLOps tooling, it seems that there are too many and they predict that they will see either convergence or clear dominant players taking the stage in certain areas of MLOps. 

They are probably going to see new literature about best practices and code quality in Data Science projects, similar to the one that is already there regarding Software Engineering.

Also they cannot ignore that right now there is a lot of talk about ChatGPT and new language models which probably is going to be a trend in upcoming years.

8. Iguazio is a Tel Aviv based company that offers ML platforms for large scale businesses. It is said in the article that Iguazio and QuantumBlack want to team up in the future to create one unified single product, that combines the best of both worlds of Kedro and Iguazio’s product. What does this mean for Kedro?

In QuantumBlack they have a product that covers a similar field to the one created by Iguazio's company and they plan to join forces to create a better solution together.

They want Kedro to be natively run together with Iguazio’s solution. Their goal is to achieve such integration that there are as little steps as possible needed from both groups of users to transfer one project from one environment to another.

Another benefit is that they have acquired another platform that Kedro runs on, which brings more experience and better understanding of how Kedro should look like, to be more flexible and useful in different scenarios. Although they should not forget that Kedro is still going to be a platform agnostic tool.

___________________

These are just snippets from the entire conversation which you can listen to here: 

Want to learn more about Kedro? Check out the following articles, tutorials and case-studies:

MLOps
ML
Kedro
LLM
8 September 2023

Want more? Check our articles

picconference2
Big Data Event

A Review of the Presentations at the Big Data Technology Warsaw Summit 2023

It has been almost a month since the 9th edition of the Big Data Technology Warsaw Summit. We were thrilled to have the opportunity to organize an…

Read more
deploy you own databricksobszar roboczy 1 4
Tutorial

Deploy your own Databricks Feature Store on Azure using Terraform

A tutorial on how to deploy one of the key pieces of the MLOps-enabling modern data platform: the Feature Store on Azure Databricks with Terraform as…

Read more
deep learning azure kedroobszar roboczy 1 4
Tutorial

Deep Learning with Azure: PyTorch distributed training done right in Kedro

At GetInData we use the Kedro framework as the core building block of our MLOps solutions as it structures ML projects well, providing great…

Read more
bqmlobszar roboczy 1 4
Tutorial

A Step-by-Step Guide to Training a Machine Learning Model using BigQuery ML (BQML)

What is BigQuery ML? BQML empowers data analysts to create and execute ML models through existing SQL tools & skills. Thanks to that, data analysts…

Read more
dynamodb aws jedraszewski getindata big data blog
Tutorial

Amazon DynamoDB - single table design

DynamoDB is a fully-managed NoSQL key-value database which delivers single-digit performance at any scale. However, to achieve this kind of…

Read more
dsc3210
Big Data Event

A Review of the Big Data Technology Warsaw Summit 2022! Part 2. Top 3 best-rated presentations

The 8th edition of the Big Data Tech Summit left us wondering about the trends and changes in Big Data, which clearly resonated in many presentations…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy