Databricks has handed over the platform for the lifecycle management of ML projects, which was presented two years ago, to the open source organization.
This from the developers of the cluster Computing frameworks Apache Spark, the software house Databricks has handed over MLflow to the Linux Foundation. The non-profit organization aims to act as a vendor-neutral hub with an open governance model to grow the MLflow project and enable more community participation. The announcement was made at this year’s online Spark + AI Summit, which will be held this week.
MLflow in general
MLflow is described as a platform for life cycle management of machine learning projects and was presented at the Spark + AI Summit two years ago. Databricks created the project with the understanding that unlike traditional software development, which is primarily concerned with code versions, machine learning models must also consider versions of data sets, model parameters and algorithms.
In addition, stands behind ML projects are usually a very iterative process. MLflow aims to make this process manageable by providing a platform for managing the entire ML development lifecycle from data preparation to production deployment, including tracking experiments (pilot projects), packaging code into reproducible flows, and the sharing and collaboration of models.
MLflow in detail
MLflow combines the three components MLflow Tracking, MLflow Projects and MLflow Models, which can be used both locally in the data center and in the cloud. MLflow Tracking provides users with an API and UI for logging parameters, code versions, and the output files when executing ML code, among others. This allows experiments to be logged and evaluated – with Python, R, Java or via a REST API.
With MLflow Projects, the code can be packaged in a file directory or Git repository in such a way that it can be reused in a reproducible manner for example, can be transferred to new platforms or to other data scientists. Finally, MLflow Models is used to prepare the code for deployment and to adapt it for processing with ML frameworks such as TensorFlow, Keras or PyTorch.
The project now has arguably over 200 contributors and appears to be downloading more than 2 million times a month, with a fourfold annual growth rate in downloads.
MLflow isn’t the first project that Databricks has tackled handed over to the Linux Foundation. Delta Lake, the open source project for managing data lakes, ended up with the non-profit organization last fall. In addition, the software manufacturer then announced the acquisition of the open source project Redash. The tool connects to a variety of data sources to visualize and share queries and analysis. About Databricks’ The new Delta Engine should also give users faster access to all information in their data lakes.
Databricks also announced version 1.0 of Koalas at this year’s Spark + AI Summit. This is a superordinate machine learning framework for Spark and Pandas, which is intended to greatly simplify working with the two machine learning or big data tools. Version 1.0 now implements the most commonly used Pandas APIs, covering arguably 80 percent of all Pandas APIs. Additionally, Koalas supports Apache Spark 3.0, Python 3.8, Spark accessor, new type hints and better in-place operations.