Machine Learning Model and Workflow Management Framework
Updated: Apr 25, 2019
ML / Data Analytics training and model management system that support multiple ML engineers and data scientists using on-premise infrastructure
A computer system vendor was helping a technical institute to build on-premise infrastructure. The technical institute planned to open data science and machine learning classes. They were looking for a solution that would allow them to offer pilot classes for interested students. They have a number of older machines and a limited budget to acquire a few new machines with GPU. If the new class goes well, they planned to acquire more machines from the system vendor.
The system vendor approached us to offer a scalable solution with following requirements:
The system allows multiple students and faculties to conveniently share storage and compute resources for their in-class and lab project works.
It allows the IT admin to allocate and oversubscribe GPU and storage resources for the students.
It offers students software development and debug environment for commonly used machine learning frameworks like Tensorflow, PyTorch, MxNet, etc.
It allows multiple students to run long running ML training jobs, based on their allocated resources. It automatically schedules the job when there is a backlog.
It provides the admin a dashboard to conveniently monitor and manage the load and jobs.
Full Product Design and Development Service
We worked with the system vendor’s engineering and product management team by providing them consultation and engineering services throughout the product development process:
Training: We offered multiple training and demo sessions to educate the product managers and engineers about machine learning algorithm development, training and tuning workflows.
Product requirements definition: We helped product managers to define product requirements that would best fit the needs of the clients.
Project planning: We worked with the engineering team in defining the Agile project plan that would roll out features to the product team in a few iterations.
Development: We helped the engineering team in implementing dockerized services, data analytics, machine learning workflow management model management
Testing: We have developed a set of data science and machine learning tests to validate the correct operation of the system in different cluster configurations.
We successfully completed the release of product in time and fully satisfied the requirements of product team.