announcements

Polyaxon v1.9: Serving models in-cluster guides, UI managed connections, Improved Tensorboard services, Improved upload command

Today, we are pleased to announce the v1.9 release of our MLOps platform, a stable version that brings several new guides for serving machine learning REST APIs on Kubernetes, running Tensorboard in sync mode, new upload command improvements, new copy mode capabilities, and a UI based connections management.

Agent Connections

Polyaxon provides an agent management UI that allow users to add, update, and remove agents, and manage the queues associated with each agent.

Each agent in Polyaxon is deployed on a Kubernetes cluster or namespace where users can schedule their workload and mount volumes, secrets, and connections.

Until today Polyaxon provided a global connections UI that shows all connections connected to an organization, where end users can order the connections per agent.

In this version we released a new UI to show connections directly on the agent’s settings page:

agent-connections-viewing

We also enabled a new feature to allow managing the connection directly from the UI instead of the YAML file, allowing users to easily add, remove, and update their connections without the need to redeploy the agent and without downtime:

agent-connections-mangement

This new feature can be enabled from each agent settings page, and it can be disabled again anytime to go back to a config file based connections management.

agent-settings

Note: In both cases, cluster admins will still need to manage the artifacts store, the ingress, and other important initial configuration directly from the config deployment file.

Improved UI for pipelines and project overview

Polyaxon UI now shows basic stats on every project and Matrix/DAG run overview page.

  • Project overview

project-overview

  • Pipeline overview

run-overview

Runtime restriction tab

The project settings page has a new advanced restriction tab that allows admins to enable/disable what kind of workload to allow on each project. This new restriction is useful to define projects that only need access to services or to restrict projects from starting hyperparameter tuning operations.

project-runtime-restriction

New pending logic

In previous versions, Polyaxon had an indicator is_approved that was initially made for providing a human-in-the-loop validation process, where operations marked as awaiting approval would require user intervention to allow them to be scheduled on the Kubernetes cluster.

The is_approved logic was used by the upload command to provide a synchronous process for uploading and then starting an operation, however that led to a couple of edge cases. In this version we refactored the logic to expose a more generic pending mechanism, that mechanism is now used by the approval process, the upload process, and the cache service.

Runs are marked as pending and the UI shows what type of action is required to resume scheduling:

  • pending: approval

run-pending-approval

  • pending: upload

run-pending-upload

Finally, this new pending logic allows the platform to perform the checks and compilation in an asynchronous way whereas previously the process had to be synchronous.

Improved upload command

We are very happy to finally announce that the upload command is now stable and generally available. For the last couple of months, the upload command was in a beta phase and had several edge cases. In this release we reworked several aspects of the upload logic to handle:

  • Upload and eager mode: Polyaxon CE users can now start a grid or random search or a mapping in eager mode with the upload command.
  • Restarting operations where the code or some artifacts were initialized using the upload command.
  • Auto-tracking of lineage information of the uploaded artifacts.

Improved restart logic

Polyaxon restart command has also received several new quality-of-life enhancements, users can restart operations with new names, descriptions, and tags. The restart command will also automatically forward any artifacts that were necessary during the initialization process (previously the restart logic would fail for runs started with the upload command).

The restart command with the copy mode was also significantly improved; previously the copy mode copied all artifacts, but we have heard from users that they sometimes need more control over the artifacts to copy during the restart process, for example copying a specific checkpoint in case of a deep learning experiment.

The CLI exposes new arguments that allows users to specify the directories and/or files to copy during the restart process:

polyaxon ops restart --copy-dir=dir1 --copy-dir=path/dir2 --copy-file=path/dir3/file1

And to copy everything, they can use the --copy flag:

polyaxon ops restart --copy

New plugin to mount the artifacts store

Several users request the default artifacts store for their operations, whether to have direct access to previous runs’ outputs or to manage new outputs manually. Previously, in order to request the default artifacts store, users needed to add store name to the connections list in their manifests. This is still possible, but for most use-cases, this means that the components and operations are decorated with specific values and dependent on the deployment configuration of that Polyaxon instance.

Starting with v1.9.1, users can request the artifacts store without setting any connection name, which should improve portability and readability of the polyaxonfiles:

plugins:
  mountArtifactsStore: true

Additionally, requesting the artifacts store via the plugins section will automatically inject a new global variable in the context: globals.store_path which should abstract the need to access the schema or to know beforehand if the artifacts store is a volume, a host path, or blob storage bucket.

New Tensorboard versions

Based on the new mountArtifactsStore plugin, we are now distributing two new Tensorboard versions tensorboard:single-run-storepath and tensorboard:multi-run-storepath that will start Tensorboard based on the artifacts store directly instead of using the initializer.

These versions are useful to start a service for experiments still running since Tensorboard will keep syncing the data.

Note: These versions will not work with Azure blob, Minio artifacts stores, or other unsupported backends.

Other improvements

Converting all parameters to args

The context has now a new variable params.as_args that should convert all params to a CLI arguments, for instance the following example:

version: 1.1
kind: component
inputs:
  - name: message1
    type: str
  - name: message2
    type: str
  - name: message3
    type: str
run:
  kind: job
  container:
    image: IMAGE
    command: [COMMAND]
    args: ['--message1={{ message1 }}', '--message2={{ message2 }}', '--message3={{ message3 }}']

Can be written using the params.as_args:

version: 1.1
kind: component
inputs:
  - name: message1
    type: str
  - name: message2
    type: str
  - name: message3
    type: str
run:
  kind: job
  container:
    image: IMAGE
    command: [COMMAND]
    args: '{{ params.as_args }}'

Docs

We refactored the intro section of the documentation to provide a comprehensive tutorial and guides to get started with Polyaxon. This refactoring was based on the feedback of several users, the idea is to provide a simple introduction to several aspects of the platform and only reserve the reference sections for users who need to customize specific aspects of their manifests.

We also added new sections and tutorials to host and serve in-cluster REST APIs based on Streamlit, FastAPI, and Flask for ML models, these tutorials do not intend to replace platforms that are built for production serving and deployment, rather they intend to give a simple infrastructure and guidelines for hosting internal API and services or for testing purposes before moving to a system like KFServing.

Upcoming automatic build

We are making a good progress on reintroducing an automatic build process, and we intend to release a beta version on Polyaxon Cloud next week or the week after. This initial release will provide the following feature:

  • It will not replace the ad-hoc build operations, users can still create independent polyaxonfiles with a kaniko/dockerize hub ref.
  • The no-build requirement that the platform provides does not change, users who have stable pipelines that do not require frequent changes to their images can safely ignore this feature.
  • A new section called build allows to define the necessary fields for creating a container as well as other flags for the queue, preset, resources, node selectors, … specific to the build.
  • An image based on the project and the run’s uuid, i.e. project:build-uuid, is generated automatically and set on the main container.
  • When the build and matrix sections are used together, a single build operation will be scheduled and will be used for all runs.

Learn More about Polyaxon

This blog post just goes over a couple of features that we shipped since our last product update, several other features and fixes are worth checking. To learn more about all the features, fixes, and enhancements, please visit the release notes, the current known issues and the short term roadmap.

Polyaxon continues to grow quickly and keeps improving and providing the simplest machine learning layer on Kubernetes. We hope that these updates will improve your workflows and increase your productivity, and again, thank you for your continued feedback and support.

Subscribe to Polyaxon

Get the latest posts delivered right to your inbox