announcements

Polyaxon v1.5: Events, hooks, & Joins

Today, we are pleased to announce the v1.5 release, a stable version which brings several new features, enhancements, and fixes. This release does not introduce any breaking changes and is fully compatible with previous releases.

Joins

Until now, Polyaxon provided several interfaces for fanning out operations either with a list of parameters using Mapping or based on a hyperparameter tuning algorithm supported by the Matrix section.

The Join interface is a new abstraction that allows performing fan-in operations. A typical use-case for such interface is the map-reduce pattern, but it’s also the interface used by Polyaxon to provide performance-based Tensorboards, i.e. starting a Tensorbaord based on a search: {query: metrics.loss:< 0.01, sort: metrics.loss, limit: 10}.

Polyaxon Join is not a replacement to other map-reduce frameworks, rather it provides a very convenient way to collect all; Inputs, Outputs, Lineages, Contexts, Artifacts from upstream runs based on Polyaxon Query Language.

A Join can be used both in an independent operation or in the context of a DAG. And each operation can perform one or multiple joins.

Let’s look at some concrete examples.

Performance-based Tensorboard

A performance-based Tensorboard operation allows starting a Tensorboard, dynamically, based on some criteria without prior knowledge of the runs’ ids.

version: 1.1
kind: operation
name: compare-top-experiments
joins:
- query: "metrics.loss:<0.01"
  sort: "metrics.loss"
  limit: "5"
  params:
    tensorboards:
      value: {dirs: "{{ artifacts.tensorboard }}"}
component:
  inputs:
  - {name: tensorboards, type: artifacts, toInit: true}
  run:
    kind: service
    ports:
    - 6006
    container:
      image: tensorflow/tensorflow:2.2.0
      command:
      - tensorboard
      args:
      - '--logdir={{globals.artifacts_path}}'
      - '--port={{globals.ports[0]}}'
      - '--path_prefix={{globals.base_url}}'
      - '--host=0.0.0.0'

In this example, Polyaxon will automatically perform a search and collect artifacts logged under the name tensorboard. Note that using the artifacts prefix, Polyaxon will look in the lineage table, however, if you do not log the lineage using Polyaxon, you can still pass a subpath, e.g. sub-path/in/each/run/in/the/search.

Map-Reduce

Joins can be used as an automated process to perform fan-out -> fan-in or map-reduce process.

version: 1.1
kind: component
run:
  kind: dag
  operations:
  - name: fan_out
    hubRef: "my-component:v1"
    matrix:
      kind: random
      numRuns: 20
      params:
        learning_rate:
          kind: linspace
          value: 0.001:0.1:5
        dropout:
          kind: choice
          value: [0.25, 0.3]
        conv_activation:
          kind: pchoice
          value: [[relu, 0.1], [sigmoid, 0.8]]
        epochs:
          kind: choice
          value: [5, 10]
  - name: fan_in
    params:
      matrix_uuid:
        ref: ops.fan_out
        value: globals.uuid
        contextOnly: true
    joins:
    - query: "metrics.accuracy:>0.9, pipeline:{{ matrix_uuid }}"
      sort: "-metrics.accuracy"
      params:
        uuids: {value: "globals.uuid", contextOnly: true}
        learning_rates: {value: "inputs.learning_rate", contextOnly: true}
        accuracies: {value: "outputs.accuracy", contextOnly: true}
        losses: {value: "outputs.loss", contextOnly: true}
    component:
      run:
        kind: job
        container:
          image: image
          command: ["/bin/bash", "-c"]
          args: [echo {{ uuids }}; "echo {{ learning_rates }}; "echo {{ accuracies }}; echo {{ losses }}"]

In the example above, instead of searching the complete project, we restrict the search to a specific subset defined by the pipeline managing the random search algorithm (the same logic can be used for Mapping, grid search, Bayesian optimization, …).

In this example, the reduce operation is not doing anything important, just printing some of the inputs and outputs collected.

Hooks

Note: Hooks are currently on the commercial version only, but will be available on Polyaxon CE soon

If you are using Polyaxon, you are already aware that you can provide:

  • init containers: an interface for users to run init containers before the main container containing the logic for training models or processing data.
  • sidecar containers: specialized containers running as sidecars to the main container.

The hooks interface is an extension to complete the lifecycle with a post-done logic. Typically, hooks, are operations that run after the main logic, to notify external systems, trigger evaluation logic, generate reports, …

Compared to init and sidecars abstractions, we made the decision to run hooks outside of the pod where the main logic is running for several reasons:

  • To allow users to release important resources, e.g. GPU/TPU that might not be needed for running the hook(s).
  • To make a distinction of what users should be running in such operations, normally we expect that users should use this interface to run recurrent and abstracted logic that depends on an upstream operation, yet applies to most operations with similar characteristics.

Users can run full components, with their own init and sidecars in hooks, and can run many hooks per operation following:

  • A trigger.
  • A set of conditions
  • And based on the full context of the main operation.

All valid hooks will be automatically scheduled to run as soon as the main operation reaches a final status.

Events

Events are the last major addition to the Polyaxonfile specification in this release, users can now run DAGs with full events support.

Most workflow orchestrators support an aggregated upstream condition, e.g. all succeeded, all failed, or all done, basically an orchestrator would schedule an operation or a task only after the upstream is finished. That was also the case for Polyaxon until this release.

Events allow starting an operation in response to any event generated by an upstream entity. In this release, the entity is an operation running in the same DAG context. In the future, the entity could be anything from a Git commit, a new S3 blob, to an internal alert, or a new registered model version.

Here’s an example of starting a Tensorboard as soon as a training operation starts running:

version: 1.1
kind: component
name: experiment-with-tensorboard
run:
  kind: dag
  operations:
  - name: experiment
    pathRef: "./experiment.yml"
    params:
      learning_rate:
        value: 0.005
      epochs:
        value: 10
  - name: tensorboard
    hubRef: tensorboard
    termination:
      timeout: 7200
    params:
      uuid:
        ref: ops.experiment
        value: globals.uuid
    events:
      - ref: ops.experiment
        kinds: [run_status_running]

This DAG will schedule two operations, a job for training a DL experiment and a Tensorboard to visualize the outputs of the experiment. Instead of waiting for the experiment to finish before starting a Tensorboard, or copying the UUID of the job to start the Tensorboard manually, this DAG will schedule the Tensorboard automatically as soon as the training starts running. If the training fails because of some compilation error, the Tensorboard will be marked as skipped.

Compiler

Optimized compilation and context resolution

This version brings several new heuristics to optimize the process of resolving and converting Polyaxonfiles, from a couple of milliseconds to a second in some cases, which should translate to faster scheduling of operations.

We also consolated the interface for requesting and resolving information from the compiler’s context, as well as moving the context to its own documentation section.

Improved init artifacts specification

This was requested several times, extending the artifacts initializer to allow providing a custom destination path where it should store the artifacts collected:

init:
  files:
    - file1
    - - file2
      - path/to/store/file2
  dirs:
    - dir1
    - - dir2
      - path/to/store/dir2

file1 and dir1 are what users were familiar with. file2 and dir2 are the new capability, the initializer accepts a tuple to specify the path from and the path to.

Restart with copy improvement

As a result of the previous enhancement, when a user restarts an operation using the copy mode, the copied information will be initialized automatically under the new run.

Note: Restart with copy mode is an advanced feature to allow users to resume training of an experiment with updated code/params/configuration/resources multiple times without mutating the original run.

New input type UUID

Polyaxonfile parser can now validate uuid types properly, users can use this new type instead of str to validate the inputs and fail faster.

UI

Serveral new lineage tabs

It was always possible to filter all upstream/downstream runs based on another run or all runs that are clones of a specific run. But pulling such information required doing a manual search in the comparison table.

Polyaxon UI now exposes several new lineage tabs to show:

  • Artifacts & Connections requested for a specific run:

feature1

  • All clones:

feature2

  • Upstream & Downstream edge runs:

feature3

Improved run’s overview page

  • Namespace and artifacts store:

feature4

  • Better documentation with Readme (the markdown preview has a similar style as the run’s overview page, in both light and dark themes):

feature5-1

feature5-2

  • IO tables with search and pagination:

feature6

  • Copying of the complete inputs and outputs as JSON objects:

feature7

Wait time

The dashboard will show a new field wait time, this time represents all phases that come before an operation is scheduled on Kubernetes. This information is available on the overview page, and on the comparison table. Users can query and sort by the wait time similar to other meta data.

feature8

By analyzing the wait time of runs filterd by specific queue, you should have more context to optimize and organize your queues.

Consolidated pipeline overview

We moved several aspects that provide information about the pipeline into a subsection:

  • Pipeline Concurrency
  • Children run kinds
  • Pipeline progress

feature9

Docs

We published a new section how-tos that should feature short guides on how to use Kubernetes capabilities, as well as answer recurrent questions that we get from users. We improved and published several new sections about scheduling and how the compiler resolves information. And we are in the process of reworking several guides in the quick-start as well as introducing a new examples section with guides to understand how some features work.

Learn More about Polyaxon

This blog post just goes over a couple of features that we shipped since our last product update, there are several other features and fixes that are worth checking. To learn more about all the features, fixes, and enhancements, please visit the release notes.

Polyaxon continues to grow quickly and keeps improving and providing the simplest machine learning layer on Kubernetes. We hope that these updates will improve your workflows and increase your productivity, and again, thank you for your continued feedback and support.

Subscribe to Polyaxon

Get the latest posts delivered right to your inbox