Getting Started
This guide describes in detail the contents of the example
package on Github
and introduces dagster concepts.
Motivation
Let us first motivate the need for this library. Composable graphs have two important qualities that enable several potential new uses:
-
Integrated. Because composable graphs are YAML-based It is possible to create dagster jobs from any language. Even within a Python application it may be beneficial to define jobs using a more limited API to simplify job definition, since more complex jobs may become unmaintainable in part because Python-native functionality is used to define them.
-
Dynamic. Changing the YAML file and reloading the code location is all it takes to update the job. Combined with the fact that the composable graph definition may be loaded from any location over the network it is possible to dynamically change dagster jobs as needed.
A possible application of this dynamism is to have a GUI-based job editor to enable a no-code approach to defining dagster jobs.
Considering these immediate advantages, the sections below introduce how to define a job in dagster and using a composable graph.
Define a job
In dagster an op is the smallest unit of computation. Each op is executed separately and is the basis of dagster functionality. Ops are arranged in graphs to enable reusability. The main difference between ops and graphs is that the body of the graph must be entirely composed of dagster ops. That is, it is not possible to mix dagster ops and Python-defined functions.
Finally these ops and graphs are combined into dagster jobs. For example:
Notice that because typically jobs are defined as a Python function it is difficult to change them at runtime, or create them from programming languages other than Python.
Once a job is defined as above it is exposed to dagster as part of the code
location. In this minimal
example we create a file named code_location.py
as follows:
Then start the dagster webserver by running
Opening then dagster on a browser shows the job we just created.
Define a composable graph
Instead of defining the job as Python code we instead write a file in .yaml
format with a certain schema. The job above would be defined as:
Here the two relevant sections are operations
and dependencies
. The former
defines nodes in the graph and references the Python function that defines
them. Notice that dagster ops and graphs are both supported. The latter
connects inputs to nodes defined in operations
to the output of other nodes.
Instead of the code location as presented in the previous section now the job is instead created as follows:
In this revised version the job is no longer imported but created using
function compose_job
. Those ops and graphs composed in the job are imported
dynamically using the value provided in field function
. Because these are
resolved at runtime it is possible to modularize and fully decouple the
definition of ops and graphs from the jobs that execute them. This is
conceptually similar to how assets may be defined in multiple code locations as
described in the documentation.
Further reading
- Read the post “Abstracting Pipelines for Analysts with a YAML DSL” on the dagster blog that started this idea.