Skip to content

Getting Started

Introduction

This guide describes in detail the contents of the example package on Github and introduces dagster concepts.

Motivation

Let us first motivate the need for this library. Composable graphs have two important qualities that enable several potential new uses:

  1. Integrated. Because composable graphs are YAML-based It is possible to create dagster jobs from any language. Even within a Python application it may be beneficial to define jobs using a more limited API to simplify job definition, since more complex jobs may become unmaintainable in part because Python-native functionality is used to define them.

  2. Dynamic. Changing the YAML file and reloading the code location is all it takes to update the job. Combined with the fact that the composable graph definition may be loaded from any location over the network it is possible to dynamically change dagster jobs as needed.

    A possible application of this dynamism is to have a GUI-based job editor to enable a no-code approach to defining dagster jobs.

Considering these immediate advantages, the sections below introduce how to define a job in dagster and using a composable graph.

Defining a job

In dagster an op is the smallest unit of computation. Each op is executed separately and is the basis of dagster functionality. Ops are arranged in graphs to enable reusability. The main difference between ops and graphs is that the body of the graph must be entirely composed of dagster ops. That is, it is not possible to mix dagster ops and Python-defined functions.

Finally these ops and graphs are combined into dagster jobs. For example:

jobs.py
1
from dagster import job, op, graph
2
3
4
@op
5
def return_five():
6
return 5
7
8
9
@op
10
def add_one(arg):
11
return arg + 1
12
13
14
@graph
15
def return_six():
16
# A graph that combines two ops.
17
return add_one(return_five())
18
19
20
@job
21
def return_seven():
22
# To define the job both an op and a graph are used.
23
add_one(return_six())

Notice that because typically jobs are defined as a Python function it is difficult to change them at runtime, or create them from programming languages other than Python.

Once a job is defined as above it is exposed to dagster as part of the code location. In this minimal example we create a file named code_location.py as follows:

code_location.py
1
from dagster import Definitions
2
3
# Assume the file above is a module next to this one.
4
from .jobs import return_seven
5
6
defs = Definitions(jobs=[return_seven])

Then start the dagster webserver by running

Terminal window
1
dagster dev -f code_location.py

Opening then dagster on a browser shows the job we just created.

Defining a composable graph

Instead of defining the job as Python code we instead write a file in .yaml format with a certain schema. The job above would be defined as:

return_seven.yaml
1
apiVersion: truevoid.dev/v1alpha1
2
kind: ComposableGraph
3
metadata:
4
name: return-seven
5
spec:
6
operations:
7
- name: return_six
8
function: jobs.return_six
9
- name: add_one
10
function: jobs.add_one
11
dependencies:
12
- name: add_one
13
inputs:
14
- node: return_six

Here the two relevant sections are operations and dependencies. The former defines nodes in the graph and references the Python function that defines them. Notice that dagster ops and graphs are both supported. The latter connects inputs to nodes defined in operations to the output of other nodes.

Instead of the code location as presented in the previous section now the job is instead created as follows:

code_location.py
1
from dagster import Definitions
2
3
from dagster_composable_graphs.compose import (
4
compose_job,
5
load_graph_def_from_yaml,
6
)
7
8
defs = Definitions(
9
jobs=[compose_job(load_graph_def_from_yaml("return_seven.yaml"))]
10
)

In this revised version the job is no longer imported but created using function compose_job. Those ops and graphs composed in the job are imported dynamically using the value provided in field function. Because these are resolved at runtime it is possible to modularize and fully decouple the definition of ops and graphs from the jobs that execute them. This is conceptually similar to how assets may be defined in multiple code locations as described in the documentation.

Further reading

  • Read the post “Abstracting Pipelines for Analysts with a YAML DSL” on the dagster blog that started this idea.