prefect-dask
to speed up code with parallelization.
Run tasks on Dask
TheDaskTaskRunner
is a parallel task runner that submits tasks to the dask.distributed
scheduler.
By default, a temporary Dask cluster is created for the duration of the flow run.
For example, this flow counts up to 10 in parallel (note that the output is not sequential).
address
argument.
To configure your flow to use the DaskTaskRunner
:
- Make sure the
prefect-dask
collection is installed as described earlier:pip install prefect-dask
. - In your flow code, import
DaskTaskRunner
fromprefect_dask.task_runners
. - Assign it as the task runner when the flow is defined using the
task_runner=DaskTaskRunner
argument.
DaskTaskRunner
configured to access an existing Dask cluster at http://my-dask-cluster
.
DaskTaskRunner
accepts the following optional parameters:
Parameter | Description |
---|---|
address | Address of a currently running Dask scheduler. |
cluster_class | The cluster class to use when creating a temporary Dask cluster. It can be either the full class name (for example, "distributed.LocalCluster" ), or the class itself. |
cluster_kwargs | Additional kwargs to pass to the cluster_class when creating a temporary Dask cluster. |
adapt_kwargs | Additional kwargs to pass to cluster.adapt when creating a temporary Dask cluster. Note that adaptive scaling is only enabled if adapt_kwargs are provided. |
client_kwargs | Additional kwargs to use when creating a dask.distributed.Client . |
Multiprocessing safetyNote that, because the
DaskTaskRunner
uses multiprocessing, calls to flows
in scripts must be guarded with if __name__ == "__main__":
or you will encounter
warnings and errors.address
of a Dask scheduler, Prefect creates a temporary local cluster automatically. The number of workers used is based on the number of cores available to your execution environment. The default provides a mix of processes and threads that should work well for most workloads. If you want to specify this explicitly, you can pass values for n_workers
or threads_per_worker
to cluster_kwargs
.
Distributing Dask collections across workers
If you use a Dask collection, such as adask.DataFrame
or dask.Bag
, to distribute the work across workers and achieve parallel computations, use one of the context managers get_dask_client
or get_async_dask_client
:
flow
run contexts and task
run contexts.
Resolving futures in sync clientNote, by default, For more information, visit the docs on Waiting on Futures.
dask_collection.compute()
returns concrete values while client.compute(dask_collection)
returns Dask Futures. Therefore, if you call client.compute
, you must resolve all futures before exiting out of the context manager by either:- setting
sync=True
- calling
result()
get_async_dask_client
.
Resolving futures in async clientWith the async client, you do not need to set
sync=True
or call result()
.However, you must await client.compute(dask_collection)
before exiting the context manager.To invoke compute
from the Dask collection, set sync=False
and call result()
before exiting out of the context manager: await dask_collection.compute(sync=False)
.Use a temporary cluster
TheDaskTaskRunner
is capable of creating a temporary cluster using any of Dask’s cluster-manager options. This can be useful when you want each flow run to have its own Dask cluster, allowing for per-flow adaptive scaling.
To configure, you need to provide a cluster_class
. This can be:
- A string specifying the import path to the cluster class (for example,
"dask_cloudprovider.aws.FargateCluster"
) - The cluster class itself
- A function for creating a custom cluster
cluster_kwargs
, which takes a dictionary of keyword arguments to pass to cluster_class
when starting the flow run.
For example, to configure a flow to use a temporary dask_cloudprovider.aws.FargateCluster
with 4 workers running with an image named my-prefect-image
:
Connect to an existing cluster
Multiple Prefect flow runs can all use the same existing Dask cluster. You might manage a single long-running Dask cluster (maybe using the Dask Helm Chart) and configure flows to connect to it during execution. This has a few downsides when compared to using a temporary cluster (as described above):- All workers in the cluster must have dependencies installed for all flows you intend to run.
- Multiple flow runs may compete for resources. Dask tries to do a good job sharing resources between tasks, but you may still run into issues.
DaskTaskRunner
to connect to an existing cluster, pass in the address of the scheduler to the address
argument:
Adaptive scaling
One nice feature of using aDaskTaskRunner
is the ability to scale adaptively to the workload. Instead of specifying n_workers
as a fixed number, this lets you specify a minimum and maximum number of workers to use, and the dask cluster will scale up and down as needed.
To do this, you can pass adapt_kwargs
to DaskTaskRunner
. This takes the following fields:
maximum
(int
orNone
, optional): the maximum number of workers to scale to. Set toNone
for no maximum.minimum
(int
orNone
, optional): the minimum number of workers to scale to. Set toNone
for no minimum.
FargateCluster
scaling up to at most 10 workers.