Ecosystem¶
There are a number of open source projects that extend the Dask interface and provide different mechanisms for deploying Dask clusters. This is likely an incomplete list so if you spot something missing - please suggest a fix!
Building on Dask¶
Many packages include built-in support for Dask collections or wrap Dask collections internally to enable parallelization.
Array¶
xarray: Wraps Dask Array, offering the same scalability, but with axis labels which add convenience when dealing with complex datasets.
cupy: Part of the Rapids project, GPU-enabled arrays can be used as the blocks of Dask Arrays. See the section GPUs for more information.
sparse: Implements sparse arrays of arbitrary dimension on top of
numpy
andscipy.sparse
.pint: Allows arithmetic operations between them and conversions from and to different units.
HyperSpy: Uses dask to allow for scalability on multi-dimensional datasets where navigation and signal axes can be separated (e.g. hyperspectral images).
DataFrame¶
cudf: Part of the Rapids project, implements GPU-enabled dataframes which can be used as partitions in Dask Dataframes.
dask-geopandas: Early-stage subproject of geopandas, enabling parallelization of geopandas dataframes.
SQL¶
blazingSQL: Part of the Rapids project, implements SQL queries using
cuDF
and Dask, for execution on CUDA/GPU-enabled hardware, including referencing externally-stored data.dask-sql: Adds a SQL query layer on top of Dask. The API matches blazingSQL but it uses CPU instead of GPU. It still under development and not ready for a production use-case.
fugue-sql: Adds an abstract layer that makes code portable between across differing computing frameworks such as Pandas, Spark and Dask.
Machine Learning¶
dask-ml: Implements distributed versions of common machine learning algorithms.
scikit-learn: Provide ‘dask’ to the joblib backend to parallelize scikit-learn algorithms with dask as the processor.
xgboost: Powerful and popular library for gradient boosted trees; includes native support for distributed training using dask.
lightgbm: Similar to XGBoost; lightgmb also natively supplies native distributed training for decision trees.
Deploying Dask¶
There are many different implementations of the Dask distributed cluster.
dask-jobqueue: Deploy Dask on job queuing systems like PBS, Slurm, MOAB, SGE, LSF, and HTCondor.
dask-kubernetes: Deploy Dask workers on Kubernetes from within a Python script or interactive session.
dask-helm: Deploy Dask and (optionally) Jupyter or JupyterHub on Kubernetes easily using Helm.
dask-yarn / Hadoop: Deploy Dask on YARN clusters, such as are found in traditional Hadoop installations.
dask-cloudprovider: Deploy Dask on various cloud platforms such as AWS, Azure, and GCP leveraging cloud native APIs.
dask-gateway: Secure, multi-tenant server for managing Dask clusters. Launch and use Dask clusters in a shared, centrally managed cluster environment, without requiring users to have direct access to the underlying cluster backend.
dask-cuda: Construct a Dask cluster which resembles
LocalCluster
and is specifically optimized for GPUs.
Commercial Dask Deployment Options¶
You can use Coiled to handle the creation and management of Dask clusters on cloud computing environments (AWS and GCP).