- Big Data
- 4.8K
Dask Distributed
Dask.distributed. Dask.distributed is a lightweight library for distributed computing in Python. It extends both the…
Dask.distributed. Dask.distributed is a lightweight library for distributed computing in Python. It extends both the…
Dask.distributed. Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.
This chart will do the following:
Distributed serves to complement the existing PyData analysis stack. In particular it meets the following needs:
Dask.distributed is a centrally managed, distributed, dynamic task scheduler. The central dask-scheduler process coordinates the actions of several dask-worker processes spread across multiple machines and the concurrent requests of several clients.
The scheduler is asynchronous and event driven, simultaneously responding to requests for computation from multiple clients and tracking the progress of multiple workers. The event-driven and asynchronous nature makes it flexible to concurrently handle a variety of workloads coming from multiple users at the same time while also handling a fluid worker population with failures and additions. Workers communicate amongst each other for bulk data transfer over TCP.
Internally the scheduler tracks all work as a constantly changing directed acyclic graph of tasks. A task is a Python function operating on Python objects, which can be the results of other tasks. This graph of tasks grows as users submit more computations, fills out as workers complete tasks, and shrinks as users leave or become disinterested in previous results.
Users interact by connecting a local Python session to the scheduler and submitting work, either by individual calls to the simple interface client.submit(function, *args, **kwargs) or by using the large data collections and parallel algorithms of the parent dask library. The collections in the dask library like dask.array and dask.dataframe provide easy access to sophisticated algorithms and familiar APIs like NumPy and Pandas, while the simple client.submit interface provides users with custom control when they want to break out of canned “big data” abstractions and submit fully custom workloads.
Dask supports a real-time task framework that extends Python’s concurrent.futures interface. This interface is good for arbitrary task scheduling, like dask.delayed, but is immediate rather than lazy, which provides some more flexibility in situations where the computations may evolve over time.
These features depend on the second generation task scheduler found in dask.distributed (which, despite its name, runs very well on a single machine).
The dask.distributed scheduler works well on a single machine. It is sometimes preferred over the default scheduler for the following reasons:
It provides access to asynchronous API, notably Futures
It provides a diagnostic dashboard that can provide valuable insight on performance and progress
It handles data locality with more sophistication, and so can be more efficient than the multiprocessing scheduler on workloads that require multiple processes.
You can create a dask.distributed scheduler by importing and creating a Client with no arguments. This overrides whatever default was previously set.
Tell us about a new Kubernetes application
Never miss a thing! Sign up for our newsletter to stay updated.
Discover and learn about everything Kubernetes