dispy
dispy is a framework, developed in Python, for parallel execution of computations by distributing them across multiple processors on a single machine (SMP), among many machines in a cluster or grid. dispy is well suited for data parallell (SIMD) paradigm where a computation is evaluated with different (large) datasets independently. Salient features of dispy are:
- dispy is implemented with asyncoro, an independent framework for asynchronous, concurrent, distributed, network programming with coroutines (without threads). asyncoro uses non-blocking sockets with I/O notification mechanisms epoll, kqueue and poll, and Windows I/O Completion Ports (IOCP) for high performance and scalability, so dispy works efficiently with a single node or large cluster(s) of nodes.
- Computations (Python functions or standalone programs) and their dependencies (files, Python functions, classes, modules) are distributed automatically.
- Computation nodes can be anywhere on the network (local or remote). For security, either simple hash based authentication or SSL encryption can be used.
- A computation may specify which nodes are allowed to execute it.
- After each execution is finished, the results of execution, output, errors and exception trace are made available for further processing.
- Nodes may become available dynamically: dispy will schedule jobs whenever a node is available and computations can use that node.
- If callback function is provided, dispy executes that function when a job is finished; this feature is useful for processing job results.
- Client-side and server-side fault recovery are supported:
If user program (client) terminates unexpectedly (e.g., due to uncaught exception), the nodes continue to execute scheduled jobs. If client-side fault recover option is used when creating a cluster, the results of the scheduled (but unfinished at the time of crash) jobs for that cluster can be retrieved later.
If a computation is marked reentrant (with 'reentrant=True' option) when a cluster is created and a node (server) executing jobs for that computation fails, dispy automatically resubmits those jobs to other available nodes. - dispy can be used in a single process to use all the nodes exclusively (with JobCluster - simpler to use) or in multiple processes sharing the nodes (with ShareJobCluster and dispyscheduler).
dispy consists of 4 components:
- dispy.py (client) provides two ways of creating "clusters": JobCluster when only one instance of dispy may run and SharedJobCluster when multiple instances may run (in separate processes). If JobCluster is used, the scheduler contained within dispy.py will distribute jobs on the server nodes; if SharedJobCluster is used, a separate scheduler (dispyscheduler) must be running.
- dispynode.py executes jobs on behalf of dispy. dispynode must be running on each of the (server) nodes that form the cluster.
- dispyscheduler.py is needed only when SharedJobCluster is used; this provides a scheduler that can be shared by multiple dispy users.
- dispynetrelay.py is needed when nodes are located across different networks; this relays information about nodes on a network to the scheduler. If all the nodes are on same network, there is no need for dispynetrelay - the scheduler and nodes automatically discover each other.
Quick Guide
Below is a quick guide on how to use dispy. More details are available in dispy document.
As a tutorial, consider the following program, in which 'compute' is distributed to nodes on a local network for parallel execution. First, run 'dispynode' program on each of the nodes on the network. If parallelizing on an SMP (single node with multiple cores), then start 'dispynode' on that node (which also runs dispy client).
#!/usr/bin/env python
def compute(n):
import time, socket
time.sleep(n)
host = socket.gethostname()
return (host, n)
if __name__ == '__main__':
import dispy, random
cluster = dispy.JobCluster(compute)
jobs = []
for n in range(20):
job = cluster.submit(random.randint(5,20))
job.id = n
jobs.append(job)
# cluster.wait()
for job in jobs:
host, n = job()
print '%s executed job %s at %s with %s' % (host, job.id, job.start_time, n)
# other fields of 'job' that may be useful:
# print job.stdout, job.stderr, job.exception, job.ip_addr, job.start_time, job.end_time
cluster.stats()
Now run the above program, which creates a cluster with function 'compute'. dispy schedules this computation with various arguments to all the nodes running 'dispynode'; each time computation is submitted, dispy returns an instance of DispyJob, or job for short. The nodes execute the computation with the job's arguments in isolation - computations shouldn't depend on global state, such as modules imported outside of computations, global variables etc. In this case, 'compute' needs modules 'time' and 'socket', so it must import them. The program then retrieves results of execution for each job with 'job()'.
Further examples on how to create cluster using JobCluster (using SharedJobCluster is similar):
-
cluster = dispy.JobCluster('/some/program', nodes=['192.168.3.*'])
distributes '/some/program' (an executable program, instead of python function) to all nodes whose IP address starts with '192.168.3'. - cluster = dispy.JobCluster(compute, depends=[ClassA, moduleB, 'file1'])
distributes 'compute' along with ClassA (python object), moduleB (python object) and 'file1'. Presumably ClassA, moduleB and file1 are needed by 'compute'. - cluster = dispy.JobCluster(compute, secret='super')
distributes 'compute' to nodes that also use secret 'super' (i.e., nodes started with 'dispynode -s super')
Note that secret is used only for establishing communication initially, but not used to encrypt programs or code for python objects. This can be useful to prevent other users from (inadvertantly) using the nodes. If encryption is needed, use SSL; see below. - cluster = dispy.JobCluster(compute, certfile='mycert',
keyfile='mykey')
distributes 'compute' and encrypts all communication using SSL certificate stored in 'mycert' file and key stored in 'mykey' file. In this case, dispynode must also use same certificate and key; i.e., each dispynode must be invoked with dispynode --certfile="mycert" --keyfile="mykey"'
If both certificate and key are stored in same file, say, 'mycertkey', they are expected to be in certfile:
cluster = dispy.JobCluster(compute, certfile='mycertkey') - cluster1 = dispy.JobCluster(compute1, nodes=['192.168.3.2', '192.168.3.5'])
cluster2 = dispy.JobCluster(compute2, nodes=['192.168.3.10', '192.168.3.11'])
distributes 'compute1' to nodes 192.168.3.2 and 192.168.3.5, and 'compute2' to nodes 192.168.3.10 and 192.168.3.11. With this setup, specific computations can be scheduled on certain node(s). As mentioned above, with JobCluster, the set of nodes for one cluster must be disjoint with set of nodes in any other cluster running at the same time. Otherwise, SharedJobCluster must be used.
See examples for more complete/concrete examples.
After a cluster is created for a computation, it can be evaluated with multiple instances of data by calling cluster.submit function. The result is a 'job', which is an instance of DispyJob (see dispy.py). User can set its 'id' field to any value appropriate. This may be useful, for example, to distinguish one job from another. In the above example, 'id' is set to unique number, although dispy doesn't require this field to be unique - dispy doesn't use 'id' field, at all.
Job's 'status' field is read-only field; its value is one of Created, Running, Finished, Cancelled or Terminated, indicating current status of job. If job is created for SharedJobCluster, status is not updated to Running when job is actually running.
When a submitted job is called with job(), it returns that job's execution result, possibly waiting until the job is finished. After a job is complete,
- job.result will have function's return value (job.result is same as return value of job())
- job.stdout and job.stderr will have stdout and stderr strings
- job.exception will have exception trace if executing job raises any exception; in this case the result of job.result will be None
- job.start_time will be the time when job is scheduled for execution on a node
- job.end_time will be time when results became available
After jobs are submitted, cluster.wait() can be used to wait until all submitted jobs for that cluster have finished. If necessary, results of execution can be retrieved by either job() or job.result, as described above.
dispy can also be used as a tool; in this case the computations should only be programs and dependencies should only be files.
dispy.py -f /some/file1 -f file2 -a "arg11 arg12" -a "arg21 arg22" -a "arg3" /some/programwill distribute '/some/program' with dependencies '/some/file1' and 'file2' and then execute '/some/program' in parallel with arg11 and arg12 (two arguments to the program), arg21 and arg22 (two arguments), and arg3 (one argument).