dispy : Distribute Computations for Parallel Execution

dispy

There are two ways to create clusters with dispy: JobCluster and SharedJobCluster. If only one instance of dispy may be running at anytime, JobCluster is simple to use; it already contains a scheduler that will schedule jobs to nodes running 'dispynode'. If, however, multiple programs using dispy may be running at anytime, JobCluster cannot be used - each of the schedulers in each instance of dispy will assume the nodes are controlled exclusively by each, causing conflicts. Instead, SharedJobCluster must be used. In this case, dispyscheduler must also be running on some computer and SharedJobCluster must set scheduler_node parameter with the node running dispyscheduler (default is the host that calls SharedJobCluster).

Once an instance of JobCluster/SharedJobCluster is created, it can be used to schedule jobs by specifying arguments to invoke the computations.

JobCluster

Although JobCluster and SharedJobCluster have many parameters, most are optional and in many cases, specifying just computation, and possibly depends, may be sufficient. JobCluster(computation, nodes=['*'], depends=[], callback=None, ip_addr=None, ext_ip_addr=None, port=51347, node_port=51348, fault_recover=False, dest_path=None, loglevel=logging.WARNING, cleanup=True, pulse_interval=None, ping_interval=None, reentrant=False, secret='', keyfile=None, certfile=None) where
  • computation is either name of Python function or a string. If it is a string, it must be path to executable program. This computation is sent to nodes in the given cluster.
  • nodes is list. Each element must be either a string or a pair (tuple of two elements). If element is a string, it must be either IP address or host name. If element is a pair, first element of pair must be IP address or name and second element must be port number where that node is serving (needed if it is different from default 51348). This list serves two purposes: dispy initially sends a request to all the nodes listed to find out information about them (e.g., number of processing units available for dispy), then sends given computation to only those nodes that match the listed nodes (dispy may know about nodes not listed in this computation, as it also broadcasts identification request). Wildcard '*' can be used to match (part of) any IP address; e.g., '192.168.3.*' matches any node whose IP address starts with '192.168.3'. If there are any nodes beyond local network, then all such nodes should be mentioned in nodes. If there are many such nodes (on outside local network), it may be cumbersome to list them all (it is not possible to send identification request to outside networks with wildcard in them); in that case, dispynetrelay may be started on one of the nodes on that network and the node running dispynetrelay should be added to nodes list (and a wildcard for that network, so that other nodes on that network match that wildcard); see below for examples.
  • depends is list of dependencies needed for computation. Each element of this list can be either Python function or Python class or an instance of class (object) or a Python module or path to file. Only Python modules that are not present on nodes already need be listed; standard modules that are present on all nodes do not need to be listed here.
  • callback is a function. When a job's results become available, dispy will call provided callback function with that job as the argument. If a job sends provisional results with 'dispy_provisional_result' multiple times, then dispy will call provided callback each such time. The (provisional) results of computation can be retrieved with 'result' field of job, etc. While computations are run on nodes in isolated environments, callbacks are run in the context of user programs from which (Shared)JobCluster is called - for example, callbacks can access global variables in programs that created cluster(s).
  • ip_addr is IP address to use for (client) communication. If it is not set, IP address set for the host calling JobCluster is used. If the host has multiple interfaces and default address is not the right choice, this parameter can be set to correct address.
  • ext_ip_addr is external IP address to use for (client) communication. This may be needed in case the client is behind a NAT firewall/gateway and (some) nodes are outside. Typically, in such a case, ext_ip_addr must be the address of NAT firewall/gateway and the NAT firewall/gateway must forward ports to ip_addr appropriately. See below for more information.
  • port is port to use for (client) communication. Usually not necessary. If not given, dispy will request socket library to choose any available port.
  • node_port is port to use for communicating with nodes (servers). If this is different from default, 'dispynode' programs must be run with the same port.
  • dest_path is directory on the nodes where files are transferred to. Default is to create a separate directory for each computation. If a computation transfers files (dependencies) and same computation is run again with same files, the transfer can be avoided by specifying same dest_path, along with the option 'cleanup=False'.
  • fault_recover must be either True or a string. If it is a string, dispy uses it as path to file where it stores information about jobs scheduled but not finished yet. In case user program terminates unexpectedly (for example, because of network failure, uncaught exception), the results of submitted jobs can be later retrieved through 'fault_recover_jobs' function (see below). If this option is True, dispy will store information about jobs in a file of the form '_dispy_fault_recover_YYYYMMDDHHMMSS' in the current directory. Note that dispy keeps information about only the jobs that have been submitted for execution but not finished yet. Once a job is finished (i.e., job result is received by dispy's scheduler), its information is lost. If it is necessary to keep the information about finished jobs, callbacks can be used to persist job results.
  • loglevel is message priority for logging module.
  • cleanup: Whether any files transferred should be deleted after the computation is done. If it is False, the files are left on the nodes; this may speedup if same files are needed for another cluster later. However, this can be security risk and/or require manual cleanup. If same files are used for multiple clusters, then cleanup may be set to False and same dest_path used.
  • pulse_interval is number of seconds between 'pulse' messages that nodes send to indicate they are alive and computing submitted jobs. If this value is given as an integer or floating number between 1 and 600, then a node is presumed dead if 5*pulse_interval seconds elapse without a pulse message. See 'reentrant' below.
  • reentrant must be either True or False. This value is used only if 'pulse_interval' is set for any of the clusters. If pulse_interval is given and reentrant is False (default), jobs scheduled for a dead node are automatically cancelled (for such jobs, execution result, output and error fields are set to None, exception field is set to 'Cancelled' and status is set to Cancelled); if reentrant is True, then jobs scheduled for a dead node are resubmitted to other available nodes.
  • ping_interval is number of seconds. Normally dispy can locate nodes running dispynode by broadcasting UDP ping messages on local network and point-to-point UDP messages to nodes on remote networks. However, UDP messages may get lost. Ping interval is number of seconds between repeated ping messages to find any nodes that have missed previous ping messages.
  • secret is a string that is (hashed and) used for handshaking of communication with nodes. This prevents unauthorized use of nodes. However, the hashed string (not the secret itself) is passed in clear text, so an unauthorized, determined person may be able to figure out how to circumvent.
  • keyfile is path to file containing private key for SSL communication (see Python 'ssl' module). This key may be stored in 'certfile' itself, in which case this should be None.
  • certfile is path to file containing SSL certificate (see Python 'ssl' module).

SharedJobCluster

SharedJobCluster has almost the same syntax, except as noted below.

SharedJobCluster(computation, nodes=['*'], depends=[], ip_addr=None, ext_ip_addr=None, scheduler_node=None, port=51347, dest_path=None, loglevel=logging.WARNING, cleanup=True, reentrant=False, secret='', keyfile=None, certfile=None) where all arguments common to JobCluster are same, and
  • scheduler_node is either IP address or host name where dispyscheduler is running; if it is not given, the node where SharedJobCluster is invoked is used
  • pulse_interval is not used in case of SharedJobCluster; instead, 'dispyscheduler' must be called with '--pulse_interval' option appropriately.
  • secret is a string that is (hashed and) used for handshaking of communication with dispyscheduler.
A cluster has following methods:
  • cluster.submit method is available for the cluster returned from JobCluster. This method should be called with the arguments exactly as expected by the 'computation' given to JobCluster. If 'computation' is a Python function, the arguments may also contain keyword arguments. However, all arguments must be serializable (picklable). If an argument is a class object that contains non-serializable members, then the classes may provide __getstate__ method for this purpose (see '_Job' class in dispy.py for an example). If 'computation' is a program, then all arguments must be strings. submit returns a 'job' object. Results from execution of computation with given arguments will be available in the 'job' object after execution finishes.
  • cluster() will wait for all submitted jobs to complete.
  • cluster.wait() will wait for all submitted jobs to complete.
  • cluster.close() will wait for all submitted jobs to complete and then cleanup (such as removing any transferred files, deleting 'computation' from the nodes etc.).
  • cluster.cancel(job) will remove the job submitted, by terminating it if it already started execution. Note that if the job execution has any side effects (such as updating database, files etc.), cancelling a job may leave unpredictable side effects, depending on at what point job is cancelled.
  • cluster.stats() will print statistics about nodes, time each node spent executing jobs etc.

Fault Recovery

As noted above, if 'fault_recover' option is used when creating a cluster, dispy stores information about scheduled but unfinished jobs in a file. If user program then terminates unexpectedly, the nodes that execute those jobs can't send the results back to dispy. In such cases, the results for the jobs can be retrieved from the nodes with the function in dispy

fault_recover_jobs(fault_recover_file, ip_addr=None, secret='', node_port=51348, certfile=None, keyfile=None) where
  • fault_recover_file is path to the file used or created when JobCluster is used.
  • ip_addr is IP address to use for (client) communication. This may be needed in case the client has multiple interfaces and default interface is not the right choice (this would be same as the 'ip_addr' option used for JobCluster).
  • secret is a string that is (hashed and) used for handshaking of communication with nodes (should same as the one used when creating JobCluster).
  • node_port is port to use for communicating with nodes (servers). If this should be different from default, 'dispynode' programs must be run with the same port. This option should be same as the one used when creating JobCluster.
  • certfile is path to file containing SSL certificate (see Python 'ssl' module) (same as the one used when creating JobCluster).
  • keyfile is path to file containing private key for SSL communication (see Python 'ssl' module). This key may be stored in 'certfile' itself, in which case this should be None. This option should be same as the one used when creating JobCluster.

This function reads the information about jobs in the fault_recover_file, retrieves DispyJob instance (that contains results, stdout, stderr, status etc.) for each job that was scheduled for execution but unfinished at the time of crash, and returns them as a list. If a job has finished executing at the time 'fault_recover_jobs' function is called, the information about that is deleted from both the node and fault_recover_file, so the results for finished jobs can't be retrieved more than once. However, if a job is still executing, the status field of DispyJob would be DispyJob.Running and the results for this job can be retrieved again (until that job finishes) by calling 'fault_recover_jobs'. Note that 'fault_recover_jobs' is available as separate function - it doesn't need JobCluster or SharedJobCluster instance. In fact, 'fault_recover_jobs' function must not be used when a cluster that uses same recover file is currently running.

Note that dispy sends only the given computation and its dependencies to the nodes; the program itself is not transferred. So if computation is a python function, it must import all the modules used by it, even if the program imported those modules before cluster is created.

Provisional Results

'dispy_provisional_result' function can be used in computations (Python functions) to send provisional results back to the client. For example, in optimization computations, there may be many (sub) optimal results that the computations can inform the client (program) that may cancel computations, or create additional computations, etc. 'dispy_provisional_result' can be used to send any information, any number of times, back to the client. As an illustrative example, consider:

#!/usr/bin/env python import random, dispy def compute(n, threshold): import random, time, socket name = socket.gethostname() for i in xrange(0, n): r = random.uniform(0, 1) if r <= threshold: # possible result dispy_provisional_result((name, r)) time.sleep(0.1) # final result return None def job_callback(job): # callback is called for Terminated status, too, in which case # job.result would be None if job.result is not None: if job.result[1] < 0.005: # acceptable result; terminate jobs print '%s computed: %s' % (job.result[0], job.result[1]) global jobs, cluster for j in jobs: if j.status == dispy.DispyJob.ProvisionalResult: cluster.cancel(j) if __name__ == '__main__': cluster = dispy.JobCluster(compute, callback=job_callback) jobs = [] for n in xrange(4): job = cluster.submit(random.randint(50,100), 0.2) if job is None: print 'creating job %s failed!' % n continue job.id = n jobs.append(job) cluster.wait() cluster.stats() cluster.close()

In the above example, computations send provisional result if computed number is <= threshold (0.2). If the number computed is < 0.005, job_callback deems it acceptable and terminates computations.

NAT/Firewall Forwarding

By default dispy client uses UDP and TCP ports 51347, dispynode uses UDP and TCP ports 51348, and dispyscheduler uses UDP and TCP pots 51347 and TCP port 51348. If client/node/scheduler are behind a NAT firewall/gateway, then these ports must be forwarded appropriately and 'ext_ip_addr' option must be used. For example, if dispy client is behdind NAT firewall/gateway, JobCluster/SharedJobCluster must set 'ext_ip_addr' to the NAT firewall/gateway address and forward UDP and TCP ports 51347 to the IP address where client is running. Similarly, if dispynode is behind NAT firewall/gateway, 'ext_ip_addr' option must be used.

For example, ext_ip_addr option can be used to work with Amazon EC2 cloud computing service. With EC2 service, a node has a private IP address (called 'Private DNS Address') that uses private network of the form 10.x.x.x and public IP address (called 'Public DNS Address') that is of the form ec2-x-x-x-x.x.amazonaws.com. After launching instance(s), one can copy dispy files to the node(s) and run dispynode as dispynode.py --ext_ip_addr ec2-x-x-x-x.y.amazonaws.com (this address can't be used with '-i'/'--ip_addr' option, as the network interface is configured with private IP address only). This node can then be used by dispy client from outside EC2 network by specifying ec2-x-x-x-x.x.amazonaws.com in the 'nodes' list (thus, using EC2 servers to augment processing units). Roughly, dispy uses 'ext_ip_addr' similar to NAT - it announces 'ext_ip_addr' to other services instead of the configured 'ip_addr' so that external services send requests to 'ext_ip_addr' and if firewall/gateway forwards them appropriately, dispy will process them.

Examples

Below are some examples on various use cases of creating clusters:

  1. cluster = dispy.JobCluster(compute, nodes=['node20', '192.168.2.21', 'node24']) sends computation to nodes 'node20', 'node24' and node with IP address '192.168.2.21'; in this case, these nodes could be in different networks, as explicit names / IP addresses are listed.
  2. cluster = dispy.JobCluster(compute, nodes=['192.168.2.*']) sends computation to all nodes whose IP address starts with '192.168.2'. In this case, it is assumed that '192.168.2' is local network (since dispy can't send identification request to outside networks with wildcard)
  3. cluster = dispy.JobCluster(compute, nodes=['192.168.3.5', '192.168.3.22', '172.16.11.22', 'node39', '192.168.2.*']) sends computation to nodes with IP addresses '192.168.3.5', '192.168.3.22', '172.16.11.22' and node 'node39' (since explicit names / IP addresses are listed, they could be on different networks), all nodes whose IP address starts with '192.168.2' (local network).
  4. cluster = dispy.JobCluster(compute, nodes=['192.168.3.5', '192.168.3.*', '192.168.2.*']) In this case, dispy will send identification request to node with IP address '192.168.3.5'. If this node is running 'dispynetrelay', then all the nodes on that network are eligible for executing this computation, as wildcard '192.168.3.*' matches IP addresses of those nodes. In addition, computation is also sent to all nodes whose IP address starts with '192.168.2' (local network).
  5. cluster = dispy.JobCluster(compute, nodes=['192.168.3.5', '192.168.8.20', '172.16.2.99', '*']) In this case, dispy will send identification request to nodes with IP address '192.168.3.5', '192.168.8.20' and '172.16.2.99'. If these nodes all are running dispynetrelay, then all the nodes on those networks are eligible for executing this computation, as wildcard '*' matches IP addresses of those nodes. In addition, computation is also sent to all nodes on local network (since they also match wildcard '*' and identification request is broadcast on local network).
  6. Assuming that 192.168.1.39 is the (private) IP address where dispy client is used, a.b.c.d is the (public) IP address of NAT firewall/gateway (that can be reached from outside) and dispynode is running at another public IP address e.f.g.h (so that a.b.c.d and e.f.g.h can communicate, but e.f.g.h can't communicate with 192.168.1.39),
    cluster = dispy.JobCluster(compute, ip_addr='192.168.1.39', ext_ip_addr='a.b.c.d', nodes=['e.f.g.h']) would work if NAT firewall/gateway forwards UDP and TCP ports 51347 to 192.168.1.39.