Handling “Partial Failures”
Need to overcome:
- When a network connection breaks,
- The host system goes down,
- The JVM running the remote MPJ task halts for some other reason (e.g., occurrence of a Java exception),
- The program that initiated the MPJ job is killed.
- Unexpected termination of any particular MPJ job.
- Concurrent tasks associated with other MPJ jobs should be unaffected, even if they were initiated by the same daemon.
All processes associated with the particular job must shut down within some (preferably short) interval of time cleanly.