Failure Modes


Unknown

This condition will require the attention of the Mecca Coordinators.

Please send e-mail to the Mecca Coordinators. If the problem is repeated for a block of runs, then you might consider calling the on-call Mecca Coordinator.


Multiple runs/tape

This is a non-fatal condition. Meaning that we will not rerun the job.

It appears when processing a tape which was logged with multiple runs. In the case PassOne should process the events from all runs on the tape as one job under the first run number. Be aware that the number of events processed will be the sum of all runs and therefore be larger than the number of whyme events for the first run.


WhyMe database error

This is typically a non-fatal condition. Meaning that we will not rerun the job.

Usually this error is the result of an earlier data entry error in the WhyMe database. If you suspect something more sinister then e-mail the Mecca Coordinators.


One worker process died

This is a non-fatal condition. Meaning that we will not rerun the job.

It is the result of an error in the analysis code which has crashed one of several worker node processes. When this error occurs we typically lose about 100 events which were in the buffer of the crashed process, but we also lose the oddpacks and event statistics from all events analyzed by the crashed worker process (up to 70,000 events if the crash comes late in the job).


Multiple worker processes died

This is typically a non-fatal condition. Meaning that we will not rerun the job.

It is the result of an error or errors in the analysis code which crashed multiple worker node processes. When this error occurs we typically lose about 100 events per crash. (See One worker process died )


No ODDPACK file

This is typically a non-fatal condition. Meaning that we will not rerun the job.

This error occurs when the oddpack file can not be found by the Factoids process on ea831. This condition could be caused by any number of reasons. If you see this error send an e-mail to the Mecca Coordinators including as much information as possible (run number, tape number, farm queue, etc.). If the problem occurs for a block of runs then you might consider calling the on-call Mecca Coordinator.


Did not `exit gracefully'

This error will require the attention of the Mecca Coordinators.

Please send e-mail unless the problem is repeated for a block of runs, then you might consider calling the on-call Mecca Coordinator.


No C!!! completion message

This is typically fatal error condition. The run will most likely be reprocessed.

This error occurs when for some reason not all files from the input tape are read in for processing. If this error appears without any other tape errors then something stange has happened and you should e-mail the Mecca Coordinators.


Too few events processed

This is typically fatal error condition. It may or may not be reprocessed.

If this error appears without any other errors then it is most likely due to an undetected tape error during input staging. You should e-mail the Mecca Coordinators and DO NOT resubmit without instructions.


Input staging failed

This error condition is not fatal. The run will not be reprocessed.

The data from the input jobs will be read directly from tape during the analysis job.


Failed to allocate input staging disk.

This error condition is not fatal. The run will not be reprocessed.

The data from the input jobs will be read directly from tape during the analysis job.


Output staging failed

This error condition is fatal. The run might be reprocessed or maybe the data can be staged by hand by a Mecca Coordinator.

You should e-mail the Mecca Coordinators and DO NOT resubmit without instructions.


Failed to allocate output staging disk.

This error condition is not fatal. The run will not be reprocessed.

The job output will be written directly to tape during the analysis job.


Input tape error

This is always a fatal error condition. The run will be reprocessed.

This error can occur for a number of reasons.

Usually you can just resubmit a job after a tape error. If you suspect that there is a problem with the tape or the tape drive, e-mail the Mecca Coordinators before resubmitting.
Output tape error

This is always a fatal error condition. The run will be reprocessed.

This error can occur for a number of reasons.

Usually you can just resubmit a job after a tape error. If you suspect that there is a problem with the tape or the tape drive, e-mail the Mecca Coordinators before resubmitting.
Operator tape mount error

This condition is always a fatal error condition. Some set of runs may need to be reprocessed.

This error occurs when then run number does not match the requested tape label in the whyme database.
An e-mail will be sent automatically to the operators to determine which tapes were mounted.

The problem can either be a simple operator tape mis-mount or a tape labeling error that occured during FOCUS data shifts.