PASS1 Instructions

SHIFT DUTIES

The duties of the Pass1 shift are the following:

SHIFT PREPARATION

Before taking your PASS1 shift, please do the following:

FARM CONFIGURATION

This section is intended to give a very basic introduction to the Fermilab Farms. It is by no means complete, but it is all the shift should need to know. More information is available, of course.

MONITORING PASS1

Check if you need to submit a job on any of the farm systems. You can do this by checking the WEB page or by logging onto the farm host. In general you should keep five jobs running/queued on each system. Always make sure you have at least one job per queue which has not started to do anything. If you will not be checking the systems for an extended time like 8-10 hours (e.g. overnight) you should make sure there are at least five jobs on each system. To check the number of jobs you can do the following:

The second type of essential monitoring is using the xtop or top commands. (See the section on commands). These allow the continuous monitoring of CPU use on the farms. Long periods (more than 30 minutes) where no CPU is being used are a good indicator something is wrong. This method can give early warning if a farm has halted. This is bad.

SUBMITTING JOBS

To submit a job on the farm you should

  1. Reload any monitoring pages and check your e-mail for special instructions from the czars.

  2. Find the next tape number you want to submit. Note that you can get this from the MECCA access page by clicking on the latest run number. A table will show the run and tape numbers for the relevant range of runs with the status of each. Select the next one which is not yet submitted or one that failed previously and is not yet resubmitted. Note the TAPE NUMBER, not the run number.

    The table also shows the expected number of events. If there are runs missing between tapes you can click on the tape label to see if multiple runs were stored on the tape. This link will give you a better estimate of the number of events on the tape. Do not submit tapes with less than 800,000 events. Submitting short runs will disrupt the timing of the farms and cause CPU time to be wasted. E-mail a Pass1 Czar (when you e-mail your usual report) and ask for special instructions regarding these tapes.

  3. The Pass1 status code gives the status of a run. D means the job has completed successfully, S means it has been submitted, P means the job has completed with some minor error, and E means that an error may have occurred which requires the attention of the Czars. Combinations are possible, so ES means the job had an error it first time and has been resubmitted.

  4. Log onto the correct host (from the table above). Login as user e831p1.

  5. You should be in the home (top) area of user e831p1 (i.e. ~e831p1). To submit a job use the command "p1submit # xxxx", where # is the farm system (a-f,t) and xxxx is the 4 digit TAPE NUMBER.

    Enter the username (e831p1) and password when prompted. If you make a mistake typing in the system or tape number type in the wrong password. (Please do not hit CTRL-C). If you type in the username or password incorrectly just try again.

    See the section on staging below for one minor modification to the p1submit command usage.

  6. You can check if the job was submitted properly with the correct tape numbers by viewing the Mecca Access page or using the monitoring commands below.

Canceling Jobs

If you find that you need to cancel a job, e.g. one submitted with the wrong tape or system number, issue the command:

p1cancel system_letter job#
e.g. p1cancel a 25 This command can also be used to abort a running job.

If you must cancel a running job or one that is on-deck, let a Pass1 czar know that you did so as there may need to be special cleanup and accounting modifications. In general, the shift should never have to do this.

PROBLEMS

Jobs occassionally fail, or at least seem to. These are the jobs that appear with an E in the table of jobs over the last 36 hours. At least once per day, but not more than twice per day, you should send e-mail to the czars with a list of these runs. The Czars will invesitigate and tell you whether to resubmit these jobs or if they are OK. If the jobs are OK, the czars will modify the database so that the E no longer appears. This may take some time as the czars are often quite busy. Single failures are not considered critical to progress since there are thousands of tapes left to analyze.

Typically Pass1 czars take one week shifts in which they deal with these non-critical problems.

More infrequently something will happen that will either cause a queue to freeze or cause all the jobs in the queue to die in rapid succession. This is a situation which requires the immediate attention of the czars. You may try e-mail first, but if you don't hear anything back in a few minutes, you should call or page a czar. If there is any doubt as to whether a problem is serious or not, make sure you get in contact with a czar (not just the primary czar). See below for phone and pager numbers.

One exception to this division is Output Staging errors. If you notice one of these, e-mail the czars immediately, but you needn't call or page anyone.

Name E-mail Office (Work) Home Pager
Irwin Gaines gaines@fnal.gov (630) 840-4022 (630) 420-1452* (888) 390-9193
Jon Link link@fnal.gov (630) 840-2183 (630) 584-9613 (800) 241-9016*
Alberto Sánchez asanchez@fnal.gov +55 (21) 541 0337 x197 +55 (21) 542 2602  
Eric Vaandering ewv@fnal.gov (303) 492-4821 (303) 543-8924 (800) 241-9165*
*Preferred after hours contact method

SHIFT CONCLUSION

On the last day of your shift (Monday) you should do the following:


More detailed information is available from these sources:

PASS1 shift schedule

FOCUS phone list page

Summary of FOCUS Pass1 commands

Description of staging software

Send comments to: FOCUS Pass1 Czars