For other OD problems, please see
Help! Guide for SuperK Outer Detector Troubles
(
http://www-sk.icrr.u-tokyo.ac.jp/~berns/SUPERK/odhelp.html
).
.
| Contents: |
![]() OD-DAQ overview diagram |
The FSCCs are 68000-based CPU modules running VxWorks real-time OS. Once they're powered up, they download their individual boot and runtime software from sukant via a private ethernet connection. Sukant is setup as a file host for the FSCCs, but doesn't do direct control with them. After bootup, the FSCCs automatically activate their TDC data taking processes (sktrig_lcsr_fbc.o).
For more details see Mei-Li's and Jordan's FSCC/DC2 manual.
1.2. Dual-Port Memory Control via DC2 Controller
There are four DC2
controller modules in the back of the OD-DAQ VME crate, one
for the Fastbus data of each outer hut, and each of them is in charge of two
Dual-Port Memory
(DPM) modules via a specialized VSB bus interface. Each set of DC2 and DPM
pairs is isolated from each other.
The DC2s are also 68000-based CPU modules running their own microcodes (based on VxWorks real-time OS), stored in on-board Eproms. Once powered up, they're booting up from their Eproms and then immediately run their runtime software in a continuous loop. Their purpose is to collect the Fastbus data and store it directly into one of the 2 DPM modules they're controlling in "ping-pong" mode, while the second DPM module is released for read access via sukant (via bit3 module). The communication to sukant (via bit3) is done through shared registers in the DPM module (so called "mailbox") which determine the "ping-pong" page flip mode, data size for read transfer, etc.
For more details see Mei-Li's and Jordan's FSCC/DC2 manual.
1.3. The OD-DAQ online run control acquisition via "sukant"
(sparc20)
- the actual OD-DAQ online control software -
![]() OD DAQ software block diagram |
Controls the OD-DAQ VME crate and collects all event data from the DPMs, GPS modules and various other modules, containing event numbers, local time data, etc. All raw data is checked for consistency and then stored in NOVA buffer blocks for the sorter. Once a NOVA buffer is filled to a certain limit (currently set to 256 event size) it is released for access by the sorter, and a new NOVA buffer is prepared for the next data block. Various error detection and auto-correction is implemented to ensure smooth operation without user action.
The sorter then sorts the raw TDC/GPS/timing/etc. data from the collector event-by-event for each newly filled NOVA buffer and does some additional error checking before releasing the sorted data to the sender.
The sender finally takes over the latest sorted NOVA buffer and keeps it stored until the "event-builder" on sukonh requests new events via net-shared memory for integrating with the Inner Detector data.
Occasionally, at an observed rate of approx. 1 in 20, the reboot is not successful, most likely due to a TDC eprom problem (see also "TDC Test Instruction"). It is then recommended to turn off the power supply of the affected crate, wait for 30 seconds, and turn it back on again. The FSCC should then be fully operational again for data taking within approx. 1 minute.
| cannot open NOVA buffer manager(/tmp/nova). ret=-1. |
| Fatal initialization error! The nova daemon process (novad) is not running! Therefore, the collector cannot communicate with the sorter. This typically happens if the run control has been reset during calibration runs while sukant was disabled (OFF button selected for sukant). Best if run control is completely restarted again from scratch. |
| NOVA get error (status=X)! Anti-Collector exiting now ... BYE! |
| Rare but fatal error! Either the nova daemon (novad) died or was killed, or the anti-sender and/or sorter died recently and therefore no more free Nova buffers available. Might require cleanup and re-initialization of run control. |
| could not install signal catcher ... I QUIT! |
| Fatal initialization error! The interrupt routine for the VME crate could not initialize. Very very rare problem. Might require to reboot sukant, turn off/on the OD-DAQ VME crate and run "initvme" on sukant. |
| No data in OD TDCs, automatic fast FSCC reset executed. | or |
|---|---|
| No data in hut N TDCs, automatic fast FSCC reset executed. |   |
| No data or corrupted data was received from the TDCs. The "fast" FSCC reset only takes 2-3 seconds. If a hut number (N) is given, then the problem was identified for a single hut only. If this message appears only once, then only a handful of TDC events are missing, and the problem has been fixed. No actions needed! But repeated messages in a short time mean there's a more serious problem, and it usually comes along with the following additional error message: | |
| OD TDCs problems; starting automatic 60-sec FSCC reset now! | or |
|---|---|
| hut N TDC/FSCC problems; starting automatic 60-sec FSCC reset now! |   |
|
No data or corrupted data was received from the TDCs, and repeated fast fscc attempts
were not sucessful. If this message appears only once within a longer period then
the problem might be auto-recovered. Then no user interference needed, but keep an
eye on the event display. There'll be least 1 minute of blank OD data, though.
Also, look for the "heardbeat" dots on the antisender@sukant status window on
sukonh. If this message pops up repeatedly, e.g. every 3 minutes or more often, then one hut (number N) or more has a more serious problem with a TDC module or with the FSCC controller itself! What to do? See the "One Fastbus crate might have stopped TDC dataflow?" section in the OD help menu. | |
| TDC mismatch flag detected --> fast FSCC reset executed, fixing problem. |
| The sorter detected data mismatches within the TDC modules and signaled the collector to re-initialize the FSCC modules (takes only 2-3 seconds). Usually, the problem is fixed right after the reset. You'll see brief OD data being blank on the event display, but it will re-appear again after 30-60 seconds at most. No need for actions. |
| FSCC reset flag detected --> started 60 sec FSCC reset to fix problem. |
| Rare. The sorter saw more than just mismatches and requests from the collector to perform a thorough full FSCC reset which takes approx. 60 seconds. The problem is usually fixed after the reset phase, and no actions needed. But expect approx. 2 minutes of blank OD data on the event display... |
| OD DC2 modules are reset now. Fixing problem! |
| The sorter requested from the collector to re-initialize the DC2 controllers if DPM page-mismatches are detected. Rare problem, though. The duration of the DC2 init phase is only a few seconds. Expect some blank OD data events, but otherwise no action needed. |
| No signal from primary GPS receiver! GPS data screwed up now... |
| There's no data from the primary GPS receiver (TrueTime). Power outage in the radon hut? Check if the radon-free-air-supply STATUS still works. |
| Both GPS receivers dead! Power outage in Radon Hut?? |
| The message speaks for itself - I hope... |
| GPS signal error: See shift manual | or |
|---|---|
| GPS lost signal lock: See shift manual |   |
| This means that there was a signal interruption of the primary GPS receiver data. Most likely preceded by one of the above GPS error messages. Check if there's still power in the radon hut... | |
| (function_name) lseek error ... bit3 problem? ... I QUIT! | or |
|---|---|
| (function_name) device read error ... bit3 problem? ... I QUIT! | or |
| (function_name) could not get exclusive lock on device ... I QUIT! | or |
| (function_name) failed to set DMA (PIO) address modifier ... I QUIT! | or |
| (function_name) failed to set data width (page number) ... I QUIT! |   |
| Fatal error(s)! Severe problem with the bit3 device driver! Did someone switch off the OD-DAQ VME crate, or disconnect the bit3 cable? | |
| (function_name) bit3 problem? ... only X bytes read from device. |
| Error with the bit3 device driver. Will probably sooner or later lead to any of the fatal messages above. Should be rare, though, and probably requires to turn off/on the OD-DAQ VME crate, and maybe even reboot of sukant (with "initvme" after reboot, etc.). |
| DC2 ERROR: Hut N page X not ready! Exiting ... | or |
|---|---|
| ERROR VME/VSB: DC2 N init BAD! I QUIT ... | or |
| ERROR VME/VSB: DC2 N (response to) activate BAD! I QUIT ... | followed by |
| STOP+ABORT run! Push reset buttons on all 4 DC2s at rear of ODDAQ VME crate. | |
Fatal error! There's an initialization problem with a DC2 module or all four DC2
modules. Probably they're hung. Usually, stopping and restarting the run should
automatically re-initialize the DC2s. If run restart fails repeatedly, then
manually resetting the modules might fix the problem:Go to the center hut, locate the four DC2 modules in the back of the OD-DAQ VME crate (see photo and description), then push the top black button on each of the DC2 modules. You should see the 4 green LEDs on each of the modules cycling on briefly, then off, starting from the bottom to the top. The top green LED should then stay on continuously, then the DC2 is ready for taking data again. [The yellow LEDs should always stay on, I believe...] | |
| Bit-3 access problems! If persists then please stop/abort/initialize. |
| Speaks for itself... Maybe someone turned off the power in the center hut? |
| OD and HE trigger problems!! Please check sukon9 or trigger hardware! |
| The collector didn't see any new event data for more than 30 seconds. Most likely there's no trigger signal coming from the VMETRG module (sukon9 crashed? power failure?). |
| run XXXX is over 24 hours! Please start new run. |
| NEW! Reminder message for forgetful shift people ... shows up within 10 minutes after run time has reached 24 hours - then once per hour again if shift person should have fallen asleep ... :-) |
| Anti-Sorter: could not open NOVA descriptor (ret=X)! I quit! |
| Fatal initialization error! Same as with the anti-collector error message shown above: The nova daemon process (novad) is not running! Therefore, the sorter cannot communicate with the collector and sender. This typically happens if the run control has been reset during calibration runs while sukant was disabled (OFF button selected for sukant). Best if run control is completely restarted again from scratch. |
| High OD/HE trigger rate (XXX Hz)!!! Reaching CPU limits on sukant! |
|
This is a warning message. The current bit3 device and sukant CPU limits allow only
an average continuous trigger rate (OD and HE) of approx. 120 Hz. Short
trigger bursts of below 1kHz are usually no problem. But if you see this message
coming up rapidly and often, then expect high ODDAQ data loss soon. There might
be a flasher or maybe even a supernova! Do not stop the run if only this message shows up on the msgctrl window. It could be caused by a supernova, and we do not want to loose important data! Check whether there are other possible errors before deciding to stop the run. |
| LARGE EVENT OFFSET FOUND IN TDC DATA => RE-INITIALIZING DC2/DPMs NOW! |
| Warning message about a TDC data mismatch, possibly caused by a very high-rate trigger burst, e.g. by high SLE to OD/HE ratio where there were more than 65536 global trigger events within one Nova buffer block. This can lead to wrong 16-bit eventnumber rollover count between the 4 quadrant's data blocks. Another cause could be a shortly malfunctioning DC2 module (e.g. page flipping out of sync with the other three DC2s). The sorter instructs the collector to re-initialize all DC2 modules in an attempt to synchronize the TDC data again. If you see this message repeating, then stop the run and restart a new run. |
| Invalid NOVA buffer header words!! Sender or eventbuilder dead?? | or |
|---|---|
| nova_put() error (status=X)!! ==> sender or eventbuilder dead? |   |
| The current Nova buffer doesn't have the correct anti-collector pattern, but seems to be a feed-back from the anti-sender. This usually means that the anti-sender has died a few moments ago, possibly caused by a dead or malfunctioning eventbuilder or disconnect from sukonh. So, check whether the anti-sender is still alive! Probably, the run needs to be stopped, aborted and then re-initialized. | |
| nova_get() interrupted or no free buffer!! ... Retrying. |
| The sorter cannot get enough Nova buffer space to process more data. Possibly buffer overflow at the anti-sender is the cause, which is usually caused by a very high trigger rate burst. If it doesn't recover then you will see the following fatal error message eventually: |
| NOVA daemon error!!! ... nova_get() status=X ... I QUIT! |
| Fatal Nova daemon problem which kills the anti-sorter! Rare problem, though. |
| No TDC data in OD DAQ - trying to recover now. | or |
|---|---|
| No OD TDC data in hut N - trying to recover now. |   |
| The current Nova buffer received from the anti-collector has no TDC data from one hut or several huts. If a hut number (e.g N=2) is identified, and you see this message repeatedly, then go to that hut and perform a power recycle of the Fastbus crate. See for instructions in the "One Fastbus crate might have stopped TDC dataflow?" section in the OD help menu. | |
| Check OD HV X.Y.Z or connector at OD paddle card x.y.z ! | or |
|---|---|
| Check QTC card a.b.c.d! |   |
| I hope the message speaks for itself. It appears when a 12-channel or 16-channel block of consecutive OD PMT channels has no data, a sure sign that either a OD High Voltage channel (hut X, channel Y.Z) has tripped or a cable between OD paddle card (hut x, crate y, card z) is disconnected. A tripped OD HV channel can be confirmed with "antihv" on sukrfm1. A 16-channel block with empty OD PMT data is usually a sign of a malfunctioning QTC module. | |
| XXX OD PMT 12-channel blocks do not have data! Is OD HV on? | or |
|---|---|
| YYY QTC cards do not have data! Check OD HV or OD electronics! |   |
| This message appears when there are several consecutive OD PMT channels without data in the current run. Please check the "antihv" window on sukrfm1 whether there is a problem with OD High Voltage. It is also possible that data cables between the OD paddel cards and QTCs or (rare) between the QTCs and TDCs are disconnected by mistake. | |
All following messages are possible results of a fatal error where the anti-sender is about to die or being killed. In all cases, the run needs to be aborted, then initialized again before a new run can be restarted:
| Cannot do nova_open() | or |
|---|---|
| Unrecoverable NOVA daemon error! |   |
| The Nova Daemon (novad) is either not running or was killed during the run. Might need to run cleanup of all sukant processes or maybe even full cleanup. | |
| No ACTIVATE flag |
| Fatal error caused by run control. Very rare, though. |
| Cannot connect server | or |
|---|---|
| Cannot send a server name to the host |   |
| The connection to sukonh was lost! This error seems to happen quite frequently after a run was stopped since the new network had been installed... | |
| Unrecoverable data error! |
| The last data block from the sorter was corrupted in such a bad way that the anti-sender couldn't recover (tried repeatedly before), gave up and died. |
| make_sendbuf(): Cannot send XXX bytes more than YYY byte buffer! ... I QUIT! |
| Rare error. Server event buffer overflow, sender cannot process more data. |
| Can not send data... | or |
|---|---|
| Network error!! Could not send final data! | or |
| Network error at event XXXXXX!! |   |
| Network socket error. Connection to sukonh lost? | |