I need a fresh pair of eyes.
We're using a 15km fibre optic line across which fibrechannel and 10GbE is multiplexed (passive optical CWDM). For FC we have long distance lasers suitable up to 40km (Skylane SFCxx0404F0D). The multiplexer is limited by the SFPs which can do max. 4Gb fibrechannel. The FC switch is a Brocade 5000 series.
The respective wavelengths are 1550,1570,1590 and 1610nm for FC and 1530nm for 10GbE.
The problem is the 4GbFC fabrics are almost never clean. Sometimes they are for a while even with a lot of traffic on them. Then they may suddenly start producing errors (RX CRC, RX encoding, RX disparity, ...) even with only marginal traffic on them. I am attaching some error and traffic graphs. Errors are currently in the order of 50-100 errors per 5 minutes when with 1Gb/s traffic.
Optics
Here is the power output of one port summarized (collected using sfpshow on different switches)
SITE-A units=uW (microwatt) SITE-B
**********************************************
FAB1
SW1 TX 1234.3 RX 49.1 SW3 1550nm (ko)
RX 95.2 TX 1175.6
FAB2
SW2 TX 1422.0 RX 104.6 SW4 1610nm (ok)
RX 54.3 TX 1468.4
What I find curious at this point is the asymmetry in the power levels. While SW2 transmits with 1422uW which SW4 receives with 104uW, SW2 only receives the SW4 signal with similar original power only with 54uW.
Vice versa for SW1-3.
Anyway the SFPs have RX sensitivity down to -18dBm (ca. 20uW) so in any case it should be fine... But nothing is.
Some SFPs have been diagnosed as malfunctioning by the manufacturer (the 1550nm ones shown above with "ko"). The 1610nm ones apparently are ok, they have been tested using a traffic generator. The leased line has also been tested more than once. All is within tolerances. I'm awaiting the replacements but for some reason I don't believe it will make things better as the apparently good ones don't produce ZERO errors either.
Earlier there was active equipment involved (some kind of 4GFC retimer) before putting the signal on the line. No idea why. That equipment was eliminated because of the problems so we now only have:
the long distance laser in the switch,
(new) 10m LC-SC monomode cable to the mux (for each fabric),
the leased line,
the same thing but reversed on the other side of the link.
FC switches
Here is a port config from the Brocade portcfgshow (it's like that on both sides, obviously)
Area Number: 0
Speed Level: 4G
Fill Word(On Active) 0(Idle-Idle)
Fill Word(Current) 0(Idle-Idle)
AL_PA Offset 13: OFF
Trunk Port ON
Long Distance LS
VC Link Init OFF
Desired Distance 32 Km
Reserved Buffers 70
Locked L_Port OFF
Locked G_Port OFF
Disabled E_Port OFF
Locked E_Port OFF
ISL R_RDY Mode OFF
RSCN Suppressed OFF
Persistent Disable OFF
LOS TOV enable OFF
NPIV capability ON
QOS E_Port OFF
Port Auto Disable: OFF
Rate Limit OFF
EX Port OFF
Mirror Port OFF
Credit Recovery ON
F_Port Buffers OFF
Fault Delay: 0(R_A_TOV)
NPIV PP Limit: 126
CSCTL mode: OFF
Forcing the links to 2GbFC produces no errors, but we bought 4GbFC and we want 4GbFC.
I don't know where to look anymore. Any ideas what to try next or how to proceed?
If we can't make 4GbFC work reliably I wonder what the people working with 8 or 16 do... I don't assume that "a few errors here and there" are acceptable.
Oh and BTW we are in contact with everyone of the manufacturers (FC switch, MUX, SFPs, ...) Except for the SFPs to be changed (some have been changed before) nobody has a clue. Brocade SAN Health says the fabric is ok. MUX, well, it's passive, it's only a prism, nature at it's best.
Any shots in the dark?
APPENDIX: Answers to your questions
@Chopper3:
This is the second generation of Brocades exhibiting the problem. Before we had 5000s, now we have 5100s.
In the beginning when we still had the active MUX we rented a longdistance laser once to put it into the switch directly in order to make tests for a day, during that day of course it was clean. But as I said, sometimes it's clean just like that. And sometimes it's not.
Alternative switches would mean to rebuild the entire SAN with those only to test. Alternative SFPs, well they're hard to come by just like that.
@longneck:
The line is rented. It's a dark fibre (9um monomode) so there's noone else on it.
Sure there are splices. I can't go and look but I have to trust they have been done correctly.
As I said the line has been checked and rechecked (using an optical time-domain reflectometer).
Obviously you don't have all this equipment yourself because it's way too expensive.
@mdpc:
What would be the "wrong" type of cable according to you? Up to the switch everything is monomode, yes. The connectors are the correct ones too. Yeah I know there are the green ones where the fibre is cut off at a certain angle etc. But we have the correct ones for all that I know.
Progress Report #1
We have had two fabrics (=2x2 switches) with Brocade 5100s with FabricOS 6.4.1 and two fabrics (another 2x4 switches) on FabricOS 7.0.2.
On the longdistance ISLs (one in each fabric) it turned out that with FOS 6.4.1 setting it to long distance issues warnings about the VC Init setting and consequently the fill word. But those are only warnings. FOS 7.0.2 requires you to do modifications to VCI and the fillword for long distance links.
Setting FOS 6.4.1 to the LS (long-distance static distance) setting with wrong VCI and fillword setting made the whole fabric inoperational (stuck in an SCN loop, use fabriclog -s to see, you don't see it anywhere else, no port error counters or anything increasing).
Currently I'm giving the one fabric with the IMHO more correct settings a beating and it seems to do fine, whereas the other one without much traffic still has errors here and there.
In short:
We have eliminated the active part of the MUX (the FC retimer).
We are putting the long distance SFPs into the end equipment themselves.
Just to be sure we bought new monomode cables to connect the end equipment to the remaining passive part of the MUX.
We are now trying out several long distance configs.
It's almost black magic. Everything that happens is mostly empirical, noone seems to have a clue what are the exact reasons to do something. ("We have tried this, and it didn't work, then we tried that and it worked, so we stuck with that." But noone really seems to know why.)
I'll keep you updated.
Progress Report #2
We got the new lasers for one of the fabrics on warranty. It's ultra clean even on 4GbFC.
They're transmitting with roughly 2mW (3dBm) whereas the others are only at 1.5mW (1.5dBm) although that should really be enough.
The other fabric (where the lasers are apparently ok) still produces one or two CRCs infrequently.
Using sfpshow the SFP producing the actual RX errors shows
Status/Ctrl: 0x82
Alarm flags[0,1] = 0x5, 0x40
Warn Flags[0,1] = 0x5, 0x40
Now I'll have to find out what that means. Not sure if it was there before.
Well I'll first clear my head with a week of vacation. 8-)