WS-26-500-DC REBOOT ISSUE EXPLAINED

Sat Mar 24, 2018 1:21 pm

So if you have a WS-26-500-DC that is rebooting randomly this post may explain why and how to possibly fix the issue.

We had several hundred units that had the cable management done wrong.

Basically a person on the line decided it was a good idea for cable management to tie the fan, power, and I2C cables together.

Question: Why is this wrong you might ask?
Answer: Well there are a couple reasons listed below.

1) In zip tying these cables it has been shown that it put strain on the I2C connector and could cause a poor connection on one of the pins causing intermittent I2C errors.

2) Also in tying these cables together you are running the I2C in parallel with the fan cables which uses modified pulse to control the fan speed. This modified pulse causes noise on the fan wires so if they are too close in parallel to the I2C cable it can cause this noise to jump over to the I2C cable and cause intermittent I2C ERRORS.

These intermittent I2C errors were causing the Linux I2C service in charge of collecting telemetry data on the switchboard and power supply sensors to get backed up as it had a timeout where it would wait for a response from the sensor that was too long. When the I2C service got too backed up it starved another service called watchdog running on the Linux shell to not get enough CPU time to respond to the switch core watchdog requests. Basically if the switch core does not get a response from the Linux shell for 1 second it assumes the Linux shell is locked up and forces a cold reboot of the switch. Keep in mind that 1 second in computer time is almost an eternity.

The problem was not affecting all WS-26-500-DC so we were confused and looking in all the wrong areas for the cause of the reboot. At first we thought it was simply the CPU was over utilized as we all know the WS-26-500-DC had the highest CPU utilization due to the fact of it having to monitor the most sensors so we came out with v1.5.0rc1 which our programmers optimized the code reducing CPU utilization on all switches by as much as 40%. Sadly this was not the issue but was a good thing to do anyway as it gives up more room to add future creature features you guys are always asking for. Remember the CPU that runs the Linux shell is for the UI/CLI, stats collection, daemons, and so on. The CPU has no direct correlation as to the amount of data the switch is passing/forwarding as the switch core has its own CPU for packet forwarding.

Now version 1.5.0rc1 did add a new feature that helped narrow this down. It now reports in the switch log if the reboot was caused by the watchdog. We were able to confirm by people posting their logs that indeed this is what was happen but still did not know why.

Intellipop's Log wrote: Dec 31 19:00:06 netonix: 1.5.0rc1-201803191145 on WS-26-500-DC
Dec 31 19:00:11 system: Setting MAC address from flash configuration: EC:13:B2:06:09:3E
Dec 31 19:00:14 admin: adding lan (eth0) to firewall zone lan
Dec 31 19:00:15 admin: Unable to query power supply
Dec 31 19:00:27 STP: MSTI0: New root on port 2, root path cost is 20000, root bridge id is 32768.64-D1-54-D5-11-AB
Dec 31 19:00:47 UI: i2c error setting 0x47 12 110
Dec 31 19:01:08 UI: i2c error setting 0x47 14 122
Dec 31 19:01:12 dropbear[931]: Running in background
Dec 31 19:01:15 switch[974]: Detected cold (watchdog) boot

What happened next was pure luck but it was good luck so we will take it. We had people basically staring at the Device/Status TABs on several switches we were running tests on and we happened to notice that every so often we would get an intermittent ERROR on the Board and CPU Temp as shown in the picture below.

CLICK IMAGE BELOW TO VIEW FULL SIZE

So we decided to open the chassis and check the I2C cable and that is when we noticed the cables were tied and this should not have been that way.

CLICK IMAGE BELOW TO VIEW FULL SIZE

So we carefully clipped the ties off and did cable management as we had intended without these ties as shown below.

CLICK IMAGE BELOW TO VIEW FULL SIZE

After making sure the connectors were all seated properly and the cables were ran properly we put the chassis back together and the intermittent I2C ERRORS went away.

But even with the ERRORS this should not have caused a watchdog reboot so our programmers and engineers went to work to find out why. Turns out a simple timeout was set too high which we reduced preventing the I2C service from getting backed up and starving the watchdog responder service from not getting enough CPU time to respond to the switch core to prevent a cold reboot. This software change was released in v1.5.0rc2

Now even if you have an intermittent I2C ERROR showing on the Device/Status TAB v1.5.0rc2 "should" prevent any reboots but I still advise you to check for the intermittent ERROR by simply sitting and watching the Device/Status TAB for up to 10 minutes after upgrading to v1.5.0rc2 and if you see this ERROR on the CPU or Board Temp I would schedule a time to fix it as described above in the picture. You are also welcome to open an RMA and send it to us to do but we are giving you permission to cut the warranty label if you see this error and it will not void your warranty on any switch with a manufactured date prior to this post as we have since corrected this and we went and opened every single switch in the warehouse and fixed any that were done wrong so this will not happen moving forward.

Also not all switches with their cables tied will cause this issue, it is random based on how the cables were tied and what position they were in when tied and also if by tieing the cables it put strain on the I2C connectors causing a poor connection.

So to recap on what you should do if you have a WS-26-500-DC:
Upgrade firmware as soon as possible to v1.5.0rc2 or newer
Check for intermittent I2C ERROR on Device/Status TAB for CPU and or Board temp.
If you see the ERROR then either cut the warranty label, open the chassis and fix it or get an RMA # for us to fix.

We are still fine tuning the firmware on this issue but we feel that v1.5.0rc2 should prevent any reboots from this issue, it lowers the CPU utilization on all models by as much as 40% which is a good thing.

You can download v1.5.0rc2 HERE.

Please make a post in this thread if you find the telemetry ERROR and if fixing the cable management clears it and if you were having reboots that doing the firmware upgrade and cable fix your issue has gone away.

Sorry for any problems this may have caused you.

Sat Mar 24, 2018 4:04 pm

Thank you for the very detailed explanation of the problem and the fix for it.

Sat Mar 24, 2018 4:08 pm

Hey... thanks for being above board here and offering up a solution! After spending the time to actually nail down the problem. Not just a "it must be your power" or 100 other excuses! We will try this on the ones we have seen and hope it fixes it. We were just starting to think we needed to find a different vendor. :-)

Sat Mar 24, 2018 4:31 pm

Thanks for all the help with this, so far (I’m not actually in right now) I have not heard of any issues on the switch we have fixed.

Mat

Sat Mar 24, 2018 4:44 pm

dragonfly wrote:Hey... thanks for being above board here and offering up a solution! After spending the time to actually nail down the problem. Not just a "it must be your power" or 100 other excuses! We will try this on the ones we have seen and hope it fixes it. We were just starting to think we needed to find a different vendor. :-)

People sometimes feel I am blowing them off when I say grounding or power issue but in reality behind the scenes you have no idea how many issues are exactly that.

How many issues people want solved that after I spend hours of my time it indeed ends up to be something they are doing wrong.

How many people send switches back with BLOWN/FRIED Ethernet Transformers claiming they simply died for no reason but we take them apart and see this (see below)

Ethernet Transformer will only get damaged in this manor from:
Shorted cable
Applying incorrect POE to a device or an incompatible device like "some" older Cambium radios.
Or "most common" from poor grounding specifically from not bonding Electrical service ground rods to tower ground rods and or electrical service ground rods are insufficient or degraded then when it rains they lose potential and the service see the tower ground rods as a better path to ground "through" the Ethernet Cable.

We are very honest when we do RMA repairs and if it is our fault we fix it free and when it is not our fault we fix it at best at break even but most times at a slight loss.

We have been repairing UBNT, MIMOSA, and MikroTik equipment for going on a year now and we see the EXACT same damage on those units. Most repairs of airFIBER radios is blown Ethernet Transformers and PHYs.

Read this post on how to get this equipment repaired: viewtopic.php?f=21&t=2827#p19482

Now we make a profit on repairing other peoples equipment but to be honest it is not a LOT of money and is just something we can offer WISPs to help and it provides a little extra income to Julian so a win win.

It is like when I started selling Ethernet Cables and Ends under RF Armor. Yes I make money but not a lot, the point was to be able to provide WISPs a little savings. It is GOOD quality cable for less money and if you use 50 spools a year and 10 boxes of ends your saving upwards of $2,000.00 per year which you can use to take the family on a nice vacation. To be honest my profit at the current volume is questionable if I should have done it.

https://www.rfarmor.com/cable-connectors.html

We are also very forthcoming when an issue is our fault like in this post of known issues at that time:
viewtopic.php?f=6&t=2591&p=18176&hilit=+cap+c51+warranty#p18176

We try very hard to educate people on grounding but some people simply are stubborn on the issue.

Same thing when people want to power our SMART DC switches with a power supply that is smaller rated than the DC power supply in the switch? :roll:

But anyway thanks for the thanks - LOL

Sun Mar 25, 2018 9:17 am

Thank you for the very thorough explanation and for giving us permission to quickly fix it ourselves without voiding the warranty. It's refreshing when a vendor not only acknowledges a problem but is completely transparent about it with our best interests at heart.

Sun Mar 25, 2018 5:06 pm

Got it... Thanks!!!

Sun Mar 25, 2018 6:48 pm

Hello, when your person on the line decided to do it?

Mon Mar 26, 2018 12:38 pm

Gonna have to say I am really disappointed, first sending me an email informing me of an issue and a link to the forum with all this technical stuff and solutions for fixing it myself?!?!?!

I expect to be run around for months with no idea if I am crazy or if there really is an issue, when there is a fix I should have to pay to ship it to you, wait 4-5 weeks then pay you for fixing it and to ship it back to me. (preferably with something else damaged in the process which bricks the device, but looks like I broke it.)

At the very least we shouldn't know the actual issue, just give us some vague wording about how the instructions were apparently too vague for peons and expected us to know some obscure factoid about the hardware that has no available datasheets. Jeez, what kind of business model is this?

Thanks for the great write up. We will probably be locked into your switches even more after this.

Sat May 12, 2018 5:55 pm

Chris,

Please see my post here:

viewtopic.php?f=17&t=3482&p=24135#p24135

Just read through this post. I am going to schedule a trip up to the tower to check the I2C cable and will report back. I wish I would have checked this thread before swapping to the Ws-26-500-DC.

Thanks!

WS-26-500-DC REBOOT ISSUE EXPLAINED

WS-26-500-DC REBOOT ISSUE EXPLAINED

Re: WS-26-500-DC REBOOT ISSUE EXPLAINED

Re: WS-26-500-DC REBOOT ISSUE EXPLAINED

Re: WS-26-500-DC REBOOT ISSUE EXPLAINED

Re: WS-26-500-DC REBOOT ISSUE EXPLAINED

Re: WS-26-500-DC REBOOT ISSUE EXPLAINED

Re: WS-26-500-DC REBOOT ISSUE EXPLAINED

Re: WS-26-500-DC REBOOT ISSUE EXPLAINED

Re: WS-26-500-DC REBOOT ISSUE EXPLAINED

Re: WS-26-500-DC REBOOT ISSUE EXPLAINED

Who is online