Update - 1st of September
Here is an update and some conclusions regarding the issues we have met on 18th of August. You will find at the end of this email the detailed chronology of the outage and the further investigations.
1 - Outage, troubleshooting and resolution :
There were two issues causing the outage on 18th of August:
- Traffic flooding on several ports, reaching 300Mbps
- Amplification of this flooding by a loop reaching 2Gbps.
After having solved the loop issue, we focused our efforts on understanding the origin of this unicast flooding.
Flooding appeared again on some ports and we found the root cause : After deep investigations, we correlated the unicast flooding issue with some port flapping.
To prevent issue appears again, we applied modifications on the Juniper configurations, following Juniper expert recommendations. We reinforced also layer 2 filtering, specially with partner's IXPs.
The situation is now stable and as we said previously, you can turn on your BGP sessions (we are still seeing few BGP sessions down).
During troubleshooting, we deactivated flow statistics export. We will schedule a maintenance shortly to enable again flow monitoring.
2 - Understanding the issue :
Even if this issue has been solved, we want to understand the behaviour we had on the Juniper platform. That's why we are still working closely with Juniper. There are two dedicated JTAC engineers working on the case, one locally in our premises and another one in Amsterdam. Juniper US engineering team is also involved. We will update this ticket if some useful information can be added.
3 - Action plan :
- Router servers connected to core infrastructure: In order to have less impact on route servers if an outage occurs, we plan to connect route servers directly to the VPLS core backbone. Currently, they are connected to our VM infrastructure.
- Removing abnormal traffic: During the troubleshooting, we observed abnormal traffic sent by some members (proxy ARP, RA packets, STP, CDP/FDP, OSPF/ISIS, etc). We will contact them in order to clear this abnormal traffic and have homogeneous configurations.
First Report - 23dr of August
On Monday 18th around 12:00, we started receiving some "broadcast" traffic (the traffic was replicated on several ports). This traffic was amplified by a loop and reached 2Gbps. After having stopped the loop, the traffic decreased significantly. We had to modify the way to filter the unwanted broadcast traffic, as the filters previously configured were not working as they should.
We have contacted Juniper TAC about such issue, and we have open several cases to understand why this traffic was not blocked properly by the filters applied globally in the VPLS instance and the filters applied on members ports.
Some unwanted traffic appeared again during this week and we found a workaround to stop immediately this type of traffic, consisting in clearing entries within the VPLS MAC table (this operation has no impact on members traffic). After further investigations with Juniper TAC, and collecting some traffic captures, we determined that the unwanted traffic was Unicast traffic (and not broadcast) sent by our Juniper equipment to some interfaces, such traffic seen as unknown unicast traffic even if the destination MAC address is known within the VPLS table.
We are working closely with the Juniper engineering team to understand if the problem is known. We are running conference calls with them several times a day. In the meantime, to prevent this issue occurs again, we applied the following workaround : a script has been installed on our Juniper equipment to stop automatically the unwanted traffic when it happens.
We apologize again for such issues and we would like to thank the France-IX members who did send us some data in order to help us identify the issue.
We are still working with Juniper. We will send a full global report later on.
Monday 18th of August around 12:00: traffic flooded on multiple interfaces reaching 300Mbps per member.
Monday 18th of August at 15:02: A loop appeared on the network, the 300Mbps flooded traffic was amplified, reaching 2Gbps. FranceIX LAN was also impacted, sessions with route servers were flapping.
Monday 18th of August around 17:00: We stopped the loop, but 300Mbps of flooding was still there. Applying some modifications on filters, we succeeded to stop the flooding, around 18:00.
Monday 18th of August: 19:00: 3 Juniper cases were opened to understand the behaviour of the filters and the platform during this outage. We made at least 2 conference calls a day to exchange information and observe the progress made on the investigations.
From Tuesday 19th to Monday 25th: We observed flooding traffic (unicast traffic) again on some ports (between 10Mbps and 100Mbps of traffic) on Telehouse2 and Interxion2 PoPs.
Thursday 21th of August: We installed two sniffers in order to capture and analyse this abnormal traffic.
Conclusions: After having discarded some possibilities, we finally correlated the sniffed traffic with some flapping ports. As soon as we remove these ports to new interfaces, flooding has disappeared. Following Juniper advises we applied some configurations to avoid again this kind of issues, even if some ports are flapping.
We are still working with Juniper to understand some behaviours on the platform.
PS: We only explained above the steps allowing us to solve the issue. We explored some ways and made some modifications on the platform (like removing IPFIX configuration, modifying filtering rules, removing VLAN configurations), they are not detailed here as they are no relevant.