New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have been told I am dead, so this silo will stop! Found a bug! Will stop. #7505
Comments
Very interesting, @t-gena. My first suspicion is that the VMs have inconsistent clocks and the final suspicion vote happened around 5 minutes after the first vote, causing the second voter to not mark the silo as dead (because it considered the first vote not recent enough). Conversely, the silo which was being voted on may have had a clock set further back in time and considered both votes recent, and therefore that it should have simultaneously been declared dead by the second voter. Are you able to post a dump of the cluster membership table from your database? |
Thank you for the hint @ReubenBond. Indeed, the clocks are not synchronized. Please find the membership table below. Note:
|
Clock sync likely explains the discrepancy. I was wrong about the duration: Why is the probe timeout 2s in your case? That's quite aggressive. The default of 5s is already aggressive. Additionally, are your hosts heavily loaded? You shouldn't be seeing this many death votes. I'm interested to understand more about why you're seeing so many votes. |
I don't know if this is relevant but in our early stages of using Orleans, we failed with IP configurations and since silos failed to connect to each other they mark unreachable silos as dead. When a silo sees the death mark, shuts down itself. If this is irrelevant and you can connect from within a VM to other VMs by the advertised addresses then you can ignore this message. |
Does that mean that if Silo's clock is behind for 3 min the death vote will expire not in 2 min, but in 5?
Looks like we misunderstood the meaning of the
Not at that point in time. But death votes came for the Silo which was shutdown ungracefully. The behavior was expected by me. It was not expected that the other 2 silos go down as well. |
@zahirtezcan-bugs, this is not our case. The cluster had a valid state (verified via Dashboard) until the Silo which failed right after the start joined the cluster. |
It depends, how long are the GC pauses? If you expect your process to be stopping for multiple seconds at a time, you're going to need to increase that probe timeout (not decrease it). You should also expect erratic performance in that case, since there is a very loud noisy neighbor sharing the same runtime. |
@ReubenBond thanks for the explanation. Meanwhile, I could reproduce (reproducible in 90% of the cases) the issue in a more isolated environment.
Steps to reproduce:
Result
Expected
Still not clear
Hope the description above helps. Thank you for the assistance and looking forward to your answers. Edit: extending ProbeTimeout to 5s or 15s did not help. |
Do you see any logs pertaining to expired messages? If I'm understanding your testing scenario correctly (essentially unloaded, but with a 3 min time difference), then there may be a bug somewhere in how message expiration is being calculated. Would you mind providing full logs?
No, but if a silo process pauses for some period of time, any messages sent to it will not be processed until it resumes, so you will find that a portion of placements are stalled either because the directory partition for that grain lives on that stalled silo or placement picked that silo to place a grain on, and of course any messages to grains which are active on that silo will not be processed until the silo resumes. It's not a great situation to be in and you generally wouldn't want a production system behaving that way. Regarding the points you are not clear on:
|
@t-gena I've opened a PR to harden the vote tallying logic against clock skew, but that only solves half of the issue. The bigger question is why you're seeing so many call timeouts to begin with. |
@ReubenBond thank you for the detailed answers. I will apply the proposed changes and we will do another round of testing. I will provide feedback afterward.
This is good news, thank you. When we can expect it to be merged to main and pushed to nuget?
How do you propose to tackle it - will full logs help? |
Full logs should help to identify the issue |
Please find logs attached. |
Why is HOST2 in your logs dies because Kestrel is unable to bind to its port. It joins the cluster and seems to immediately crash due to that. The other hosts have to detect that it's dead using pings, which is where you run into the bug fixed by the PR I linked, caused by the 3min clock desync. That causes HOST1 to crash, and HOST0 is left unable to vote either dead since one host is not able to take down the rest of the cluster. Does that all make sense? Is there a reason the clocks are still out of sync, is it just for testing? |
In the production this option will be set to true. It was mainly used during testing.
This is the crash test scenario when the service cannot start which leads to an ungrateful shutdown of the silo right after the start. Your explanation why HOST2 was suspected by others to be dead makes sense. I also understand that unsynchronized clocks together with
It is just for testing and providing feedback on this issue. |
The original issue here has been fixed, so I'll close this. Please open a new issue if you run into something similar in future. |
Good day.
We want to exchange Apache.Ignite with Orleans as horizontal scaling and cluster management framework.
In the process of testing we encounter the problem that 2 silos at the same time went down with the message:
I have been told I am dead, so this silo will stop! Found a bug! Will stop.
The situation.
3 silos were running. Let's name them Silo0, Silo1, Silo2. I have replaced IP addresses in the logs with these values.
Then another Silo4 was started and shutdown (not gracefully) almost right away.
As a result, Silo1 and Silo2 went down with the message posted above. In addition in the log (find expanded ones below) of each of them was written: Silo Silo4 is suspected by 2 which is more or equal than 2, but is not marked as dead. This is a bug!!!
Questions (by priority).
The environment.
Logs.
Unfortunately, the logs for Silo0 are not available (but Silo0 kept running anyway).
Silo1
Silo2
I would appreciate your help.
The text was updated successfully, but these errors were encountered: