Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I have been told I am dead, so this silo will stop! Found a bug! Will stop. #7505

Closed
t-gena opened this issue Jan 21, 2022 · 16 comments
Closed
Assignees
Labels
area-silos category for all the silos related issues
Milestone

Comments

@t-gena
Copy link

t-gena commented Jan 21, 2022

Good day.

We want to exchange Apache.Ignite with Orleans as horizontal scaling and cluster management framework.

In the process of testing we encounter the problem that 2 silos at the same time went down with the message:
I have been told I am dead, so this silo will stop! Found a bug! Will stop.

The situation.

3 silos were running. Let's name them Silo0, Silo1, Silo2. I have replaced IP addresses in the logs with these values.
Then another Silo4 was started and shutdown (not gracefully) almost right away.
As a result, Silo1 and Silo2 went down with the message posted above. In addition in the log (find expanded ones below) of each of them was written: Silo Silo4 is suspected by 2 which is more or equal than 2, but is not marked as dead. This is a bug!!!

Questions (by priority).

  1. According to the message it is a bug. What caused this bug? Any misconfiguration?
  2. Who told Silo1 & Silo2 to die? As per configuration, there should be at least 2 votes to kick the other silo from the cluster out.
  3. What is the state of the cluster when Silo4 went down ungracefully. Does that mean that no new grains can be allocated/called until Silo4 Status will be changed to Dead? We call grains using IClusterClient.
  4. Is there any good explanation/documentation for how it is decided to exclude a silo(s) from the cluster (e.g.2 data centers with an equivalent number of silos loose connection between each other).

The environment.

  • Each silo is hosted on a separate VM.
  • VMs are hosted on-premises.
  • AdoNetClustering is used (w/ MS SQL Server).
  • Silos use the same configuration:
Configuration Orleans.Configuration.ActivationCountBasedPlacementOptions: 
ChooseOutOf: 2

Configuration Orleans.Configuration.ClusterMembershipOptions: 
NumMissedTableIAmAliveLimit: 2
LivenessEnabled: True
ProbeTimeout: 00:00:02
TableRefreshTimeout: 00:01:00
DeathVoteExpirationTimeout: 00:02:00
IAmAliveTablePublishTimeout: 00:05:00
MaxJoinAttemptTime: 00:05:00
ValidateInitialConnectivity: False
UseLivenessGossip: True
NumProbedSilos: 3
NumMissedProbesLimit: 3
NumVotesForDeathDeclaration: 2
DefunctSiloExpiration: 7.00:00:00
DefunctSiloCleanupPeriod: 01:00:00
IsRunningAsUnitTest: False
LocalHealthDegradationMonitoringPeriod: 00:00:10
ExtendProbeTimeoutDuringDegradation: False
EnableIndirectProbes: False

Configuration Orleans.Configuration.ClusterOptions: 
ClusterId: Obsidian
ServiceId: OVALiS

Configuration Orleans.Configuration.ConnectionOptions: 
ProtocolVersion: Version1
ConnectionsPerEndpoint: 1
ConnectionRetryDelay: 00:00:01
OpenConnectionTimeout: 00:00:05

Configuration Orleans.Configuration.ConsistentRingOptions: 
NumVirtualBucketsConsistentRing: 30
UseVirtualBucketsConsistentRing: True

Configuration Orleans.Configuration.EndpointOptions: 
AdvertisedIPAddress: ....
SiloPort: 8081
GatewayPort: 30001
SiloListeningEndpoint: 
GatewayListeningEndpoint: 

Configuration Orleans.Configuration.GrainCollectionOptions: 
CollectionQuantum: 00:02:00
CollectionAge: 02:00:00
ClassSpecificCollectionAge.Qoniac.Ovalis.Core.Server.Clustering.Orleans.Grains.AffinitySiloAddressGrain: 2.00:00:00
DeactivationTimeout: 00:00:30

Configuration Orleans.Configuration.GrainDirectoryOptions: 
CachingStrategy: Adaptive
CacheSize: 1000000
InitialCacheTTL: 00:00:30
MaximumCacheTTL: 00:04:00
CacheTTLExtensionFactor: 2
LazyDeregistrationDelay: 00:01:00

Configuration Orleans.Configuration.GrainVersioningOptions: 
DefaultCompatibilityStrategy: BackwardCompatible
DefaultVersionSelectorStrategy: AllCompatibleVersions

Configuration Orleans.Configuration.LoadSheddingOptions: 
LoadSheddingEnabled: False
LoadSheddingLimit: 95

Configuration Orleans.Configuration.PerformanceTuningOptions: 
DefaultConnectionLimit: 200
Expect100Continue: False
UseNagleAlgorithm: False
MinDotNetThreadPoolSize: 0
MinIOThreadPoolSize: 0

Configuration Orleans.Configuration.SchedulingOptions: 
PerformDeadlockDetection: False
AllowCallChainReentrancy: True
MaxActiveThreads: 4
DelayWarningThreshold: 00:00:10
ActivationSchedulingQuantum: 00:00:00.1000000
TurnWarningLengthThreshold: 00:00:00.2000000
MaxPendingWorkItemsSoftLimit: 0
EnableWorkerThreadInjection: False
StoppedActivationWarningInterval: 00:01:00

Configuration Orleans.Configuration.SerializationProviderOptions: 
FallbackSerializationProvider: 
MaxSustainedSerializationContextCapacity: 64

Configuration Orleans.Configuration.SiloMessagingOptions: 
SiloSenderQueues: 4
GatewaySenderQueues: 4
MaxForwardCount: 2
ClientDropTimeout: 00:01:00
ClientRegistrationRefresh: 00:05:00
ClientGatewayShutdownNotificationTimeout: 00:00:05
MaxEnqueuedRequestsSoftLimit: 0
MaxEnqueuedRequestsHardLimit: 0
MaxEnqueuedRequestsSoftLimit_StatelessWorker: 0
MaxEnqueuedRequestsHardLimit_StatelessWorker: 0
MaxRequestProcessingTime: 02:00:00
AssumeHomogenousSilosForTesting: False
DEFAULT_SHUTDOWN_REROUTE_TIMEOUT: 00:00:10
ShutdownRerouteTimeout: 00:00:10
SystemResponseTimeout: 00:00:30
GrainWorkloadAnalysisPeriod: 00:00:05
RequestProcessingWarningTime: 00:00:05
RequestQueueDelayWarningTime: 00:00:10
ResponseTimeout: 00:00:30
ResponseTimeoutWithDebugger: 00:30:00
DropExpiredMessages: True
BufferPoolBufferSize: 4096
BufferPoolMaxSize: 10000
BufferPoolPreallocationSize: 250
PropagateActivityId: False
LargeMessageWarningThreshold: 85000
MaxMessageHeaderSize: 10485760
MaxMessageBodySize: 104857600

Configuration Orleans.Configuration.StatisticsOptions: 
PerfCountersWriteInterval: 00:00:30
LogWriteInterval: 00:05:00
CollectionLevel: Info

Configuration Orleans.Configuration.TelemetryOptions: 
Consumers.0: OrleansDashboard.DashboardTelemetryConsumer

Configuration Orleans.Configuration.TypeManagementOptions: 
TypeMapRefreshInterval: 00:01:00

Logs.

Unfortunately, the logs for Silo0 are not available (but Silo0 kept running anyway).

Silo1

System.OperationCanceledException: The ping attempt was cancelled after 00:00:02.0001725. Ping #2
2022-01-14 08:39:25.588[  31] WARN  Orleans.Runtime.MembershipService.SiloHealthMonitor Did not get response for probe #3 to silo Silo4 after 00:00:02.0194873. Current number of consecutive failed probes is 2
System.OperationCanceledException: The ping attempt was cancelled after 00:00:02.0194771. Ping #3
2022-01-14 08:39:27.604[ 116] WARN  Orleans.Runtime.MembershipService.SiloHealthMonitor Did not get response for probe #4 to silo Silo4 after 00:00:02.0143516. Current number of consecutive failed probes is 3
System.OperationCanceledException: The ping attempt was cancelled after 00:00:02.0143435. Ping #4
2022-01-14 08:39:27.611[  31] INFO  Orleans.Runtime.MembershipService.MembershipTableManager -Putting my vote to mark silo Silo4 as DEAD #2. Previous suspect list is [], trying to update to [<Silo1, 2022-01-14 07:39:27.611 GMT>], eTag=, freshVotes is []
2022-01-14 08:39:27.627[ 116] INFO  Orleans.Runtime.MembershipService.MembershipTableManager ProcessTableUpdate (called from TrySuspectOrKill) membership table: 8 silos, 4 are Active, 4 are Dead, Version=<2133, 2133>. All silos: [SiloAddress=Silo2 SiloName=Silo_56215 Status=Active, SiloAddress=Silo1 SiloName=Silo_9da6a Status=Active, SiloAddress=Silo0 SiloName=Silo_bece0 Status=Active, SiloAddress=Silo4 SiloName=Silo_c21e1 Status=Active, SiloAddress=S10.147.58.121:8080:377702519 SiloName=Silo_99ab1 Status=Dead, SiloAddress=S10.147.58.108:8082:375974107 SiloName=Silo_db65e Status=Dead, SiloAddress=S10.147.58.108:8080:375974221 SiloName=Silo_06aa2 Status=Dead, SiloAddress=S10.147.58.108:8081:377439496 SiloName=Silo_ab05c Status=Dead]
2022-01-14 08:39:27.627[ 116] INFO  Orleans.Runtime.MembershipService.MembershipGossiper Gossiping Silo1 status Active to 3 partners
2022-01-14 08:39:29.629[  69] INFO  Orleans.Networking Establishing connection to endpoint Silo4
2022-01-14 08:39:31.645[  66] WARN  Orleans.Runtime.MembershipService.SiloHealthMonitor Did not get response for probe #6 to silo Silo4 after 00:00:02.0158648. Current number of consecutive failed probes is 1
System.OperationCanceledException: The ping attempt was cancelled after 00:00:02.0158526. Ping #6
2022-01-14 08:39:32.707[  66] WARN  Orleans.Runtime.MembershipService.SiloHealthMonitor Did not get response for probe #34772 to silo Silo2 after 00:00:02.0156488. Current number of consecutive failed probes is 1
System.OperationCanceledException: The ping attempt was cancelled after 00:00:02.0156413. Ping #34772
2022-01-14 08:39:33.660[  69] WARN  Orleans.Runtime.MembershipService.SiloHealthMonitor Did not get response for probe #7 to silo Silo4 after 00:00:02.0142774. Current number of consecutive failed probes is 2
System.OperationCanceledException: The ping attempt was cancelled after 00:00:02.0142701. Ping #7
2022-01-14 08:39:34.646[  66] WARN  Orleans.Networking Connection attempt to endpoint Silo4 failed
System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
2022-01-14 08:39:34.665[ 116] WARN  Orleans.Runtime.DeploymentLoadPublisher An exception was thrown by PublishStatistics.UpdateRuntimeStatistics(). Ignoring.
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint Silo4. See InnerException
 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint)
   at Orleans.Runtime.Messaging.OutboundMessageQueue.<SendMessage>g__SendAsync|10_0(ValueTask`1 c, Message m)
   at Orleans.Runtime.DeploymentLoadPublisher.PublishStatistics(Object _)
2022-01-14 08:39:34.666[  66] WARN  Orleans.Runtime.MembershipService.SiloHealthMonitor Did not get response for probe #8 to silo Silo4 after 00:00:01.0045517. Current number of consecutive failed probes is 3
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint Silo4. See InnerException
 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint)
   at Orleans.Runtime.Messaging.OutboundMessageQueue.<SendMessage>g__SendAsync|10_0(ValueTask`1 c, Message m)
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()
   at Orleans.Runtime.MembershipService.SiloHealthMonitor.ProbeDirectly(CancellationToken cancellation)
**2022-01-14 08:39:34.669[  66] ERROR Orleans.Runtime.MembershipService.MembershipTableManager -Silo Silo4 is suspected by 2 which is more or equal than 2, but is not marked as dead. This is a bug!!!
2022-01-14 08:39:34.669[  66] ERROR Orleans.Runtime.MembershipService.MembershipTableManager I have been told I am dead, so this silo will stop! Found a bug! Will stop.**
2022-01-14 08:39:34.670[  66] ERROR Orleans.Runtime.FatalErrorHandler Fatal error from Orleans.Runtime.MembershipService.MembershipTableManager. Context: I have been told I am dead, so this silo will stop! Found a bug! Will stop.
2022-01-14 08:39:34.723[ 145] WARN  Orleans.Runtime.MembershipService.SiloHealthMonitor Did not get response for probe #34773 to silo Silo2 after 00:00:02.0143580. Current number of consecutive failed probes is 2
System.OperationCanceledException: The ping attempt was cancelled after 00:00:02.0143493. Ping #34773

Silo2

022-01-14 08:39:24.433[ 197] WARN  Orleans.Runtime.MembershipService.SiloHealthMonitor Did not get response for probe #2 to silo Silo4 after 00:00:02.0000697. Current number of consecutive failed probes is 1
System.OperationCanceledException: The ping attempt was cancelled after 00:00:02.0000616. Ping #2
2022-01-14 08:39:26.433[  14] WARN  Orleans.Runtime.MembershipService.SiloHealthMonitor Did not get response for probe #3 to silo Silo4 after 00:00:01.9989425. Current number of consecutive failed probes is 2
System.OperationCanceledException: The ping attempt was cancelled after 00:00:01.9989339. Ping #3
2022-01-14 08:39:27.647[ 197] INFO  Orleans.Runtime.MembershipService.MembershipTableManager Received cluster membership snapshot via gossip: [Version: 2133, 27 silos, SiloAddress=S10.147.58.121:8080:377698378 SiloName=Silo_0d93d Status=Dead, SiloAddress=S10.147.58.121:8080:377702519 SiloName=Silo_99ab1 Status=Dead, SiloAddress=Silo2 SiloName=Silo_56215 Status=Active, SiloAddress=S10.147.58.121:8080:376649749 SiloName=Silo_bf2c0 Status=Dead, SiloAddress=S10.147.58.121:8080:376578011 SiloName=Silo_7b29e Status=Dead, SiloAddress=S10.147.58.121:8080:376829867 SiloName=Silo_63121 Status=Dead, SiloAddress=S10.147.58.121:8080:377439366 SiloName=Silo_c81b6 Status=Dead, SiloAddress=S10.147.58.121:8080:377262083 SiloName=Silo_dfa9a Status=Dead, SiloAddress=S10.147.58.121:8080:377254565 SiloName=Silo_ecc0c Status=Dead, SiloAddress=S10.147.58.121:8080:375973709 SiloName=Silo_255c7 Status=Dead, SiloAddress=S10.147.58.121:8080:376237638 SiloName=Silo_8c4b5 Status=Dead, SiloAddress=Silo4 SiloName=Silo_c21e1 Status=Active, SiloAddress=S10.147.58.108:8081:375974016 SiloName=Silo_fe14c Status=Dead, SiloAddress=S10.147.58.108:8082:375974107 SiloName=Silo_db65e Status=Dead, SiloAddress=S10.147.58.108:8080:375974221 SiloName=Silo_06aa2 Status=Dead, SiloAddress=S10.147.58.108:8081:375974452 SiloName=Silo_b41ca Status=Dead, SiloAddress=S10.147.58.108:8081:375974516 SiloName=Silo_124ff Status=Dead, SiloAddress=S10.147.58.108:8081:376237796 SiloName=Silo_e0633 Status=Dead, SiloAddress=S10.147.58.108:8081:377254729 SiloName=Silo_ea7d4 Status=Dead, SiloAddress=S10.147.58.108:8081:377261539 SiloName=Silo_081bd Status=Dead, SiloAddress=S10.147.58.108:8081:377439496 SiloName=Silo_ab05c Status=Dead, SiloAddress=S10.147.58.108:8081:376660276 SiloName=Silo_f46c6 Status=Dead, SiloAddress=S10.147.58.108:8081:376649335 SiloName=Silo_796ac Status=Dead, SiloAddress=S10.147.58.108:8081:376670382 SiloName=Silo_66852 Status=Dead, SiloAddress=S10.147.58.108:8081:376830961 SiloName=Silo_bae9b Status=Dead, SiloAddress=Silo1 SiloName=Silo_9da6a Status=Active, SiloAddress=Silo0 SiloName=Silo_bece0 Status=Active]
2022-01-14 08:39:27.909[ 159] INFO  Orleans.Runtime.MembershipService.MembershipTableManager Received cluster membership snapshot via gossip: [Version: 2134, 27 silos, SiloAddress=S10.147.58.121:8080:377698378 SiloName=Silo_0d93d Status=Dead, SiloAddress=S10.147.58.121:8080:377702519 SiloName=Silo_99ab1 Status=Dead, SiloAddress=Silo2 SiloName=Silo_56215 Status=Active, SiloAddress=S10.147.58.121:8080:376649749 SiloName=Silo_bf2c0 Status=Dead, SiloAddress=S10.147.58.121:8080:376578011 SiloName=Silo_7b29e Status=Dead, SiloAddress=S10.147.58.121:8080:376829867 SiloName=Silo_63121 Status=Dead, SiloAddress=S10.147.58.121:8080:377439366 SiloName=Silo_c81b6 Status=Dead, SiloAddress=S10.147.58.121:8080:377262083 SiloName=Silo_dfa9a Status=Dead, SiloAddress=S10.147.58.121:8080:377254565 SiloName=Silo_ecc0c Status=Dead, SiloAddress=S10.147.58.121:8080:375973709 SiloName=Silo_255c7 Status=Dead, SiloAddress=S10.147.58.121:8080:376237638 SiloName=Silo_8c4b5 Status=Dead, SiloAddress=Silo4 SiloName=Silo_c21e1 Status=Active, SiloAddress=S10.147.58.108:8081:375974016 SiloName=Silo_fe14c Status=Dead, SiloAddress=S10.147.58.108:8082:375974107 SiloName=Silo_db65e Status=Dead, SiloAddress=S10.147.58.108:8080:375974221 SiloName=Silo_06aa2 Status=Dead, SiloAddress=S10.147.58.108:8081:375974452 SiloName=Silo_b41ca Status=Dead, SiloAddress=S10.147.58.108:8081:375974516 SiloName=Silo_124ff Status=Dead, SiloAddress=S10.147.58.108:8081:376237796 SiloName=Silo_e0633 Status=Dead, SiloAddress=S10.147.58.108:8081:377254729 SiloName=Silo_ea7d4 Status=Dead, SiloAddress=S10.147.58.108:8081:377261539 SiloName=Silo_081bd Status=Dead, SiloAddress=S10.147.58.108:8081:377439496 SiloName=Silo_ab05c Status=Dead, SiloAddress=S10.147.58.108:8081:376660276 SiloName=Silo_f46c6 Status=Dead, SiloAddress=S10.147.58.108:8081:376649335 SiloName=Silo_796ac Status=Dead, SiloAddress=S10.147.58.108:8081:376670382 SiloName=Silo_66852 Status=Dead, SiloAddress=S10.147.58.108:8081:376830961 SiloName=Silo_bae9b Status=Dead, SiloAddress=Silo1 SiloName=Silo_9da6a Status=Active, SiloAddress=Silo0 SiloName=Silo_bece0 Status=Active]
2022-01-14 08:39:28.434[  60] WARN  Orleans.Runtime.MembershipService.SiloHealthMonitor Did not get response for probe #4 to silo Silo4 after 00:00:01.9997111. Current number of consecutive failed probes is 3
System.OperationCanceledException: The ping attempt was cancelled after 00:00:01.9997033. Ping #4
**2022-01-14 08:39:28.437[  60] ERROR Orleans.Runtime.MembershipService.MembershipTableManager -Silo Silo4 is suspected by 2 which is more or equal than 2, but is not marked as dead. This is a bug!!!
2022-01-14 08:39:28.443[  60] ERROR Orleans.Runtime.MembershipService.MembershipTableManager I have been told I am dead, so this silo will stop! Found a bug! Will stop.**
2022-01-14 08:39:28.449[  60] ERROR Orleans.Runtime.FatalErrorHandler Fatal error from Orleans.Runtime.MembershipService.MembershipTableManager. Context: I have been told I am dead, so this silo will stop! Found a bug! Will stop.

I would appreciate your help.

@ghost ghost added the Needs: triage 🔍 label Jan 21, 2022
@ReubenBond
Copy link
Member

Very interesting, @t-gena. My first suspicion is that the VMs have inconsistent clocks and the final suspicion vote happened around 5 minutes after the first vote, causing the second voter to not mark the silo as dead (because it considered the first vote not recent enough). Conversely, the silo which was being voted on may have had a clock set further back in time and considered both votes recent, and therefore that it should have simultaneously been declared dead by the second voter.

Are you able to post a dump of the cluster membership table from your database?

@t-gena
Copy link
Author

t-gena commented Jan 21, 2022

Thank you for the hint @ReubenBond.

Indeed, the clocks are not synchronized.
Silo0 = Silo4
Silo1 = Silo2: - 3 minutes to Silo0. Which is still less than 5 minutes.

Please find the membership table below. Note:

  • Silo4 was restarted after the first fail.
  • Time in the logs provided in the initial post is +1 GMT.
Address Port Generation SiloName Status ProxyPort SuspectTimes StartTime IAmAliveTime
Silo4 8080 379842136 Silo_c21e1 6 30000 Silo1:8081@379772351,2022-01-14 07:39:27.611 GMT; Silo0:8081@379770381,2022-01-14 07:49:58.805 GMT;Silo4:8084@379842593,2022-01-14 07:50:01.340 GMT 2022-01-14 07:42:16.453 2022-01-14 07:42:18.413
Silo4 8084 379842266 Silo_c4603 6 30000 Silo0:8081@379770381,2022-01-14 07:49:52.586 GMT ; Silo4:8084@379842593,2022-01-14 07:49:54.277 GMT 2022-01-14 07:44:26.573 2022-01-14 07:44:30.613
Silo4 8084 379842593 Silo_bfd95 6 30004 Silo0:8081@379770381,2022-01-14 08:18:44.250 GMT 2022-01-14 07:49:53.477 2022-01-14 08:14:56.637
Silo0 8081 379770381 Silo_bece0 6 30001 Silo0:8081@379770381,2022-01-15 14:30:54.273 GMT 2022-01-13 11:46:21.767 2022-01-15 14:30:54.273
Silo1 8081 379772351 Silo_9da6a 6 30001 Silo0:8081@379770381,2022-01-14 07:50:01.571 GMT ; Silo4:8084@379842593,2022-01-14 07:50:03.707 GMT 2022-01-13 12:19:11.753 2022-01-14 07:39:12.727
Silo2 8080 377774428 Silo_56215 6 30000 Silo0:8081@379770381,2022-01-14 07:49:59.555 GMT ; Silo4:8084@379842593,2022-01-14 07:50:03.517 GMT 2021-12-21 09:20:29.103 2022-01-14 07:35:32.653

@ReubenBond
Copy link
Member

Clock sync likely explains the discrepancy. I was wrong about the duration: DeathVoteExpirationTimeout is the relevant value and the default is 2 minutes.

Why is the probe timeout 2s in your case? That's quite aggressive. The default of 5s is already aggressive. Additionally, are your hosts heavily loaded? You shouldn't be seeing this many death votes. I'm interested to understand more about why you're seeing so many votes.

@ReubenBond ReubenBond self-assigned this Jan 21, 2022
@zahirtezcan-bugs
Copy link

I don't know if this is relevant but in our early stages of using Orleans, we failed with IP configurations and since silos failed to connect to each other they mark unreachable silos as dead. When a silo sees the death mark, shuts down itself.

If this is irrelevant and you can connect from within a VM to other VMs by the advertised addresses then you can ignore this message.

@t-gena
Copy link
Author

t-gena commented Jan 21, 2022

DeathVoteExpirationTimeout is the relevant value and the default is 2 minutes.

Does that mean that if Silo's clock is behind for 3 min the death vote will expire not in 2 min, but in 5?

Why is the probe timeout 2s in your case? That's quite aggressive. The default of 5s is already aggressive.

Looks like we misunderstood the meaning of the ProbeTimeout. We treat it as 'timeout of the individual ping after which it fails'.
To test the failover we correlate it with NumMissedProbesLimit when the node goes down ungracefully. It appeared that during the probing period we could neither allocate new grains nor reach already allocated via IClusterClient. Is it expected behavior?

Additionally, are your hosts heavily loaded? You shouldn't be seeing this many death votes.

Not at that point in time. But death votes came for the Silo which was shutdown ungracefully. The behavior was expected by me. It was not expected that the other 2 silos go down as well.
In general, the load on hosts is expected to be high. We use Orleans to allocate computation affinity. The real loads are scheduled by RPC's which do not involve communication with grains. Computations use LOH allocations and compacting the LOH may take system resources. Are there any problems expected for the scenario when Orleans + High load computations + LOH allocations are all hosted in the same process?

@t-gena
Copy link
Author

t-gena commented Jan 24, 2022

I don't know if this is relevant but in our early stages of using Orleans, we failed with IP configurations and since silos failed to connect to each other they mark unreachable silos as dead.

@zahirtezcan-bugs, this is not our case. The cluster had a valid state (verified via Dashboard) until the Silo which failed right after the start joined the cluster.

@rafikiassumani-msft rafikiassumani-msft added area-silos category for all the silos related issues and removed Needs: triage 🔍 labels Jan 27, 2022
@rafikiassumani-msft rafikiassumani-msft added this to the Discussion milestone Jan 27, 2022
@ReubenBond
Copy link
Member

Are there any problems expected for the scenario when Orleans + High load computations + LOH allocations are all hosted in the same process?

It depends, how long are the GC pauses? If you expect your process to be stopping for multiple seconds at a time, you're going to need to increase that probe timeout (not decrease it). You should also expect erratic performance in that case, since there is a very loud noisy neighbor sharing the same runtime.

@t-gena
Copy link
Author

t-gena commented Jan 28, 2022

@ReubenBond thanks for the explanation.
Does it mean if one of the Silos is not responding (e.g.due to LOH compacting) the whole cluster is not usable (no possibility to allocate/call grains) via IClusterClient during this time?

Meanwhile, I could reproduce (reproducible in 90% of the cases) the issue in a more isolated environment.

  1. Clocks: Silo0 = Silo2 = now, Silo1 = now - 3 min.
  2. No extra load on VMs.
  3. Configuration of the silos as in the first post.

Steps to reproduce:

  1. Start Silo0.
  2. Start Silo1.
  3. Start Silo2, wait for joining the cluster, and shut down ungracefully.

Result

  1. Silo1 goes down with the message "I have been told I am dead...".
  2. Silo0 is stuck (e.g.Dashboard cannot load any information) with repeatedly logging:
2022-01-28 15:16:55.050 [  26] WARN  Orleans.Runtime.CallbackData Response did not arrive on time in 00:00:30 for message: Request Silo0:8081:381075231*stg/MembershipOracle/0000000f@S0000000f->Silo1:8081:381075083*stg/MembershipOracle/0000000f@S0000000f InvokeMethodRequest Orleans.Runtime.IMembershipService:MembershipChangeNotification #1006. Target History is: <Silo1:8081:381075083:*stg/MembershipOracle/0000000f:@S0000000f>. About to break its promise.
2022-01-28 15:16:55.050 [  26] WARN  Orleans.Runtime.CallbackData Response did not arrive on time in 00:00:30 for message: Request Silo0:8081:381075231*stg/MembershipOracle/0000000f@S0000000f->Silo1:8081:381075083*stg/MembershipOracle/0000000f@S0000000f InvokeMethodRequest Orleans.Runtime.IMembershipService:Ping #1012. Target History is: <Silo1:8081:381075083:*stg/MembershipOracle/0000000f:@S0000000f>. About to break its promise.
2022-01-28 15:16:55.050 [  26] WARN  Orleans.Runtime.CallbackData Response did not arrive on time in 00:00:30 for message: Request Silo0:8081:381075231*stg/MembershipOracle/0000000f@S0000000f->Silo1:8081:381075083*stg/MembershipOracle/0000000f@S0000000f InvokeMethodRequest Orleans.Runtime.IMembershipService:MembershipChangeNotification #1010. Target History is: <Silo1:8081:381075083:*stg/MembershipOracle/0000000f:@S0000000f>. About to break its promise.
2022-01-28 15:16:55.050 [  21] ERROR Orleans.Runtime.MembershipService.MembershipSystemTarget [DRS\gennadiy.tuzov] Exception while sending gossip notification to remote silo Silo1:8081:381075083
System.TimeoutException: Response did not arrive on time in 00:00:30 for message: Request Silo0:8081:381075231*stg/MembershipOracle/0000000f@S0000000f->Silo1:8081:381075083*stg/MembershipOracle/0000000f@S0000000f InvokeMethodRequest Orleans.Runtime.IMembershipService:MembershipChangeNotification #1006. Target History is: <Silo1:8081:381075083:*stg/MembershipOracle/0000000f:@S0000000f>.
   at Orleans.Runtime.MembershipService.MembershipSystemTarget.GossipToRemoteSilo(SiloAddress silo, MembershipTableSnapshot snapshot, SiloAddress updatedSilo, SiloStatus updatedStatus)
2022-01-28 15:16:55.050 [  21] ERROR Orleans.Runtime.MembershipService.MembershipSystemTarget [DRS\gennadiy.tuzov] Exception while sending gossip notification to remote silo Silo1:8081:381075083
System.TimeoutException: Response did not arrive on time in 00:00:30 for message: Request Silo0:8081:381075231*stg/MembershipOracle/0000000f@S0000000f->Silo1:8081:381075083*stg/MembershipOracle/0000000f@S0000000f InvokeMethodRequest Orleans.Runtime.IMembershipService:MembershipChangeNotification #1010. Target History is: <Silo1:8081:381075083:*stg/MembershipOracle/0000000f:@S0000000f>.
   at Orleans.Runtime.MembershipService.MembershipSystemTarget.GossipToRemoteSilo(SiloAddress silo, MembershipTableSnapshot snapshot, SiloAddress updatedSilo, SiloStatus updatedStatus)
2022-01-28 15:16:55.207 [  21] WARN  Orleans.Networking [DRS\gennadiy.tuzov] Connection attempt to endpoint Silo2:8080:381075358 failed
System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
2022-01-28 15:16:55.207 [  21] ERROR Orleans.Runtime.MembershipService.MembershipSystemTarget [DRS\gennadiy.tuzov] Exception while sending gossip notification to remote silo Silo2:8080:381075358
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint Silo2:8080:381075358. See InnerException
 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint)
   at Orleans.Runtime.Messaging.OutboundMessageQueue.<SendMessage>g__SendAsync|10_0(ValueTask`1 c, Message m)
   at Orleans.Runtime.MembershipService.MembershipSystemTarget.GossipToRemoteSilo(SiloAddress silo, MembershipTableSnapshot snapshot, SiloAddress updatedSilo, SiloStatus updatedStatus)
2022-01-28 15:16:55.207 [  21] ERROR Orleans.Runtime.MembershipService.MembershipSystemTarget [DRS\gennadiy.tuzov] Exception while sending gossip notification to remote silo Silo2:8080:381075358
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint Silo2:8080:381075358. See InnerException
 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint)
   at Orleans.Runtime.Messaging.OutboundMessageQueue.<SendMessage>g__SendAsync|10_0(ValueTask`1 c, Message m)
   at Orleans.Runtime.MembershipService.MembershipSystemTarget.GossipToRemoteSilo(SiloAddress silo, MembershipTableSnapshot snapshot, SiloAddress updatedSilo, SiloStatus updatedStatus)
2022-01-28 15:16:55.207 [  21] ERROR Orleans.Runtime.MembershipService.MembershipSystemTarget [DRS\gennadiy.tuzov] Exception while sending gossip notification to remote silo Silo2:8080:381075358
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint Silo2:8080:381075358. See InnerException
 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint)
   at Orleans.Runtime.Messaging.OutboundMessageQueue.<SendMessage>g__SendAsync|10_0(ValueTask`1 c, Message m)
   at Orleans.Runtime.MembershipService.MembershipSystemTarget.GossipToRemoteSilo(SiloAddress silo, MembershipTableSnapshot snapshot, SiloAddress updatedSilo, SiloStatus updatedStatus)
2022-01-28 15:16:55.207 [  21] ERROR Orleans.Runtime.MembershipService.MembershipSystemTarget [DRS\gennadiy.tuzov] Exception while sending gossip notification to remote silo Silo2:8080:381075358
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint Silo2:8080:381075358. See InnerException
 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint)
   at Orleans.Runtime.Messaging.OutboundMessageQueue.<SendMessage>g__SendAsync|10_0(ValueTask`1 c, Message m)
   at Orleans.Runtime.MembershipService.MembershipSystemTarget.GossipToRemoteSilo(SiloAddress silo, MembershipTableSnapshot snapshot, SiloAddress updatedSilo, SiloStatus updatedStatus)
2022-01-28 15:16:55.207 [  21] WARN  Orleans.Runtime.MembershipService.SiloHealthMonitor [DRS\gennadiy.tuzov] Did not get response for probe #27 to silo Silo2:8080:381075358 after 00:00:00.9706049. Current number of consecutive failed probes is 24
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint Silo2:8080:381075358. See InnerException
 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint)
   at Orleans.Runtime.Messaging.OutboundMessageQueue.<SendMessage>g__SendAsync|10_0(ValueTask`1 c, Message m)
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()
   at Orleans.Runtime.MembershipService.SiloHealthMonitor.ProbeDirectly(CancellationToken cancellation)
  1. Membership table looks as follows:
Address Port Generation SiloName Status ProxyPort SuspectTimes StartTime IAmAliveTime
Silo2 8080 381075358 Silo_9df6a 3 30000 Silo1:8081@381075083,2022-01-28 14:13:03.479 GMT; Silo0:8081@381075231,2022-01-28 14:16:55.222 GMT 2022-01-28 14:15:58.490 2022-01-28 14:15:59.963
Silo0 8081 381075231 Silo_8aedf 3 30001 2022-01-28 14:13:51.563 2022-01-28 14:13:52.237
Silo1 8081 381075083 Silo_11814 3 30001 Silo0:8081@381075231,2022-01-28 14:16:55.238 GMT 2022-01-28 14:11:24.253 2022-01-28 14:11:25.283

Expected

  1. The cluster contains Silo0 and Silo1 and is responsive.
  2. Silo2.Status = 6.

Still not clear

  1. "...Found a bug!" message means it is a bug in the Orleans framework or misuse/misconfiguration of it?
  2. The above case shows really critical situation which may happen in production where the whole cluster is not usable. If it is expected behavior (e.g.again misconfiguration) is there any documentation/best practices for that?

Hope the description above helps. Thank you for the assistance and looking forward to your answers.

Edit: extending ProbeTimeout to 5s or 15s did not help.

@ReubenBond
Copy link
Member

Do you see any logs pertaining to expired messages? If I'm understanding your testing scenario correctly (essentially unloaded, but with a 3 min time difference), then there may be a bug somewhere in how message expiration is being calculated. Would you mind providing full logs?

Does it mean if one of the Silos is not responding (e.g.due to LOH compacting) the whole cluster is not usable (no possibility to allocate/call grains) via IClusterClient during this time?

No, but if a silo process pauses for some period of time, any messages sent to it will not be processed until it resumes, so you will find that a portion of placements are stalled either because the directory partition for that grain lives on that stalled silo or placement picked that silo to place a grain on, and of course any messages to grains which are active on that silo will not be processed until the silo resumes. It's not a great situation to be in and you generally wouldn't want a production system behaving that way.

Regarding the points you are not clear on:

  1. It is a bug, it should be a very rare corner case which happens when there's significant clock skew (more than about 2 minutes) and silos are suspecting each other (either there's a bug triggered by severe clock skew or because of GC pauses and/or thread pool starvation). We may be able to add some clock skew compensation logic. The membership algorithm tallies the elements in "SuspectTimes" within a particular recency period (in the future or less than 2 mins in the past, according to the local clock). It assumes that the voting silos also have a similar clock (low skew) and so their tallies should be the same. The relevant code is here:
    if (now < otherVoterTime || now.Subtract(otherVoterTime) < expiration)
    . In this case, a slow (back in time) silo is seeing votes in the future and is counting them as fresh votes. They don't consider the reported timing between votes, so maybe that's the right change to make to fix the crash: recency should count votes within a fixed-width window (rather than also including any future vote).
  2. These silos are not responsive from Orleans' perspective and it is a dire situation. Assuming there is a lot of load on the system (and it's not all caused by a clock skew bug), you may be able to get things to stagger along if you loosen the timing enough, but I recommend fixing the core issues instead. If you do want to loosen the timings, you can increate DeathVoteExpirationTimeout to 10 minutes to cover the clock skew, increase ProbeTimeout to 30s to cover the long pauses and set both ExtendProbeTimeoutDuringDegradation & EnableIndirectProbes to true so that unhealthy silos (based upon a self-test) don't vote others unhealthy (no throwing stones in a glass house) and they send probes via an intermediary as a sanity check.

@ReubenBond
Copy link
Member

@t-gena I've opened a PR to harden the vote tallying logic against clock skew, but that only solves half of the issue. The bigger question is why you're seeing so many call timeouts to begin with.

@t-gena
Copy link
Author

t-gena commented Jan 31, 2022

@ReubenBond thank you for the detailed answers. I will apply the proposed changes and we will do another round of testing. I will provide feedback afterward.

I've opened a PR to harden the vote tallying logic against clock skew, but that only solves half of the issue.

This is good news, thank you. When we can expect it to be merged to main and pushed to nuget?

The bigger question is why you're seeing so many call timeouts to begin with.

How do you propose to tackle it - will full logs help?

@ReubenBond
Copy link
Member

ReubenBond commented Jan 31, 2022

Full logs should help to identify the issue

@t-gena
Copy link
Author

t-gena commented Feb 1, 2022

Please find logs attached.
logs.zip

@ReubenBond
Copy link
Member

Why is ValidateInitialConnectivity set to false? We strongly advise against that and will likely remove the option.

HOST2 in your logs dies because Kestrel is unable to bind to its port. It joins the cluster and seems to immediately crash due to that. The other hosts have to detect that it's dead using pings, which is where you run into the bug fixed by the PR I linked, caused by the 3min clock desync. That causes HOST1 to crash, and HOST0 is left unable to vote either dead since one host is not able to take down the rest of the cluster. Does that all make sense?

Is there a reason the clocks are still out of sync, is it just for testing?

@t-gena
Copy link
Author

t-gena commented Feb 2, 2022

Why is ValidateInitialConnectivity set to false? We strongly advise against that and will likely remove the option.

In the production this option will be set to true. It was mainly used during testing.

HOST2 in your logs dies because Kestrel is unable to bind to its port. It joins the cluster and seems to immediately crash due to that. The other hosts have to detect that it's dead using pings, which is where you run into the bug fixed by the PR I linked, caused by the 3min clock desync. That causes HOST1 to crash, and HOST0 is left unable to vote either dead since one host is not able to take down the rest of the cluster. Does that all make sense?

This is the crash test scenario when the service cannot start which leads to an ungrateful shutdown of the silo right after the start. Your explanation why HOST2 was suspected by others to be dead makes sense. I also understand that unsynchronized clocks together with DeathVoteExpirationTimeout can cause the issue when ping cannot reach the target silo.

  1. But why did HOST0 at all suspect HOST1 to be dead? Since there were no network outages the ping should return a successful result. Or do I misunderstand smth?
  2. Should HOST1 after it was told to be dead and before stopping itself set (or at least try to) the Status in the membership table to 6 (Dead)?
  3. According to the logs both HOST0 and HOST1 suspected HOST2 as dead. But in the membership table HOST2 still has Status = 3 (Active).

Is there a reason the clocks are still out of sync, is it just for testing?

It is just for testing and providing feedback on this issue.

@ReubenBond
Copy link
Member

The original issue here has been fixed, so I'll close this. Please open a new issue if you run into something similar in future.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-silos category for all the silos related issues
Projects
None yet
Development

No branches or pull requests

4 participants