Conversation
This fixes various problems with error reporting when running tests with goroutines. Unfortunately, it still seems to fail for Style1 for any replica count, but Style2 does not fail even for replica counts up to 180.
To avoid that the process create to many connections (open files) each test should stop their servers.
This still contains the old versions that don't work, but I've disabled them. Things should be cleaned up now that the new version (seems to) work. Have been able to start 200 servers.
Previously there was no way to know whether or not a goroutine had in fact been started; henceforth, the errChan would be closed by the defer since the createReplicas() call would reach the return before all goroutines had the chance to send their err, causing a panic. In this new version, we wait for all goroutines to start before we check for errors on the channel.
johningve
reviewed
Mar 8, 2022
johningve
reviewed
Mar 8, 2022
johningve
reviewed
Mar 8, 2022
johningve
reviewed
Mar 8, 2022
johningve
reviewed
Mar 8, 2022
tests/all2all/alltoallconfig_test.go
Outdated
| nodeMap[replica.address] = replica.id | ||
| } | ||
| for _, replica := range replicas { | ||
| replica.createConfiguration(nodeMap) |
Member
There was a problem hiding this comment.
The error returned by this method is not checked.
Member
There was a problem hiding this comment.
Interestingly, this seems to bring back the bug we are fighting with here. If we run the code with:
$ go test -v -run TestAllToAllConfiguration -replicas 10
=== RUN TestAllToAllConfiguration
--- PASS: TestAllToAllConfiguration (1.06s)vs
go test -v -run TestAllToAllConfiguration -replicas 11
=== RUN TestAllToAllConfiguration
alltoallconfig_test.go:33: could not create configuration: connection failed for addr: 127.0.0.1:65204: dialing node failed: context deadline exceeded
--- FAIL: TestAllToAllConfiguration (1.16s)And this is consistently working for up to 10 replicas, but not for 11 or above.
Member
There was a problem hiding this comment.
I added TestGrpcDial to try to understand better what's going on. This version fails consistently with 84 replicas (but not with fewer on my machine that is):
go test -v -run TestGrpc -replicas 84
=== RUN TestGrpcDial
alltoallconfig_test.go:133: context deadline exceeded
--- FAIL: TestGrpcDial (2.05s)
Member
There was a problem hiding this comment.
I can't reproduce this on my machine
❯ ./all2all.test -test.v -test.run="TestGrpcDial" -replicas=10000
=== RUN TestGrpcDial
--- PASS: TestGrpcDial (4.84s)
PASS
It only fails for me when the kernel runs out of ports 😅
❯ ./all2all.test -test.v -test.run="TestGrpcDial" -replicas=50000
=== RUN TestGrpcDial
alltoallconfig_test.go:123: listen tcp 127.0.0.1:0: bind: address already in use
--- FAIL: TestGrpcDial (9.77s)
FAIL
Just wanted to check if this issue still persists (it does), but thought I just make a few tweaks while I was at it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tests to form configuration with a configurable number of replicas and wait time.