Possible scheduling issue with painlessMesh TimeSync
Using the same test harness as issue #487 (that is a simple 2 node mesh) and with the additional discipline suggested by #487 it appears that TimeSync still has some odd behaviour. This output was logged by the adopter from the nodeTimeAdjusted callback
void nodeTimeAdjustedCallback(int32_t offset) {
Serial.printf("At %lu.%lu seconds, Adjusted mesh time to %lu.%lu (Offset = %ld)\n", micros()/1000000L,micros() % 1000000L,mesh.getNodeTime()/1000000L,mesh.getNodeTime() % 1000000L,offset);
}
Comments added manually;
(When reviewing be aware an offset of 0 indicates a discarded, long running request that must have been initiated significantly earlier than the timestamp indicates)
setLogLevel: ERROR |
--> Start: New Connection, nodeId = 3006701716
--> Start: New Connection, {
"nodeId": 366351453,
"subs": [
{
"nodeId": 3006701716
}
]
}
At 28.869156 seconds, Adjusted mesh time to 4283.529939 (Offset = -40306517) // Initial sync, expect a large offset
At 29.182061 seconds, Adjusted mesh time to 4283.842841 (Offset = 0) // Would expect a retry 200ms later but it gave us an outlier and was zeroed
At 89.149542 seconds, Adjusted mesh time to 48.841491 (Offset = -1535) // Sixty seconds later the request completes OK with offset of -1535
At 149.157130 seconds, Adjusted mesh time to 108.847695 (Offset = -1386) // Sixty seconds later again the request completes with an offset of -1386
At 149.397686 seconds, Adjusted mesh time to 109.88249 (Offset = 0) // So why has painlessMesh scheduled another attempt 200ms later?
// In fact it took over 200ms to complete so must have been scheduled at
// approximately the same time as the attempt 149.157130
At 209.170983 seconds, Adjusted mesh time to 168.860102 (Offset = -1445) // Here is a perfectly acceptable reply and offset
At 269.178982 seconds, Adjusted mesh time to 228.866793 (Offset = -1309) // but we get another reply just 8ms later?
At 269.409921 seconds, Adjusted mesh time to 229.97731 (Offset = 0) // Result of the line above it is perfectly acceptable so why did we call for this again?
// Especially as it took ages to complete
At 329.193346 seconds, Adjusted mesh time to 288.879509 (Offset = -1646) // 60 seconds pass and a good sync request and reply
At 389.201034 seconds, Adjusted mesh time to 348.885925 (Offset = -1275) // 60 seconds pass and a good sync request and reply
At 389.422462 seconds, Adjusted mesh time to 349.107350 (Offset = 0) // but there we are again 200ms later
At 449.230546 seconds, Adjusted mesh time to 408.912387 (Offset = -3048) // 60 seconds pass and a good sync request and reply
At 509.233926 seconds, Adjusted mesh time to 468.915665 (Offset = -104) // 60 seconds pass and a good sync request and reply
At 509.237931 seconds, Adjusted mesh time to 468.919756 (Offset = 88) // but there we are again 4ms later
At 569.247086 seconds, Adjusted mesh time to 528.927319 (Offset = -1592) // and again 9ms later
At 569.441576 seconds, Adjusted mesh time to 529.121810 (Offset = 0) // and a final one 200ms later but having taken ~200ms to complete
At 629.261349 seconds, Adjusted mesh time to 588.940388 (Offset = -1194) // 60 seconds pass and a good sync request and reply
At 689.271192 seconds, Adjusted mesh time to 648.950215 (Offset = -18) // 60 seconds pass and a good sync request and reply
At 689.454464 seconds, Adjusted mesh time to 649.133484 (Offset = 0) // followed by much delayed reply 180ms later
At 749.283317 seconds, Adjusted mesh time to 708.959752 (Offset = -2585) // 60 seconds pass and a good sync request and reply
At 809.292795 seconds, Adjusted mesh time to 768.968604 (Offset = -628) // 60 seconds pass and a good sync request and reply
At 809.382321 seconds, Adjusted mesh time to 769.58128 (Offset = 0) // just 90ms later we get another reply
At 869.305916 seconds, Adjusted mesh time to 828.979442 (Offset = -2281) // 60 seconds pass and a good sync request and reply
At 929.318240 seconds, Adjusted mesh time to 888.988970 (Offset = -2798) // 60 seconds pass and a good sync request and reply
At 929.342815 seconds, Adjusted mesh time to 889.1665 (Offset = -11878) // 24 ms later a delayed response results in false offset
At 929.687371 seconds, Adjusted mesh time to 889.346221 (Offset = 0) // followed an outlier response 340ms later
At 989.654979 seconds, Adjusted mesh time to 949.325719 (Offset = 11890) // Correction of false offset
At 989.861912 seconds, Adjusted mesh time to 949.532621 (Offset = -33) // so we would expect this retry 200ms or so later
At 1049.871452 seconds, Adjusted mesh time to 1009.540462 (Offset = -1697) // 60 seconds pass and a good sync request and reply
At 1050.311787 seconds, Adjusted mesh time to 1009.980799 (Offset = 0) // so why does this turn up later?
This is far from documented proof that there is an issue but it does appear from the additional replies that additional requests are being sent? Is this because the TIME_SYNC_INTERVAL is too low and is firing before the previous is complete. I doubt it is that simple.
What is clear is that turning on logging for S_TIME increases the reported round trip sync time from sub 1000us to a consistent 17500us and the problem goes away. Simply studying the problem affects the observations.
More work to be done but any ideas and suggestions welcome.