With the MobileNetV2 model, 224x224x3 images, and 5 epochs, I ran various number of thor slaves with default CPU and memory to evaluate differences in accuracy and timing.
# of Thor Slaves (with default CPU and Memory) | End Time (Total Cluster Time) |
1 | *error-terminated |
2 | *error-terminated |
4 (default # of thor slaves) | 28:12 Second trial: 28:13 |
8 | Failed Second trial: Failed Third trial: still failed Reason why it failed the third time: not enough memory Reason why it failed the first 2 times: the gnncarina resource group on Azure must be set to selected networks, so each time I create a new aks cluster, I must add the network |
Since only 4 thor slaves were compatible under default settings, I manually changed the CPU to 8 and the memory to 16G.
It is expected that more thor slaves = shorter running time
# of Thor Slaves (with 8 CPU and 16G Memory) | End Time (Total Cluster Time) |
1 | 1:38:12 |
2 | 48:10 ^^ 1 vs 2 epochs |