DAY 3+4: TESTING VARIOUS THOR SLAVES

With the MobileNetV2 model, 224x224x3 images, and 5 epochs, I ran various number of thor slaves with default CPU and memory to evaluate differences in accuracy and timing.

# of Thor Slaves (with default CPU and Memory)End Time (Total Cluster Time)
1*error-terminated
2*error-terminated
4 (default # of thor slaves)
28:12

Second trial:
28:13
8Failed


Second trial: Failed
Third trial: still failed

Reason why it failed the third time: not enough memory

Reason why it failed the first 2 times: the gnncarina resource group on Azure must be set to selected networks, so each time I create a new aks cluster, I must add the network


Since only 4 thor slaves were compatible under default settings, I manually changed the CPU to 8 and the memory to 16G.

It is expected that more thor slaves = shorter running time

# of Thor Slaves (with 8 CPU and 16G Memory)End Time (Total Cluster Time)
1
1:38:12
2
48:10
^^ 1 vs 2 epochs

Leave a comment