Now that we have benchmarks for EVMMAX precompiles we can experiment with different Montgomery multiplication algorithms. Currently we use CIOS.
The reference paper we used have a summary of the complexity of each algorithm. However, these may not be directly applicable to small numbers like 256-bit. Check also what other projects are doing.