Attempt to fix failing test by dealing with overflow with three rounds,
instead of previous subtract modulus solution. Also optimise out shifts
by using memcpy / memmove instead. Remove final sub to return canonical
result, as this is not required here.
Signed-off-by: Paul Elliott <paul.elliott@arm.com>