tensor core

Tensor Core

1st

4 * 2 * 64 FP16 FMA/clock = 512 per SM per clock
image

2nd

3rd

4 * 1 * 256 FP16 FMA/clock = 1024 per SM per clock

image

image