daily note

gpu code

Midgard

write one 32-bit pixel per core per clock, 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle

arithmetic pipeline

simd向量处理引擎，作用于128 bit 4字的寄存器，可以有多个，一般是每个shader core两个。能够弹性访问的数据类型包括2 x FP64, 4 x FP32, 8 x FP16, 2 x int64, 4 x int32, 8 x int16, or 16 x int8。

OpenCL kernels operating on 8-bit luminance data to process 16 pixels per SIMD unit per clock cycle。

For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle.

flops instruction

7 dot product (4 Muls, 3 adds)

1 scalar add

4 vec4 add

4 vec4 multiply

1 scalar multiply

Mali-T760：600MHz，16 cores，浮点计算性能为326 FP32 GFLOPS， 16 * 600M * 2 * 17 FP32 FLOPS。包含两个arithmetic pipeline，17 FP32 FLOPS per pipeline per clock cycle。
load/store pipeline
texture pipeline.

texuture访存，bilinear filtering 一个时钟周期，trilinear filtering从两个不同mipmaps memory加载，需要两个时钟周期

每个shader core包含两个16KB L1 数据cache，分别用于texture和常规数据访问。
一个逻辑L2 cache，所有的shader core共享，由厂商来配置，通常每个实例shader core 32KB或者64KB。