0%

gpu code

https://github.com/Oramy/m2-cgpu

dl deploy

https://github.com/uber/neuropod

github

https://www.ruanyifeng.com/blog/2017/12/travis_ci_tutorial.html

gpu resource

https://www.edge-ai-vision.com/2015/10/a-quick-guide-to-writing-opencl-kernels-for-powervr-rogue-gpus/

mail

Midgard

write one 32-bit pixel per core per clock, 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle

gpu arch

shader core

Tripipe design:

  • arithmetic pipeline

    simd向量处理引擎,作用于128 bit 4字的寄存器,可以有多个,一般是每个shader core两个。能够弹性访问的数据类型包括2 x FP64, 4 x FP32, 8 x FP16, 2 x int64, 4 x int32, 8 x int16, or 16 x int8。

    OpenCL kernels operating on 8-bit luminance data to process 16 pixels per SIMD unit per clock cycle。

    For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle.

    flops instruction
    7 dot product (4 Muls, 3 adds)
    1 scalar add
    4 vec4 add
    4 vec4 multiply
    1 scalar multiply

    Mali-T760:600MHz,16 cores, 浮点计算性能为326 FP32 GFLOPS, 16 * 600M * 2 * 17 FP32 FLOPS。包含两个arithmetic pipeline,17 FP32 FLOPS per pipeline per clock cycle。

  • load/store pipeline

  • texture pipeline.

    texuture访存,bilinear filtering 一个时钟周期,trilinear filtering从两个不同mipmaps memory加载,需要两个时钟周期

memory system

每个shader core包含两个16KB L1 数据cache,分别用于texture和常规数据访问。
一个逻辑L2 cache,所有的shader core共享,由厂商来配置,通常每个实例shader core 32KB或者64KB。

https://community.arm.com/arm-community-blogs/b/graphics-gaming-and-vr-blog/posts/arm-mali-compute-architecture-fundamentals

paddle inference学习记录

代码解析

paddle inference代码位于paddle/fluid/inference下面。

engine基类

1
2
3
4
5
6
7
8
9
10
class EngineBase {
public:
using DescType = ::paddle::framework::proto::BlockDesc;
// Build the model and do some preparation, for example, in TensorRT, run
// createInferBuilder, buildCudaEngine.
virtual void Build(const DescType& paddle_model) = 0;
// Execute the engine, that will run the inference network.
virtual void Execute(int batch_size) = 0;
virtual ~EngineBase() {}
};

待添加

framework::ProgramDesc
framework::Executor* executor
framework::Scope* scope

paddle inference 执行逻辑

1
paddle/fluid/framework/naive_executor.cc#L41
1
2
3
4
5
6
7
8
9
10
11
12
13
void NaiveExecutor::Run() {
#ifdef PADDLE_WITH_MKLDNN
platform::AttachPointerHashToMKLDNNKey(this, place_);
platform::RegisterModelLayout(ops_, place_);
#endif
platform::ScopedFlushDenormal flush;
for (auto &op : ops_) {
VLOG(4) << std::this_thread::get_id() << " run "
<< op->DebugStringEx(scope_) << " on scope " << scope_;
op->SetIsCalledByExecutor(false);
op->Run(*scope_, place_);
}
}
1
paddle/fluid/framework/operator.cc#L204
1
2
3
4
5
6
void OperatorBase::Run(const Scope& scope, const platform::Place& place) {
auto dev_id = place.device;
platform::SetDeviceId(dev_id);
auto op_name = platform::OpName(outputs_, Type());
RunImpl(scope, place);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void OperatorWithKernel::RunImpl(const Scope& scope,
const platform::Place& place,
RuntimeContext* runtime_ctx) const {
platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance();
auto* dev_ctx = pool.Get(place);
auto exe_ctx = ExecutionContext(*this, scope, *dev_ctx, *runtime_ctx);
// using cache
if (kernel_type_.get()) {
dev_ctx = pool.Get(kernel_type_->place_);
}
{
impl_ =
new CacheImpl(new phi::KernelContext(),
new RuntimeInferShapeContext(*this, *runtime_ctx));
BuildPhiKernelContext(*runtime_ctx, dev_ctx, impl_->getKernelContext());

(*pt_kernel_)(impl_->getKernelContext());
}
}

一些记录

ccache和distcc

1
2
3
4
5
export USE_CCACHE=1
export CCACHE_DIR=/home/xx/tools/.ccache
ccache -M 50G
ccache -s
ccache -C
1
2
3
4
5
cmake中使用ccache的最加方案:cmake > 3.5,命令行上-DCMAKE_CXX_COMPILER_LAUNCHER=ccache配置文件
find_program(CCACHE_FOUND ccache)
if(CCACHE_FOUND)
set(CMAKE_CXX_COMPILER_LAUNCHER ccache)
endif()

conv

input filter output
1x32x40x80 16x32x3x3 1x16x40x80

int kernel_size = kernel_w * kernel_h;
int num_input = weight_data_size / kernel_size / num_output;

cento8 epel配置aliyun源

首先安装epel配置包
yum install -y https://mirrors.aliyun.com/epel/epel-release-latest-8.noarch.rpm
然后将 repo 配置中的地址替换为阿里云镜像站地址
sed -i ‘s|^#baseurl=https://download.fedoraproject.org/pub|baseurl=https://mirrors.aliyun.com|' /etc/yum.repos.d/epel*
sed -i ‘s|^metalink|#metalink|’ /etc/yum.repos.d/epel*

记录linux常用命令

shell

  • 目录下文件中指定字符串替换
    将当前目录包括子目录的文件中printf字符串替换为print
    1
    sed -i "s/printf/print/g" `grep printf -rl ./`

主要介绍google gflags使用

常见用法

定义flag

一般.cc中定义flag,.h进行声明,其他包含该.h的文件就可以使用.cc定义的flag变量。

flag与参数解析

1
gflags::ParseCommandLineFlags(&argc, &argv, true); 

告诉程序处理命令行传入参数。最后一个参数为remove_flags,值为true,会移除相应flag和对应值并且修改argc值,argv只保留命令行参数;值为false,会保持argc不变,会调整argv中存储的内容顺序,flag放命令行参数前面。

命令行设置flag

更改flag默认值

本文主要介绍Tencent TNN编译使用。

下载编译

下载

1
git clone github.com/Tencent/TNN

其他需要cmake、opencv,单独安装

linux x86

1
2
mkdir build && cd build
cmake .. -DTNN_X86_ENABLE=ON

模型部署示例

1
2
3
int main(){
return main();
}

cmake

cmake usage

hexo

hexo init

该命令初始化博客

new post

1
hexo new [layout] titile

-p 自定义新文章路径, 如下面会在source/article目录下新建tensorrt.md

1
hexo new -p article/tensorrt "TensorRT"

admin

ref

hexo next后台管理

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1
$ hexo new "My New Post"

More info: Writing

Run server

1
$ hexo server

More info: Server

Generate static files

1
$ hexo generate

More info: Generating

Deploy to remote sites

1
$ hexo deploy

More info: Deployment