daily note

Posted on 2021-03-16 Edited on 2023-10-03

gpu code

https://github.com/Oramy/m2-cgpu

dl deploy

https://github.com/uber/neuropod

github

https://www.ruanyifeng.com/blog/2017/12/travis_ci_tutorial.html

gpu resource

Arm Mali GPU Best Practices Developer Guide
https://developer.arm.com/documentation/101897/latest
Arm Mali Bifrost and Valhall OpenCL Developer Guide
https://developer.arm.com/documentation/101574/latest/

https://www.edge-ai-vision.com/2015/10/a-quick-guide-to-writing-opencl-kernels-for-powervr-rogue-gpus/

mail

Midgard

write one 32-bit pixel per core per clock, 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle

Tripipe design:

arithmetic pipeline

simd向量处理引擎，作用于128 bit 4字的寄存器，可以有多个，一般是每个shader core两个。能够弹性访问的数据类型包括2 x FP64, 4 x FP32, 8 x FP16, 2 x int64, 4 x int32, 8 x int16, or 16 x int8。

OpenCL kernels operating on 8-bit luminance data to process 16 pixels per SIMD unit per clock cycle。

For Mali T604 and T628, peak performance is 17 FP32 FLOPS per ALU per cycle.

flops instruction

7 dot product (4 Muls, 3 adds)

1 scalar add

4 vec4 add

4 vec4 multiply

1 scalar multiply

Mali-T760：600MHz，16 cores，浮点计算性能为326 FP32 GFLOPS， 16 * 600M * 2 * 17 FP32 FLOPS。包含两个arithmetic pipeline，17 FP32 FLOPS per pipeline per clock cycle。
load/store pipeline
texture pipeline.

texuture访存，bilinear filtering 一个时钟周期，trilinear filtering从两个不同mipmaps memory加载，需要两个时钟周期

flops	instruction
7	dot product (4 Muls, 3 adds)
1	scalar add
4	vec4 add
4	vec4 multiply
1	scalar multiply

memory system

每个shader core包含两个16KB L1 数据cache，分别用于texture和常规数据访问。
一个逻辑L2 cache，所有的shader core共享，由厂商来配置，通常每个实例shader core 32KB或者64KB。

https://community.arm.com/arm-community-blogs/b/graphics-gaming-and-vr-blog/posts/arm-mali-compute-architecture-fundamentals

paddle inference

Posted on 2021-03-16 Edited on 2023-10-03

paddle inference学习记录

代码解析

paddle inference代码位于paddle/fluid/inference下面。

engine基类

class EngineBase {
 public:
  using DescType = ::paddle::framework::proto::BlockDesc;
  // Build the model and do some preparation, for example, in TensorRT, run
  // createInferBuilder, buildCudaEngine.
  virtual void Build(const DescType& paddle_model) = 0;
  // Execute the engine, that will run the inference network.
  virtual void Execute(int batch_size) = 0;
  virtual ~EngineBase() {}
};

待添加

framework::ProgramDesc
framework::Executor* executor
framework::Scope* scope

paddle inference 执行逻辑

1	paddle/fluid/framework/naive_executor.cc#L41

void NaiveExecutor::Run() {
#ifdef PADDLE_WITH_MKLDNN
  platform::AttachPointerHashToMKLDNNKey(this, place_);
  platform::RegisterModelLayout(ops_, place_);
#endif
  platform::ScopedFlushDenormal flush;
  for (auto &op : ops_) {
    VLOG(4) << std::this_thread::get_id() << " run "
            << op->DebugStringEx(scope_) << " on scope " << scope_;
    op->SetIsCalledByExecutor(false);
    op->Run(*scope_, place_);
  }
}

1	paddle/fluid/framework/operator.cc#L204

void OperatorBase::Run(const Scope& scope, const platform::Place& place) {
  auto dev_id = place.device;
  platform::SetDeviceId(dev_id);
  auto op_name = platform::OpName(outputs_, Type());
  RunImpl(scope, place);
}

void OperatorWithKernel::RunImpl(const Scope& scope,
                                 const platform::Place& place,
                                 RuntimeContext* runtime_ctx) const {
  platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance();
  auto* dev_ctx = pool.Get(place);
  auto exe_ctx = ExecutionContext(*this, scope, *dev_ctx, *runtime_ctx);
  // using cache
  if (kernel_type_.get()) {
    dev_ctx = pool.Get(kernel_type_->place_);
  }
  {
    impl_ =
        new CacheImpl(new phi::KernelContext(),
                          new RuntimeInferShapeContext(*this, *runtime_ctx));
    BuildPhiKernelContext(*runtime_ctx, dev_ctx, impl_->getKernelContext());

    (*pt_kernel_)(impl_->getKernelContext());
  }
}

notes

Posted on 2021-03-12 Edited on 2023-10-03

一些记录

ccache和distcc

export USE_CCACHE=1
export CCACHE_DIR=/home/xx/tools/.ccache
ccache -M 50G
ccache -s
ccache -C

cmake中使用ccache的最加方案：cmake > 3.5，命令行上-DCMAKE_CXX_COMPILER_LAUNCHER=ccache配置文件
find_program(CCACHE_FOUND ccache)
    if(CCACHE_FOUND)  
        set(CMAKE_CXX_COMPILER_LAUNCHER ccache)
    endif()

conv

input	filter	output
1x32x40x80	16x32x3x3	1x16x40x80

int kernel_size = kernel_w * kernel_h;
int num_input = weight_data_size / kernel_size / num_output;

cento8 epel配置aliyun源

常用命令

Posted on 2021-03-11 Edited on 2023-10-03

记录linux常用命令

shell

目录下文件中指定字符串替换
将当前目录包括子目录的文件中printf字符串替换为print
1
sed -i "s/printf/print/g" `grep printf -rl ./`

常用命令

Posted on 2021-03-11 Edited on 2023-10-03

gflags

Posted on 2021-03-11 Edited on 2023-10-03

主要介绍google gflags使用

常见用法

定义flag

一般.cc中定义flag，.h进行声明，其他包含该.h的文件就可以使用.cc定义的flag变量。

flag与参数解析

1	gflags::ParseCommandLineFlags(&argc, &argv, true);

告诉程序处理命令行传入参数。最后一个参数为remove_flags，值为true，会移除相应flag和对应值并且修改argc值，argv只保留命令行参数；值为false，会保持argc不变，会调整argv中存储的内容顺序，flag放命令行参数前面。

命令行设置flag

更改flag默认值

TNN

Posted on 2021-03-06 Edited on 2023-10-03

本文主要介绍Tencent TNN编译使用。

下载编译

下载

1	git clone github.com/Tencent/TNN

其他需要cmake、opencv，单独安装

linux x86

1 2	mkdir build && cd build cmake .. -DTNN_X86_ENABLE=ON

模型部署示例

1
2
3

int main(){
  return main();
}

cmake

Posted on 2021-03-01 Edited on 2023-10-03

cmake

cmake usage

hexo

Posted on 2021-03-01 Edited on 2023-10-03

hexo

hexo init

该命令初始化博客

new post

1	hexo new [layout] titile

-p 自定义新文章路径, 如下面会在source/article目录下新建tensorrt.md

1	hexo new -p article/tensorrt "TensorRT"

admin

ref

hexo next后台管理

Hello World

Posted on 2021-01-01 Edited on 2023-10-03

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1	$ hexo new "My New Post"

More info: Writing

Run server

1	$ hexo server

More info: Server

Generate static files

1	$ hexo generate

More info: Generating

Deploy to remote sites

1	$ hexo deploy

More info: Deployment