VPN service

Posted on 2023-01-02 Edited on 2023-10-03

使用VPS搭建VPN代理

原理

准备工作

1、免费域名

freenom免费域名申请
申请地址：https://my.freenom.com/
申请时需要保证个人资料地址信息与网络ip地址信息一致；国内ip环境，需使用Gooreplacer chrome插件将www.google.com/recaptcha 重定向recaptcha.net/recaptcha
eu.org免费域名申请
申请地址：https://nic.eu.org/arf/en (需使用代理)
https://pp.ua/ 免费域名申请

2、域名解析

https://topdn.net
配置简单，更新快速
cloudflare
个人推荐cloudflare，功能齐全，同时能实现ip地址隐藏

3、CDN（可选）

https://www.cloudflare.com/ 可以实现vps ip地址隐藏，同时也可以解析到境外已被墙ip（例如阿里云香港主机）

搭建步骤

1、v2ray或者trojan服务器伪装

2、客户端v2rayNG配置

3、CDN流量中转（可选）

流量中转目的：1、隐藏VPS ip；2、解救被海外封ip

参考链接

cudnn

Posted on 2022-04-06 Edited on 2023-10-03

cudnn 优化设置

cudnn deterministic

设置为true，cudnn使用非确定性算法，能够自动寻找最适合当前配置的高效算法，来达到优化运行效率的问题。

线程池

Posted on 2022-04-05 Edited on 2023-10-03

线程池

支持单例使用，支持任意参数的任务提交

#pragma once

#include <condition_variable>
#include <functional>
#include <future>
#include <memory>
#include <mutex>
#include <queue>
#include <thread>
#include <vector>

class ThreadPool {
 public:
  using Task = std::function<void()>;
  explicit ThreadPool(int num_threads): running_(true) {
    threads_.resize(num_threads);
    for (auto& thread : threads_) {
      thread.reset(new std::thread(&ThreadPool::TaskLoop, this));
    }
  }
  ~ThreadPool() {
    {
      std::unique_lock<std::mutex> lock(mutex_);
      running_ = false;
    }
    scheduled_.notify_all();
    for (auto& thread : threads_) {
      thread->join();
      thread.reset(nullptr);
    }
  }
  static ThreadPool* GetInstance() {
    std::call_once(init_flag_, &ThreadPool::Init);
    return threadpool_.get();
  }
  

  ThreadPool(const ThreadPool& pool) = delete;
  ThreadPool& operator=(const ThreadPool& pool) = delete;
  template<class F, class... Args>
  auto Commit(F&& f, Args&&... args) -> std::future<decltype(f(args...))> {
    if (!running_) {
      throw std::runtime_error("ThreadPool is not running");  
    }
    using RetType = decltype(f(args...));
    auto task = std::make_shared<std::packaged_task<RetType()>>(
      std::bind(std::forward<F>(f), std::forward<Args>(args)...)
    );
    std::future<RetType> future = task -> get_future();
    {
        std::lock_guard<std::mutex> lock(mutex_);
        tasks_.emplace([task]() {
            (*task)();
        });
    }
    scheduled_.notify_one();
    return std::move(future);
  }

 private:
  void TaskLoop() {
    while (true) {
      Task task;
      {
        std::unique_lock<std::mutex> lock(mutex_);
        scheduled_.wait(
            lock, [this] { return !this->tasks_.empty() || !this->running_; });
        if (!running_ && tasks_.empty()) {
          return;
        }
        task = std::move(tasks_.front());
        tasks_.pop();
      }
      task();
    }
  }
  static void Init() {
    if (threadpool_ == nullptr) {
      int num_threads = std::thread::hardware_concurrency();
      threadpool_.reset(new ThreadPool(num_threads));
    }
  }

 private:
  static std::unique_ptr<ThreadPool> threadpool_;
  static std::once_flag init_flag_;

  std::vector<std::unique_ptr<std::thread>> threads_;
  std::queue<Task> tasks_;
  std::mutex mutex_;
  bool running_;
  std::condition_variable scheduled_;
};

std::unique_ptr<ThreadPool> ThreadPool::threadpool_ = nullptr;
std::once_flag ThreadPool::init_flag_;

使用示例

#include <iostream>

#include "thread_pool.h"

struct sum {
  int operator()(int a, int b) {
      int res = a + b;
      return res;
  }
};

int print(int a) {
    return a;
}

class A {
public:
  static int calc(int val) {
      return val;
  }
};
int main() {
  ThreadPool executor(10);
  auto result = executor.Commit(&print, 3);
  std::cout << "result: " << result.get() << std::endl;

  // auto result = executor.Commit(A::calc, 3);
  // std::cout << "result: " << result.get() << std::endl;
  ThreadPool pool(4);
  std::vector<std::future<int>> results;
  std::chrono::seconds span(1);
  for (int i = 0; i < 2; ++ i) {
    results.emplace_back(
        pool.Commit([i, span] {
            std::cout << "run " << i << std::endl;
            std::this_thread::sleep_for(span);
            return i * i;
        })
    );
  }

  for (auto && item : results) {
      if(item.wait_for(span) == std::future_status::ready) {
        std::cout << item.get() << std::endl;
      }
  }

  return 0;
}

核心要点

工作队列 work queu
thread factory
饱和策略 handler

work stealing

Posted on 2022-03-19 Edited on 2023-10-03

work stealing

Nsight System

Posted on 2022-03-17 Edited on 2023-10-03

Nsight System

下载地址：https://developer.nvidia.com/gameworksdownload#?search=nsight

Nsight Compute

OpenCL

Posted on 2022-03-06 Edited on 2023-10-03

OpenCL平台模型

fiber(work item) - wave - workgroup

(thread - warp - block?)

OpenCL执行模型

上下文

命令队列

kernel执行

OpenCL存储器模型

存储类型

host memory
global memory
constant memory
片内延迟低，系统RAM延迟高。work group中所有work item的常量数据。
local memory
一个work group内的所有work item共享。
Local Memory coalesced access
private memory

存储对象类型

buffer
image
pipe

OpenCL API

clCreateProgramWithSource()
clBuildProgram()
clLinkProgram()
clUnloadPlatformCompiler()
clCreateProgramWithBinary()

clCreate{Image|Buffer}
clEnqueueNDRangeKernel()

cl_mem clCreateBuffer (
cl_context context,
cl_mem_flags flags,
size_t size,
void *host_ptr,
cl_int *errcode_ret)

OpenCL性能优化

内存

local memory
不同work item之间需要barrier进行同步，操作耗时。

不同work item之间交换数据需要barrier进行同步。

Barrier 经常会导致同步延迟，从而阻塞ALU，导致更低的ALU的使用效率。

在某些情况下，将数据缓冲到本地内存中可能会需要同步，同步产生的延迟将会抵消使用本地内存带来的性能提升。在这种情况下，直接使用全局内存，避免使用barrier可能是更好的选择。

OpenCL 资料

https://developer.qualcomm.com/download/adrenosdk/adreno-opencl-programming-guide.pdf
https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/gpu.html
https://developer.qualcomm.com/blog/matrix-multiply-adreno-gpus-part-1-opencl-optimization

Qualcomm_Mobile_OpenCL 中文翻译

MacTex

Posted on 2022-03-02 Edited on 2023-10-03

install

macOS

1	brew install mactex

ubuntu

1 2	apt update apt-get install texlive-xetex latex-cjk-all texmaker

link

https://github.com/FengMengZhao/LaTeX_generate_Chinese_resume
https://github.com/billryan/resume/tree/zh_CN
https://herechen.github.io/post/latex-skills/#latexmk-%E8%87%AA%E5%8A%A8%E5%8C%96%E7%BC%96%E8%AF%91

Modern Cpp

Posted on 2022-01-09 Edited on 2023-10-03

cpp11

type traits

std::integral_constant

wrap a static constant of specified type. Defined in
1
2
template<class T, T v>
struct integral_constant;

TensorRT

Posted on 2022-01-06 Edited on 2023-10-03

[TOC]

introduction

install

可以使用三种方式进行安装，包括

container 形式进行安装，下载NGC container;
debian 形式安装
pip 形式进行安装

container 形式安装

下载https://github.com/NVIDIA/TensorRT/blob/main/docker/ubuntu-18.04.Dockerfile
docker build -f ubuntu-18.04.Dockerfile --build-arg CUDA_VERSION=11.4.3 --tag=tensorrt-ubuntu .

debian 形式安装

pip形式进行安装

与TensorRT包里面wheel包安装形式不同，这种方式是自己管理TensorRT安装，不需要提前安装TensorRT包。目前只支持Python 3.6～3.9和CUDA 11.4。

安装前的准备

1	python3 -m pip install nvidia-pyindex

pip install时需要额外指定--extra-index-url https://pypi.ngc.nvidia.com

安装TensorRT wheel包

1	python3 -m pip install --upgrade nvidia-tensorrt

进行验证

python3
>>> import tensorrt
>>> print(tensorrt.__version__)
>>> assert tensorrt.Builder(tensorrt.Logger())

TensorRT生态

basic workflow

convert

使用TF-TRT
使用Torch-TensorRT
onnx转换器转换.onnx模型
使用TensorRT API进行组网

deploy

使用 TensorFlow

使用 TensorFflow 模型部署即可，TensorRT不支持的OP，会fall back到TensorFlow实现。
使用 TRT Runtime API

开销最小，能实现细粒度控制。对于不是原生支持的OP，需要使用plugin进行实现
使用 Nvidia Triton Inference Server

能支持多种框架，包括 TensorFlow, TensorRT, PyTorch, ONNX Runtime, 或者自定义框架。

TensorRT 基础介绍

创建引擎

1
2
3

Logger gLogger;
IBuilder* builder = createInferBuilder(gLogger);
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));

构建推理

IBuilderConfig* config = builder->createBuilderConfig();
config->setMemoryPoolLimit(1 << 20);
//设置推理精度
config->setFlag(nvinfer1::BuilderFlag::kFP16);

engine = builder->buildSerializedNetwork(*network, *config);
context = engine->createExecutionContext();

1
2
3

void* buffers[n];
engine->getBindingIndex(
context->enqueueV2(buffers, stream, nullptr);

dynamic shape

createNetwork()与createNetworkV2()的区别有两处，一是前者处理的维度为(C,H,W), 后者为(B,C,H,W)；二是后者支持dynamic shapes。

plugin

createNetwork()
createNetworkV2()
TensorRT 优化
https://blog.csdn.net/qq_33287871/article/details/117201271
Weight &Activation Precision Calibration
Layer & Tensor Fusion
Kernel Auto-Tuning
Dynamic Tensor Memory
Multi-Stream Execution

TensorRT API

bool reshapeWeights(
    const Weights& input, int32_t const* shape, int32_t const* shapeOrder, void* data, int32_t nbDims) noexcept;
bool reorderSubBuffers(
    void* input, int32_t const* order, int32_t num, int32_t size) noexcept;
bool transposeSubBuffers(
    void* input, DataType type, int32_t num, int32_t height, int32_t width) noexcept;

TensorRT 常见问题

DLA

DLA Supported Layers

reference

TensorRT

TensorRT Developer Guide

TensorRT API

ONNX-TensorRT

Torch-TensorRT

Metal for Paddle Lite

Posted on 2022-01-05 Edited on 2023-10-03

Metal for Paddle Lite

Metal kernel and context
Metal OP executation

使用VPS搭建VPN代理

原理

准备工作

1、免费域名

2、域名解析

3、CDN（可选）

搭建步骤

1、v2ray或者trojan服务器伪装

2、客户端v2rayNG配置

3、CDN流量中转（可选）

参考链接

cudnn 优化设置

cudnn deterministic

线程池

使用示例

核心要点

work stealing

Nsight System

Nsight Compute

OpenCL平台模型

OpenCL执行模型

上下文

命令队列

kernel执行

OpenCL存储器模型

存储类型

存储对象类型

OpenCL API

OpenCL性能优化

内存

OpenCL 资料

install

macOS

ubuntu

link

cpp11

type traits

introduction

install

container 形式安装

debian 形式安装

pip形式进行安装

TensorRT生态

basic workflow

convert

deploy

TensorRT 基础介绍

创建引擎

构建推理

dynamic shape

plugin

TensorRT 优化

TensorRT API

TensorRT 常见问题

DLA

DLA Supported Layers

reference

Metal for Paddle Lite