Contents

TensorFlow中的基本计算核心函数，或称kernel，均定义在tensorflow/core/kernels目录下。其中卷积操作是深度神经网络中最常见的热点操作之一，自然也定义并实现在此处，涉及到的主要文件是conv_ops.cc。本文以TensorFlow r1.5代码为基础，试图对其中卷积操作进行初步分析，以见一斑。

1. 2D卷积代码结构简析

2D卷积操作是卷积网络中非常重要的热点函数，采用Intel VTune工具对VGG16算例（纯CPU计算）inference进行热点分析。程序的CPU运行时间是43.576s，其中卷积操作tensorflow::Conv2DOp::operator()占用时间37.172s，占比达到85.3%，可谓是不折不扣的第一热点了。

文件tensorflow/core/kernels/cov_ops.cc中，Conv2DOp的主要代码框架如下：

// File: tensorflow/core/kernels/cov_ops.cc

template <typename Device, typename T>
class Conv2DOp : public BinaryOp<T> {
public:
  explicit Conv2DOp(OpKernelConstruction* context) : BinaryOp<T>(context) {
    ...
  }
...
  void Compute(OpKernelContext* context) override {
    ...
        if (LaunchDeepConvOp<Device, T>::Run(
            context, input, filter, dimensions.batch, dimensions.input_rows,
            dimensions.input_cols, dimensions.in_depth, dimensions.filter_rows,
            dimensions.filter_cols, dimensions.pad_rows, dimensions.pad_cols,
            dimensions.out_rows, dimensions.out_cols, dimensions.out_depth,
            dimensions.dilation_rows, dimensions.dilation_cols,
            dimensions.stride_rows, dimensions.stride_cols, output,
            params_.data_format)) {
                return;
    }

    launcher_(context, use_cudnn_, cudnn_use_autotune_, input, filter,
              dimensions.dilation_rows, dimensions.dilation_cols,
              dimensions.stride_rows, dimensions.stride_cols, params_.padding,
              output, params_.data_format);
  }

 private:
  Conv2DParameters params_;
  bool use_cudnn_;
  bool cudnn_use_autotune_;

  LaunchConv2DOp<Device, T> launcher_;

  TF_DISALLOW_COPY_AND_ASSIGN(Conv2DOp);
...
}

LaunchDeepConvOp::Run()的执行与否取决于卷积的参数，在不执行的情况下调用后面的launcher_()，在前面分析的VGG16例子中就是这样，直接调用launcher_()。所以真正的主函数入口在void LaunchConv2DOp<CPUDevice, T>::operator()和void LaunchConv2DOp<GPUDevice, T>::operator()。

在CPU版本void LaunchConv2DOp<CPUDevice, T>::operator()中，通过参数正确性检查后直接调用void LaunchGeneric::operator()，在该函数中又通过不同的条件判断调用两个不同的计算kernel：functor::MatMulConvFunctor<Device, T>()和functor::SpatialConvolution<Device, T>()。

在GPU版本void LaunchConv2DOp<GPUDevice, T>::operator()中，则要更为复杂一些。从代码结构上看，主要是：

// File: tensorflow/core/kernels/cov_ops.cc

template <typename T>
void LaunchConv2DOp<GPUDevice, T>::operator()(
    OpKernelContext* ctx, bool use_cudnn, bool cudnn_use_autotune,
    const Tensor& input_param, const Tensor& filter, int row_dilation,
    int col_dilation, int row_stride, int col_stride, const Padding& padding,
    Tensor* output, TensorFormat data_format) {
    ...
    if ( 1x1 filter ){ // call cublas directly.
        bool blas_launch_status =
        stream
            ->ThenBlasGemm(no_transpose, no_transpose, n, m, k, 1.0f, b_ptr, n,
                           a_ptr, k, 0.0f, &c_ptr, n)
            .ok();
        ...
        return;
    }else if( the input data and filter have the same height/width ){
        // call cublas directly
        bool blas_launch_status =
        stream
            ->ThenBlasGemm(no_transpose, no_transpose, n, m, k, 1.0f, b_ptr, n,
                           a_ptr, k, 0.0f, &c_ptr, n)
            .ok();
        ...
        return;
    }
    ...
    if (padding == SAME) {
        // handle padding
        ...
    }
    if (data_format == FORMAT_NHWC) {
        // Convert the input tensor from NHWC to NCHW.
        ...
    }
    // Set parameters of cudnn
    ...
    // Transform filter, allocate output tensor memory
    ...
    // Set more parameters
    ...
    AlgorithmConfig algorithm_config;
    if (cudnn_use_autotune &&
      !AutoTuneConv::GetInstance()->Find(conv_parameters, &algorithm_config)) {
        // Try to get the best algorithm
        ...
    }
    ...
   bool cudnn_launch_status =
      stream
          ->ThenConvolveWithAlgorithm(input_desc, input_ptr, filter_desc,
                                      filter_ptr, conv_desc, output_desc,
                                      &output_ptr, &scratch_allocator,
                                      algorithm_config, nullptr)
          .ok();
    ...
    if (data_format == FORMAT_NHWC) {
        // Convert the output tensor back from NCHW to NHWC.
        ...
    }
    ...
｝

ThenConvolveWithAlgorithm()在做计算时最终会调用tensorflow/stream_executor/cuda/cuda_dnn.cc中的port::Status CudnnSupport::DoConvolveImpl()：

// File: tensorflow/stream_executor/cuda/cuda_dnn.cc

template <class T>
port::Status CudnnSupport::DoConvolveImpl(
    Stream* stream, const dnn::BatchDescriptor& input_descriptor,
    const DeviceMemory<T>& input_data,
    const dnn::FilterDescriptor& filter_descriptor,
    const DeviceMemory<T>& filter_data,
    const dnn::ConvolutionDescriptor& convolution_descriptor,
    const dnn::BatchDescriptor& output_descriptor, DeviceMemory<T>* output_data,
    ScratchAllocator* scratch_allocator,
    const dnn::AlgorithmConfig& algorithm_config,
    dnn::ProfileResult* output_profile_result) {
  ...
  RETURN_IF_CUDNN_ERROR(cudnnConvolutionForward(
      cudnn.handle(),
      /*alpha=*/alpha, /*srcDesc=*/input_nd.handle(),
      /*srcData=*/input_data.opaque(), /*filterDesc=*/filter.handle(),
      /*filterData=*/filter_data.opaque(), /*convDesc=*/conv.handle(),
      /*algo=*/ToConvForwardAlgo(algo_desc), /*workSpace=*/scratch.opaque(),
      /*workSpaceSizeInBytes=*/scratch.size(), /*beta=*/beta,
      /*yDesc=*/output_nd.handle(), /*y=*/output_data->opaque()));
  ...
}

而cudnnConvolutionForward正是cudnn中的卷积前向接口。

整个LaunchConv2DOp<GPUDevice, T>::operator()接口调用简图如下所示：

2. tensorflow::Tensor和Eigen::Tensor

tensorflow::Tensor是TensorFlow中的核心数据结构，它构建在Eigen::Tensor之上，是对后者的进一步封装。
tensorflow::Tensor的主要定义文件是tensorflow/core/framework/tensor.h，它与Eigen::Tensor的结合则主要是在文件tensorflow/core/framework/tensor_types.h：

// File: tensorflow/core/framework/tensor_types.h

...
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
...
namespace tensorflow {

// Helper to define Tensor types given that the scalar is of type T.
template <typename T, int NDIMS = 1, typename IndexType = Eigen::DenseIndex>
struct TTypes {
  // Rank-<NDIMS> tensor of scalar type T.
  typedef Eigen::TensorMap<Eigen::Tensor<T, NDIMS, Eigen::RowMajor, IndexType>,
                           Eigen::Aligned>
      Tensor;
  typedef Eigen::TensorMap<
      Eigen::Tensor<const T, NDIMS, Eigen::RowMajor, IndexType>, Eigen::Aligned>
      ConstTensor;
  ...
  // Rank-1 tensor (vector) of scalar type T.
  typedef Eigen::TensorMap<Eigen::Tensor<T, 1, Eigen::RowMajor, IndexType>,
                           Eigen::Aligned>
      Flat;
  typedef Eigen::TensorMap<
      Eigen::Tensor<const T, 1, Eigen::RowMajor, IndexType>, Eigen::Aligned>
      ConstFlat;
  ...
};
...

在结构体TTypes中定义了Tensor，ConstTensor，Flat等一众成员。而在tensorflow/core/framework/tensor.h中定义的类tensorflow::Tensor中，一众成员函数返回或操作的值都与TTypes::Tensor，TTypes::ConstTensor等前述定义联系了起来，不再贴出具体代码。我们更关心的问题是：如何从tensorflow::Tensor中得到Eigen::Tensor，乃至进一步得到其中的void*数组指针？对数组指针的操作又如何传送回最初的tensorflow::Tensor结构？

从tensorflow::Tensor中获取Eigen::Tensor，答案就在tensorflow/core/framework/tensor.h的注释中：

  /// Return the tensor data as an `Eigen::Tensor` of the data type and a
  /// specified shape.
  ///
  /// These methods allow you to access the data with the dimensions
  /// and sizes of your choice.  You do not need to know the number of
  /// dimensions of the Tensor to call them.  However, they `CHECK` that
  /// the type matches and the dimensions requested creates an
  /// `Eigen::Tensor` with the same number of elements as the tensor.
  ///
  /// Example:
  ///
  /// ```c++
     typedef float T;
     Tensor my_ten(...built with Shape{planes: 4, rows: 3, cols: 5}...);
     // 1D Eigen::Tensor, size 60:
     auto flat = my_ten.flat<T>();
     // 2D Eigen::Tensor 12 x 5:
     auto inner = my_ten.flat_inner_dims<T>();
     // 2D Eigen::Tensor 4 x 15:
     auto outer = my_ten.shaped<T, 2>({4, 15});
     // CHECK fails, bad num elements:
     auto outer = my_ten.shaped<T, 2>({4, 8});
     // 3D Eigen::Tensor 6 x 5 x 2:
     auto weird = my_ten.shaped<T, 3>({6, 5, 2});
     // CHECK fails, type mismatch:
     auto bad   = my_ten.flat<int32>();
  ///
  ///`` `

而进一步得到其中的void *数据数组指针及其长度，参考LaunchConv2DOp<GPUDevice, T>::operator()中的代码：

    // tensorflow::Tensor input
    auto a_ptr = AsDeviceMemory(input.template flat<T>().data(),
                                input.template flat<T>().size());

对数组指针对应的地址进行操作，这一改动就会直接反映在原先的tensorflow::Tensor结构中。

3. 补充知识

3.1 对padding的理解

卷积操作存在一个缺陷：对输入矩阵中间部分的运算操作多，对边缘数据的运算次数低，并且卷积后矩阵会变小。padding就是为了解决这一问题，思路很简单：在边缘的外围人为增加/填充一些数据行以及列。

为了不影响运算结果，一般填充的数据都是0，问题只在于要填充几行几列。常见的padding方式有两种：valid和same。前者是不填充，后者是使填充后输入输出数据大小相同。

设输入矩阵大小为 $n \times n$ （通常是正方形），filter的大小为 $f \times f$ ，p是要填充的行列数。则有
$n \times n = (n+2p-f+1) \times (n+2p-f+1)$ ，即 $n = n + 2p -f +1$ ， $p = \frac{f-1}{2}$
当f为奇数时（通常情况），p值由上式计算。

详细可参考：https://blog.csdn.net/qq_30979017/article/details/79407720

3.2 Eigen::Tensor文档

https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md

3.3 cuDNN API文档

https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html

道无常形

农夫山泉有点甜

TensorFlow中2D卷积代码简析

1. 2D卷积代码结构简析

2. tensorflow::Tensor和Eigen::Tensor

3. 补充知识

3.1 对padding的理解

3.2 Eigen::Tensor文档

3.3 cuDNN API文档

发表评论取消回复

1. 2D卷积代码结构简析

2. tensorflow::Tensor和Eigen::Tensor

3. 补充知识

3.1 对padding的理解

3.2 Eigen::Tensor文档

3.3 cuDNN API文档

发表评论 取消回复

发表评论取消回复