TensorFlow中2D卷积代码简析

TensorFlow中的基本计算核心函数,或称kernel,均定义在tensorflow/core/kernels目录下。其中卷积操作是深度神经网络中最常见的热点操作之一,自然也定义并实现在此处,涉及到的主要文件是conv_ops.cc。本文以TensorFlow r1.5代码为基础,试图对其中卷积操作进行初步分析,以见一斑。

1. 2D卷积代码结构简析

2D卷积操作是卷积网络中非常重要的热点函数,采用Intel VTune工具对VGG16算例(纯CPU计算)inference进行热点分析。程序的CPU运行时间是43.576s,其中卷积操作tensorflow::Conv2DOp::operator()占用时间37.172s,占比达到85.3%,可谓是不折不扣的第一热点了。


文件tensorflow/core/kernels/cov_ops.cc中,Conv2DOp的主要代码框架如下:

// File: tensorflow/core/kernels/cov_ops.cc

template <typename Device, typename T>
class Conv2DOp : public BinaryOp<T> {
public:
  explicit Conv2DOp(OpKernelConstruction* context) : BinaryOp<T>(context) {
    ...
  }
...
  void Compute(OpKernelContext* context) override {
    ...
        if (LaunchDeepConvOp<Device, T>::Run(
            context, input, filter, dimensions.batch, dimensions.input_rows,
            dimensions.input_cols, dimensions.in_depth, dimensions.filter_rows,
            dimensions.filter_cols, dimensions.pad_rows, dimensions.pad_cols,
            dimensions.out_rows, dimensions.out_cols, dimensions.out_depth,
            dimensions.dilation_rows, dimensions.dilation_cols,
            dimensions.stride_rows, dimensions.stride_cols, output,
            params_.data_format)) {
                return;
    }

    launcher_(context, use_cudnn_, cudnn_use_autotune_, input, filter,
              dimensions.dilation_rows, dimensions.dilation_cols,
              dimensions.stride_rows, dimensions.stride_cols, params_.padding,
              output, params_.data_format);
  }

 private:
  Conv2DParameters params_;
  bool use_cudnn_;
  bool cudnn_use_autotune_;

  LaunchConv2DOp<Device, T> launcher_;

  TF_DISALLOW_COPY_AND_ASSIGN(Conv2DOp);
...
}

LaunchDeepConvOp::Run()的执行与否取决于卷积的参数,在不执行的情况下调用后面的launcher_(),在前面分析的VGG16例子中就是这样,直接调用launcher_()。所以真正的主函数入口在void LaunchConv2DOp<CPUDevice, T>::operator()void LaunchConv2DOp<GPUDevice, T>::operator()

在CPU版本void LaunchConv2DOp<CPUDevice, T>::operator()中,通过参数正确性检查后直接调用void LaunchGeneric::operator(),在该函数中又通过不同的条件判断调用两个不同的计算kernel:functor::MatMulConvFunctor<Device, T>()functor::SpatialConvolution<Device, T>()

在GPU版本void LaunchConv2DOp<GPUDevice, T>::operator()中,则要更为复杂一些。从代码结构上看,主要是:

// File: tensorflow/core/kernels/cov_ops.cc

template <typename T>
void LaunchConv2DOp<GPUDevice, T>::operator()(
    OpKernelContext* ctx, bool use_cudnn, bool cudnn_use_autotune,
    const Tensor& input_param, const Tensor& filter, int row_dilation,
    int col_dilation, int row_stride, int col_stride, const Padding& padding,
    Tensor* output, TensorFormat data_format) {
    ...
    if ( 1x1 filter ){ // call cublas directly.
        bool blas_launch_status =
        stream
            ->ThenBlasGemm(no_transpose, no_transpose, n, m, k, 1.0f, b_ptr, n,
                           a_ptr, k, 0.0f, &c_ptr, n)
            .ok();
        ...
        return;
    }else if( the input data and filter have the same height/width ){
        // call cublas directly
        bool blas_launch_status =
        stream
            ->ThenBlasGemm(no_transpose, no_transpose, n, m, k, 1.0f, b_ptr, n,
                           a_ptr, k, 0.0f, &c_ptr, n)
            .ok();
        ...
        return;
    }
    ...
    if (padding == SAME) {
        // handle padding
        ...
    }
    if (data_format == FORMAT_NHWC) {
        // Convert the input tensor from NHWC to NCHW.
        ...
    }
    // Set parameters of cudnn
    ...
    // Transform filter, allocate output tensor memory
    ...
    // Set more parameters
    ...
    AlgorithmConfig algorithm_config;
    if (cudnn_use_autotune &&
      !AutoTuneConv::GetInstance()->Find(conv_parameters, &algorithm_config)) {
        // Try to get the best algorithm
        ...
    }
    ...
   bool cudnn_launch_status =
      stream
          ->ThenConvolveWithAlgorithm(input_desc, input_ptr, filter_desc,
                                      filter_ptr, conv_desc, output_desc,
                                      &output_ptr, &scratch_allocator,
                                      algorithm_config, nullptr)
          .ok();
    ...
    if (data_format == FORMAT_NHWC) {
        // Convert the output tensor back from NCHW to NHWC.
        ...
    }
    ...
}

ThenConvolveWithAlgorithm()在做计算时最终会调用tensorflow/stream_executor/cuda/cuda_dnn.cc中的port::Status CudnnSupport::DoConvolveImpl()

// File: tensorflow/stream_executor/cuda/cuda_dnn.cc

template <class T>
port::Status CudnnSupport::DoConvolveImpl(
    Stream* stream, const dnn::BatchDescriptor& input_descriptor,
    const DeviceMemory<T>& input_data,
    const dnn::FilterDescriptor& filter_descriptor,
    const DeviceMemory<T>& filter_data,
    const dnn::ConvolutionDescriptor& convolution_descriptor,
    const dnn::BatchDescriptor& output_descriptor, DeviceMemory<T>* output_data,
    ScratchAllocator* scratch_allocator,
    const dnn::AlgorithmConfig& algorithm_config,
    dnn::ProfileResult* output_profile_result) {
  ...
  RETURN_IF_CUDNN_ERROR(cudnnConvolutionForward(
      cudnn.handle(),
      /*alpha=*/alpha, /*srcDesc=*/input_nd.handle(),
      /*srcData=*/input_data.opaque(), /*filterDesc=*/filter.handle(),
      /*filterData=*/filter_data.opaque(), /*convDesc=*/conv.handle(),
      /*algo=*/ToConvForwardAlgo(algo_desc), /*workSpace=*/scratch.opaque(),
      /*workSpaceSizeInBytes=*/scratch.size(), /*beta=*/beta,
      /*yDesc=*/output_nd.handle(), /*y=*/output_data->opaque()));
  ...
}

cudnnConvolutionForward正是cudnn中的卷积前向接口。

整个LaunchConv2DOp<GPUDevice, T>::operator()接口调用简图如下所示:


2. tensorflow::Tensor和Eigen::Tensor

tensorflow::Tensor是TensorFlow中的核心数据结构,它构建在Eigen::Tensor之上,是对后者的进一步封装。
tensorflow::Tensor的主要定义文件是tensorflow/core/framework/tensor.h,它与Eigen::Tensor的结合则主要是在文件tensorflow/core/framework/tensor_types.h

// File: tensorflow/core/framework/tensor_types.h

...
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
...
namespace tensorflow {

// Helper to define Tensor types given that the scalar is of type T.
template <typename T, int NDIMS = 1, typename IndexType = Eigen::DenseIndex>
struct TTypes {
  // Rank-<NDIMS> tensor of scalar type T.
  typedef Eigen::TensorMap<Eigen::Tensor<T, NDIMS, Eigen::RowMajor, IndexType>,
                           Eigen::Aligned>
      Tensor;
  typedef Eigen::TensorMap<
      Eigen::Tensor<const T, NDIMS, Eigen::RowMajor, IndexType>, Eigen::Aligned>
      ConstTensor;
  ...
  // Rank-1 tensor (vector) of scalar type T.
  typedef Eigen::TensorMap<Eigen::Tensor<T, 1, Eigen::RowMajor, IndexType>,
                           Eigen::Aligned>
      Flat;
  typedef Eigen::TensorMap<
      Eigen::Tensor<const T, 1, Eigen::RowMajor, IndexType>, Eigen::Aligned>
      ConstFlat;
  ...
};
...

在结构体TTypes中定义了TensorConstTensorFlat等一众成员。而在tensorflow/core/framework/tensor.h中定义的类tensorflow::Tensor中,一众成员函数返回或操作的值都与TTypes::TensorTTypes::ConstTensor等前述定义联系了起来,不再贴出具体代码。我们更关心的问题是:如何从tensorflow::Tensor中得到Eigen::Tensor,乃至进一步得到其中的void*数组指针?对数组指针的操作又如何传送回最初的tensorflow::Tensor结构?

tensorflow::Tensor中获取Eigen::Tensor,答案就在tensorflow/core/framework/tensor.h的注释中:

  /// Return the tensor data as an `Eigen::Tensor` of the data type and a
  /// specified shape.
  ///
  /// These methods allow you to access the data with the dimensions
  /// and sizes of your choice.  You do not need to know the number of
  /// dimensions of the Tensor to call them.  However, they `CHECK` that
  /// the type matches and the dimensions requested creates an
  /// `Eigen::Tensor` with the same number of elements as the tensor.
  ///
  /// Example:
  ///
  /// ```c++
     typedef float T;
     Tensor my_ten(...built with Shape{planes: 4, rows: 3, cols: 5}...);
     // 1D Eigen::Tensor, size 60:
     auto flat = my_ten.flat<T>();
     // 2D Eigen::Tensor 12 x 5:
     auto inner = my_ten.flat_inner_dims<T>();
     // 2D Eigen::Tensor 4 x 15:
     auto outer = my_ten.shaped<T, 2>({4, 15});
     // CHECK fails, bad num elements:
     auto outer = my_ten.shaped<T, 2>({4, 8});
     // 3D Eigen::Tensor 6 x 5 x 2:
     auto weird = my_ten.shaped<T, 3>({6, 5, 2});
     // CHECK fails, type mismatch:
     auto bad   = my_ten.flat<int32>();
  ///
  ///`` `

而进一步得到其中的void *数据数组指针及其长度,参考LaunchConv2DOp<GPUDevice, T>::operator()中的代码:

    // tensorflow::Tensor input
    auto a_ptr = AsDeviceMemory(input.template flat<T>().data(),
                                input.template flat<T>().size());

对数组指针对应的地址进行操作,这一改动就会直接反映在原先的tensorflow::Tensor结构中。


3. 补充知识

3.1 对padding的理解

卷积操作存在一个缺陷:对输入矩阵中间部分的运算操作多,对边缘数据的运算次数低,并且卷积后矩阵会变小。padding就是为了解决这一问题,思路很简单:在边缘的外围人为增加/填充一些数据行以及列。

为了不影响运算结果,一般填充的数据都是0,问题只在于要填充几行几列。常见的padding方式有两种:valid和same。前者是不填充,后者是使填充后输入输出数据大小相同。

设输入矩阵大小为n \times n(通常是正方形),filter的大小为f \times f,p是要填充的行列数。则有
n \times n = (n+2p-f+1) \times (n+2p-f+1),即n = n + 2p -f +1p = \frac{f-1}{2}
当f为奇数时(通常情况),p值由上式计算。

详细可参考:https://blog.csdn.net/qq_30979017/article/details/79407720

3.2 Eigen::Tensor文档

https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md

3.3 cuDNN API文档

https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html

发表评论

电子邮件地址不会被公开。 必填项已用*标注