Contents
TensorFlow中的基本计算核心函数,或称kernel,均定义在tensorflow/core/kernels
目录下。其中卷积操作是深度神经网络中最常见的热点操作之一,自然也定义并实现在此处,涉及到的主要文件是conv_ops.cc
。本文以TensorFlow r1.5代码为基础,试图对其中卷积操作进行初步分析,以见一斑。
1. 2D卷积代码结构简析
2D卷积操作是卷积网络中非常重要的热点函数,采用Intel VTune工具对VGG16算例(纯CPU计算)inference进行热点分析。程序的CPU运行时间是43.576s,其中卷积操作tensorflow::Conv2DOp::operator()
占用时间37.172s,占比达到85.3%,可谓是不折不扣的第一热点了。
文件tensorflow/core/kernels/cov_ops.cc
中,Conv2DOp的主要代码框架如下:
// File: tensorflow/core/kernels/cov_ops.cc
template <typename Device, typename T>
class Conv2DOp : public BinaryOp<T> {
public:
explicit Conv2DOp(OpKernelConstruction* context) : BinaryOp<T>(context) {
...
}
...
void Compute(OpKernelContext* context) override {
...
if (LaunchDeepConvOp<Device, T>::Run(
context, input, filter, dimensions.batch, dimensions.input_rows,
dimensions.input_cols, dimensions.in_depth, dimensions.filter_rows,
dimensions.filter_cols, dimensions.pad_rows, dimensions.pad_cols,
dimensions.out_rows, dimensions.out_cols, dimensions.out_depth,
dimensions.dilation_rows, dimensions.dilation_cols,
dimensions.stride_rows, dimensions.stride_cols, output,
params_.data_format)) {
return;
}
launcher_(context, use_cudnn_, cudnn_use_autotune_, input, filter,
dimensions.dilation_rows, dimensions.dilation_cols,
dimensions.stride_rows, dimensions.stride_cols, params_.padding,
output, params_.data_format);
}
private:
Conv2DParameters params_;
bool use_cudnn_;
bool cudnn_use_autotune_;
LaunchConv2DOp<Device, T> launcher_;
TF_DISALLOW_COPY_AND_ASSIGN(Conv2DOp);
...
}
LaunchDeepConvOp::Run()的执行与否取决于卷积的参数,在不执行的情况下调用后面的launcher_(),在前面分析的VGG16例子中就是这样,直接调用launcher_()。所以真正的主函数入口在void LaunchConv2DOp<CPUDevice, T>::operator()
和void LaunchConv2DOp<GPUDevice, T>::operator()
。
在CPU版本void LaunchConv2DOp<CPUDevice, T>::operator()
中,通过参数正确性检查后直接调用void LaunchGeneric::operator()
,在该函数中又通过不同的条件判断调用两个不同的计算kernel:functor::MatMulConvFunctor<Device, T>()
和functor::SpatialConvolution<Device, T>()
。
在GPU版本void LaunchConv2DOp<GPUDevice, T>::operator()
中,则要更为复杂一些。从代码结构上看,主要是:
// File: tensorflow/core/kernels/cov_ops.cc
template <typename T>
void LaunchConv2DOp<GPUDevice, T>::operator()(
OpKernelContext* ctx, bool use_cudnn, bool cudnn_use_autotune,
const Tensor& input_param, const Tensor& filter, int row_dilation,
int col_dilation, int row_stride, int col_stride, const Padding& padding,
Tensor* output, TensorFormat data_format) {
...
if ( 1x1 filter ){ // call cublas directly.
bool blas_launch_status =
stream
->ThenBlasGemm(no_transpose, no_transpose, n, m, k, 1.0f, b_ptr, n,
a_ptr, k, 0.0f, &c_ptr, n)
.ok();
...
return;
}else if( the input data and filter have the same height/width ){
// call cublas directly
bool blas_launch_status =
stream
->ThenBlasGemm(no_transpose, no_transpose, n, m, k, 1.0f, b_ptr, n,
a_ptr, k, 0.0f, &c_ptr, n)
.ok();
...
return;
}
...
if (padding == SAME) {
// handle padding
...
}
if (data_format == FORMAT_NHWC) {
// Convert the input tensor from NHWC to NCHW.
...
}
// Set parameters of cudnn
...
// Transform filter, allocate output tensor memory
...
// Set more parameters
...
AlgorithmConfig algorithm_config;
if (cudnn_use_autotune &&
!AutoTuneConv::GetInstance()->Find(conv_parameters, &algorithm_config)) {
// Try to get the best algorithm
...
}
...
bool cudnn_launch_status =
stream
->ThenConvolveWithAlgorithm(input_desc, input_ptr, filter_desc,
filter_ptr, conv_desc, output_desc,
&output_ptr, &scratch_allocator,
algorithm_config, nullptr)
.ok();
...
if (data_format == FORMAT_NHWC) {
// Convert the output tensor back from NCHW to NHWC.
...
}
...
}
ThenConvolveWithAlgorithm()
在做计算时最终会调用tensorflow/stream_executor/cuda/cuda_dnn.cc
中的port::Status CudnnSupport::DoConvolveImpl()
:
// File: tensorflow/stream_executor/cuda/cuda_dnn.cc
template <class T>
port::Status CudnnSupport::DoConvolveImpl(
Stream* stream, const dnn::BatchDescriptor& input_descriptor,
const DeviceMemory<T>& input_data,
const dnn::FilterDescriptor& filter_descriptor,
const DeviceMemory<T>& filter_data,
const dnn::ConvolutionDescriptor& convolution_descriptor,
const dnn::BatchDescriptor& output_descriptor, DeviceMemory<T>* output_data,
ScratchAllocator* scratch_allocator,
const dnn::AlgorithmConfig& algorithm_config,
dnn::ProfileResult* output_profile_result) {
...
RETURN_IF_CUDNN_ERROR(cudnnConvolutionForward(
cudnn.handle(),
/*alpha=*/alpha, /*srcDesc=*/input_nd.handle(),
/*srcData=*/input_data.opaque(), /*filterDesc=*/filter.handle(),
/*filterData=*/filter_data.opaque(), /*convDesc=*/conv.handle(),
/*algo=*/ToConvForwardAlgo(algo_desc), /*workSpace=*/scratch.opaque(),
/*workSpaceSizeInBytes=*/scratch.size(), /*beta=*/beta,
/*yDesc=*/output_nd.handle(), /*y=*/output_data->opaque()));
...
}
而cudnnConvolutionForward
正是cudnn中的卷积前向接口。
整个LaunchConv2DOp<GPUDevice, T>::operator()
接口调用简图如下所示:
2. tensorflow::Tensor和Eigen::Tensor
tensorflow::Tensor
是TensorFlow中的核心数据结构,它构建在Eigen::Tensor
之上,是对后者的进一步封装。
tensorflow::Tensor
的主要定义文件是tensorflow/core/framework/tensor.h
,它与Eigen::Tensor
的结合则主要是在文件tensorflow/core/framework/tensor_types.h
:
// File: tensorflow/core/framework/tensor_types.h
...
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
...
namespace tensorflow {
// Helper to define Tensor types given that the scalar is of type T.
template <typename T, int NDIMS = 1, typename IndexType = Eigen::DenseIndex>
struct TTypes {
// Rank-<NDIMS> tensor of scalar type T.
typedef Eigen::TensorMap<Eigen::Tensor<T, NDIMS, Eigen::RowMajor, IndexType>,
Eigen::Aligned>
Tensor;
typedef Eigen::TensorMap<
Eigen::Tensor<const T, NDIMS, Eigen::RowMajor, IndexType>, Eigen::Aligned>
ConstTensor;
...
// Rank-1 tensor (vector) of scalar type T.
typedef Eigen::TensorMap<Eigen::Tensor<T, 1, Eigen::RowMajor, IndexType>,
Eigen::Aligned>
Flat;
typedef Eigen::TensorMap<
Eigen::Tensor<const T, 1, Eigen::RowMajor, IndexType>, Eigen::Aligned>
ConstFlat;
...
};
...
在结构体TTypes
中定义了Tensor
,ConstTensor
,Flat
等一众成员。而在tensorflow/core/framework/tensor.h
中定义的类tensorflow::Tensor
中,一众成员函数返回或操作的值都与TTypes::Tensor
,TTypes::ConstTensor
等前述定义联系了起来,不再贴出具体代码。我们更关心的问题是:如何从tensorflow::Tensor
中得到Eigen::Tensor
,乃至进一步得到其中的void*
数组指针?对数组指针的操作又如何传送回最初的tensorflow::Tensor
结构?
从tensorflow::Tensor
中获取Eigen::Tensor
,答案就在tensorflow/core/framework/tensor.h
的注释中:
/// Return the tensor data as an `Eigen::Tensor` of the data type and a
/// specified shape.
///
/// These methods allow you to access the data with the dimensions
/// and sizes of your choice. You do not need to know the number of
/// dimensions of the Tensor to call them. However, they `CHECK` that
/// the type matches and the dimensions requested creates an
/// `Eigen::Tensor` with the same number of elements as the tensor.
///
/// Example:
///
/// ```c++
typedef float T;
Tensor my_ten(...built with Shape{planes: 4, rows: 3, cols: 5}...);
// 1D Eigen::Tensor, size 60:
auto flat = my_ten.flat<T>();
// 2D Eigen::Tensor 12 x 5:
auto inner = my_ten.flat_inner_dims<T>();
// 2D Eigen::Tensor 4 x 15:
auto outer = my_ten.shaped<T, 2>({4, 15});
// CHECK fails, bad num elements:
auto outer = my_ten.shaped<T, 2>({4, 8});
// 3D Eigen::Tensor 6 x 5 x 2:
auto weird = my_ten.shaped<T, 3>({6, 5, 2});
// CHECK fails, type mismatch:
auto bad = my_ten.flat<int32>();
///
///`` `
而进一步得到其中的void *
数据数组指针及其长度,参考LaunchConv2DOp<GPUDevice, T>::operator()
中的代码:
// tensorflow::Tensor input
auto a_ptr = AsDeviceMemory(input.template flat<T>().data(),
input.template flat<T>().size());
对数组指针对应的地址进行操作,这一改动就会直接反映在原先的tensorflow::Tensor
结构中。
3. 补充知识
3.1 对padding的理解
卷积操作存在一个缺陷:对输入矩阵中间部分的运算操作多,对边缘数据的运算次数低,并且卷积后矩阵会变小。padding就是为了解决这一问题,思路很简单:在边缘的外围人为增加/填充一些数据行以及列。
为了不影响运算结果,一般填充的数据都是0,问题只在于要填充几行几列。常见的padding方式有两种:valid和same。前者是不填充,后者是使填充后输入输出数据大小相同。
设输入矩阵大小为n \times n(通常是正方形),filter的大小为f \times f,p是要填充的行列数。则有
n \times n = (n+2p-f+1) \times (n+2p-f+1),即n = n + 2p -f +1,p = \frac{f-1}{2}
当f为奇数时(通常情况),p值由上式计算。
详细可参考:https://blog.csdn.net/qq_30979017/article/details/79407720
3.2 Eigen::Tensor文档
https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md
3.3 cuDNN API文档
https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html