unsorted_segment_sum源碼追溯


unsorted_segment_sum

tensorflow中遇到了unsorted_segment_sum作用差不多的幾個算子,追溯了一下源碼,mark一下。

tf.math.unsorted_segment_sum版本

tf.math.unsorted_segment_sum(
    data,		# <tf.Tensor 'wide_deep/deep/mul_4:0' shape=(?, ?, 32) dtype=float32>
    segment_ids,# <tf.Tensor 'wide_deep/deep/add_4:0' shape=(?, ?) dtype=int64>
    num_segments,# <tf.Tensor 'wide_deep/deep/Cast_9:0' shape=() dtype=int64>
    name=None # None
)

參數解釋:

data :

A Tensor. Must be one of the following types: float32, float64, int32, uint8, int16, int8, complex64, int64, qint8, quint8, qint32, bfloat16, uint16, complex128, half, uint32, uint64.

segment_ids : 分段索引數組,shape要求是data.shape的前綴。

A Tensor. Must be one of the following types: int32, int64. A tensor whose shape is a prefix of data.shape.

num_segments : 分段數目。

A Tensor. Must be one of the following types: int32, int64.

name :

A name for the operation (optional).

返回:類型與data相同,維度為(num_segments, data.shape(segment_ids.dims()), ... ,data.shape(data.dims()))

A Tensor. Has the same type as data.

作用

c = tf.constant([[1,2,3,4], [5,6,7,8], [4,3,2,1]])
tf.unsorted_segment_sum(c, tf.constant([0, 1, 0]), num_segments=2)
# ==> [[ 0所屬分段和 ], [ 1所屬分段和 ]]
# ==> [[ c[0] + c[2]], [c1]]
# ==> [[ 5,  5, 5, 5], [5,  6, 7, 8]]

實現

tensorflow 1.14.0版本python端:

tensorflow/python/ops/gen_math_ops.py(11767)unsorted_segment_sum()

gen_math_ops.py是編譯后生成的python文件,實際上是通過_pywrap_tensorflow.TFE_Py_FastPathExecute調用C++代碼:

tensorflow/core/ops/math_ops.cc(1252)REGISTER_OP("UnsortedSegmentSum")

UnsortedSegmentSum類比較復雜,且有多個版本,這里以GPU版本為例,首先通過REGISTER_GPU_KERNEL_UNSORTEDSEGMENT間接定義:

tensorflow/core/kernels/segment_reduction_ops.cc(584)REGISTER_GPU_KERNEL_UNSORTEDSEGMENT("UnsortedSegmentSum", type, index_type, functor::Zero<type>, functor::SumOpGpu<type>)

REGISTER_GPU_KERNEL_UNSORTEDSEGMENT宏最終通過REGISTER_KERNEL_BUILDER調用UnsortedSegmentReductionOp類:

tensorflow/core/kernels/segment_reduction_ops.cc(467)class UnsortedSegmentReductionOp

具體實現在Compute函數中:

tensorflow/core/kernels/segment_reduction_ops.cc(472)Compute()

REGISTER_GPU_KERNEL_UNSORTEDSEGMENT中指定了DeviceReductionFunctorfunctor::UnsortedSegmentFunctor這里直接調用:

tensorflow\core\kernels\segment_reduction_ops_gpu.cu.cc(176)struct UnsortedSegmentFunctor

UnsortedSegmentFunctor調用了兩個CUDA kernel:

第一個kernelSetToValue設定返回tensor的值全0(functor::Zero,在REGISTER_GPU_KERNEL_UNSORTEDSEGMENT指定的):

tensorflow/core/util/gpu_device_functions.h(472)SetToValue()
tensorflow/core/kernels/segment_reduction_ops.h(107)struct Zero

第二個kernelUnsortedSegmentCustomKernel對每個元素調用functor::SumOpGpuREGISTER_GPU_KERNEL_UNSORTEDSEGMENT指定的):

tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc(109)UnsortedSegmentCustomKernel()
tensorflow/core/kernels/segment_reduction_ops.h(72)struct SumOpGpu

實際上就是對每個元素調用CudaAtomicAdd函數。

C++代碼文件.cc等都只能在編譯前的源碼中找到,編譯后成了.so文件。

tf.scatter_add版本

tf.scatter_add(
    ref,     # <tf.Tensor 'wide_deep/deep/transpose:0' shape=(32, ?, 6) dtype=float32>
    indices, # <tf.Tensor 'wide_deep/deep/transpose_1:0' shape=(32, ?, ?) dtype=int64>
    updates, # <tf.Tensor 'wide_deep/deep/mul_4:0' shape=(?, ?, 32) dtype=float32>
    use_locking=False, #False
    name=None #None
)

參數解釋

ref: 目標值,類型與updates相同,這里輸入為全0 tensor,。

A Variable.

indices: 索引id,與data中的元素一一對應,表示updates要加到ref中的哪個位置。

A Tensor. Must be one of the following types: int32, int64. A tensor of indices into the first dimension of ref.

updates: 即data,維度與indices相同。A Tensor.

Must have the same type as ref.A tensor of updated values to store in ref.

use_locking: ref+=updates時是否加鎖。

An optional bool. Defaults to False. If True, the assignment will be protected by a lock; otherwise the behavior is undefined, but may exhibit less contention.

name:

A name for the operation (optional).

返回:

Same as ref. Returned as a convenience for operations that want to use the updated values after the update is done.

限制:

updates.shape = indices.shape + ref.shape[1:]

作用

# Scalar indices
ref[indices, ...] += updates[...]

# Vector indices (for each i)
ref[indices[i], ...] += updates[i, ...]

# High rank indices (for each i, ..., j)
ref[indices[i, ..., j], ...] += updates[i, ..., j, ...]

實現

tensorflow 1.14.0版本python端:

tensorflow/python/ops/gen_state_ops.py(719)scatter_add()

gen_state_ops.py是編譯后生成的python文件,實際上是通過_op_def_lib._apply_op_helper調用C++代碼:

tensorflow/core/ops/state_ops.cc(146)REGISTER_OP("ScatterAdd")

ScatterAdd類的實現比較復雜,該class並不是直接定義的,而是通過REGISTER_SCATTER_KERNEL間接定義的:

tensorflow/core/kernels/scatter_op.cc(256)REGISTER_SCATTER_KERNEL(type, dev, "ScatterAdd", scatter_op::UpdateOp::ADD);

該宏定義最終通過REGISTER_KERNEL_BUILDER調用ScatterUpdateOp類:

tensorflow/core/kernels/scatter_op.cc(73)class ScatterUpdateOp

具體實現在Compute中:

tensorflow/core/kernels/scatter_op.cc(84)Compute()

Compute只是判斷是否加鎖並最終調用DoCompute函數:

tensorflow/core/kernels/scatter_op.cc(97)DoCompute()

DoCompute函數其實也只是檢查參數,具體實現由functor::ScatterFunctor,只看GPU版本的實現:

tensorflow/core/kernels/scatter_functor_gpu.cu.h(118)struct ScatterFunctor

該算子只調用了一個CUDA kernel scatter_op_gpu::ScatterOpCustomKernel

tensorflow/core/kernels/scatter_functor_gpu.cu.h(73)ScatterOpCustomKernel()

該kernel對每一個元素調用ScatterOpKernelBody運算,這里調用的是scatter_op::UpdateOp::ADD版本(REGISTER_SCATTER_KERNEL指定的):

tensorflow/core/kernels/scatter_functor_gpu.cu.h(43)struct ScatterOpKernelBody

實際上就是對每個元素調用CudaAtomicAdd操作。

C++代碼文件.cc等都只能在編譯前的源碼中找到,編譯后成了.so文件。

torch.scatter_add版本

torch.scatter_add(
    dim,	
    index, 
    src
)

參數解釋

self(tensor) : 調用scatter_add的對象,通常由一個tensor元素調用。

dim (int) : 單個int值,src要加到self的哪個維度。

the axis along which to index.

index (LongTensor) : 索引id,src加到selfdim維的index位置,大小要么為空,要么與src的維度相同。

the indices of elements to scatter and add, can be either empty or the same size of src. When empty, the operation returns identity.

src (Tensor) : 要加的元素 。

the source elements to scatter and add.

返回:

一個tensor,維度與self的維度相同。

限制:

index.size(d) <= src.size(d) for all dimensions d, and that index.size(d) <= self.size(d) for all dimensions d != dim.

作用

self[index[i][j][k]][j][k] += src[i][j][k]  # if dim == 0
self[i][index[i][j][k]][k] += src[i][j][k]  # if dim == 1
self[i][j][index[i][j][k]] += src[i][j][k]  # if dim == 2

實現

pytorch 1.5.1版本python端:

torch/onnx/symbolic_opset9.py(1938)scatter_add()

從python源代碼可以直接看到,scatter_add的實現分為三步:

第一步先生成一個大小與self相同的全0 tensor to_add

sizes = self.type().sizes()

to_add = g.op("Constant", value_t=torch.zeros(sizes, dtype=dtype))

第二步通過scatter操作將src的元素按index賦值到to_adddim維對應位置處。

to_add = sym_help._scatter_helper(g, to_add, dim, index, src)

最后將to_add加到self中。

add(g, self, to_add)

具體C++代碼和CUDA代碼實現從pytorch源碼中並沒有找到。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM