unsorted_segment_sum
在tensorflow中遇到了unsorted_segment_sum作用差不多的幾個算子,追溯了一下源碼,mark一下。
tf.math.unsorted_segment_sum版本
tf.math.unsorted_segment_sum(
data, # <tf.Tensor 'wide_deep/deep/mul_4:0' shape=(?, ?, 32) dtype=float32>
segment_ids,# <tf.Tensor 'wide_deep/deep/add_4:0' shape=(?, ?) dtype=int64>
num_segments,# <tf.Tensor 'wide_deep/deep/Cast_9:0' shape=() dtype=int64>
name=None # None
)
參數解釋:
data :
A
Tensor. Must be one of the following types:float32,float64,int32,uint8,int16,int8,complex64,int64,qint8,quint8,qint32,bfloat16,uint16,complex128,half,uint32,uint64.
segment_ids : 分段索引數組,shape要求是data.shape的前綴。
A
Tensor. Must be one of the following types:int32,int64. A tensor whose shape is a prefix ofdata.shape.
num_segments : 分段數目。
A
Tensor. Must be one of the following types:int32,int64.
name :
A name for the operation (optional).
返回:類型與data相同,維度為(num_segments, data.shape(segment_ids.dims()), ... ,data.shape(data.dims()))
A
Tensor. Has the same type asdata.
作用

c = tf.constant([[1,2,3,4], [5,6,7,8], [4,3,2,1]])
tf.unsorted_segment_sum(c, tf.constant([0, 1, 0]), num_segments=2)
# ==> [[ 0所屬分段和 ], [ 1所屬分段和 ]]
# ==> [[ c[0] + c[2]], [c1]]
# ==> [[ 5, 5, 5, 5], [5, 6, 7, 8]]
實現
tensorflow 1.14.0版本python端:
tensorflow/python/ops/gen_math_ops.py(11767)unsorted_segment_sum()
gen_math_ops.py是編譯后生成的python文件,實際上是通過_pywrap_tensorflow.TFE_Py_FastPathExecute調用C++代碼:
tensorflow/core/ops/math_ops.cc(1252)REGISTER_OP("UnsortedSegmentSum")
UnsortedSegmentSum類比較復雜,且有多個版本,這里以GPU版本為例,首先通過REGISTER_GPU_KERNEL_UNSORTEDSEGMENT間接定義:
tensorflow/core/kernels/segment_reduction_ops.cc(584)REGISTER_GPU_KERNEL_UNSORTEDSEGMENT("UnsortedSegmentSum", type, index_type, functor::Zero<type>, functor::SumOpGpu<type>)
REGISTER_GPU_KERNEL_UNSORTEDSEGMENT宏最終通過REGISTER_KERNEL_BUILDER調用UnsortedSegmentReductionOp類:
tensorflow/core/kernels/segment_reduction_ops.cc(467)class UnsortedSegmentReductionOp
具體實現在Compute函數中:
tensorflow/core/kernels/segment_reduction_ops.cc(472)Compute()
在REGISTER_GPU_KERNEL_UNSORTEDSEGMENT中指定了DeviceReductionFunctor為functor::UnsortedSegmentFunctor這里直接調用:
tensorflow\core\kernels\segment_reduction_ops_gpu.cu.cc(176)struct UnsortedSegmentFunctor
UnsortedSegmentFunctor調用了兩個CUDA kernel:
第一個kernel為 SetToValue設定返回tensor的值全0(functor::Zero,在REGISTER_GPU_KERNEL_UNSORTEDSEGMENT指定的):
tensorflow/core/util/gpu_device_functions.h(472)SetToValue()
tensorflow/core/kernels/segment_reduction_ops.h(107)struct Zero
第二個kernel為UnsortedSegmentCustomKernel對每個元素調用functor::SumOpGpu(REGISTER_GPU_KERNEL_UNSORTEDSEGMENT指定的):
tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc(109)UnsortedSegmentCustomKernel()
tensorflow/core/kernels/segment_reduction_ops.h(72)struct SumOpGpu
實際上就是對每個元素調用CudaAtomicAdd函數。
C++代碼文件.cc等都只能在編譯前的源碼中找到,編譯后成了.so文件。
tf.scatter_add版本
tf.scatter_add(
ref, # <tf.Tensor 'wide_deep/deep/transpose:0' shape=(32, ?, 6) dtype=float32>
indices, # <tf.Tensor 'wide_deep/deep/transpose_1:0' shape=(32, ?, ?) dtype=int64>
updates, # <tf.Tensor 'wide_deep/deep/mul_4:0' shape=(?, ?, 32) dtype=float32>
use_locking=False, #False
name=None #None
)
參數解釋
ref: 目標值,類型與updates相同,這里輸入為全0 tensor,。
A
Variable.
indices: 索引id,與data中的元素一一對應,表示updates要加到ref中的哪個位置。
A
Tensor. Must be one of the following types:int32,int64. A tensor of indices into the first dimension ofref.
updates: 即data,維度與indices相同。A Tensor.
Must have the same type as
ref.A tensor of updated values to store inref.
use_locking: ref+=updates時是否加鎖。
An optional
bool. Defaults toFalse. If True, the assignment will be protected by a lock; otherwise the behavior is undefined, but may exhibit less contention.
name:
A name for the operation (optional).
返回:
Same as
ref. Returned as a convenience for operations that want to use the updated values after the update is done.
限制:
updates.shape = indices.shape + ref.shape[1:]
作用
# Scalar indices
ref[indices, ...] += updates[...]
# Vector indices (for each i)
ref[indices[i], ...] += updates[i, ...]
# High rank indices (for each i, ..., j)
ref[indices[i, ..., j], ...] += updates[i, ..., j, ...]

實現
tensorflow 1.14.0版本python端:
tensorflow/python/ops/gen_state_ops.py(719)scatter_add()
gen_state_ops.py是編譯后生成的python文件,實際上是通過_op_def_lib._apply_op_helper調用C++代碼:
tensorflow/core/ops/state_ops.cc(146)REGISTER_OP("ScatterAdd")
ScatterAdd類的實現比較復雜,該class並不是直接定義的,而是通過REGISTER_SCATTER_KERNEL間接定義的:
tensorflow/core/kernels/scatter_op.cc(256)REGISTER_SCATTER_KERNEL(type, dev, "ScatterAdd", scatter_op::UpdateOp::ADD);
該宏定義最終通過REGISTER_KERNEL_BUILDER調用ScatterUpdateOp類:
tensorflow/core/kernels/scatter_op.cc(73)class ScatterUpdateOp
具體實現在Compute中:
tensorflow/core/kernels/scatter_op.cc(84)Compute()
而Compute只是判斷是否加鎖並最終調用DoCompute函數:
tensorflow/core/kernels/scatter_op.cc(97)DoCompute()
DoCompute函數其實也只是檢查參數,具體實現由functor::ScatterFunctor,只看GPU版本的實現:
tensorflow/core/kernels/scatter_functor_gpu.cu.h(118)struct ScatterFunctor
該算子只調用了一個CUDA kernel scatter_op_gpu::ScatterOpCustomKernel:
tensorflow/core/kernels/scatter_functor_gpu.cu.h(73)ScatterOpCustomKernel()
該kernel對每一個元素調用ScatterOpKernelBody運算,這里調用的是scatter_op::UpdateOp::ADD版本(REGISTER_SCATTER_KERNEL指定的):
tensorflow/core/kernels/scatter_functor_gpu.cu.h(43)struct ScatterOpKernelBody
實際上就是對每個元素調用CudaAtomicAdd操作。
C++代碼文件.cc等都只能在編譯前的源碼中找到,編譯后成了.so文件。
torch.scatter_add版本
torch.scatter_add(
dim,
index,
src
)
參數解釋
self(tensor) : 調用scatter_add的對象,通常由一個tensor元素調用。
dim (int) : 單個int值,src要加到self的哪個維度。
the axis along which to index.
index (LongTensor) : 索引id,src加到self的dim維的index位置,大小要么為空,要么與src的維度相同。
the indices of elements to scatter and add, can be either empty or the same size of src. When empty, the operation returns identity.
src (Tensor) : 要加的元素 。
the source elements to scatter and add.
返回:
一個tensor,維度與self的維度相同。
限制:
index.size(d) <= src.size(d)for all dimensionsd, and thatindex.size(d) <= self.size(d)for all dimensionsd != dim.
作用
self[index[i][j][k]][j][k] += src[i][j][k] # if dim == 0
self[i][index[i][j][k]][k] += src[i][j][k] # if dim == 1
self[i][j][index[i][j][k]] += src[i][j][k] # if dim == 2
實現
pytorch 1.5.1版本python端:
torch/onnx/symbolic_opset9.py(1938)scatter_add()
從python源代碼可以直接看到,scatter_add的實現分為三步:
第一步先生成一個大小與self相同的全0 tensor to_add。
sizes = self.type().sizes()
to_add = g.op("Constant", value_t=torch.zeros(sizes, dtype=dtype))
第二步通過scatter操作將src的元素按index賦值到to_add的dim維對應位置處。
to_add = sym_help._scatter_helper(g, to_add, dim, index, src)
最后將to_add加到self中。
add(g, self, to_add)
具體C++代碼和CUDA代碼實現從pytorch源碼中並沒有找到。
