Tensorflow Probability Distributions 簡介

本文轉載自查看原文 2021-02-02 18:04 459

摘要：Tensorflow Distributions提供了兩類抽象：distributions和bijectors。distributions提供了一系列具備快速、數值穩定的采樣、對數概率計算以及其他統計特征計算方法的概率分布。bijectors提供了一系列針對distribution的可組合的確定性變換。

1、Distributions

1.1 methods

一個distribution至少實現以下方法：sample、log_prob、batch_shape_tensor、event_shape_tensor；同時也實現了一些其他方法，例如：cdf、survival_function、quantile、mean、variance、entropy等；Distribution基類實現了給定log_prob計算prob、給定log_cdf計算log_survival_fn的方法。

1.2 shape semantics

將一個tensor的形狀分為三個部分：sample shape、batch shape、event shape。

sample shape：描述從給定概率分布上獨立同分布的采樣形狀；

batch shape：描述從概率分布上獨立、非同分布的采樣形狀，也即，我們可以指定一組參數不同的相同分布，batch shape通常用來為機器學習中一個batch的樣本每個樣本指定一個分布；

event shape：描述從概率分布上單次采樣的形狀；

1.3 sampling

reparameterization：distributions擁有一個reparameterization屬性，這個屬性表明了自動化微分和采樣之間的關系。目前包括兩種：“fully reparameterized” 和 “not reparameterized”。

fully reparameterized：例如，對於分布dist = Normal(loc, scale)，采樣y = dist.sample()的內部過程為x = tf.random_normal([]); y = scale * x + loc. 樣本y是reparameterized的，因為它是參數loc、scale及無參數樣本x的光滑函數。

not reparameterized：例如，gamma分布使用接收-拒絕的方式進行采樣，是參數的非光滑函數。

end to end automatic differentiation：通過與tensorflow結合，一個fully reparameterized的分布可以進行端到端的自動微分。例如，要最小化分布Y的期望損失E [φ(Y)]，可以使用蒙特卡洛近似的方法最小化

這使得我們可以使用S_N作為期望損失的估計，還可以使用Δ_λS_N作為梯度Δ_λE [φ(Y)]的估計，其中λ是分布Y的參數。

1.4 high order distributions

TransformedDistribution：對一個基分布執行一個可逆可微分轉換即可得到一個TransformedDistribution。例如，可以從一個Exponential分布得到一個標准Gumbel分布：

standard_gumbel = tfd.TransformedDistribution(
    distribution=tfd.Exponential(rate=1.),
    bijector=tfb.Chain([
        tfb.Affine(
            scale_identity_multiplier=-1.,
            event_ndims=0),
        tfb.Invert(tfb.Exp()),
    ]))
standard_gumbel.batch_shape  # ==> []
standard_gumbel.event_shape  # ==> []

基於gumbel分布，可以構建一個Gumbel-Softmax(Concrete)分布：

alpha = tf.stack([
    tf.fill([28 * 28], 2.),
    tf.ones(28 * 28)])

concrete_pixel = tfd.TransformedDistribution(
    distribution=standard_gumbel,
    bijector=tfb.Chain([
        tfb.Sigmoid(),
        tfb.Affine(shift=tf.log(alpha)),
    ]),
    batch_shape=[2, 28 * 28])
concrete_pixel.batch_shape  # ==> [2, 784]
concrete_pixel.event_shape  # ==> []

Independent：對batch shape和event shape進行轉換。例如：

image_dist = tfd.TransformedDistribution(
    distribution=tfd.Independent(concrete_pixel),
    bijector=tfb.Reshape(
        event_shape_out=[28, 28, 1],
        event_shape_in=[28 * 28]))
image_dist.batch_shape  # ==> [2]
image_dist.event_shape  # ==> [28, 28, 1]

Mixture：定義了由若干分布組合成的新的分布，例如：

image_mixture = tfd.MixtureSameFamily(
    mixture_distribution=tfd.Categorical(
        probs=[0.2, 0.8]),
    components_distribution=image_dist)
image_mixture.batch_shape  # ==> []
image_mixture.event_shape  # ==> [28, 28, 1]

1.5 distribution functionals

functional以一個分布作為輸入，輸出一個標量，例如：entropy、cross entropy、mutual information、kl距離等。

p = tfd.Normal(loc=0., scale=1.)
q = tfd.Normal(loc=-1., scale=2.)
xent = p.cross_entropy(q)
kl = p.kl_divergence(q)
# ==> xent - p.entropy()

2、Bijectors

2.1 definition

Bijector API提供了針對distribution的可微分雙向映射（differentialble, bijective map, diffeomorphism）轉換接口。給定隨機變量X和一個diffeomorphism F，可以定義一個新的隨機變量Y，Y的密度可由下式計算：

其中DF^-1是F的Jacobian的逆。（參考：https://zhuanlan.zhihu.com/p/100287713）

每個bijector子類都對應一個F，TransformedDistribution自動計算Y=F(X)的密度。bijector使得我們可以利用已有的分布構建許多其他分布。

bijector主要包含以下三個函數：

forward：實現x → F (x)，TransformedDistribution.sample函數使用該函數將一個tensor轉換為另一個tensor；

inverse：forward的逆變換，實現y → F^-1(y)，TransformedDistribution.log_prob使用該函數計算對數概率（上式）；

inverse_log_det_jacobian：計算log |DF⁻¹(y)|，TransformedDistribution.log_prob使用該函數計算對數概率（上式）；

通過使用bijectors，TransformedDistribution可以自動高效地實現sample、log_prob、prob，對於具有恆定Jacobian的bijector，TransformedDistribution自動實現一些基礎統計量，如mean、variance、entropy等。

以下實現了對Laplace的放射變換：

vector_laplace = tfd.TransformedDistribution(
    distribution=tfd.Laplace(loc=0., scale=1.),
    bijector=tfb.Affine(
        shift=tf.Variable(tf.zeros(d)),
        scale_tril=tfd.fill_triangular(
            tf.Variable(tf.ones(d * (d + 1) / 2)))),
    event_shape=[d])

由於tf.Variables，該分布是可學習的。

2.2 composability

bijectors可以構成高階bijectors，例如Chain、Invert。

chain bijector可以構建一系列豐富的分布，例如創建一個多變量logit-Normal分布：

matrix_logit_mvn =
tfd.TransformedDistribution(
    distribution=tfd.Normal(0., 1.),
    bijector=tfb.Chain([
        tfb.Reshape([d, d]),
        tfb.SoftmaxCentered(),
        tfb.Affine(scale_diag=diag),
    ]),
    event_shape=[d * d])

Invert可以通過交換inverse和forward函數，高效地將bijectors數量翻倍，例如：

softminus_gamma = tfd.TransformedDistribution(
    distribution=tfd.Gamma(
        concentration=alpha,
        rate=beta),
    bijector=tfb.Invert(tfb.Softplus()))

2.3 caching

bijector自動緩存操作的輸入輸出對，包括log det jacobian。caching的意義時，當inverse計算很慢或數值不穩定或難以實現時，可以高效的執行inverse操作。當計算采樣結果的概率是，緩存被觸發。如果q(x)是x=f(ε)的密度，且ε~r，那么caching可以降低計算q(xi)的計算成本：

caching機制也可用來進行高效地重要性采樣（importance sampling）：

3、應用

3.1 核密度估計（KDE）

例如，可以通過以下代碼構建一個由n個mvn_diag分布作為kernel的混合高斯模型，其中每個kernel的權重為1/n。注意，此時Independent會對分布的shape進行重定義（reinterpret），tfd.Normal(loc=x, scale=1.)創建了一個batch_shape = n*d, event_shape = []的分布，對其Independent之后，變為batch_shape = n, event_shape = d的分布。

Independent文檔：https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/Independent?hl=zh-cn

f = lambda x: tfd.Independent(tfd.Normal(
    loc=x, scale=1.))
n = x.shape[0].value
kde = tfd.MixtureSameFamily(
    mixture_distribution=tfd.Categorical(
        probs=[1 / n] * n),
    components_distribution=f(x))

3.2 變分自編碼器（VAE）

論文：https://arxiv.org/pdf/1312.6114.pdf

博客：https://spaces.ac.cn/archives/5253

def make_encoder(x, z_size=8):
    net = make_nn(x, z_size * 2)


return tfd.MultivariateNormalDiag(
    loc=net[..., :z_size],
    scale=tf.nn.softplus(net[..., z_size:])))

def make_decoder(z, x_shape=(28, 28, 1)):
    net = make_nn(z, tf.reduce_prod(x_shape))


logits = tf.reshape(
    net, tf.concat([[-1], x_shape], axis=0))
return tfd.Independent(tfd.Bernoulli(logits))


def make_prior(z_size=8, dtype=tf.float32):
    return tfd.MultivariateNormalDiag(
        loc=tf.zeros(z_size, dtype)))

    def make_nn(x, out_size, hidden_size=(128, 64)):
        net = tf.flatten(x)

    for h in hidden_size:
        net = tf.layers.dense(
            net, h, activation=tf.nn.relu)
    return tf.layers.dense(net, out_size)

3.3 Edward概率編程

tfd是Edward的后端。以下代碼實現一個隨機循環神經網絡（stochastic rnn），其隱藏狀態是隨機的。

stochastic rnn論文：https://arxiv.org/pdf/1411.7610.pdf

from edward.models import Normal

z = x = []
z[0] = Normal(loc=tf.zeros(K), scale=tf.ones(K))
h = tf.layers.dense(
    z[0], 512, activation=tf.nn.relu)
loc = tf.layers.dense(h, D, activation=None)
x[0] = Normal(loc=loc, scale=0.5)
for t in range(1, T):
    inputs = tf.concat([z[t - 1], x[t - 1]], 0)
    loc = tf.layers.dense(
        inputs, K, activation=tf.tanh)
    z[t] = Normal(loc=loc, scale=0.1)
    h = tf.layers.dense(
        z[t], 512, activation=tf.nn.relu)
    loc = tf.layers.dense(h, D, activation=None)
    x[t] = Normal(loc=loc, scale=0.5)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 TensorFlow Serving簡介 Google TensorFlow 學習筆記一 —— TensorFlow簡介深度學習框架：TensorFlow（簡介） bernoulli, multinoulli distributions 講解 Probability和Likelihood的區別指數族分布(Exponential Families of Distributions) Everything Is Generated In Equal Probability(概率與期望) 什么是TensorFlow？ TensorFlow tensorflow(一)