MINIMALIST: Mutual INformatIon Maximization for Amortized Likelihood Inference from Sampled Trajectories

Simulation-based inference enables learning the parameters of a model even when its likelihood cannot be computed in practice. One class of methods uses data simulated with different parameters to infer models of the likelihood-to-evidence ratio, or equivalently the posterior function. Here we frame the inference task as an estimation of an energy function parametrized with an artificial neural network. We present an intuitive approach where the optimal model of the likelihood-to-evidence ratio is found by maximizing the likelihood of simulated data. Within this framework, the connection between the task of simulation-based inference and mutual information maximization is clear, and we show how several known methods of posterior estimation relate to alternative lower bounds to mutual information. These distinct objective functions aim at the same optimal energy form and therefore can be directly benchmarked. We compare their accuracy in the inference of model parameters, focusing on four dynamical systems that encompass common challenges in time series analysis: dynamics driven by multiplicative noise, nonlinear interactions, chaotic behavior, and high-dimensional parameter space.


I. INTRODUCTION
Model-based Bayesian inference relies on knowing the probabilistic description of a process.Traditional methods rely on computing the likelihood of the observed data given the model parameters, in order to maximize or sample from the posterior.For many models, in particular with multiple interacting degrees of freedom or hidden variables, the likelihood function may be impractical to evaluate.In cases where drawing data from the generative process is possible, simulation-based inference techniques can be used as a powerful alternative approach for characterizing the underlying model.
Population genetics provides many examples of such problems.The observed quantities in this context are often based on sequencing data and are "far" from the quantities described by population dynamics models: it is in principle possible to write down likelihood functions, but they typically depend on a number of hidden variables that need to be marginalized out, making their evaluation impractical.Approximate Bayesian Computation (ABC) was first used for posterior inference in the context of population genetics [1], and since then numerous new approaches to simulation-based inference have been developed to answer particular questions of phylodynamics and sequencing data analysis.
More broadly, methods for simulation-based inference can be organized in two classes [2,3].In the first class, observations and simulated data are compared within the inference process, as in the original ABC approach [4,5].Methods belonging to the second class proceed in two stages.First, they use a large number of simulations to learn an approximate model for the likelihood function [8,9], or alternatively the posterior function [10], or the likelihood-to-evidence ratio [11][12][13]) that is amortized over the simulation examples.The amortized model is then used to evaluate the likelihood of observations and to evaluate the posterior for model parameters.With theoretical developments in machine learning and improvements in computing power, this class of algorithms has seen a renewed interest in the past years.
Here, we model the likelihood-to-evidence ratio as a Boltzmann factor and use it to infer the corresponding energy function.We show that the maximum-likelihood estimation of this factor is equivalent to the maximization of a lower bound to mutual information between parameters of the simulation and the simulated data.We exploit this equivalence by testing different lower bounds to mutual information as objective functions for simulationbased inference and optimize the parameters of artificial neural networks to approximate the posterior.We compare this approach to a recently developed method [12,13], where a model of the likelihood-to-evidence ratio is learned through the optimization of a binary classifier operating on simulated data and which has been shown to provide state-of-the-art performance on a variety of tasks [3].We assess the accuracy of the methods to infer the parameters of four dynamical systems: the Ornstein-Uhlenbeck process, a multidimensional stochastic pro- density < l a t e x i t s h a 1 _ b a s e 6 4 = " e 1 K y a E + 2 p T 5 g D D y Z 7 5 7 y y < l a t e x i t s h a 1 _ b a s e 6 4 = " cess with analytically tractable posterior; the birth-death process with multiplicative noise-driven dynamics; the Susceptible-Infected-Recovered (SIR) model of epidemiology, a simple system with elementary non-linear interactions; the Lorenz attractor, a dynamical system that exhibits chaotic behavior.The task of finding the parameters of stochastic processes by maximizing the likelihood of observed discrete trajectories is generically difficult [14].Our experiments encompass the most common challenges in the analysis of time-series data and together show that simulation-based inference offers a viable alternative to analytical methods.

II. METHODS
We aim to estimate parameters θ of a model given a set of data x obtained from stochastic simulation of that model, P (x|θ), with a prior P (θ).First, we reinterpret the likelihood-to-evidence ratio in terms of a Boltzmann factor where the energy function E(x, θ) captures how the joint distribution of data and parameters P joint (x, θ) = P (x|θ)P (θ) deviates from the independent distribution P indep (x, θ) = P (x)P (θ), as shown schematically in Fig. 1A, in which the pairs (x, θ) sampled from P joint have lower energies than the samples from P indep .
The energy E(x, θ) is generally a non-linear function describing the dependence between data and parameters.Z is the partition function, which ensures that the probability density P (x, θ) is normalized.E and Z are each defined up to constants (additive for E, multiplicative for Z).Given the energy function, E(x, θ) we recover the posterior probability density P (θ|x) = 1 Z e −E(x,θ) P (θ).
We will now describe an inference scheme to learn a model of the energy function from simulated data, relying on the flexibility of artificial neural networks.Specifically, we approximate the energy E by a multi-layered network E φ characterized by a set of parameters φ.Under a given model E φ , the joint distribution is approximated as: We simulate samples from the joint distribution, denoted J = {(x i , θ i )} N i=1 by drawing a model parameter from a prior distribution, θ i ∼ P (θ) and simulating x i ∼ P (x|θ i ).
To learn the neural network parameters φ, we need to maximize the log-likelihood of the simulated sample J under a given model E φ : where E J [•] denotes the empirical average over samples J .The partition function is approximated using importance sampling on samples drawn from P indep : where E I [•] is the counting measure over a large set I of independently drawn parameter and data pairs (x, θ) ∼ P indep (x, θ) = P (θ)P (x).In practice, I may be obtained by shuffling the indices of J [13], With this estimate of Z φ , and noting that the last term of eq. 3 does not depend on φ, the problem is reduced to maximizing (5) which in the infinite data limit constitutes a lower bound (the Donsker-Varadhan representation) to mutual information I(X; Θ) between simulated data and simulation parameters, where D KL (P joint P φ joint ) is the Kullback-Leibler divergence between the true joint distribution and its model (2).This bound was extensively studied in [15] as an estimate of the mutual information from discrete samples drawn from joint distributions.Here, we will use this representation to learn the energy function E φ , which approximates the likelihood-to-evidence ratio.We refer to this method as Mutual Information Neural Estimation (MINE).Its rationale is presented schematically in Fig. 1B.
The connection to mutual information estimation established above opens the possibility of employing other empirical mutual information estimates to perform simulation-based inference.An alternative lower bound to I(X; Θ), first introduced in [16], is the so-called fdivergence representation (FDIV), This estimator defines an alternative objective function to (5) that can be used to infer an optimal energy model E φ .Note that in the limit of infinite data N → ∞, and when the class of models {E φ } φ can represent the true energy exactly, the maxima of ( 5) and (7) are both reached at the true value of E(x, θ) where they give the true value of I(X; Θ) (see Appendix A).Outside of this limit, using one of these objective functions may prove more beneficial.In particular, the second term of L f (φ; I, J ) and its gradients may be reliably estimated by averaging over small batches, unlike the second term of M (φ; I, J ) because of the logarithm, giving FDIV an advantage for stochastic gradient descent algorithms.While the Donsker-Varadhan bound on the mutual information is tighter, i.e.L f (φ; I, J ) ≤ M (φ; I, J ) holds for N → ∞ [15], it is unclear whether it might produce a more reliable estimate of E(x, θ).
A third alternative is to use the original approach for the likelihood-to-evidence ratio estimation proposed in [12] and [13].The energy E(x, θ) may be rewritten in terms of a classifier between the two hypotheses of (x, θ) originating from the joint or independent distribution in a mixture P mix = 1 k+1 P joint + k k+1 P indep (in [13] k = 1).We define: The classifier is parametrized by a neural network, d = d φ and is trained by minimizing the binary cross-entropy Similarly to objectives ( 5) and (7), in the N → ∞ limit and when the class of {d φ } φ models contains the true d, this cross-entropy is minimized at the true value d(x, θ).
In [17] it was shown that it can also be used as an estimator of mutual information by computing the mean logarithm of the predicted likelihood-to-evidence ratio, Indeed, in the infinite data limit, all three objective functions, ( 5), (7), and ( 9), share the same optimum; see Appendix A. For high-dimensional random variables, binary cross-entropy (9) sets a tighter lower bound and a more accurate estimate of the mutual information than the f-divergence estimator (7) [17].However, its accuracy was not directly compared to the proposed estimator in eq. 5. We will refer to the inference approach based on minimizing the binary cross-entropy loss in eq. 9 as BCE.
Finally, once an energy model E φ has been trained by optimizing M (φ; I, J ), L f (φ; I, J ), or S(φ; I, J ) over φ, the posterior of parameters given an observation x may be calculated as P (θ|x) = (1/Z φ )e −E φ (x,θ) P (θ).We note that if the prior is changed, the energy function needs to be re-inferred.When θ is of high dimension, scanning the posterior for all possible values of θ may be impractical.In that case, we generate samples of θ from the posterior using a Markov-Chain Monte-Carlo method with Metropolis-Hasting acceptance probability: with q an ergodic Markov transition probability in the parameter space.This procedure generalizes in a straightforward way to the case of multiple observations drawn with the same set of parameters.

III. RELATIONS TO OTHER WORK
The presented methods are related to several recent approaches.For completeness, we discuss the similarities and differences between the presented and other methods.

Posterior inference methods
An alternative to likelihood-to-evidence ratio estimation (RE) is the framework of Neural Posterior Estima-tion (NPE) which consists in fitting a conditional density estimator directly to the posterior.Several recentlydeveloped NPE algorithms [18][19][20] have been compared with the BCE method in [3], as part of a public benchmark of ABC, NPE, and RE methods across several simulation-based inference tasks.The comparison reveals that there is no single best algorithm, NPE and RE approaches yield similar performance and consistently outperform ABC.The benchmark we present below is limited to different RE methods and compares alternative loss functions.

Noise contrastive estimation
In [21], the NPE and the RE methods have been presented as two instances of a more general scheme, which the authors termed as noise contrastive estimation.The learning algorithm introduced in [21] is based on a multisample loss function that allows interpolating between the two approaches.Importantly, ref. [21] also suggests a link between the field of mutual information estimation and simulation-based inference since the multi-sample loss function is also a lower bound to I(X; Θ).Here we formalize this link and test its applicability on a variety of examples by proposing two new methods (MINE and FDIV) for likelihood-to-evidence ratio estimation.We detail how our approach fits within the noise contrastive estimation scheme in Appendix B.

Learning optimal experimental designs and sufficient statistics
An important direction of research in simulation-based inference aims at finding experimental designs that are most informative about the model parameters [24,25].In [22], an optimal design is found by maximizing mutual information between design choice and target variable of interest in the experiment.This task extends the original RE framework to optimize over an additional hidden variable, the design, that specifies the simulation setup.
In [23] mutual information is maximized to build sufficient summary statistics of the simulated data.This technique allows for automatically learning an optimal representation of the data without expert knowledge of process-specific observables.The statistics obtained this way can then generically be parsed as input to simulation-based inference methods.While the procedure to find summary statistics is similar to the MINE and BCE methods described here, the algorithm of [23] does not exploit the energy model for posterior inference.Instead, the model is discarded and the statistics are independently utilized with another simulation-based inference method.
Concurrently to the first version of this paper, a new development in this line of research [26] has been proposed.The authors compare different mutual informa-tion lower bounds for Bayesian experimental design and observe a consistent trend with our findings, namely that the BCE method performs overall better than other estimators.It is an independent validation of our results using a closely related task.

IV. EXPERIMENTS
We set out to examine the 3 presented methods for estimating the likelihood-to-evidence ratio (MINE, FDIV, and BCE) to infer the parameters of simple dynamical models from discrete samples of their trajectories.We chose 4 contexts that together encompass the range of difficulties in the inference of model parameters: (i) the stochastic birth-death process, (ii) the epidemiological Susceptible-Infected-Recovered (SIR) process, (iii) the multidimensional Ornstein-Uhlenbeck process, and (iv) the chaotic system of Lorenz attractor.Example trajectories of each model are shown in Fig. 2.

Ornstein-Uhlenbeck process
The Ornstein-Uhlenbeck (OU) process is a multidimensional Markov process driven by additive Gaussian white noise.It is applied in many branches of science, notably to describe the velocity of a Brownian particle [27], the fluctuations of interest rates [28] or evolution of continuous phenotypic traits [29,30].The trajectories are a solution of a stochastic differential equation: where µ is the stationary mean and γ is the damping matrix, assumed to be symmetric.W stands for the multidimensional Wiener process, and σ is the noise amplitude.We use the Euler-Maruyama integration scheme to obtain the numerical solutions of this equation [31].
The corresponding Fokker-Planck equation for this process can be solved to obtain the true posterior (see Appendix C).
To study how the performance of each method scales with dimension, for 1 ≤ d ≤ 5, we fix µ = 0, σ = I, where I is the identity matrix, and infer the damping matrix parametrized as γ = I + g, where g is a Gaussian orthogonal matrix and < 1, which ensures that the damping matrix is positive definite (see Appendix C for more details).Since γ is symmetric, in the d-dimensional case we have d 2 parameters to infer.The prior is given by the Gaussian Orthogonal Ensemble distribution density, P (g) ∝ exp (−d Tr(g 2 )/4).

Birth-death process
The birth-death process is a discrete one-dimensional Markov process with multiplicative demographic noise.The number of individuals n is subject to variation due to stochastic birth and death events occurring at rates nλ and nδ, respectively, We use the Gillespie algorithm to sample trajectories from this process [32].We parametrize the process with the average exponential drift α = λ − δ, and the noise timescale β = λ + δ.We use uniform priors for both of these variables: P (α) = U(−2, 2) and P (β) = U (2,20).

SIR model
The Susceptible-Infected-Recovered (SIR) model is a staple of epidemiological modeling.Any member of the susceptible population S can be infected at rate β upon contact with one of I infected individuals.The infected individuals can become resistant R at a rate γ: We simulate the trajectories of the SIR model using the Gillespie algorithm [32].We infer the rates β and γ under uniform priors P (β) = P (γ) = U(0, 1) given samples from the (S, I) trajectories.

Lorenz attractor
The Lorenz system is a 3-dimensional chaotic system governed by the equations, ẋ = σ(y − x), ẏ = x(ρ − z) − y, ż = xy − βz.(14) We simulate this deterministic process starting from a random position (x 0 + η, y 0 , z 0 ), where η is the noise in the initial position drawn from a uniform distribution, η ∼ U(−0.1, 0.1).We fix the parameters σ = 10 and β = 8/3 and set out to infer ρ.The ensemble of trajectories starting in the vicinity of x 0 diverge with the characteristic time set by the inverse of the largest Lyapunov exponent of the system λ = λ(ρ).We start sampling from the trajectories at a random initial time drawn from a Gamma distribution, t 0 ∼ Γ(k = 5, θ = 2).We then take 5 samples from each trajectory at time windows that are larger than the characteristic time for chaotic divergence ∆t = 2λ −1 .We set λ 0.905, which is the Lyapunov exponent for ρ = 28, a transition point where some but not all the solutions of the Lorenz system are chaotic.We infer the parameter ρ in a chaotic regime using a uniform prior P (ρ) = U (30,40).
The artificial neural networks used for all 3 methods were multilayer perceptrons [33] with two hidden layers and a hyperbolic tangent activation function.This architecture choice was found to be expressive enough across tasks and, thanks to its simplicity, we could perform a well-grounded comparison of the three methods without advanced regularisation techniques (see Appendix D for details on hyperparameters choices).The methods were In order to evaluate the posteriors we first choose a reference hypothesis θ * : for Ornstein-Uhlenbeck µ * = 5 and σ * = 1, for Birth-Death β * = 10 and α * = 0.2, for SIR β * = 0.6 and γ * = 0.2, for Lorenz attractor ρ * = 35.We show posteriors P ∞ (θ|x1:M ) calculated using models trained with N = 10 7 trajectories.For Ornstein-Uhlenbeck process the exact posterior density P (θ|x1:M ) is also shown.

V. RESULTS
Given enough data and a powerful enough neural network, we expect the optima of the objective functions I, L f , and S to converge, and the estimated energy function should approach the true value.
To confirm the validity of the proposed methods, in the first set of experiments we use a large number of simulations to study the convergence of the posterior functions.To this end, we choose a hypothesis θ * and simulate M trajectories, x 1:M = {x m } M m=1 , x m ∼ P (x|θ * ) with M = 2 for SIR, and M = 5 for the other tasks.We evaluate the posteriors Pl (θ|x with l indexing one of the three methods (MINE, FDIV, or BCE) and φ is the optimal one for each method (we dropped the explicit dependency on φ in Pl for ease of notation).The posteriors converge when the amortized inference is done on a training set with at least N = 10 7 samples (and 10 7 samples for validation); see Fig. 3.We define a reference posterior P ∞ (θ|x 1:M ) ≡ Pl (θ|x 1:M ) l , obtained with N = 10 7 as the average over three estima-tors.In the case of Ornstein-Uhlenbeck process, where the true posterior P (θ|x 1:M ) can be calculated analytically, P ∞ (θ|x 1:M ) agrees with the analytical prediction.This first result confirms the validity of our approach.
With reducing sample size N , the amortized posteriors differ.To study the performance of the three methods under different simulation budgets N for each task we simulate N tot = 2 × 10 7 samples J = {(x i , θ i )}.We perform the inference of the amortized likelihoodto-evidence ratio with varying simulation budgets, where both the training and the validation data are equal-sized subsamples of J with N ∈ {10 4 , 10 5 , 10 6 }.Inference and comparison are performed 10 times on independent subsamples of J .To obtain samples from the independent set I we shuffle the joint samples k = 5 times, and so N I = 5N .A larger shuffled data N I can improve the inference but at the cost of computing power, which sets a trade-off between performance and training time.
We compare the accuracy of the 3 inference methods (MINE, FDIV, BCE) for the 4 tasks (OU, Birth-death, SIR, Lorenz) based on the three following metrics.

Global comparison
The first metric used for the benchmark is the mutual information given a density estimator, computed with eq. 5.For each N , it is evaluated on test data composed of the remaining N tot − N samples.Unlike the other two comparisons (see below), it is a global metric that tests Benchmarking of the methods.We compare the three objectives M , eq. 5 (MINE), L f , eq. 7 (FDIV), and S, eq. 8 (BCE) for 3 different metrics.We perform 10 replicates of the inference and comparison for simulation budgets N ∈ {10 4 , 10 5 , 10 6 } for 4 systems: Ornstein-Uhlenbeck (A,B,C), Birth-Death (D,E,F), SIR (G,H,I) and Lorenz attractor (J,K,L).In the first row we compare the mutual information on held out test data using eq. 5 with the estimated E φ .For the following two metrics we need to instead choose an hypothesis θ * , see Fig 3 .for the exact values.In the second row we compare the Jensen-Shannon divergence DJS(P ∞ (θ|x1:M ), P N l (θ|x1:M )) between the reference and inferred posteriors.In the last row we compare the Jensen-Shannon divergence DJS(P (x|θ * ), P N l (x|θ * )) using sampled trajectories from the simulator P (x|θ * ) and the inferred distribution Pl (x|θ * ).
the approximation of the likelihood-to-evidence ratio over all θ and x values.For this reason, we use it to perform hyperparameter tuning for each task and each objective with N = 10 5 , see Appendix D. For all four tasks, the value of the estimated mutual information grows with the simulation budget and yields comparable performances for the 3 methods (Fig. 4 A,  D, G, J).For the Ornstein-Uhlenbeck process, the BCE method reaches consistently higher values of mutual information.For the Lorenz attractor, the MINE method is significantly outperformed by the other methods in the low data limit (N = 10 4 ).

Posterior comparison
The objective of simulation-based inference is to find the posterior distribution over model parameters.To characterize the inference accuracy as a function of the simulation budget N and the method l, we evaluate the Jensen-Shannon divergence between the inferred and the reference posterior D JS (P ∞ (θ|x 1:M ), P N l (θ|x 1:M )) (where D JS (p, q) = (1/2) [p(x) log(p(x)/m(x)) + q(x) log(q(x)/m(x))]dx with m(x) = (p(x) + q(x))/2), by scanning through the parameter space with the prior P (θ).A larger Jensen-Shannon divergence indicates a larger deviation between the inferred posterior and the reference (i.e., a lower performance).
All methods show comparable performances, and the Jensen-Shannon divergence decays as a function of the simulation budget N (Fig. 4 B, E, H, K, and Fig. 5 for 5.The Ornstein-Uhlenbeck process in d ≥ 1.We compare the three objective functions M (MINE), L f (FDIV), and S (BCE) for dimension d = 1, 2, 3, 4, 5.We perform 10 replicates of the inference with simulation budgets N = 10 4 , 10 5 , 10 6 for changing dimension d = 1, 2, 3, 4, 5.We compare the posterior with the analytical prediction for a hypothesis (g * )ij = −1.We compute the Jensen-Shannon divergence DJS(P (θ|x1:M ), P N l (θ|x1:M )) between the true and inferred posteriors for each element of the damping matrix γ independently.We show the divergence for diagonal (A-E) and off-diagonal terms (F-I).
the Ornstein-Uhlenbeck process with d ≥ 1).At N = 10 4 the accuracy of the posterior inference is decreased for all objective functions, as reflected by the large variance of the D JS .In the case of the Lorenz attractor, this simulation budget is also insufficient for the MINE method which performs significantly worse than the classifierbased (BCE) and f-divergence (FDIV) estimators.Fig. 5 presents the comparison of the objective functions for the inference of the damping matrix elements in the high-dimensional Ornstein-Uhlenbeck process.We compare the performance in terms of the estimated posterior's divergence from the analytical prediction.We compute the Jensen-Shannon divergence independently for the marginal posterior distributions over each element of the damping matrix (a total of d 2 unique elements in dimension d).We present the results independently for diagonal and off-diagonal elements of the matrix γ (in dimension d there are d diagonal and d(d−1)/2 off-diagonal elements).
The efficacy of the three methods is comparable and the average performance does not significantly decrease up to d = 5, for which we estimate 15 elements of matrix g.At the same time, the variation in the Jensen-Shannon divergence increases with dimension as the inference task becomes more difficult.This is particularly pronounced for the MINE method in dimensions 4 and 5 at low (N = 10 4 ) as well as intermediate (N = 10 5 ) simulation budgets.In these instances, using the f-divergence objective function yields the best performance.

Likelihood comparison
The third metric is the Jensen-Shannon divergence D JS (P (x|θ * ), P N l (x|θ * )) between the true and approximated likelihood for a given model θ.This D JS cannot be directly evaluated by summing over x, because it is typically of high dimension.We thus rely on samples from these two distributions and infer an additional classifier to estimate D JS ; see Appendix E.
The performance of the 3 methods is comparable (Fig. 4 C, F, I, L).For the Ornstein-Uhlenbeck process, the BCE infers more accurate likelihood functions at N = 10 4 and N = 10 5 but it is outperformed by MINE at higher simulation budgets.
The results of the benchmark shown in Fig 4 suggest that all estimators show reliable performances across different tasks and simulation budgets.While the first metric is global and the two other metrics are local, they draw a consistent picture.A higher simulation budget enhances the performance of all methods.The BCE method tends to perform better at the lowest simulation budget.All three methods perform similarly in the middle and high data regimes.

VI. CONCLUSION
We analyzed the problem of inferring an amortized estimator for the likelihood-to-evidence ratio over model parameters, using simulated data.We showed that this inference can be performed by maximization of the mutual information between simulated data and parame-ters of the model.This formulation captures an intuition that inference can be performed when we can extract the dependence between parameters and observed data, as measured by the mutual information.Our formalism opens up possibilities for using algorithms and techniques developed in the context of mutual information estimation [38] for inverse problems.
The likelihood function we propose is equivalent to the mutual information bound analyzed in [15].However, while in [15] the focus is on the estimation of the absolute value of this quantity, we are interested in the inferred energy function that can be used to evaluate the posterior distribution for model parameters.Previous work that used classifiers for simulation-based inference [12,13] also fits naturally within our framework since logistic regression is linked to mutual information estimation [17].The methods we studied rely on two lower bound estimators of mutual information, which are based on (i) the Donsker-Varadhan [15], and (ii) f-divergence representations of the Kullback-Leibler divergence [16].It would be interesting to explore other known mutual information estimators for simulation-based inference [38].
We showed that the mutual information-based methods (MINE and FDIV), implemented in flexible neural networks, can reliably infer the posterior of the parameters and give consistent results with the previously proposed classifier-based technique (BCE) [12,13] when the simulation budget is sufficient.We benchmarked the three approaches and found that their performances are comparable in the intermediate data regime, while in the low data regime the classifier-based method performs consistently better.The main limitation of the two proposed objective functions M and L f is that they require large simulation budgets for accurate inference.
Our choice to implement the neural network as a multilayer perceptron with two hidden layers was motivated by having a simple and reliable architecture to better focus on the relative performance of the different objective functions.For the specific task of inference of model parameters from discrete samples of trajectories, absolute performance could be increased by choosing network architectures adapted to the data structure such as convolutional and recurrent layers.
Existing approaches to simulation-based inference, such as ABC, suffer from the need to define ad-hoc summary statistics to be matched between data and model.An important property of mutual information is its invariance upon the reparametrization of its variables.This enables inference and comparison of different parametrizations of the observed data, as different choices can be evaluated using the absolute value of the mutual information.A specific application that could be interesting to explore is inference for population genetics models, where the choice of summary statistics to use for ABC analysis has always been critical, and the ability to flexibly compare different parametrization choices greatly improves performance, as shown in [23].
Another possibility would be to explore more principled regularization techniques such as the information bottleneck method [39].This approach could be used to infer summary statistics of the data that are maximally informative of the parameters of the model.Then the summary statistics could be added as additional variables for the observations of related tasks, such as model extensions, in a transfer learning fashion.
In conclusion, our work helps to clarify the link between mutual information estimation and simulationbased inference.We believe that this connection can be a fruitful source of improved methods for amortized inference.
energy < l a t e x i t s h a 1 _ b a s e 6 4 = " g S r 4 N X p 7 u D e Z M R 4 1 l Z s b g 2 k C N O 8 = " > A A A B 8 3 i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j 0 4 r G C / Y A m l M 1 2 2 i 7 d b M L u R g i h f 8 O L B 0 W 8 + m e 8 + W / c t j l o 6 4 O B x 3 s z z M w L E 8 G 1 c d 1 v Z 2 1 9 Y 3 N r u 7 R T 3 t 3 b Pz i s H B 2 3 d Z w q h i 0 W i 1 h 1 Q 6 p R c I k t w 4 3 A b q K Q R q H A T j i 5 m / m d J 1 S a x / L R Z A k G E R 1 J P u S M G i v 5 u a 8 i g h L V K J v 2 K 1 W 3 5 s 5 B V o l X k C o U a P Yr X / 4 g Z m m E 0 j B B t e 5 5 b m K C n C r D m c B p 2 U 8 1 J p R N 6 A h 7 l k o a o Q 7 y + c 1 T c m 6 V A R n G y p Y 0 Z K 7 + n s h p p H U W h b Y z o m a s l 7 2 Z + J / X S 8 3 w J s i 5 T F K D 7 s k j e S Y v 3 o P 3 5 L 1 6 b 9 + t U 9 5 4 Z o P 8 g v f + B b 9 / m b 4 = < / l a t e x i t > P joint < l a t e x i t s h a 1 _ b a s e 6 4 = " d p w R K C a j 3 d S A W k J Y y D p i c m A l N + s = " > A A A B 9 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x G Q Y 9 B L x 4 j m A c k S 5 i d T J I x 8 1 h n Z g N h y X d 4 8 a C g f P 4 A H e a S U w = = < / l a t e x i t > A B P joint < l a t e x i t s h a 1 _ b a s e 6 4 = " d p w R K C a j 3 d S A W k J Y y D p i c m A l N + s = " > A A A B 9 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x G Q Y 9 B L x 4 j m A c k S 5 i d T J I x 8 1 h n Z g N h y X d 4 8 a C I V z / G m 3 / j J N m D J h Y 0 F F X d d H d F M W f G + v 6 3 l 1 t b 3 9 j c y m 8 X d n b 3 9 g+ K h 0 c N o x J N a J 0 o r n Q r w o Z y J m n d M s t p K 9 Y U i 4 j T Z j S 6 n f n N M d W G K f l g J z E N B R 5 I 1 m c E W y e F t W 7 a 0 Q I 9 K i b t t F s s + W V / D r R K g o y U I E O t W / z q 9 B R J B J W W c G x M O / B j G 6 Z Y W 0 Y 4 n R Y 6 i a E x J i M 8 o G 1 H J R b U h O n8 6 C k 6 c 0 o P 9 Z V 2 J S 2 a q 7 8 n U i y M m Y j I d Q p s h 2 b Z m 4 n / e e 3 E 9 q / D l M k 4 s V S S x a J + w p F V a J Y A 6 j F N i e U T R z D R z N 2 K y B B r T K z L q e B C C J Z f X i W N S j m 4 K F f u L 0 v V m y y O P J z A K Z x D A F d Q h T u o Q R 0 I P M E z v M K b N / Z e v H f v Y 9 G a 8 7 K Z Y / g D 7 / M H 2 H 2 S J g = = < / l a t e x i t > sample < l a t e x i t s h a 1 _ b a s e 6 4 = " + q f a k R w c S n 0 M v U P Z f G z x r / d b q a M = " > A A A B 8 3 i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 6 L L o x m U F + 4 D O U D J p p g 1 N M i H J C G X o b 7 h x o Y h b f 8 a d f 2 P a z k J b D 1 w 4 n H N v c u + J F W f G + v 6 3 t 7 a + s b m 1 X d o p 7 + 7 t H x x W j o 7 b J s 0 0 o S 2 S 8 l R 3 Y 2 w o Z 5 K 2 L L O c d p W m W M S c d u L x 3 c z 7 j k o s q I n y + c 5 T d O 6 U A U p S 7 U p a N F d / T + R Y G D M R s e s U 2 I 7 M s j c T / / N 6 m U 1 u o p x J l V k q y e K j J O P I p m g W A B o w T Y n l E 0 c w 0 c z t i s g I a 0 y s i 6 n s Q g i W T 1 4 l 7 X o t u K z V H 6 6 q j d s i j h K c w h l c Q A D X 0 I B 7 a E I L C C h 4 h l d 4 8 z L v x X v 3 P h a t a 1 4 x c w J / 4 H 3 + A D U / k c s = < / l a t e x i t > shu✏e < l a t e x i t s h a 1 _ b a s e 6 4 = " V z t l J c Z Y Z Q z c k v N v s g W G k 3 9 D Q 4 U = " > A A A B 9 H i c b V D L S g M x F L 3 x W e u r 6 t J N s A i u y k w V d F l 0 4 7 K C f U A 7 l E y a a U O T z J h k C m X o d 7 h x o Y h b P 8 a d f 2 P a z k J b D 1 w 4 n H N v c u 8 J E 8 G N 9 b x v t L a + s b m 1 X d g p 7 u 7 t H x y W j o 6 b J k 4 1 Z Q 0 a i 1 i 3 Q 2 K Y 4 I o 1 L L e C t R P N i A w F a 4 W j u 5 n f G j N t e K w e 7 S R h g S Q D n e I U 3 N E Y v 6 B 1 9 L F r X U D 5 z A n + A P n 8 A A V e S Q A = = < / l a t e x i t > simulator < l a t e x i t s h a 1 _ b a s e 6 4 = " TA P W 5 Y n J 6 8 F u R Z D 5 v x A b A G s 3 A / E = " > A A A B + H i c b V D L S g N B E O z 1 G e M j q x 6 9 D A b B U 9 i N g h 6 D X j x G M A 9 I l j A 7 m S R D Z m a X e Q h x y Z d 4 8 a C I V z / F m 3 / j J N m D J h Y 0 F F X d d H f F K W f a B M G 3 t 7 a + s b m 1 X d g p 7 u 7 t H 5 T 8 w 6 O m T q w i t E E S n q h 2 j D X l T N K G Y Y b T d q o o F j G n r X h 8 O / N b j 1 R p l s g H M 0 l p J P B Q s g E j 2 D i p 5 5 e y r h J I M 2 E 5 N o m a 9 v x y U A n m Q K s k z E k Z c t R 7 / l e 3 n x A r q D S E Y 6 0 7 Y Z C a K M P K M M L p t N i 1 m q a Y j P G Q d h y V W F A d Z f P D p + j M K X 0 0 S J Q r a d B c / T 2 R Y a H 1 R M S u U 2 A z 0 s v e T P z P 6 1 g z u I 4 y J l N r q C S L R Q P L k U n Q L A X U Z 4 o S w y e O Y K K Y u x W R E V a Y G J d V 0 Y U Q L r + 8 S pr V S n h R q d 5 f l m s 3 e R w F O I F T O I c Q r q A G d 1 C H B h C w 8 A y v 8 O Y 9 e S / e u / e x a F 3 z 8 p l j + A P v 8 w c + j 5 N 4 < / l a t e x i t > I = {(x i , ✓ ⇡(i) )} < l a t e x i t s h a 1 _ b a s e 6 4 = " e P Q H K 0 + 2 q Z m I 6 u d A F v O p C j 6 N S q g FIG. 1. (A)Distribution of energies of independent and joint pairs (x, θ).Pairs from the joint distribution have lower energy E, while pairs from the independent distribution have higher energy, as the majority of these independent samples are relatively unlikely under a joint model.(B) Schematic of the method.We first sample parameters θ from the prior P (θ) and then sample observations x from the simulator P (x|θ) to obtain pairs (xi, θi).In order to generate pairs (xi, θj) from the independent distribution we shuffle the two initial vectors.Both sets are used to infer a model of energy E φ by maximizing a log-likelihood in(5) with φ parameters of the artificial neural network used for the inference.
where π is a random permutation of N elements, possibly multiple times.Counting all possible combinations, the set I can have maximal size max(N I ) = N 2 − N under a fixed simulation budget.We denote the relative size of the two sets by k = N I /N .

FIG. 2 .
FIG. 2. Example trajectories (A,C,E,G) and posterior inference (B,D,F,H) for Ornstein-Uhlenbeck (A,B), Birth-Death (C,D), SIR (E,F) and Lorenz attractor (G,H).Trajectories are simulated with the parameter marked in red on the posterior plots.Circles indicate the discrete observations used for inference.Posteriors where estimated over 10 trajectories.
FIG.4.Benchmarking of the methods.We compare the three objectives M , eq. 5 (MINE), L f , eq. 7 (FDIV), and S, eq. 8 (BCE) for 3 different metrics.We perform 10 replicates of the inference and comparison for simulation budgets N ∈ {10 4 , 10 5 , 10 6 } for 4 systems: Ornstein-Uhlenbeck (A,B,C), Birth-Death (D,E,F), SIR (G,H,I) and Lorenz attractor (J,K,L).In the first row we compare the mutual information on held out test data using eq. 5 with the estimated E φ .For the following two metrics we need to instead choose an hypothesis θ * , see Fig 3.for the exact values.In the second row we compare the Jensen-Shannon divergence DJS(P ∞ (θ|x1:M ), P N l (θ|x1:M )) between the reference and inferred posteriors.In the last row we compare the Jensen-Shannon divergence DJS(P (x|θ * ), P N l (x|θ * )) using sampled trajectories from the simulator P (x|θ * ) and the inferred distribution Pl (x|θ * ).