Bilevel Optimization for Traffic Mitigation in Optimal Transport Networks

Global infrastructure robustness and local transport efficiency are critical requirements for transportation networks. However, since passengers often travel greedily to maximize their own benefit and trigger traffic jams, overall transportation performance can be heavily disrupted. We develop adaptation rules that leverage Optimal Transport theory to effectively route passengers along their shortest paths while also strategically tuning edge weights to optimize traffic. As a result, we enforce both global and local optimality of transport. We prove the efficacy of our approach on synthetic networks and on real data. Our findings on the International European highways suggest that thoughtfully devised routing schemes might help to lower car-produced carbon emissions.

Introduction.-Transportnetworks are ubiquitous in nature and engineering, spanning from living organisms to cities and telecommunications.Many of these systems can be modeled by adaptation rules that follow the principle of minimum energy, regulating edge flows to optimize transportation costs.Examples in biology are plants, whose profiles emerge from a trade-off between minimization of hydraulic resistance and carbon cost [1], and leaves, shaped by the interplay of nutrients' transport efficiency and robustness to damage [2][3][4].
Similarly, adaptation rules have been employed to model traffic flows in urban transportation by jointly minimizing the energy dissipated by the passengers and the construction cost of the infrastructure [5][6][7][8][9][10][11].While these models set forth a first approach to simulate traffic flows using adaptation, they crucially neglect that passengers in a transportation network do not move cohesively to minimize a unique cost.Instead, they choose their routes greedily to maximize their benefit (Wardrop's first principle) [12][13][14].As a consequence, transport networks may be globally inefficient.
In this work, we propose a set of adaptation equations to find traffic flows that mitigate congestion, considered as a proxy for global efficiency, while trading off against the shortest routes.
We frame the problem in a bilevel optimization setup, which poses a competition between greedy passengers and a network manager.The passengers minimize their origin-destination path cost seeking for the User Equilibrium [15] (lower-level problem), whereas the network manager guarantees global efficiency by mitigating traffic bottlenecks on edges to achieve the System Optimum (upper-level problem), while implicitly accounting for passengers' shortest path.We tackle the optimization problem by alternating Optimal Transport (OT)-inspired adaptation rules for the lower-level optimization, and a Projected Stochastic Gradient Descent (PSGD) scheme for the upper-level optimization.
Traffic mitigation is performed by minimizing a quadratic loss function that penalizes edges whose traffic exceeds a prefixed threshold.The minimization problem can be treated analytically by assuming that the network edges are endowed with capacities and weights (resistances) and their flows are the gradient of a scalar potential, as for electrical networks.We derive closed-form gradients for the weights, which can be interpreted as the cost that passengers pay for traveling.In practice, network managers would implement these weights by strategically designing incentives or disincentives, e.g., assigning road tolls, to encourage passengers to relocate from jammed edges.The task of traffic mitigation has been addressed using several methods.These include belief propagation [30][31][32], adaptive dynamical networks [33], MCMC schemes [34], cellar automata [35,36], and heuristic routing models [37].
A bilevel optimization problem similar to the one studied here was solved using message-passing [38].While the problem's setting is similar to ours, the methodologies differ since we alternate adaptation rules for the capacities with global descent for the weights, whereas message-passing uses local updates for flows.Our approach outputs individual passengers' optimal paths, whereas the formulation in Li et al. [38] can only extract aggregate routes.
We find that our method effectively trades off traffic mitigation against the shortest passengers' routes.Namely, both on synthetic topologies and real roads, it returns optimal transport networks where congestion is heavily reduced.We argue that this result is beneficial for reducing the carbon footprint of roads.We also show that the uncoordinated actions of network manager and passengers can be counterproductive, i.e., they may increase traffic, with an outcome opposite to that intended.
Problem.-We take a network G(V, E) where M ≥ 1 groups of greedy passengers i can travel from origin nodes O i (one node per group), to possibly multiple destination nodes D i .Stationary numbers of entry and exit passengers are stored in a mass matrix with entries Si for v ∈ D i , and Si v = 0 otherwise.We assume that the system is isolated, i.e., that passengers entering the network must also exit.This condition is v Si v = 0 for all i.When traveling along an edge, passengers pay a cost we > 0, and lastly, each edge is equipped with a capacity that controls the rate at which passengers i are allocated along each edge e-c i e ≥ 0. Intuitively, one could think of capacities as the space occupied by passengers of type i, i.e., larger space accommodates more passengers.All problem variables have been introduced with units, however, these can be nondimensionalized to derive scale-independent adaptation rules [39].We denote dimensional quantities with a tilde, and dimensionless ones without.
Lower-level optimization.
-The lower-level problem allows us to find the cheapest routes from O i to D i .In order to model traffic flows, we introduce the fluxes F i e , specifying the displacement of S i along an edge e.In analogy with electrical networks, we assume that there exists an auxiliary pressure potential p i v on each node v due to index i.We interpret them as the travel demand from passengers traveling from v. With this, we define the potential-based fluxes for all e = (u, v) and i, i.e., Poiseuille's Law, as Fluxes must obey Kirchhoff's law.We can write it as e B ve F i e = S i v , where B is a conventionally oriented incidence matrix of the network.Substituting Eq. ( 1) in Kirchhoff's law, the potential becomes a function of c and w, namely where † denotes the Moore-Penrose inverse and L i uv = e (c i e /w e )B ue B ve are entries of the network weighted Laplacian.With this substitution, F ≡ F (c, w) is also a function of only c and w, the only independent problem's variables.
For any fixed set of weights, we write the lower-level problem as The convex OT cost J in Eq. ( 2) is the sum over M indexes of the w-shortest path costs J i = e w e |F i e | [16,17].Its only minimizer is the overlap of M shortest paths from all O i to D i , that are found with c using Eq. ( 1) and Kirchhoff's law.
-The upper-level problem formalizes the task of the network manager of tuning w to mitigate traffic jams triggered by the passengers.We measure traffic by penalizing congested links where i |F i e | exceeds a threshold θ ≥ 0, above which infrastructural failures may occur.Conveniently, we introduce ∆ e = i |F i e | − θ.Analogously to Eqs. (2)-(3), for any set of capacities, the upper-level optimization is where H is the Heaviside step function.In Eq. ( 4), other objective functions, e.g., the hinge loss can be utilized [38,40], we do not explore this here.Furthermore, the weights are constrained to be larger than a small ϵ > 0. This means that passengers cannot profit (w < 0), or travel for free (w = 0).Practically, this ensures that the Laplacian L is well-defined.Bilevel Optimization.-Wecombine the two optimization problems into one.Suppose that the network manager is regularly informed of the passengers' routes, and using such information they tune the weights to mitigate traffic.After each update, passengers reroute accordingly to the updated weights.
Formally, this translates into the problem where the equality in Eq. ( 7) comes from the convexity of J [16,17].In Eq. ( 6) we explicit the dependence on w as a variable and on c as a parameter (conversely for Eq. ( 7)).
Optimal Transport dynamics.-Tofind the shortest paths required for the lower-level problem, we couple fluxes and capacities with the ODEs where fluxes obey Kirchhoff's law.In Eq. ( 8), edges with high flux enlarge, whereas those where the negative decaying term prevails shrink.Crucially, asymptotic solutions converge to the minimum OT cost J in Eq. ( 2), being the Wasserstein distance between passengers' entry and exit distributions [39], and whose minimizers are origin-destination shortest paths.Beside Kirchhoff's law and positivity, capacities in Eqs. ( 2)-(3) are otherwise unconstrained.One can potentially add additional constraints, e.g., a limited budget, by employing recent ideas in the context of adaptation equations [41].We do not explore this here.
Projected Stochastic Gradient Descent (PSGD).-Minimization of Eq. ( 6) is performed using SGD with a projection step to enforce w ≥ ϵ.Importantly, we can derive a closed-form expression for the gradients Ψ e = ∂Ω/∂w e [39].To explore the non-convex landscape of the minimization in Eqs. ( 6)- (7), we update the weights with dropout at each The black rectangle frames a region where the network manager triggers high congestion.We conveniently normalize θ ⋆ and ρ.
Bilevel optimization scheme.-Inorder to find the optimal c and w, and hence F , we iterate between Eq. ( 8) and PSGD recursively.The scheme is repeated until J and Ω converge.A diagram outlining the optimization method is in Fig. 1, we also provide an open-source code (BROT, Bilevel Routing on networks with Optimal Transport) [42].
Experimental setup.
-We analyze BROT's optimal networks against two baselines.The first, referred to as OT, consists of finding passengers' shortest paths without any intervention from the network manager.We assume a unitary cost per unit of length fare, i.e., we set w = ℓ with ℓ the Euclidean lengths of the edges, and numerically integrate Eq. ( 8).The second, referred to as PSGD, reflects the scenario of a network manager that tunes w only relying on the shortest paths taken when w = ℓ, and that disregards how fluxes redistribute while updating w.In practice, this corresponds to running PSGD only, with initial conditions being w(0) = ℓ + ξ and c i e ≃ |F i Dij,e | [39], and then, to integrating Eq. ( 8) with w = w ⋆ PSGD being the optimal weights returned by the network manager.Here, ξ is a small zero-sum uniform noise, F Dij are the shortest path fluxes computed with Dijkstra's algorithm, and the approximation arises because, to avoid numerical instabilities, a small non-zero c i e is allocated to all edges.We fix BROT's initial conditions to w( 0 We evaluate J and Ω at convergence for all methods with different q, and ranging θ from θ = 0 to a large value θ ⋆ where few edges are congested.Results are in Fig. 2(a).
Since for OT the network manager does not intervene, J is constant for at all θ, and it is the origin-destination shortest length.Its profile changes when the network manager influences passengers' routes by tuning the weights.Specifically, for PSGD J drops when reducing θ, making it cheaper for the passengers to move.On the contrary, lower θ corresponds to a larger J for BROT.This behavior seemingly favors an uninformed network manager (PSGD) over an informed one (BROT).However, the profile of Ω shows that, even though the traveling cost of PSGD is cheaper, all transport networks at convergence are highly congested (large Ω).BROT successfully trades off the cost of traveling against traffic, outputting low values of Ω for all θ, with only a mild increase as θ approaches zero.This is clarified in Fig. 2(c), where BROT generates ramified loopy networks.
The dropout parameter q allows to explore the minimization landscape of Eqs. ( 6)- (7).By decreasing q, i.e., setting more gradients to zero, BROT returns lower Js at all θ, whereas PSGD gives higher ones, conversely for Ω.This impacts the network topologies, which are less ramified and akin to OT trees [43], when q is lower and for the same θ [39].The tradeoff between J and Ω is further laid out in Fig. 2(b) where we show J − J OT against Ω − Ω 0 , Ω 0 = 0. We highlight in red the non-dominated points (also referred as maximal points) at four values of θ, computed over all q [as in Fig. 2(a)] and 25 random initializations of BROT.Such points are the best J-Ω trade-off attained by the experimental runs [39].
For all q and sufficiently low θ, the Price of Anarchy (PoA) [44] is greater for PSGD than for OT, i.e., the network manager's intervention increases traffic congestion, having the opposite effect to that intended.We illustrate exemplary networks at convergence in Fig. 2(c).The parameter ρ = w ⋆ X − ℓ (X = BROT, PSGD), expressing the variation of cost, indicates that the uninformed network manager naively-and significantlydecreases the cost of a small fraction of edges [squared in Fig. 2(d)].This encourages fluxes to largely concentrate on them, thus creating congestion.
To further discern the nature of congestion, we propose two additional metrics.First, the Gini coefficient of the fluxes,  and t θ,e (s) = ℓ e /v ∞ otherwise.Here v ∞ = 1 is a (conventionally fixed) free-flow velocity, and s is a sensitivity coefficient to penalize traffic.Results are in Fig. 3.
The Gini coefficient of PSGD fluctuates slightly around the high values attained by the congested shortest path network of OT.For BROT, as θ decreases-more flux gets penalized-Gini sharply drops, yielding progressively distributed networks.The total travel time reveals once again that the uncoordinated action of passengers and network manager may be detrimental compared to having no tuning of w.In fact, times for PSGD are higher than those for OT.BROT keeps T θ (s) small for any value of θ and for both low and high sensitivity.Finally, as θ increases, traffic gradually mitigates, with lim θ→+∞ T θ (s) = T ∞ (T ∞ = J OT ) being the travel time for infinite capacities, when all passengers flow freely.
The E-road network.-Westudy the methods on a graph extracted from the International European highways (E-road) [46,47], of size |V | = 541 and |E| = 712.Entry inflows of passengers are populations of 15 large cities.We assume that all passengers travel from one city to another.Thus, we set for O i and v ∈ D i (being also origin nodes O j ) the exiting number of passengers Sv to be proportional to the product r v = SO i SO j , properly normalized to ensure conservation of mass.In this way, cities with high inflows have large outflows, and vice versa for small ones.The total number of passengers to be routed is i SO i ≃ 3 • 10 7 .We fix θ (dimensionalized by S c ) so that 43% of the passengers reroute from their congested shortest path, found with Dijkstra's and w = ℓ.
Results are in Fig. 4. We observe that in the shortest path configuration of OT, a large volume of passengers travels between the two most populous cities, Madrid and Berlin, on the southernmost region of the network.The uninformed network (PSGD) heavily increases the price of the connections to Milan [39].This causes a heavy rerouting from Madrid to the north, and congests the roads connecting Madrid to Paris, and then from Paris to Berlin.In contrast, BROT distributes traffic over We study the average travel time for all routing schemes.This is ⟨ Tθ (s , where tθ is a dimensionalized latency function computed using l, the Euclidean distance between cities, and v ∞ = 100 (km/hours).
Results for s = 1, 5 in Fig. 4 show that the average travel time of BROT is substantially lower than that of OT and PSGD.Particularly, for low sensitivity BROT's ⟨ Tθ (s)⟩ is approximately 1.7 (hours), while OT's and PSGD's are 2.3 (hours) and 3.1 (hours).Here, BROT leads to a reduction in traveled time of approximately 26% and 45% compared to OT and PSGD.This result becomes starker if the sensitivity increases, here BROT reduces ⟨ Tθ (s)⟩ of 48% compared to OT-from 5 (hours) to 2.6 (hours)-and of 74% compared to PSGD-whose heavy congestion gives ⟨ Tθ (s)⟩ ≃ 10 (hours).Once again, the PoA (the travel time) is higher if the network manager's intervention is uncoordinated with the passengers (PSGD), as opposed to when there is no intervention (OT).
Conclusion.-BROT relies on theoretical assumptions that can be challenging to meet in real-world traffic control [48], e.g., passengers rerouting more unpredictably than expected by theoretical models.Nevertheless, our analysis on the E-road network demonstrates how an informed tuning of road tollswhere the network manager factors in passengers' reroutingcan be beneficial for reducing the carbon footprint of roads, since traffic jams, and hence longer travels, critically impact greenhouse gas emissions of vehicles [49][50][51].
To facilitate practitioners using our algorithms, we opensource our code [42].
Bilevel Optimization for Traffic Mitigation in Optimal Transport Networks: Supplementary Material (SM) Alessandro Lonardi When building the model setup, we assume that passengers move greedily according to Wardrop's first principle.This means that, for a given set of weights w, they travel from their origins O i to their destinations D i minimizing the OT cost J = ei w e |F i e |.As a consequence, a meaningful comparison of BROT is proposed against two methods.The first, PSGD, consists of a scheme where initially only Ω is minimized by a network manager that tunes w while ignoring how passengers reroute when the weights update.We suppose that, at t = 0, the network manager knows which would be the paths that the passengers were to take if they moved minimizing J with w = ℓ-we compute these fluxes with Dijstra's algorithm.In PSGD, only at convergence of w the passengers choose on which path to travel.The second scheme is OT, where the passengers find their shortest path for w = ℓ.Here, the network manager does not intervene on w.
We give an overview of the schemes in Fig. S1, which is a companion Figure of Fig. 1 (main text).Here we consider an entry and an exit node where unitary mass flows in (red plus) and flows out (blue minus) the network.Additionally to BROT, PSGD (with dropout probability fixed at q = 1, i.e., vanilla GD), and OT, we also extract the paths the passengers would take if they did not act greedily.This simulates the unrealistic-for our case of study-situation where the agents on a road network follow the instruction of an oracle, and accept to reroute in order to achieve a social optimum rather than maximizing their own benefit.
The costs J and Ω in Fig. S1 show that if passengers unrealistically moved in a coordinated way, then traffic congestion would be greatly minimized.However, this would cause J to explode.The OT cost J can be minimized by OT and PSGD, which in turn trigger traffic congestion.As discussed in the main text, it may happen that congestion is larger in PSGD than in OT, showing that the uncoordinated actions of passengers and network manager can be detrimental to the global efficiency of the transport network.Only BROT is able to trade off between J and Ω.
The networks in Fig. S1 reflect the costs profiles.Particularly, in the coordinated transport network (purple), passengers distribute over the whole area of the network and keep congestion low.Here, ρ = w ⋆ PSGD − ℓ (w ⋆ PSGD are the weights tuned by the network manager) shows that the straight line connecting the origin and destination nodes is heavily penalized, while other parallel connections and oblique links branching from the origin and destination become cheaper.In the OT network (blue), passengers simply travel on the shortest straight path.In PSGD (orange), passengers split in two separate branches, which however are congested.Only BROT (green) is able to trade off between congestion and total travel cost by outputting a network where fluxes distribute hierarchically to prevent over-trafficking and to minimize the travel cost.Noticeably, the action of the network manager (here quantified by ρ = w ⋆ BROT − ℓ) is less invasive in this scenario, i.e., edge costs need to be varied less to find an optimal configuration.

NONDIMENSIONALIZATION OF THE MODEL
The scale-independent adaptation equations for the evolution of weights and capacities can be derived by rescaling dimensiondependent quantities.We start with the constant coefficients ODEs In Eq. (S1) we write a dimension-dependent version of Eq. ( 8, main text).In Eq. (S2) the gradient flow equation associated with the GD update used to tune the weights.Here, α, β, and γ are constant coefficients with appropriate dimensions.We then choose the following nondimensionalization: where t c , c c , w c and S c are characteristic units.As mentioned in the main text, in order to fix ideas, one could think of such units as time, length, price of tolls applied to roads, and number of passengers, for t c , c c , w c , and S c , respectively.Importantly, such a choice of the parameters' dimensions is not binding since for different physical scenario the problem variables assume diverse meanings.For example, in the case of plant leaves the capacity c could be thought of as the production rate of auxin transporting proteins [4,56].Substituting Eqs.showing that, to recover the adimensional model, we can conventionally set We also notice that J = ei we | F i e | = (w c S c ) ei w e |F i e |, which is J = (w c S c )J.Moreover, Ω = Ω/S 2 c , with θ = θ/S c for an opportunely dimension-dependent capacity threshold θ, and T (s) = (w OT,c S c )T (s), where w OT,c is the nondimensionalization coefficient used for the constant weights of OT.As a consequence, when comparing the costs Ω and J, and the total traveled time T θ (s) between different methods we perform the following nondimensionalization.We fix w X,c = e w⋆ X,e where X is one between OT, BROT, or PSGD, and starred weights are those at convergence.Additionally, we set S c to be the total inflowing number of passengers, i.e., S c = i Si O i .This yields, for any algorithm X, CONNECTION WITH OPTIMAL TRANSPORT The adaptation rules governing the evolution of the capacities [Eq.( 8, main text)] is tightly connected with Optimal Transport (OT) theory.In detail, consider the problem formulation proposed in the main text.By fixing i = 1 we obtain that the passengers inflows µ = S O i and outflows ν = −S D i are two atomic probability distributions supported on the origin and destination nodes, respectively, that need to be mapped one into the other by minimizing the transportation cost.This problem, using a standard OT formulation (primal Kantorovich Problem) is the linear program [54] min π∈Π(µ,ν) u∈V,v∈V Here Π(µ, ν) is the set of transport paths π (expressing the probability of an assignment of passengers on uv) that satisfy the conservation constraints v π uv = µ u and u π uv = ν v .The cost w uv corresponds to the price one needs to pay to move passengers from u to v. Since the transport of passengers is supported on a network, we fix w uv = +∞ if two nodes u, v are not connected.It can be proved [29] that the problem in Eq. (S13) and the minimization problem in Eq. ( 2, main text) correspond, i.e., the minimization objectives and the search space defined by Π(µ, ν) and Kirchhoff's law are the same, therefore shortest path fluxes are exactly OT paths.Moreover, OT fluxes are asymptotic solutions of Eq. ( 8, main text) [29].Noticeably, the optimal cost J is exactly the Wasserstein distance between µ and ν, when w uv satisfies the properties of a metric: (i) symmetry, (ii) vanishing along the diagonal, (iii) triangular inequality [55].These requirements may not hold in our setup, nevertheless, J can still be interpreted as the cost that passengers pay to move on the network.We also remark that even if disconnected nodes yield an infinite edge cost, the finiteness of Eq. ( S13) is guaranteed by assuming the existence of at least one path joining O i and D i where passengers can travel.
OT paths for i > 1 are the overlap of M independent solutions of Eq. (S13), with passengers that move from µ i = S O i to ν i = −S D i for all i.

CLOSED-FORM EXPRESSION OF THE UPPER-LEVEL PROBLEM GRADIENTS
We derive the gradients of the upper-level problem in Eq. ( 6, main text) in closed-form.Let us start by applying the chain rule to Ω, which is defined as with ∆ e = i |F i e | − θ and H Heaviside step function.In order to ease notations, for any feasible set of capacities c, we denote all edges for which ∆ e ≥ 0 as w e ∈ W . Conversely, if ∆ e < 0 (hence H(∆ e ) = 0), then w e / ∈ W .Moreover, we compactly use This allows us to write To manipulate Eq. (S17), we introduce the auxiliary variables r i e = w e /c i e , that we can use to write compactly all problem's main variables.In particular, we denote fluxes, as variables of r, as F i e = (P i u − P i v )/r i e .Least-squares potentials are e ∂r i e Now, write differences of pressure along the network edges as ∆P i e = P i u − P i v , and when computing the gradients with respect to an edge e, we separate contributions for e ′ ̸ = e, and for e ′ = e.In particular, we write e ′ ∀e ′ = e .

(S20)
To simplify notations, we substitute Eq. (S20) into Eq.(S18), and group all terms in one unique expression.This reads ∂r i e ∂w e (S21) with δ e ′ e Kronecker delta.Moreover, derivatives of pressure differences can be written making explicit the definition of the least-squares potential, that is, Now, in order to conclude the derivations, we need to calculate the derivatives of the Laplacian Moore-Penrose inverse in Eq. (S22).This can be done by following the detailed calculation of ?? .We denote their closed-form expression with e as defined in Eq. (S37).Substituting Eq. (S37) into Eq.(S22), and then Eq. (S22) back into Eq.(S21), we obtain ∂r i e Making explicit capacities and weights in all terms of Eq. (S23) yields which we simplify further by writing where we introduced G i e ′ e = vu B ve ′ B ue L i † vu .Notice that G yields non-zero terms for all edges in W and not, hence, congestion on an edge e can affect the weight of any other edge of the network.We conclude by substituting Eq. (S32) in Eq. (S17), and finally obtain the gradients of Ω in closed-form.These read A similar result can be found in [38] (Appendix IIIC).
Differentiation of the Laplacian Moore-Penrose inverse The following calculations can be carried out identically for any index i ∈ M , thus we omit it.We also assume that L has constant rank, i.e., the network does not disconnect in multiple connected components.With this assumption, we can write ( [53], Theorem 4.3) the Laplacian Moore-Penrose inverse derivatives as L † , (S35) which we further simplify, for all u, v ∈ V and e = (x, y) ∈ W , as follows: The bilevel minimization problem in Eqs. ( 6)-( 7) (main text) is highly non-convex.This implies that the energy landscape of (Ω, J) has several local minima, that we wish to explore.Such a task is carried out by implementing batch GD.Particularly, at every discrete time step n ∈ N 0 we draw an |E|-dimensional vectors with components sampled from a Bernoulli probability distribution with parameter q, i.e., α e (n) ∼ Bernoulli(q).We multiply α(n) and Ψ(n) so that, on average, at every step |E|(1−q) random gradients are set to zero.
Additionally, in order constrain the weights onto their feasibility set C = {w ∈ R |E| : w e ≥ ϵ > 0}, we perform a projection step at every descent iteration.Thus, for every discrete time step n ∈ N 0 , we modify the update of the weights as Other projection methods could be used to constrain the weights.Particularly, in our numerical code [42] we also implement the method of Muehlebach and Jordan [52].This consists of adding a "cleverly-designed" momentum term to the descent scheme, which ensures that, given that w(0) ∈ C, then w will fall in the feasibility region at convergence.Remarkably, this approach has the advantage of being easily adaptable to highly non-linear constraints and has recently been used to formulate a constrained dynamical formulation of OT like the one used in this work [41].However, since the structure of C in our case is simple, we observe that it does not give any numerical benefit compared to PSGD.Hence, we opt for the latter.The threshold ϵ has been set to ϵ = 0.01 • min w.

Initial conditions
Initial conditions for OT, PSGD, and BROT are set as follows: OT: Conditions in Eq. (S41) allow us to compute the shortest path fluxes without the intervention of the network manager, hence with w = ℓ.In this case, the width of roads at t = 0 is set to be uniform and equal to the inflowing passengers S i O i , since passengers could potentially travel on any edge of the network.In PSGD, we suppose that the network manager is-only initially-informed about passengers' shortest paths, hence we should initialize the capacities as c i e = |F i e |.This condition comes from the fact that at convergence of Eq. ( 8, main text) and for optimal solutions of Eq. (2, main text) the scaling c i e ∼ |F i e | holds [6].In order to prevent numerical instabilities, in Eq. (S42) we assign c i e ≃ |F i Dij,e | to edges that are traversed by shortest path fluxes, and c i e = 0.05 • S i O i to all the others.The weights are initialized as equal to the lengths, with a small zero-sum noise used to explore the cost landscape of Ω, together with the dropout coefficient α.In PSGD, c gets updated only after the full update of the weights by the network manager is performed, then the OT paths are computed for w = w ⋆ PSGD .For BROT, the network manager is initially uninformed about passengers' routes, hence c i e (0) = S i O i .Similarly to PSGD, w(0) = ℓ + ξ.

Implementation details
In Algorithm 1 we write a pseudocode for the implementation of OT, PSGD, and BROT.In Table S1, we provide a detailed list of all parameters used for our experiments.Ranging the seeds for ξ and using the random dropouts α, we explore the cost landscape of the bilevel optimization problem.
A point is x m said to be dominating x n if x m ̸ = x n and x m ≺ x n , where x m ≺ x n indicates that ω m ≤ ω n and j m ≤ j n , with at least one inequality being strict.We look for and highlight those x n that are dominated by no points outputted by all experimental runs, at a fixed value of θ.

Synthetic experiments
We show additional results that complement those of the synthetic experiments discussed in the main text.The interpretation of these results is as that of those already discussed in the manuscript, therefore, here we only present them briefly.
First, Fig. S2 and Fig. S3 are companion figures to Fig. 2 (main text).These consist of the cost plots for J and Ω, of J − J OT vs. Ω − Ω 0 for D = 4 and q = 0.1, 0.25, 0.5, 0.75, 1 and of a detailed visualization of the transport networks that we extract with OT, PSGD, and BROT for D = 4 and q = 0.5, 1.In Fig. S4 [companion Figure to Fig. 3 (main text)] we display the Gini coefficient and the total travel time T θ (s) for D = 4 and q = 0.5, 1.Similarly, Fig. S5 and Fig. S6 are detailed visualizations of the networks with D = 8 and q = 0.5, 1, and a companion Figure to Fig. 3 (main text), where q = 0.5.
It is worth mentioning how lowering q accentuate the contribution of the greedy passengers, thus yielding, for the same value of θ, costs that are similar to those of OT, and networks with fewer loops.

E-road network
In Fig. S7 we show companion plots to those in Fig. 4 (main text).Particularly, we color the E-road network with the nondimensionalized Euclidean lengths ℓ, which are used for the initial conditions Eqs. (S41)-(S43).Then, we show the change of cost at convergence of the numerical schemes, i.e., ρ X = w ⋆ X − ℓ, with X = PSGD (q = 1), BROT.We also display the distribution of passengers on edges at convergence of Dijkstra's, i.e., the configuration based on which the uninformed network manager of PSGD tunes the weight.The highlighted value of θ is that fixed for all numerical experiments on the E-road network.
Additionally, in Fig. S8 we display experiments on the E-road performed fixing q = 0.25, 0.5, 0.75.In these we see how increasing q in PSGD, the PoA (quantified by the average travel time ⟨ Tθ (s)⟩) gets higher because of the starker lack of cooperation between greedy passengers and network manager.This yields transport networks with more congested paths.Conversely, BROT keeps the average travel time low for all q, i.e., it finds different local minimizers-transport networks-that are not over-trafficked.S7.E-road transport networks, companion Figure to Fig. 4 (main text).Colors of edges follow the colorbars for ℓ (the Euclidian lengths of edges) and for ρ, being the difference in cost between the weights at convergence and their initial configuration.The bottom right histogram is the shortest path distribution, computed with Dijkstra, of the fluxes.In red we mark the value of θ that has been used for all experiments on the E-road network, which penalizes approximately 43% of the total number of passengers traveling along their shortest path.

BROT PSGD BROT PSGD
BROT PSGD q =1.0 q =0.25 q =0.5 q =0.75 S8.E-road transport networks, experiments for q = 0.25, 0.5, 0.75.Colors of edges are the passengers travelling in the network.In the first column, we zoom into an area of interest of the networks outputted by BROT in order to highlight the presence of different network configurations that are decongested.Colors and hatches of histograms are as in Fig. 4 (main text).In gray, we draw the average travel times for q = 1.

FIG. 2 .
FIG. 2. Overview of the routing schemes.(a) J and Ω against θ.(b) Trade-off J − JOT vs. Ω − Ω0 with varying (θ, q, ξ).Non-dominated points for θ/θ ⋆ ≃ {0.06, 0.2, 0.3, 0.4} are in red.(c) BROT's networks at different θ.Edge widths are proportional to the average fluxes in 50 runs of the algorithm.Gray edge contours are fluxes' standard deviations.(d) Cost (left) and flux (right) networks for all methods and θ/θ ⋆ = 0.4.Flux networks are as in (c), whereas edges in the cost networks are colored with ρ and their widths are proportional to the fluxes.The black rectangle frames a region where the network manager triggers high congestion.We conveniently normalize θ ⋆ and ρ.
) = ℓ + ξ and c i e (0) = S i O i .Synthetic experiments.-First,we study a network of size |V | = 300, |E| = 864, with nodes placed uniformly at random in the unitary disk, and edges extracted from their Delaunay triangulation.Entry and exit inflows are S i O i = +1 on an origin node at the center, and S i D i = −1/D, on D = 4, 8 destinations D i on the disk edge.Since M = 1, there is only a single index i.Here we discuss results for D = 8, for experiments with varying q for D = 4, 8, see Supp.Mat.[39].

FIG. 3 .
FIG. 3. Measuring traffic congestion, D = 8.(a) Gini coefficient against θ.(b) T θ (s) against θ.Solid lines correspond to low sensitivity s = 1 and dashed ones to s = 50, in red we draw T∞ (free flow).Shades are standard deviations over 50 realizations of the algorithms.

FIG. 4 .
FIG. 4. E-road transport networks.Nodes in red are 15 main cities taken as passenger inflows, their size is proportional to the entry inflows.Edge widths are the total number of passengers i | F i e |, gray shades are standard deviations over 50 realizations of the algorithms.
FIG. S1.Bilevel optimization scheme in detail.We plot costs and network topology for different adaptation rules.Histograms and networks labeled with purple, blue, orange, and green squares correspond to coordinated traffic, OT, PSGD, and BROT, respectively.Hatch styles are J and Ω. Edges in black are proportional to i |F i e |, while those colored with ρ express the change in (the normalized) w.

50 FIG 5 FIG 5 FIG
FIG. S2.Transport cost, congestion cost, and their trade-off for D = 4.For an extensive caption, refer to Fig. 2 (main text).
FIG.S8.E-road transport networks, experiments for q = 0.25, 0.5, 0.75.Colors of edges are the passengers travelling in the network.In the first column, we zoom into an area of interest of the networks outputted by BROT in order to highlight the presence of different network configurations that are decongested.Colors and hatches of histograms are as in Fig.4(main text).In gray, we draw the average travel times for q = 1.