Robust principal component analysis for modal decomposition of corrupt fluid flows

Isabel Scherl ,1,* Benjamin Strom,1 Jessica K. Shang,2 Owen Williams ,3 Brian L. Polagye ,1 and Steven L. Brunton 1 1Department of Mechanical Engineering, University of Washington, Seattle, Washington 98195, USA 2Department of Mechanical Engineering, University of Rochester, Rochester, New York 14627, USA 3Department of Aeronautics and Astronautics, University of Washington, Seattle, Washington 98195, USA


I. INTRODUCTION
The ability to understand, model, and control fluid flows is foundational to advancing technologies in transportation, energy, health, and defense. The challenges these fields pose are not easily solved by first-principles analysis without intense simplification. Instead, we rely on data from simulations and experiments [1][2][3]. The fidelity of both approaches have improved dramatically, generating vast and increasing volumes of data [4]. However, despite this apparently high-dimensional data, fluid dynamics are often characterized by the evolution of a few dominant coherent structures that are energetically or dynamically important [5][6][7][8][9]. Thus, even with increasing ambient measurement dimension, the intrinsic dimension of the flow may remain relatively low. Modal decomposition techniques are designed to extract these meaningful patterns from highdimensional fluids data [3,10], resulting in a compact representation that can be used for accurate and efficient reduced-order models and control [11,12].
The majority of modal decompositions are linear [3,10], although emerging techniques in machine learning are providing improved nonlinear pattern extraction [1]. Linear regression and least-squares optimization are particularly widely used, as in the ubiquitous proper orthogonal decomposition (POD) [also known as principal component analysis (PCA)] [5][6][7]9,13] and the emerging dynamic mode decomposition (DMD) [8,[14][15][16]. POD provides a principled approach to decomposing high-dimensional fluid flow data into a hierarchy of orthogonal modes that are ordered in terms of their ability to capture the energy in the flow; because these modes are orthogonal, it is possible to obtain reduced-order models by Galerkin projection of the Navier-Stokes equations onto a truncated POD basis [17][18][19]. DMD is a related technique to decompose a flow into spatiotemporal coherent structures that are each constrained to have coherent and linear dynamics in time. The least-squares regression underlying these approaches is highly susceptible to outliers and corrupted data [20,21]. Outliers and corrupt entries differ significantly from the distribution of other measurements [22] and are common in experimental measurement techniques such as particle image velocimetry (PIV). Thus, modal decomposition techniques such as POD/PCA and DMD are fragile with respect to outliers. Further, even though POD is robust to Gaussian white noise [23], DMD is sensitive to noisy data [24][25][26]. Even techniques that are robust to corrupt velocity fields, such as finite-time Lyapunov exponents and Lagrangian coherent structures [27][28][29][30][31][32], are unable to process velocity fields with missing regions, which is common in experimentally acquired data, so that interpolation must be used to fill in missing vectors. In this work, we explore the use of the robust principal component analysis (RPCA) [21] to process corrupt flow fields, leveraging global correlations in the data. We emphasize the impact of this approach on modal analysis, including POD/PCA and DMD.

A. Experimental challenges
Experimental techniques to measure fluid flows have evolved rapidly over the past century, with the ultimate goal of acquiring full flow fields with high spatial and temporal resolution. Laser-based imaging techniques have evolved from point measurements [33,34] to two-dimensional (2D) and 3D field measurements [35][36][37][38][39][40]. PIV has since become a cornerstone of experimental fluid mechanics, providing nonintrusive velocity field measurements across a range of applications. Improvements in PIV hardware, including more powerful lasers, higher-resolution and frame-rate cameras, advanced image processing technology, and the development of tomographic PIV are providing unprecedented views into real flows. Despite the undeniable success of PIV, there are several well-known challenges to acquiring clean and accurate data. Multiple factors in the PIV data acquisition and processing pipeline can contribute to velocity vector outliers that degrade the resulting velocity fields. These include inadequate illumination and irregularities in the light field, background speckle, seeding density and nonpassivity of the particles, sharp gradients in flow properties, optical issues, such as alignment and aberration, limited resolution and shot noise in the image recording, and out of plane motion of the particles when measuring in 2D [37,41]. Because of a fundamental trade-off between the quantity and quality of PIV data, in both space and time, researchers continue to push the resolution limits of current systems. Thus, flow fields acquired with PIV often have missing and/or corrupt measurements. This has motivated processing techniques to improve PIV data quality of PIV data [42][43][44], including predictor-corrector schemes [45], spatial filtering to remove frequencies not possible for the measurement resolution [46], and POD-based background removal [47]. The identification of spurious vectors has been studied extensively [48,49], and the normalized median filter is a robust and well-used method [50]. However, missing FIG. 1. Schematic of RPCA filtering applied to corrupt flow-field data. Corrupted snapshots are arranged as column vectors in the matrix X, which is decomposed into the sum of a low-rank matrix L and a sparse matrix of outliers S. vectors often cluster in regions of high shear, presenting a challenge for standard vector validation and interpolation methods that rely on local flow information [50]. In this work, we leverage robust statistics and global spatiotemporal coherent structures across the entire data set to fill in missing measurements and improve modal decomposition of fluid flow fields.

B. Contributions of this work
We investigate the use of RPCA [21], a robust variant of POD/PCA [20,51]. RPCA uses a sparsity-promoting optimization to decompose a data matrix into the sum of a low-rank matrix containing coherent structures and a sparse matrix containing outliers. RPCA was originally popularized in the Netflix matrix completion algorithm for its recommender system [52] and has since been widely used for image and video processing [53], electrical capacitance tomography [54], and voice separation [55]. Here, we use RPCA filtering to process flow-field data from several simulations and experiments. Figure 1 demonstrates the ability of RPCA to uncover and isolate the dominant low-rank coherent structures from sparse outliers in flow data from an idealized example. In addition to directly analyzing and processing flow-field data, we also perform PCA and DMD modal analyses on the data before and after RPCA filtering to assess its performance.
Here we consider a range of simulated and experimentally acquired flow fields of varying complexity to isolate and analyze various aspects of the algorithm applied to data from fluid mechanics. First, we investigate high-fidelity flow fields from direct numerical simulations of a laminar flow past a cylinder and a turbulent channel flow, where it is possible to artificially add corrupt velocity field vectors to compare the RPCA filtered fields with a known ground truth. Next, we apply the method to two experimentally acquired data sets, including a companion laminar flow past a cylinder and measurements of a cross-flow turbine wake. Although there is not a ground-truth model for these flows, it is possible to estimate the effect of RPCA filtering in reducing outliers and corruption by analyzing the DMD spectrum, which has well-stereotyped behavior for such periodic wake flows [24]. In all cases, we show that the RPCA filtered fields yield DMD spectra that are more consistent with a periodic wake in the absence of noise.
This work is organized as follows: First, we present the standard POD/PCA and DMD modal analysis techniques in Sec. II, followed by the RPCA method in Sec. III. Section IV describes the four flow fields used in this analysis. Results of RPCA filtering on these flow fields and its impact on downstream modal analysis are presented in Sec. V. Depending on the data set in question, the data matrix may be artificially corrupted prior to RPCA filtering (Sec. IV). The results of principal component analysis and dynamic mode decomposition performed on the data matrix (X) are referred to as PCA and DMD modes, respectively, whereas those same operations performed on the low-rank matrix (L) are referred to as RPCA and RDMD modes.

II. MODAL ANALYSIS
Extracting coherent structures from high-dimensional data has been a central challenge in fluid mechanics for decades. Here we review two leading modal decomposition techniques for data from fluid mechanics, the POD, also known as PCA (Sec. II A), and the DMD (Sec. II B). Both methods apply equally well to data from simulations or experiments. We use these two modal decomposition techniques to assess the effectiveness of RPCA filtering, in processing and correcting corrupt flow fields. Both techniques are based on the singular value decomposition (SVD) [56][57][58][59] and there are several detailed discussions of these decompositions [3,8,10,20]. The RPCA algorithm is explained in Sec. III.
In this work, we follow the flowchart shown in Fig. 2. RPCA filtering is applied to a data matrix (X) which is bisected into the low-rank structure (L) and sparse (S) subspaces. From there, RPCA modes and DMD modes are calculated from the low-rank data. We also calculate the POD/PCA and DMD modes on the data matrix.

A. Proper orthogonal decomposition
Proper orthogonal decomposition-referred to as PCA throughout results-is a widely used method to identify spatially correlated coherent structures from data, decomposing the flow field into a linear combination of orthogonal modes that are arranged hierarchically by energy content. There are several variants of POD [3,6,7,9,13,60], and we will present a variant of the snapshot POD of Sirovich that relies on the numerically stable SVD [20,60]. First, flow-field data (e.g., a velocity or vorticity field) is measured or computed on a discrete spatial grid, and m snapshots of these flow fields are collected at various times t 1 , t 2 , . . . , t m . The flow-field data at time t k may be reshaped into a column vector x k = x(t k ) ∈ R n , where n denotes the number of flow variables times the number of spatial grid locations. Next, a data matrix is formed by arranging the column vectors x k in a matrix X: Finally, POD modes are obtained by computing the singular value decomposition of X ∈ R n×m : where superscript T defines the matrix transpose, U ∈ R n×n , ∈ R n×m , and V ∈ R m×m . The columns of U are POD modes with the same dimension as a flow field x. POD modes are orthonormal so that U T U = I; similarly, V T V = I. Moreover, the columns of U (respectively, rows of V T ) are arranged in order of their importance in describing the data. The importance of each 054401-4 mode (i.e., column of U) is given by the corresponding entry of the non-negative, diagonal matrix of singular values ∈ R n×m . The matrix X will exhibit low-rank structure, so that it is well approximated by the first r m < n columns of U and V: where U r and V r denote the first r columns of each matrix and r denotes the first r × r sub-block of . In fact, the Eckart-Young theorem states that this is the optimal rank-r approximation of the matrix X in a least-squares sense. More details about the SVD can be found in Ref. [20]. After truncating all but the first r dominant modes, a flow-field snapshot x may be approximated by a linear combination of these modes: where α k is the POD mode coefficient.
Because the POD modal basis is orthogonal, it is possible to obtain a reduced-order nonlinear dynamical system for the evolution of the coefficients α k (t ) in time via Galerkin projection of the Navier-Stokes equations onto the POD basis. In this way, the POD basis may be thought of a datadriven generalization of the Fourier basis that is tailored to a particular flow field. POD is also closely related to PCA [51], the Karhunen-Loève decomposition [61], empirical orthogonal functions [62], or the Hotelling transform [63,64].

B. Dynamic mode decomposition
DMD is a modal decomposition technique that simultaneously identifies spatially coherent modes that are constrained to have the same linear behavior in time, given by oscillations at a fixed frequency with growth or decay [8,15]. Thus, the dynamic mode decomposition provides a dimensionality reduction into a set of spatial modes along with a linear model for how these modes evolve in time. This is in contrast to POD, which results in orthogonal modes arranged in terms of energy content and without consideration of dynamics. However, in many formulations, DMD is closely related to POD, and may be thought of as a linear combination of POD modes that results in linear evolution in time. DMD also has deep connections to nonlinear dynamical systems via Koopman operator theory [8,14,16,65,66].
In the original formulation of DMD [15], the snapshots in the data matrix in Eq. (1) are spaced evenly in time, so that t k = k t with t sufficiently small to resolve the highest frequencies in the dynamics. Generalizations have since been formulated to allow for nonsequential time series [16,67] and for data that is under-resolved in space [68,69] or time [70]; however, for simplicity, we will present the standard exact DMD formulation of Tu et al. [16] with evenly spaced and sequential snapshots. DMD seeks to identify the leading eigenvalues and eigenvectors of the best-fit linear operator A that evolves snapshots forward in time: The eigenvectors φ of A have the dimensions of a flow field and correspond to spatiotemporal coherent structures whose dynamics in time evolve according to the associated eigenvalue γ . In practice, this operator is identified by first splitting the data in Eq. (1) into two matrices: and then solving for the best-fit operator that satisfies 054401-5 via the following least-squares optimization problem: Here we are minimizing the Frobenius norm || · || F via the pseudoinverse In practice, the matrix A is far too large to analyze directly, and instead, we project A onto an r-dimensional POD subspace, given by the columns of U r : The eigenvalues ofÃ are the same as the eigenvalues of A, which are known as the DMD eigenvalues. They are computed via the eigendecomposition of the r × r matrixÃ: Finally, the corresponding DMD modes are reconstructed using the full-dimensional data along with the reduced eigenvectors in W: This formula for the eigenvectors is from the exact DMD algorithm [8,16]; the original formulation of Schmid [15] computes modes as = U r W.
With the DMD modes and eigenvalues it is possible to reconstruct the state at time k t, where the vector b of mode amplitudes is generally computed as More principled approaches to select the few dominant modes have been considered based on sparsity-promoting optimization [71]. The spectral expansion above may also be written in continuous time by introducing the continuous eigenvalues ω = log(γ )/ t: where is a diagonal matrix containing the continuous-time eigenvalues ω j . DMD is known to be extremely sensitive to noisy data [24][25][26], and the eigenvalues specifically suffer from a bias that is not reduced with increasing data. There are several modifications to make DMD more robust to noise, including averaging forward-time and backward-time operators [25], total least squares [26], and variable projection [67]. For periodic wake data, as explored in three of the examples in this paper, the discrete-time eigenvalues should occur in complex conjugate pairs γ ,γ exactly on the unit circle in the complex plane for clean data. Similarly, the continuoustime eigenvalues should be in complex conjugate pairs ±iω on the imaginary axis, indicating pure oscillations with no growth or decay [24,72]. In Ref. [24], Bagheri characterized the perturbative effect of noise on these eigenvalues, deriving an asymptotic expression for how high frequency eigenvalues become increasingly affected by noise. If the true continuous-time eigenvalue should be ±iω in the absence of noise, Bagheri showed that in the presence of perturbative white noise with magnitude 1 the observed eigenvalue pair is 054401-6 where C is a sensitivity constant. Thus, low noise levels cause a spurious real-valued damping − Cω 2 that is quadratic in the frequency. We will make extensive use of this property to assess the quality of our RPCA filtered fields by computing the ratio of the best-fit factor C before and after applying RPCA filtering. C should decrease as a consequence of reduced noise and corruption.

III. ROBUST EXTRACTION OF FLUID COHERENT STRUCTURES
Techniques based on least-squares regression, such as POD/PCA and DMD, are highly susceptible to outliers and corrupted data, making them fragile with respect to some experimental measurement errors. Outliers and corruption are defined as data points that differ significantly from the statistical distribution of the majority of the data set [22], so that they cannot be considered as the original data plus a small-to-moderate amount of white noise. To mitigate this sensitivity, Candes et al. [21] have developed RPCA that seeks to decompose a data matrix X into a structured low-rank matrix L that is characterized by dominant coherent structures and a sparse matrix S containing outliers and corrupt data: The principal components of L are robust to outliers and corrupt data, which are isolated in S. This decomposition, also referred to as a filter, has profound implications for many modern problems of interest, including video surveillance (where the background objects appear in L and foreground objects appear in S), facial recognition (eigenfaces are in L and shadows, occlusions, etc., are in S), natural language processing and latent semantic indexing, and ranking problems. 1 Standard PCA/POD is effective at removing white noise that is smaller than the relevant singular values in the data [23]; however, it is not able to remove outliers. Instead, RPCA is used to correct outliers that differ significantly from the distribution of the other observations. Mathematically, the goal is to find L and S that satisfy the following: min L,S rank(L) + ||S|| 0 subject to L + S = X.
||S|| 0 counts the number of nonzero elements in S, quantifying how sparse it is. rank(L) is the number of nonzero singular values in L, quantifying how many linearly independent rows and columns describe the data. However, neither the rank(L) nor the ||S|| 0 terms are convex, making this optimization intractable. Similarly to the compressed sensing problem, it is possible to solve for L and S with high probability using a convex relaxation of (16): where || · || * is the nuclear norm, given by the sum of singular values which is a proxy for the rank of the matrix, and || · || 1 is the 1-norm of the matrix viewed as a vector, given by the sum of the magnitudes of each entry in the matrix, which is a proxy for the || · || 0 norm of a matrix; the hyperparameter λ 0 is given by λ 0 = λ/ √ max(n, m). The solution to (17) converges to the solution of (16) with high probability if λ = 1, where n and m are the dimensions of X, given that L is not sparse and S is not low rank. In the examples below, these assumptions may only be partially valid, so the optimal value of λ may vary slightly. The convex problem in (17) is known as principal component pursuit and may be solved using the augmented Lagrange multiplier (ALM) algorithm.
Specifically, an augmented Lagrangian may be constructed: where Y is the matrix of Lagrange multipliers and υ is a hyperparameter. We then solve for L k and S k to minimize L, update the Lagrange multipliers and iterate until convergence. In this work, an inexact ALM implementation from Ref. [73] is used. The alternating directions method (ADM) [74,75] provides another simple procedure. After the low-rank matrix L is obtained, it is possible to compute robust POD/PCA modes (Sec. II A) as in Eq. (2): Henceforth, we refer to the modes in U from L as RPCA modes. We note that in many flow applications it is important to subtract the mean flow before computing POD, which allows the POD eigenvalues to be interpreted as the variance of fluctuations and the expansion to respect boundary conditions by construction. However, before computing RPCA, it may be difficult to obtain an accurate mean flow estimate. Instead, we advocate computing the RPCA, then subtracting the mean of L from itself, and finally computing POD on the mean-subtracted low-rank matrix; this final POD step will remove small amounts of white noise.
Similarly, it is also possible to compute robust DMD (RDMD) modes and eigenvalues. We note that the RDMD in this work should not be confused with the recursive DMD of Noack et al. [76], which uses the same acronym. We also note that this decomposition is similar in spirit to the coherent vortex simulation approach [77], which separates turbulent flows into coherent and random parts based on a wavelet decomposition. However, RPCA does not perform this decomposition using a universal basis, such as wavelets, that rely on scale separation but rather based on statistical correlations in the data.

IV. MODEL FLOWS
We demonstrate RPCA filtering on several example data sets of varying complexity, drawn from direct numerical simulations (DNS) and PIV data from experiments. Figure 3 provides an overview of the four example flow fields.

A. Cylinder flow
Flow past a cylinder is a canonical example in fluid mechanics. We consider data from DNS at a diameter-based Reynolds number of 100 and from PIV measurements at Reynolds number 413 [78].
The DNS data are generated by simulating the two-dimensional incompressible Navier-Stokes equations using the immersed boundary projection method [79,80]. The computational domain comprises four nested grids: The finest grid covers a domain of 9 × 4 and the largest grid covers a domain of 72 × 32, where lengths are nondimensionalized by the cylinder diameter. Each grid contains 449 × 199 points with a resolution of 50 points per cylinder diameter. The time step is t = 0.02 and data are sampled at intervals of 10 t (30 times the vortex shedding frequency) with m = 150 snapshots saved, covering 5 vortex shedding cycles. The DNS provides a benchmark, where the uncorrupted flow field is known, to quantitatively assess performance of RPCA filtering on data with artificial salt-and-pepper corruption. Corrupted sample points are chosen uniformly in space and time at a given rate, and both the u and v velocity components at each selected location are randomly assigned a value of ±10 times the standard deviation of the streamwise velocity data. In addition, we consider a second case where corrupted sample points are chosen with a bias toward regions of high vorticity or shear magnitude, which is more physically realistic for PIV data. second case, we select measurements for corruption based on a probability density given by α + |ω|, where α is a small positive constant and |ω| is the absolute value of the vorticity; when the rate of corruption is sufficiently high, these corrupted fields begin to resemble the uniformly corrupted cases, but with more corruption in vortex cores. Because vorticity is calculated from velocity fields using a finite-difference derivative, there is a higher rate of corruption in the vorticity fields than in the velocity fields.
The PIV data has frame size of 135 × 80 grid points with a resolution of 8 points per cylinder diameter. Data are sampled at a rate of 20 Hz (125 times the shedding frequency) with m = 8000 snapshots saved, which corresponds to 64 vortex shedding cycles.

B. Turbulent channel flow
For a more complex and multiscale flow, we consider DNS data from a forced, fully developed turbulent channel flow data with a friction velocity Reynolds number of Re τ = 1000, from the Johns Hopkins Turbulence Database [81]. This example provides a test case to see how turbulent kinetic energy at various scales is filtered depending on the level of added noise. The addition of noise is similar to the cylinder DNS where randomly selected sample points of the streamwise and cross-stream velocity fields are assigned a value of ±10 standard deviations of the streamwise velocity data. Due to the size of the full data set, we only consider two-dimensional fields on the midplane, with a 512 × 512 grid of three component velocity measurements spanning the channel width. Data are sampled at a rate of 966 times the mean flow-through time with m = 1000 snapshots.

C. Cross-flow turbine wake
Finally, we consider PIV wake data from a cross-flow turbine experiment conducted at the University of Washington. Cross-flow turbines can be used to extract power from wind and water currents for renewable energy generation. This flow exhibits both coherent and broadband phenomena and provides a challenging test-case RPCA filtering. The frame consists of 158 × 98 grid points, at a resolution of 99 points per rotor diameter. Data are sampled at a rate of 32 times the blade-pass frequency with m = 1000 snapshots. Vectors were calculated using a multigrid, multipass algorithm with adaptive image deformation [82]. Resulting vector fields were then 054401-9 FIG. 4. RPCA filtering removes noise and outliers in the flow past a cylinder (black circle), from DNS (left) with 10% of velocity field measurements corrupted with salt-and-pepper noise, and PIV measurements (right). All frames show resultant vorticity fields. As the parameter λ is decreased, RPCA filtering is more aggressive, eventually incorrectly identifying coherent flow structures as outliers.
validated using a normalized median filter with potential replacement by secondary correlation peaks. The cross-correlation and validation steps result in missing data, particularly in regions of high vorticity and shear. To apply RPCA filtering, these missing values are randomly assigned a value of ±10 standard deviations of the streamwise flow data, in contrast to the experimental cylinder wake where missing measurements were previously interpolated.

V. RESULTS
We now explore the ability of RPCA filtering to isolate and remove noise and corruption from the example flow fields. We will begin with the simulated and experimental flow past a cylinder, followed by data from the Johns Hopkins turbulent channel flow simulation, and ending with the experimental wake of a cross-flow turbine. Figure 4 shows the results of RPCA filtering for flow past a cylinder, providing a side-by-side comparison of PIV and corrupted DNS data. Although the Reynolds numbers differ by a factor of four, the flow fields are qualitatively similar, characterized by periodic, laminar vortex shedding. For λ = 1, the data are correctly segmented with the coherent flow in L and the sparse corruption in S. When λ is too small, RPCA filtering is overly aggressive, incorrectly including relevant flow structures in S, and when λ is too large, the corruption is not filtered.

A. Cylinder flow
For the experimental data in Fig. 4 (right), the optimal value of λ is less clear. For λ = 0.1, the low-rank field L is visually smoother than the field at λ = 1, but the sparse matrix S contains a significant portion of the wake structures, indicating overfiltering. This filtering becomes more pronounced in the movies, where it is clear that much of the high-frequency "noise" in the bypass flow is actually free-stream turbulence, which is consistent with the turbulence intensity of the 054401-10 experiments. Further, as subsequently discussed, when we compute the RPCA modes, it is clear that the λ = 0.1 case is heavily filtering out all but the first three modes. Thus, it appears that the theoretically optimal value λ = 1 has the best performance, although there may be a trade-off between filtering ambient free-stream turbulence and coherent structures of interest in experiments. Figure 5 shows the results of RPCA filtering on the simulated data for the second case of vorticity-biased corruption. Again, in all cases, the theoretically optimal value of λ = 1 yields the best segmentation of the corruption into the matrix S. When the rate of corruption is increased from 1% (left) to 10% (right), the free-stream flow begins to become corrupted, resembling the uniform corruption case in Fig. 4. The mean error and relative nuclear norm of the low-rank matrix (L) compared to the true, uncorrupted data (X) are shown in Fig. 6 for varying percentages of corrupt entries. Statistically, results are similar for vorticity-biased and randomly-distributed corruption. In both cases, RPCA filtering is remarkably robust to corruption, even for corruption in excess of 50% of the measurements. This laminar vortex shedding example is an ideal application for RPCA filtering, as the true flow field is low rank and the corruption is sparse; it is unlikely that this will hold as well for data exhibiting broadband turbulence.

PCA analysis for cylinder wake flows
We now investigate the impact of RPCA filtering on modal decompositions. Figures 7 and 8 show four leading PCA and RPCA modes for 1% and 10% vorticity-biased corruption, respectively. The first mode corresponds to the mean flow, and the remaining modes come in energetic pairs where the corresponding coefficients α 2 j and α 2 j+1 oscillate sinusoidally at the same frequency but π/2 out of phase, sweeping out a circle in the phase plane. Thus, we only show one mode, u 2 j+1 from each of the first three energetic mode pairs. In all cases, the RPCA modes show dramatic improvement, while significant artifacts remain in the PCA modes. We also investigate the effect of increasing the amount of data, and there is a clear improvement in RPCA modes from two to five vortex shedding cycles; in contrast, the PCA modes do not improve appreciably with more data. To quantify the improvement observed above, we compute the L 2 error between the PCA and RPCA modes of corrupted data and the PCA modes for the clean data (i.e., DNS results) as a function of the number of shedding periods. As shown in Fig. 9, the RPCA mode velocity fields quickly converge to a small error as the amount of data is increased for both the 1% and 10% corruption cases, while the PCA modes converge much more slowly and still have considerable error after five shedding periods are included in the analysis.
The modes for the PIV data for the cylinder flow are shown in Fig. 10. This figure highlights the effect of λ, the sparsity hyperparameter, which was previously discussed with respect to Fig. 4. In this case, we do not have a clean ground-truth data set to compare against. Although the flow field in Fig. 4 appears to have less corruption for λ = 0.1, here we see that all RPCA modes after the first three modes are heavily filtered, as seen in the rapid drop off in the singular values after the third mode. The corresponding modes are highly corrupt, further supporting that λ = 0.1 is not a good choice. In contrast, the RPCA modes for the theoretically optimal λ = 1 case appear to have slightly less free-stream corruption than the PCA modes. Also, as expected, for a large enough value of λ, the RPCA filtering has little effect on the modes.

DMD analysis for cylinder wake flows
DMD is known to be quite sensitive to noisy data, making this a challenging test case for RPCA filtering. Figure 11 shows the discrete-time eigenvalues for the cylinder DNS data with vorticity-biased corruption. For the cylinder wake, the uncorrupted or true DMD eigenvalues may be computed from the noiseless data, and they are equally spaced on the unit circle in the complex plane. In all cases, the RPCA-filtered DMD (RDMD) data result in dramatically better agreement with the uncorrupted or true DMD eigenvalues compared with the corrupted DMD eigenvalues. Even with only a single period of data and η = 10% corruption, the RDMD values capture the first six low-frequency mode pairs; in contrast, even with five periods of data and as little as η = 1% corruption, corrupted DMD only captures the first two low-frequency mode pairs, and with considerably more spurious damping. To see this more clearly, we plot the eigenvalues in continuous-time in Fig. 12, where the x axis is the imaginary eigenvalue component and the y axis is the real eigenvalue component, which is a standard way to plot DMD eigenvalues [15]. Here, the best-fit parabolas for the RDMD eigenvalues and the first seven corrupted DMD eigenvalues are shown in dashed lines. The curvature of these parabolas is directly related to the noise amplitude, as in Eq. (14) from Ref. [24]. The same continuous-time eigenvalue plot is shown for the PIV cylinder wake data in Fig. 13. In both cases, we see that the parabolic eigenvalue fit for the RDMD eigenvalues has a smaller curvature than for the corrupted DMD eigenvalues, indicating a quantitative and significant reduction in noise.

B. Turbulent channel flow
The trade-off between filtering corruption and small-scale structures is also apparent in the turbulent channel flow DNS. Unlike the cylinder wake, this flow field contains broadband turbulent phenomena across multiple spatial and temporal scales. Figure 14 shows RPCA filtering for various levels of corruption, sweeping across the tuning parameter λ. The corresponding turbulent kinetic 054401-14 FIG. 12. Continuous-time DMD eigenvalues for the simulated cylinder data along with parabolic eigenvalue fits to estimate the error as in Ref. [24]. Here we use five vortex shedding periods with η = 1% corrupt values. The RDMD parabolic coefficient is approximately 2 × 10 4 times smaller than the DMD coefficient. energy (TKE) is shown in Fig. 15, providing a summary of the various scales that are filtered. The value of λ that preserves the true TKE spectrum varies with the degree of velocity field corruption. In the uncorrupted case (η = 0), we can clearly see the effect of filtering on the turbulent coherent structures, indicating that some fine-scale structures are filtered for λ = 1. As the degree of corruption increases to η = 2%, we see that the curves for λ 2 remain relatively unchanged, although the λ > 2 curves begin to exhibit spurious high-frequency spatial structures (i.e., corruption is present in L). As the rate of corruption increases to η = 10%, spurious highfrequency energy also appears for λ = 2. In this case, it is clear that the optimal filtering value λ changes with the level of corruption. For relatively limited corruption, a larger value of λ may be used but must be decreased toward the theoretically optimal value of λ = 1 for higher levels of corruption. Finally, we note that, unlike the cylinder wake cases, it is not surprising that λ = 1 is suboptimal because the channel flow is not fundamentally low rank which deviates from an underlying assumption of the RPCA algorithm.

C. Cross-flow turbine wake
As a final example, we consider the use of RPCA filtering to identify outliers and fill in missing PIV data collected in the wake of a cross-flow turbine, as shown in Fig. 16. There are several stages in the PIV processing pipeline where RPCA filtering could be applied, including after initial cross-correlation, after conventional normalized median filter vector validation, and after FIG. 13. Continuous-time DMD eigenvalues for the PIV cylinder data along with parabolic eigenvalue fits to estimate the error as in Ref. [24]. The RDMD parabolic coefficient is approximately two times smaller than the DMD coefficient.

054401-15
FIG. 14. RPCA filtering for turbulent channel flow vorticity fields with various levels of added noise and tuning parameter λ; η represents the percentage of corrupted measurements in the velocity fields. The border colors match the color of the curve at the corresponding η in Fig. 15. linear interpolation. For the cases shown here, we use λ = 1.6, which results in a velocity in the bypass flow, or lower third of the frame, that visually matches the frequency content of the unfiltered data. There are enough missing velocity vectors (23% and 20%, respectively) to degrade the effectiveness of both median filtering and interpolation. In contrast, RPCA filtering produces flow fields that capture dominant coherent structures for either cross-correlated or median-filtered fields. Finally, by investigating the standard deviation of all flow fields collected at a given turbine angular position (i.e., phase), it is clear that the RPCA filtering can be used to remove artifacts introduced by linear interpolation. This is consistent with the intuition that vector validation and interpolation should fail in these regions where there is high density of missing data that are spatially clustered.
The continuous-time DMD eigenvalues for the cross-flow turbine wake are shown in Fig. 17. In this plot, the parabolic fits for DMD eigenvalues computed after interpolation and RDMD eigenvalues computed after vector validation are displayed as dotted lines. For this case, the coefficient of the parabolic fit for the RDMD-based eigenvalues is six times smaller than the parabolic fit coefficient for the DMD-based eigenvalues. This demonstrates a significant quantitative improvement of the DMD spectrum using RPCA filtering to process the data.

054401-16
FIG. 15. TKE spectra for various levels of corruption and RPCA filtering. The TKE profiles provide a summary of the filtering that occurs at various scales. As corruption increases, the filtering remove more highfrequency information.
FIG. 16. RPCA filtering of cross-flow turbine wake PIV data. The standard PIV processing pipeline (top row) includes several steps where RPCA filtering can be applied (bottom row). In the cross-correlated streamwise velocity field (top left), 23% of the velocity vectors are missing. Vector validation reduces the missing vectors to 20%. Finally, linear interpolation is used to fill in these missing vectors. In all cases, RPCA filtering captures the relevant phase-averaged coherent structures with fewer outliers and missing data, which appear as dark spots in the standard deviation plot.
FIG. 17. Continuous-time DMD eigenvalues for the turbine wake PIV data, along with parabolic eigenvalue fits to estimate the error as in Ref. [24]. The RDMD parabolic coefficient is approximately six times smaller than the DMD coefficient.

VI. CONCLUSIONS AND DISCUSSION
In this work, we have demonstrated the ability of RPCA filtering to effectively recover dominant coherent structures from corrupt flow fields with missing measurements. Unlike standard POD/PCA, which is based on least squares and is susceptible to outliers and corruption, RPCA utilizes sparse optimization to separate a data matrix into a low-rank matrix containing correlated structures and a sparse matrix containing the spurious entries.
We apply RPCA filtering to several types of fluid flow data (DNS and PIV), ranging from laminar vortex shedding behind a circular cylinder, to fully turbulent channel flow DNS, and concluding with an experimental flow past a cross-flow turbine. These flows exhibit a variety of phenomena and a range of measurement quality. The DNS examples provide us with a baseline, where it is possible to add corruption to quantitatively assess the performance of RPCA. For flow past a cylinder in DNS, RPCA filtering is extremely effective at separating the true flow field from considerable corruption, with robust recovery even in flow fields with excess of 50% of the measurements corrupted. In the experimental counterpart, RPCA is still able to remove large outliers and corruption, although there is a trade-off between filtering the background turbulence and coherent structures in the wake. The fully turbulent channel flow DNS provides an opportunity to more fully explore this trade-off in a controlled setting, where we can incrementally increase the corruption ratio and observe the filtering effects on various spatial frequencies. As expected, an increasingly aggressive filtering leads to degradation at higher wave numbers, although dominant coherent structures are robustly preserved. Finally, the wake behind a cross-flow turbine provides a practical real-world flow that directly benefits from improved PIV processing. In all three wake flows we also assess the performance of RPCA filtering to yield more accurate modal decompositions. Although we do not have ground-truth measurements and modal decompositions, except in the case of direct numerical simulations, we know that continuous-time DMD eigenvalues should be arranged on the imaginary axis in the complex plane for clean data, and deviations from this may be quantified using the derivation from Bagheri [24]. In all three cases, we see considerable reduction in spurious damping, indicating the denoising effectiveness of RPCA. Based on these results, we believe that RPCA can be a valuable algorithm in the arsenal of PIV processing and filtering techniques, particularly when the processing pipeline culminates in modal analysis.
There are a number of future directions motivated by this work. First, RPCA depends on the hyperparameter λ, and a better understanding of how to objectively choose λ for different conditions is important. Because RPCA is based on sparse, nonconvex optimization, it is also likely that improved optimization techniques may improve speed and robustness. Although this work considered three-dimensional flows, the data comprised two-dimensional cross sections, and the current analysis could be extended to flow volumes. In principle, the RPCA method should generalize, although there may be computational scaling challenges. Recent results have extended RPCA from linear subspaces to manifolds [83], so it may be possible to robustly characterize fluids data that is well described by a low-dimensional manifold [84], rather than a low-dimensional POD subspace. Nonstationary flows may be more challenging for this method, as the bulk distribution will drift. Similarly shocks may be erroneously flagged as outliers; however, this may provide an opportunity to identify shocks in the data. Investigating these flows is an important avenue of future work. It would also be useful to extend this work to PIV measurements of other turbine configurations [85]. Finally, the quality of the RPCA filtered flow fields for additional downstream analyses should be assessed for example, in dynamical systems modeling via Galerkin projection [19] or regression [18] onto the filtered modes and in control [86].
The code and videos for this work are available [87,88].