Microlensing signatures of extended dark objects using machine learning

This paper presents a machine learning-based method for the detection of the unique gravitational microlensing signatures of extended dark objects, such as boson stars, axion miniclusters and subhalos. We adapt MicroLIA, a machine learning-based package tailored to handle the challenges posed by low-cadence data in microlensing surveys. Using realistic observational timestamps, our models are trained on simulated light curves to distinguish between microlensing by point-like and extended lenses, as well as from other object classes which give a variable magnitude. We show that boson stars, examples of objects with a relatively flat mass distribution, can be confidently identified for $0.8 \lesssim r/r_E\lesssim 3$. Intriguingly, we also find that more sharply peaked structures, such as NFW-subhalos, can be distinctly recognized from point-lenses under regular observation cadence. Our findings significantly advance the potential of microlensing data in uncovering the elusive nature of extended dark objects. The code and dataset used are also provided.


I. INTRODUCTION
Macroscopic dark matter candidates, with masses ranging from large asteroids (∼ 10 −15 M ⊙ ) to stars (∼ M ⊙ ), offer compelling alternatives to the traditional particle-based theories.These celestial objects, potentially formed in the early universe, are primarily detectable via their gravitational effects: gravitational lensing (e.g.[1][2][3][4]) and gravitational waves (e.g.[5][6][7][8][9][10][11][12]).Indeed, gravitational microlensing is one of the most important ways of probing compact objects such as "machos" or primordial black holes (PBHs), a dark matter (DM) candidate consisting of compact objects formed in the early Universe.Through surveys of a range of sources, microlensing of such point-like lenses has been used to constrain the fraction of DM such objects can comprise in a wide range of masses (see e.g.[4]).
It has also been proposed that dark matter can instead be comprised of extended objects, such as boson stars (e.g.[13][14][15][16]), axion miniclusters [17], and subhalos [18][19][20][21][22]. Like compact objects, such objects can also bend the light of distant stars.Whether this effect can be probed by a microlensing survey depends on the comparison between the object radius and the Einstein radiusthe characteristic length scale which is a function of the mass of the dark matter lens and the distance to the light source.The effectiveness of (micro-)lensing then depends on the size of the object compared to the Einstein radius: dilute dark objects, which are transparent to light, are ineffective lenses.Using conservative assumptions about the number of events observed, Refs.[23,24] derived modified constraints on extended dark matter objects.
Interestingly, structures with radii close to the Einstein radius may give distinct microlensing signatures.How the mass is spread within these structures affects the lensing effect.This was demonstrated explicitly for microlensing of various dark matter structures in [23][24][25].
In objects with a flatter density profile, such as boson stars, the microlensing magnification time series can deviate significantly from that expected from a point-like lens such as a PBH, for example featuring caustics crossings as can be seen in Fig. 1.These distinguishable features of extended dark lenses can in principle be used to make a positive discovery.
In this work we take the first step towards a positive discovery of a microlensing signature by an extended dark object.To this end we develop an analysis pipeline to search for signatures of extended dark matter objects in time series data.Specifically, we train a histogram-based gradient boosted classifier to identify such features, and describe how these observations can be used to search for dark objects in a range of experiments.
This paper is organised as follows.We first review microlensing by extended lenses and estimate the sensitivity of the OGLE survey to the modified light curves of bo-son stars.We then describe the generation of our dataset and methodology, followed by an analysis of our results and a discussion.

II. MICROLENSING SIGNATURES OF EXTENDED LENSES
Let us first review some basics of gravitational microlensing, largely following the treatment in Ref. [26].We will need to define some important parameters in the lens setup; a depiction of the geometry can be found in [23].The observer-lens, lens-source, and observer-source distances are denoted D L , D S , and D LS respectively.From the perspective of the observer, the lens center subtends angles of β and θ i with the source and images of the source respectively.As D L , D S , and D LS are much larger than all other scales in the problem, lensing calculations can be simplified using the small angle approximation.In this approximation, the deflections α = 4GM/(c 2 ξ) only occur when starlight encounters the "lens plane" perpendicular to the observer-source axis.
Assuming that the lens is spherically symmetric with density distribution ρ(r) (and total mass M = 4π and where [27] is the point-like Einstein angle, the value of θ for a pointlike lens (M (θ) → M ) at zero impact parameter (β = 0).This in turn defines the point-like Einstein radius r E ≡ D L θ E on the lens plane.As the only way in which the total lens mass M and the distances D L , D LS , D S enter the problem is through their contributions to the Einstein radius r E , it is convenient to express all angles in units of θ E and all distances r E .Thus, we define u ≡ β/θ E = D L β/r E , and τ ≡ θ/θ E = D L θ/r E (note that the latter was named t in [23,24]), which allows us to rewrite (1) as where we have also defined m(τ ) ≡ M (θ E τ )/M which describes the distribution of the lens mass projected onto the lens plane, where ρ is the density distribution of the lens.The lensing equation ( 4) can be used to find the position(s) of the images θ i given a position of the lens β.As the images subtend different solid angles than the unlensed source, microlensing alters the observed flux.The magnification is the ratio of the angular extent of an image to that of the source: The light curve as a function of time t for a lens with velocity v and minimum impact parameter ξ min can now be found through βr E = ξ 2 min + v 2 t 2 .Whether a microlensing event is observable depends on the minimum detectable magnification for a given telescope, as well as the range of cadences at the microlensing survey, which sets the transit timescales to which it is sensitive.Typically, a transit is counted as a lensing event if µ > 1.34, which occurs for a point-like lens for impact parameter (in units of the Einstein radius In [23,24] the microlensing efficiency of an extended lens compared to that of a point-like lens was defined as the maximum impact parameter u 1.34 for which a threshold magnification is produced: µ tot (u ≤ u 1.34 ) ≥ 1.34 .For a point-like lens, u 1.34 = 1 by definition, and the naive expectation for extended lenses would be that u 1.34 ≤ 1. Remarkably, this is not necessarily the case: in particular for lenses with a reasonably flat density profile, u 1.34 can be larger than one.Given u 1.34 , the microlensing differential event rate for a single source with respect to the typical event timescale t E and x = D L /D S , observed by a particular experiment, can be calculated as where ε(t E ) is the efficiency of telescopic detection, v E (x) ≡ 2u 1.34 (x)r E (x)/t E , v 0 = 220 km/s is the dark matter circular speed in the galaxy, and ρ DM (x) is the DM density projected onto the line of sight, for example following an isothermal profile in the Milky Way galaxy, with R Sol = 8.5 kpc, ρ s = 1.39 GeV/cm events is then given by where N ⋆ is the number of observed sources, T obs is the total observation time, and t E,min (t E,max ) is the minimum (maximum) timescale of an event.
Comparing N events with the number of observed events in microlensing surveys, [23,24] set constraints on extended lenses.However, it is important to point out that the surveys identify microlensing events based on a comparison with the point-like magnification light curve.For extended lenses, in particular lenses with τ m ≡ r lens /r E ∼ 1, this is not a good approximation.In Fig. 2 we estimate the range of lens masses and radial sizes for which we expect significant deviations from the point-like light curve, for the example of a boson star observed in the OGLE-IV survey (5 year dataset) [3], anticipating that the non-compactness of the lens can be resolved for 0.8 < τ m < 3 (which we will verify in the next section).Here we follow the mass profile of boson stars m(t) outlined in the appendix of [23].As in this work, the sensitivity is computed using the Poissonian 90% confidence limit based on a signal comprised of dark matter and astrophysical foregrounds (see [3]).We note that as the OGLE collaboration did not search for the particular microlensing light curves predicted by boson stars, this is an estimate only.Comparing Fig. 2 with the sensitivity of the OGLE-IV survey to boson stars in [23], we note that for a given lens size, the lighter masses lead to distinguishable features in the light curve, as can be expected from the dependence of the Einstein radius on M .
Because the Einstein radius varies along the line of sight to the source, for a lens of a particular size r lens , a range of τ m (x) is relevant, where x = D L /D S .One might wonder what the distribution in τ m is.This is given by dN/dτ m = (dN/dx)(dx/dτ m ) = f DM ρ DM (x) dx/dτ m .For the OGLE sources, we find that this distribution peaks for τ m (x = 0.5), and rapidly falls like τ −3 m away from it.Thus, we expect that given a particular lens mass, it is in practise possible to interpret a measurement of τ m directly in terms of a lens size.

III. DATASET GENERATION AND METHODOLOGY
In this work we adapt MicroLIA, a tool developed and detailed in [29].The classifier developed in this paper is a machine learning model utilising the Random Forest algorithm, specifically designed for the detection of microlensing events in astronomical surveys.It is tailored to handle the challenges posed by low-cadence data, which typically suffer from irregular signal sampling and thus lower signal-to-noise ratios, making microlensing event detection more difficult.MicroLIA distinguishes between microlensing and other variable star events using 148 features derived from the light curve and its derivative time series.The classifier categorises events into classes like microlensing, eclipsing binaries, and regular variable stars, focussing on accurate identification of microlensing amidst these.
In this work, we extend the scope of MicroLIA by including extended dark matter objects which act like nonpoint-like microlensing lenses, such as Boson Stars (BS) and Navarro-Frenk-White subhalos (NFW).For this, we generated microlensing light curves for these extended objects using their mass profile, m(τ ), which have previously been calculated in [23].From these mass profiles, we fit an interpolating function, and for each value of the impact parameter, u, we solve the microlensing equation 4 to obtain the total magnification to the light curve, 6.
The observed impact parameters are closely related to the cadence of a survey.For a lens with a characteristic timescale defined by the crossing of the Einstein radius t E ≡ R E /v and a minimum impact parameter u 0 producing a magnification peak occurring at t 0 , the values of the observed impact parameters are where t is the survey time.Therefore, the cadence at which the survey collects observational data will have a significant impact in the observed light curve.For this study, we considered two possible cases for the data collection timestamps.In the first case we used OGLE-II timestamps (but not the light curves nor their errors), which are input into MicroLIA and randomly sampled when simulating individual light curves.We note that the regularity of the cadence in OGLE-II is not significantly different from other iterations of the survey.In the second case we considered a perfectly regular daily cadence, where timestamps are all equally spaced by the same interval.As we will see below, this case demonstrates some identification opportunities that are obscured by irregular cadences, such as the case with OGLE-II timestamps.We will refer to these to cases as OGLE-II Timestamps and Regular Daily Cadence, respectively.
Both datasets were generated by simulating 100 000 light curves for each of the six classes: Cataclysmic Variables (CV), RR Lyrae and Cepheid Variables (VARI-ABLE), Mira long-period variables (LPV), Point-like Microlensing (ML), Boson Stars (BS), and NFW Subhalos (NFW).The CV, VARIABLE, LPV, and ML light curves were generated using MicroLIA's simulation, and BS and NFW were generated using our own simulation.We applied the same selection criteria for the three microlensing source events (ML, BS, NFW).The criteria are the same as MicroLIA's, but we further demanded that the observed magnification be at least 1.34, a common criterion imposed by microlensing surveys.The light curves were simulated with a minimum magnitude of 15, a maximum magnitude of 20, and with Gaussian noise.
The extended microlensing sources, BS and NFW, have mass profiles which depend on the parameter τ m , which follows some distribution depending on the prevalence of such objects.For this study, we sampled τ m ∼ U(0.5, 5) logarithmically, since, as discussed in the previous section, this distribution is strongly peaked at the τ m reached at x = 0.5, and the largest deviations in the light curves are expected for τ m ∼ 1.Finally, we sampled the minimal impact parameter, u 0 , differently for each of the three microlensing sources for ML U(0, 1.5) for BS where we emphasise that these are used during the simulation step, i.e. before the selection criteria, including µ ≥ 1.34, are imposed.
In total, we generated 600 000 curves, of which only a few O(0.1%) with missing values for some of the features were dropped.The dataset was then split into train, validation, and test subsets with proportions 0.5:0.25:0.25,respectively corresponding to around 300 000, 150 000, and 150 000 light curves, with the six classes being equally represented in each of the sets.For each light curve, we computed 74 features using the light curve time series, in addition to the same 74 using the derivative of the light curve time series, to a total of 148 features.The computed features relate to statistics of the time series of the light curve, as well as other time series quantities.See [29] and MicroLIA's API reference for the full description and reference of each feature. 1 The training set was used to conduct exploratory data analysis and to train machine learning models.The validation set was used for model selection and comparison, and to produce preliminary analysis plots and statistics.The test set was used to produce the final analysis plots and statistics, presented in the next section.The full dataset can be downloaded here [30].
For the multiclassification task, we trained a histogram-based gradient boosted classifier, which is a member of the broader machine learning algorithm family of gradient boosting machines (GBM).Like Random Forests (RF), implemented in MicroLIA, GBM are ensemble models that leverage the power of multiple weaker estimators, usually small trees, to produce a strong estimator.In RF the trees are independent of each other and the final prediction is obtained by the ensemble average over all the predictions from the trees in the forest.In GBM the trees are not independent but sequentially trained as to improve on the previous iteration.Schematically, consider F (X) to be the output of the GBM as a function of the data, X.In the first iteration, we want to train a simpler estimator, F 1 (X), to match a target label, y.Since the estimator is simple, it will not be very accurate and the prediction will have a certain residual error, y−F 1 (X).In the second iteration, we train another weak estimator, h 1 (X) not on the desired output, but on residual error of the previous step to create a new, improved, estimator F 2 (X) = F 1 (X) + h 1 (X), since in the ideal case h 1 (X) = y − F 1 (X) and we would have the desired prediction.However, each step will have a residual error y − F m (X), and so the process can be repeated until the desired accuracy (or maximum number of iterations) is reached.The final GBM output will then be a function of all the steps, F (X) = m F m (X) + h m (X).This presentation is schematic and can be generalised to any problem with a differentiable loss function. 2GBM have been known in the literature to be powerful estimators for tabular data, which we explore in this work.The usual implementation of GBM uses simple trees at each step.However, this requires sorting the data at each iteration, making tree-based GBM computationally heavy for datasets larger than a few tens of thousands examples, which is our case.To mitigate this, a histogrambase variant has been developed that sorts and binarises (i.e., assign each data point to the bins of the histogram) the data once, producing orders of magnitude speed improvements in training and prediction.In this work, we used the histogram-based GBM implemented by scikit-learn's HistGradientBoostingClassifier (HGBC).
In a preliminary study, we compared HGBC against other GBM implementations and scikit-learn's RF, observing all GBM to outperform the RF when computing the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) evaluated on the validation set.We then tuned the hyperparameters of the HGBC, observing minimal improvements of its discriminating performance, for which reason we decided to keep the default hyperparamters for the rest of the study presented herein. 3

IV. ANALYSIS
In this section, we present the multiclassification analysis on the two generated sets: the first generated with the OGLE-II timestamps, and the second generated with ideal regular daily cadence.The purpose of performing two analyses is to assess whether and how the discrimination power is affected by the cadence.

A. OGLE-II Timestamps
In fig. 3 we present the confusion matrix of the sixway (All vs. All) multiclassification performed by the HGBC.We observe that the non-microlensing events -CV, LPV, VARIABLE -have minimal to nonexisting overlap with any class.Conversely, the HGBC prediction for the microlensing events -ML, BS, and NFW -has significant overlap.However, we observe that BS suffer from considerable less contamination from ML and NFW events than these two do of each other.This suggests that BS events are the easiest to isolate, and therefore to detect, of the three microlensing cases.
Since fig. 3 suggests that BS events are easy to isolate, it is important to understand the nature of the BS we identify.All BS follow the same mass profile, m(t), which can be wider or narrower, depending on the parameter τ m , associated with the radius of the BS.In fig. 4 we show how the probability of a BS being identified as such by the HGBC depends on τ m .We find that there is a sweet spot at τ m ≃ 2 to maximise the correct identification of BS, with some some BS light curves being classified with high confidence, i.e. with P (y = BS|X) ≃ 1.This also motivates a posterior to our choice of interval 0.8 < τ m < 3 in fig. 2. In fig. 1 we present the 10 BS light curves with the highest value of P (y = BS|X), and we observe that they all exhibit the three peak magnification profile produced by caustics that one expects from BS sources.1.20e-03 9.98e-01 0.00e+00 0.00e+00 0.00e+00 6.39e-04 0.00e+00 0.00e+00 1.00e+00 0.00e+00 0.00e+00 0.00e+00 1.23e-01 1.20e-04 0.00e+00 5.27e-01 3.50e-01 0.00e+00 1.45e-01 0.00e+00 0.00e+00 4.38e-01 4.16e-01 0.00e+00 8.09e-05 1.21e-04 0.00e+00 0.00e+00 0.00e+00 1.00e+00 Confusion Matrix All vs All w/ OGLE-II Timestamps  Our analysis so far has shown that the discrimination between microlensing sources (ML, BS, NFW) and other sources of light curves (CV, LPV, VARIABLE) is a relatively easy task when OGLE-II cadence is observed, with almost nonexisting overlap between these two broad classes.As such, we now focus on the difficult threeway classification focused on discriminating between the three microlensing sources.In fig. 5 we present the ROC and the AUC of each class (versus the other two) for a HGBC trained on this three-way multiclassification prob- lem. 4 The advantage of analysing a ROC curve, and its area, over a classification matrix is that the ROC curve captures the classification performance over all possible output threshold cuts, while the confusion matrix only shows the true and false positives of the assigned predicted class as the one that has maximal probability.For example, in fig. 5 we can see that BS light curves can be isolated with high purity (i.e. with False Positive Rate around 0.0), while that is not possible for neither ML nor NFW light curves, again reinforcing that BS light curves are easier to identify than the remaining microlensing sources due to their unique three-peak profile.Furthermore, fig. 5 suggests that point-like microlensing light curves are only slightly easier to identify than NFW ones, since their ROC curve has a True Positive Rate greater than that of the NFW for all values of the False Positive Rates.This is also evidenced by the ML AUC, 0.65, which is greater than that of the NFW, 0.62.
Having been able to isolate the BS light curves from other microlensing sources, we now turn to interpreting our results.Unfortunately, although HBGC are incredibly powerful estimators they do no provide a clear way of interpreting their prediction, a common challenge when using machine learning in highly dimensional multivariate analyses.In this study, we implemented a backward Sequential Feature Selection (SFS) loop to assess which of the 148 features are relevant for this classification task.The backward SFS loop starts by fitting an HGBC using all 148 features and evaluating its performance on the validation set.Then we train 148 HGBC on all the possible 148 subsets comprised of 147 features (i.e., with one feature removed) and evaluate their performance on the validation set.Keeping only the best set of 147 features, the same step can be done for the 146 subsets and so one, for a total of 148 iterations totaling 11026 HGBC trained on 11026 subset of features, which is a feasible study given the increased training speed offered by the HGBC. 5 In fig.6 we show the performance of the HGBC on the three-way microlensing focused multiclassification for each of the possible sources and their geometric mean.We see that only around 25 features derived from the light curve time series and its derivatives, out of the 148, are relevant for this task. 6We also note that the discrimination performance degrades for all the three classes at the same stage, suggesting that the final 25 features are relevant for all three classes.The list of the final 25 features by survival ranking is presented in table I.
For the dataset using OGLE-II timestamps, which is the subject of the analysis in this section, we find that the complexity of the time series is the most important feature to separate the light curves from the three microlensing sources, surviving the whole backward selection loop until there is only one.In fig.7 we show the distribution of the time series complexity for BS against the τ m , where we witness a boost in the time series complexity for τ m ≃ 2, the same sweet spot identified above in fig.4, suggesting that the complexity of the time series of a BS light curve is a driving feature for it to be correctly identified as a BS.The same is not observed for the NFW events, which we do not show for the sake of presentation tidiness.
In fig. 1 we show the 10 most distinctive BS light curves, i.e. the 10 BS light curves with highest P (y = BS|X) as identified by the HGBC trained on the threeway multiclassification task on the OGLE-II dataset.These light curves all exhibit very clear three magnitude peaks, a hallmark feature arising from the caustics produced by the extended nature of the lens.

B. Regular Daily Cadence Timestamps
So far our analysis has focused on light curves simulated using OGLE-II timestamps.This reflects well the sensitivity to extended objects in a realistic microlensing survey, with its given observational constraints.However, one might wonder to what extent the conclusions previously drawn are sensitive to the observation cadence details, especially its irregularity.To address this, we conduct the analysis with light curves simulated with regular daily cadence, i.e.where observations are taken exactly 24 hours apart.
We begin with the six-way All vs All multiclassification task.In fig.8 we show the confusion matrix obtained using the HGBC.Although similar to its counterpart with OGLE-II timestamps, fig.3, it has some noticeable differences.First, we notice that the contamination in BS positive predictions has decreased from O(10%) down to O(1%), suggesting an improved capacity of the HGBC in producing a pure sample of BS light curves.Secondly, we notice that the cross misclassification between NFW and ML light curves has also decreased, improving upon the OGLE-II timestamps case.
As before, we now focus on the three-way multiclassification problem focused on the three microlensing sources.The ROC curves for each of the three classes and their respective AUC can be seen in fig.10.Again, we see that there are some noticeable differences to the OGLE-II case shown in fig. 5.The first thing to notice is that all AUC have increased compared to the OGLE-II timestamps case, indicating an easier discrimination when using regular daily cadence timestamps.Next, we see how the ROC curve for the BS is significantly more peaked for small False Positive Rate, providing further evidence that the BS light curves are better classified with regular daily cadence.More interestingly, however, is how the ML and NFW ROC curves no longer follow the same trend as in the OGLE-II case.More precisely, we can observe that they cross, as before they did not.This happens halfway through the curve, with the NFW ROC curve having the upper hand over the ML ROC curve for lower values of the False Positive Rate, with an added feature that the ROC curve for the NFW can have nonvanishing True Positive Rate at low False Positive Rate values.This implies that with regular daily cadence it is possible to isolate NFW light curves with little to no contamination of any other class, something that was impossible to achieve using the OGLE-II cadence.
The previous result points at the possibility in principle of completely isolating NFW light curves.It is then important to understand the nature of the NFW light curves that we can isolate.In fig.11 we show how the probability for an NFW light curve to be correctly identified as such can vary with the NFW τ m parameter.Although less noticeable than what we saw before for the BS light curves, we can observe a sweet spot for correct classification at 1 ≲ τ m ≲ 2, with some light curves being assigned P (y = N F W |X) ≃ 1.We notice that the equivalent scatter for the OGLE-II timestamps, not shown here to declutter the presentation, does not exhibit this pattern, suggesting that there is important information in the light curve that can only be obtained with regular cadence.
Contrary to BS light curves, NFW light curves do not have a clear visual profile compared to point-like microlensing sources.In fig. 12 we compare the 100 most easily identifiable point-like microlensing curves, i.e. those with the highest P (y = M L|X), against the 100 most easily identifiable NFW curves, i.e. those with the highest P (y = N F W |X), where the probabilities are obtained from the HGBC trained on the three-way multiclass classification on the regular cadence timestamps dataset.We normalised the magnitudes by min-maxing their values to fall under the [0, 1] interval to aid visual comparison.Although the difference is very nuanced, the NFW curves tend to be narrower than the ML curves.This subtlety explains why regular cadence is so important for identifying NFW light curves, as one needs a better resolution of the light curve profile to be able to identify this nuance, which proved impossible when using the OGLE-II timestamps.
Finally, we perform a backward SFS for the regular daily cadence dataset using a three-way HGBC focused on the microlensing classes.The discrimination performance versus the number of features is shown in fig.13, where we observe a similar trend than that already discussed for the OGLE-II timestamps dataset with the ROC AUC degrading significantly below 25 features.However, it is worth noting that the performance degrades first for the NFW and ML cases, with the ROC AUC associated with the classification of the BS light curves staying at its maximum value until around 15.This suggests that there are less relevant features to distinguish BS light curves from the other classes than there are to correctly identify the others.In table I we present the top 25 features, and we observe considerable overlap with the OGLE-II study.Perhaps curiously, we count 10 of the features to have been computed on the derivative of the time series, whereas for the OGLE-II the number is six, possibly hinting at the importance of the derivative of the time series in identifying the nuanced differences in shape between the ML and the NFW light curve magnification peak.

V. DISCUSSION
In this work we have studied the observation of microlensing signatures due to extended dark objects in time series data, and its distinction from other signals, most importantly point-like lenses.We have focused on two lens profiles, each exemplifying a class of objects: the profile of an NFW-subhalo is more peaked, whereas bo-son stars have a more diffuse profile.As expected, the boson stars can be confidently distinguished from pointlike lenses, for 0.8 ≲ τ m (= r lens /r E ) ≲ 3.This is owed to the characteristic caustics in the light curves for these objects.
As an exercise, we studied how the regularity of the time series cadence affects the confidence of the detections.Interestingly, for regular daily cadence we also find confident detections of NFW-subhalos.These detections occur for 0.9 ≲ τ m ≲ 4, despite the fact that no caustics are observed in the light curves.Though the regularity of (daily) observations is dependent on many conditions that are beyond the observer's control, this is an interesting observation which warrants investigation for other microlensing scenarios and targets.
Stellar binaries or exoplanets orbiting a lensing body may give rise to perturbations and caustics in the light curve.Unlike for extended dark objects, these are asymmetric or one-sided features (see e.g.[31]).Microlensing has allowed for the discovery of over 100 exoplanets, particularly of near-terrestial size [32].In this work, we have not considered these signatures, which may give rise to confusion with boson star identifications with low significance.We will leave such an analysis for future work.
We performed the analysis in this work by extending the MicroLIA algorithm, which utilises 148 features derived from the light curves in a single band to distinguish between objects.Our SFS analysis showed that of these, only 25 were needed for the optimal classification.For our regular cadence analysis, further features could help distinguish between NFW-subhalos and point-like lenses.In future work, we will study the potential benefits of learning directly on the light curves.
We note that in our analysis we have assumed a Gaussian noise model, which may not be realistic.In future work towards the application of our methodology, further noise models should be considered.These could lead to larger misclassification between classes of events.
Microlensing surveys typically only release data on candidate events, which have passed through a strict selection in which our extended dark object light curves were likely cut.Exceptions include the UKIRT Microlensing Survey [33,34] and the VISTA Variables in the Via Lactea Survey [35], which we plan to analyse in future work.Upcoming microlensing opportunities include the launch of infrared astronomy experiments in the mid-2020s.In particular, the Nancy Grace Roman Space Telescope (previously WFIRST; a space mission by NASA) has as a key objective to discover exoplanets through the microlensing technique [36], but will also probe primordial black holes [37] and extended dark objects.
Finally, for some surveys, the finite extend of the source becomes important.This is the case, for example, for the Subaru-hsc survey of M31 [38]: because of its sensitivity to small transit times (and hence small Einstein radii), the angular extent of source stars corresponds to a distance at the lens larger than the Einstein radius, sup-pressing the magnification relative to point-like sources [16,[39][40][41].We leave an analysis of the magnification curves with finite source effects for future work.

SOFTWARE
For the computation of the mass profile of the extended sources and the sensitivity estimation, we used Mathematica version 12.The dataset generation, machine learning training, and the final analysis were performed in python 3.9.18,making use of several packages, of which we make especial note: • Light curve simulation and time series feature extraction was performed using MicroLIA.We used the code directly from the MicroLIA github repository, as it includes considerable changes to the code provided with the packaged version 2.6.0.More precisely, the version used makes proper use of the derivative time series, as well as its errors as statistical weights.These two fixes were contributions of this work.
The full list of packages and their versions can be found in the requirements.txtfile included in the code repository hosting the code used in this work.
FIG. 1.The 10 most distinctive Boson Star light curves, using the dataset generated with OGLE-II timestamps..

FIG. 3 .
FIG.3.Confusion matrix for the six-way All vs.All multiclassification performed by the HGBC using the dataset generated with OGLE-II timestamps.The entries are rounded to three significant digits.

FIG. 4 .
FIG. 4. Probability of a BS light curve being correctly identified as one by the HGBC versus the Boson Star τm parameter, using the datset generated with OGLE-II timestamps.
FIG.5.ROC curves and their areas obtained from a HGBC trained on the three-way multiclassification task using the dataset generated with OGLE-II timestamps.

FIG. 6 .
FIG.6.Backward SFS (features reduced from right to left) for a HGBC trained on the three-way multiclassification task using the dataset generated with OGLE-II timestamps.

FIG. 8 .
FIG.8.Confusion matrix for the six-way All vs.All multiclassification performed by the HGBC using the datset generated with regular daily cadence timestamps.The entries are rounded to three significant digits.

FIG. 9 .
FIG. 9. Probability of a BS light curve being correctly identified as one by the HGBC versus the Boson Star τm parameter, using the datset generated with regular cadence timestamps.
FIG.10.ROC curves and their areas obtained from a HGBC trained on the three-way multiclassification task using the datset generated with regular daily cadence timestamps.

FIG. 11 .
FIG.11.Probability of a NFW light curve being correctly identified as one by the HGBC versus the NFW τm parameter, using the dataset generated with regular daily cadence timestamps.

TABLE I .
[29]25features, ranked by survivability of the backward Sequential Feature Selection loop.Features computed in the derivative of the time series are identified as such.See[29]for details about the features.