Over the Summer of the 2024-25 academic year Codie Gerlach-Wood, PhD student on the Compass programme, co-supervised three undergraduate research interns alongside Dr Juliette Unwin. What follows is a blog post by these students – Hou Hin Ip, Ka Nam Lam, and Joshua Man Yu Ng – detailing their work.

Introduction

Traditionally, predicting social indicators in developing regions has relied on large-scale household surveys, which are expensive and time-consuming. As a result, researchers have been exploring alternative methods, such as using low-cost, publicly available satellite imagery to estimate these variables [1].

By applying machine learning to satellite images, such techniques have significant potential as tools for analysing demographic, health, and development indicators. Our eight-week bursary research project focuses on KidSat [2], a model pipeline for applying advanced computer vision models to the challenge of child poverty estimation. After reviewing the original pipeline, we extended the work presented in the paper by introducing several modified data processing measures and a spatial encoding framework to enrich the information during the fine-tuning stage.

As a result, we demonstrate that an improved data cleaning and imagery selection pipeline, combined with a geographical encoder using LightGBM, reduces the detection accuracy in terms of Mean Absolute Error by 23.19%, and the model remains consistent when scaled to 34 African countries, compared to 16 countries in the original paper.

The Original Model Architecture

The KidSat project used satellite imagery to predict the proportion of individuals under the age of 18 within a region (cluster) who meet the condition of “severe deprivation”. According to UNICEF [3], a child is considered deprived if they experience deprivation in at least one of six criteria: health, nutrition, water, education, sanitation, or housing. To calculate this target variable, large-scale household surveys from the DHS programme are utilised. From these surveys, 17 key variables are collected to calculate the targeted index.

The KidSat paper employed computer vision models to learn patterns from different satellite images and key variables for fine-tuning. Assessed using five-fold cross-validation, the fine-tuned model generates an embedding from each satellite image, which is then passed to a linear regression head to predict severe deprivation. The paper conducted experiments with various models, including DINOv2, SatMAE, and MOSAIKS. Among these, DINOv2, developed by Meta, demonstrated the best performance in terms of Mean Absolute Error (MAE).

Figure 1: The Original KidSat Model Architecture

Data Processing and Data Quality

As the DINOv2 model requires a normalised dataset with stable information, where all values are scaled between 0 and 1 for fine-tuning, we applied several fundamental data cleaning rules to ensure no outliers or missing columns were present during the fine-tuning stage. We also explored new variables, such as a rural–urban indicator and a relative wealth index, to provide additional information for the model.

In addition to adjusting existing cleaning methods to produce more reliable data for training, we identified further issues that required attention:

Quality of Image Data

Our primary image source is Landsat 7/8 from Google Earth, which often suffers from image corruption, such as Scan Line Corrector (SLC) error and weather conditions.

Although images from different timestamps within the same clusters are captured in the original pipeline, there is no mechanism for selecting the best images for fine-tuning and evaluation. To address this, we developed a system to detect corrupted pixels. We also implemented a cloud detection algorithm to identify pixels with a high likelihood of being classified as clouds, inspired by the FMASK approach [4]. By combining the two algorithms, we were able to provide recommendations for selecting imagery with the lowest proportion of cloud or corrupted pixels.

Example of corrupted images — Figure 2: Example of three corrupted images captured in the same place in Luanda, Angola in 2015 from Landsat 8, with each image acquired within a 16 to 18 days interval. As shown above, the left and middle images are corrupted by heavy cloud cover and large dark pixel regions, which provide limited information.

Problem of One-Hot Encoding

To preprocess categorical data, the original paper applied one-hot encoding, expanding 17 DHS variables into 99 columns. However, the resulting features were highly sparse, with many columns over 99% zeros, adding noise and hindering model performance. To address this, we re-categorized overly sparse categories into larger, meaningful categories while preserving key information.

Extension of KidSAT: Implementation on Spherical Harmonic Information

We noticed that the original model relies heavily on satellite image quality and DHS data, and there is significant room for improving accuracy by incorporating low-cost additional information into the model. In line with the findings from Marc Rußwurm [5], we adopt a spherical harmonics (SH) + Sinusoidal Representation Networks (SIREN) location encoder to inject geometry-aware geographic information alongside imagery.

The following section provides a brief description of our implementation process and the mathematical background involved; the detailed coding work can be found on our Github.

Mathematical Foundation

Spherical Harmonics Basis Functions

SH provide a natural coordinate system for functions defined on spherical surfaces. Any function $f(\lambda,\phi)$ on a sphere, where $\lambda\in[-\pi,\pi]$ and $\phi\in[-\frac{\pi}{2},\frac{\pi}{2}]$ represents longitude and latitude respectively, can be expressed as a weighted sum of orthogonal SH basis functions:
$$f(\lambda,\phi)=\sum^\infty_{\ell=0}\sum^\ell_{m=-\ell}\omega^m_\ell Y^m_\ell(\theta(\lambda),\varphi(\phi))$$
where $\omega^m_\ell$ are learnable weights, and $Y^m_\ell(\theta(\lambda),\varphi(\phi))$ are the SH basis functions of degree $\ell$ and order $m$, which will be elaborated in the section “Complex Spherical Harmonics with Real-Imaginary Decomposition”.

Link to Kernels

The connection between SH and kernels can be referenced in the findings of Minh [6] and Dutordoir [7]. By Mercer’s theorem on the sphere, any continuous positive-definite zonal kernel admits the SH expansion
$$
K(x,y)=\sum_{\ell=0}^{\infty}\sum_{m=-\ell}^{\ell}\lambda_{\ell, m}\,Y_{\ell, m}(x)\,\overline{Y_{\ell, m}(y)}.
$$
In this decomposition, eigenvalues $\lambda_{\ell,m}$ quantify the contribution of each SH basis to geographic similarity: low-order $\ell$ (e.g., $\ell=0,1$) capture global spatial patterns, while high-order $\ell$ (e.g., $\ell=15$, our implementation default maximum degree) capture local details (e.g., neighborhood-level proximity or regional land-use patterns). Therefore, the SH vector acts as an explicit kernel feature map on the sphere space.

Complex Spherical Harmonics with Real-Imaginary Decomposition

We employ the standard complex form of spherical harmonics (SH) and decompose each complex coefficient into its real and imaginary components for compatibility with machine learning frameworks. Each harmonic $Y_{\ell}^{m}(\lambda,\phi)$ is computed using the complex formulation:
$$
Y_{\ell}^{m}(\theta,\varphi)
= \sqrt{\frac{(2\ell+1)}{4\pi}\,\frac{(\ell-m)!}{(\ell+m)!}}\;
P_{\ell}^{|m|}(\cos\theta)\,e^{\,i m \varphi},
\qquad \ell\!\ge 0,\;-\,\ell\!\le m\!\le \ell.
$$
where $P_{\ell}^{m}$ are the associated Legendre polynomials,
$\theta=\frac{\pi}{2} – \lambda \cdot \frac{\pi}{180}$ and $\varphi= \phi \cdot \frac{\pi}{180}$ are colatitude and azimuth angle respectively.

Integration with Neural Architecture

SIREN Network Integration

We extend the SH encoding with SIREN, which uses sine activation functions particularly well-suited for representing spatial patterns. The SIREN architecture maps the SH-encoded coordinates to a learned geographic embedding through multiple hidden layers with sine activations:
$$\operatorname{SIREN}(\operatorname{SH}(\lambda, \phi)) = \sin(W_n(\sin(W_{n-1}(\ldots \sin(W_1 \cdot \operatorname{SH}(\lambda, \phi) + b_1) \ldots) + b_{n-1}) + b_n))$$
Our default SIREN architecture consists of four hidden layers with 256 neurons each, producing a 128-dimensional learned geographic embedding.

Two-Stage Training Process

The integration of SH with SIREN follows a two-stage training approach:

Pre-training Stage: The SH+SIREN network is pre-trained to predict poverty indicators directly from geographic coordinates, enabling it to learn location-specific spatial patterns and relationships.

$$\text{geo} = \operatorname{SIREN}([\Re Y_{\ell}^{m},\Im Y_{\ell}^{m}]_{\ell\le L})$$

Feature Fusion Stage: The learned geographic embeddings from the pre-trained SH+SIREN network are concatenated with visual features extracted from satellite imagery using DINOv2: $$z = [\text{geo(loc)} \| \text{DINOv2(image)}]$$ This fused feature representation is then passed to a regression head for final poverty level prediction.

Regression Heads

As in the equation for the “Feature Fusion Stage”, the fused vector $z$ combines visual and geographic information: DINOv2 encodes what a place looks like through visual similarity, while SH-SIREN encodes where it is through geographic similarity. This fusion leverages complementary information. To better exploit it, we evaluate a broader family of prediction heads beyond the linear ridge-regression baseline, motivated by the cross-modal interaction patterns inherent in the fused space.

While large self-supervised visual encoders often perform well with a linear regression head, the fused space introduces non-trivial cross-modal interactions that linear models cannot capture without explicitly engineering interaction terms. Concretely, this cannot represent multiplicative, context-dependent effects between visual and geographic information.

In our setting, the same visual cue can carry different poverty-related meanings across geographic contexts. For instance, “sparse vegetation” is typical in arid regions (e.g., the Sahel) and weakly informative about poverty level, but in humid regions (e.g., rural Southeast Asia), it can indicate soil degradation or limited irrigation, correlating with higher deprivation. These conditional relationships are inherently multiplicative or non-linear, motivating predictors that can capture such interactions.

We therefore report results for ridge (baseline), Random Forest, LightGBM, XGBoost, and a shallow MLP.

Figure 3: The Extended KidSat Model Architecture

Result and Discussions

Improvements in Data Cleaning Process and Imagery Quality

By re-categorizing and merging one-hot DHS variables, and introducing new variables, we reduced the fine-tuning dimension from 99 to 51. This approach yielded a gradual improvement in MAE by reducing noise and redundant columns.

For the imagery quality issues, we observed that nearly 15\% of selected imagery contained more than 30\% cloud or dark corrupted pixels. A cutoff is then applied: Targeted imagery will be replaced with cleaner alternatives when available, or discarding the cluster otherwise. Unsurprisingly, this algorithms, applied alone or in combination with column reduction, further improved MAE by ensuring higher image quality for learning and inference.

Table 1: Comparison of baseline and improvement strategies, evaluated on NVIDIA A40 GPU.

Method (in Ridge Regression Head)	MAE (conducted in A40 GPU)
Baseline	0.2167 (± 0.0013)
Strategy 1: Re-categorize Columns + Add New Variables	0.2045 (± 0.0012)
Strategy 2: Improvement in Imagery Choice	0.2102 (± 0.0005)
Adopt Strategy 1 + Strategy 2	0.1980 (± 0.0007)

Implementation of Spherical Harmonic Information

Table 2 demonstrates that adding the SH encoder significantly improved MAE, as this geographical information is strongly correlated with deprivation indicators. Moreover, it helped improve the accuracy of slightly corrupted imagery.

Regarding regression heads, LightGBM achieved the best performance when combined with SH encoding, demonstrating that tree ensembles outperform the ridge baseline. This supports the earlier discussion that modeling non-linear interactions between geographic context and visual cues is critical for maximising the predictive power of the fused representation.

Surprisingly, SH + SIREN underperformed plain SH. This can be explained by overfitting and representation overlap: since the SIREN network was pretrained on the same DHS poverty target used to supervise DINOv2 fine-tuning, the geo branch lost its role as a label-agnostic location prior and instead learned redundant features overlapping with the visual branch. Also, the extra capacity of the periodic MLP (SIREN) increased variance and led to overfitting in this data regime.

Table 2: Comparison of MAEs on regression heads under different geo-encoder settings.

Encoder / Regression Head evaluated on A40 GPU		Without Improvement on Data and Imagery Processing	With Improvement on Data and Imagery Processing
Without Geo Encoder		0.2167 ± 0.0013	0.1980 ± 0.0008
Adding SH + SIREN	Ridge	0.2154 ± 0.0008	0.1885 ± 0.0018
	XGBoost	0.2110 ± 0.0009	0.1853 ± 0.0022
	LightGBM	0.2110 ± 0.0014	0.1847 ± 0.0018
	Random Forest	0.2146 ± 0.0010	0.1855 ± 0.0018
	MLP	0.2149 ± 0.0011	0.1897 ± 0.0013
Adding SH Encoder	Ridge	0.2031 ± 0.0010	0.1888 ± 0.0011
	XGBoost	0.1833 ± 0.0007	0.1771 ± 0.0020
	LightGBM	0.1835 ± 0.0006	0.1759 ± 0.0016
	Random Forest	0.1888 ± 0.0010	0.1797 ± 0.0018
	MLP	0.1902 ± 0.0007	0.1823 ± 0.0024

Final Output and Further Exploration

Baseline vs Modified Process — Figure 4: Visualisation demonstrating comparison between Baseline method (Left) and visualisation with modified process (Right), based on the 16 African countries in the KidSat Project

Based on the experiments above, by adopting SH encoding with LightGBM regression head, incorporating an improved data processing pipeline and with a modified imagery selection process, we achieved the best performance, reducing MAE by 23.19\%.

Whole Africa Results — Figure 5: Visualisation demonstrating a comparison between the Baseline method (Left) and visualisation with modified process (Right), with a total of 34 African countries.

While the original KidSat project covered only 16 countries in Southern and Eastern Africa, we expanded training to an additional 18 countries in Central and Western Africa. Using SH encoding and LightGBM, we obtained an MAE of 0.1658 ($\pm$ 0.0005), demonstrating the consistency of the model despite demographic differences between these regions.

Conclusions and Reflections

Throughout this project, we explored the use of satellite imagery for predicting social variables and successfully improved accuracy by adopting various methods. Our key takeaway is that, when combined with low-cost and representative data such as geolocation information, satellite imagery can partially substitute for household surveys.

Our investigation has several limitations. For instance, the cloud detection method for image selection requires abundant satellite imagery and may underperform when data are scarce. Future work could incorporate additional accessible datasets, such as nighttime imagery or different spectral bands combination [8], to further supplement the model. Exploring alternative geo-encoding methods and more advanced fusion techniques, such as attention-based or contrastive cross-modal learning may also improve performance.

Finally, we’re grateful to have participated in this 8-week bursary project. Despite the challenges of applying techniques we had never learned before, it was a valuable opportunity to experience research life and to learn how to apply data cleaning, modeling, and machine learning principles to real-world work. We especially thank Ettie and Codie for their guidance and helpful recommendations throughout the supervision!

References

Ola Hall et al. (2023). “A review of machine learning and satellite imagery for poverty prediction: Implications for development research and applications.” Journal of International Development, 35(7):1753–1768.
Makkunda Sharma et al. (2024). “KidSat: Satellite imagery to map childhood poverty dataset and benchmark.” arXiv preprint arXiv:2407.05986.
UNICEF (2024). “Child poverty: Overview.” UNICEF Data. Available at:
https://data.unicef.org/topic/child-poverty/overview/.
Shi Qiu et al. (2017). “Improving Fmask cloud and cloud shadow detection in mountainous area for Landsats 4–8 images.” Remote Sensing of Environment, 199:107–119.
Marc Rußwurm et al. (2023). “Geographic location encoding with spherical harmonics and sinusoidal representation networks.” arXiv preprint arXiv:2310.06743.
Ha Quang Minh, Partha Niyogi, and Yuan Yao (2006). “Mercer’s theorem, feature maps, and smoothing.” In International Conference on Computational Learning Theory. Berlin, Heidelberg: Springer Berlin Heidelberg.
Vincent Dutordoir, Nicolas Durrande, and James Hensman (2020). “Sparse Gaussian processes with spherical harmonic features.” In International Conference on Machine Learning. PMLR.
Fan Yang et al. (2024). “Uncertainty-aware regression for socio-economic estimation via multi-view remote sensing.” arXiv preprint arXiv:2411.14119.

Author: codie.gerlach-wood

Predicting Childhood Poverty Using Satellite Imagery – A Compass Supervised Project