You are here

Multi-Scale Representation Learning


OUSD (R&E) MODERNIZATION PRIORITY: Artificial intelligence/machine learning


TECHNOLOGY AREA(S): Information systems, modeling and simulation technology


OBJECTIVE: Develop a single neural network that learns representations at multiple spatial and semantic scales and that may be applied to different geospatial tasks, such as land cover segmentation, object detection, key-point matching, and few-shot/fine-grained/long-tailed classification.


DESCRIPTION: NGA is interested in a single, hierarchical network that learns representations at multiple spatial and semantic scales and can improve the performance of all aspects common to differing geospatial computer vision pipelines.


Existing representation learning techniques are often tailored to a specific task such as semantic segmentation, classification, object detection, or key-point matching. As a result, the trained feature extractors are focused on learning global image-level features, object-level features, or local interest-point features but do not work well at extracting all such feature types at varying scales. This problem is compounded when introducing data that differs fundamentally in format, such as images with more or fewer bands than the data that the feature extractor was trained on.


Recent advancements in representation learning and generalizability show promise that such a one-network solution may be on the horizon. CNN feature extractors for object detection have used feature pyramid networks for several years, which are architected to extract features at different scales [1]. Self-supervised learning has now matched, or exceeded, the transferability of supervised techniques and has demonstrated promising performance on diverse downstream tasks requiring learning different feature types and scales [2, 3]. Transformers, which have earned the state-of-the-art (SoTA) in a variety of vision benchmarks, have shown ability to work across mid-sized and small image scales when pre-trained on large datasets, show parity with SoTA in self-supervised vision tasks, and have been successfully applied to remote sensing [4, 5, 6]. Moreover, the attention layers in a Pre-Trained Frozen Transformer are generalizable across a wide variety of data types and tasks—for example, from language to vision [7].


PHASE I: Develop a neural network architecture that learns representations at multiple spatial and semantic scales and a pre-training methodology on publicly available satellite imagery and/or Government furnished WorldView-3 imagery. Self-supervised pre-training is preferred. Using the same pre-trained network backbone, demonstrate near-parity with SoTA on two different satellite imagery computer vision benchmark tasks requiring either different resolution imagery or different feature scales. Proposers are expected to identify which benchmarks they will target in the proposal.


PHASE II: Extend Phase I results to 4+ computer vision benchmarks using 3+ different image resolutions and/or feature scales. Develop techniques to use the same pre-trained backbone with 4-16 band imagery and demonstrate parity with SoTA on at least two associated benchmarks. Collaborate with NGA’s SAFFIRE program for testing and evaluation on classified imagery, and provide code and support for integration.


Deliverables include a comprehensive report on the architecture, training scheme, and benchmark performance delivered to NGA at the conclusion of Phase I, Phase II midpoint, and Phase II conclusion; at least two papers submitted to academic journals or conferences by the conclusion of Phase II; all data procured, curated, and/or labeled during the period of performance; and delivery without restriction (or open-sourcing) of code. Proposing teams are expected to have a strong and ongoing academic publication track record on related research topics.


PHASE III DUAL USE APPLICATIONS: A single neural network that learns representations at multiple spatial and semantic scales has the potential to apply broadly to diverse machine learning tasks across Government and industry. For example, such technology could improve performance in all aspects of geospatial computer vision, as well as diverse fields such as facial recognition, self-driving cars, and robotics.



  1. Lin TY., et al. “Feature pyramid networks for object detection,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 10.1109/CVPR.2017.106.
  2. Xiao, T., Wang, X., Efros, A., Darrell, T., “What should not be contrastive in contrastive learning,” 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3-7 May  2021. 2021.
  3. Xiao, T., Reed, C., Wang, X., Keutzer, K., Darrell, T., “Region similarity representation learning,” arXiv preprint, arXiv:2103.12902v2.
  4. Dosovitskiy A., et al. “An image is worth 16x16 words: Transformers for image recognition at scale,” 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3-7 May 2021. 2021.
  5. Caron M., et al. “Emerging properties in self-supervised vision transformers,” arXiv preprint arXiv:2104.14294 (2021).
  6. Bazi Y., et al. “Vision transformers for remote sensing image classification,” Remote Sens. 2021, 13(3), 516;
  7. Lu, K., Grover, A., Abbeel, P., Mordatch, I., “Pretrained transformers as universal computation engines,” arXiv preprint, arXiv:2103.05247.
  8. Bingyi, C., Araujo, A., and Sim, J. “Unifying deep local and global features for image search.” European Conference on Computer Vision. Springer, Cham, 2020,
  9. Yurun, T., et al. “HyNet: Learning local descriptor with hybrid similarity measure and triplet loss.” arXiv preprint, arXiv:2006.10202 (2020).
  10. Zhuoqian, Y., Dan, T., and Yang, Y. “Multi-temporal remote sensing image registration using deep convolutional features.” IEEE Access 6 (2018): 38544-38555.


KEYWORDS: Artificial intelligence, deep learning, machine learning, representation learning, computer vision, remote sensing

US Flag An Official Website of the United States Government