You are here

Data Labeling and Curation at Scale (DLCS) for Machine Learning Algorithms


The DHS Science & Technology Directorate (S&T) laboratories and DHS component operational partners generate large volumes (up to 10,000 or more measurements per day) of data from test events, prototype demonstrations, or targeted stream of commerce (SoC) data collections. These data are incredibly valuable to DHS and our R&D partners to support the development of next-generation detection algorithms, like those used at airports for on-person and accessible property screening in order to detect explosives and prohibited items. Currently, any data that is collected must be hand-annotated and stored on physical hard disks. This process is extremely time and labor intensive, while limiting DHS's ability to develop curated data sets and share data with R&D partners. R&D partners must also accept the data in the formats and labels that were hand created, as DHS does not currently have the capability to rapidly re-annotate or reformat existing data sets. DHS is seeking innovative techniques to accelerate and bring additional flexibility to DHS's data collection, labeling, storing, and distribution processes. The current state of the art relies heavily on human labeling and knowing desired metadata and curation schemes a priori. Successful solutions will limit the amount of human intervention required to perform these tasks, instead relying on automatic software to process most routine activities. It is assumed that the provided solution may include certain commercial-off-the-shelf (COTS) modules, but the focus of the research should be on novel data ingestion, labeling, and curation techniques. COTS modules included should support Government approved cybersecurity standards such as FedRAMP approval and/or compliance with FIPS 104-3 specifications. Capabilities of particular interest include the ability to ingest interesting file formats such as Hierarchical Data Formats, Digital Imaging and Communications in Security (DICOS) (an adaption of Digital Imaging and Communications in Medicine), and other defined but unusual data types, and then processing the data to assess complexity, identify common features/defined labels, and generate ground truth data for these files. Areas of uncertainty may be flagged for human review at a future time (at which point the human-generated ground truth may be analyzed to enhance the automated tools). Once the data is stored, it should be able to be easily curated, reprocessed (e.g. change file formats or ground truth formats), and distributed as packaged data sets. A successful solution should be able to be scaled significantly to support long-term use by DHS.
US Flag An Official Website of the United States Government