TY - JOUR
T1 - PhytoOracle
T2 - Scalable, modular phenomics data processing pipelines
AU - Gonzalez, Emmanuel M.
AU - Zarei, Ariyan
AU - Hendler, Nathanial
AU - Simmons, Travis
AU - Zarei, Arman
AU - Demieville, Jeffrey
AU - Strand, Robert
AU - Rozzi, Bruno
AU - Calleja, Sebastian
AU - Ellingson, Holly
AU - Cosi, Michele
AU - Davey, Sean
AU - Lavelle, Dean O.
AU - Truco, Maria José
AU - Swetnam, Tyson L.
AU - Merchant, Nirav
AU - Michelmore, Richard W.
AU - Lyons, Eric
AU - Pauli, Duke
N1 - Publisher Copyright: Copyright © 2023 Gonzalez, Zarei, Hendler, Simmons, Zarei, Demieville, Strand, Rozzi, Calleja, Ellingson, Cosi, Davey, Lavelle, Truco, Swetnam, Merchant, Michelmore, Lyons and Pauli.
PY - 2023
Y1 - 2023
N2 - As phenomics data volume and dimensionality increase due to advancements in sensor technology, there is an urgent need to develop and implement scalable data processing pipelines. Current phenomics data processing pipelines lack modularity, extensibility, and processing distribution across sensor modalities and phenotyping platforms. To address these challenges, we developed PhytoOracle (PO), a suite of modular, scalable pipelines for processing large volumes of field phenomics RGB, thermal, PSII chlorophyll fluorescence 2D images, and 3D point clouds. PhytoOracle aims to (i) improve data processing efficiency; (ii) provide an extensible, reproducible computing framework; and (iii) enable data fusion of multi-modal phenomics data. PhytoOracle integrates open-source distributed computing frameworks for parallel processing on high-performance computing, cloud, and local computing environments. Each pipeline component is available as a standalone container, providing transferability, extensibility, and reproducibility. The PO pipeline extracts and associates individual plant traits across sensor modalities and collection time points, representing a unique multi-system approach to addressing the genotype-phenotype gap. To date, PO supports lettuce and sorghum phenotypic trait extraction, with a goal of widening the range of supported species in the future. At the maximum number of cores tested in this study (1,024 cores), PO processing times were: 235 minutes for 9,270 RGB images (140.7 GB), 235 minutes for 9,270 thermal images (5.4 GB), and 13 minutes for 39,678 PSII images (86.2 GB). These processing times represent end-to-end processing, from raw data to fully processed numerical phenotypic trait data. Repeatability values of 0.39-0.95 (bounding area), 0.81-0.95 (axis-aligned bounding volume), 0.79-0.94 (oriented bounding volume), 0.83-0.95 (plant height), and 0.81-0.95 (number of points) were observed in Field Scanalyzer data. We also show the ability of PO to process drone data with a repeatability of 0.55-0.95 (bounding area).
AB - As phenomics data volume and dimensionality increase due to advancements in sensor technology, there is an urgent need to develop and implement scalable data processing pipelines. Current phenomics data processing pipelines lack modularity, extensibility, and processing distribution across sensor modalities and phenotyping platforms. To address these challenges, we developed PhytoOracle (PO), a suite of modular, scalable pipelines for processing large volumes of field phenomics RGB, thermal, PSII chlorophyll fluorescence 2D images, and 3D point clouds. PhytoOracle aims to (i) improve data processing efficiency; (ii) provide an extensible, reproducible computing framework; and (iii) enable data fusion of multi-modal phenomics data. PhytoOracle integrates open-source distributed computing frameworks for parallel processing on high-performance computing, cloud, and local computing environments. Each pipeline component is available as a standalone container, providing transferability, extensibility, and reproducibility. The PO pipeline extracts and associates individual plant traits across sensor modalities and collection time points, representing a unique multi-system approach to addressing the genotype-phenotype gap. To date, PO supports lettuce and sorghum phenotypic trait extraction, with a goal of widening the range of supported species in the future. At the maximum number of cores tested in this study (1,024 cores), PO processing times were: 235 minutes for 9,270 RGB images (140.7 GB), 235 minutes for 9,270 thermal images (5.4 GB), and 13 minutes for 39,678 PSII images (86.2 GB). These processing times represent end-to-end processing, from raw data to fully processed numerical phenotypic trait data. Repeatability values of 0.39-0.95 (bounding area), 0.81-0.95 (axis-aligned bounding volume), 0.79-0.94 (oriented bounding volume), 0.83-0.95 (plant height), and 0.81-0.95 (number of points) were observed in Field Scanalyzer data. We also show the ability of PO to process drone data with a repeatability of 0.55-0.95 (bounding area).
KW - data management
KW - distributed computing
KW - high performance computing
KW - image analysis
KW - morphological phenotyping
KW - phenomics
KW - physiological phenotyping
KW - point cloud analysis
UR - http://www.scopus.com/inward/record.url?scp=85150463445&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85150463445&partnerID=8YFLogxK
U2 - 10.3389/fpls.2023.1112973
DO - 10.3389/fpls.2023.1112973
M3 - Article
SN - 1664-462X
VL - 14
JO - Frontiers in Plant Science
JF - Frontiers in Plant Science
M1 - 1112973
ER -