Dataset summary
Introduction
We've built the dccvalidator tool to streamline the process of data validation and QA/QC. As the AMP-AD Knowledge Portal has grown to 50+ studies and over 70,000 data files, we've realized a need to be more standardized in our approaches to data curation. Thus, we built an application that performs many of the routine data quality checks we previously conducted by hand, with the hopes that it will help you, the data contributor, get your data checked, validated, and shared more easily and quickly.
The application can be found at https://shinypro.synapse.org/users/kwoo/dccvalidator-app/
Requirements
To use this application you must:
- Be logged in to Synapse in your browser
- Be a Synapse certified user
- Be a member of the AMP-AD consortium team
Some portions of the app submit data to Synapse. This allows curators at Sage to troubleshoot issues if needed; no one outside the Sage curation team will be able to download the data.
Instructions
This topic has a general overview of the data contribution process and detailed instructions for each step, including uploading documentation, metadata requirements, validating and reviewing the metadata, and uploading the dataset.
General Process Overview
- Contact the AMP-AD team to discuss the study and the expected data. Receive staging folder synIDs for each expected dataset.
- Upload documentation and validate metadata + manifest files in dccvalidator.
- Contact the AMP-AD team when all files pass validation. The team will verify items not checked by the dccvalidator. Receive permissions to upload data to the staging folder.
- Use the validated manifest to upload the data with
syncToSynapse(see Synapse documentation for uploading data in bulk). - Contact the AMP-AD team. The team will do the final verifications before releasing the data.
Documentation Upload
Each study in AMP-AD has accompanying documentation in the portal. You can submit your documentation through the dccvalidator app on the Documentation page. There should be a study description for the whole study, and an assay description for each of the assays that was performed. These can be in a single file, or you can upload multiple files to the assay description section.
Data Validation
Metadata Requirements
Each study should include metadata that would help a new researcher understand and reuse the data. In most cases, we will expect 4 files:
- Individual metadata a csv file describing each individual in the study.
- Biospecimen metadata a csv file describing the specimens that were collected.
- Assay metadata a csv file describing the assay that was performed. If multiple assays were part of the study, there will be one assay file for each.
- A manifest listing each file that will be uploaded. You will use this file to upload your data after it has been validated and approved. The manifest should be in tsv (tab-delimited text) format.
We provide templates for all of the metadata files within the portal: https://www.synapse.org/#!Synapse:syn18512044
You can download these files, fill out the first tab, and save it as a .csv or .tsv file. The other tabs exist to describe the variables and allowed values in the template. If you do not have any data for some of the columns, you can leave them blank (but do not remove the column header).
If you don't see a template for the assay(s) in your study, or if not all of the metadata types above seem relevant to your study, please get in touch with us at AMPAD_SageAdmin@synapse.org.
Validating the Metadata and Manifest
The data validation portion of the app allows you to upload metadata files (as .csv) and the manifest (as .tsv or .txt) and view the results of a series of automated checks.
Examples of the types of checks we perform are:
- All required columns from the templates are present
- Individuals and specimens have unique identifiers
- Metadata terms conform to a controlled vocabulary where applicable
Viewing Data Summary
We also provide a summary of the files you have uploaded, showing the number of individuals, specimens, and files. We visualize the data in each column by its data type to help spot unexpected missing values.
Uploading Data
Once data has passed validation, and the AMP-AD data curators permit edit
permissions to the staging folder for your study, you will use your newly
created manifest file to upload your data and metadata using syncToSynapse.
You can execute syncToSynapse in the
Python client
or
R client.
For getting started with the Synapse programmatic clients, please visit our
Synapse docs.
Upload Study Documentation
This information should be similar to a materials and methods section in a paper. An example of what a study should include can be found here for an animal model study and here for a human study. If you wish, also provide an acknowledgement statment and/or reference that should be included in publications resulting from secondary data use; examples can be found here. This can be provided as part of the study documentation text.
Study Description
Each study should be given both a descriptive and an abbreviated name. The abbreviation will be used to annotate all content associated with the study. For a study with a human cohort, the study description should include:
- study type (randomized controlled study, prospective observational study, case-control study, or post-mortem study)
- disease focus
- diagnostic criteria and inclusion/exclusion criteria of study participants
- (for post mortem studies) the brain bank name(s) and links to website(s)
For a study with an animal model cohort, the study description should include:
- species
- treatments
- (if genetically modified) genotype and genetic background. Provide a link to the strain datasheet(s) if a commercial model, or a description of how it was created if not.
For studies using in-vitro cell culture, the study description should include:
- species
- cell type
- cell culture information (such as primary or immortalized cell line, passage, treatments, differentiation). If a commercial cell line, provide a link.
Include citations for more study information if available.
Assay Description
For each assay, provide a summary of sample processing, data generation, and data processing, including which organs and tissues the samples came from. For other tests (such as cognitive assessments or imaging), include a description of how the test was done. Include links for any commercial equipment or tools, code repositories, and citations for more information, if available.
Detailed protocols are highly recommended. These can be uploaded as pdf together with the data-files, or as links to protocol repositories such as protocols.io or Open Lab Book.