Examples ********* As the package has not been published on PyPi yet, it CANNOT be install using pip. For now, the suggested method is to clone the repository and view the example notebooks. Useful query filters for GDC API endpoints ========================================== The following examples demonstrate how to use various filters from the GDCQueryFilters class to query different GDC API endpoints. These examples demonstrate how to create filters for various GDC data types and endpoints. RNA-Seq Filter ^^^^^^^^^^^^^^^ .. code-block:: python from Connectors.gdc_filters import GDCQueryFilters gdc_filters = GDCQueryFilters() rna_seq_filter = gdc_filters.rna_seq_filter() # Use this filter with the 'files' endpoint # Example: requests.post("https://api.gdc.cancer.gov/files", json={"filters": rna_seq_filter, "size": 10}) WGS Filter ^^^^^^^^^^^^^ .. code-block:: python wgs_filter = gdc_filters.wgs_filter() # Use this filter with the 'files' endpoint # Example: requests.post("https://api.gdc.cancer.gov/files", json={"filters": wgs_filter, "size": 10}) Methylation Filter ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python methylation_filter = gdc_filters.methylation_filter() # Use this filter with the 'files' endpoint # Example: requests.post("https://api.gdc.cancer.gov/files", json={"filters": methylation_filter, "size": 10}) Top Mutated Genes Filter ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python top_mutated_genes_filter = gdc_filters.top_mutated_genes_by_project_filter("TCGA-BRCA", top_n=5) # Use this filter with the 'analysis/top_mutated_genes_by_project' endpoint # Example: requests.get("https://api.gdc.cancer.gov/analysis/top_mutated_genes_by_project", # params={"filters": json.dumps(top_mutated_genes_filter), "fields": "gene_id,symbol,score", "size": 5}) Custom RNA-Seq Data Filter ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python custom_params = { "cases.project.primary_site": ["Breast"], "cases.demographic.gender": ["female"] } custom_rna_seq_filter = gdc_filters.rna_seq_data_filter(field_params=custom_params) # Use this filter with the 'files' endpoint # Example: requests.post("https://api.gdc.cancer.gov/files", json={"filters": custom_rna_seq_filter, "size": 10}) Data Processing and Analysis Examples ====================================== Cohort Creation of Bulk RNA Seq Experiments from Genomic Data Commons (GDC) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python """ This example demonstrates how to create a data matrix for Differential gene expression (DE) or machine learning analysis. You can select the primary site of the samples and the downstream analysis you want to perform. """ import grequests import src.Engines.gdc_engine as gdc_engine from importlib import reload reload(gdc_engine) # Create Dataset for differential gene expression rna_seq_DGE_data = gdc_eng_inst.run_rna_seq_data_matrix_creation(primary_site='Kidney', downstream_analysis='DE') # Create Dataset for machine learning analysis rna_seq_ML_data = gdc_eng_inst.run_rna_seq_data_matrix_creation(primary_site='Kidney', downstream_analysis='ML') Migrating GDC RNA-Seq Expression Data to your BigQuery Database ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Make sure to run this code in a jupyter notebook or script in the Root directory of OmixHub This example demonstrates a comprehensive workflow for uploading RNA-Seq data from multiple primary sites to BigQuery: 1. It initializes the `BigQueryUtils` class with a specific project ID. 2. Defines a schema for the BigQuery table, including various fields related to RNA-Seq data. 3. Creates a new BigQuery table with the defined schema, including partitioning and clustering for optimized performance. 4. Initializes a `GDCEngine` instance to fetch data from the GDC API. 5. Iterates through a list of primary sites, fetching data for each site from GDC. 6. Loads the fetched data into the BigQuery table for each primary site. This strategy allows for efficient uploading of data from multiple primary sites into a single, well-structured BigQuery table. The use of partitioning and clustering can significantly improve query performance on large datasets. Key features demonstrated: - Creating a table with a specific schema - Implementing partitioning and clustering for better query performance - Batch processing of multiple primary sites - Integration with GDCEngine for data retrieval - Using tqdm for progress tracking during the upload process This approach is particularly useful for large-scale genomic data analysis, allowing researchers to efficiently store and query RNA-Seq data across multiple primary sites in a cloud-based environment. .. code-block:: python """ For downstream applications, it is tedious to make API calls to GDC every time you need to access the data for analysis. This example demonstrates how to create a BigQuery database for the data you need so that downstream applications can access the data easily. """ import gevent.monkey gevent.monkey.patch_all(thread=False, select=False) from Connectors.gcp_bigquery_utils import BigQueryUtils from google.cloud import bigquery from tqdm import tqdm from Engines.gdc_engine import GDCEngine # Initialize BigQueryUtils with your project project_id = 'rnaseqml' bq_utils = BigQueryUtils(project_id=project_id) # Define the table ID table_id = 'rnaseqml.rnaseqexpression.expr_clustered' # Define the schema for your table schema = [ bigquery.SchemaField("case_id", "STRING", mode="NULLABLE"), bigquery.SchemaField("file_id", "STRING", mode="NULLABLE"), bigquery.SchemaField("expr_unstr_count", "INTEGER", mode="REPEATED"), bigquery.SchemaField("tissue_type", "STRING", mode="NULLABLE"), bigquery.SchemaField("sample_type", "STRING", mode="NULLABLE"), bigquery.SchemaField("primary_site", "STRING", mode="NULLABLE"), bigquery.SchemaField("tissue_or_organ_of_origin", "STRING", mode="NULLABLE"), bigquery.SchemaField("age_at_diagnosis", "FLOAT", mode="NULLABLE"), bigquery.SchemaField("primary_diagnosis", "STRING", mode="NULLABLE"), bigquery.SchemaField("race", "STRING", mode="NULLABLE"), bigquery.SchemaField("gender", "STRING", mode="NULLABLE"), bigquery.SchemaField("group_identifier", "INTEGER", mode="NULLABLE") ] # Create table with partitioning and clustering bq_utils.create_bigquery_table_with_schema( table_id=table_id, schema=schema, partition_field="group_identifier", clustering_fields=["primary_site", "tissue_type"] ) # Initialize GDCEngine params = { 'files.experimental_strategy': 'RNA-Seq', 'data_type': 'Gene Expression Quantification' } gdc_eng_inst = GDCEngine(**params) # List of primary sites to process primary_sites = ['Esophagus', 'Lung', 'Breast'] # Add more sites as needed # Specify the kind of downstream analysis you want to perform downstream_analysis = 'DE' # Process each primary site for site in tqdm(primary_sites): # Get data from GDC json_object = gdc_eng_inst.get_data_for_bq(site, downstream_analysis=downstream_analysis, format='json') # Load data into BigQuery job = bq_utils.load_json_data(json_object, schema, table_id) job.result() # Wait for the job to complete print(f"Data for {site} loaded successfully.") print("All data loaded successfully.") Run an analysis for Differential Gene Expression (DE) and Gene Set Enrichment Analysis (GSEA) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python """ This example demonstrates how to create a data matrix for Differential gene expression (DE) or machine learning analysis. You can select the primary site of the samples and the downstream analysis you want to perform. """ import pandas as pd from importlib import reload import src.Engines.analysis_engine as analysis_engine import src.Connectors.gcp_bigquery_utils as gcp_bigquery_utils reload(analysis_engine) reload(gcp_bigquery_utils) # 1. Download Dataset from BigQuery for a given Primary Diagnosis By Primary Site and the Normal Tissue for the Primary site project_id = 'rnaseqml' dataset_id = 'rnaseqexpression' table_id = 'expr_clustered_08082024' bq_queries = gcp_bigquery_utils.BigQueryQueries(project_id=project_id, dataset_id=dataset_id, table_id=table_id) pr_site = 'Head and Neck' pr_diag = 'Squamous cell carcinoma, NOS' data_from_bq = bq_queries.get_df_for_pydeseq(primary_site=pr_site, primary_diagnosis=pr_diag) # 2. Data Preprocessing for PyDeSeq and GSEA # Intialize the Analysis Engine analysis_eng = analysis_engine.AnalysisEngine(data_from_bq, analysis_type='DE') if not analysis_eng.check_tumor_normal_counts(): raise ValueError("Tumor and Normal counts should be at least 10 each") gene_ids_or_gene_cols_df = pd.read_csv('/Users/abhilashdhal/Projects/personal_docs/data/Transcriptomics/data/gene_annotation/gene_id_to_gene_name_mapping.csv') gene_ids_or_gene_cols = list(gene_ids_or_gene_cols_df['gene_id'].to_numpy()) # Expand the nested expression Data From BigQuery exp_df = analysis_eng.expand_data_from_bq(data_from_bq, gene_ids_or_gene_cols=gene_ids_or_gene_cols, analysis_type='DE') # Get Metadata and Counts for PyDeSeq metadata = analysis_eng.metadata_for_pydeseq(exp_df=exp_df) counts_for_de = analysis_eng.counts_from_bq_df(exp_df, gene_ids_or_gene_cols) # 3. Run PyDeSeq res_pydeseq = analysis_eng.run_pydeseq(metadata=metadata, counts=counts_for_de) # Merge Gene Names as it is required for GSEA and more informative res_pydeseq_with_gene_names = pd.merge(res_pydeseq, gene_ids_or_gene_cols_df, left_on='index', right_on='gene_id') # 4. Run GSEA for the given Primary Diagnosis By Primary Site and the Normal Tissue for the Primary site using a gene set database # Explore the gene set options from gseapy from gseapy.plot import gseaplot import gseapy as gp from gseapy import dotplot gsea_options = gp.get_library_name() print(gsea_options) ## Select Gene Set, run GSEA and plot the results gene_set = 'Human_Gene_Atlas' result, plot = analysis_eng.run_gsea(res_pydeseq_with_gene_names, gene_set)