API Reference ============= .. toctree:: :maxdepth: 2 :caption: Modules ================= This package has two modules, detailed below. Connectors ---------- .. automodule:: src.Connectors :members: :undoc-members: :show-inheritance: 1. GDC Endpoint Connectors ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. autoclass:: src.Connectors.gdc_endpt_base.GDCEndptBase :members: :undoc-members: :show-inheritance: 2. GDC Filters ^^^^^^^^^^^^^^ .. autoclass:: src.Connectors.gdc_filters.GDCQueryFilters :members: :undoc-members: :show-inheritance: Example usage: .. code-block:: python from Connectors.gdc_filters import GDCQueryFilters gdc_filters = GDCQueryFilters() # Create a custom RNA-Seq data filter custom_params = { "primary_site": ["Breast"], "cases.demographic.gender": ["female"] } custom_rna_seq_filter = gdc_filters.rna_seq_data_filter(field_params=custom_params) For more examples and detailed usage, see the :ref:`examples` section. .. autoclass:: src.Connectors.gdc_filters.GDCFacetFilters :members: :undoc-members: :show-inheritance: Example usage: .. code-block:: python from Connectors.gdc_filters import GDCFacetFilters facet_filters = GDCFacetFilters() # Create a facet filter facet_filter = facet_filters.create_facet_filter("cases.project.primary_site", ["Breast", "Lung"]) 3. Google Cloud Connector ^^^^^^^^^^^^^^^^^^^^^^^^^ .. class:: src.Connectors.gcp_bigquery_utils.BigQueryUtils(project_id) Utility class for interacting with Google BigQuery. .. method:: table_exists(table_ref) Example: .. code-block:: python bq_utils = BigQueryUtils("my-project-id") table_ref = "my-project.my_dataset.my_table" exists = bq_utils.table_exists(table_ref) print(f"Table exists: {exists}") .. method:: dataset_exists(dataset_id) Example: .. code-block:: python bq_utils = BigQueryUtils("my-project-id") dataset_id = "my-project.my_dataset" exists = bq_utils.dataset_exists(dataset_id) print(f"Dataset exists: {exists}") .. method:: upload_df_to_bq(table_id, df) Example: .. code-block:: python import pandas as pd bq_utils = BigQueryUtils("my-project-id") df = pd.DataFrame({"col1": [1, 2, 3], "col2": ["a", "b", "c"]}) table_id = "my-project.my_dataset.my_table" job = bq_utils.upload_df_to_bq(table_id, df) job.result() # Wait for the job to complete .. method:: create_bigquery_table_with_schema(table_id, schema, partition_field=None, clustering_fields=None) Example: .. code-block:: python from google.cloud import bigquery bq_utils = BigQueryUtils("my-project-id") table_id = "my-project.my_dataset.my_table" schema = [ bigquery.SchemaField("name", "STRING"), bigquery.SchemaField("age", "INTEGER"), ] table = bq_utils.create_bigquery_table_with_schema(table_id, schema) .. method:: df_to_json(df, file_path="data.json") Example: .. code-block:: python import pandas as pd bq_utils = BigQueryUtils("my-project-id") df = pd.DataFrame({"col1": [1, 2, 3], "col2": ["a", "b", "c"]}) bq_utils.df_to_json(df, "output.json") .. method:: load_json_data(json_object, schema, table_id) Example: .. code-block:: python from google.cloud import bigquery bq_utils = BigQueryUtils("my-project-id") json_object = [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}] schema = [ bigquery.SchemaField("name", "STRING"), bigquery.SchemaField("age", "INTEGER"), ] table_id = "my-project.my_dataset.my_table" job = bq_utils.load_json_data(json_object, schema, table_id) job.result() # Wait for the job to complete .. method:: run_query(query) Example: .. code-block:: python bq_utils = BigQueryUtils("my-project-id") query = "SELECT * FROM `my-project.my_dataset.my_table` LIMIT 10" df = bq_utils.run_query(query) print(df.head()) .. autoclass:: src.Connectors.gcp_bigquery_utils.BigQueryQueries :members: :undoc-members: :show-inheritance: Example usage: .. code-block:: python from src.Connectors.gcp_bigquery_utils import BigQueryQueries # Initialize BigQueryQueries project_id = "your-project-id" dataset_id = "your-dataset-id" table_id = "your-table-id" bq_queries = BigQueryQueries(project_id, dataset_id, table_id) # Get primary site options primary_sites = bq_queries.get_primary_site_options() print("Primary site options:", primary_sites) # Get primary diagnosis options for a specific primary site primary_site = "Breast" diagnoses = bq_queries.get_primary_diagnosis_options(primary_site) print(f"Primary diagnosis options for {primary_site}:", diagnoses) # Get DataFrame for PyDeSeq analysis primary_diagnosis = "Invasive Ductal Carcinoma" df_pydeseq = bq_queries.get_df_for_pydeseq(primary_site, primary_diagnosis) print("DataFrame for PyDeSeq analysis:") print(df_pydeseq.head()) # Get DataFrame for recurrence-free survival analysis df_rfs = bq_queries.get_df_for_recurrence_free_survival_exp(primary_site, primary_diagnosis) print("DataFrame for recurrence-free survival analysis:") print(df_rfs.head()) # Get all primary diagnoses for a primary site all_diagnoses_df = bq_queries.get_all_primary_diagnosis_for_primary_site(primary_site) print(f"All primary diagnoses for {primary_site}:") print(all_diagnoses_df) This example demonstrates how to use all public methods of the `BigQueryQueries` class: 1. Initialize the `BigQueryQueries` instance with project, dataset, and table IDs. 2. Get a list of primary site options using `get_primary_site_options()`. 3. Get primary diagnosis options for a specific primary site using `get_primary_diagnosis_options()`. 4. Retrieve a DataFrame for PyDeSeq analysis with `get_df_for_pydeseq()`. 5. Get a DataFrame for recurrence-free survival analysis using `get_df_for_recurrence_free_survival_exp()`. 6. Fetch all primary diagnoses for a given primary site with `get_all_primary_diagnosis_for_primary_site()`. These methods provide a convenient interface to query and retrieve data from your customly created BigQuery Database for various genomic analyses. The custom creation of BigQuery Tables with partitioning and clustering for optimized query performance is shown in the example :ref:`upload_data_to_bq` method and :ref:`examples` section. Engines ------- .. automodule:: src.Engines :members: :undoc-members: :show-inheritance: 1. Analysis Engine ^^^^^^^^^^^^^^^^^^ .. autoclass:: src.Engines.analysis_engine.AnalysisEngine :members: :undoc-members: :show-inheritance: Example usage: .. code-block:: python import pandas as pd from Engines.analysis_engine import AnalysisEngine from Connectors.gcp_bigquery_utils import BigQueryQueries # Initialize BigQuery connection project_id = "your_project_id" dataset_id = "your_dataset_id" table_id = "your_table_id" bq_queries = BigQueryQueries(project_id, dataset_id, table_id) # Fetch data from BigQuery primary_site = "Breast" primary_diagnosis = "Invasive Ductal Carcinoma" df = bq_queries.get_df_for_pydeseq(primary_site, primary_diagnosis) data_from_bq = df.copy() # Optional: Add simulated samples if available simulated_samples = None # Replace with actual simulated samples if available if simulated_samples is not None: data_from_bq = pd.concat([data_from_bq, simulated_samples], ignore_index=True) # Initialize AnalysisEngine analysis_cls = AnalysisEngine(data_from_bq, analysis_type='DE') # Check if there are enough tumor and normal samples if not analysis_cls.check_tumor_normal_counts(): raise ValueError("Tumor and Normal counts should be at least 10 each") # Load gene IDs (adjust the path as needed) gene_ids_or_gene_cols = list(pd.read_csv('path/to/gene_id_to_gene_name_mapping.csv')['gene_id']) # Expand data for differential expression analysis exp_data = analysis_cls.expand_data_from_bq(data_from_bq, gene_ids_or_gene_cols, 'DE') # Prepare counts and metadata for PyDESeq2 counts_for_de = analysis_cls.counts_from_bq_df(exp_data, gene_ids_or_gene_cols) metadata = analysis_cls.metadata_for_pydeseq(exp_data) # Perform differential expression analysis de_results = analysis_cls.run_pydeseq(metadata=metadata, counts=counts_for_de) # Display results print(de_results.head()) # Optional: Perform Gene Set Enrichment Analysis (GSEA) gene_set = "path/to/your/gene_set.gmt" # Replace with actual gene set file path gsea_results, _, _ = analysis_cls.run_gsea(de_results, gene_set) # Display GSEA results print(gsea_results.head()) This example demonstrates how to use the `AnalysisEngine` class for differential expression analysis: 1. It starts by fetching data from BigQuery using the `BigQueryQueries` class. 2. Optionally adds simulated samples to the dataset. 3. Initializes the `AnalysisEngine` with the data and specifies the analysis type as 'DE' (Differential Expression). 4. Checks if there are enough tumor and normal samples for analysis. 5. Loads gene IDs from a CSV file. 6. Expands the data for differential expression analysis. 7. Prepares counts and metadata for PyDESeq2. 8. Runs the differential expression analysis using PyDESeq2. 9. Optionally performs Gene Set Enrichment Analysis (GSEA) on the differential expression results. This workflow showcases the key functionalities of the `AnalysisEngine` class for genomic data analysis, particularly focusing on differential expression and enrichment analysis. 2. BigQuery Engine ^^^^^^^^^^^^^ .. autoclass:: src.Engines.bigquery_engine.BigQueryEngine :members: :undoc-members: :show-inheritance: 3. GDC Engine ^^^^^^^^^^^^ .. autoclass:: src.Engines.gdc_engine.GDCEngine :members: :undoc-members: :show-inheritance: Example usage: .. code-block:: python from Engines.gdc_engine import GDCEngine import json # Initialize GDCEngine params = { 'files.experimental_strategy': 'RNA-Seq', 'data_type': 'Gene Expression Quantification' } gdc_eng_inst = GDCEngine(**params) # Set parameters new_params = {'cases.project.primary_site': 'Lung'} gdc_eng_inst.set_params(**new_params) # Get RNA-Seq metadata rna_seq_metadata = gdc_eng_inst._get_rna_seq_metadata() print(rna_seq_metadata['metadata'].head()) # Run RNA-Seq data matrix creation primary_site = 'Lung' ml_data_matrix = gdc_eng_inst.run_rna_seq_data_matrix_creation(primary_site, downstream_analysis='ML') print(ml_data_matrix.head()) # Create identifier sample_row = ml_data_matrix.iloc[0] identifier = gdc_eng_inst.create_identifier(sample_row) print(f"Identifier: {identifier}") # Make count data for BigQuery json_object, gene_cols = gdc_eng_inst.make_count_data_for_bq(primary_site, downstream_analysis='DE', format='json') print(f"Number of gene columns: {len(gene_cols)}") print(json.dumps(json_object[0], indent=2)) # Make data for recurrence-free survival rfs_data, rfs_gene_cols = gdc_eng_inst.make_data_for_recurrence_free_survival(primary_site, downstream_analysis='ML', format='dataframe') print(rfs_data.head()) print(f"Number of gene columns for RFS: {len(rfs_gene_cols)}") This example demonstrates the usage of all public methods in the `GDCEngine` class: 1. Initializing the `GDCEngine` with parameters. 2. Setting new parameters using `set_params()`. 3. Retrieving RNA-Seq metadata with `_get_rna_seq_metadata()`. 4. Running RNA-Seq data matrix creation with `run_rna_seq_data_matrix_creation()`. 5. Creating a unique identifier for a row using `create_identifier()`. 6. Making count data for BigQuery with `make_count_data_for_bq()`. 7. Preparing data for recurrence-free survival analysis with `make_data_for_recurrence_free_survival()`. These methods provide a comprehensive toolkit for working with GDC data, from initial querying to preparing data for various types of analyses, including differential expression and machine learning tasks.