# Analyze ## Identify differentially abundant genes between the control (the inoculum) and treatment conditions with `mbarq analyze` ### Input/Output Files **Required Inputs** - Count file produced by `mbarq merge` | barcode | Name | Sample1 | Sample2 | ... | |:----------| :---: | :---: | :---: | :---: | | ACCTGGTAG | geneA | 500 | 1000 | ... | | ACCGGGGAA | geneA | 100 | 500 | ... | | CCCGGGAAA | geneB | 300 | 300 | ... | - Sample data file (CSV) in the following format: | sampleID | treatment | |:---------| :---: | | Sample1 | control | | Sample2 | treatment1 | | ... | ... | ... | - Name of the column indicating treatment in the sample data should be specified using ``--treatment_column`` (for the example above, `` --treatment_column treatment``) - Treatment level that should be used as a control/baseline should be specified using ``--baseline`` (for the example above, ``--baseline control``) **Suggested Inputs** - We highly recommend adding control strains (i.e. strains with barcodes inserted into fitness-neutral locations) to the barcode library. This greatly facilitates quality control and analysis of the data. - If control strains are present in the library, the control barcodes can be specified with a control file using the ``--control_file`` option. - In the simplest option, the control file will only contain the barcode sequences of the control strains (1 barcode per line). - If different control strains were added at different concentrations, the concentration of each barcode can be specified in the second column. - If control strains included strains of different genotypes (ex. wild type as well as negative control strains), the genotype can be specified in the 3rd column. - Only wild-type strains will be used for quality control and analysis. This should be specified as `wt`, `WT`, or `wildtype`. - The control file should be in CSV format, and contain NO header. | [Required] | [Optional] | [Optional] | |:-----------|:----------:|:----------:| |ACCTGGGTT | 0.005 | wt | | CCGGAAGGT | 0.001 | wt | **Output Files** - ``mbarq_merged_counts_batch.txt``: Information on sample and batch - ``mbarq_merged_counts.correlations.csv``: Correlation for each batch - ``mbarq_merged_counts_rra_results.csv``: Information for each gene about number of barcodes, LFC and false discovery rate - ``mbarq_merged_counts_barcodes_results.csv``: Information for each barcode about LFC and significance scores For each comparison: - ``mbarq_merged_counts_cond1_vs_cond0.gene_summary.txt``: Summary for each gene - ``mbarq_merged_counts_cond1_vs_cond0.report.Rmd``: MAGeCK Comparison Report - ``mbarq_merged_counts_cond1_vs_cond0.sgrna_summary.txt``: Summary for each sgRNA **Output Format Options** The final results can be output in two formats: - **Long format (default)**: Each row represents a gene-treatment combination. This format includes a 'contrast' column indicating the treatment condition. - **Wide format**: Each row represents a gene, with separate columns for each treatment condition (e.g., 'LFC_d1', 'LFC_d2'). This format is useful for downstream analysis and visualization. Use the `--format` option to specify the desired output format. **Format Examples:** *Long format (default):* | Name | number_of_barcodes | LFC | neg_selection_fdr | pos_selection_fdr | contrast | |------|-------------------|-----|------------------|------------------|----------| | geneA | 3 | 1.2 | 0.05 | 0.9 | d1 | | geneA | 3 | 1.5 | 0.03 | 0.8 | d2 | | geneB | 2 | -0.8 | 0.8 | 0.1 | d1 | | geneB | 2 | -0.9 | 0.7 | 0.2 | d2 | *Wide format:* | Name | number_of_barcodes | LFC_d1 | LFC_d2 | neg_selection_fdr_d1 | neg_selection_fdr_d2 | pos_selection_fdr_d1 | pos_selection_fdr_d2 | |------|-------------------|--------|--------|---------------------|---------------------|---------------------|---------------------| | geneA | 3 | 1.2 | 1.5 | 0.05 | 0.03 | 0.9 | 0.8 | | geneB | 2 | -0.8 | -0.9 | 0.8 | 0.7 | 0.1 | 0.2 | ### Example Usage ```bash # Basic usage with long format output (default) mbarq analyze -i -s -c \ --treatment_column treatment --baseline control # Output results in wide format mbarq analyze -i -s -c \ --treatment_column treatment --baseline control --format wide ``` ### All Options ``` mbarq analyze Usage: mbarq analyze Options: -i, --count_file FILE CSV file produced by `mbarq merge` -s, --sample_data FILE CSV file containing sample data -c, --control_file FILE control barcode file, see documentation for proper format -g, --gene_name STR column in the count file containing gene identifiers [Name] --treatment_column STR column in sample data file indicating treatment --baseline STR treatment level to use as control/baseline, ex. day0 -n, --name STR experiment name, by default will try to use count file name -o, --out_dir DIR Output directory --norm_method STR mageck normalization method: median, total, or control. By default will use control barcodes if provided, otherwise median --filter_low_counts INT filter out barcodes with < N reads across all conditions [0] -f, --format STR output file format: long or wide [long] -h, --help Show this message and exit. ```