Analyze
Contents
Analyze#
Identify differentially abundant genes between the control (the inoculum) and treatment conditions with mbarq analyze#
Input/Output Files#
Required Inputs
Count file produced by
mbarq merge
barcode |
Name |
Sample1 |
Sample2 |
… |
|---|---|---|---|---|
ACCTGGTAG |
geneA |
500 |
1000 |
… |
ACCGGGGAA |
geneA |
100 |
500 |
… |
CCCGGGAAA |
geneB |
300 |
300 |
… |
Sample data file (CSV) in the following format:
sampleID |
treatment |
|---|---|
Sample1 |
control |
Sample2 |
treatment1 |
… |
… |
Name of the column indicating treatment in the sample data should be specified using
--treatment_column(for the example above,--treatment_column treatment)Treatment level that should be used as a control/baseline should be specified using
--baseline(for the example above,--baseline control)
Suggested Inputs
We highly recommend adding control strains (i.e. strains with barcodes inserted into fitness-neutral locations) to the barcode library. This greatly facilitates quality control and analysis of the data.
If control strains are present in the library, the control barcodes can be specified with a control file using the
--control_fileoption.In the simplest option, the control file will only contain the barcode sequences of the control strains (1 barcode per line).
If different control strains were added at different concentrations, the concentration of each barcode can be specified in the second column.
If control strains included strains of different genotypes (ex. wild type as well as negative control strains), the genotype can be specified in the 3rd column.
Only wild-type strains will be used for quality control and analysis. This should be specified as
wt,WT, orwildtype.The control file should be in CSV format, and contain NO header.
[Required] |
[Optional] |
[Optional] |
|---|---|---|
ACCTGGGTT |
0.005 |
wt |
CCGGAAGGT |
0.001 |
wt |
Output Files
mbarq_merged_counts_batch.txt: Information on sample and batchmbarq_merged_counts.correlations.csv: Correlation for each batchmbarq_merged_counts_rra_results.csv: Information for each gene about number of barcodes, LFC and false discovery ratembarq_merged_counts_barcodes_results.csv: Information for each barcode about LFC and significance scores
For each comparison:
mbarq_merged_counts_cond1_vs_cond0.gene_summary.txt: Summary for each genembarq_merged_counts_cond1_vs_cond0.report.Rmd: MAGeCK Comparison Reportmbarq_merged_counts_cond1_vs_cond0.sgrna_summary.txt: Summary for each sgRNA
Output Format Options
The final results can be output in two formats:
Long format (default): Each row represents a gene-treatment combination. This format includes a ‘contrast’ column indicating the treatment condition.
Wide format: Each row represents a gene, with separate columns for each treatment condition (e.g., ‘LFC_d1’, ‘LFC_d2’). This format is useful for downstream analysis and visualization.
Use the --format option to specify the desired output format.
Format Examples:
Long format (default):
Name |
number_of_barcodes |
LFC |
neg_selection_fdr |
pos_selection_fdr |
contrast |
|---|---|---|---|---|---|
geneA |
3 |
1.2 |
0.05 |
0.9 |
d1 |
geneA |
3 |
1.5 |
0.03 |
0.8 |
d2 |
geneB |
2 |
-0.8 |
0.8 |
0.1 |
d1 |
geneB |
2 |
-0.9 |
0.7 |
0.2 |
d2 |
Wide format:
Name |
number_of_barcodes |
LFC_d1 |
LFC_d2 |
neg_selection_fdr_d1 |
neg_selection_fdr_d2 |
pos_selection_fdr_d1 |
pos_selection_fdr_d2 |
|---|---|---|---|---|---|---|---|
geneA |
3 |
1.2 |
1.5 |
0.05 |
0.03 |
0.9 |
0.8 |
geneB |
2 |
-0.8 |
-0.9 |
0.8 |
0.7 |
0.1 |
0.2 |
Example Usage#
# Basic usage with long format output (default)
mbarq analyze -i <count_file> -s <sample_data_file> -c <control_file> \
--treatment_column treatment --baseline control
# Output results in wide format
mbarq analyze -i <count_file> -s <sample_data_file> -c <control_file> \
--treatment_column treatment --baseline control --format wide
All Options#
mbarq analyze
Usage: mbarq analyze <options>
Options:
-i, --count_file FILE CSV file produced by `mbarq merge`
-s, --sample_data FILE CSV file containing sample data
-c, --control_file FILE control barcode file, see documentation for proper
format
-g, --gene_name STR column in the count file containing gene
identifiers [Name]
--treatment_column STR column in sample data file indicating treatment
--baseline STR treatment level to use as control/baseline, ex.
day0
-n, --name STR experiment name, by default will try to use count
file name
-o, --out_dir DIR Output directory
--norm_method STR mageck normalization method: median, total, or
control. By default will use control barcodes if
provided, otherwise median
--filter_low_counts INT filter out barcodes with < N reads across all
conditions [0]
-f, --format STR output file format: long or wide [long]
-h, --help Show this message and exit.