Analyze#

Identify differentially abundant genes between the control (the inoculum) and treatment conditions with mbarq analyze#

Input/Output Files#

Required Inputs

  • Count file produced by mbarq merge

barcode

Name

Sample1

Sample2

ACCTGGTAG

geneA

500

1000

ACCGGGGAA

geneA

100

500

CCCGGGAAA

geneB

300

300

  • Sample data file (CSV) in the following format:

sampleID

treatment

Sample1

control

Sample2

treatment1

  • Name of the column indicating treatment in the sample data should be specified using --treatment_column (for the example above, --treatment_column treatment)

  • Treatment level that should be used as a control/baseline should be specified using --baseline (for the example above, --baseline control)

Suggested Inputs

  • We highly recommend adding control strains (i.e. strains with barcodes inserted into fitness-neutral locations) to the barcode library. This greatly facilitates quality control and analysis of the data.

  • If control strains are present in the library, the control barcodes can be specified with a control file using the --control_file option.

    • In the simplest option, the control file will only contain the barcode sequences of the control strains (1 barcode per line).

    • If different control strains were added at different concentrations, the concentration of each barcode can be specified in the second column.

    • If control strains included strains of different genotypes (ex. wild type as well as negative control strains), the genotype can be specified in the 3rd column.

    • Only wild-type strains will be used for quality control and analysis. This should be specified as wt, WT, or wildtype.

    • The control file should be in CSV format, and contain NO header.

[Required]

[Optional]

[Optional]

ACCTGGGTT

0.005

wt

CCGGAAGGT

0.001

wt

Output Files

  • mbarq_merged_counts_batch.txt: Information on sample and batch

  • mbarq_merged_counts.correlations.csv: Correlation for each batch

  • mbarq_merged_counts_rra_results.csv: Information for each gene about number of barcodes, LFC and false discovery rate

  • mbarq_merged_counts_barcodes_results.csv: Information for each barcode about LFC and significance scores

For each comparison:

  • mbarq_merged_counts_cond1_vs_cond0.gene_summary.txt: Summary for each gene

  • mbarq_merged_counts_cond1_vs_cond0.report.Rmd: MAGeCK Comparison Report

  • mbarq_merged_counts_cond1_vs_cond0.sgrna_summary.txt: Summary for each sgRNA

Output Format Options

The final results can be output in two formats:

  • Long format (default): Each row represents a gene-treatment combination. This format includes a ‘contrast’ column indicating the treatment condition.

  • Wide format: Each row represents a gene, with separate columns for each treatment condition (e.g., ‘LFC_d1’, ‘LFC_d2’). This format is useful for downstream analysis and visualization.

Use the --format option to specify the desired output format.

Format Examples:

Long format (default):

Name

number_of_barcodes

LFC

neg_selection_fdr

pos_selection_fdr

contrast

geneA

3

1.2

0.05

0.9

d1

geneA

3

1.5

0.03

0.8

d2

geneB

2

-0.8

0.8

0.1

d1

geneB

2

-0.9

0.7

0.2

d2

Wide format:

Name

number_of_barcodes

LFC_d1

LFC_d2

neg_selection_fdr_d1

neg_selection_fdr_d2

pos_selection_fdr_d1

pos_selection_fdr_d2

geneA

3

1.2

1.5

0.05

0.03

0.9

0.8

geneB

2

-0.8

-0.9

0.8

0.7

0.1

0.2

Example Usage#

# Basic usage with long format output (default)
mbarq analyze -i <count_file> -s <sample_data_file> -c <control_file> \ 
--treatment_column treatment --baseline control 

# Output results in wide format
mbarq analyze -i <count_file> -s <sample_data_file> -c <control_file> \ 
--treatment_column treatment --baseline control --format wide

All Options#

mbarq analyze
Usage: mbarq analyze <options>

Options:
  -i, --count_file FILE    CSV file produced by `mbarq merge`
  -s, --sample_data FILE   CSV file containing sample data
  -c, --control_file FILE  control barcode file, see documentation for proper
                           format
  -g, --gene_name STR      column in the count file containing gene
                           identifiers [Name]
  --treatment_column STR   column in sample data file indicating treatment
  --baseline STR           treatment level to use as control/baseline, ex.
                           day0
  -n, --name STR           experiment name, by default will try to use count
                           file name
  -o, --out_dir DIR        Output directory
  --norm_method STR        mageck normalization method: median, total, or 
                           control. By default will use control barcodes if 
                           provided, otherwise median
  --filter_low_counts INT  filter out barcodes with < N reads across all 
                           conditions [0]
  -f, --format STR         output file format: long or wide [long]
  -h, --help               Show this message and exit.