Identifying Salmonella colonization determinants in mice
Contents
Identifying Salmonella colonization determinants in mice#
This analysis is based on the BarSeq screen described in Import of aspartate and malate by DcuABC drives H2/fumarate respiration to promote initial Salmonella gut-lumen colonization in mice. In the paper, the authors generated and sequenced a barcoded library of Salmonella mutants. This library was then used to infect LCM mice, and the fecal pellets from the infected mice were collected on days 1, 2,3, and 4 post-infection. Each of the fecal samples was then sequenced to count the abundances of each of the barcoded mutants. The goal of the analysis was to identify which mutants are lost (and thus which genes are important for Salmonella pathogenesis) on different days post infection.
Setup#
Make sure you have followed the installation instructions.
Download and unpack the test data. After running the command below, you should see a directory named
nguyenb_walkthrough, which should contain all the data you need for this walkthrough.
tar -xvf nguyenb.tar.gz
cd nguyenb_walkthrough
ls
Make sure
mbarqis installed, and you have created as well as activated thembarqenvironment.
conda activate mbarq
Mapping#
The first thing in the analysis of random barcode mutagenesis experiment is figuring out the position of each barcode in the host genome. This can be accomplished with mbarq map.
Files needed for this analysis:
library_11_1_sub_1.fq.gz : a subsample of the library sequencing file. The reads in this file will contain barcodes + host DNA sequence, and allow us to identify the location of each barcoded insertion.
SL1344.fna is a genome sequence of the Salmonella strain used to construct the library. SL1344.gff is a matching annotation file.
Note
For mbarq to run, you need to specify the transposon construct structure used for the experiment. Specifically, you need to specify the conserved IR motif, the length of the barcode, the length of the spacer (if there is any) between the barcode and IR, and their relative position to each other. Here’s an example of a read from the library_11_1_sub_1.fq.gz file. The transposon conserved sequence (IR) is shown in bold and the barcode (17 nt long) is shown in color. The spacer sequence (13 nt long) is shown in lower case and the host sequence in upper case. For mbarq, this will translate into -tn B17N13GTGTATAAGAGACAG
GGGACCAAAGTACTAGAtcagggttgagatGTGTATAAGAGACAGATTGTATTCGCC[…]
To map barcodes to insertions run:
mbarq map -f library_11_1_sub_1.fq.gz -g SL1344.fna -a SL1344.gff \
-tn B17N13GTGTATAAGAGACAG -n nguyenb_library_map
The final results will be saved in nguyenb_library_map.annotated.csv. Note that this was done on a test dataset, so we did not apply any filtering to the results. In reality, we recommend filtering out barcodes that are supported by only a few reads. The filtering threshold will vary from dataset to dataset, anywhere from 10 to 100 could be reasonable (for example, specifying -l 10 will filter out any barcodes supported by less than 10 reads).
Counting#
Now we are ready to analyze our samples (i.e. fecal pellets collected from different mice on different days p.i.). For each sample, you would need to run mbarq count command to generate a table of barcode counts.
Files needed for this analysis:
library_11_1_map.annotated.csv: a full map of the mutant library (the one we generated above was done using a subsample of the data, and would not contain all the barcodes). Contains information about the chromosomal location of each of the barcodes.
dnaid1315_124_subsample.fasta.gz: a subsample of the sequencing file generated from one of the fecal samples. The reads in this file will contain just the barcode sequences (no host DNA), and allow us to count abundance of each of the barcodes in the sample.
To count the barcodes run:
mbarq count -f dnaid1315_124_subsample.fasta.gz -m nguyenb_library_map.annotated.csv \
-tn B17N13GTGTATAAGAGACAG
You can examine the resulting count table: dnaid1315_124_subsample_mbarq_counts.csv
Merging count files#
After generating the count files for each of your samples, you can merge them into a single file using mbarq merge. To demonstrate this, we will be using 2 previously generated count files for samples dnaid1315_17 and dnaid1315_18.
Files needed for this analysis:
dnaid1315_17_mbarq_counts.csv and dnaid1315_18_mbarq_counts.csv are examples of barcode count files.
To merge count tables into a single table run
mbarq merge -i dnaid1315_17_mbarq_counts.csv,dnaid1315_18_mbarq_counts.csv -a Name -n nguyenb_counts
You can examine the resulting count table: nguyenb_counts_mbarq_merged_counts.csv
Note
You can also place all the count files into the same directory, and specify the directory name with -d instead of listing all the file names as was shown above.
Analysis#
The goal of this experiment was to identify potential fitness factors on different days of Salmonella infection. Thus, we want to compare mutant abundances on each day to the inoculum (mutant library + control strains cultured in LB), labeled as d0, to samples from d1, d2, d3, and d4.
Files needed for this analysis:
library_11_1_mbarq_merged_counts.csv contains counts for all the samples used in the experiment.
sample_data.csv contains sample metadata, i.e. day post-infection and mouse ID for each of the samples.
control_strains.csv contains a list of barcoded wild-type isogenic strains used as a control in the study.
Note
You can read more about the sample_data.csv and control_strains.csv file formats in the analysis section of the documentation
mkdir results
mbarq analyze -i library_11_1_mbarq_merged_counts.csv -s sample_data.csv -c control_strains.csv --treatment_column day --baseline d0 -o results
mbarq analyze creates a folder containing library_11_1_mbarq_merged_counts_rra_results.csv that lists log fold changes (LFC) and false discovery rates (FDR) for each gene in the library. You can upload this file to the mBARq App to create heatmaps, perform functional analysis, and visualize the results in the context of KEGG metabolic maps.