Microbiome Data Science with R

For a year, I have uploaded data uploaded to http://lassesen.com/ubiome/ for interested parties (i.e. data scientists and statisticians) to explore. Making anonymous data available for research has been a key goal for my ubiome site (http://microbiomeprescription.azurewebsites.net ).

In this post, I will document how people can load and so some exploration of it. There are some excellent on line material for learning R, for example dataCamp where the presenters are assistant professors and Ph.D. – but do a clean presentation in their lower level courses.

Getting the data loaded in R

I assume you have installed R or even better, R-Studio (the free one is fine).

I suggest that the following packages are installed:

  1. tidyverse this is a bundle of other packages (the typical ones used by dataCamp)

Now execute (commands are in bold, messages returned in italic):

library(tidyverse)
— Attaching packages ———————————————————————————————————————————————- tidyverse 1.2.1 —
v ggplot2 3.1.0 v purrr 0.3.1
v tibble 2.1.1 v dplyr 0.8.0.1
v tidyr 0.8.3 v stringr 1.4.0
v readr 1.3.1 v forcats 0.4.0
— Conflicts ————————————————————————————————————————————————- tidyverse_conflicts() —
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
Warning messages:
1: package ‘tidyverse’ was built under R version 3.5.3
2: package ‘tibble’ was built under R version 3.5.3
3: package ‘dplyr’ was built under R version 3.5.3

Next we grab the list of symptoms off the web site and place the data in symptoms

symptoms <-read_csv(“http://lassesen.com/ubiome/Symptoms.csv”)
Parsed with column specification:
cols(
symptomId = col_double(),
symptom_name = col_character()
)

Next let us inspect the data in symptoms

symptoms
A tibble: 277 x 2
symptomId symptom_name

1 186 Age: 0-10
2 187 Age: 10-20
3 188 Age: 20-30
4 189 Age: 30-40
5 190 Age: 40-50
6 191 Age: 50-60
7 192 Age: 60-70
8 193 Age: 70-80
9 194 Age: 80-90
10 195 Age: 91 and higher
… with 267 more rows

Next we will grab the bacteria taxonomy and place in taxonomy. I cheated to both load and list a portion by putting ( ) around the import. Note that we have the hierarchy included (parent_taxon) as well as the tax_rank.

library(tidyverse)

(taxonomy <-read_csv(“http://lassesen.com/ubiome/Taxonomy.csv”))
Parsed with column specification:
cols(
taxon = col_double(),
parent_taxon = col_double(),
tax_name = col_character(),
tax_rank = col_character()
)
A tibble: 4,454 x 4
taxon parent_taxon tax_name tax_rank <dbl> <dbl> <chr> <chr>
1 1 0 root root
2 203486 1 Dictyoglomi class
3 4751 1 Fungi kingdom
4 33154 1 Opisthokonta no_rank
5 2 1 Bacteria root
6 2157 1 Archaea superkingdom
7 2759 1 Eukaryota superkingdom
8 1437201 2 Pentapetalae class
9 538999 2 Clostridiales incertae sedis family
10 2323 2 unclassified Bacteria kingdom
… with 4,444 more rows

Next we are going to grab all of the samples available (pre-filtered to “gut” only before I uploaded). Note that is none was detected (0) there is no observation — This means that some of the analysis may need to use NA.

(samples <-read_csv(“http://lassesen.com/ubiome/samples.csv”))
Parsed with column specification:
cols(
sampleId = col_double(),
taxon = col_double(),
count_norm = col_double()
)
A tibble: 173,356 x 3
sampleId taxon count_norm

1 6 1 1014583
2 6 2 1000000
3 6 237 2614
4 6 712 280
5 6 724 280
6 6 729 280
7 6 815 158414
8 6 816 158321
9 6 818 155
10 6 820 373
… with 173,346 more rows

The last items is the symptoms to sample mixture. Remember that some people do provide symptoms and some people do not. This means that if we are looking at symptoms vs taxonomy (bacteria) — we must restrict observations to only those with symptoms.

(sampleSymptom <-read_csv(“http://lassesen.com/ubiome/SampleSymptom.csv”))
Parsed with column specification:
cols(
sampleId = col_double(),
symptomId = col_double()
)
A tibble: 8,254 x 2
sampleId symptomId

1 -609 1
2 -609 2
3 -609 3
4 -609 4
5 -609 5
6 -609 6
7 -609 17
8 -609 18
9 -609 19
10 -609 21
… with 8,244 more rows


Simple Manipulations

First we are going to create a data.frame containing only the sampleId of people that have contributed symptoms. Note that we went from
173,356 observations down to 54,632 observations. Roughly 1/3 of the uploads provided symptoms.

(samplesWithSymptoms <- inner_join(samples,sampleSymptom, by=”sampleId”))
A tibble: 54,632 x 4
sampleId taxon count_norm symptomId

1 421 1 1037869 131
2 421 2 1000000 131
3 421 237 3809 131
4 421 543 55 131
5 421 579 55 131
6 421 815 196324 131
7 421 816 196324 131
8 421 818 8480 131
9 421 820 2224 131
10 421 821 109575 131
… with 54,622 more rows

Adding Symptom Names

(samplesWithSymptomNames <- inner_join(samplesWithSymptoms,symptoms, by=”symptomId”))
A tibble: 54,632 x 5
sampleId taxon count_norm symptomId symptom_name

1 421 1 1037869 131 Asymptomatic: No Health Issues
2 421 2 1000000 131 Asymptomatic: No Health Issues
3 421 237 3809 131 Asymptomatic: No Health Issues
4 421 543 55 131 Asymptomatic: No Health Issues
5 421 579 55 131 Asymptomatic: No Health Issues
6 421 815 196324 131 Asymptomatic: No Health Issues
7 421 816 196324 131 Asymptomatic: No Health Issues
8 421 818 8480 131 Asymptomatic: No Health Issues
9 421 820 2224 131 Asymptomatic: No Health Issues
10 421 821 109575 131 Asymptomatic: No Health Issues
… with 54,622 more rows

Adding Taxonomy Information

As above, with Tidyverse, it is a simple one line of code

(samplesExpanded <- inner_join(samplesWithSymptomNames,taxonomy, by=”taxon”))
A tibble: 54,632 x 8
sampleId taxon count_norm symptomId symptom_name parent_taxon tax_name tax_rank

1 421 1 1037869 131 Asymptomatic: No Health Issues 0 root root
2 421 2 1000000 131 Asymptomatic: No Health Issues 1 Bacteria root
3 421 237 3809 131 Asymptomatic: No Health Issues 49546 Flavobacterium genus
4 421 543 55 131 Asymptomatic: No Health Issues 91347 Enterobacteriaceae family
5 421 579 55 131 Asymptomatic: No Health Issues 543 Kluyvera genus
6 421 815 196324 131 Asymptomatic: No Health Issues 171549 Bacteroidaceae family
7 421 816 196324 131 Asymptomatic: No Health Issues 815 Bacteroides genus
8 421 818 8480 131 Asymptomatic: No Health Issues 816 Bacteroides thetaiotaomicron species
9 421 820 2224 131 Asymptomatic: No Health Issues 816 Bacteroides uniformis species
10 421 821 109575 131 Asymptomatic: No Health Issues 816 Bacteroides vulgatus species
… with 54,622 more rows

That’s it for the first post. I will followup with some explorations in coming posts. (If you want to guest write some R analysis posts — write it up as a word document or r-markup file and email to me, ken/at/lassesen.com)

It is my belief that microbiome patterns are more strongly associated with symptoms than ‘official diagnosis’

Bottom Line

Getting the data from the web into R is just a few lines of code (just copy and paste below)

  • library(tidyverse)
  • (symptoms <-read_csv(“http://lassesen.com/ubiome/Symptoms.csv”) )
  • (samples <-read_csv(“http://lassesen.com/ubiome/samples.csv”))
  • (taxonomy <-read_csv(“http://lassesen.com/ubiome/Taxonomy.csv”))
  • (sampleSymptom <-read_csv(“http://lassesen.com/ubiome/SampleSymptom.csv”))
  • (samplesWithSymptoms <- inner_join(samples,sampleSymptom, by=”sampleId”))
  • (samplesWithSymptoms <- inner_join(samples,sampleSymptom, by=”sampleId”))
  • (samplesExpanded <- inner_join(samplesWithSymptomNames,taxonomy, by=”taxon”))

Please pass along to any statisticians or data scientists that you know.