December 22, 2019 / by Rafał Małanij
Apart from research or implementations of Big Data technologies projects our Team is also involved in educational programmes. We are running Big Data classes and recently some of us were involved in teaching Python at Warsaw University of Technology.
However in the middle of December we had a chance to show how to analyse omics data in distributed computing environment. “ICM UW” together with “Institute of Mother and Child” organised a course for non-genetists about analysing genomics data - “Omics Data Science - Bioinformatics and Large Scale Medical Data Analysis”.
The whole course was orgniased in a couple of 2-day sessions. We were responsible for delivering a workshop about genomic pipelines and genomic data analysis. As we are working mainly with Big Data tools, we have prepared a few Jupyter notebooks showing what is genomic pipeline and how to run it in a distributed environment. We have introduced participants into the variety of genomic files and have gone through the structure of FASTQ, BAM and VCF files. All participants were able to run alignment and annotation steps - both distributed via Apache Spark framework.
For the analysis part we asked participants to load a file from Broad Institute containing data about CNVs within BRAC2 gene. It was a good excersise to learn PySpark, SparkSQL and the concept of DataFrames to explore the structure of data. Finally we visualised the data in “IGV” embeded in Jupyter notebook and in a couple of other Matplotlib plots.
Actually that was a lot material to be explained and learnt. All 2 days were totally packed with practical knowledge and many excercises to practice what the students have learnt. What is more there are not so many opportunities to work with genomic data on distributed computing clusters - that was probably a totally new experience for participants. However I am sure that they enjoyed it a lot!
We plan to run this workshops during conferences and events. We have submitted our proposal for ASHG 2020 with shortened version of the workshop. If you are organising an event around genomics, let us know. I am sure your participants will be extremely interested in learning how to analyse thousands of genomes in a “Data Science” way.