Apart from research or implementations of Big Data technologies projects our Team is also involved in educational programmes. We are running Big Data classes and recently some of us were involved in teaching Python at Warsaw University of Technology.
However in the middle of December we had a chance to show how to analyse omics data in distributed computing environment. ICM UW together with Institute of Mother and Child organised a course for non-geneticists about analysing genomics data - Omics Data Science - Bioinformatics and Large Scale Medical Data Analysis.
The whole course was organised in a couple of 2-day sessions. We were responsible for delivering a workshop about genomic pipelines and genomic data analysis. As we are working mainly with Big Data tools, we have prepared a few Jupyter notebooks showing what is genomic pipeline and how to run it in a distributed environment. We have introduced participants into the variety of genomic files and have gone through the structure of FASTQ, BAM and VCF files. All participants were able to run alignment and annotation steps - both distributed via Apache Spark framework.
For the analysis part we asked participants to load a file from Broad Institute containing data about CNVs within BRCA2 gene. It was a good exercise to learn PySpark, SparkSQL and the concept of DataFrames to explore the structure of data. Finally we visualised the data in IGV embedded in Jupyter notebook and in a couple of other Matplotlib plots.
Actually that was a lot material to be explained and learnt. All 2 days were totally packed with practical knowledge and many exercises to practice what the students have learnt. We plan to run these workshops during conferences and events.