December 10, 2018 / by Agnieszka Szmurło

Reference set selection for CNV detection

Recently we were working on improving performance of available CNV callers by proper selection of reference sample set. We were aiming at choosing the most similar samples to the investigated one and we accomplished this using clustering based approach.

We have evaluated both kNN and k-means clustering methods. The results show that they both improve the performance of CNV callers in comparison to choosing whole sample set as reference. However the k-means method has much less computational complexity and we suggest that it should be the preferred way of preprocessing reference data.

The method overview is shown below:

alignment reads to reference

The publication has been submitted to BMC Bioinformatics, but it is also currently available at: biorxiv. Testing scripts can be found at: Github.

