Bohdan Khomtchouk, Ph.D.

Postdoctoral Research Scholar at Stanford University

Bioinformatician and computational biologist with a postdoctoral research appointment in the Department of Biology at Stanford University.

About Me

I am a data science postdoctoral fellow working in the field of computational epigenetics in the laboratory of Prof. Dr. Or Gozani M.D., Ph.D. at Stanford University in Stanford, CA USA. My favorite large-scale data science project is Biochat. My published works to-date include a Molecular Psychiatry paper, PLoS One paper, BMC Bioinformatics paper, Briefings in Bioinformatics paper, Source Code for Biology and Medicine paper, Journal of the American Heart Association paper, R/Bioconductor package, and Python/PyPI package. I have also published an organic chemistry textbook at CRC Press (Taylor & Francis). In addition, I have released several preprints, which are currently under peer-review. These preprints span the fields of mathematical genetics and evolutionary biology (here, here, and here), databases and search engines (here), and nonlinear physics (here). As a bioinformatics software engineer and biostatistician, I am a big fan of the C and R programming languages (respectively), especially when I can synergize them together (see geneXtendeR). I also love programming in Lisp and talk about its advantages to bioinformatics and computational biology here. Overall, I enjoy writing extremely efficient implementations of various algorithms & data structures for solving computationally complex biological problems. I am always looking for yet another way to shave off yet another millisecond of runtime performance, even if it may not be the most practical use of my time (but over a million cycles this equates to over 16 minutes, which is certainly tangible!).

Previously, I was a genetics doctoral candidate funded by the National Defense Science & Engineering Graduate (NDSEG) Fellowship, working in the laboratory of Prof. Dr. Claes Wahlestedt M.D., Ph.D. at the University of Miami Miller School of Medicine in Miami, FL USA. I graduated with a Ph.D. in Human Genetics and Genomics and my dissertation title is: Histone modification ChIP-seq algorithm engineering and high performance bioinformatics graphics and analysis software for computational epigenetics. My thesis committee chair was Prof. Dr. Nicholas Tsinoremas Ph.D. and my thesis committee members were: Prof. Dr. Mitsunori Ogihara Ph.D., Prof. Dr. Eden Martin Ph.D., and Prof. Dr. Phil Harvey Ph.D.. My biophysics advisor was Prof. Dr. Wolfgang Nonner M.D. and my bioinformatics advisor was Derek Van Booven, Senior Bioinformatics Analyst at the John P. Hussman Institute for Human Genomics. My academic appointments were in the Center for Therapeutic Innovation and the Department of Psychiatry and Behavioral Sciences.

Prior to graduate school, I was a triple-major summa cum laude (B.Sc. Mathematics, B.Sc. Physics, B.Sc. Molecular Biology and Biochemistry) at Benedictine University in Lisle, IL USA.

NIH biosketch

I am a bioinformatician, computational geneticist, and (epi-)genome researcher. I work across a broad array of next-generation sequencing data, including RNA-seq, small RNA-seq, methyl-seq, ATAC-seq, mass spectrometry, and ChIP-seq. My primary area of expertise is in the development of novel algorithms and computational pipelines for histone modification ChIP-seq data analysis -- the focus of my PhD dissertation. As part of my thesis, I made several important basic science contributions to the understanding of epigenetic mechanisms in alcohol addiction and cardiac ischemia, as well as several critical computational breakthroughs in the area of scientific visualization. In addition, I launched several large-scale software engineering products such as: a universal data search engine for all bioinformatics databases worldwide, a high performance graphics engine for rendering ultra fast low memory interactive biological heatmaps, and a high performance computing protocol for efficiently traversing and indexing extremely large directory trees via a remote network. I have also built several large-scale mathematical and statistical big data models for understanding the dynamics of codon usage bias across taxa in the evolutionary tree of life. As such, my primary research interests span epigenetics, integrative multi-omics, mathematical biology, and software engineering.

Publications

Books

Survival Guide to Organic Chemistry: Bridging the Gap from General Chemistry (CRC Press (Taylor & Francis), 2016, pp. 1-674)
Patrick E. McMahon, Bohdan B. Khomtchouk, Claes Wahlestedt

Articles

shinyheatmap: Ultra fast low memory heatmap web interface for big data genomics (PLoS One, 2017, 12(5):e0176334. doi: 10.1371/journal.pone.0176334.)
Bohdan B. Khomtchouk, James R. Hennessy, Claes Wahlestedt

MicroScope: ChIP-seq and RNA-seq software analysis suite for gene expression heatmaps (BMC Bioinformatics, 2016, 17:390, pp. 1-9)
Bohdan B. Khomtchouk, James R. Hennessy, Claes Wahlestedt

How the Strengths of Lisp-Family Languages Facilitate Building Complex and Flexible Bioinformatics Applications (Briefings in Bioinformatics, 2016, doi: 10.1093/bib/bbw130, pp. 1-7)
Bohdan B. Khomtchouk, Edmund Weitz, Peter D. Karp, Claes Wahlestedt

Dependence-induced increase of alcohol self-administration and compulsive drinking mediated by the histone methyltransferase PRDM2 (Molecular Psychiatry, 2016, doi: 10.1038/mp.2016.131, pp. 1-13)
Estelle Barbier, Andrea Johnstone, Bohdan Khomtchouk, Jenica Tapocik, Caleb Pitcairn, Faazal Rehman, Eric Augier, Abbey Borich, Jesse Schank, Christopher Rienas, Derek Van Booven, Sun Hui, Daniel Natt, Claes Wahlestedt, Markus Heilig

Ischemic preconditioning confers epigenetic repression of Mtor and induction of autophagy through G9a-dependent H3K9 di-methylation (Journal of the American Heart Association, 2016, 5:12, pp. 1-27)
Olof Gidlöf, Andrea Johnstone, Kerstin Bader, Bohdan Khomtchouk, Jiaqi O'Reilly, Derek Van Booven, Claes Wahlestedt, Bernhard Metzler, David Erlinge

HeatmapGenerator: high performance RNAseq and microarray visualization software suite to examine differential gene expression levels using an R and C++ hybrid computational pipeline (Source Code for Biology and Medicine, 2014, 9:30, pp. 1-6)
Bohdan Khomtchouk, Derek Van Booven, Claes Wahlestedt

Dissertation

Histone Modification ChIP-seq Algorithm Engineering and High Performance Bioinformatics Graphics and Analysis Software for Computational Epigenetics (Open Access Dissertations, 2017, pp. 1-106)
Bohdan B. Khomtchouk

Preprints

A global perspective of codon usage (bioRxiv (Cold Spring Harbor Laboratory), 2016, pp. 1-9)
Bohdan B. Khomtchouk, Claes Wahlestedt, Wolfgang Nonner

Codon usage is a stochastic process across genetic codes of the kingdoms of life (bioRxiv (Cold Spring Harbor Laboratory), 2016, pp. 1-18)
Bohdan B. Khomtchouk, Claes Wahlestedt, Wolfgang Nonner

SUPERmerge: ChIP-seq coverage island analysis algorithm for broad histone marks (bioRxiv (Cold Spring Harbor Laboratory), 2017, pp. 1-12)
Bohdan B. Khomtchouk, Derek Van Booven, Claes Wahlestedt

PubData: search engine for bioinformatics databases worldwide (bioRxiv (Cold Spring Harbor Laboratory), 2016, pp. 1-8)
Bohdan B. Khomtchouk, Kasra A. Vand, Thor Wahlestedt, Kelly Khomtchouk, Mohammed K. Sayed, Claes Wahlestedt

geneXtendeR: R/Bioconductor package for functional annotation of histone modification ChIP-seq data in a 3D genome world (bioRxiv (Cold Spring Harbor Laboratory), 2016, pp. 1-15)
Bohdan B. Khomtchouk, Derek Van Booven, Claes Wahlestedt

Zipf's law emerges asymptotically during phase transitions in communicative systems (arXiv (Cornell University Library), 2016, pp. 1-5)
Bohdan B. Khomtchouk, Claes Wahlestedt

The mathematics of the genetic code reveal that frequency degeneracy leads to exponential scaling in the DNA codon distribution of Homo sapiens (arXiv (Cornell University Library), 2014, pp. 1-8)
Bohdan B. Khomtchouk

Abstracts

Epigenetic enzymes as a novel class of targets for disease-modifying pharmacotherapies in alcohol addiction (Alcohol, 2017, 60:209)
Estelle Barbier, Andrea Johnstone, Bohdan Khomtchouk, Jenica Tapocik, Caleb Pitcairn, Faazal Rehman, Eric Augier, Abbey Borich, Jesse Schank, Christopher Rienas, Derek Van Booven, Sun Hui, Daniel Natt, Claes Wahlestedt, Markus Heilig

Software packages

geneXtendeR (Bioconductor) (GitHub)

geneXtendeR is an R/Bioconductor package for histone modification ChIP-seq analysis. It is designed to optimally annotate a histone modification ChIP-seq peak input file with functionally important genomic features (e.g., genes associated with peaks) based on optimization calculations. geneXtendeR optimally extends the boundaries of every gene in a genome by some genomic distance (in DNA base pairs) for the purpose of flexibly incorporating cis-regulatory elements, such as enhancers and promoters, as well as downstream elements that are important to the function of the gene (relative to an epigenetic histone modification ChIP-seq dataset). geneXtendeR computes optimal gene extensions tailored to the broadness of the specific epigenetic mark (e.g., H3K9me1, H3K27me3), as determined by a user-supplied ChIP-seq peak input file. As such, geneXtendeR maximizes the signal-to-noise ratio of locating genes closest to and directly under peaks. By performing a computational expansion of this nature, ChIP-seq reads that would initially not map strictly to a specific gene can now be optimally mapped to the regulatory regions of the gene, thereby implicating the gene as a potential candidate, and thereby making the ChIP-seq experiment more successful. Such an approach becomes particularly important when working with epigenetic histone modifications that have inherently broad peaks.

happybiRthday (CRAN) (GitHub)

happybiRthday is an R package hosted on CRAN to calculate upcoming birthday dates of Github repos. Software creation is a pretty big deal! A repository's initial commit date can be thought of as its birthday. Next time, drop in and wish a developer (any Github username) a happy birthday of their repo(s). Or maybe just toast to the upcoming anniversary of your own software! The software life cycle is too short not to celebrate!

FTPwalker PyPI version

FTPwalker is a network programming Python package for optimally traversing extremely large FTP directory trees. It constitutes the algorithmic heart of the PubData search engine. FTPwalker creates a dictionary formatted as a JSON file in the user’s home directory containing all the full paths as keys and the respective filenames as values. FTPwalker is designed with speed in mind by utilizing state-of-the-art high performance parallelism and concurrency algorithms to traverse huge directory trees through a remote network via an FTP connection. The resultant hash table (i.e., dictionary) supports fast lookup for any file in any biological database (or any remote database with an extremely large file system).

Software projects

Biochat (GitHub)

Biochat aims at providing an interactive workbench for biological databases (e.g., Gene Expression Omnibus (GEO), miRBase, TCGA, Human Epigenome Atlas, etc.) to learn to communicate with each other by matching and pairing data records across the biological data-verse (terabytes of publicly available data). Biochat is written in Common Lisp and operates based on efficient categorization and pairing of similar items (e.g., words that describe data records) into groups. It is basically the high performance computing (HPC) data science equivalent of the chemistry saying "like dissolves like." We apply the "like dissolves like" principle to teach data files to learn to talk to each other (quite literally). In order to talk, data must first be able to find each other in space (not a trivial task, considering that there are dozens of bioinformatics databases out there... see how we've tackled this problem with PubData). So how, for example, is an RNA-seq dataset supposed to find its potentially related ChIP-seq dataset (e.g., according to some combination of similar cell type, histone mark, sequencing details, etc.)? Through metadata, of course! However, for the datasets to meet each other via a similar metadata footprint requires sophisticated NLP strategies to introduce them. Once the datasets meet, we can let the conversations (i.e., integrative bioinformatics analyses) begin! Hence the name: Biochat. Our ultimate goal is to make integrative multi-omics a lot easier (and more fun) through artificial intelligence (AI). Right now, we are barely scratching the surface with NLP. Thus, we are currently implementing novel neural network approaches to help us teach data to talk to each other (stay tuned!).

SUPERmerge (GitHub)

SUPERmerge is a ChIP-seq read pileup analysis and annotation algorithm for investigating alignment (BAM) files of diffuse histone modification ChIP-seq datasets with broad chromatin domains at a single base pair resolution level. SUPERmerge allows flexible regulation of a variety of read pileup parameters, thereby revealing how read islands aggregate into areas of coverage across the genome and what annotation features they map to within individual biological replicates. SUPERmerge is especially useful for investigating low sample size ChIP-seq experiments in which epigenetic histone modifications (e.g., H3K9me1, H3K27me3) result in inherently broad peaks with a diffuse range of signal enrichment spanning multiple consecutive genomic loci and annotated features.

shinyDE (GitHub)

shinyDE is an R Shiny web server for all differential expression (DE) callers. Currently, shinyDE supports edgeR, DESeq2, baySeq, NOISeq, SAMSeq, DEGseq, EBSeq, and PoissonSeq. Users can run multiple DE callers in parallel and observe mutual overlaps of statistical results called in common between, e.g., edgeR and DESeq2, which can then further be sorted by p-value, FDR, etc. shinyDE also supports heatmaps, gene ontology analyses, and Venn diagrams.

NGStoolkit (GitHub)

NGStoolkit is a one-stop shop next-generation sequencing analysis toolkit written in Bash that allows you to run a complete NGS analysis pipeline (e.g., RNA-seq) by issuing just one command from the Terminal. Currently, NGStoolkit supports RNA-seq analysis via a single-click automated RNA-seq pipeline starting with bz2 files through differential expression. Current work is underway to expand NGStoolkit into ChIP-seq, methyl-seq, and exome-seq analysis territory to create a comprehensive NGS analysis toolkit.

Smalls (GitHub)

Smalls is a Python/R/shell scripting software program that efficiently aligns miRNA seed sequences to three-prime and five-prime untranslated regions in DNA (3'-UTR and 5'-UTR gene regions). Smalls' efficiency and flexibility stem from its utilization of advanced string algorithms and user-specified options for approximate matching (e.g., 2 mismatches). miRNA binding at UTR regions has been shown to influence biological processes as diverse as embryonic development or disease progression. It is, therefore, imperative to be able to computationally evaluate a candidate list of miRNAs in relation to their potential UTR binding targets. By evaluating which specific miRNAs align most often in the genome (as quantified by raw count of successful alignment at a given approximate mismatch rate), we can determine which specific miRNAs are most promising to experimentally validate in the lab. More importantly, we can determine the precise genomic position(s) of these miRNA-UTR alignments and, therefore, the identity of the respective gene(s) involved in the biological process at hand. Current work is underway to expand Smalls to cover other types of small RNAs in addition to miRNAs (e.g., snoRNAs, snRNAs, siRNAs, piRNAs, lncRNAs, vlincRNAs).

Triglav (GitHub)

Triglav is a bioinformatics software program written in the C programming language that performs the following operations:

  • Generates a single nucleotide polymorphism (SNP) list for every input file in your genomic sample pool
  • Extracts variants based on each of these SNP lists
  • Merges all extracted variants into one final file
Triglav outperforms the standard vcf-merge utility found in vcftools, which typically gives rise to problems when merging indels with normal variant calling format (vcf) files. Also, the vcf-merge utility only takes the first values (in the INFO field) from the list of merged vcf files, thereby affecting other post-analysis scripts such as VQSR for downstream filtering of the data. Current work is underway to transition Triglav from an HPC environment to a cloud infrastructure. Also, an indel-merge feature is currently under development and testing.

Conferences

Great Lakes Bioinformatics (GLBIO) Conference, University of Illinois at Chicago, Chicago, IL --- Spring 2017

Oral presentation: "geneXtendeR: R/Bioconductor package for functional annotation of histone modification ChIP-seq data in a 3D genome world"

Fourth International Congress on Alcoholism and Stress: A Framework for Future Treatment Strategies, Volterra, Italy --- Spring 2017

Meeting abstract: "Epigenetic enzymes as a novel class of targets for disease-modifying pharmacotherapies in alcohol addiction"

10th European Lisp Symposium, Brussels, Belgium --- Spring 2017

Invited keynote: "How the Strengths of Lisp-Family Languages Facilitate Building Complex and Flexible Bioinformatics Applications"

ISBRA ESBRA World Congress on Alcohol and Alcoholism, Berlin, Germany --- Fall 2016

Poster presentation: "Dependence-induced increase of alcohol self-administration and compulsive drinking mediated by the histone methyltransferase PRDM2"

Miami Winter Symposium, Hyatt Regency Miami, Miami, FL --- Winter 2016

Poster presentation: "Python and Apache Spark bioinformatics software program that sensitively detects the presence of snoRNAs and miRNAs within NGS samples"

Miami Winter Symposium, Hyatt Regency Miami, Miami, FL --- Winter 2015

Poster presentation: "Personalized computational epigenomics pipeline to investigate complex brain tissue using ChIP-seq"

Awards

National Defense Science & Engineering Graduate Fellowship --- 2014-2017

University of Miami Graduate Fellowship --- Fall 2013

Who's Who Among Students in American Universities & Colleges Award --- Spring 2016

Undergraduate Alumni

James R. Hennessy, B.Sc. --- Current position: Data Analyst I, The Genome Technology Center, Stanford University School of Medicine

    Years mentored: 2015-2017
    Role: undergraduate student researcher (University of Miami, Department of Mathematics)

Philosophy

Contact