Bohdan Khomtchouk, Ph.D.
Postdoctoral Research Scholar at Stanford University
Bioinformatician and computational biologist with a postdoctoral research appointment in the Department of Biology at Stanford University.
My official Stanford University profile can be accessed here.
My published works to-date include a Molecular Psychiatry paper, PLoS One paper, BMC Bioinformatics paper, Briefings in Bioinformatics paper, Source Code for Biology and Medicine paper, Journal of the American Heart Association paper, R/Bioconductor package, and Python/PyPI package. I have also published an organic chemistry textbook at CRC Press (Taylor & Francis). In addition, I have released several preprints, which are currently under peer-review. These preprints span the fields of mathematical genetics and evolutionary biology (here, here, and here), databases and search engines (here), and nonlinear physics (here). As a bioinformatics software engineer and biostatistician, I am a big fan of the C and R programming languages (respectively), especially when I can synergize them together to do some low-level statistical programming (see geneXtendeR). Overall, I enjoy writing extremely efficient implementations of various algorithms & data structures for solving computationally complex biological problems. I am always looking for yet another way to shave off yet another millisecond of runtime performance, even if it may not be the most practical use of my time!
Previously, I was a genetics doctoral candidate funded by the National Defense Science & Engineering Graduate (NDSEG) Fellowship, working in the laboratory of Prof. Dr. Claes Wahlestedt M.D., Ph.D. at the University of Miami Miller School of Medicine in Miami, FL USA. I graduated with a Ph.D. in Computational Human Genetics and Genomics and my dissertation title is: Histone modification ChIP-seq algorithm engineering and high performance bioinformatics graphics and analysis software for computational epigenetics. My thesis committee chair was Prof. Dr. Nicholas Tsinoremas Ph.D. and my thesis committee members were: Prof. Dr. Mitsunori Ogihara Ph.D., Prof. Dr. Eden Martin Ph.D., and Prof. Dr. Phil Harvey Ph.D.. My biophysics advisor was Prof. Dr. Wolfgang Nonner M.D. and my bioinformatics advisor was Derek Van Booven, Senior Bioinformatics Analyst at the John P. Hussman Institute for Human Genomics. My academic appointments were in the Center for Therapeutic Innovation and the Department of Psychiatry and Behavioral Sciences.
Prior to graduate school, I was a triple-major summa cum laude (B.Sc. Mathematics, B.Sc. Physics, B.Sc. Molecular Biology and Biochemistry) at Benedictine University in Lisle, IL USA. I was also a visiting researcher at Ecole Polytechnique Fédérale de Lausanne (EPFL) -- see press release, certificate of completion, and acceptance letter. Finally, I was an inventor at the Einstein patent bureau (Swiss Federal Institute of Intellectual Property) -- see official invention disclosure form.
- Favorite language: R
- Favorite package: data.table
- Favorite function: data.table::fread()
- Favorite IDE: RStudio
- Favorite editor: Atom
- Favorite book: SICP
- Favorite bioinformatician: Brian Haas
- Favorite programmer: Matt Dowle
NIH biosketchI am a bioinformatician, computational geneticist, and (epi-)genome researcher. I work across a broad array of next-generation sequencing data, including RNA-seq, small RNA-seq, methyl-seq, ATAC-seq, mass spectrometry, and transcription factor and histone modification ChIP-seq. My primary area of training and expertise is in the development of novel algorithms and computational pipelines for histone modification ChIP-seq data analysis -- the focus of my PhD dissertation. As part of my thesis, I made several important basic science contributions to the understanding of epigenetic mechanisms in alcohol addiction and cardiac ischemia, as well as several critical computational breakthroughs in the area of scientific visualization. In addition, I launched several large-scale software engineering products such as a high performance graphics engine for rendering ultra fast low memory interactive biological heatmaps. I have also built several large-scale mathematical and statistical big data models for understanding the dynamics of codon usage bias across taxa in the evolutionary tree of life. Currently, my research involves understanding the data science behind aging-related diseases as well as creating artificial intelligence and machine learning software to organize the world's biological information at a massive scale -- working at the interdisciplinary interface of big data, integrative bioinformatics, multi-omics, and statistical learning.
Survival Guide to Organic Chemistry: Bridging the Gap from General Chemistry
(CRC Press (Taylor & Francis), 2016, pp. 1-674)
Patrick E. McMahon, Bohdan B. Khomtchouk, Claes Wahlestedt
shinyheatmap: Ultra fast low memory heatmap web interface for big data genomics
(PLoS One, 2017, 12(5):e0176334. doi: 10.1371/journal.pone.0176334.)
Bohdan B. Khomtchouk, James R. Hennessy, Claes Wahlestedt
Bohdan B. Khomtchouk, James R. Hennessy, Claes Wahlestedt How the Strengths of Lisp-Family Languages Facilitate Building Complex and Flexible Bioinformatics Applications (Briefings in Bioinformatics, 2016, doi: 10.1093/bib/bbw130, pp. 1-7)
Bohdan B. Khomtchouk, Edmund Weitz, Peter D. Karp, Claes Wahlestedt Dependence-induced increase of alcohol self-administration and compulsive drinking mediated by the histone methyltransferase PRDM2 (Molecular Psychiatry, 2016, doi: 10.1038/mp.2016.131, pp. 1-13)
Estelle Barbier, Andrea Johnstone, Bohdan Khomtchouk, Jenica Tapocik, Caleb Pitcairn, Faazal Rehman, Eric Augier, Abbey Borich, Jesse Schank, Christopher Rienas, Derek Van Booven, Sun Hui, Daniel Natt, Claes Wahlestedt, Markus Heilig Ischemic preconditioning confers epigenetic repression of Mtor and induction of autophagy through G9a-dependent H3K9 di-methylation (Journal of the American Heart Association, 2016, 5:12, pp. 1-27)
Olof Gidlöf, Andrea Johnstone, Kerstin Bader, Bohdan Khomtchouk, Jiaqi O'Reilly, Derek Van Booven, Claes Wahlestedt, Bernhard Metzler, David Erlinge HeatmapGenerator: high performance RNAseq and microarray visualization software suite to examine differential gene expression levels using an R and C++ hybrid computational pipeline (Source Code for Biology and Medicine, 2014, 9:30, pp. 1-6)
Bohdan Khomtchouk, Derek Van Booven, Claes Wahlestedt
Histone Modification ChIP-seq Algorithm Engineering and High Performance Bioinformatics Graphics and Analysis Software for Computational Epigenetics
(Open Access Dissertations, 2017, pp. 1-106)
Bohdan B. Khomtchouk
A global perspective of codon usage
(bioRxiv (Cold Spring Harbor Laboratory), 2016, pp. 1-9)
Bohdan B. Khomtchouk, Claes Wahlestedt, Wolfgang Nonner
Bohdan B. Khomtchouk, Claes Wahlestedt, Wolfgang Nonner SUPERmerge: ChIP-seq coverage island analysis algorithm for broad histone marks (bioRxiv (Cold Spring Harbor Laboratory), 2017, pp. 1-12)
Bohdan B. Khomtchouk, Derek Van Booven, Claes Wahlestedt PubData: search engine for bioinformatics databases worldwide (bioRxiv (Cold Spring Harbor Laboratory), 2016, pp. 1-8)
Bohdan B. Khomtchouk, Kasra A. Vand, Thor Wahlestedt, Kelly Khomtchouk, Mohammed K. Sayed, Claes Wahlestedt geneXtendeR: optimized functional annotation of ChIP-seq data (bioRxiv (Cold Spring Harbor Laboratory), 2017, pp. 1-7)
Bohdan B. Khomtchouk, Derek Van Booven, Claes Wahlestedt Zipf's law emerges asymptotically during phase transitions in communicative systems (arXiv (Cornell University Library), 2016, pp. 1-5)
Bohdan B. Khomtchouk, Claes Wahlestedt The mathematics of the genetic code reveal that frequency degeneracy leads to exponential scaling in the DNA codon distribution of Homo sapiens (arXiv (Cornell University Library), 2014, pp. 1-8)
Bohdan B. Khomtchouk
AbstractsEpigenetic enzymes as a novel class of targets for disease-modifying pharmacotherapies in alcohol addiction (Alcohol, 2017, 60:209)
Estelle Barbier, Andrea Johnstone, Bohdan Khomtchouk, Jenica Tapocik, Caleb Pitcairn, Faazal Rehman, Eric Augier, Abbey Borich, Jesse Schank, Christopher Rienas, Derek Van Booven, Sun Hui, Daniel Natt, Claes Wahlestedt, Markus Heilig
geneXtendeR (Bioconductor) (GitHub)
geneXtendeR is an R/Bioconductor package for histone modification ChIP-seq analysis. It is designed to optimally annotate a histone modification ChIP-seq peak input file with functionally important genomic features (e.g., genes associated with peaks) based on optimization calculations. geneXtendeR optimally extends the boundaries of every gene in a genome by some genomic distance (in DNA base pairs) for the purpose of flexibly incorporating cis-regulatory elements, such as enhancers and promoters, as well as downstream elements that are important to the function of the gene (relative to an epigenetic histone modification ChIP-seq dataset). geneXtendeR computes optimal gene extensions tailored to the broadness of the specific epigenetic mark (e.g., H3K9me1, H3K27me3), as determined by a user-supplied ChIP-seq peak input file. As such, geneXtendeR maximizes the signal-to-noise ratio of locating genes closest to and directly under peaks. By performing a computational expansion of this nature, ChIP-seq reads that would initially not map strictly to a specific gene can now be optimally mapped to the regulatory regions of the gene, thereby implicating the gene as a potential candidate, and thereby making the ChIP-seq experiment more successful. Such an approach becomes particularly important when working with epigenetic histone modifications that have inherently broad peaks.
happybiRthday is an R package hosted on CRAN to calculate upcoming birthday dates of Github repos. Software creation is a pretty big deal! A repository's initial commit date can be thought of as its birthday. Next time, drop in and wish a developer (any Github username) a happy birthday of their repo(s). Or maybe just toast to the upcoming anniversary of your own software! The software life cycle is too short not to celebrate!
FTPwalker is a network programming Python package for optimally traversing extremely large FTP directory trees. It constitutes the algorithmic heart of the PubData search engine. FTPwalker creates a dictionary formatted as a JSON file in the user’s home directory containing all the full paths as keys and the respective filenames as values. FTPwalker is designed with speed in mind by utilizing state-of-the-art high performance parallelism and concurrency algorithms to traverse huge directory trees through a remote network via an FTP connection. The resultant hash table (i.e., dictionary) supports fast lookup for any file in any biological database (or any remote database with an extremely large file system).
Software projects (in progress)
biosemble is a Python natural language processing (NLP) software program written in highly-tuned Cython and optimized Python for assembling biological wordnets from structured and unstructured biological text. Structured text includes resources like biologically relevant dictionaries and encyclopedias, while unstructured text includes biologically relevant textbooks.
Biochat aims at providing an interactive workbench for biological databases (e.g., Gene Expression Omnibus (GEO), miRBase, TCGA, Human Epigenome Atlas, etc.) to learn to communicate with each other by matching and pairing data records across the biological data-verse (terabytes of publicly available data). Biochat's mission is to fundamentally transform how people perform biological data science by unifying it, going from thousands of scattered database silos (that act as data storage repositories) to 1 intelligent centralized framework (that acts as a living breathing AI to integrate large-scale data), thereby opening doors to more biological breakthroughs based on existing publicly available data. Biochat is written in Common Lisp and operates based on efficient categorization and pairing of similar items (e.g., words that describe data records) into groups. It is basically the high performance computing (HPC) data science equivalent of the chemistry saying "like dissolves like." We apply the "like dissolves like" principle to teach data files to learn to talk to each other (quite literally). In order to talk, data must first be able to find each other in space (not a trivial task, considering that there are dozens of bioinformatics databases out there... see how we've tackled this problem with PubData). So how, for example, is an RNA-seq dataset supposed to find its potentially related ChIP-seq dataset (e.g., according to some combination of similar cell type, histone mark, sequencing details, etc.)? Through metadata, of course! However, for the datasets to meet each other via a similar metadata footprint requires sophisticated NLP strategies to introduce them. Once the datasets meet, we can let the conversations (i.e., integrative bioinformatics analyses) begin! Hence the name: Biochat. Our ultimate goal is to make integrative multi-omics a lot easier (and more fun) through artificial intelligence (AI). Right now, we are barely scratching the surface with NLP. Thus, we are currently implementing novel neural network approaches to help us teach data to talk to each other (stay tuned!).
PubData is a search engine and file retrieval system for all bioinformatics databases worldwide. PubData searches biomedical FTP data in a user-friendly fashion similar to how PubMed searches biomedical literature. PubData is hosted as both a web application and a standalone graphical user interface (GUI) software program, while PubMed is hosted as an online web server. PubData is built on novel network programming and natural language processing algorithms that can patch into the FTP servers of any user-specified bioinformatics database, query its contents, and retrieve files for download. PubData is written in the Python programming language (specifically, Django and PyQt4). PubData can remotely search, access, view, and retrieve files from the deeply nested directory trees of any major bioinformatics database via a local computer network. By assembling all major bioinformatics databases under the roof of one software program, PubData allows the user to avoid the unnecessary hassle and non-standardized complexities inherent to accessing databases one-by-one using an Internet browser. More importantly, it allows a user to query multiple databases simultaneously for user-specified keywords (e.g., human, cancer, transcriptome). As such, PubData allows researchers to search, access, view, and download files from the FTP servers of any major bioinformatics database directly from one centralized location. By using only a GUI or web application, PubData allows the user to simultaneously surf multiple bioinformatics FTP servers directly from the comfort of their local computer.
SUPERmerge is a ChIP-seq read pileup analysis and annotation algorithm for investigating alignment (BAM) files of diffuse histone modification ChIP-seq datasets with broad chromatin domains at a single base pair resolution level. SUPERmerge allows flexible regulation of a variety of read pileup parameters, thereby revealing how read islands aggregate into areas of coverage across the genome and what annotation features they map to within individual biological replicates. SUPERmerge is especially useful for investigating low sample size ChIP-seq experiments in which epigenetic histone modifications (e.g., H3K9me1, H3K27me3) result in inherently broad peaks with a diffuse range of signal enrichment spanning multiple consecutive genomic loci and annotated features.
Bay Area R useR Group (R Programming Language), Intuit Building 9 – Invention and Innovation, Mountain View, CA --- Fall 2017
Bay Area Lisp & Scheme Users Group, Hacker Dojo, Santa Clara, CA --- Fall 2017
Invited talk: "Biochat: organizing the world's biological information through AI"
Great Lakes Bioinformatics (GLBIO) Conference, University of Illinois at Chicago, Chicago, IL --- Spring 2017
Oral presentation: "geneXtendeR: R/Bioconductor package for functional annotation of histone modification ChIP-seq data in a 3D genome world"
Fourth International Congress on Alcoholism and Stress: A Framework for Future Treatment Strategies, Volterra, Italy --- Spring 2017
Meeting abstract: "Epigenetic enzymes as a novel class of targets for disease-modifying pharmacotherapies in alcohol addiction"
10th European Lisp Symposium, Brussels, Belgium --- Spring 2017
Invited keynote: "How the Strengths of Lisp-Family Languages Facilitate Building Complex and Flexible Bioinformatics Applications"
ISBRA ESBRA World Congress on Alcohol and Alcoholism, Berlin, Germany --- Fall 2016
Poster presentation: "Dependence-induced increase of alcohol self-administration and compulsive drinking mediated by the histone methyltransferase PRDM2"
Miami Winter Symposium, Hyatt Regency Miami, Miami, FL --- Winter 2016
Poster presentation: "Python and Apache Spark bioinformatics software program that sensitively detects the presence of snoRNAs and miRNAs within NGS samples"
Miami Winter Symposium, Hyatt Regency Miami, Miami, FL --- Winter 2015
Poster presentation: "Personalized computational epigenomics pipeline to investigate complex brain tissue using ChIP-seq"
Big Data with R, rstudio::conf 2018 – Manchester Grand Hyatt, San Diego, CA --- Winter 2018
2-day workshop: Learn how to use R with Hive, SQL Server, Oracle and other scalable external data sources along with Big Data clusters in this two-day workshop. We will cover how to connect, retrieve schema information, upload data, and explore data outside of R. For databases, we will focus on the dplyr, DBI and odbc packages. These packages enable us to use the same dplyr verbs inside R but are translated and sent as SQL queries. For Big Data clusters, we will also learn how to use the sparklyr package to run models inside Spark and return the results to R. We will review recommendations for connection settings, security best practices and deployment options. Throughout the workshop, we will take advantage of the new data connections available with the RStudio IDE.
StanfordR (Stanford R Group) (StanfordR)
pyStanford (Stanford Python Meetup) (pyStanford)
Stanford Biolisp (Stanford Common Lisp, Scheme, and Clojure Meetup) (Biolisp)
NIH/NIA Stanford Training Program in Aging Research (T32 AG0047126), National Institute on Aging of the National Institutes of Health --- 2017-2018
National Defense Science & Engineering Graduate Fellowship (32 CFR 168a), Department of Defense (Army Research Office, Biosciences Division) --- 2014-2017
UM Graduate Fellowship, University of Miami --- Fall 2013
Who's Who Among Students in American Universities & Colleges Award, Randall-Reilly Publishing --- Spring 2016
James R. Hennessy, B.Sc. --- Current position: Data Analyst I, The Genome Technology Center, Stanford University School of Medicine
- Years mentored: 2015-2017
- Role: undergraduate student researcher (University of Miami, Department of Mathematics)