Today’s project is something new to this site – some bioinformatics code! I feel a tad ridiculous that it’s taken me so many years to post code related to my field, but hey – better late than never. (That seems to be a recurring theme around here…)
First, a primer. If you don’t know what the title of this article refers to, here are some Wikipedia links to get you started:
Hidden Markov models (including some examples in Python)
Viterbi algorithm / Viterbi paths (including even more Python examples)
As a bonus, I’m including sections from my original write-up on this program (it began as a university project) in hopes that they’ll shed some light on how the code works and what its potential usefulness is.
The algorithm implementations are quite fast, and the visual feedback is an excellent addition (IMO). Full source code and a sample executable are provided, as well as a FASTA file with data from chromosome #1 in the human genome (bases 6000-12000, I believe).
(FYI, my original lab assignment link is here: http://dna.cs.byu.edu/bio465/Labs/hmm.shtml. It contains quite a bit of helpful information and links if you’re looking to write your own HMM implementation.)
INTRODUCTION
CpG islands are regions of DNA characterized by a large number of adjacent cytosine and guanine nucletoides linked by phosphodiester bonds. Additionally, CpG islands appear in some 70% of promoters of human genes (40% of mammalian genes). Unlike CpG sites in the coding region of a gene, in most instances the CpG sites in CpG islands are unmethylated if genes are expressed. This observation led to the speculation that methylation of CpG sites in the promoter of a gene may inhibit the expression of a gene. (Wikipedia, retrieved Feb 2007)
In this project, I was particularly concerned with generating an accurate Hidden Markov model to define the relationship between normal states (B) and island states (I) within a region of the human chromosome. Such an implementation is useful for gene finding, as CpG islands tend to appear near the promoters of important mammalian genes. The automation of this process is critical because such islands are impossible to locate by simply looking at a strand of DNA (they may cover several hundred bases and their C/G content may be only slightly higher than expected values).
PROCEDURE
I opted to code this project in the Microsoft Visual Basic 6.0 programming language. I believe this implementation is highly appropriate, as I can quickly generate both graphs for visualization and a simple, easy-to-modify GUI (since finding CpG islands can require a lot of trial-and-error using different parameters, and this is not conducive to a command-line program). VB6 is excellent for quickly assembling GUIs, and for algorithms of this nature its speed is comparable to that of a raw C++ implementation. (BTW, if you are a Linux user please note that this project works flawlessly under Wine 1.1.27, and likely earlier versions as well.)
I will refer to the following image as I walk through my implementation of a HMM.

Step 1: Loading a FASTA file
This routine isn’t of much interest, except to mention a programming quirk particular to VB6. A, C, G, and T entries from the FASTA file are reassigned into a byte array as the numbers 0, 1, 2, and 3. In VB6, number comparison and processing is generally faster than string comparison and processing. (To any VB experts reading this – please note that I am aware of mechanisms for overcoming this, but the fact of the matter is that I prefer working with byte arrays. :)
Step 2: HMM Parameters
As per the lab specifications, the user can change 14 HMM-related parameters. Additionally, I have chosen to add an option to let the program estimate the eight state probabilities. This is a trivial operation, but it seems to markedly increase the accuracy of the algorithm. The estimation of A, C, G and T’s being emitted in a B state is computed by the percentage of occurrences of each of these entries in the original FASTA data. I predict that given the low percentage of I states in a length of DNA, this method of estimating emission probabilities is likely quite accurate. Once those categories have been determined, the I state probabilities are computed by doubling the C and G probabilities and halving the A and T probabilities.
Step 3: The HMM and Viterbi algorithms
This step begins by moving all HMM parameters into look-up tables. The logs of these values are pre-calculated, allowing us to add them (instead of multiplying) for future HMM calculations.
Next, all HMM probabilities are calculated and stored.
This is followed by a 2nd loop backward through the data while calculating all data necessary for the Viterbi algorithm. (Note: this is not necessarily the most efficient way to do this, but the speed difference is negligible for sequences less than 100,000 bases long.)
After all Viterbi data has been pre-calculated, the traceback part of the Viterbi algorithm becomes extremely fast. (Another note: execution speed of this chunk of code is slightly reduced because I translate the states into a string and display that string in the textbox in the upper-right of the program window. This is mainly for convenience and/or to satisfy my curiosity.)
Step 4: The Sliding Window
This step is one where I felt a proper GUI was essential. I have chosen to display two graphs; the top graph displays raw counts of C and G occurrences as compared to the expected value (where 0.5 is used as the expected ratio). The bottom graph displays the ratio of I and B states, and it is this data that is used to recommend the location of any CpG islands.
A noteworthy shortcoming of this implementation is that it is designed to only find the most probable CpG island (instead of multiple islands). In a more formal implementation, returning any/all potential CpG island locations would be more useful.
(And for the record, this step is my favorite part of the program. I find it to be well-implemented and surprisingly graphically pleasing for a bioinformatics tool. :)
RESULTS
By my estimation, the program’s most accurate CpG prediction for this FASTA file is bases 2018-2045 (which corresponds to bases 6018-6045 in the original data, I believe). If you run the program yourself, you will find that most of its estimates fall around this range. Adjusting the sliding window doesn’t have tremendous effect on the estimates, although larger windows will result in larger CpG island predictions (not surprising). The percentage of I states varies dramatically; certain parameters result in levels as high as 30-40% while others are less than 5%. For my best estimate (above), the percentage of I states was 22%.
Finally, code execution is swift – typically a matter of seconds – and if the text display in the top-right is disabled the program runs more-or-less instantaneously on modern hardware.

Print This Article
Very good code, i understand the HMM and Viterbi from your code, not from many articles on the fancy articles :) tnx and respect !
Happy to help! :)