top of page

A concise introduction to Markov chain models

How do they work?


A Markov chain is a stochastic process that models a system that transitions from one state to another, where the probability of the next state only depends on the current state and not on the previous history. 

For example, assuming that X0 is the current state of a system or process, the probability of a state, X1 , depends only on X0 which is of course the current state of the system as stated.

P(X1) = f(P(X0)) 

It may be hard to think of any real-life processes that follow this behaviour because there is the belief that all events happen in a sequence because of each other. Here are some examples: 

Games e.g. chess - If your king is in a certain spot on a chess board, there will be a maximum of 4 transition states that can be achieved that all depend on the initial position of chess piece. The parameters for the Markov model will obviously vary depending on your position on the board which is the essence of the Markov process. 

Genetics - The genetic code of an organism can be modelled as a Markov chain, where each nucleotide (A, C, G, or T) is a state, and the probability of the next nucleotide depends only on the current one.

Text generation - Consider the current state as the most recent word. The transition states would be all possible words which could follow on from said word. Next word prediction algorithms can utilize a first-order Markov process to predict the next word in a sentence based on the most recent word.

The text generation example is particularly interesting because only considering the previous word when trying to predict the next word sentence would lead to a very random sentence. That is where we can change things up using various mathematical techniques.

k-Order Markov Chains (adding more steps)

In a first-order Markov chain, we only consider the immediately preceding state to predict the next state. However, in k-order Markov chains, we broaden our perspective. Here’s how it works:

Definition: a k-order Markov chain considers the previous states (or steps) when predicting the next state. It’s like looking further back in time to inform our predictions.

Example: suppose we’re modelling the weather. In a first-order Markov chain, we’d only look at today’s weather to predict tomorrow’s weather. But in a second-order Markov chain, we’d consider both today’s and yesterday’s weather. Similarly, a third-order Markov chain would involve three days of historical data.

By incorporating more context, k-order chains can capture longer-term dependencies and patterns. As k increases, the model becomes more complex, and we need more data to estimate transition probabilities accurately.

See diagram below for a definition of higher order Markov chains.

Markov chains for Natural Language Processing

A Markov chain can generate text by using a dictionary of words as the states, and the frequency of words in a corpus of text as the transition probabilities. Given an input word, such as "How", the Markov chain can generate the next word, such as "to", by sampling from the probability distribution of words that follow "How" in the corpus. Then, the Markov chain can generate the next word, such as "use", by sampling from the probability distribution of words that follow "to" in the corpus. This process can be repeated until a desired length or end of sentence is reached. 

That is a basic example and for more complex NLP tasks we can employ more complex Markov models such as k-order, variable, n-gram or even hidden Markov models.

Limitations of Markov models

Markov models for tasks such as text generation will struggle because they are too simplistic to create text that is intelligent and sometimes even coherent. Here are some reasons why:

Fixed Transition Probabilities: Markov models assume that transition probabilities are constant throughout. In reality, language is dynamic, and context can change rapidly. Fixed probabilities may not capture these nuances effectively.

Local Dependencies: Markov chains have local dependencies, meaning they only consider a

limited context (e.g., the previous word). They don’t capture long-range dependencies or global context.

Limited Context Window: Markov models have a fixed context window (e.g., first-order, second order, etc.). If the context extends beyond this window, the model won’t capture it. 

Sparse Data: Markov models rely on observed data (transition frequencies) from the training corpus. If certain word combinations are rare or absent, the model struggles to estimate accurate probabilities. 

Lack of Learning: Markov models don’t learn from gradients or backpropagation. They’re based solely on observed statistics. 

Written by Temi Abbass


1.  “Improving the Markov Chain Approach for Generating Text Used for…”: This work focuses on text generation using Markov chains. It highlights the chance based transition process and the representation of temporal patterns determined by probability over sample observations.  

2. “Synthetic Text Generation for Sentiment Analysis”: This paper discusses text generation using latent Dirichlet allocation (LDA) and a text generator based on Markov chain models. It explores approaches for generating synthetic text for sentiment analysis

3.  “A Systematic Review of Hidden Markov Models and Their Applications”: This review paper provides insights into HMMs, a statistical model designed using a Markov process with hidden states. It discusses their applications in various fields, including robotics, finance, social science, and ecological time series data analysis.

Project Gallery

bottom of page