Sinem's Blog

Understanding Morgan Fingerprints

When I started learning cheminformatics, I struggled with a basic question: how do computers compare molecules? Then I discovered Morgan Fingerprints, and it all made sense.

The Problem

You can’t store entire molecular structures in a database and compare millions of them — it’s too slow. You need a compact representation that captures what makes a molecule unique.

What Are Morgan Fingerprints?

A Morgan Fingerprint converts a molecule into a binary vector — a string of 1s and 0s.

For example:

Molecule A: 10110101101...
Molecule B: 10110101100...

Each bit represents a specific chemical feature found in the molecule:

  • Bit 1: Contains a carbon-carbon double bond
  • Bit 5: Has a nitrogen atom at distance 2 from a carbon
  • Bit 8: Contains an aromatic ring
  • And so on…

Molecules with similar structures have similar bit patterns. Compare them quickly using just these vectors instead of full chemical structures.

How It Works: The Circular Algorithm

Morgan Fingerprints use circular neighborhoods around each atom:

  1. Start at each atom
  2. Expand outward — examine atoms 1 bond away, 2 bonds away, 3 bonds away, etc. (controlled by radius)
  3. Hash each neighborhood — convert into a unique number
  4. Create the fingerprint — combine all hashes into bits

Radius matters: Radius 2 is the standard choice — it balances capturing enough molecular detail with computational efficiency. Radius 3-4 captures broader patterns but increases computational burden. For large-scale searches, radius 2 is usually the sweet spot.

Why This Matters for Drug Discovery

  • Speed: Compare fingerprints instead of full structures—billions of comparisons become feasible
  • Similarity Search: Find drug-like compounds by searching for similar fingerprints
  • Predict Properties: Similar fingerprints often mean similar biological activity
  • Navigate Chemical Space: Intelligently search trillions of possible molecules

Code to Generate

About RDKit: RDKit is the open-source standard for cheminformatics in Python. Nearly every drug discovery company, research lab, and cheminformatics project uses it. It handles everything from parsing molecules to computing fingerprints to predicting molecular properties. It’s free, well-maintained, and industry-standard.

from rdkit import Chem
from rdkit.Chem import AllChem

mol = Chem.MolFromSmiles("CCO")  # ethanol

# Generate a Morgan fingerprint with radius 2
gen = AllChem.GetMorganGenerator(radius=2)
fingerprint = gen.GetFingerprint(mol)

# View as bit vector
print(fingerprint.ToBinary())  # 10110101101...

What’s happening:

  1. Chem.MolFromSmiles("CCO") — Convert a molecule into RDKit’s format using SMILES notation. CCO is ethanol.

  2. AllChem.GetMorganGenerator(radius=2) — Create a Morgan Fingerprint generator with radius 2 (how many bonds out from each atom to examine).

  3. gen.GetFingerprint(mol) — Generate the actual fingerprint. RDKit runs the circular algorithm: examines each atom, expands outward to bonds 1 and 2 away, hashes neighborhoods, and builds the binary vector.

  4. fingerprint.ToBinary() — Convert to a readable bit string like 10110101101.... Each 1 means that chemical feature exists; 0 means it doesn’t.

The beauty: RDKit abstracts away the complexity. You specify the molecule and radius — it handles all the hashing and neighborhood detection. That’s why fingerprints are so practical in real work.

Why This Changed Everything For Me

Morgan Fingerprints showed me that chemistry becomes computable. You don’t need to memorize molecular properties — you can measure them mathematically.

That’s the real power: turning “these molecules seem similar” into something a computer can verify in microseconds.


The bottom line: Morgan Fingerprints are the bridge between chemistry and computation. They’re used in every major drug discovery pipeline because they work.