SmartsMatch
Proprietarysmarts.bio's AI-powered protein similarity search. Uses protein language model embeddings to find structurally and functionally related proteins — faster than BLAST and capable of detecting remote homologs that sequence-alignment methods miss.
How SmartsMatch works
Your query protein sequence is encoded into a high-dimensional vector using a protein language model. That vector is compared against the SmartsBio AI index — a pre-embedded snapshot of UniProtKB/Swiss-Prot, PDB chains, and selected AlphaFold models — using approximate nearest-neighbour (ANN) search.
smartsmatch_proteintools scopeFind the most similar proteins in the SmartsBio AI index for a given query sequence. Returns ranked hits with similarity scores, UniProt/PDB accessions, and functional annotations.
Parameters
| Parameter | Type | Description |
|---|---|---|
| sequence * | string | Amino-acid sequence in single-letter IUPAC code. Min 10 residues, max 2 048 residues. |
| top_k | integer | Number of top hits to return (default 10, max 100). |
| threshold | float | Minimum similarity score 0–1 (default 0.7). Lower values return more distant hits. |
| search_mode | string | sequence — prioritise sequence similarity; functional — prioritise functional analogs (may have low sequence identity). Default: sequence. |
| index | string | Which index to search: all (default), swissprot, pdb, or alphafold. |
| include_annotations | boolean | If true, each hit includes GO terms, EC numbers, and subcellular localisation from UniProt (default true). |
Response fields (per hit)
| Field | Description |
|---|---|
| accession | UniProt accession or PDB chain ID (e.g. P38398 or 1JM7_A) |
| name | Protein name from UniProt or PDB |
| organism | Source organism scientific name |
| similarity_score | Cosine similarity 0–1 between query and hit embeddings |
| sequence_identity | Pairwise sequence identity % (computed post-retrieval) |
| source | swissprot | trembl | pdb | alphafold |
| go_terms | List of GO term objects with id, name, namespace |
| ec_numbers | EC number(s) for enzyme hits (e.g. 3.1.1.1) |
from smartsbio import SmartsBio
client = SmartsBio(api_key="sk_live_...")
result = client.tools.run(
tool_id="smartsmatch_protein",
input={
"sequence": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSY...",
"top_k": 10,
"threshold": 0.75,
"search_mode": "sequence",
"index": "all",
},
)
for hit in result["hits"]:
print(
f"{hit['accession']:<12} "
f"sim={hit['similarity_score']:.3f} "
f"id={hit['sequence_identity']:.1f}% "
f"{hit['name'][:50]}"
)SmartsMatch vs BLAST
Use SmartsMatch when you need speed, remote homolog detection, or functional analog discovery. Fall back to BLAST (ncbi_blast) when you need strict statistical E-values or coverage of the full NCBI nt/nr databases.
| SmartsMatch | BLAST (ncbi_blast) | |
|---|---|---|
| Speed | < 1 second | 10 s – 10 min |
| Remote homolog detection (<25% identity) | Excellent | Poor–Fair |
| Functional analog search | Supported (search_mode=functional) | Not supported |
| Statistical E-values | Not provided | Yes |
| Database coverage | 250 M proteins (UniProt + PDB + AlphaFold) | NCBI nt/nr/refseq (trillions of bases) |
| Nucleotide search | Protein only | blastn, blastx, tblastn, tblastx |
Use cases
Find remote homologs across kingdoms
When sequence identity is too low for BLAST to find anything, SmartsMatch can still detect evolutionarily distant relatives that share the same fold.
result = client.tools.run(
tool_id="smartsmatch_protein",
input={
"sequence": my_novel_sequence,
"top_k": 20,
"threshold": 0.6, # Lower threshold for distant homologs
"search_mode": "sequence",
"index": "swissprot", # Curated hits only
},
)
# Filter to hits with low sequence identity (BLAST zone-of-darkness)
remote = [h for h in result["hits"] if h["sequence_identity"] < 25]
print(f"Found {len(remote)} remote homologs below 25% identity:")
for h in remote:
print(f" {h['accession']} sim={h['similarity_score']:.3f} {h['organism']}")Discover functional analogs (convergent evolution)
Use search_mode: functional to find proteins that perform the same biochemical function through a different fold — e.g., alternative proteases or convergent ATP-binding domains.
result = client.tools.run(
tool_id="smartsmatch_protein",
input={
"sequence": serine_protease_sequence,
"top_k": 30,
"threshold": 0.65,
"search_mode": "functional", # Functional analogs, not just homologs
},
)
# Explore EC numbers to confirm convergent function
ec_counts: dict[str, int] = {}
for h in result["hits"]:
for ec in h.get("ec_numbers", []):
ec_counts[ec] = ec_counts.get(ec, 0) + 1
print("Most common EC numbers among hits:")
for ec, count in sorted(ec_counts.items(), key=lambda x: -x[1])[:5]:
print(f" {ec}: {count} hits")Let the agent pick SmartsMatch automatically
When you use the Query endpoint, the agent will automatically use SmartsMatch for protein similarity tasks and combine it with structural (PDB/AlphaFold) and functional (UniProt/InterPro) data sources.
response = client.query.run(
"Find proteins similar to this sequence and tell me what they do: MTEYKLVVVGAGGVGK...",
)
# The agent uses SmartsMatch internally, then enriches hits
# with UniProt annotations and PDB structures automatically.
print(response.answer)