What Is SMILES Notation? The Basics of Representing Molecules as Strings

In areas where chemistry and computer science intersect, a "language" for handling molecular structures on computers is essential. One of the most widely used examples is SMILES notation.

SMILES stands for Simplified Molecular Input Line Entry System. It was proposed by David Weininger in the late 1980s as a notation for representing molecular structures as single-line strings. Today, it is widely used in chemical databases, cheminformatics, AI drug discovery, molecular generation models, and related fields.

For example, ethanol is written as CH3CH2OH as a chemical formula, but in SMILES it can be written as follows.

CCO

This means that carbon, carbon, and oxygen are connected in sequence by single bonds. The strength of SMILES is that it is relatively readable for humans and easy for computers to process. Formats such as MOL files and SDF can store detailed structural information, but when listing large numbers of molecules or using them as input for machine learning models, the compactness of SMILES is a major advantage.

1. Basic SMILES Syntax

SMILES represents molecules by combining atoms, bonds, branches, ring structures, and other elements. The basic symbols are as follows.

Element	Notation	Example	Meaning
Atom	Element symbol	`C`, `O`, `N`	Carbon, oxygen, nitrogen
Single bond	Usually omitted	`CC`	Single bond between carbons
Double bond	`=`	`C=C`	Ethylene
Triple bond	`#`	`C#C`	Acetylene
Branch	`()`	`CC(C)C`	Isobutane
Ring structure	Numbers	`C1CCCCC1`	Cyclohexane
Aromatic atom	Lowercase	`c1ccccc1`	Benzene

Atoms commonly used in organic chemistry, such as B, C, N, O, P, S, F, Cl, Br, and I, can be written without square brackets. For these atoms, hydrogen atoms are implicitly added based on standard valence rules.

Examples are shown below.

Molecule	SMILES	Notes
Methane	`C`	Four hydrogens are implicitly attached to carbon
Ethane	`CC`	Two carbons connected by a single bond
Ethanol	`CCO`	Carbon, carbon, and oxygen connected in a straight chain
Acetic acid	`CC(=O)O`	Contains a carbonyl group and a hydroxy group

Square brackets are used when you need to explicitly specify charge, isotope, metal atoms, or unusual valence.

Sodium ion: [Na+]
Iron(II) ion: [Fe+2]
Ammonium ion: [NH4+]

2. Branches, Rings, and Aromaticity

Branches are represented with parentheses. For example, isobutane can be written as follows.

CC(C)C

This indicates that a methyl group branches from the carbon in the main chain.

In ring structures, the same number is placed at two positions to indicate where the ring closes.

C1CCCCC1

This represents cyclohexane. The 1 attached to the first and last carbon indicates that those two atoms are bonded to form a ring.

Benzene can be written in Kekule form as follows.

C1=CC=CC=C1

When aromaticity is expressed explicitly, lowercase c is used.

c1ccccc1

However, aromaticity perception is not completely unique. Software such as RDKit, Open Babel, and Daylight can differ in how they recognize aromaticity and normalize structures. For this reason, when using SMILES in research or data processing, it is important to record the tool and version used.

3. Representing Stereochemistry

SMILES can represent not only atom connectivity but also some stereochemical information.

Stereochemical information	Symbol	Meaning
Chiral center	`@`, `@@`	Local orientation of a tetrahedral center
Geometric isomerism	`/`, `\`	Relative arrangement around a double bond

For chiral centers, @ or @@ is used. However, these symbols do not directly mean R or S configuration. They describe local stereochemistry based on the order in which neighboring atoms appear in the SMILES string, so CIP priority rules must be considered separately to determine R/S configuration.

Geometric isomerism around double bonds is represented using / and \.

Cl/C=C/Cl
Cl/C=C\Cl

These notations distinguish isomers of 1,2-dichloroethylene. However, / and \ are also interpreted in the context of the entire string, so their meaning cannot be determined from the symbol alone.

4. Why Canonical SMILES Matters

An important point about SMILES is that the same molecule can have multiple valid representations.

For example, ethanol can be represented in either of the following ways.

CCO
OCC

Both represent the same molecule, but the strings are different. This causes problems in database search and duplicate removal.

This is where Canonical SMILES is used.

Item	Description
Purpose	Make it easier to treat the same molecule as the same string
Method	Rank atoms in the molecular graph and generate a representative SMILES
Benefit	Useful for duplicate removal, search, and data organization
Caution	Output depends on the software implementation

Canonical SMILES is very useful, but it is not a single fully internationally standardized string. RDKit, Daylight, Open Babel, and other tools may output different normalized SMILES. Therefore, when using it in research, it is best to state which software was used for normalization.

5. Differences from InChI and SELFIES

SMILES is not the only way to represent molecules as strings. Two well-known alternatives are InChI and SELFIES.

Notation	Main use	Strengths	Cautions
SMILES	Structure description, search, machine learning	Short and relatively readable	Invalid strings can be generated
InChI	Compound identification, standardization	Powerful as an identifier	Hard for humans to read
SELFIES	Molecular generation, machine learning	Designed so valid molecules are easier to generate	Strings tend to be longer

InChI (International Chemical Identifier) is a chemical substance identifier developed under the leadership of IUPAC. It describes atoms, bonds, hydrogens, charges, stereochemistry, isotopes, and other information in a layered structure, making it suitable for standard compound identification. On the other hand, it is not suitable for intuitive human reading.

SELFIES (SELF-referencIng Embedded Strings) is a notation proposed with machine learning, especially molecular generation models, in mind. With SMILES, AI systems may generate syntactically invalid strings or chemically inappropriate molecules. SELFIES is designed so that, in principle, any SELFIES string corresponds to a valid molecule.

6. Applications in AI Drug Discovery and Machine Learning

Because SMILES is a string representation, techniques developed in natural language processing can be applied to chemistry. By training models such as RNNs, VAEs, and Transformers on large collections of SMILES, models learn patterns of "chemically plausible strings."

Main applications include the following.

Application area	Description
Molecular generation	Generate new molecular candidates from existing compounds
Property prediction	Predict solubility, membrane permeability, toxicity, and other properties
Activity prediction	Predict binding to target proteins and pharmacological activity
Retrosynthetic analysis	Predict synthetic routes and precursors from a target molecule

When handling chemical reactions, Reaction SMILES is used. The basic format is as follows.

reactants>agents>products

When the agent field is not used, it is written as follows.

reactants>>products

In retrosynthetic analysis, precursors and reaction routes for synthesizing a target molecule are predicted from the SMILES of that target. This can also be treated as a problem of "translating" a product string into reactant strings.

7. MolSketch Can Automatically Generate Structures from SMILES

In research and development, SMILES is important not only as a "string for storage" but also as an input format for quickly calling up structural formulas.

In MolSketch, entering a SMILES string automatically generates the corresponding chemical structure. If you already have SMILES from a paper, database, internal document, or another tool, you can use it directly as a starting point for creating a structure.

For example, the following workflow is possible.

Copy a SMILES string you already have
Paste it into the MolSketch input field
Automatically generate the structure
Edit, save, or export it as needed

In this way, MolSketch connects "chemical information as a string" with "chemical structures as diagrams." Instead of drawing everything manually from scratch, it is often more efficient to start from SMILES and adjust only the necessary parts.

It is especially useful for the following cases.

Quickly turning a compound found in a paper or database into a diagram
Making small edits based on an existing molecule
Creating structural formula images for presentations or reports

Understanding SMILES lets you use it not merely as text input, but as an entry point for converting chemical information directly into editable structural formulas.

8. Points to Keep in Mind When Using SMILES

SMILES is convenient, but it is not universal. In particular, the following points require attention.

Multiple SMILES can exist for the same molecule.
Canonical SMILES depends on software implementation.
Aromaticity perception differs between tools.
If stereochemistry is omitted, isomers may not be distinguishable.
3D coordinates and conformational information are usually not included.
Results can vary depending on how salts, solvents, protonation states, and tautomers are handled.

SMILES is mainly a format for representing molecular graphs and some stereochemical information. When protein binding modes, molecular conformations, solvent effects, and similar 3D information are important, formats such as SDF, MOL2, and PDB, as well as 3D structure generation methods, need to be used together.

Summary

SMILES notation is a concise and powerful chemical language for representing molecular structures as single-line strings. Because it can express atoms, bonds, branches, ring structures, aromaticity, and stereochemistry in short strings, it is widely used in chemical databases, cheminformatics, AI drug discovery, molecular generation models, and other fields.

At the same time, SMILES has the following limitations.

Limitation	Notes
Multiple representations	The same molecule can be represented by multiple strings
Implementation dependence	Canonical SMILES and aromaticity perception differ between tools
Omitted stereochemistry	Isomers may not be distinguishable
Lack of 3D information	Coordinates and conformations are usually not included

In modern data-driven chemistry, SMILES is a basic interface that allows computers to read and write molecules. Just as chemists draw structural formulas, AI learns molecules through SMILES and explores new candidate compounds. Understanding SMILES will become an increasingly important foundation for researchers involved in chemistry, life sciences, materials science, and AI drug discovery.

MolSketch | Free & Powerful Chemical Structure EditorDraw chemical structures instantly in your browser — no install needed. Easily edit atoms, bonds, and charges, then export as SVG or PNG for free.

References

Weininger, D. "SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules." Journal of Chemical Information and Computer Sciences, 28(1), 31–36, 1988.
Weininger, D., Weininger, A., Weininger, J. L. "SMILES. 2. Algorithm for generation of unique SMILES notation." Journal of Chemical Information and Computer Sciences, 29(2), 97–101, 1989.
Heller, S. R. et al. "InChI, the IUPAC International Chemical Identifier." Journal of Cheminformatics, 7, 23, 2015.
Krenn, M. et al. "SELFIES: a robust representation of semantically constrained graphs with an example in chemistry." Machine Learning: Science and Technology, 1(4), 045024, 2020.
Gómez-Bombarelli, R. et al. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules." ACS Central Science, 4(2), 268–276, 2018.