## Saturday, January 13, 2018

### Number of possible fragments for the connectivity-based hierarchy scheme

Faithful readers of this blog (hi, mom!) will know that we have been working with the connectivity-based hierarchy (CBH) approach for a while (paper almost done). The method works by breaking molecules up into fragments and truncating with hydrogens. In the CBH-1 scheme you fragment into bonds (so propane would be fragmented into 2 ethane molecules) and in the CBH-2 scheme you include all bonds to an atom with 2 or more bonds (butane would be fragmented into 2 propane molecules).

I started wondering how many different fragments we would need to cover most organic molecules wth the CBH-2 scheme so I wrote some code (shown below) to find out and the number turns out to be 15,670 neutral molecules using ["C","N","O","F","Si","P","S","Cl","Br","I"]

This number also includes CBH-1 fragments because you need them in the CBH-2 scheme. There are a few special cases missing such as isocyanide and there aren't any rings such as cyclopropane, since these are not made until you get higher up in the CBH hierarchy.  Also, there are some very weird molecules that you'll probably never see as a functional group in an organic molecule.

The code considers all possible combinations (so it runs for a long time) and then uses RDKit to figure out if it's a reasonable molecule.

As mentioned the code only generates neutral molecules, so the actual number of fragments needed will be higher.

## Monday, January 1, 2018

### Planned papers for 2018

A year ago I thought I'd probably publish seven papers in 2017:

Accepted
1. Protein structure refinement using a quantum mechanics-based chemical shielding predictor
2. Prediction of pKa values for drug-like molecules using semiempirical quantum chemical methods

Probable
3. Intermolecular Interactions in the Condensed Phase: Evaluation of Semi-empirical Quantum Mechanical Methods
4. Fast Prediction of the Regioselectivity of Electrophilic Aromatic Substitution Reactions of Heteroaromatic Systems Using Semi-Empirical Quantum Chemical Methods
5. Benchmarking cost vs. accuracy for computation of NMR shielding constants by quantum mechanical methods
6. Improved prediction of chemical shifts using machine learning
7. PM6 for all elements in GAMESS, including PCM interface

As you can see from the links the end result will be three papers, because paper 4, while accepted in 2017, will be officially published in 2018. In addition I also published this short preprint and a proposal.

Here's the plan for 2018

Accepted

Probable
2. Random Versus Systematic Errors in Reaction Enthalpies Computed using Semi-empirical and Minimal Basis Set Methods
3. Improving Solvation Energy Predictions using the SMD Solvation Method and Semi-empirical Electronic Structure Methods

Maybe
4. Towards a barrier height benchmark set for biologically relevant systems - part 2
5. pKaSQM: Automated Prediction of pKa Values for Druglike Molecules Using Semiempirical Quantum Chemical Methods
6. Prediction of CH pKa values

Paper 2 is essentially done and I've blogged about it here and here. I thought Paper 3 was also essentially done, but we discovered that the PM6/SMD geometry optimization has some problems with X-H bond breaking so we have to go back and look at that.  I am still quite confident we will get this out in 2018.

Paper 4: much of the work is done but we haven't started in the manuscript. We have collected 11 new systems to include in the database and we plan to improve the accuracy of the benchmark values using the approach in paper 2.

Paper 5: As I wrote in July: "This paper is 2/3 written and presents a completely automated PM3-based pKa prediction protocol. The method works quite well, but most outliers turn out to be due to high energy conformations. The main remaining issue is to find a conformer-search protocol that consistently gives low-energy conformations. Depending on how much time I have to devote to paper 4 and the proposal mentioned below, I am still hopefull I can get this published this year." You can read more about it here, here, and here.

This was put on the back burner to focus on other things, but we have started to look at the conformation issue again with my new PhD student Mads. It also looks like the accuracy can be improved by using the new PM3/SMD radii from paper 3. My main issue with QM-based pKa prediction is that I am not sure we can ever beat the accuracy of the empirical predictors.

Paper 6: I think it might be more fruitful to focus on CH pKa values since they are of interest to synthetic chemists and there is currently no other predictor that I know of. I have written some prototype code and I am in the process of creating a test set from Bordwell's data as a side project, but the fate of this project is really in the hands of the Danish Science Foundation. I'll know more in the Spring.

Work in progress
I have been working on some prototype codes (here, here, and here) aimed at high throughput screening of barrier heights and Mads will be building on this in 2018.

## Thursday, December 28, 2017

### Predicting pKa values using PM3 - conformer search part 2

Disclaimer: These are preliminary results and may contain errors
In a previous post a looked at the effect of conformational search on the accuracy of pKa predictions.  This is a follow up from a slightly different angle using data obtained by Mads.

The plot shows
$$\Delta \Delta G(n) = G_{0}(n) - G_{+}(n) - (G_{0}(\min) - G_{+}(\min))$$
where $G_{0}(n)$ is the minimum free energy value found among $n$ conformations of neutral acebutolol generated by RDKit and optimized using either PM3/COSMO (blue) or PM3/SMD (red). $G_{+}(n)$ is the corresponding value for protonated acebutolol.  $G_{0}(n)$ (and $G_{+}(n)$) are found using $n = 11, 21, 31, ..., 201$ starting conformations and $G_{0}(\min)$ is the lowest value of  found for $G_{0}(n)$ and similarly for $G_{+}(\min)$.

So $G_{0}(\min) - G_{+}(\min)$ is our best estimate of the correct $\Delta G$ value and $\Delta \Delta G(n)$ is the deviation from the best estimate for a give value of $n$.

Keeping in mind that a 1.4 kcal/mol error corresponds to a 1 pH unit error in pKa, we are clearly no where near convergence. The fact that we observe this using two different programs (MOPAC and GAMESS) indicates that the problem is probably not the optimizer.

The next plot shows $\Delta G_X(n) =G_X(n)-G_X(\min)$. All errors are above 1 kcal/mol for $n<50$. The error for COSMO neutral does not dip below 0.5 kcal/mol until $n =$ 181, compared to $n=$ 101 for SMD neutral.

The RDKit starting geometries are not pre-minimized using MMFF. The above plot shows the corresponding plot for MMFF optimized in the gas phase.  I think the much better convergence is due to the optimization being done in the gas phase.  It's interesting that, again, the neutral is slower to converge, and only does so for $n >$ 140.

I think the next step is do redo this analysis with focus on finding the lowest energy conformer rather than the accuracy of the pKa values.