1. Introduction
SISSO (Sure Independence Screening and Sparsifying Operator), developed by Ouyang et. al. (Phys. Rev. Mater. 2018, 2.8, 083802), is a machine learning-based feature selection algorithm that has been increasing used across various domains, including materials science, drug discovery, and computational chemistry. In these fields, feature selection plays a pivotal role in achieving efficient modeling and accurate prediction.
SISSO is designed to identify the most important features or variables in a dataset, and explores the mathematical dependence of target properties on a set of input features within the framework of compressed-sensing based dimensionality reduction. In the realm of material science and catalysis, SISSO aims to streamline the complexity of the feature space while retaining essential chemical information. This not only enhances the interpretability and transferability of machine learning models but also ensures robust prediction capabilities are maintained.
2. Alghrithm
For a supervised learning task involving multiple features and a target property, the SISSO method works in several steps. First, the SISSO algorithm creates a feature space combining each feature by a given mathematical operator set, namely Ĥ(m) ≡ {I, +, ₋, ×, ÷, log, exp, exp₋, --1, 2, 3, 6, √, | |, sin, cos}, where only physically meaningful combinations, such as those between features with the same unit, are retained (indicated by the (m) notation). By employing these operators, a wide range of non-linear expressions, resulting from combinations of the given features and mathematical operators, are generated, forming an extensive candidate space for further analysis.
Next, the sure-independence screening (SIS), a powerful feature selection technique, is applied to rank the descriptors by evaluating their correlation with the target property, effectively screening them by assessing their independence with the target variable. Through this process, a subset of descriptors that exhibit strong correlations with the target variable is selected.
Subsequently, a Sparsifying Operator (SO) is applied to further promote sparsity in the descriptor space. The SO encourages most of the descriptor coefficients to be zero or close to zero, effectively reducing the dimensionality of the problem.
The final step involves constructing a model using the selected descriptors and their corresponding target values. Machine learning algorithms, such as linear regression or support vector machines, can be employed to build the predictive model.
3. Advantages
In contrast to black-box machine learning methods like artificial neural networks, SISSO stands out by its ability to uncover mathematical mappings that can convey physical insights. This capability is of utmost importance for developing meaningful descriptors in various physical and chemical applications. Unlike other symbolic regression approaches such as genetic algorithms and random search, SISSO offers reduced bias since it conducts an exhaustive search of the solution space, evaluating all expressions within a certain complexity. Additionally, the resulting SISSO descriptors tend to have low complexity, enhancing their resilience to data noise. Particularly, when the desired formula involves only a few coefficients, SISSO may require a smaller training dataset, which is a significant advantage considering the challenges of acquiring large amounts of "big data."
4. Implementation
4.1. Installation
To acquire the latest version of SISSO, visit https://github.com/rouyang2017/SISSO and follow the installation instructions outlined in the manual. For Linux operating systems, unzip the package and navigate to the 'src' subdirectory. Then, execute one of the following commands:
mpiifort -fp-model precise var_global.f90 libsisso.f90 DI.f90 FC.f90 SISSO.f90 -o ~/bin/SISSO
or
mpiifort -O2 var_global.f90 libsisso.f90 DI.f90 FC.f90 SISSO.f90 -o ~/bin/SISSO
Note: option (1) enables better accuracy and run-to-run reproducibility of floating-point
calculations; (2) is ~ 2X faster than (1) but tiny run-to-run variations may happen between
processors of different types, e.g. Intel and AMD.
If 'mpi' related errors present during the compilation, try opening the file 'var_global.f90' and
replace the line "use mpi" with "include 'mpif.h'". However, " use mpi " is strongly encouraged
4.2. Data preparation
Prepare the training data in the following format. The first column represents the name of each material or system, the second column corresponds to the target property, and the third column onwards contain the features. If you have multiple target properties, you need to create separate SISSO jobs. The provided bash code can assist in automatically generating multiple jobs from a regular .csv file for supervised machine learning.
Material | Target | Feature1 | Feature2 | Feature3 |
material1 | 2.9523636 | 0.796520614 | 0.284965064 | 0.105907601 |
material2 | 3.574732022 | 0.921702233 | 0.395324106 | 0.458804669 |
material3 | 3.205701264 | 0.800595223 | 0.818392742 | 0.246040896 |
material4 | 3.22204552 | 0.687809368 | 0.92417909 | 0.61173436 |
material5 | 2.572562089 | 0.580261273 | 0.81898682 | 0.288152336 |
material6 | 3.466886983 | 0.876347448 | 0.848937896 | 0.268953142 |
material7 | 3.647028173 | 0.990047837 | 0.347398189 | 0.28413612 |
material8 | 2.054711809 | 0.504345427 | 0.470434555 | 0.02066922 |
material9 | 1.548757666 | 0.27684717 | 0.740252464 | 0.1765122 |
material10 | 0.932156388 | 0.041605399 | 0.199852554 | 0.47616647 |
4.3. Set keywords in SISSO.in
Similar to many modeling codes, you must specify a set of keywords in the input file, SISSO.in. Refer to the example below to correctly set the keywords:
!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
! Below are the list of keywords for SISSO. Use exclamation mark,!,to comment out a line.
! The (R), (C) and (R&C) denotes the keyword to be used by regression, classification and both, respectively.
! More explanations on these keywords can be found in the SISSO_guide.pdf
! Users need to change the setting below according to your data and job.
!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
ptype=1 !Property type 1: regression, 2:classification.
ntask=1 !(R&C) Multi-task learning (MTL) is invoked if >1.
task_weighting=1 !(R) MTL 1: no weighting (tasks treated equally), 2: weighted by the # of samples.
scmt=.false. !(R) Sign-Constrained MTL is invoked if .true.
desc_dim=3 !(R&C) Dimension of the descriptor, a hyperparmaeter.
nsample=10 !(R) Number of samples in train.dat. For MTL, set nsample=N1,N2,... for each task.
!nsample=(n1,n2,...) !(C) Number of samples. For MTL, set nsample=(n1,n2,...),(m1,m2,...),... for each tak.
restart=0 !(R&C) 0: starts from scratch, 1: continues the job(progress in the file CONTINUE)
!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
! Feature construction (FC) and sure independence screening (SIS)
! Implemented operators:(+)(-)(*)(/)(exp)(exp-)(^-1)(^2)(^3)(sqrt)(cbrt)(log)(|-|)(scd)(^6)(sin)(cos)
! scd: standard Cauchy distribution
!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
nsf= 3 !(R&C) Number of scalar features provided in the file train.dat
ops='(+)(-)(*)(/)(exp)(^-1)(^2)(^3)' !(R&C) Please customize the operators from the list shown above.
fcomplexity=2 !(R&C) Maximal feature complexity (# of operators in a feature), integer usually 0 to 7.
funit=(1:3) !(1:2)(3:3) !(R&C) (n1:n2): features from n1 to n2 in the train.dat have same units
fmax_min=1e-3 !(R&C) The feature will be discarded if the max. abs. value in it is < fmax_min.
fmax_max=1e5 !(R&C) The feature will be discarded if the max. abs. value in it is > fmax_max.
nf_sis=5 !(R&C) Number of features in each of the SIS-selected subspace.
!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
! Descriptor identification (DI) via sparse regression
!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
method_so='L0' !(R&C) 'L0' or 'L1L0'(LASSO+L0). The 'L0' is recommended for both ptype=1 and 2.
nl1l0= 1 !(R) Only useful if method_so = 'L1L0', number of LASSO-selected features for the L0.
fit_intercept=.true. !(R) Fit to a nonzero (.true.) or zero (.false.) intercept for the linear model.
metric='RMSE' !(R) The metric for model selection in regression: RMSE or MaxAE (max absolute error)
nmodels=100 !(R&C) Number of the top-ranked models to output (see the folder 'models')
!isconvex=(1,1) !(C) Each data group constrained to be convex domain, 1: YES; 0: NO
!bwidth=0.001 !(C) Boundary tolerance for classification
Based on my experience, I would like to share the recommended settings for certain keywords in SISSO:
i. ntask = n: In SISSO, multi-task learning involves dividing the datasets into n subsets. For each subset, the same descriptors are generated, but the coefficients differ. If ntask is not equal to 1, you should set nsample = N1, N2, N3, ..., Nn to specify the number of datasets belonging to each subset.
ii. desc_dim = m: This keyword determines the dimension of the final descriptor. For example, a 1D descriptor (desc_dim = 1) can be used to predict the target property as P = c0 + c1*D1, where P is the target property, D1 is a SISSO-suggested descriptor (combination of features and mathematical operators), and c0 and c1 are the coefficients. For another example, a 3D descriptor (desc_dim = 3) can connect to P as P = C0 + c1*D1 + c2*D2 + c3*D3, where D2 and D3 are also SISSO-suggested descriptors, and c2 and c3 are the corresponding coefficients.
iii. fcomplexity (or rung for SISSO v.1.X): This parameter determines the number of rounds of iterative feature combination. When fcomplexity is set to 0, SISSO will screen descriptors solely from the original input features. Increasing the fcomplexity value, such as setting it to 1, will enable SISSO to combine the original features using mathematical operators, resulting in a larger feature space for optimal descriptor screening. It is generally sufficient to set fcomplexity = 2 or 3 for most scientific problems. A higher complexity is usually unnecessary as it can significantly increase time consumption and may lead to overfitting.
5. Example
Attached is an example (download here) comprising input and output files for SISSO v3.2. Submitting a SISSO job is straightforward by using mpirun to execute SISSO and redirecting the output information to the log file.
#!/bin/sh
#BSUB -q queue_name
#BSUB -n 24
#BSUB -o SISSO.log
#BSUB -e SISSO.err
mpirun -np 24 <your_directory>/SISSO > log
The SISSO.out file contains the best-performing 1D, 2D, ..., mD descriptors. As an example, the 3D descriptor is presented as follows:
P = 0.1326243282E+01 * [(x+(x+z))] + 0.1984883858E+00 * [(x*(x+y))] -0.1077885067E+01 * [(z/exp(y))]
================================================================================
3D descriptor (model):
RMSE and MaxAE: 0.008646 0.015835
@@@descriptor:
1:[(x+(x+z))]
5:[(x*(x+y))]
7:[(z/exp(y))]
coefficients_001: 0.1326243282E+01 0.1984883858E+00 -0.1077885067E+01
Intercept_001: 0.6135000008E+00
RMSE,MaxAE_001: 0.8645999147E-02 0.1583524165E-01
================================================================================
Additionally, more detailed information is provided in the subdirectories:
models/
This folder contains the files for the information of the top n models/descriptors, where the n can be set in SISSO.in.
desc_dat/
The folder “desc_dat” contains the data files of the identified descriptors shown in the SISSO.out.
SIS_subspaces/
There are n SIS-subspaces in a SISSO-nD calculation.These subspaces, including the
feature expressions and feature data, are kept in the folder “SIS_subspaces”.
Furthermore, the correlation between the real target property and the SISSO-predicted values generated by the 1D, 2D, and 3D descriptors is presented. It is evident that higher-dimensional descriptors tend to yield better prediction performance. However, it is important to note that excessively high-dimensional descriptors can lead to overfitting. To mitigate this, the use of test sets and cross-validation methods is recommended.
If you found this tutorial helpful, consider joining our community by making a donation. Your support will enable me to continue creating valuable content while also supporting my baby's care. Together, we can make a meaningful impact. Thank you!
Comments