eem_processing
eempy.eem_processing
EEMDataset
Build an EEM dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eem_stack
|
np.ndarray
|
The 3D EEM stack, with shape (n_samples, n_ex_wavelengths, n_em_wavelengths). |
required |
ex_range
|
np.ndarray
|
A 1D NumPy array of the excitation wavelengths. |
required |
em_range
|
np.ndarray
|
A 1D NumPy array of the emission wavelengths. |
required |
index
|
list or None
|
Optional. The name used to label each sample. The number of elements in the list should equal the number of samples in the eem_stack (with the same sample order). |
None
|
ref
|
pd.DataFrame or None
|
Optional. The reference data, e.g., the contaminant concentrations in each sample. It should have a length equal to the number of samples in the eem_stack. The index of each sample should be the name given in parameter "index". It is possible to have more than one column. NaN is allowed (for example, if contaminant concentrations in specific samples are unknown). |
None
|
cluster
|
list or None
|
Optional. The classification of samples, e.g., the output of EEM clustering algorithms. The number of elements in the list should equal the number of samples in the eem_stack (with the same sample order). |
None
|
aqy
aqy(abs_stack, ex_range_abs, target_ex=None)
Calculate the apparent_quantum_yield (AQY).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
abs_stack
|
ndarray
|
absorbance spectra stack |
required |
ex_range_abs
|
ndarray
|
excitation wavelengths of absorbance spectra |
required |
target_ex
|
float or None
|
excitation wavelength for AQY. If None is passed, all excitation wavelengths will be returned. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
aqy |
DataFrame
|
apparent quantum yield (AQY) |
bix
bix()
Calculate the biological index (BIX).
Returns:
| Name | Type | Description |
|---|---|---|
bix |
DataFrame
|
BIX |
correlation
correlation(variables, fit_intercept=True)
Analyze the correlation between reference and fluorescence intensity at each pair of ex/em.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variables
|
list
|
List of variables (i.e., the headers of the reference table) to be fitted |
required |
fit_intercept
|
bool
|
Whether to fit the intercept for linear regression. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
corr_dict |
dict
|
A dictionary containing multiple correlation evaluation metrics. |
cutting
cutting(ex_min, ex_max, em_min, em_max, inplace=True)
Cut every EEM in the dataset to a new excitation/emission window.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ex_min
|
float
|
Lower bound of the excitation wavelength window to keep (nm). |
required |
ex_max
|
float
|
Upper bound of the excitation wavelength window to keep (nm). |
required |
em_min
|
float
|
Lower bound of the emission wavelength window to keep (nm). |
required |
em_max
|
float
|
Upper bound of the emission wavelength window to keep (nm). |
required |
inplace
|
bool
|
If True, overwrite |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
EEM dataset after cutting. The dataset's |
fi
fi()
Compute the fluorescence index (FI) for each sample.
FI is computed as intensity(ex=370 nm, em=470 nm) divided by intensity(ex=370 nm, em=520 nm).
Returns:
| Name | Type | Description |
|---|---|---|
fi |
DataFrame
|
Fluorescence index values. Note: the current implementation labels the output column as "BIX" even though the values correspond to FI. |
filter_by_cluster
filter_by_cluster(cluster_names, inplace=True)
Select the samples belong to certain cluster(s).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cluster_names
|
int/float/str or list of int/float/str
|
cluster names. |
required |
inplace
|
bool
|
if False, overwrite the EEMDataset object. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
The filtered EEM dataset. |
filter_by_index
filter_by_index(mandatory_keywords, optional_keywords, inplace=True)
Select the samples whose indexes contain the given keyword.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mandatory_keywords
|
str or list of str
|
Keywords for selecting samples whose indexes contain all the mandatory keywords. |
required |
optional_keywords
|
str or list of str
|
Keywords for selecting samples whose indexes contain any of the optional keywords. |
required |
inplace
|
bool
|
if True, overwrite the EEMDataset object. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
The filtered EEM dataset. |
gaussian_filter
gaussian_filter(sigma=1, truncate=3, inplace=True)
Apply Gaussian filtering to every EEM in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sigma
|
float
|
Standard deviation of the Gaussian kernel. |
1
|
truncate
|
float
|
Truncate the filter at this many standard deviations. |
3
|
inplace
|
bool
|
If True, overwrite |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
EEM dataset with Gaussian filtering applied. |
hix
hix()
Calculate the humification index (HIX).
Returns:
| Name | Type | Description |
|---|---|---|
hix |
DataFrame
|
HIX |
ife_correction
ife_correction(absorbance, ex_range_abs, inplace=True)
Apply inner filter effect (IFE) correction to every EEM using absorbance spectra.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
absorbance
|
ndarray
|
Absorbance spectra stack (n_samples, n_abs_wavelengths). |
required |
ex_range_abs
|
ndarray
|
Wavelength axis (nm) for the absorbance spectra. |
required |
inplace
|
bool
|
If True, overwrite |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
IFE-corrected EEM dataset. |
interpolation
interpolation(ex_range_new, em_range_new, method, inplace=True)
Interpolate every EEM onto a new excitation/emission wavelength grid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ex_range_new
|
ndarray
|
Target excitation wavelength axis (nm). |
required |
em_range_new
|
ndarray
|
Target emission wavelength axis (nm). |
required |
method
|
(str, {linear, nearest, slinear, cubic, quintic})
|
Interpolation method passed to |
required |
inplace
|
bool
|
If True, overwrite |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
EEM dataset interpolated to the new wavelength grid. The dataset's |
mean
mean()
Calculate mean of each pixel over all samples.
Returns:
| Name | Type | Description |
|---|---|---|
mean |
ndarray
|
|
median_filter
median_filter(window_size=(3, 3), mode='reflect', inplace=True)
Apply median filtering to an EEM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
window_size
|
tuple of two integers
|
Gives the shape that is taken from the input array, at every element position, to define the input to the filter function. |
(3, 3)
|
mode
|
str, {‘reflect’, ‘constant’, ‘nearest’, ‘mirror’, ‘wrap’}
|
The mode parameter determines how the input array is extended beyond its boundaries. |
'reflect'
|
inplace
|
bool
|
if True, overwrite the EEMDataset object with the processed EEMs. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
The processed EEM dataset. |
nan_imputing
nan_imputing(method='linear', fill_value='linear_ex', inplace=True)
Impute NaN pixels in every EEM in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
(str, {linear, cubic})
|
2D interpolation method passed to |
"linear"
|
fill_value
|
(float or str, {linear_ex, linear_em})
|
How to fill pixels outside the convex hull of non-NaN data. |
"linear_ex"
|
inplace
|
bool
|
If True, overwrite |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
EEM dataset with NaN pixels filled. |
peak_picking
peak_picking(ex, em)
Return the fluorescence intensities at the location closest to the given (ex, em).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ex
|
float or int
|
excitation wavelength of the wanted location |
required |
em
|
float or int
|
emission wavelength of the wanted location |
required |
Returns:
| Name | Type | Description |
|---|---|---|
fi |
DataFrame
|
table of fluorescence intensities at the wanted location for all samples |
ex_actual |
the actual ex of the extracted fluorescence intensities |
|
em_actual |
the actual em of the extracted fluorescence intensities |
raman_normalization
raman_normalization(ex_range_blank=None, em_range_blank=None, blank=None, from_blank=False, integration_time=1, ex_target=350, bandwidth=5, rsu_standard=20000, manual_rsu=1, inplace=True)
Normalize every EEM in the dataset by a Raman scattering unit (RSU). RSU can be supplied directly (
from_blank=False) or calculated from blank EEM data (from_blank=True). The normalization factor is
RSU_raw divided by (rsu_standard * integration_time).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
blank
|
ndarray
|
Blank EEM(s) used to estimate RSU when |
None
|
ex_range_blank
|
ndarray
|
Excitation wavelength axis for the blank EEM(s). |
None
|
em_range_blank
|
ndarray
|
Emission wavelength axis for the blank EEM(s). |
None
|
from_blank
|
bool
|
If True, calculate RSU from the provided blank EEM(s). |
False
|
integration_time
|
float
|
Integration time used for the blank measurement. |
1
|
ex_target
|
float
|
Excitation wavelength (nm) at which RSU is computed. |
350
|
bandwidth
|
float
|
Raman peak bandwidth (nm) used for regional integration. |
5
|
rsu_standard
|
float
|
Scaling factor applied to RSU to control the magnitude of normalized intensities. |
20000
|
manual_rsu
|
float
|
RSU used directly when |
1
|
inplace
|
bool
|
If True, overwrite |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
Raman-normalized EEM dataset. |
raman_scattering_removal
raman_scattering_removal(width=5, interpolation_method='linear', interpolation_dimension='2d', inplace=True, recover_original_nan=True)
Remove the first-order Raman scattering band and fill the masked region.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
width
|
float
|
Total width (nm) of the Raman scattering band to mask. |
5
|
interpolation_method
|
(str, {linear, cubic, nan, zero})
|
Method used to fill the masked region. |
"linear"
|
interpolation_dimension
|
(str, {'1d-ex', '1d-em', '2d'})
|
Interpolation axis/dimension used when |
"2d"
|
recover_original_nan
|
bool
|
If True, preserve NaN pixels that existed before scattering removal. |
True
|
inplace
|
bool
|
If True, overwrite |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
EEM dataset with Raman scattering removed and filled. |
rayleigh_scattering_removal
rayleigh_scattering_removal(width_o1=15, width_o2=15, interpolation_dimension_o1='2d', interpolation_dimension_o2='2d', interpolation_method_o1='zero', interpolation_method_o2='linear', inplace=True, recover_original_nan=True)
Remove first- and second-order Rayleigh scattering bands and fill the masked regions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
width_o1
|
float
|
Total width (nm) of the first-order Rayleigh band (Em = Ex). |
15
|
width_o2
|
float
|
Total width (nm) of the second-order Rayleigh band (Em = 2*Ex). |
15
|
interpolation_dimension_o1
|
(str, {'1d-ex', '1d-em', '2d'})
|
Interpolation axis/dimension for the first-order band. |
"2d"
|
interpolation_dimension_o2
|
(str, {'1d-ex', '1d-em', '2d'})
|
Interpolation axis/dimension for the second-order band. |
"2d"
|
interpolation_method_o1
|
(str, {linear, cubic, nan, zero, none})
|
Fill method for the first-order band. |
"zero"
|
interpolation_method_o2
|
(str, {linear, cubic, nan, zero, none})
|
Fill method for the second-order band. |
"linear"
|
recover_original_nan
|
bool
|
If True, preserve NaN pixels that existed before scattering removal. |
True
|
inplace
|
bool
|
If True, overwrite |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
EEM dataset with Rayleigh scattering removed and filled. |
region_masking
region_masking(ex_min, ex_max, em_min, em_max, fill_value='nan', inplace=True)
Mask a rectangular excitation/emission region in every EEM in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ex_min
|
float
|
Lower bound of the excitation wavelength window to mask (nm). |
230
|
ex_max
|
float
|
Upper bound of the excitation wavelength window to mask (nm). |
500
|
em_min
|
float
|
Lower bound of the emission wavelength window to mask (nm). |
250
|
em_max
|
float
|
Upper bound of the emission wavelength window to mask (nm). |
810
|
fill_value
|
(str, {nan, zero})
|
How to fill the masked region. |
"nan"
|
inplace
|
bool
|
If True, overwrite |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
EEM dataset with regional masking applied. |
regional_integration
regional_integration(ex_min, ex_max, em_min, em_max) -> pd.DataFrame
Calculate regional integration of samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ex_min
|
float
|
The lower boundary of excitation wavelengths of the integrated region. |
required |
ex_max
|
float
|
The upper boundary of excitation wavelengths of the integrated region. |
required |
em_min
|
float
|
The lower boundary of emission wavelengths of the integrated region. |
required |
em_max
|
float
|
The upper boundary of emission wavelengths of the integrated region. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
integrations |
DataFrame
|
|
sort_by_index
sort_by_index(inplace=True)
Sort the sample order of eem_stack, index and reference (if exists) by the index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inplace
|
bool
|
If True, overwrite the EEMDataset object. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
The processed EEM dataset. |
splitting
splitting(n_split, rule: str = 'random', random_state=None, kw_top=None, kw_bot=None, idx_top=None, idx_bot=None)
To split the EEM dataset and form multiple sub-datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_split
|
int
|
The number of splits. |
required |
rule
|
(str, {random, sequential})
|
If 'random' is passed, the split will be generated randomly. If 'sequential' is passed, the dataset will be split according to index order. |
'random'
|
random_state
|
int
|
Random seed for splitting. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
model_list |
list.
|
A list of sub-datasets. Each of them is an EEMDataset object. |
std
std()
Calculate standard deviation of each pixel over all samples.
Returns:
| Name | Type | Description |
|---|---|---|
std |
ndarray
|
|
subsampling
subsampling(portion=0.8, inplace=True)
Randomly select a portion of the EEM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
portion
|
float
|
The portion. |
0.8
|
inplace
|
bool
|
if True, overwrite the EEMDataset object. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_sub |
ndarray
|
New EEM dataset. |
selected_indices |
ndarray
|
Indices of selected EEMs. |
tf_normalization
tf_normalization(inplace=True)
Normalize every EEM by its total fluorescence. Each sample is divided by its total fluorescence, normalized to the mean total fluorescence across the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inplace
|
bool
|
If True, overwrite |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
Total-fluorescence-normalized EEM dataset. |
weights |
ndarray
|
Per-sample normalization factors (total fluorescence divided by the dataset mean). |
threshold_masking
threshold_masking(threshold, fill, mask_type='greater', inplace=True)
Mask fluorescence intensity values above or below a threshold across all samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float or int
|
Intensity threshold. |
required |
fill
|
float or int
|
Value used to replace masked pixels. |
required |
mask_type
|
(str, {greater, smaller})
|
Whether to mask values greater than or smaller than |
"greater"
|
inplace
|
bool
|
If True, overwrite |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_new |
EEMDataset
|
EEM dataset with threshold masking applied. |
total_fluorescence
total_fluorescence()
Calculate total fluorescence of each sample.
Returns:
| Name | Type | Description |
|---|---|---|
tf |
ndarray
|
|
variance
variance()
Calculate variance of each pixel over all samples.
Returns:
| Name | Type | Description |
|---|---|---|
variance |
ndarray
|
|
zscore
zscore()
Calculate zscore of each pixel over all samples.
Returns:
| Name | Type | Description |
|---|---|---|
zscore |
ndarray
|
|
PARAFAC
Parallel factor analysis (PARAFAC) model for an excitation–emission matrix (EEM) dataset.
This class fits a low-rank PARAFAC (CP) decomposition to a 3D EEM stack with shape (n_samples, n_ex,
n_em) by factorizing it into:
- A sample-mode score matrix A with shape (n_samples, n_components).
- An excitation-mode loading matrix B with shape (n_ex, n_components).
- An emission-mode loading matrix C with shape (n_em, n_components).
Each component r corresponds to a rank-1 outer product A[:, r] ⊗ B[:, r] ⊗ C[:, r], and the reconstructed EEM stack is obtained by summing these rank-1 components over r = 1...n_components.
This class fits a low-rank PARAFAC decomposition to a 3D EEM stack with optional regularization:
- Non-negativity
- Elastic-net regularization on any factor (L1/L2 mix).
- Quadratic priors on A, B, and/or C (controlled by prior_dict_sample, prior_dict_ex,
prior_dict_em andgamma_sample,gamma_exandgamma_em), with NaNs allowed to skip entries. This is
useful when fitted scores or spectral components are desired to be close (but not necessarily identical) to
prior knowledge. For example, if a component’s concentration is known for some samples, a prior vector of
length n_samples can be passed with real values for known samples and NaN for unknown samples.
- A ratio constraint on paired rows of A: A[idx_top] ≈ beta * A[idx_bot]. This is useful when
the ratios of component amplitudes between two sets of samples are desired to be constant. For example,
if each sample is measured both unquenched and quenched using a fixed quencher dosage, then for a given
chemically consistent component the ratio between unquenched and quenched amplitudes may be approximately
constant across samples (Hu et al., ES&T, 2025). In this case, passing the unquenched and quenched sample
indices to idx_top and idx_bot encourages a constant ratio. lam controls the strength of this
regularization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_components
|
int
|
Number of PARAFAC components (rank of the CP decomposition). |
required |
non_negativity
|
bool
|
Whether to enforce non-negativity constraints on the factor matrices. |
True
|
solver
|
{'mu', 'hals'}
|
Optimization algorithm used when |
'hals'
|
init
|
{'svd', 'random'} or tensorly.CPTensor
|
Initialization strategy for the factor matrices. If a |
'svd'
|
custom_init
|
optional
|
Custom initialization passed to the HALS solver (when supported by the backend implementation). |
None
|
fixed_components
|
optional
|
Component(s) to keep fixed during fitting (backend-specific behavior). |
None
|
tf_normalization
|
bool
|
Whether to normalize each EEM by its total fluorescence during model fitting. |
False
|
loadings_normalization
|
{'sd', 'maximum', None}
|
Post-fit normalization applied to excitation/emission loadings, with the sample scores scaled accordingly. - 'sd': normalize each loading vector to unit standard deviation. - 'maximum': normalize each loading vector to unit maximum. - None: no loading normalization. |
'maximum'
|
sort_components_by_em
|
bool
|
Whether to sort components by the emission peak position (ascending). If |
True
|
alpha_sample
|
float
|
Regularization strength applied to the sample-mode factor matrix (backend-specific). |
0
|
alpha_ex
|
float
|
Regularization strength applied to the excitation-mode factor matrix (backend-specific). |
0
|
alpha_em
|
float
|
Regularization strength applied to the emission-mode factor matrix (backend-specific). |
0
|
l1_ratio
|
float
|
Elastic-net mixing parameter used by the backend ( |
1
|
prior_dict_sample
|
dict
|
Prior information for the sample-mode factor matrix (backend-specific). |
None
|
prior_dict_ex
|
dict
|
Prior information for the excitation-mode factor matrix (backend-specific). |
None
|
prior_dict_em
|
dict
|
Prior information for the emission-mode factor matrix (backend-specific). |
None
|
gamma_sample
|
float
|
Additional prior/penalty strength for the sample-mode factor matrix (backend-specific). |
0
|
gamma_ex
|
float
|
Additional prior/penalty strength for the excitation-mode factor matrix (backend-specific). |
0
|
gamma_em
|
float
|
Additional prior/penalty strength for the emission-mode factor matrix (backend-specific). |
0
|
ref_components
|
optional
|
Reference component definitions used by the backend prior/regularization logic (backend-specific). |
None
|
kw_top
|
str
|
Keyword used to identify "top" EEM from |
None
|
kw_bot
|
str
|
Keyword used to identify "bot" EEM from |
None
|
idx_top
|
list of int
|
0-based integer positions of samples in eem_dataset used as the numerator ("top") group (e.g., [0, 1, 2]). |
None
|
idx_bot
|
list of int
|
0-based integer positions of samples in eem_dataset used as the denominator ("bot") group (e.g., [3, 4, 5]). |
None
|
lam
|
float
|
Strength of ratio-based regularization between "top" and "bot" samples (backend-specific). |
0
|
max_iter_als
|
int
|
Maximum number of outer ALS iterations. |
100
|
tol
|
float
|
Convergence tolerance for the ALS loop. |
1e-6
|
max_iter_nnls
|
int
|
Maximum number of iterations for NNLS subproblems (when used by the backend). |
500
|
random_state
|
int or numpy.random.RandomState
|
Random seed or RNG used for reproducible initialization (when supported). |
None
|
mask
|
array-like
|
A ideally sparse mask array for missing values (backend-specific). When provided, masked entries are ignored in fitting. |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
score |
DataFrame or None
|
Sample scores (sample loadings). |
ex_loadings |
DataFrame or None
|
Excitation-mode loadings for each component. |
em_loadings |
DataFrame or None
|
Emission-mode loadings for each component. |
fmax |
DataFrame or None
|
The maximum fluorescence intensity of components. Fmax is calculated by multiplying the maximum excitation loading and maximum emission loading for each component by its score. |
nnls_fmax |
DataFrame or None
|
Fmax estimated from refitting PARAFAC components to the original EEMs using NNLS. It may be slightly
different from |
components |
ndarray or None
|
Component EEMs with shape |
cptensors |
CPTensor or None
|
Fitted CP/PARAFAC tensor representation returned by the underlying solver. |
eem_stack_train |
ndarray or None
|
EEM stack used for model fitting, with shape |
eem_stack_reconstructed |
ndarray or None
|
Reconstructed EEM stack from the fitted model, with shape |
ex_range |
ndarray or None
|
Excitation wavelength grid corresponding to |
em_range |
ndarray or None
|
Emission wavelength grid corresponding to |
beta |
ndarray or None
|
Component-wise ratio parameters used when ratio regularization / beta fitting is enabled. |
References
[1] Tensorly documentation for CP/PARAFAC decomposition. [2] Hu, Yongmin, Céline Jacquin, and Eberhard Morgenroth. "Fluorescence Quenching as a Diagnostic Tool for Prediction Reliability Assessment and Anomaly Detection in EEM-Based Water Quality Monitoring." Environmental Science & Technology 59.36 (2025): 19490-19501.
component_peak_locations
component_peak_locations()
Get the ex/em of component peaks
Returns:
| Name | Type | Description |
|---|---|---|
max_exem |
list
|
A List of (ex, em) of component peaks. |
core_consistency
core_consistency()
Calculate the core consistency of the established PARAFAC model
Returns:
| Name | Type | Description |
|---|---|---|
cc |
float
|
core consistency |
export
export(filepath, info_dict)
Export the PARAFAC model to a text file that can be uploaded to the online PARAFAC model database Openfluor (https://openfluor.lablicate.com/#).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
Location of the saved text file. Please specify the ".csv" extension. |
required |
info_dict
|
dict
|
A dictionary containing the model information. Possible keys include: name, creator date, email, doi, reference, unit, toolbox, fluorometer, nSample, decomposition_method, validation, dataset_calibration, preprocess, sources, description |
required |
Returns:
| Name | Type | Description |
|---|---|---|
info_dict |
dict
|
A dictionary containing the information of the PARAFAC model. |
fit
fit(eem_dataset: EEMDataset)
Establish a PARAFAC model based on a given EEM dataset
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eem_dataset
|
EEMDataset
|
The EEM dataset used to fit the PARAFAC model. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
self |
object
|
The established PARAFAC model |
leverage
leverage(mode: str = 'sample')
Calculate the leverage of a selected mode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode
|
(str, {ex, em, sample})
|
The mode of which the leverage is calculated. |
'sample'
|
Returns:
| Name | Type | Description |
|---|---|---|
lvr |
DataFrame
|
The table of leverage |
predict
predict(eem_dataset: EEMDataset, fit_intercept=False, fit_beta=False, idx_top=None, idx_bot=None)
Predict the score and Fmax of a given EEM dataset using the component fitted. This method can be applied to a new EEM dataset independent of the one used in NMF model establishment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eem_dataset
|
EEMDataset
|
The EEM dataset to be predicted. |
required |
fit_intercept
|
bool
|
Whether to calculate the intercept. |
False
|
fit_beta
|
bool
|
Whether to fit the beta parameter (the proportions between "top" and "bot" samples). |
False
|
idx_top
|
list
|
List of indices of samples serving as numerators in ratio calculation. |
None
|
idx_bot
|
List of indices of samples serving as denominators in ratio calculation. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
score_sample |
DataFrame
|
The fitted score. |
fmax_sample |
DataFrame
|
The fitted Fmax. |
eem_stack_pred |
np.ndarray (3d)
|
The EEM dataset reconstructed. |
residual
residual()
Get the residual of the established PARAFAC model, i.e., the difference between the original EEM dataset and the reconstructed EEM dataset.
Returns:
| Name | Type | Description |
|---|---|---|
res |
np.ndarray (3d)
|
the residual |
sample_relative_rmse
sample_relative_rmse()
Calculate the normalized root mean squared error (normalized RMSE) of EEM of each sample. It is defined as the RMSE divided by the mean of original signal.
Returns:
| Name | Type | Description |
|---|---|---|
relative_rmse |
DataFrame
|
Table of normalized RMSE |
sample_rmse
sample_rmse()
Calculate the root mean squared error (RMSE) of EEM of each sample.
Returns:
| Name | Type | Description |
|---|---|---|
rmse |
DataFrame
|
Table of RMSE |
sample_summary
sample_summary()
Get a table showing the score, Fmax, leverage, RMSE and normalized RMSE for each sample.
Returns:
| Name | Type | Description |
|---|---|---|
summary |
DataFrame
|
Table of samples' score, Fmax, leverage, RMSE and normalized RMSE. |
variance_explained
variance_explained()
Calculate the explained variance of the established PARAFAC model
Returns:
| Name | Type | Description |
|---|---|---|
ev |
float
|
the explained variance |
EEMNMF
Non-negative matrix factorization (NMF) model for an excitation–emission matrix (EEM) dataset.
This class fits a low-rank NMF decomposition to a 3D EEM stack by unfolding it into a 2D non-negative matrix with shape (n_samples, n_pixels) and factorizing it into: - A non-negative sample score matrix W with shape (n_samples, n_components). - A non-negative component matrix H with shape (n_components, n_pixels), where n_pixels = n_ex * n_em`in the unfolded representation.
The fitted NMF components are reshaped back to EEM form with shape (n_components, n_ex, n_em). Component
amplitudes are reported as Fmax-like values using:
- fmax : scores from the NMF factorization, rescaled to account for component normalization.
- nnls_fmax : scores refit by non-negative least squares (NNLS) against the extracted components,
which can differ slightly from fmax due to the non-exact NMF reconstruction and/or constraints.
Optional regularization / constraints (solver-dependent) include:
- Non-negativity (always enforced by this model).
- Elastic-net regularization on W and/or H (L1/L2 mix).
- Quadratic priors on W and/or H (controlled by prior_dict_W, prior_dict_H and
gamma_W, gamma_H), with NaNs allowed to skip entries. This is useful when fitted scores
or spectral components are desired to be close (but not necessarily identical) to prior knowledge.
For example, if a component’s concentration is known for some samples, a prior vector of length
n_samples can be passed with real values for known samples and NaN for unknown samples.
- A ratio constraint on paired rows of W: W[idx_top] ≈ beta * W[idx_bot]. This is useful when
the ratios of component amplitudes between two sets of samples are desired to be constant. For example,
if each sample is measured both unquenched and quenched using a fixed quencher dosage, then for a given
chemically consistent component the ratio between unquenched and quenched amplitudes may be approximately
constant across samples (Hu et al., ES&T, 2025). In this case, passing the unquenched and quenched sample
indices to idx_top and idx_bot encourages a constant ratio. lam controls the strength of this
regularization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_components
|
int
|
Number of NMF components (rank of the factorization). |
required |
solver
|
{'cd', 'mu', 'hals'}
|
Optimization algorithm used to fit NMF.
- |
'cd'
|
init
|
str
|
Initialization strategy passed to the selected solver.
Common options include |
'nndsvda'
|
custom_init
|
optional
|
Custom initialization passed to the HALS solver (when supported by the backend implementation). |
None
|
fixed_components
|
optional
|
Component(s) to keep fixed during fitting (backend-specific behavior). |
None
|
beta_loss
|
{'frobenius', 'kullback-leibler', 'itakura-saito'}
|
Beta divergence used by the |
'frobenius'
|
alpha_sample
|
float
|
Regularization strength applied to the sample-mode factor matrix |
0
|
alpha_component
|
float
|
Regularization strength applied to the component matrix |
0
|
l1_ratio
|
float
|
Elastic-net mixing parameter used by the backend ( |
1
|
prior_dict_W
|
dict
|
Prior information for the sample-mode factor matrix |
None
|
prior_dict_H
|
dict
|
Prior information for the component matrix |
None
|
prior_dict_A
|
dict
|
Additional prior mapping used by the HALS backend (backend-specific). |
None
|
prior_dict_B
|
dict
|
Additional prior mapping used by the HALS backend (backend-specific). |
None
|
prior_dict_C
|
dict
|
Additional prior mapping used by the HALS backend (backend-specific). |
None
|
gamma_W
|
float
|
Additional prior/penalty strength for the sample-mode factor matrix |
0
|
gamma_H
|
float
|
Additional prior/penalty strength for the component matrix |
0
|
gamma_A
|
float
|
Additional prior/penalty strength for backend-specific prior term A (HALS solver only). |
0
|
gamma_B
|
float
|
Additional prior/penalty strength for backend-specific prior term B (HALS solver only). |
0
|
gamma_C
|
float
|
Additional prior/penalty strength for backend-specific prior term C (HALS solver only). |
0
|
ref_components
|
optional
|
Reference component definitions used by the backend prior/regularization logic (backend-specific). |
None
|
kw_top
|
str
|
Keyword used to identify "top" EEM from |
None
|
kw_bot
|
str
|
Keyword used to identify "bot" EEM from |
None
|
idx_top
|
list of int
|
0-based integer positions of samples in eem_dataset used as the numerator ("top") group (e.g., [0, 1, 2]). |
None
|
idx_bot
|
list of int
|
0-based integer positions of samples in eem_dataset used as the denominator ("bot") group (e.g., [3, 4, 5]). |
None
|
lam
|
float
|
Strength of ratio-based regularization between "top" and "bot" samples (HALS solver only). |
0
|
fit_rank_one
|
bool
|
Whether to enable a rank-one component constraint/penalty in the HALS backend (backend-specific). |
False
|
normalization
|
{'pixel_std', None}
|
Optional preprocessing applied to the unfolded data matrix before factorization.
- |
None
|
sort_components_by_em
|
bool
|
Whether to sort components by the emission peak position (ascending). If |
True
|
max_iter_als
|
int
|
Maximum number of outer iterations for the HALS solver. |
100
|
max_iter_nnls
|
int
|
Maximum number of iterations for NNLS subproblems (when used by the backend). |
500
|
tol
|
float
|
Convergence tolerance passed to the solver. |
1e-5
|
random_state
|
int
|
Random seed used by solvers that support it. |
42
|
Attributes:
| Name | Type | Description |
|---|---|---|
fmax |
DataFrame or None
|
Sample-mode component amplitudes computed from the fitted NMF |
nnls_fmax |
DataFrame or None
|
Component amplitudes computed by refitting each EEM using NNLS with the fitted components.
Columns follow the naming convention |
components |
ndarray or None
|
Component EEMs with shape |
eem_stack_train |
ndarray or None
|
EEM stack used for model fitting, with shape |
eem_stack_reconstructed |
ndarray or None
|
Reconstructed EEM stack from the fitted model, with shape |
eem_stack_unfolded |
ndarray or None
|
Unfolded 2D matrix used by the solver, with shape |
normalization_factor_std |
ndarray or None
|
Per-pixel standard deviation used when |
normalization_factor_max |
ndarray or None
|
Per-component scaling factors (maximum value of each component in the unfolded space) used
to normalize |
ex_range |
ndarray or None
|
Excitation wavelength grid corresponding to |
em_range |
ndarray or None
|
Emission wavelength grid corresponding to |
beta |
ndarray or None
|
Component-wise ratio parameters used when ratio regularization / beta fitting is enabled (backend-specific). |
decomposer |
object or None
|
Underlying solver object when using scikit-learn NMF (e.g., fitted |
reconstruction_error |
float or None
|
Reconstruction error if provided by the backend/solver; otherwise |
objective_function_error |
object or None
|
Objective tracking information if provided by the backend/solver; otherwise |
References
[1] scikit-learn documentation for sklearn.decomposition.NMF (Coordinate Descent and Multiplicative Updates).
[2] Hu, Yongmin, Céline Jacquin, and Eberhard Morgenroth. "Fluorescence Quenching as a Diagnostic Tool for
Prediction Reliability Assessment and Anomaly Detection in EEM-Based Water Quality Monitoring."
Environmental Science & Technology 59.36 (2025): 19490-19501.
component_peak_locations
component_peak_locations()
Get the ex/em of component peaks
Returns:
| Name | Type | Description |
|---|---|---|
max_exem |
list
|
A List of (ex, em) of component peaks. |
fit
fit(eem_dataset)
Fit NMF model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eem_dataset
|
EEMDataset
|
The EEM dataset used to fit the NMF model. |
required |
predict
predict(eem_dataset: EEMDataset, fit_intercept=False, fit_beta=False, idx_top=None, idx_bot=None)
Predict the score and Fmax of a given EEM dataset using the component fitted. This method can be applied to a new EEM dataset independent of the one used in NMF model establishment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eem_dataset
|
EEMDataset
|
The EEM dataset to be predicted. |
required |
fit_intercept
|
bool
|
Whether to calculate the intercept. |
False
|
fit_beta
|
bool
|
Whether to fit the beta parameter (the proportions between "top" and "bot" samples). |
False
|
idx_top
|
list
|
List of indices of samples serving as numerators in ratio calculation. |
None
|
idx_bot
|
list
|
List of indices of samples serving as denominators in ratio calculation. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
score_sample |
DataFrame
|
The fitted score. |
fmax_sample |
DataFrame
|
The fitted Fmax. |
eem_stack_pred |
np.ndarray (3d)
|
The EEM dataset reconstructed. |
residual
residual()
Get the residual of the established PARAFAC model, i.e., the difference between the original EEM dataset and the reconstructed EEM dataset.
Returns:
| Name | Type | Description |
|---|---|---|
res |
np.ndarray (3d)
|
the residual |
sample_normalized_rmse
sample_normalized_rmse()
Calculate the normalized root mean squared error (normalized RMSE) of EEM of each sample. It is defined as the RMSE divided by the mean of original signal.
Returns:
| Name | Type | Description |
|---|---|---|
normalized_sse |
DataFrame
|
Table of normalized RMSE |
sample_rmse
sample_rmse()
Calculate the root mean squared error (RMSE) of EEM of each sample.
Returns:
| Name | Type | Description |
|---|---|---|
sse |
DataFrame
|
Table of RMSE |
variance_explained
variance_explained()
Calculate the explained variance of the established NMF model
Returns:
| Name | Type | Description |
|---|---|---|
ev |
float
|
the explained variance |
SplitValidation
Validate PARAFAC or NMF models by comparing component consistency across EEM sub-datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_model
|
PARAFAC or EEMNMF
|
Base model used to fit each sub-dataset. |
required |
n_splits
|
int
|
Number of splits used to create sub-datasets. |
4
|
combination_size
|
int or {"half"}
|
Number of splits assembled into each combination. If "half" is passed, each combination uses half of the splits (split-half validation). |
"half"
|
rule
|
{"random", "sequential"}
|
Split rule for the dataset. "sequential" splits by index order. |
"random"
|
random_state
|
int
|
Random seed used when |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
eem_subsets |
dict
|
Mapping of subset labels to EEMDataset instances. |
subset_specific_models |
dict
|
Mapping of subset labels to fitted PARAFAC or EEMNMF models. |
eem_dataset_full |
EEMDataset or None
|
The full dataset used to generate splits. |
compare_components
compare_components()
Compare component EEMs between models fitted to paired sub-datasets.
Returns:
| Name | Type | Description |
|---|---|---|
similarities_components |
DataFrame
|
Similarity scores for component EEMs. |
compare_parafac_loadings
compare_parafac_loadings()
Compare excitation/emission loadings between PARAFAC models fitted to paired sub-datasets.
This method is only meaningful for PARAFAC models because it relies on Ex/Em loadings.
Returns:
| Name | Type | Description |
|---|---|---|
similarities_ex |
DataFrame
|
Similarity scores for excitation loadings per component. |
similarities_em |
DataFrame
|
Similarity scores for emission loadings per component. |
correlation_cv
correlation_cv(ref_col)
Cross-validate reference correlations using component Fmax values.
For each split pair, fit a linear regression on the training subset and evaluate on the paired test subset. Metrics are reported for each component as R2 and RMSE for both training and test.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref_col
|
str
|
Column name in |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Table of R2 and RMSE metrics for each component and split pairing. |
fit
fit(eem_dataset: EEMDataset)
Fit the base model on each sub-dataset and store the fitted models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eem_dataset
|
EEMDataset
|
Full dataset used for splitting and model fitting. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
self |
SplitValidation
|
Fitted validation object. |
KMethod
K-method (e.g., K-PARAFACs or K-NMFs) for EEM clustering by minimizing reconstruction error (Hu et al., Water Research, 2025).
This class implements the K-method family of clustering algorithms for excitation–emission matrix (EEM) datasets. The key hypothesis is that fitting EEMs with high chemical composition variability using a single, unified set of components (e.g., one PARAFAC or NMF model) can lead to over-generalized component formation and large reconstruction error. In contrast, EEMs sharing similar chemical compositions can be clustered and represented by cluster-specific component sets, resulting in a number of unique component sets that better capture the variability in chemical composition between clusters and reduce overall reconstruction error.
Based on this hypothesis, K-method searches for a clustering strategy that minimizes the overall reconstruction
error by iterating between:
- Estimation: fit a base decomposition model (base_model) separately on each current cluster to obtain
cluster-specific models.
- Assignment: assign each sample to the cluster whose model yields the smallest distance (e.g.,
reconstruction RMSE), forming updated clusters.
Repeating this procedure yields cluster-specific PARAFAC/NMF models that (ideally) reconstruct the dataset better than a single unified model.
In addition, K-method can be run multiple times with subsampling to form a consensus matrix and then derive a final clustering using hierarchical clustering on a distance matrix computed from consensus values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_model
|
object
|
Base decomposition model used within each cluster (e.g., an instance of |
required |
n_initial_splits
|
int
|
Number of splits used in initialization (the first partition of the dataset before iterative refinement). |
required |
distance_metric
|
{'reconstruction_error', 'reconstruction_error_with_beta', 'quenching_coefficient'}
|
Criterion used for assignment in the maximization step.
- |
'reconstruction_error'
|
max_iter
|
int
|
Maximum number of K-method iterations in a single base clustering run. |
20
|
tol
|
float
|
Convergence tolerance based on similarity between cluster-specific models of two consecutive iterations.
If the average Tucker’s congruence (or component similarity proxy) exceeds |
0.001
|
elimination
|
{'default'} or int
|
Minimum allowed cluster size during optimization. Clusters with fewer samples than the threshold are removed.
- |
'default'
|
Attributes:
| Name | Type | Description |
|---|---|---|
unified_model |
object or None
|
Unified model fitted once on the full dataset (a deep copy of |
label_history |
list or None
|
History of cluster assignments. For base clustering runs, this is typically a list containing a DataFrame with per-sample labels across iterations. |
error_history |
list or None
|
History of per-sample distances/errors (e.g., RMSE) across iterations, typically stored as DataFrames. |
silhouette_score |
float or None
|
Silhouette score computed on the final distance matrix during hierarchical clustering (when available). |
labels |
ndarray or None
|
Final cluster labels for each sample. Labels are cluster IDs returned by hierarchical clustering (typically 1..K), or by base clustering when used directly. |
index_sorted |
list or None
|
Dataset index reordered by the final hierarchical clustering labels (when available). |
ref_sorted |
DataFrame or None
|
Reference table reordered by the final hierarchical clustering labels (when available). |
threshold_r |
float or None
|
Distance threshold used for hierarchical clustering cut (derived from the linkage matrix). |
eem_clusters |
dict or None
|
Mapping from cluster label to an |
cluster_specific_models |
dict or None
|
Mapping from cluster label to the fitted cluster-specific model (deep copies of |
consensus_matrix |
ndarray or None
|
Consensus matrix |
distance_matrix |
ndarray or None
|
Distance matrix derived from consensus, typically |
linkage_matrix |
ndarray or None
|
Hierarchical clustering linkage matrix computed from the consensus-derived distance matrix. |
consensus_matrix_sorted |
ndarray or None
|
Consensus matrix reordered by the final cluster labels for visualization. |
References
[1] Hu, Yongmin, Eberhard Morgenroth, and Céline Jacquin. "Online monitoring of greywater reuse system using excitation-emission matrix (EEM) and K-PARAFACs." Water Research 268 (2025): 122604.
base_clustering
base_clustering(eem_dataset: EEMDataset)
Run clustering for a single time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eem_dataset
|
EEMDataset
|
The EEM dataset to be clustered. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
cluster_labels |
ndarray
|
Cluster labels. |
label_history |
list
|
Cluster labels in each iteration. |
error_history |
list
|
Average reconstruction error (RMSE) in each iteration. |
calculate_consensus
calculate_consensus(eem_dataset: EEMDataset, n_base_clusterings: int, subsampling_portion: float)
Run the clustering for many times and combine the output of each run to obtain an optimal clustering.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eem_dataset
|
EEMDataset
|
EEM dataset. |
required |
n_base_clusterings
|
int
|
Number of base clustering. |
required |
subsampling_portion
|
float
|
The portion of EEMs remained after subsampling. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
self |
object
|
The established K-PARAFACs model |
hierarchical_clustering
hierarchical_clustering(eem_dataset, n_clusters, consensus_conversion_power=1)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eem_dataset
|
EEMDataset
|
EEM dataset to cluster. |
required |
n_clusters
|
int
|
Number of clusters. |
required |
consensus_conversion_power
|
float
|
The factor adjusting the conversion from consensus matrix (M) to distance matrix (D) used for hierarchical clustering. D_{i,j} = (1 - M_{i,j})^factor. This number influences the gradient of distance with respect to consensus. A smaller number will lead to shaper increase of distance at consensus close to 1. |
1
|
predict
predict(eem_dataset: EEMDataset)
Fit the cluster-specific models to a given EEM dataset. Each EEM in the EEM dataset is fitted to the model that produce the least RMSE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eem_dataset
|
EEMDataset
|
The EEM dataset to be predicted. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
best_model_label |
DataFrame
|
The best-fit model for every EEM. |
score_all |
DataFrame
|
The score fitted with each cluster-specific model. |
fmax_all |
DataFrame
|
The fmax fitted with each cluster-specific model. |
sample_error |
DataFrame
|
The RMSE fitted with each cluster-specific model. |
combine_eem_datasets
combine_eem_datasets(list_eem_datasets)
Combine all EEMDataset objects in a list
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
list_eem_datasets
|
list.
|
List of EEM datasets. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
eem_dataset_combined |
EEMDataset
|
EEM dataset combined. |