Skip to content

eem_processing

eempy.eem_processing

EEMDataset

Build an EEM dataset.

Parameters:

Name Type Description Default
eem_stack np.ndarray

The 3D EEM stack, with shape (n_samples, n_ex_wavelengths, n_em_wavelengths).

required
ex_range np.ndarray

A 1D NumPy array of the excitation wavelengths.

required
em_range np.ndarray

A 1D NumPy array of the emission wavelengths.

required
index list or None

Optional. The name used to label each sample. The number of elements in the list should equal the number of samples in the eem_stack (with the same sample order).

None
ref pd.DataFrame or None

Optional. The reference data, e.g., the contaminant concentrations in each sample. It should have a length equal to the number of samples in the eem_stack. The index of each sample should be the name given in parameter "index". It is possible to have more than one column. NaN is allowed (for example, if contaminant concentrations in specific samples are unknown).

None
cluster list or None

Optional. The classification of samples, e.g., the output of EEM clustering algorithms. The number of elements in the list should equal the number of samples in the eem_stack (with the same sample order).

None

aqy

aqy(abs_stack, ex_range_abs, target_ex=None)

Calculate the apparent_quantum_yield (AQY).

Parameters:

Name Type Description Default
abs_stack ndarray

absorbance spectra stack

required
ex_range_abs ndarray

excitation wavelengths of absorbance spectra

required
target_ex float or None

excitation wavelength for AQY. If None is passed, all excitation wavelengths will be returned.

None

Returns:

Name Type Description
aqy DataFrame

apparent quantum yield (AQY)

bix

bix()

Calculate the biological index (BIX).

Returns:

Name Type Description
bix DataFrame

BIX

correlation

correlation(variables, fit_intercept=True)

Analyze the correlation between reference and fluorescence intensity at each pair of ex/em.

Parameters:

Name Type Description Default
variables list

List of variables (i.e., the headers of the reference table) to be fitted

required
fit_intercept bool

Whether to fit the intercept for linear regression.

True

Returns:

Name Type Description
corr_dict dict

A dictionary containing multiple correlation evaluation metrics.

cutting

cutting(ex_min, ex_max, em_min, em_max, inplace=True)

Cut every EEM in the dataset to a new excitation/emission window.

Parameters:

Name Type Description Default
ex_min float

Lower bound of the excitation wavelength window to keep (nm).

required
ex_max float

Upper bound of the excitation wavelength window to keep (nm).

required
em_min float

Lower bound of the emission wavelength window to keep (nm).

required
em_max float

Upper bound of the emission wavelength window to keep (nm).

required
inplace bool

If True, overwrite self and return it. If False, return a new EEMDataset instance.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

EEM dataset after cutting. The dataset's ex_range and em_range are updated accordingly.

fi

fi()

Compute the fluorescence index (FI) for each sample.

FI is computed as intensity(ex=370 nm, em=470 nm) divided by intensity(ex=370 nm, em=520 nm).

Returns:

Name Type Description
fi DataFrame

Fluorescence index values. Note: the current implementation labels the output column as "BIX" even though the values correspond to FI.

filter_by_cluster

filter_by_cluster(cluster_names, inplace=True)

Select the samples belong to certain cluster(s).

Parameters:

Name Type Description Default
cluster_names int/float/str or list of int/float/str

cluster names.

required
inplace bool

if False, overwrite the EEMDataset object.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

The filtered EEM dataset.

filter_by_index

filter_by_index(mandatory_keywords, optional_keywords, inplace=True)

Select the samples whose indexes contain the given keyword.

Parameters:

Name Type Description Default
mandatory_keywords str or list of str

Keywords for selecting samples whose indexes contain all the mandatory keywords.

required
optional_keywords str or list of str

Keywords for selecting samples whose indexes contain any of the optional keywords.

required
inplace bool

if True, overwrite the EEMDataset object.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

The filtered EEM dataset.

gaussian_filter

gaussian_filter(sigma=1, truncate=3, inplace=True)

Apply Gaussian filtering to every EEM in the dataset.

Parameters:

Name Type Description Default
sigma float

Standard deviation of the Gaussian kernel.

1
truncate float

Truncate the filter at this many standard deviations.

3
inplace bool

If True, overwrite self and return it. If False, return a new EEMDataset instance.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

EEM dataset with Gaussian filtering applied.

hix

hix()

Calculate the humification index (HIX).

Returns:

Name Type Description
hix DataFrame

HIX

ife_correction

ife_correction(absorbance, ex_range_abs, inplace=True)

Apply inner filter effect (IFE) correction to every EEM using absorbance spectra.

Parameters:

Name Type Description Default
absorbance ndarray

Absorbance spectra stack (n_samples, n_abs_wavelengths).

required
ex_range_abs ndarray

Wavelength axis (nm) for the absorbance spectra.

required
inplace bool

If True, overwrite self and return it. If False, return a new EEMDataset instance.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

IFE-corrected EEM dataset.

interpolation

interpolation(ex_range_new, em_range_new, method, inplace=True)

Interpolate every EEM onto a new excitation/emission wavelength grid.

Parameters:

Name Type Description Default
ex_range_new ndarray

Target excitation wavelength axis (nm).

required
em_range_new ndarray

Target emission wavelength axis (nm).

required
method (str, {linear, nearest, slinear, cubic, quintic})

Interpolation method passed to scipy.interpolate.RegularGridInterpolator.

required
inplace bool

If True, overwrite self and return it. If False, return a new EEMDataset instance.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

EEM dataset interpolated to the new wavelength grid. The dataset's ex_range and em_range are updated accordingly.

mean

mean()

Calculate mean of each pixel over all samples.

Returns:

Name Type Description
mean ndarray

median_filter

median_filter(window_size=(3, 3), mode='reflect', inplace=True)

Apply median filtering to an EEM.

Parameters:

Name Type Description Default
window_size tuple of two integers

Gives the shape that is taken from the input array, at every element position, to define the input to the filter function.

(3, 3)
mode str, {‘reflect’, ‘constant’, ‘nearest’, ‘mirror’, ‘wrap’}

The mode parameter determines how the input array is extended beyond its boundaries.

'reflect'
inplace bool

if True, overwrite the EEMDataset object with the processed EEMs.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

The processed EEM dataset.

nan_imputing

nan_imputing(method='linear', fill_value='linear_ex', inplace=True)

Impute NaN pixels in every EEM in the dataset.

Parameters:

Name Type Description Default
method (str, {linear, cubic})

2D interpolation method passed to scipy.interpolate.griddata.

"linear"
fill_value (float or str, {linear_ex, linear_em})

How to fill pixels outside the convex hull of non-NaN data.

"linear_ex"
inplace bool

If True, overwrite self and return it. If False, return a new EEMDataset instance.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

EEM dataset with NaN pixels filled.

peak_picking

peak_picking(ex, em)

Return the fluorescence intensities at the location closest to the given (ex, em).

Parameters:

Name Type Description Default
ex float or int

excitation wavelength of the wanted location

required
em float or int

emission wavelength of the wanted location

required

Returns:

Name Type Description
fi DataFrame

table of fluorescence intensities at the wanted location for all samples

ex_actual

the actual ex of the extracted fluorescence intensities

em_actual

the actual em of the extracted fluorescence intensities

raman_normalization

raman_normalization(ex_range_blank=None, em_range_blank=None, blank=None, from_blank=False, integration_time=1, ex_target=350, bandwidth=5, rsu_standard=20000, manual_rsu=1, inplace=True)

Normalize every EEM in the dataset by a Raman scattering unit (RSU). RSU can be supplied directly ( from_blank=False) or calculated from blank EEM data (from_blank=True). The normalization factor is RSU_raw divided by (rsu_standard * integration_time).

Parameters:

Name Type Description Default
blank ndarray

Blank EEM(s) used to estimate RSU when from_blank=True.

None
ex_range_blank ndarray

Excitation wavelength axis for the blank EEM(s).

None
em_range_blank ndarray

Emission wavelength axis for the blank EEM(s).

None
from_blank bool

If True, calculate RSU from the provided blank EEM(s).

False
integration_time float

Integration time used for the blank measurement.

1
ex_target float

Excitation wavelength (nm) at which RSU is computed.

350
bandwidth float

Raman peak bandwidth (nm) used for regional integration.

5
rsu_standard float

Scaling factor applied to RSU to control the magnitude of normalized intensities.

20000
manual_rsu float

RSU used directly when from_blank=False.

1
inplace bool

If True, overwrite self and return it. If False, return a new EEMDataset instance.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

Raman-normalized EEM dataset.

raman_scattering_removal

raman_scattering_removal(width=5, interpolation_method='linear', interpolation_dimension='2d', inplace=True, recover_original_nan=True)

Remove the first-order Raman scattering band and fill the masked region.

Parameters:

Name Type Description Default
width float

Total width (nm) of the Raman scattering band to mask.

5
interpolation_method (str, {linear, cubic, nan, zero})

Method used to fill the masked region.

"linear"
interpolation_dimension (str, {'1d-ex', '1d-em', '2d'})

Interpolation axis/dimension used when interpolation_method is not "nan" or "zero".

"2d"
recover_original_nan bool

If True, preserve NaN pixels that existed before scattering removal.

True
inplace bool

If True, overwrite self and return it. If False, return a new EEMDataset instance.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

EEM dataset with Raman scattering removed and filled.

rayleigh_scattering_removal

rayleigh_scattering_removal(width_o1=15, width_o2=15, interpolation_dimension_o1='2d', interpolation_dimension_o2='2d', interpolation_method_o1='zero', interpolation_method_o2='linear', inplace=True, recover_original_nan=True)

Remove first- and second-order Rayleigh scattering bands and fill the masked regions.

Parameters:

Name Type Description Default
width_o1 float

Total width (nm) of the first-order Rayleigh band (Em = Ex).

15
width_o2 float

Total width (nm) of the second-order Rayleigh band (Em = 2*Ex).

15
interpolation_dimension_o1 (str, {'1d-ex', '1d-em', '2d'})

Interpolation axis/dimension for the first-order band.

"2d"
interpolation_dimension_o2 (str, {'1d-ex', '1d-em', '2d'})

Interpolation axis/dimension for the second-order band.

"2d"
interpolation_method_o1 (str, {linear, cubic, nan, zero, none})

Fill method for the first-order band.

"zero"
interpolation_method_o2 (str, {linear, cubic, nan, zero, none})

Fill method for the second-order band.

"linear"
recover_original_nan bool

If True, preserve NaN pixels that existed before scattering removal.

True
inplace bool

If True, overwrite self and return it. If False, return a new EEMDataset instance.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

EEM dataset with Rayleigh scattering removed and filled.

region_masking

region_masking(ex_min, ex_max, em_min, em_max, fill_value='nan', inplace=True)

Mask a rectangular excitation/emission region in every EEM in the dataset.

Parameters:

Name Type Description Default
ex_min float

Lower bound of the excitation wavelength window to mask (nm).

230
ex_max float

Upper bound of the excitation wavelength window to mask (nm).

500
em_min float

Lower bound of the emission wavelength window to mask (nm).

250
em_max float

Upper bound of the emission wavelength window to mask (nm).

810
fill_value (str, {nan, zero})

How to fill the masked region.

"nan"
inplace bool

If True, overwrite self and return it. If False, return a new EEMDataset instance.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

EEM dataset with regional masking applied.

regional_integration

regional_integration(ex_min, ex_max, em_min, em_max) -> pd.DataFrame

Calculate regional integration of samples.

Parameters:

Name Type Description Default
ex_min float

The lower boundary of excitation wavelengths of the integrated region.

required
ex_max float

The upper boundary of excitation wavelengths of the integrated region.

required
em_min float

The lower boundary of emission wavelengths of the integrated region.

required
em_max float

The upper boundary of emission wavelengths of the integrated region.

required

Returns:

Name Type Description
integrations DataFrame

sort_by_index

sort_by_index(inplace=True)

Sort the sample order of eem_stack, index and reference (if exists) by the index.

Parameters:

Name Type Description Default
inplace bool

If True, overwrite the EEMDataset object.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

The processed EEM dataset.

splitting

splitting(n_split, rule: str = 'random', random_state=None, kw_top=None, kw_bot=None, idx_top=None, idx_bot=None)

To split the EEM dataset and form multiple sub-datasets.

Parameters:

Name Type Description Default
n_split int

The number of splits.

required
rule (str, {random, sequential})

If 'random' is passed, the split will be generated randomly. If 'sequential' is passed, the dataset will be split according to index order.

'random'
random_state int

Random seed for splitting.

None

Returns:

Name Type Description
model_list list.

A list of sub-datasets. Each of them is an EEMDataset object.

std

std()

Calculate standard deviation of each pixel over all samples.

Returns:

Name Type Description
std ndarray

subsampling

subsampling(portion=0.8, inplace=True)

Randomly select a portion of the EEM.

Parameters:

Name Type Description Default
portion float

The portion.

0.8
inplace bool

if True, overwrite the EEMDataset object.

True

Returns:

Name Type Description
eem_dataset_sub ndarray

New EEM dataset.

selected_indices ndarray

Indices of selected EEMs.

tf_normalization

tf_normalization(inplace=True)

Normalize every EEM by its total fluorescence. Each sample is divided by its total fluorescence, normalized to the mean total fluorescence across the dataset.

Parameters:

Name Type Description Default
inplace bool

If True, overwrite self and return it. If False, return a new EEMDataset instance.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

Total-fluorescence-normalized EEM dataset.

weights ndarray

Per-sample normalization factors (total fluorescence divided by the dataset mean).

threshold_masking

threshold_masking(threshold, fill, mask_type='greater', inplace=True)

Mask fluorescence intensity values above or below a threshold across all samples.

Parameters:

Name Type Description Default
threshold float or int

Intensity threshold.

required
fill float or int

Value used to replace masked pixels.

required
mask_type (str, {greater, smaller})

Whether to mask values greater than or smaller than threshold.

"greater"
inplace bool

If True, overwrite self and return it. If False, return a new EEMDataset instance.

True

Returns:

Name Type Description
eem_dataset_new EEMDataset

EEM dataset with threshold masking applied.

total_fluorescence

total_fluorescence()

Calculate total fluorescence of each sample.

Returns:

Name Type Description
tf ndarray

variance

variance()

Calculate variance of each pixel over all samples.

Returns:

Name Type Description
variance ndarray

zscore

zscore()

Calculate zscore of each pixel over all samples.

Returns:

Name Type Description
zscore ndarray

PARAFAC

Parallel factor analysis (PARAFAC) model for an excitation–emission matrix (EEM) dataset.

This class fits a low-rank PARAFAC (CP) decomposition to a 3D EEM stack with shape (n_samples, n_ex, n_em) by factorizing it into: - A sample-mode score matrix A with shape (n_samples, n_components). - An excitation-mode loading matrix B with shape (n_ex, n_components). - An emission-mode loading matrix C with shape (n_em, n_components).

Each component r corresponds to a rank-1 outer product A[:, r] ⊗ B[:, r] ⊗ C[:, r], and the reconstructed EEM stack is obtained by summing these rank-1 components over r = 1...n_components.

This class fits a low-rank PARAFAC decomposition to a 3D EEM stack with optional regularization: - Non-negativity - Elastic-net regularization on any factor (L1/L2 mix). - Quadratic priors on A, B, and/or C (controlled by prior_dict_sample, prior_dict_ex, prior_dict_em andgamma_sample,gamma_exandgamma_em), with NaNs allowed to skip entries. This is useful when fitted scores or spectral components are desired to be close (but not necessarily identical) to prior knowledge. For example, if a component’s concentration is known for some samples, a prior vector of length n_samples can be passed with real values for known samples and NaN for unknown samples. - A ratio constraint on paired rows of A: A[idx_top] ≈ beta * A[idx_bot]. This is useful when the ratios of component amplitudes between two sets of samples are desired to be constant. For example, if each sample is measured both unquenched and quenched using a fixed quencher dosage, then for a given chemically consistent component the ratio between unquenched and quenched amplitudes may be approximately constant across samples (Hu et al., ES&T, 2025). In this case, passing the unquenched and quenched sample indices to idx_top and idx_bot encourages a constant ratio. lam controls the strength of this regularization.

Parameters:

Name Type Description Default
n_components int

Number of PARAFAC components (rank of the CP decomposition).

required
non_negativity bool

Whether to enforce non-negativity constraints on the factor matrices.

True
solver {'mu', 'hals'}

Optimization algorithm used when non_negativity=True. - 'mu': Multiplicative Updates solver (tensorly.decomposition.non_negative_parafac). - 'hals': Hierarchical Alternating Least Squares solver with optional priors/regularization( eempy.solver.parafac_with_prior_hals). if non_negativity=False, a standard alternating least squares solver is used anyway ( tensorly.decomposition.parafac).

'hals'
init {'svd', 'random'} or tensorly.CPTensor

Initialization strategy for the factor matrices. If a tensorly.CPTensor is provided, it is used as the initialization.

'svd'
custom_init optional

Custom initialization passed to the HALS solver (when supported by the backend implementation).

None
fixed_components optional

Component(s) to keep fixed during fitting (backend-specific behavior).

None
tf_normalization bool

Whether to normalize each EEM by its total fluorescence during model fitting.

False
loadings_normalization {'sd', 'maximum', None}

Post-fit normalization applied to excitation/emission loadings, with the sample scores scaled accordingly. - 'sd': normalize each loading vector to unit standard deviation. - 'maximum': normalize each loading vector to unit maximum. - None: no loading normalization.

'maximum'
sort_components_by_em bool

Whether to sort components by the emission peak position (ascending). If False, components are kept in the solver output order (which may correlate with variance contribution depending on the solver).

True
alpha_sample float

Regularization strength applied to the sample-mode factor matrix (backend-specific).

0
alpha_ex float

Regularization strength applied to the excitation-mode factor matrix (backend-specific).

0
alpha_em float

Regularization strength applied to the emission-mode factor matrix (backend-specific).

0
l1_ratio float

Elastic-net mixing parameter used by the backend (1 corresponds to L1 only; 0 to L2 only).

1
prior_dict_sample dict

Prior information for the sample-mode factor matrix (backend-specific).

None
prior_dict_ex dict

Prior information for the excitation-mode factor matrix (backend-specific).

None
prior_dict_em dict

Prior information for the emission-mode factor matrix (backend-specific).

None
gamma_sample float

Additional prior/penalty strength for the sample-mode factor matrix (backend-specific).

0
gamma_ex float

Additional prior/penalty strength for the excitation-mode factor matrix (backend-specific).

0
gamma_em float

Additional prior/penalty strength for the emission-mode factor matrix (backend-specific).

0
ref_components optional

Reference component definitions used by the backend prior/regularization logic (backend-specific).

None
kw_top str

Keyword used to identify "top" EEM from eem_dataset.index during fitting. "Top" and "bot" EEMs are assumed to be paired one-to-one and aligned by selection order (first "top" ↔ first "bot", etc.). A recommended naming convention is "a_sharing_sample_name" + "kw_top" or "kw_bot" for the quenched and unquenched EEM derived from the same original sample, so the pair differs only by kw_top/kw_bot and alignment is preserved when selecting by keywords. An alternative approach is to provide idx_top and idx_bot to directly specify "top" and "bot" EEMs by positions.

None
kw_bot str

Keyword used to identify "bot" EEM from eem_dataset.index during fitting.

None
idx_top list of int

0-based integer positions of samples in eem_dataset used as the numerator ("top") group (e.g., [0, 1, 2]).

None
idx_bot list of int

0-based integer positions of samples in eem_dataset used as the denominator ("bot") group (e.g., [3, 4, 5]).

None
lam float

Strength of ratio-based regularization between "top" and "bot" samples (backend-specific).

0
max_iter_als int

Maximum number of outer ALS iterations.

100
tol float

Convergence tolerance for the ALS loop.

1e-6
max_iter_nnls int

Maximum number of iterations for NNLS subproblems (when used by the backend).

500
random_state int or numpy.random.RandomState

Random seed or RNG used for reproducible initialization (when supported).

None
mask array-like

A ideally sparse mask array for missing values (backend-specific). When provided, masked entries are ignored in fitting.

None

Attributes:

Name Type Description
score DataFrame or None

Sample scores (sample loadings).

ex_loadings DataFrame or None

Excitation-mode loadings for each component.

em_loadings DataFrame or None

Emission-mode loadings for each component.

fmax DataFrame or None

The maximum fluorescence intensity of components. Fmax is calculated by multiplying the maximum excitation loading and maximum emission loading for each component by its score.

nnls_fmax DataFrame or None

Fmax estimated from refitting PARAFAC components to the original EEMs using NNLS. It may be slightly different from fmax due to the non-exact fit.

components ndarray or None

Component EEMs with shape (n_components, n_ex, n_em) constructed from excitation/emission loadings.

cptensors CPTensor or None

Fitted CP/PARAFAC tensor representation returned by the underlying solver.

eem_stack_train ndarray or None

EEM stack used for model fitting, with shape (n_samples, n_ex, n_em).

eem_stack_reconstructed ndarray or None

Reconstructed EEM stack from the fitted model, with shape (n_samples, n_ex, n_em).

ex_range ndarray or None

Excitation wavelength grid corresponding to ex_loadings and components.

em_range ndarray or None

Emission wavelength grid corresponding to em_loadings and components.

beta ndarray or None

Component-wise ratio parameters used when ratio regularization / beta fitting is enabled.

References

[1] Tensorly documentation for CP/PARAFAC decomposition. [2] Hu, Yongmin, Céline Jacquin, and Eberhard Morgenroth. "Fluorescence Quenching as a Diagnostic Tool for Prediction Reliability Assessment and Anomaly Detection in EEM-Based Water Quality Monitoring." Environmental Science & Technology 59.36 (2025): 19490-19501.

component_peak_locations

component_peak_locations()

Get the ex/em of component peaks

Returns:

Name Type Description
max_exem list

A List of (ex, em) of component peaks.

core_consistency

core_consistency()

Calculate the core consistency of the established PARAFAC model

Returns:

Name Type Description
cc float

core consistency

export

export(filepath, info_dict)

Export the PARAFAC model to a text file that can be uploaded to the online PARAFAC model database Openfluor (https://openfluor.lablicate.com/#).

Parameters:

Name Type Description Default
filepath str

Location of the saved text file. Please specify the ".csv" extension.

required
info_dict dict

A dictionary containing the model information. Possible keys include: name, creator date, email, doi, reference, unit, toolbox, fluorometer, nSample, decomposition_method, validation, dataset_calibration, preprocess, sources, description

required

Returns:

Name Type Description
info_dict dict

A dictionary containing the information of the PARAFAC model.

fit

fit(eem_dataset: EEMDataset)

Establish a PARAFAC model based on a given EEM dataset

Parameters:

Name Type Description Default
eem_dataset EEMDataset

The EEM dataset used to fit the PARAFAC model.

required

Returns:

Name Type Description
self object

The established PARAFAC model

leverage

leverage(mode: str = 'sample')

Calculate the leverage of a selected mode.

Parameters:

Name Type Description Default
mode (str, {ex, em, sample})

The mode of which the leverage is calculated.

'sample'

Returns:

Name Type Description
lvr DataFrame

The table of leverage

predict

predict(eem_dataset: EEMDataset, fit_intercept=False, fit_beta=False, idx_top=None, idx_bot=None)

Predict the score and Fmax of a given EEM dataset using the component fitted. This method can be applied to a new EEM dataset independent of the one used in NMF model establishment.

Parameters:

Name Type Description Default
eem_dataset EEMDataset

The EEM dataset to be predicted.

required
fit_intercept bool

Whether to calculate the intercept.

False
fit_beta bool

Whether to fit the beta parameter (the proportions between "top" and "bot" samples).

False
idx_top list

List of indices of samples serving as numerators in ratio calculation.

None
idx_bot

List of indices of samples serving as denominators in ratio calculation.

None

Returns:

Name Type Description
score_sample DataFrame

The fitted score.

fmax_sample DataFrame

The fitted Fmax.

eem_stack_pred np.ndarray (3d)

The EEM dataset reconstructed.

residual

residual()

Get the residual of the established PARAFAC model, i.e., the difference between the original EEM dataset and the reconstructed EEM dataset.

Returns:

Name Type Description
res np.ndarray (3d)

the residual

sample_relative_rmse

sample_relative_rmse()

Calculate the normalized root mean squared error (normalized RMSE) of EEM of each sample. It is defined as the RMSE divided by the mean of original signal.

Returns:

Name Type Description
relative_rmse DataFrame

Table of normalized RMSE

sample_rmse

sample_rmse()

Calculate the root mean squared error (RMSE) of EEM of each sample.

Returns:

Name Type Description
rmse DataFrame

Table of RMSE

sample_summary

sample_summary()

Get a table showing the score, Fmax, leverage, RMSE and normalized RMSE for each sample.

Returns:

Name Type Description
summary DataFrame

Table of samples' score, Fmax, leverage, RMSE and normalized RMSE.

variance_explained

variance_explained()

Calculate the explained variance of the established PARAFAC model

Returns:

Name Type Description
ev float

the explained variance

EEMNMF

Non-negative matrix factorization (NMF) model for an excitation–emission matrix (EEM) dataset.

This class fits a low-rank NMF decomposition to a 3D EEM stack by unfolding it into a 2D non-negative matrix with shape (n_samples, n_pixels) and factorizing it into: - A non-negative sample score matrix W with shape (n_samples, n_components). - A non-negative component matrix H with shape (n_components, n_pixels), where n_pixels = n_ex * n_em`in the unfolded representation.

The fitted NMF components are reshaped back to EEM form with shape (n_components, n_ex, n_em). Component amplitudes are reported as Fmax-like values using: - fmax : scores from the NMF factorization, rescaled to account for component normalization. - nnls_fmax : scores refit by non-negative least squares (NNLS) against the extracted components, which can differ slightly from fmax due to the non-exact NMF reconstruction and/or constraints.

Optional regularization / constraints (solver-dependent) include: - Non-negativity (always enforced by this model). - Elastic-net regularization on W and/or H (L1/L2 mix). - Quadratic priors on W and/or H (controlled by prior_dict_W, prior_dict_H and gamma_W, gamma_H), with NaNs allowed to skip entries. This is useful when fitted scores or spectral components are desired to be close (but not necessarily identical) to prior knowledge. For example, if a component’s concentration is known for some samples, a prior vector of length n_samples can be passed with real values for known samples and NaN for unknown samples. - A ratio constraint on paired rows of W: W[idx_top] ≈ beta * W[idx_bot]. This is useful when the ratios of component amplitudes between two sets of samples are desired to be constant. For example, if each sample is measured both unquenched and quenched using a fixed quencher dosage, then for a given chemically consistent component the ratio between unquenched and quenched amplitudes may be approximately constant across samples (Hu et al., ES&T, 2025). In this case, passing the unquenched and quenched sample indices to idx_top and idx_bot encourages a constant ratio. lam controls the strength of this regularization.

Parameters:

Name Type Description Default
n_components int

Number of NMF components (rank of the factorization).

required
solver {'cd', 'mu', 'hals'}

Optimization algorithm used to fit NMF. - 'cd': Coordinate Descent solver (scikit-learn decomposition.NMF). - 'mu': Multiplicative Updates solver (scikit-learn decomposition.NMF). - 'hals': Hierarchical Alternating Least Squares solver with optional priors/regularization (eempy.solver.nmf_with_prior_hals).

'cd'
init str

Initialization strategy passed to the selected solver. Common options include 'random', 'nndsvd', 'nndsvda', 'nndsvdar' (solver-dependent). For HALS, a custom initialization can be provided via custom_init when supported.

'nndsvda'
custom_init optional

Custom initialization passed to the HALS solver (when supported by the backend implementation).

None
fixed_components optional

Component(s) to keep fixed during fitting (backend-specific behavior).

None
beta_loss {'frobenius', 'kullback-leibler', 'itakura-saito'}

Beta divergence used by the 'mu' solver. Ignored by 'cd' and 'hals'.

'frobenius'
alpha_sample float

Regularization strength applied to the sample-mode factor matrix W (backend-specific). For scikit-learn, this maps to alpha_W.

0
alpha_component float

Regularization strength applied to the component matrix H (backend-specific). For scikit-learn, this maps to alpha_H.

0
l1_ratio float

Elastic-net mixing parameter used by the backend (1 corresponds to L1 only; 0 to L2 only).

1
prior_dict_W dict

Prior information for the sample-mode factor matrix W (HALS solver only). Keys are component indices (int); values are 1D arrays of length n_samples. Use NaNs to indicate unknown entries that should not contribute to the penalty.

None
prior_dict_H dict

Prior information for the component matrix H (HALS solver only). Keys are component indices (int); values are 1D arrays of length n_pixels. Use NaNs to indicate unknown entries that should not contribute to the penalty.

None
prior_dict_A dict

Additional prior mapping used by the HALS backend (backend-specific).

None
prior_dict_B dict

Additional prior mapping used by the HALS backend (backend-specific).

None
prior_dict_C dict

Additional prior mapping used by the HALS backend (backend-specific).

None
gamma_W float

Additional prior/penalty strength for the sample-mode factor matrix W (HALS solver only).

0
gamma_H float

Additional prior/penalty strength for the component matrix H (HALS solver only).

0
gamma_A float

Additional prior/penalty strength for backend-specific prior term A (HALS solver only).

0
gamma_B float

Additional prior/penalty strength for backend-specific prior term B (HALS solver only).

0
gamma_C float

Additional prior/penalty strength for backend-specific prior term C (HALS solver only).

0
ref_components optional

Reference component definitions used by the backend prior/regularization logic (backend-specific).

None
kw_top str

Keyword used to identify "top" EEM from eem_dataset.index during fitting. "Top" and "bot" EEMs are assumed to be paired one-to-one and aligned by selection order (first "top" ↔ first "bot", etc.). A recommended naming convention is "a_sharing_sample_name" + "kw_top" or "kw_bot" for the quenched and unquenched EEM derived from the same original sample, so the pair differs only by kw_top/kw_bot and alignment is preserved when selecting by keywords. An alternative approach is to provide idx_top and idx_bot to directly specify "top" and "bot" EEMs by positions.

None
kw_bot str

Keyword used to identify "bot" EEM from eem_dataset.index during fitting.

None
idx_top list of int

0-based integer positions of samples in eem_dataset used as the numerator ("top") group (e.g., [0, 1, 2]).

None
idx_bot list of int

0-based integer positions of samples in eem_dataset used as the denominator ("bot") group (e.g., [3, 4, 5]).

None
lam float

Strength of ratio-based regularization between "top" and "bot" samples (HALS solver only).

0
fit_rank_one bool

Whether to enable a rank-one component constraint/penalty in the HALS backend (backend-specific).

False
normalization {'pixel_std', None}

Optional preprocessing applied to the unfolded data matrix before factorization. - None: no normalization. - 'pixel_std': divide each pixel (feature) by its standard deviation across samples.

None
sort_components_by_em bool

Whether to sort components by the emission peak position (ascending). If False, components are kept in the solver output order.

True
max_iter_als int

Maximum number of outer iterations for the HALS solver.

100
max_iter_nnls int

Maximum number of iterations for NNLS subproblems (when used by the backend).

500
tol float

Convergence tolerance passed to the solver.

1e-5
random_state int

Random seed used by solvers that support it.

42

Attributes:

Name Type Description
fmax DataFrame or None

Sample-mode component amplitudes computed from the fitted NMF W (and rescaling after component normalization). Columns follow the naming convention "component {i} NMF-Fmax".

nnls_fmax DataFrame or None

Component amplitudes computed by refitting each EEM using NNLS with the fitted components. Columns follow the naming convention "component {i} NNLS-Fmax".

components ndarray or None

Component EEMs with shape (n_components, n_ex, n_em) constructed from the unfolded H. Each component is normalized by its maximum value (peak intensity equals 1), and the scaling is carried into fmax.

eem_stack_train ndarray or None

EEM stack used for model fitting, with shape (n_samples, n_ex, n_em).

eem_stack_reconstructed ndarray or None

Reconstructed EEM stack from the fitted model, with shape (n_samples, n_ex, n_em).

eem_stack_unfolded ndarray or None

Unfolded 2D matrix used by the solver, with shape (n_samples, n_pixels).

normalization_factor_std ndarray or None

Per-pixel standard deviation used when normalization='pixel_std'. Shape is (n_pixels,). None if no pixel-wise standard-deviation normalization was applied.

normalization_factor_max ndarray or None

Per-component scaling factors (maximum value of each component in the unfolded space) used to normalize components and rescale reported amplitudes. Shape is (n_components,).

ex_range ndarray or None

Excitation wavelength grid corresponding to components.

em_range ndarray or None

Emission wavelength grid corresponding to components.

beta ndarray or None

Component-wise ratio parameters used when ratio regularization / beta fitting is enabled (backend-specific).

decomposer object or None

Underlying solver object when using scikit-learn NMF (e.g., fitted sklearn.decomposition.NMF). May be None depending on the solver/backend implementation.

reconstruction_error float or None

Reconstruction error if provided by the backend/solver; otherwise None.

objective_function_error object or None

Objective tracking information if provided by the backend/solver; otherwise None.

References

[1] scikit-learn documentation for sklearn.decomposition.NMF (Coordinate Descent and Multiplicative Updates). [2] Hu, Yongmin, Céline Jacquin, and Eberhard Morgenroth. "Fluorescence Quenching as a Diagnostic Tool for Prediction Reliability Assessment and Anomaly Detection in EEM-Based Water Quality Monitoring." Environmental Science & Technology 59.36 (2025): 19490-19501.

component_peak_locations

component_peak_locations()

Get the ex/em of component peaks

Returns:

Name Type Description
max_exem list

A List of (ex, em) of component peaks.

fit

fit(eem_dataset)

Fit NMF model.

Parameters:

Name Type Description Default
eem_dataset EEMDataset

The EEM dataset used to fit the NMF model.

required

predict

predict(eem_dataset: EEMDataset, fit_intercept=False, fit_beta=False, idx_top=None, idx_bot=None)

Predict the score and Fmax of a given EEM dataset using the component fitted. This method can be applied to a new EEM dataset independent of the one used in NMF model establishment.

Parameters:

Name Type Description Default
eem_dataset EEMDataset

The EEM dataset to be predicted.

required
fit_intercept bool

Whether to calculate the intercept.

False
fit_beta bool

Whether to fit the beta parameter (the proportions between "top" and "bot" samples).

False
idx_top list

List of indices of samples serving as numerators in ratio calculation.

None
idx_bot list

List of indices of samples serving as denominators in ratio calculation.

None

Returns:

Name Type Description
score_sample DataFrame

The fitted score.

fmax_sample DataFrame

The fitted Fmax.

eem_stack_pred np.ndarray (3d)

The EEM dataset reconstructed.

residual

residual()

Get the residual of the established PARAFAC model, i.e., the difference between the original EEM dataset and the reconstructed EEM dataset.

Returns:

Name Type Description
res np.ndarray (3d)

the residual

sample_normalized_rmse

sample_normalized_rmse()

Calculate the normalized root mean squared error (normalized RMSE) of EEM of each sample. It is defined as the RMSE divided by the mean of original signal.

Returns:

Name Type Description
normalized_sse DataFrame

Table of normalized RMSE

sample_rmse

sample_rmse()

Calculate the root mean squared error (RMSE) of EEM of each sample.

Returns:

Name Type Description
sse DataFrame

Table of RMSE

variance_explained

variance_explained()

Calculate the explained variance of the established NMF model

Returns:

Name Type Description
ev float

the explained variance

SplitValidation

Validate PARAFAC or NMF models by comparing component consistency across EEM sub-datasets.

Parameters:

Name Type Description Default
base_model PARAFAC or EEMNMF

Base model used to fit each sub-dataset.

required
n_splits int

Number of splits used to create sub-datasets.

4
combination_size int or {"half"}

Number of splits assembled into each combination. If "half" is passed, each combination uses half of the splits (split-half validation).

"half"
rule {"random", "sequential"}

Split rule for the dataset. "sequential" splits by index order.

"random"
random_state int

Random seed used when rule="random".

None

Attributes:

Name Type Description
eem_subsets dict

Mapping of subset labels to EEMDataset instances.

subset_specific_models dict

Mapping of subset labels to fitted PARAFAC or EEMNMF models.

eem_dataset_full EEMDataset or None

The full dataset used to generate splits.

compare_components

compare_components()

Compare component EEMs between models fitted to paired sub-datasets.

Returns:

Name Type Description
similarities_components DataFrame

Similarity scores for component EEMs.

compare_parafac_loadings

compare_parafac_loadings()

Compare excitation/emission loadings between PARAFAC models fitted to paired sub-datasets.

This method is only meaningful for PARAFAC models because it relies on Ex/Em loadings.

Returns:

Name Type Description
similarities_ex DataFrame

Similarity scores for excitation loadings per component.

similarities_em DataFrame

Similarity scores for emission loadings per component.

correlation_cv

correlation_cv(ref_col)

Cross-validate reference correlations using component Fmax values.

For each split pair, fit a linear regression on the training subset and evaluate on the paired test subset. Metrics are reported for each component as R2 and RMSE for both training and test.

Parameters:

Name Type Description Default
ref_col str

Column name in eem_dataset_full.ref used as the reference variable.

required

Returns:

Type Description
DataFrame

Table of R2 and RMSE metrics for each component and split pairing.

fit

fit(eem_dataset: EEMDataset)

Fit the base model on each sub-dataset and store the fitted models.

Parameters:

Name Type Description Default
eem_dataset EEMDataset

Full dataset used for splitting and model fitting.

required

Returns:

Name Type Description
self SplitValidation

Fitted validation object.

KMethod

K-method (e.g., K-PARAFACs or K-NMFs) for EEM clustering by minimizing reconstruction error (Hu et al., Water Research, 2025).

This class implements the K-method family of clustering algorithms for excitation–emission matrix (EEM) datasets. The key hypothesis is that fitting EEMs with high chemical composition variability using a single, unified set of components (e.g., one PARAFAC or NMF model) can lead to over-generalized component formation and large reconstruction error. In contrast, EEMs sharing similar chemical compositions can be clustered and represented by cluster-specific component sets, resulting in a number of unique component sets that better capture the variability in chemical composition between clusters and reduce overall reconstruction error.

Based on this hypothesis, K-method searches for a clustering strategy that minimizes the overall reconstruction error by iterating between: - Estimation: fit a base decomposition model (base_model) separately on each current cluster to obtain cluster-specific models. - Assignment: assign each sample to the cluster whose model yields the smallest distance (e.g., reconstruction RMSE), forming updated clusters.

Repeating this procedure yields cluster-specific PARAFAC/NMF models that (ideally) reconstruct the dataset better than a single unified model.

In addition, K-method can be run multiple times with subsampling to form a consensus matrix and then derive a final clustering using hierarchical clustering on a distance matrix computed from consensus values.

Parameters:

Name Type Description Default
base_model object

Base decomposition model used within each cluster (e.g., an instance of PARAFAC or EEMNMF). Before passing to KMethod, the base model should be properly configured (e.g., number of components, regularizations to be implemented, etc.).

required
n_initial_splits int

Number of splits used in initialization (the first partition of the dataset before iterative refinement).

required
distance_metric {'reconstruction_error', 'reconstruction_error_with_beta', 'quenching_coefficient'}

Criterion used for assignment in the maximization step. - 'reconstruction_error': assign each sample to the model with the smallest per-sample RMSE. - 'reconstruction_error_with_beta': like reconstruction error, but pairs samples into top/bot groups and uses beta-based reconstruction that forces fmax ratios between paired samples equal to the beta values across all samples (requires kw_top, kw_bot in base_model). - 'quenching_coefficient': assign samples based on similarity of estimated quenching coefficients derived from paired top/bot samples (requires kw_top and kw_bot).

'reconstruction_error'
max_iter int

Maximum number of K-method iterations in a single base clustering run.

20
tol float

Convergence tolerance based on similarity between cluster-specific models of two consecutive iterations. If the average Tucker’s congruence (or component similarity proxy) exceeds 1 - tol, convergence is declared.

0.001
elimination {'default'} or int

Minimum allowed cluster size during optimization. Clusters with fewer samples than the threshold are removed. - 'default': use base_model.n_components as the minimum cluster size. - int: explicit minimum cluster size.

'default'

Attributes:

Name Type Description
unified_model object or None

Unified model fitted once on the full dataset (a deep copy of base_model). Used as a reference for aligning components and for some distance calculations.

label_history list or None

History of cluster assignments. For base clustering runs, this is typically a list containing a DataFrame with per-sample labels across iterations.

error_history list or None

History of per-sample distances/errors (e.g., RMSE) across iterations, typically stored as DataFrames.

silhouette_score float or None

Silhouette score computed on the final distance matrix during hierarchical clustering (when available).

labels ndarray or None

Final cluster labels for each sample. Labels are cluster IDs returned by hierarchical clustering (typically 1..K), or by base clustering when used directly.

index_sorted list or None

Dataset index reordered by the final hierarchical clustering labels (when available).

ref_sorted DataFrame or None

Reference table reordered by the final hierarchical clustering labels (when available).

threshold_r float or None

Distance threshold used for hierarchical clustering cut (derived from the linkage matrix).

eem_clusters dict or None

Mapping from cluster label to an EEMDataset containing the EEMs assigned to that cluster.

cluster_specific_models dict or None

Mapping from cluster label to the fitted cluster-specific model (deep copies of base_model fitted on each cluster).

consensus_matrix ndarray or None

Consensus matrix M with shape (n_samples, n_samples), where M[i, j] is the fraction of base runs in which sample i and j co-occur in the same cluster.

distance_matrix ndarray or None

Distance matrix derived from consensus, typically D[i, j] = (1 - M[i, j])**p (see consensus_conversion_power).

linkage_matrix ndarray or None

Hierarchical clustering linkage matrix computed from the consensus-derived distance matrix.

consensus_matrix_sorted ndarray or None

Consensus matrix reordered by the final cluster labels for visualization.

References

[1] Hu, Yongmin, Eberhard Morgenroth, and Céline Jacquin. "Online monitoring of greywater reuse system using excitation-emission matrix (EEM) and K-PARAFACs." Water Research 268 (2025): 122604.

base_clustering

base_clustering(eem_dataset: EEMDataset)

Run clustering for a single time.

Parameters:

Name Type Description Default
eem_dataset EEMDataset

The EEM dataset to be clustered.

required

Returns:

Name Type Description
cluster_labels ndarray

Cluster labels.

label_history list

Cluster labels in each iteration.

error_history list

Average reconstruction error (RMSE) in each iteration.

calculate_consensus

calculate_consensus(eem_dataset: EEMDataset, n_base_clusterings: int, subsampling_portion: float)

Run the clustering for many times and combine the output of each run to obtain an optimal clustering.

Parameters:

Name Type Description Default
eem_dataset EEMDataset

EEM dataset.

required
n_base_clusterings int

Number of base clustering.

required
subsampling_portion float

The portion of EEMs remained after subsampling.

required

Returns:

Name Type Description
self object

The established K-PARAFACs model

hierarchical_clustering

hierarchical_clustering(eem_dataset, n_clusters, consensus_conversion_power=1)

Parameters:

Name Type Description Default
eem_dataset EEMDataset

EEM dataset to cluster.

required
n_clusters int

Number of clusters.

required
consensus_conversion_power float

The factor adjusting the conversion from consensus matrix (M) to distance matrix (D) used for hierarchical clustering. D_{i,j} = (1 - M_{i,j})^factor. This number influences the gradient of distance with respect to consensus. A smaller number will lead to shaper increase of distance at consensus close to 1.

1

predict

predict(eem_dataset: EEMDataset)

Fit the cluster-specific models to a given EEM dataset. Each EEM in the EEM dataset is fitted to the model that produce the least RMSE.

Parameters:

Name Type Description Default
eem_dataset EEMDataset

The EEM dataset to be predicted.

required

Returns:

Name Type Description
best_model_label DataFrame

The best-fit model for every EEM.

score_all DataFrame

The score fitted with each cluster-specific model.

fmax_all DataFrame

The fmax fitted with each cluster-specific model.

sample_error DataFrame

The RMSE fitted with each cluster-specific model.

combine_eem_datasets

combine_eem_datasets(list_eem_datasets)

Combine all EEMDataset objects in a list

Parameters:

Name Type Description Default
list_eem_datasets list.

List of EEM datasets.

required

Returns:

Name Type Description
eem_dataset_combined EEMDataset

EEM dataset combined.