HiggsML package

HiggsML.datasets module

class HiggsML.datasets.Data(input_dir, test_size=0.3)

Bases: object

A class to represent a dataset.

Parameters:
  • input_dir (str): The directory path of the input data.

Attributes:
  • __train_set (dict): A dictionary containing the train dataset.

  • __test_set (dict): A dictionary containing the test dataset.

  • input_dir (str): The directory path of the input data.

Methods:
  • load_train_set(): Loads the train dataset.

  • load_test_set(): Loads the test dataset.

  • get_train_set(): Returns the train dataset.

  • get_test_set(): Returns the test dataset.

  • delete_train_set(): Deletes the train dataset.

  • get_syst_train_set(): Returns the train dataset with systematic variations.

delete_train_set()

Deletes the train dataset.

get_test_set()

Returns the test dataset.

Returns:

dict: The test dataset.

get_train_set()

Returns the train dataset.

Returns:

dict: The train dataset.

HiggsML.datasets.download_dataset(input_str)

Downloads and extracts the Neurips 2024 public dataset.

Returns:

Data: The path to the extracted input data.

Raises:

HTTPError: If there is an error while downloading the dataset. FileNotFoundError: If the downloaded dataset file is not found. zipfile.BadZipFile: If the downloaded file is not a valid zip file.

HiggsML.ingestion module

class HiggsML.ingestion.Ingestion(data=None)

Bases: object

Class for handling the ingestion process.

Args:

data (object): The data object.

Attributes:
  • start_time (datetime): The start time of the ingestion process.

  • end_time (datetime): The end time of the ingestion process.

  • model (object): The model object.

  • data (object): The data object.

fit_submission()

Fit the submitted model.

get_duration()

Get the duration of the ingestion process.

Returns:

timedelta: The duration of the ingestion process.

init_submission(Model, model_type='sample_model')

Initialize the submitted model.

Args:

Model (object): The model class.

load_train_set(**kwargs)

Load the training set.

Returns:

object: The loaded training set.

predict_submission(test_settings, initial_seed=31415)

Make predictions using the submitted model.

Args:

test_settings (dict): The test settings.

save_duration(output_dir=None)

Save the duration of the ingestion process to a file.

Args:

output_dir (str): The output directory to save the duration file.

save_result(output_dir=None)

Save the ingestion result to files.

Args:

output_dir (str): The output directory to save the result files.

show_duration()

Show the duration of the ingestion process.

start_timer()

Start the timer for the ingestion process.

stop_timer()

Stop the timer for the ingestion process.

HiggsML.systematics module

This module contains the systematics functions for the FAIR Challenge. Originally written by David Rousseau, and Victor Estrade.

class HiggsML.systematics.V4(apx=0.0, apy=0.0, apz=0.0, ae=0.0)

Bases: object

A simple 4-vector class to ease calculation

copy()

Copy the current V4 object

Parameters:

None

Returns:

copy (V4): a copy of the current V4 object

deltaEta(v)

Compute the pseudo-rapidity difference with another V4 object

Parameters:

v (V4): the other V4 object

Returns:

deltaPhi (float): azimuthal angle difference

deltaPhi(v)

Compute the azimuthal angle difference with another V4 object Parameters: v (V4) - the other V4 object Returns: deltaPhi (float) - azimuthal angle difference

deltaR(v)

Compute the delta R with another V4 object

Parameters:

v (V4): the other V4 object

Returns:

deltaEta (float): pseudo-rapidity difference

eWithM(m=0.0)

Compute the energy with a given mass

Parameters:

m (float): mass

Returns:

e (float): energy with a given mass

eta()

Compute the pseudo-rapidity

Parameters:

None

Returns:

eta (float): pseudo-rapidity

m()

Compute the mass

Parameters:

None

Returns:

m (float): mass

p()

Compute the norm of the 3D momentum

Parameters:

None

Returns:

p (float): norm of the 3D momentum

p2()

Compute the squared norm of the 3D momentum

Parameters:

None

Returns:

p2 (float): squared norm of the 3D momentum

phi()

Compute the azimuthal angle

Parameters:

None

Returns:

phi (float): azimuthal angle

pt()

Compute the norm of the transverse momentum

Parameters:

None

Returns:

pt (float): norm of the transverse momentum

pt2()

Compute the squared norm of the transverse momentum

Parameters:

None

Returns:

pt2 (float): squared norm of the transverse momentum

scale(factor=1.0)

Apply a simple scaling

scaleFixedM(factor=1.0)

Scale (keeping mass unchanged)

setPtEtaPhiM(pt=0.0, eta=0.0, phi=0.0, m=0)

Re-initialize with : pt, eta, phi and m

sum(v)

Add another V4 into self

HiggsML.systematics.all_bkg_weight_norm(weights, label, systBkgNorm)

Apply a scaling to the weight.

Args:

weights (array-like): The weights to be scaled label (array-like): The labels systBkgNorm (float): The scaling factor

Returns:

array-like: The scaled weights

HiggsML.systematics.diboson_bkg_weight_norm(weights, detailedlabel, systBkgNorm)

Apply a scaling to the weight. For Diboson background

Args:
  • weights (array-like): The weights to be scaled

  • detailedlabel (array-like): The detailed labels

  • systBkgNorm (float): The scaling factor

Returns:

array-like: The scaled weights

HiggsML.systematics.get_bootstrapped_dataset(test_set, mu=1.0, seed=31415, ttbar_scale=None, diboson_scale=None, bkg_scale=None, poisson=True)

Generate a bootstrapped dataset

Args:
  • test_set (dict): The original test dataset

  • mu (float): The scaling factor for htautau background

  • seed (int): The random seed

  • ttbar_scale (float): The scaling factor for ttbar background

  • diboson_scale (float): The scaling factor for diboson background

  • bkg_scale (float): The scaling factor for other backgrounds

Returns:

pandas.DataFrame: The bootstrapped dataset

HiggsML.systematics.mom4_manipulate(data, systTauEnergyScale, systJetEnergyScale, soft_met, seed=31415)

Manipulate primary inputs : the PRI_had_pt PRI_jet_leading_pt PRI_jet_subleading_pt and recompute the others values accordingly.

Args:
  • data (pandas.DataFrame): The dataset to be manipulated

  • systTauEnergyScale (float): The factor applied to PRI_had_pt

  • systJetEnergyScale (float): The factor applied to all jet pt

  • soft_met (float): The additional soft MET energy

  • seed (int): The random seed

Returns:

pandas.DataFrame: The manipulated dataset

HiggsML.systematics.postprocess(data)

Select the events with the following conditions: * PRI_had_pt > 26 * PRI_jet_leading_pt > 26 * PRI_jet_subleading_pt > 26 * PRI_lep_pt > 20

This is applied to the dataset after the systematics are applied

Args:

data (pandas.DataFrame): The manipulated dataset

Returns:

pandas.DataFrame: The postprocessed dataset

HiggsML.systematics.systematics(data_set=None, tes=1.0, jes=1.0, soft_met=0.0, seed=31415, ttbar_scale=None, diboson_scale=None, bkg_scale=None, dopostprocess=True)

Apply systematics to the dataset

Args:
  • data_set (dict)/(df): The dataset to apply systematics to

  • tes (float): The factor applied to PRI_had_pt

  • jes (float): The factor applied to all jet pt

  • soft_met (float): The additional soft MET energy

  • seed (int): The random seed

  • ttbar_scale (float): The scaling factor for ttbar background

  • diboson_scale (float): The scaling factor for diboson background

  • bkg_scale (float): The scaling factor for other backgrounds

Returns:

dict: The dataset with applied systematics

HiggsML.systematics.ttbar_bkg_weight_norm(weights, detailedlabel, systBkgNorm)

Apply a scaling to the weight. For ttbar background

Args:
  • weights (array-like): The weights to be scaled

  • detailedlabel (array-like): The detailed labels

  • systBkgNorm (float): The scaling factor

Returns:

array-like: The scaled weights

HiggsML.derived_quantities module

This module contains the functions to calculate the derived quantities of the HEP dataset. Originally written by David Rousseau, and Victor Estrade.

HiggsML.derived_quantities.DER_data(data)

data is supposed to be clean (no Weight, no eventId etc…) This function directly modifies the dataframe data so make sure to make a copy if you need to keep data

HiggsML.derived_quantities.calcul_int(data)

Calculate the px py pz E components of the particles’ 4 momentum.

Args:

data (pandas.DataFrame): Input data containing the particle properties.

Returns:

pandas.DataFrame: Dataframe with the derived quantities calculated.

HiggsML.derived_quantities.f_DER_deltaeta_jet_jet(data)

Calculate the absolute value of the difference of the pseudorapidity of the two jets Parameters: data (dataframe)

HiggsML.derived_quantities.f_DER_deltar_had_lep(data)

Calculate the delta R between the hadron and the lepton

HiggsML.derived_quantities.f_DER_lep_eta_centrality(data)

Calculate the centrality of the lepton Parameters: data (dataframe)

HiggsML.derived_quantities.f_DER_mass_jet_jet(data)

Calculate the invariant mass of the two jets Parameters: data (dataframe)

HiggsML.derived_quantities.f_DER_mass_transverse_met_lep(data)

Calculate the transverse mass between the MET and the lepton Parameters: data (dataframe)

HiggsML.derived_quantities.f_DER_mass_vis(data)

Calculate the invariant mass of the hadron and the lepton Parameters: data (dataframe)

HiggsML.derived_quantities.f_DER_met_phi_centrality(data)

Calculate the centrality of the MET Parameters: data (dataframe)

HiggsML.derived_quantities.f_DER_prodeta_jet_jet(data)

Calculate the product of the pseudorapidities of the two jets Parameters: data (dataframe)

HiggsML.derived_quantities.f_DER_pt_h(data)

Calculate the transverse momentum of the hadronic system Parameters: data (dataframe)

HiggsML.derived_quantities.f_DER_pt_ratio_lep_had(data)

Calculate the ratio of the transverse momentum of the lepton and the hadron Parameters: data (dataframe)

HiggsML.derived_quantities.f_DER_pt_tot(data)

Calculate the total transverse momentum Parameters: data (dataframe)

HiggsML.derived_quantities.f_DER_sum_pt(data)

Calculate the sum of the transverse momentum of the lepton, the hadron and the jets Parameters: data (dataframe)

HiggsML.derived_quantities.f_del_DER(data)

Delete all the unnecessary columns that were used to calculate the DER variables Parameters: data (dataframe)

HiggsML.score module

class HiggsML.score.Scoring

Bases: object

This class is used to compute the scores for the competition. For more details, see the evaluation page.

Atributes:
  • start_time (datetime): The start time of the scoring process.

  • end_time (datetime): The end time of the scoring process.

  • ingestion_results (list): The ingestion results.

  • ingestion_duration (float): The ingestion duration.

  • scores_dict (dict): The scores dictionary.

Methods:
  • start_timer(): Start the timer.

  • stop_timer(): Stop the timer.

  • get_duration(): Get the duration of the scoring process.

  • show_duration(): Show the duration of the scoring process.

  • load_ingestion_duration(ingestion_duration_file): Load the ingestion duration.

  • load_ingestion_results(prediction_dir=”./”,score_dir=”./”): Load the ingestion results.

  • compute_scores(test_settings): Compute the scores.

  • RMSE_score(mu, mu_hat, delta_mu_hat): Compute the RMSE score.

  • MAE_score(mu, mu_hat, delta_mu_hat): Compute the MAE score.

  • Quantiles_Score(mu, p16, p84, eps=1e-3): Compute the Quantiles Score.

  • write_scores(): Write the scores.

  • save_figure(mu, p16s, p84s, set=0): Save the figure.

MAE_score(mu, mu_hat, delta_mu_hat)

Compute the mean absolute error between the true value mu and the predicted value mu_hat.

Args:
  • mu (float): The true value.

  • mu_hat (np.array): The predicted value.

  • delta_mu_hat (np.array): The uncertainty on the predicted value

Quantiles_Score(mu, p16, p84, eps=0.001)

Compute the quantiles score based on the true value mu and the quantiles p16 and p84.

Args:
  • mu (array): The true ${mu} value.

  • p16 (array): The 16th percentile.

  • p84 (array): The 84th percentile.

  • eps (float, optional): A small value to avoid division by zero. Defaults to 1e-3.

RMSE_score(mu, mu_hat, delta_mu_hat)

Compute the root mean squared error between the true value mu and the predicted value mu_hat.

Args:
  • mu (float): The true value.

  • mu_hat (np.array): The predicted value.

  • delta_mu_hat (np.array): The uncertainty on the predicted value.

compute_scores(test_settings)

Compute the scores for the competition based on the test settings.

Args:

test_settings (dict): The test settings.

load_ingestion_duration(ingestion_duration_file)

Load the ingestion duration.

Args:

ingestion_duration_file (str): The ingestion duration file.

load_ingestion_results(prediction_dir='./', score_dir='./')

Load the ingestion results.

Args:

prediction_dir (str, optional): location of the predictions. Defaults to “./”. score_dir (str, optional): location of the scores. Defaults to “./”.

save_figure(mu, p16s, p84s, set=0, true_mu=None, result_text=None)

Save the figure of the mu distribution.

Args:
  • mu (array): The true ${mu} value.

  • p16 (array): The 16th percentile.

  • p84 (array): The 84th percentile.

  • set (int, optional): The set number. Defaults to 0.

HiggsML.visualization module

HiggsML.visualization.correlation_plots(dfall, target, columns=None)

Plots correlation matrices of the dataset features.

Args: * dfall : Pandas Dataframe * target : numpy array with labels * columns : List of column names to consider

_images/correlation_plots.png
HiggsML.visualization.event_vise_syst(dfall, df_syst, columns=None, sample_size=100)

Plots the event-wise shift between the nominal dataset and the systemalically shifted dataset. Args:

  • dfall : The nominal dataset.

  • df_syst : The systematics shifted dataset.

  • sample_size : The number of samples to consider (default: 100).

  • columns : The list of column names to consider.

..Images:: ../images/event_vise_syst.png

HiggsML.visualization.pair_plots(dfall, target, sample_size=10, columns=None)

Plots pair plots of the dataset features.

Args:
  • target : numpy array with labels

  • sample_size (int): The number of samples to consider (default: 10).

  • columns : List of column names to consider

_images/pair_plot.png
HiggsML.visualization.pair_plots_syst(dfall, df_syst, sample_size=100, columns=None)

Plots pair plots between the dataset and a system dataset.

Args:
  • df_syst (DataFrame): The system dataset.

  • sample_size (int): The number of samples to consider (default: 10).

..images:: ../images/pair_plot_syst.png

HiggsML.visualization.stacked_histogram(dfall, target, weights, detailed_label, field_name, mu_hat=1.0, nbins=30, y_scale='linear')

Plots a stacked histogram of a specific field in the dataset.

Args:
  • dfall : Pandas Dataframe

  • target : numpy array with labels

  • weights : numpy array with event weights

  • weights : numpy array with detailed labels of the events

  • detailed_label : The name of the field to plot.

  • mu_hat : The value of mu (default: 1.0).

  • bins (int): The number of bins for the histogram (default: 30).

_images/stacked_histogram.png
HiggsML.visualization.visualize_coverage(ingestion_result_dict, ground_truth_mus)

Plots a coverage plot of the mu values.

Args:
  • ingestion_result_dict : A dictionary containing the ingestion results.

  • ground_truth_mus : A dictionary of ground truth mu values.

images/coverage_plot.png
HiggsML.visualization.visualize_scatter(ingestion_result_dict, ground_truth_mus)

Plots a scatter Plot of ground truth vs. predicted mu values.

Args:
  • ingestion_result_dict : A dictionary containing the ingestion results.

  • ground_truth_mus : A dictionary of ground truth mu values.

images/scatter_plot_mu.png