A Large Causal Model (LCM) for causal discovery is a model that unveils the cause-and-effect relationships between variables in complex systems. These models are particularly useful for understanding how changes in one factor lead to changes in another, rather than just identifying correlations. Additionally, they help identify which variables influence others, especially in complex or high-dimensional datasets where traditional methods might miss key connections.
Approach | Discover causal connections | Fast inference | Scale to large data quantity |
---|---|---|---|
Traditional Causal Discovery Methods | Yes | No | No |
Large Causal Models | Yes | Yes | Yes |
The LCM takes as input a dataset of time series and automatically predicts the full-time graph representing the causal relationships between the time series over a time window period. Such graph is then condensed in a simpler representation, the summary graph, in which each time series is represented as a node, and each discovered cause-effect relationship is represented as a directed link.
The setup phase is divided into three steps:
requirements.txt
(e.g. with pip install -r requirements.txt
)res/
folderCurrently, two pretrained LCM models of different sizes are available:
LCM_CI_CR_1.3M_12_3_joint_220k
(30 MB Small): download linkLCM_CI_9.6M_joint_220k_permuted_3
(167 MB Medium): download linklcm_CI_RH_12_3_merged_290k
(4.6 GB Large): download linkAt first, import the necessary modules for data generation, model prediction, and result visualization:
from pathlib import Path
from utils.causal_model import CausalModel # architecture module
from utils.data_utils import create_example_data # example data creation module
from utils.plotting_utils import plot_summary_from_pred, plot_summary_graph # plotting module
The LCM takes as input a temporal dataset of shape (N, D)
where N is the sample size and D the feature size (number of time-series). In this example, we generate synthetic data with 1000 time samples, where each column represents a different time series. Data are Min-max normalized and random seed set to 42
for reproducibility.
set_seed(42)
df = create_example_data(n=1000)
variable_names = list(df.columns)
Load the .ckpt
pretrained model for causal prediction:
models_path = 'res'
model_name = 'lcm_CI_RH_12_3_merged_290k'
model = CausalModel(model_name = model_name, model_path = Path(models_path) / f"{model_name}.ckpt")
Run model.predict
to perform causal discovery on the data. The max_lag
parameter specifies the maximum time window size for analyzing causal relationships:
# Run causal discovery with a maximum lag of 1
pred = model.predict(df, max_lag_to_predict = 1)
The result is a lagged adjacency tensor of shape (N, N, max_lag)
where:
pred[i, j, k]
represents the probability that the j-th time-series at time t-(max_lag - k) causes the i-th time-series at time t.The predicted causal relationships can be visualized using plot_summary_from_pred
. The plt_thr
parameter controls the density of the graph: higher values result in fewer edges being displayed.
plot_summary_from_pred(pred, variable_names, plt_thr=0.25)
In the resulting graph, an edge from time series A to B marked as t-1
means that time series A at time t-1 caused time series B at time t.
As an alternative to using a specific causal model or threshold, the get_best_graph
method can be applied. This method evaluates all available models and thresholds and returns the causal graph that optimally represents the relationships in the dataset.
import utils.prediction_utils as pu
G = pu.get_best_graph(df, models_folder = models_path)
plot_summary_graph(G, variable_names)
The following publications have emerged from this research collaboration:
The last version of the LCMs works under the current assumptions: