A Large Causal Model (LCM) for causal discovery is a model that unveils the cause-and-effect relationships between variables in complex systems. These models are particularly useful for understanding how changes in one factor lead to changes in another, rather than just identifying correlations. Additionally, they help identify which variables influence others, especially in complex or high-dimensional datasets where traditional methods might miss key connections.
Approach | Discover causal connections | Fast inference | Scale to large data quantity |
---|---|---|---|
Traditional Causal Discovery Methods | Yes | No | No |
Large Causal Models | Yes | Yes | Yes |
The LCM takes as input a dataset of time series and automatically predicts the full-time graph representing the causal relationships between the time series over a time window period. Such graph is then condensed in a simpler representation, the summary graph, in which each time series is represented as a node, and each discovered cause-effect relationship is represented as a directed link.
The setup phase is divided into three steps:
requirements.txt
(e.g. with pip install -r requirements.txt
)res/
folderCurrently, two pretrained LCM models of different sizes are available:
deep_CI_12_3_fine_tuned_frozen_sim_82k_pre_joint_220k
(210 MB Small): download link(external link will be available soon)lcm_CI_RH_12_3_merged_290k
(4.6 GB Large): download link (external link will be available soon) At first, import the necessary modules for data generation, model prediction, and result visualization:
from pathlib import Path
from utils.model_wrapper import Architecture_PL
from utils.cp_utils import set_seed, create_example_data, run_cp_and_parse_res
from utils.plotting_utils import plot_summary_from_pred
The LCM takes as input a temporal dataset of shape (N, D)
where N is the sample size and D the feature size (number of time-series). In this example, we generate synthetic data with 1000 time samples, where each column represents a different time series. Data are Min-max normalized and random seed set to 42
for reproducibility.
set_seed(42)
df = create_example_data(n=1000)
variable_names = list(df.columns)
Load the .ckpt
pretrained model for causal prediction:
models_path = 'res'
model_name = 'lcm_CI_RH_12_3_merged_290k'
model = Architecture_PL.load_from_checkpoint(Path(models_path) / f"{model_name}.ckpt")
Run run_cp_and_parse_res
to perform causal discovery on the data. The max_lag
parameter specifies the maximum time window size for analyzing causal relationships:
# Run causal discovery with a maximum lag of 2
pred = run_cp_and_parse_res(model_name, model=model, df=df, max_lag=2, seed=42)
The result is a lagged adjacency tensor of shape (N, N, max_lag)
where:
pred[i, j, k]
represents the probability that the j-th time-series at time t-(max_lag - k) causes the i-th time-series at time t.The predicted causal relationships can be visualized using plot_summary_from_pred
. The plt_thr
parameter controls the density of the graph: higher values result in fewer edges being displayed.
plot_summary_from_pred(pred, variable_names, plt_thr=0.5)
In the resulting graph, an edge from time series A to B marked as t-1
means that time series A at time t-1 caused time series B at time t.
Comming soon
The last version of the LCMs works under the current assumptions: