Large Causal Models for Causal Discovery on Time Series

Large Causal Model

A Large Causal Model (LCM) for causal discovery is a model that unveils the cause-and-effect relationships between variables in complex systems. These models are particularly useful for understanding how changes in one factor lead to changes in another, rather than just identifying correlations. Additionally, they help identify which variables influence others, especially in complex or high-dimensional datasets where traditional methods might miss key connections.

Approach	Discover causal connections	Fast inference	Scale to large data quantity
Traditional Causal Discovery Methods	Yes	No	No
Large Causal Models	Yes	Yes	Yes

How does it work?

The LCM takes as input a dataset of time series and automatically predicts the full-time graph representing the causal relationships between the time series over a time window period. Such graph is then condensed in a simpler representation, the summary graph, in which each time series is represented as a node, and each discovered cause-effect relationship is represented as a directed link.

Try it yourself!

0. Setup

The setup phase is divided into three steps:

Download the code at https://github.com/mensxmachina/LCM-FORTH-Huawei
Install the required dependencies in requirements.txt (e.g. with pip install -r requirements.txt)
Download the pretrained model weights and save them under the res/ folder

Currently, two pretrained LCM models of different sizes are available:

deep_CI_12_3_fine_tuned_frozen_sim_82k_pre_joint_220k(210 MB Small): download link (external link will be available soon)
lcm_CI_RH_12_3_merged_290k(4.6 GB Large): download link (external link will be available soon)

1. Import the Required Libraries

At first, import the necessary modules for data generation, model prediction, and result visualization:


        from pathlib import Path
        from utils.model_wrapper import Architecture_PL
        from utils.cp_utils import set_seed, create_example_data, run_cp_and_parse_res
        from utils.plotting_utils import plot_summary_from_pred

2. Load the Data

The LCM takes as input a temporal dataset of shape (N, D) where N is the sample size and D the feature size (number of time-series). In this example, we generate synthetic data with 1000 time samples, where each column represents a different time series. Data are Min-max normalized and random seed set to 42 for reproducibility.


        set_seed(42)
        
        df = create_example_data(n=1000)
        variable_names = list(df.columns)

3. Load the Pretrained Model

Load the .ckpt pretrained model for causal prediction:


        models_path = 'res'
        model_name = 'lcm_CI_RH_12_3_merged_290k'
        
        model = Architecture_PL.load_from_checkpoint(Path(models_path) / f"{model_name}.ckpt")

4. Perform Causal Discovery

Run run_cp_and_parse_res to perform causal discovery on the data. The max_lag parameter specifies the maximum time window size for analyzing causal relationships:


        # Run causal discovery with a maximum lag of 2
        pred = run_cp_and_parse_res(model_name, model=model, df=df, max_lag=2)

The result is a lagged adjacency tensor of shape (N, N, max_lag) where:

N is the number of input time-series
pred[i, j, k] represents the probability that the j-th time-series at time t-(max_lag - k) causes the i-th time-series at time t.

5. Visualize the Results

The predicted causal relationships can be visualized using plot_summary_from_pred. The plt_thr parameter controls the density of the graph: higher values result in fewer edges being displayed.


        plot_summary_from_pred(pred, variable_names, plt_thr=0.5)

In the resulting graph, an edge from time series A to B marked as t-1 means that time series A at time t-1 caused time series B at time t.

Publications

The following publications have emerged from this research collaboration:

Temporal Causal-based Simulation for Realistic Time-series Generation: Describes the causal based resimulation framework developed to augment the training data with real-world realistic time series (link)

Assumptions and Limitations of current model

The last version of the LCMs works under the current assumptions:

Causal Markov Condition and Faithfulness: These assumptions imply that the causal structure of the system fully captures the dependencies among the variables and that the observed data conforms to the causal relationships described by the model.
Causal Inference of up to 12 Variables and 3 Time Lags: The models can handle inputs up to 12 variables and up to 3 time lags.
No Contemporaneous Effects: We assume that all cause-effect relationships occur after at least one time lag, meaning that there are no contemporaneous effects where variables influence each other at the same time step.
No Unobserved Confounders: The model relies on the assumption that all relevant causes are observable and accounted for.
Causal Stationarity: The relationships between variables and their influences do not evolve or change in different time periods.
Time-Series Stationarity: The input time-series data are assumed to be stationary, meaning their statistical properties, such as mean and variance, do not change over time.

Large Causal Models on Time Series