optrade.data


optrade.data.contracts

class Contract(root, start_date, exp, strike, interval_min, right)[source]

Bases: object

A class representing an options contract with methods for optimal contract selection.

The Contract class defines the structure of an options contract including the underlying security, dates, strike price, and other key parameters.

Parameters:
  • root (str)

  • start_date (str)

  • exp (str)

  • strike (float)

  • interval_min (int)

  • right (str)

__init__(root, start_date, exp, strike, interval_min, right)[source]

Initialize a Contract instance.

Parameters:
  • root (str) – Root symbol of the underlying security (e.g., “AAPL” representing Apple Inc.)

  • start_date (str) – Start date in YYYYMMDD format (e.g., “20241107” representing November 7, 2024)

  • exp (str) – Expiration date in YYYYMMDD format (e.g., “20241206” representing December 6, 2024)

  • strike (float) – Strike price (e.g., 225 representing $225)

  • interval_min (int) – Interval in minutes (e.g., 1 representing 1 minute)

  • right (str) – Option type (‘C’ for call, ‘P’ for put)

Returns:

None

classmethod find_optimal(root, start_date, interval_min, right, target_tte, tte_tolerance, moneyness, strike_band=0.05, hist_vol=None, volatility_scaled=False, volatility_scalar=1.0, verbose=True, warning=False, dev_mode=False)[source]

Find the optimal contract for a given security, start date, and approximate TTE.

Parameters:
  • root (str) – Underlying stock symbol

  • start_date (str) – Start date for the contract in YYYYMMDD format

  • interval_min (int) – Interval in minutes

  • right (str) – Option type (C for call, P for put)

  • target_tte (int) – Target time to expiration in days

  • tte_tolerance (Tuple[int, int]) – Acceptable range for TTE as (min_days, max_days)

  • moneyness (str) – Contract moneyness (OTM, ATM, ITM)

  • strike_band (float | None) – Target percentage band for strike selection

  • hist_vol (float | None) – Historical volatility for dynamic strike selection

  • volatility_scaled (bool) – Whether to select strike by volatility

  • volatility_scalar (float | None) – Scaling factor for volatiliy-based strike selection

  • verbose (bool) – Whether to print verbose output

  • warning (bool)

  • dev_mode (bool)

Return type:

Contract

load_data(clean_up=False, offline=False, save_dir=None, warning=False, dev_mode=False)[source]

Load data for the selected contract.

Parameters:
  • clean_up (bool) – Whether to clean up the data after use

  • offline (bool) – Whether to load saved data from disk

  • save_dir (str | None) – Directory to save/load data

  • warning (bool) – Whether to display warnings

  • dev_mode (bool) – Whether to use development mode

Returns:

pd.DataFrame – The loaded data containing NBBO quotes and OHLCVC data for the contract and the underlying

Return type:

DataFrame

class ContractDataset(root, total_start_date, total_end_date, contract_stride, interval_min, right, target_tte, tte_tolerance, moneyness, strike_band=0.05, volatility_scaled=False, volatility_scalar=1.0, hist_vol=None, verbose=False, save_dir=None, warning=True, dev_mode=False, contract_dir=None)[source]

Bases: object

A dataset containing options contracts generated with consistent parameters.

Parameters:
  • root (str)

  • total_start_date (str)

  • total_end_date (str)

  • contract_stride (int)

  • interval_min (int)

  • right (str)

  • target_tte (int)

  • tte_tolerance (Tuple[int, int])

  • moneyness (str)

  • strike_band (float)

  • volatility_scaled (bool)

  • volatility_scalar (float)

  • hist_vol (float | None)

  • verbose (bool)

  • save_dir (str | None)

  • warning (bool)

  • dev_mode (bool)

  • contract_dir (Path | None)

__init__(root, total_start_date, total_end_date, contract_stride, interval_min, right, target_tte, tte_tolerance, moneyness, strike_band=0.05, volatility_scaled=False, volatility_scalar=1.0, hist_vol=None, verbose=False, save_dir=None, warning=True, dev_mode=False, contract_dir=None)[source]

Initialize the ContractDataset with the specified parameters.

Parameters:
  • root (str) – The security root symbol

  • total_start_date (str) – Start date for the dataset (YYYYMMDD)

  • total_end_date (str) – End date for the dataset (YYYYMMDD)

  • contract_stride (int) – Days between consecutive contracts

  • interval_min (int) – Data interval in minutes

  • right (str) – Option type (C/P)

  • target_tte (int) – Target time to expiration in days

  • tte_tolerance (Tuple[int, int]) – Acceptable range for TTE as (min_days, max_days)

  • moneyness (str) – Contract moneyness (OTM/ATM/ITM)

  • strike_band (float) – Target percentage band for strike selection

  • volatility_scaled (bool) – Whether to scale by volatility

  • volatility_scalar (float) – Scaling factor for volatility

  • hist_vol (float | None) – Historical volatility for dynamic strike selection

  • verbose (bool) – Whether to print verbose output

  • save_dir (str | None)

  • warning (bool)

  • dev_mode (bool)

  • contract_dir (Path | None)

Return type:

None

generate()[source]

Generate all contracts in the dataset based on configuration parameters. Contracts are generated by starting from total_start_date and advancing by contract_stride days until reaching the last valid date that allows for contracts within the specified time-to-expiration tolerance.

Returns:

ContractDataset – The dataset with all generated contracts

Return type:

ContractDataset

save(filename=None, clean_file=False)[source]

Save the dataset to a pickle file.

Parameters:
  • filepath – Optional custom filepath. If None, generates default name

  • clean_file (bool) – Whether to delete the existing file if it exists

  • filename (str | None)

Returns:

str – Path where the pickle file was saved

Return type:

None

classmethod load(filepath)[source]

Load a dataset from a pickle file.

Parameters:

filepath (Path)

Return type:

ContractDataset

get_contract_datasets(root, start_date, end_date, contract_stride, interval_min, right, target_tte, tte_tolerance, moneyness, strike_band=0.05, volatility_type='period', volatility_scaled=False, volatility_scalar=1.0, train_split=0.7, val_split=0.1, clean_up=False, offline=False, save_dir=None, verbose=False, dev_mode=False)[source]

Returns the training, validation, and test datasets contract datasets. These contain mutually exclusive contracts at mutually exclusive time periods to prevent information leakage during training and evaluation.

Parameters:
  • root (str) – Underlying stock symbol

  • start_date (str) – Start date for the total dataset in YYYYMMDD format

  • end_date (str) – End date for the total dataset in YYYYMMDD format

  • contract_stride (int) – Number of days between each contract

  • interval_min (int) – Interval in minutes for the underlying stock data

  • right (str) – Option type (C for call, P for put)

  • target_tte (int) – Target time to expiration in days

  • tte_tolerance (Tuple[int, int]) – Tuple of (min, max) time to expiration tolerance in days

  • moneyness (str) – Moneyness of the option contract (OTM, ATM, ITM)

  • strike_band (float | None) – Target band for moneyness selection, proportion of current underlying price

  • volatility_type (str | None) – Type of historical volatility to use

  • volatility_scaled (bool | None) – Whether to scale strikes based on historical volatility

  • volatility_scalar (float | None) – Scalar to adjust historical volatility-based strike selection

  • train_split (float) – Proportion of total days to use for training

  • val_split (float) – Proportion of total days to use for validation

  • clean_up (bool) – Whether to clean up the data after use

  • offline (bool) – Whether to load saved contracts from disk

  • save_dir (str | None) – Directory to save/load contracts

  • verbose (bool) – Whether to print verbose output

  • dev_mode (bool) – Whether to use development mode

Returns:

Training, validation, and test contract datasets.

Return type:

Tuple[ContractDataset, ContractDataset, ContractDataset]

optrade.data.features

dt_features(df, feats, dt_col='datetime', market_open_time='09:30:00', market_close_time='16:00:00')[source]

Generates datetime features for options.

Parameters:
  • df (DataFrame) – DataFrame containing a datetime column.

  • feats (List[str]) – List of datetime features to generate. Options include: - minute_of_day: Minute of trading day (0-389 for standard session) - sin_minute_of_day: Sine transformation of time of day (continuous circular feature) - cos_minute_of_day: Cosine transformation of time of day (continuous circular feature) - day_of_week: Day of week (0=Monday, 4=Friday) - hour_of_week: Hour position in trading week as proportion (0.0-1.0) - sin_hour_of_week: Sine transformation of hour of week (continuous circular feature) - cos_hour_of_week: Cosine transformation of hour of week (continuous circular feature)

  • dt_col (str | None) – Name of datetime column. If None, will attempt to detect it. Defaults to datetime.

  • market_open_time (str | None) – Market open time in HH:MM:SS format. Defaults to 09:30:00.

  • market_close_time (str | None) – Market close time in HH:MM:SS format. Defaults to 16:00:00.

Returns:

Original DataFrame with additional datetime feature columns, prefixed with dt_.

Return type:

DataFrame

Examples

Basic usage:

>>> import pandas as pd
>>> data = pd.DataFrame({
...     "datetime": pd.date_range("2023-01-02 09:30:00", periods=5, freq="1min")
... })
>>> feats = ["minute_of_day", "day_of_week"]
>>> result = dt_features(data, feats)
>>> result.columns
Index(['datetime', 'dt_minute_of_day', 'dt_day_of_week'], dtype='object')

Using custom datetime column name:

>>> data = pd.DataFrame({
...     "timestamp": pd.date_range("2023-01-02 09:30:00", periods=5, freq="1min")
... })
>>> result = dt_features(data, feats, dt_col="timestamp")
>>> result.columns
Index(['timestamp', 'dt_minute_of_day', 'dt_day_of_week'], dtype='object')
tte_features(df, feats, exp)[source]

Generate Time to Expiration (TTE) features for a given DataFrame.

Parameters:
  • df (pd.DataFrame) – DataFrame containing datetime column in format “YYYY-MM-DD HH:MM:SS”. The function will try to identify a datetime column if not explicitly named “datetime”.

  • feats (List) – List of features to generate. Options include: - “linear”: raw TTE in minutes - “inverse”: 1/TTE (in minutes) - “sqrt”: √(TTE minutes) - “inverse_sqrt”: 1/√(TTE minutes) - “exp_decay”: exp(-TTE/contract_length)

  • exp (str) – The expiration date of the option in YYYYMMDD format. The expiration time is assumed to be 16:30 (4:30 PM) on the expiration date.

Returns:

pd.DataFrame

The original DataFrame with additional TTE feature columns. Each requested

feature will be added with a prefix “tte_” (e.g., “tte_inverse”). All TTE features are guaranteed to be float64 type.

Return type:

DataFrame

get_volatility_features(df, feats, root, right, risk_free_rate=0.045, rolling_volatility_range=None)[source]

Computes volatility features from stock and option data.

Parameters:
  • df (DataFrame) – DataFrame with required columns

  • feats (List[str]) – List of feature names to compute

  • r – Risk-free rate

  • short_window – Lookback for short-term realized vol

  • long_window – Lookback for long-term realized vol

  • return_type – ‘log’ or ‘simple’ returns

  • root (str)

  • right (str)

  • risk_free_rate (float)

  • rolling_volatility_range (List[int] | None)

Returns:

DataFrame with new volatility features

Return type:

DataFrame

transform_features(df, core_feats, tte_feats=None, datetime_feats=None, vol_feats=None, rolling_volatility_range=None, root=None, right=None, strike=None, exp=None, keep_datetime=False)[source]

Selects and transforms features from a DataFrame based on specified feature lists.

This function allows the selection of core features from NBBO and OHLCVC data, as well as the generation of time-to-expiration features and datetime-based features. It can also calculate derived features such as returns, moneyness, and LOB imbalance.

Parameters:
  • df (DataFrame) – The DataFrame containing the raw features.

  • core_feats (List[str]) – List of core features to select.

  • tte_feats (List[str] | None) – List of Time to Expiration (TTE) features to generate.

  • datetime_feats (List[str] | None) – List of datetime features to generate.

  • strike (float | None) – Strike price of the option, required for moneyness and distance_to_strike calculations.

  • exp (str | None) – Expiration date string in YYYYMMDD format, required for TTE feature generation.

  • vol_feats (List[str] | None) – List of volatility features to generate.

  • root (str | None) – Stock symbol (e.g., “AAPL”), required for volatility feature generation.

  • right (str | None) – Option type (“C” for call, “P” for put), required for volatility feature generation.

  • rolling_volatility_range (List[int] | None) – List of intervals in minutes for rolling volatility features.

  • keep_datetime (bool) – If True, keep the datetime column in the output DataFrame. Otherwise, drop it.

Returns:

DataFrame containing only the requested features.

Return type:

DataFrame

Core feature options (subset of NBBO and OHLCVC):
  • datetime: Timestamp of the data point

  • {asset}_mid_price: Mid price of the asset

  • {asset}_bid_size: Size of the bid

  • {asset}_bid_exchange: Exchange of the bid

  • {asset}_bid: Bid price

  • {asset}_bid_condition: Condition of the bid

  • {asset}_ask_size: Size of the ask

  • {asset}_ask_exchange: Exchange of the ask

  • {asset}_ask: Ask price

  • {asset}_ask_condition: Condition of the ask

  • {asset}_open: Opening price

  • {asset}_high: High price

  • {asset}_low: Low price

  • {asset}_close: Closing price

  • {asset}_volume: Volume

  • {asset}_count: Count

where “{asset}” is either “option” or “stock”.

Advanced core feature options:
  • {asset}_returns: Mid-price returns

  • log_{asset}_returns: Log mid-price returns

  • {asset}_lob_imbalance: Limit order book imbalance

  • {asset}_quote_spread: Quote spread normalized by mid-price

  • moneyness: Log(S/K)

  • distance_to_strike: Linear distance to strike price

where “{asset}” is either “option” or “stock”.

TTE features options:
  • tte: Time to expiration

  • inverse: Inverse time to expiration

  • sqrt: Square root of time to expiration

  • inverse_sqrt: Inverse square root of time to expiration

  • exp_decay: Exponential decay of time to expiration

Datetime features options:
  • minute_of_day: Minute of the day

  • sin_minute_of_day: Sine of minute of the day

  • cos_minute_of_day: Cosine of minute of the day

  • day_of_week: Day of the week

  • sin_day_of_week: Sine of day of the week

  • cos_day_of_week: Cosine of day of the week

  • hour_of_week: Hour of the week

  • sin_hour_of_week: Sine of hour of the week

  • cos_hour_of_week: Cosine of hour of the week

Volatility feature options:
  • rolling_volatility: Rolling volatility over specified interval in minutes, set by rolling_volatility_range parameter.

  • vol_ratio: Ratio of short-term to long-term volatility

Examples

Basic usage:

from optrade.data.thetadata.contracts import Contract
contract = Contract()
df = contract.load_data()

# TTE features
tte_feats = ["sqrt", "exp_decay"]

# Datetime features
datetime_feats = ["sin_minute_of_day", "cos_minute_of_day",
                  "sin_hour_of_week", "cos_hour_of_week"]

# Select features
core_feats = [
    "option_returns",
    "stock_returns",
    "distance_to_strike",
    "moneyness",
    "option_lob_imbalance",
    "option_quote_spread",
    "stock_lob_imbalance",
    "stock_quote_spread",
    "option_mid_price",
    "option_bid_size",
    "option_bid",
    "option_ask_size",
    "option_close",
    "option_volume",
    "option_count",
    "stock_mid_price",
    "stock_bid_size",
    "stock_bid",
    "stock_ask_size",
    "stock_ask",
    "stock_volume",
    "stock_count",
]

df = transform_features(
    df=df,
    core_feats=core_feats,
    tte_feats=tte_feats,
    datetime_feats=datetime_feats,
    strike=contract.strike,
    exp=contract.exp
)

optrade.data.forecasting

class ForecastingDataset(data, seq_len, pred_len, target_channels=None, target_type='multistep', dtype='float32', normalize_target=False)[source]

Bases: Dataset

Parameters:
  • data (DataFrame)

  • seq_len (int)

  • pred_len (int)

  • target_channels (List[str] | None)

  • target_type (str)

  • dtype (str)

  • normalize_target (bool)

__init__(data, seq_len, pred_len, target_channels=None, target_type='multistep', dtype='float32', normalize_target=False)[source]

Initializes the ForecastingDataset class.

Parameters:
  • data (pd.DataFrame) – Input DataFrame containing the time series data.

  • seq_len (int) – Length of the lookback window for each sample.

  • pred_len (int) – Length of the forecast window (number of steps ahead to predict).

  • target_channels (Optional[List[str]]) – List of column names to include as target channels. If None, all columns are used.

  • target_type (str) – Type of target to predict. Must be one of: - “multistep”: Predicts the full future sequence (regression). - “average”: Predicts the average value over the forecast window (regression). - “average_direction”: Predicts the sign of the average change (binary classification).

  • dtype (str) – Data type for the internal PyTorch tensors (e.g., “float32”, “float64”). Default is “float32”.

  • normalize_target (bool) – Whether to apply normalization to the target variable(s).

Returns:

None

Return type:

None

to_numpy()[source]

Converts the dataset into a set of NumPy arrays for scikit-learn model training. :returns: Tuple[np.ndarray, np.ndarray]

A tuple containing:
  • inputs: NumPy array of shape (num_samples, seq_len, num_features).

  • targets: NumPy array of shape (num_samples, pred_len, num_target_features).

If datetime is available:
  • input_datetimes: NumPy array of shape (num_samples, seq_len).

  • target_datetimes: NumPy array of shape (num_samples, pred_len).

Return type:

Tuple[ndarray, ndarray] | Tuple[ndarray, ndarray, ndarray, ndarray]

get_item(idx)[source]

Get a sample from the dataset. This method retrieves an input-target pair at the specified index, with input being the lookback window and target being the forecast window based on the target_type. :param idx: Index of the starting point of the lookback window.

Returns:

If datetime is available

tuple: A tuple containing (input_tensor, target_tensor, input_datetime, target_datetime)
  • input_tensor: Lookback window of shape (num_features, seq_len).

  • target_tensor: Target window with shape depending on target_type: - “multistep”: (num_target_features, pred_len) - “average”: (num_target_features, 1) - “average_direction”: (num_target_features, 1)

  • input_datetime: Datetime values for input window of shape (seq_len,).

  • target_datetime: Datetime values for target window of shape (pred_len,).

Otherwise:
tuple: A tuple containing (input_tensor, target_tensor)
  • input_tensor: Lookback window of shape (num_features, seq_len).

  • target_tensor: Target window with shape as described above.

Parameters:

idx (int)

Return type:

Tuple[Tensor, Tensor] | Tuple[Tensor, Tensor, ndarray, ndarray]

normalize_concat_dataset(concat_dataset, scaler)[source]

Modifies the data in a ConcatDataset in-place by normalizing it using a fitted StandardScaler.

Parameters:
  • concat_dataset (ConcatDataset) – ConcatDataset object containing ForecastingDatasets

  • scaler (StandardScaler) – Fitted StandardScaler from scikit-learn.

Returns:

None

Return type:

None

normalize_datasets(train_dataset, val_dataset, test_dataset)[source]

Normalizes financial time series datasets using StandardScaler. Fits scaler only on training data to prevent look-ahead bias.

Parameters:
  • train_dataset (ConcatDataset) – Training dataset (ConcatDataset of ForecastingDatasets)

  • val_dataset (ConcatDataset) – Validation dataset

  • test_dataset (ConcatDataset) – Test dataset

Returns:

Tuple[ConcatDataset, ConcatDataset, ConcatDataset, StandardScaler] – Normalized training, validation, and test datasets, and the fitted Standard

Return type:

Tuple[ConcatDataset, ConcatDataset, ConcatDataset, StandardScaler]

get_forecasting_dataset(contract_dataset, tte_tolerance, seq_len=None, pred_len=None, core_feats=['option_returns'], tte_feats=None, datetime_feats=None, vol_feats=None, rolling_volatility_range=None, keep_datetime=False, target_type='multistep', clean_up=False, offline=False, intraday=False, target_channels=None, dtype='float32', normalize_target=False, save_dir=None, download_only=False, validate_contracts=False, modify_contracts=False, verbose=False, warning=True, dev_mode=False)[source]

Creates a PyTorch dataset object composed of multiple ForecastingDatasets, each representing different option contracts.

Parameters:
  • contract_dataset (ContractDataset) – ContractDataset object containing option contract parameters

  • tte_tolerance (Tuple[int, int]) – Tuple of (min, max) time to expiration tolerance in days

  • core_feats (List[str]) – List of core features to include

  • tte_feats (List[str] | None) – List of time-to-expiration features to include

  • datetime_feats (List[str] | None) – List of datetime features to include

  • vol_feats (List[str] | None) – List of volatility features to include

  • rolling_volatility_range (List[int] | None) – List of rolling volatility ranges to include

  • keep_datetime (bool) – Whether to keep the datetime column in the dataset

  • target_type (str) – Type of forecasting target. Options: “multistep” (float), “average” (float), or “average_direction” (binary).

  • clean_up (bool) – Whether to clean up the data after use

  • offline (bool) – Whether to load saved contracts from disk

  • intraday (bool) – Whether to use intraday data

  • target_channels (List[str] | None) – List of target channels to include in the target tensor. If None, all channels will be included.

  • seq_len (int | None) – Sequence length of lookback window (input)

  • pred_len (int | None) – Prediction length of forecast window (target)

  • dtype (str) – Data type for the PyTorch tensors

  • normalize_target (bool) – Whether to normalize the target variable(s)

  • save_dir (str | None) – Save directory

  • download_only (bool) – Whether to download data only (used mainly for Universe class)

  • validate_contracts (bool) – Whether to validate contracts by requesting data from ThetaData API and adjustintg start and end dates if necessary.

  • modify_contracts (bool) – Whether to delete old contracts .pkl file and save the (new) validate contracts in the same path. Warning: This will overwrite the old contracts.

  • verbose (bool) – Whether to print verbose output

  • warning (bool) – Whether to print verbose DataValidationError statements as warnings or errors.

  • dev_mode (bool) – Whether to run in development mode.

Returns:

ContractDataset – The updated ContractDataset object if `download_only`=True or `validate_contracts`=True. Tuple[ConcatDataset, ContractDataset]: A tuple containing the concatenated PyTorch dataset and the updated ContractDataset if download_only=False.

Return type:

ContractDataset | Tuple[ConcatDataset, ContractDataset]

calibrate_new_contract(contract_dataset, original_contract, candidate_start_date, candidate_exp, tte_tolerance, expirations_exist=False, save_dir=None, verbose=False, dev_mode=False)[source]
Parameters:
  • contract_dataset (ContractDataset)

  • original_contract (Contract)

  • candidate_start_date (str)

  • candidate_exp (str)

  • tte_tolerance (Tuple[int, int])

  • expirations_exist (bool)

  • save_dir (str | None)

  • verbose (bool)

  • dev_mode (bool)

Return type:

Tuple[bool, Contract | None]

get_valid_start_date(candidate_start_date)[source]

Return the next valid NYSE trading day given a candidate date in YYYYMMDD format.

This function checks whether the provided date falls on a weekend or a NYSE holiday. If so, it advances the date forward to the next valid trading day.

Parameters:

candidate_start_date (str) – The date to validate, in ‘YYYYMMDD’ format.

Returns:

str – The next valid NYSE trading day in ‘YYYYMMDD’ format.

Raises:

ValueError – If no valid trading day is found within the search buffer.

Return type:

str

get_forecasting_loaders(train_contract_dataset, val_contract_dataset, test_contract_dataset, seq_len, pred_len, tte_tolerance, core_feats=['option_returns'], tte_feats=None, datetime_feats=None, vol_feats=None, rolling_volatility_range=None, keep_datetime=False, target_channels=None, target_type='multistep', batch_size=32, shuffle=True, drop_last=False, num_workers=4, prefetch_factor=None, pin_memory=False, persistent_workers=True, clean_up=False, offline=False, save_dir=None, verbose=False, scaling=False, intraday=False, dtype='float32', normalize_target=False, modify_contracts=False, warning=True, dev_mode=False)[source]

Forms training, validation, and test dataloaders for option contract data.

Parameters:
  • train_contract_dataset (ContractDataset) – Contract dataset for training

  • val_contract_dataset (ContractDataset) – Contract dataset for validation

  • test_contract_dataset (ContractDataset) – Contract dataset for testing

  • seq_len (int) – Sequence length for input data

  • pred_len (int) – Prediction length for forecasting

  • tte_tolerance (Tuple[int, int]) – Tuple of (min, max) time to expiration tolerance in minutes

  • core_feats (List[str]) – List of core features to include

  • tte_feats (List[str] | None) – List of time-to-expiration features to include

  • datetime_feats (List[str] | None) – List of datetime features to include

  • keep_datetime (bool) – Whether to keep the datetime column in the dataset

  • target_type (str) – Type of forecasting target. Options: “multistep” (float), “average” (float), or “average_direction” (binary).

  • batch_size (int) – Number of samples per batch

  • shuffle (bool) – Whether to shuffle the data

  • drop_last (bool) – Whether to drop the last incomplete batch

  • num_workers (int) – Number of subprocesses to use for data loading

  • prefetch_factor (int | None) – Number of batches to prefetch

  • pin_memory (bool) – Whether to pin memory for faster GPU transfer

  • clean_up (bool) – Whether to clean up the data after use

  • offline (bool) – Whether to load saved contracts from disk

  • save_dir (str | None) – Directory to save/load processed datasets

  • modify_contracts (bool) – Whether to modify contracts if they are invalid in get_forecasting_dataset function calls.

  • verbose (bool) – Whether to print verbose output

  • scaling (bool) – Whether to normalize the datasets

  • intraday (bool) – Whether to use intraday data

  • target_channels (List[str] | None) – List of target channels for forecasting

  • dtype (str) – Data type for tensors

  • normalize_target (bool) – Whether to normalize the target variable(s)

  • warning (bool) – Whether to show warnings

  • dev_mode (bool) – Whether to run in development mode

  • vol_feats (List[str] | None)

  • rolling_volatility_range (List[int] | None)

  • persistent_workers (bool)

Returns:

Tuple[DataLoader, DataLoader, DataLoader] – Train, validation, and test data loaders if scaling=False. Tuple[DataLoader, DataLoader, DataLoader, StandardScaler]: Train, validation, and test data loaders, and the scaler if scaling=True.

Return type:

Tuple[DataLoader, DataLoader, DataLoader, None] | Tuple[DataLoader, DataLoader, DataLoader, StandardScaler]

create_windows(df, seq_len, pred_len, window_stride, intraday=False)[source]

Generates rolling windows of data for a given DataFrame. Should be used primarily for scikit-learn models and/or intraday modeling, otherwise default to optrade.data.forecasing.get_forecasting_loaders or optrade.data.forecasting.get_forecasting_datasets.

Parameters:
  • df (pd.DataFrame) – DataFrame containing the data.

  • seq_len (int) – Length of the input sequence.

  • pred_len (int) – Length of the prediction sequence.

  • window_stride (int) – Number of steps to move the window forward.

  • intraday (bool) – Whether the data is intraday or not. If True, the function will first split the data into separate trading days before creating individual windows that cannot crossover between days. Otherwise, the function will create windows that can span multiple days.

Returns:

input (np.ndarray)

Array of input windows of shape (num_windows, seq_len, num_features) where num_features

is the number of columns in the DataFrame (removing datetime but adding returns).

target (np.ndarray): Array of target windows of shape (num_windows, pred_len, 1).

Target contains only returns for the ‘option_mid_price’.

Return type:

Tuple[ndarray, ndarray]

optrade.data.thetadata

get_roots(sec='option', save_dir=None, clean_up=False, offline=False, dev_mode=False)[source]

Fetches all root symbols for a given security type.

Parameters:
  • sec (str) – The security type. Options: ‘option’, ‘stock’, ‘index’.

  • save_dir (str) – Directory to save the CSV file (default: current directory)

  • clean_up (bool) – Whether to clean up the CSV files after merging. If True, the CSV files are saved in a temp folder and then subsequently deleted before returning the df.

  • offline (bool) – Whether to work in offline mode, using previously saved data.

  • dev_mode (bool) – Whether to run in development mode.

Returns:

pd.DataFrame – The DataFrame containing the root symbols for the given security type.

Return type:

DataFrame

get_expirations(root, save_dir='.', clean_up=False, offline=False, dev_mode=False)[source]

Fetch option expiration dates for a given root symbol and save to CSV.

Parameters:
  • root (str) – The root symbol to get expirations for.

  • save_dir (str) – Directory to save the CSV file (default: current directory)

  • clean_up (bool) – Whether to clean up the CSV files after merging. If True, the CSV files are saved in a temp folder and then subsequently deleted before returning the df.

  • offline (bool) – Whether to work in offline mode, using previously saved data.

  • dev_mode (bool) – Whether to run in development mode.

Returns:

pd.DataFrame – The DataFrame containing the expiration dates for the given root symbol.

Return type:

DataFrame

get_strikes(root, exp, save_dir='.', clean_up=False, offline=False, dev_mode=False)[source]

Fetch option strike prices for a given root symbol and expiration, saving to CSV.

Parameters:
  • root (str) – The root symbol to get expirations for.

  • exp (str) – The expiration date to get strikes for.

  • save_dir (str) – Directory to save the CSV file (default: current directory)

  • clean_up (bool) – Whether to clean up the CSV files after merging. If True, the CSV files are saved in a temp folder and then subsequently deleted before returning the df.

  • offline (bool) – Whether to work in offline mode, using previously saved data.

  • dev_mode (bool) – Whether to run in development mode.

Returns:

pd.DataFrame – The DataFrame containing the strike prices for the given root and expiration.

Return type:

DataFrame

find_optimal_exp(root, start_date, target_tte, tte_tolerance, clean_up=False, dev_mode=False)[source]

Returns the closest valid TTE to target_tte within tolerance range and its expiration date.

Parameters:
  • root (str) – The root symbol of the underlying security

  • start_date (str) – The start date in YYYYMMDD format

  • target_tte (int) – Desired days to expiry (e.g., 30)

  • tte_tolerance (Tuple[int, int]) – (min_tte, max_tte) acceptable range

  • save_dir – Directory to save the data files.

  • clean_up (bool) – Whether to clean up the CSV files after merging. If True, the CSV files are saved in a temp folder and then subsequently deleted before returning the df.

  • dev_mode (bool)

Returns:

Tuple[str, int]

A tuple containing the optimal expiration date (in YYYYMMDD format) and

the corresponding time-to-expiration in days.

Return type:

Tuple[str | None, int | None]

load_stock_data(root, start_date, end_date, interval_min=1, save_dir=None, clean_up=False, offline=False, dev_mode=False)[source]

Gets historical quote-level data (NBBO) and OHLC (Open High Low Close) from ThetaData API for stocks across multiple exchanges, aggregated by interval_min (lowest resolution: 1min).

Note

Data from OHLC ends at 15:59:00, while quote data ends at 16:00:00, so for simplicity we remove all rows with 16:00:00 in datetime from quote data, before merging quote and OHLC data.

Parameters:
  • root (str) – The root symbol of the underlying security.

  • start_date (str) – The start date of the data in YYYYMMDD format.

  • end_date (str) – The end date of the data in YYYYMMDD format.

  • interval_min (int) – The interval in minutes between data points.

  • save_dir (str) – The directory to save the data.

  • clean_up (bool) – Whether to clean up the CSV files after merging. If True, the CSV files are saved in a temp folder and then subsequently deleted before returning the df.

  • offline (bool) – Whether to work in offline mode, using previously saved data.

  • dev_mode (bool) – Whether to run in development mode.

Returns:

pd.DataFrame – The merged NBBO quote and OHLCVC data.

Return type:

DataFrame

load_stock_data_eod(root, start_date, end_date, save_dir=None, clean_up=False, offline=False, dev_mode=False)[source]

Gets historical End of Day (EOD) report from ThetaData API for stocks across multiple exchanges. Each report is generated around 17:15:00 ET and contain NBBO and OHLCVC data.

Parameters:
  • root (str) – The root symbol of the underlying security.

  • start_date (str) – The start date of the data in YYYYMMDD format.

  • end_date (str) – The end date of the data in YYYYMMDD format.

  • save_dir (str) – The directory to save the data.

  • clean_up (bool) – Whether to clean up the CSV files after merging. If True, the CSV files are saved in a temp folder and then subsequently deleted before returning the df.

  • offline (bool) – Whether to work in offline mode, using previously saved data.

  • dev_mode (bool) – Whether to run in development mode.

Returns:

pd.DataFrame – The merged quote-level and OHLC data.

Return type:

DataFrame

find_optimal_strike(root, start_date, exp, right, interval_min, moneyness, strike_band=0.05, volatility_scaled=False, hist_vol=None, volatility_scalar=1.0, clean_up=False, offline=False, deterministic=True, dev_mode=False)[source]

Finds the optimal strike price for option return forecasting, prioritizing strikes that are likely to provide meaningful price movement data.

Parameters:
  • root (str) – The root symbol of the option

  • start_date (str) – The start date in YYYYMMDD format

  • exp (str) – The expiration date in YYYYMMDD format

  • right (str) – Option type - “C” for call or “P” for put

  • interval_min (int) – The interval in minutes between data points (the resolution of the data).

  • moneyness (str) – Desired moneyness - “OTM”, “ITM”, or “ATM”

  • strike_band (float | None) – Base percentage distance from current price for strike selection

  • volatility_scaled (bool) – Whether to adjust strike_band based on historical volatility

  • hist_vol (float | None) – Historical volatility to use for scaling strike_band (required if volatility_scaled=True).

  • volatility_scalar (float | None) – The number of standard deviations to scale the strike_band by.

  • clean_up (bool) – Whether to clean up the CSV files after merging. If True, the CSV files are saved in a temp folder and then subsequently deleted before returning the df.

  • offline (bool) – Whether to work in offline mode, using previously saved data.

  • deterministic (bool | None) – Use deterministic algorithm for strike selection (True by default, stochastic mode not yet implemented).

  • dev_mode (bool) – Whether to run in development mode (True) or production mode (False).

Returns:

float – The optimal strike price for option return forecasting based on the specified criteria.

Return type:

Tuple[float, str]

load_option_data(root, start_date, end_date, exp, strike, interval_min, right, save_dir=None, clean_up=False, offline=False, count_ohlc_zeros=False, dev_mode=False)[source]

Gets historical quote-level data (NBBO) and OHLC (Open High Low Close) from ThetaData API for options across multiple exchanges, aggregated by interval_min (lowest resolution: 1min).

Note

Data from OHLC ends at 15:59:00, while quote data ends at 16:00:00, so for simplicity we remove all rows with 16:00:00 in datetime from quote data, before merging quote and OHLC data.

Parameters:
  • root (str) – The root symbol of the underlying security.

  • start_date (str) – The start date of the data in YYYYMMDD format.

  • end_date (str) – The end date of the data in YYYYMMDD format.

  • exp (Optional[str]) – The expiration date of the option in YYYYMMDD format.

  • strike (int) – The strike price of the option in dollars.

  • interval_min (int) – The interval in minutes between data points.

  • right (str) – The type of option, either ‘C’ for call or ‘P’ for put.

  • save_dir (str) – The directory to save the data.

  • clean_up (bool) – Whether to clean up the CSV files after merging. If True, the CSV files are saved in a temp folder and then subsequently deleted before returning the df.

  • offline (bool) – Whether to work in offline mode, using previously saved data.

  • count_ohlc_zeros (bool) – Whether to count the proportion of zero values in OHLC transactions data.

  • dev_mode (bool) – Whether to run in development mode.

Returns:

pd.DataFrame – Merged DataFrame containing quote-level (NBBO) and OHLC data for the specified option.

Return type:

DataFrame

load_all_data(root, start_date, exp, interval_min, right, strike, save_dir=None, clean_up=False, offline=False, warning=False, dev_mode=False)[source]

Gets historical quote-level data (NBBO) and OHLC (Open High Low Close) from ThetaData API for combined stocks and options across multiple exchanges, aggregated by interval_min (lowest resolution: 1min).

Note

Data from OHLC ends at 15:59:00, while quote data ends at 16:00:00, so for simplicity we remove all rows with 16:00:00 in datetime from quote data, before merging quote and OHLC data.

Parameters:
  • root (str) – The root symbol of the underlying security.

  • start_date (str) – The start date of the data in YYYYMMDD format.

  • exp (str) – The expiration date of the option in YYYYMMDD format.

  • interval_min (int) – The interval in minutes between data points.

  • right (str) – The type of option, either ‘C’ for call or ‘P’ for put.

  • strike (float) – The strike price of the option in dollars.

  • save_dir (str) – The directory to save the data.

  • clean_up (bool) – Whether to clean up the CSV files after merging. If True, the CSV files are saved in a temp folder and then subsequently deleted before returning the df.

  • offline (bool) – Whether to use offline (already saved) data instead of calling ThetaData API directly (default: False).

  • dev_mode (bool) – Whether to run in development mode.

  • warning (bool)

Returns:

DataFrame – The combined quote-level and OHLC data for an option and the underlying,

Return type:

DataFrame

optrade.data.universe

class Universe(start_date, end_date, sp_500=False, nasdaq_100=False, dow_jones=False, candidate_roots=None, volatility=None, pe_ratio=None, debt_to_equity=None, beta=None, market_cap=None, sector=None, industry=None, dividend_yield=None, earnings_volatility=None, market_beta=None, size_beta=None, value_beta=None, profitability_beta=None, investment_beta=None, momentum_beta=None, all_metrics=False, save_dir=None, verbose=False, dev_mode=False)[source]

Bases: object

Parameters:
  • start_date (str)

  • end_date (str)

  • sp_500 (bool)

  • nasdaq_100 (bool)

  • dow_jones (bool)

  • candidate_roots (List[str] | None)

  • volatility (str | None)

  • pe_ratio (str | None)

  • debt_to_equity (str | None)

  • beta (str | None)

  • market_cap (str | None)

  • sector (str | None)

  • industry (str | None)

  • dividend_yield (str | None)

  • earnings_volatility (str | None)

  • market_beta (str | None)

  • size_beta (str | None)

  • value_beta (str | None)

  • profitability_beta (str | None)

  • investment_beta (str | None)

  • momentum_beta (str | None)

  • all_metrics (bool)

  • save_dir (str | None)

  • verbose (bool)

  • dev_mode (bool)

__init__(start_date, end_date, sp_500=False, nasdaq_100=False, dow_jones=False, candidate_roots=None, volatility=None, pe_ratio=None, debt_to_equity=None, beta=None, market_cap=None, sector=None, industry=None, dividend_yield=None, earnings_volatility=None, market_beta=None, size_beta=None, value_beta=None, profitability_beta=None, investment_beta=None, momentum_beta=None, all_metrics=False, save_dir=None, verbose=False, dev_mode=False)[source]

A class for defining the universe of stocks and options for data retrieval and analysis.

This class contains parameters for filtering stocks based on various factors and selecting options contracts based on specific criteria.

Parameters:
  • start_date (str)

  • end_date (str)

  • sp_500 (bool)

  • nasdaq_100 (bool)

  • dow_jones (bool)

  • candidate_roots (List[str] | None)

  • volatility (str | None)

  • pe_ratio (str | None)

  • debt_to_equity (str | None)

  • beta (str | None)

  • market_cap (str | None)

  • sector (str | None)

  • industry (str | None)

  • dividend_yield (str | None)

  • earnings_volatility (str | None)

  • market_beta (str | None)

  • size_beta (str | None)

  • value_beta (str | None)

  • profitability_beta (str | None)

  • investment_beta (str | None)

  • momentum_beta (str | None)

  • all_metrics (bool)

  • save_dir (str | None)

  • verbose (bool)

  • dev_mode (bool)

Return type:

None

start_date

Start date for data retrieval in YYYYMMDD format.

Type:

str, optional

end_date

End date for data retrieval in YYYYMMDD format.

Type:

str, optional

sp_500

If True, use S&P 500 stocks as the candidate universe. Default is False.

Type:

bool

nasdaq_100

If True, use NASDAQ 100 stocks as the candidate universe. Default is False.

Type:

bool

dow_jones

If True, use Dow Jones Industrial Average stocks as the candidate universe. Default is False.

Type:

bool

candidate_roots

Candidate root symbols to be filtered by other parameters. Used only if no collection (sp_500, nasdaq_100, etc.) is selected.

Type:

list, optional

volatility

The volatility of the stock. Options: ‘low’, ‘medium’, ‘high’. Based on the terciles of volatility from the candidate universe.

Type:

str, optional

pe_ratio

The P/E ratio of the stock. Options: ‘low’, ‘medium’, ‘high’. Based on the terciles of P/E ratio from the candidate universe.

Type:

str, optional

debt_to_equity

The debt to equity ratio of the stock. Options: ‘low’, ‘medium’, ‘high’. Based on the terciles of debt to equity from the candidate universe.

Type:

str, optional

beta

The beta of the stock. Options: ‘low’, ‘medium’, ‘high’. Based on the terciles of beta from the candidate universe.

Type:

str, optional

market_cap

The market cap of the stock. Options: ‘low’, ‘medium’, ‘high’. Based on the terciles of market cap from the candidate universe.

Type:

str, optional

sector

The sector of the stock. Options: ‘tech’, ‘healthcare’, ‘financial’, ‘consumer_cyclical’, ‘consumer_defensive’, ‘industrial’, ‘energy’, ‘materials’, ‘utilities’, ‘real_estate’, ‘communication’.

Type:

str, optional

industry

The industry of the stock matching Yahoo Finance classifications.

Type:

str, optional

dividend_yield

The dividend yield of the stock. Options: ‘low’, ‘medium’, ‘high’. Based on the terciles of dividend yield from the candidate universe.

Type:

str, optional

earnings_volatility

The earnings volatility of the stock. Options: ‘low’, ‘medium’, ‘high’. Based on the terciles of earnings volatility from the candidate universe.

Type:

str, optional

market_beta

The market beta of the stock. Options: ‘high’, ‘low’, ‘neutral’. Based on the absolute thresholds of < 0.9 and > 1.1.

Type:

str, optional

size_beta

The size beta of the stock. Options: ‘small_cap’, ‘large_cap’, ‘neutral’. Based on 30th and 70th percentiles of beta from the candidate universe.

Type:

str, optional

value_beta

The value beta of the stock. Options: ‘value’, ‘growth’, ‘neutral’. Based on 30th and 70th percentiles of beta from the candidate universe.

Type:

str, optional

profitability_beta

The profitability beta of the stock. Options: ‘robust’, ‘weak’, ‘neutral’. Based on 30th and 70th percentiles of beta from the candidate universe.

Type:

str, optional

investment_beta

The investment beta of the stock. Options: ‘conservative’, ‘aggressive’, ‘neutral’. Based on 30th and 70th percentiles of beta from the candidate universe.

Type:

str, optional

momentum_beta

(str, optional): The momentum beta of the stock used in Carhart 4-Factor model. Options: ‘high’, ‘low’, ‘neutral’. Based on 30th and 70th percentiles of beta from the candidate universe.

all_metrics

If True, computes all metrics to the candidate universe. Default is False.

Type:

bool

save_dir

Directory to save the contract datasets and raw data.

Type:

str, optional

verbose

Whether to print verbose output. Default is False.

Type:

bool

dev_mode

If True, enables development mode specific data directory management. Default is False.

Type:

bool

set_roots()[source]

Fetches constituents of a specified index using public data on Wikipedia and updates candidate_roots.

Return type:

None

get_market_metrics(remove_roots=False)[source]

Retrieves market metrics data for each stock in candidate_roots from various sources. Only includes metrics that are specified in the filter criteria.

Parameters:

remove_roots (bool)

Return type:

None

get_factor_exposures(remove_roots=False)[source]

Computes and categorizes Fama-French factor exposures for each stock in the universe, using Kenneth French’s data library and fitting the specified factor mode (ff3, c4, or ff5) with linear regression.

Parameters:

remove_roots (bool)

Return type:

None

get_percentiles(metric, bins=3)[source]
filter_three_level(filtered_roots, metric, level_value)[source]
Parameters:
  • filtered_roots (List[str])

  • metric (str)

  • level_value (str | None)

Return type:

List[str]

filter_five_level(filtered_roots, metric, level_value)[source]
Parameters:
  • filtered_roots (List[str])

  • metric (str)

  • level_value (str | None)

Return type:

List[str]

filter_categorical(filtered_roots, metric, category_value)[source]
Parameters:
  • filtered_roots (List[str])

  • metric (str)

  • category_value (str | None)

Return type:

List[str]

filter()[source]

Filters the universe of stocks based on the specified criteria. - For ThreeFactorLevel: ‘low’ (0-33%), ‘medium’ (33-66%), ‘high’ (66-100%) - For FiveFactorLevel: ‘very_low’ (0-20%), ‘low’ (20-40%), ‘medium’ (40-60%), ‘high’ (60-80%), ‘very_high’ (80-100%)

Return type:

None

download(contract_stride, interval_min, right, target_tte, tte_tolerance, moneyness, train_split, val_split, strike_band=0.05, volatility_type='period', volatility_scaled=False, volatility_scalar=None)[source]

Downloads options contract datasets and market data for the filtered universe of stocks. To be used in conjunction with offline=True when calling get_forecasting_loaders() for higher efficiency during model training.

Parameters:
  • contract_stride (int) – Number of days between consecutive contracts.

  • interval_min (int) – Interval in minutes for the options data.

  • right (str) – Type of contract (‘C’ for call or ‘P’ and for put).

  • target_tte (int) – Target time to expiration in days.

  • tte_tolerance (Tuple[int, int]) – Lower and upper bounds for the time to expiration.

  • moneyness (str) – Moneyness of the option. Options: “ATM”, “ITM”, or “OTM”.

  • strike_band (float) – Strike band for the option.

  • train_split (float) – Proportion of contracts to use for training.

  • val_split (float) – Proportion of contracts to use for validation.

  • volatility_type (str, optional) – Type of volatility to use for scaling. Options: “daily”, “period”, or “annualized”.

  • volatility_scaled (bool, optional) – Whether to scale the volatility.

  • volatility_scalar (float, optional) – Scalar to multiply the volatility by.

  • dev_mode (bool, optional) – Whether to use development mode.

Returns:

None

Return type:

None

get_forecasting_loaders(root, tte_tolerance, seq_len, pred_len, scaling=False, dtype='float32', core_feats=['option_returns'], tte_feats=None, datetime_feats=None, keep_datetime=False, target_channels=None, target_type='multistep', offline=False, batch_size=32, shuffle=True, drop_last=False, num_workers=4, prefetch_factor=2, pin_memory=False, persistent_workers=True)[source]
Parameters:
  • root (str) – Root symbol of the stock.

  • contract_stride (int) – Number of days between consecutive contracts.

  • interval_min (int) – Interval in minutes for the options data.

  • right (str) – Type of contract (‘C’ for call or ‘P’ and for put).

  • target_tte (int) – Target time to expiration in days.

  • tte_tolerance (Tuple[int, int]) – Lower and upper bounds for the time to expiration.

  • moneyness (str) – Moneyness of the option. Options: “ATM”, “ITM”, or “OTM”.

  • seq_len (int) – Sequence length for the input data.

  • pred_len (int) – Prediction length for the target data.

  • dtype_str (str) – Data type for the input and target data.

  • train_split (float) – Proportion of contracts to use for training.

  • val_split (float) – Proportion of contracts to use for validation.

  • scaling (bool) – Whether to scale the data.

  • dtype (str) – Data type for the input and target data.

  • core_feats (List[str]) – Core features to include in the input data.

  • tte_feats (List[str], optional) – Time-to-expiration features to include in the input data.

  • datetime_feats (List[str], optional) – Datetime features to include in the input data.

  • keep_datetime (bool, optional) – Whether to keep the datetime features in the input data.

  • target_channels (List[str], optional) – Target channels to include in the target data.

  • target_type (str, optional) – Type of forecasting target. Options: “multistep” (float), “average” (float), or “average_direction” (binary).

  • strike_band (float, optional) – Strike band for the option.

  • volatility_type (str, optional) – Type of volatility to use for scaling. Options: “daily”, “period”, or “annualized”.

  • volatility_scaled (bool, optional) – Whether to scale the volatility.

  • volatility_scalar (float, optional) – Scalar to multiply the volatility by.

  • offline (bool, optional) – Whether to use offline data for faster training.

  • batch_size (int, optional) – Batch size for the data loader.

  • shuffle (bool, optional) – Whether to shuffle the data.

  • drop_last (bool, optional) – Whether to drop the last incomplete batch.

  • num_workers (int, optional) – Number of workers for the data loader.

  • prefetch_factor (int, optional) – Prefetch factor for the data loader.

  • pin_memory (bool, optional) – Whether to pin memory for the data loader.

  • persistent_workers (bool, optional) – Whether to use persistent workers for the data loader.

  • dev_mode (bool, optional) – Whether to use development mode.

Returns:

Tuple[DataLoader, DataLoader, DataLoader] – Train, validation, and test data loaders if scaling=False. Tuple[DataLoader, DataLoader, DataLoader, StandardScaler]: Train, validation, and test data loaders, and the scaler if scaling=True.

Return type:

Tuple[DataLoader, DataLoader, DataLoader] | Tuple[DataLoader, DataLoader, DataLoader, StandardScaler]

Module contents