blueetl_core.etl¶

Pandas accessors.

Functions

register_accessors()

Register the accessors.

Classes

`ETLBaseAccessor`(pandas_obj)	Accessor with methods common to Series and DataFrame.
`ETLDataFrameAccessor`(pandas_obj)	DataFrame accessor.
`ETLIndexAccessor`(pandas_obj)	Index accessor.
`ETLSeriesAccessor`(pandas_obj)	Series accessor.

class blueetl_core.etl.ETLBaseAccessor(pandas_obj: PandasT)¶

Bases: ABC, Generic[PandasT, PandasGroupByT]

Accessor with methods common to Series and DataFrame.

Initialize the accessor.

add_conditions(conditions: str | list[str], values: Any, *, inner: bool = False, drop: bool = False, dtypes: Any = None) → PandasT¶

Add one or multiple conditions into the outermost or innermost position.

Parameters:

conditions – single conditions or list of conditions to be added.
values – single value or list of values, one for each condition.
inner (bool) – if True, add the conditions in the innermost position.
drop (bool) – if True, drop the existing conditions.
dtypes – if not None, it’s used to enforce the dtype of the new levels. It can be a single dtype, or a list of dtypes, one for each condition. Examples: int, float, “category”…

Returns:

resulting Series or DataFrame.

complementary_conditions(conditions: str | list[str]) → list[str]¶

Return the difference between the object conditions and the specified conditions.

Parameters:: conditions – single condition or list of conditions used to calculate the difference.

conditions() → list[str]¶: Names for each of the index levels.

first(_query: dict | None = None, /, **params) → Any¶: Execute the query and return the first resulting record.

groupby_except(conditions: str | list[str], *args, **kwargs) → PandasGroupByT¶

Group by all the conditions except for the ones specified.

Parameters:

conditions – single condition or list of conditions to be excluded from the groupby
args – positional arguments to be passed to groupby
kwargs – key arguments to be passed to groupby

keep_conditions(conditions: str | list[str]) → PandasT¶

Remove the conditions not specified.

Parameters:: conditions – single condition or list of conditions to keep.
Returns:: resulting Series or DataFrame.

labels(conditions: list[str] | None = None) → list[Index]¶

Unique labels for each level, or for the specified levels.

Parameters:: conditions – list of condition names, or None to consider all the levels.
Returns:: list of indexes with unique labels, one for each level.

labels_of(condition: str) → Index¶

Unique labels for a specific level in the index.

Parameters:: condition – condition name.
Returns:: indexes with unique labels.

one(_query: dict | None = None, /, **params) → Any¶: Execute the query and return the unique resulting record.

pool(conditions: str | list[str], func: Callable) → PandasT¶

Remove one or more conditions grouping by the remaining conditions.

Parameters:

conditions – single condition or list of conditions to be removed from the index.
func – function that should accept a single element. If the returned value is a Series, it will be used as an additional level in the MultiIndex of the returned object.

q(_query: dict[str, Any] | list[dict[str, Any]] | None = None, /, **params) → PandasT¶

Given a query dict, list, or some query params, return the filtered Series or DataFrame.

Filter by columns (for DataFrames) and index names (for both Series and DataFrames). If a name is present in both columns and index names, only the column is considered.

If a list is passed, the items must be query dictionaries, that will be OR-ed together.

All the keys of the dict are combined in AND, while the values can be scalar, list, or dict.

If value is a scalar, the exact value will be matched.
If value is a list, the values in the list are considered in OR.
If value is a dict, a more advanced filter can be specified using the
supported operators: eq, ne, le, lt, ge, gt, isin.

Query and named params cannot be specified together. If they are both empty or missing, the original Series or DataFrame is returned.

This method is similar to the standard query method for pandas DataFrames, but it accepts a dict instead of a string, and has some limitations on the possible filters to be applied.

Parameters:

_query – query dictionary, where the keys are columns or index levels, and the values can be scalar, list, or dict values.
**params – named params can be specified as an alternative to the _query dictionary.

Examples

{“mtype”: “SO_BP”, “etype”: “cNAC”} -> mtype == SO_BP AND etype == cNAC
{“mtype”: [“SO_BP”, “SP_AA”]} -> mtype == SO_BP OR mtype == SP_AA
{“gid”: {“ge”: 3, “lt”: 8} -> gid >= 3 AND gid < 8

remove_conditions(conditions: str | list[str]) → PandasT¶

Remove one or more conditions.

Parameters:: conditions – single condition or list of conditions to remove.
Returns:: resulting Series or DataFrame.

select(drop_level: bool = True, **kwargs) → PandasT¶

Filter the series or dataframe based on some conditions on the index.

Note: if the level doesn’t need to be dropped, it’s possible to use etl.q instead. TODO: consider if it can be deprecated in favour of etl.q, and removed.

Parameters:

drop_level (bool) – True to drop the conditions from the returned object.
kwargs – conditions used to filter, specified as name=value.

class blueetl_core.etl.ETLDataFrameAccessor(pandas_obj: PandasT)¶

Bases: ETLBaseAccessor[DataFrame, DataFrameGroupBy]

DataFrame accessor.

Initialize the accessor.

groupby_apply_parallel(groupby_columns: list[str], selected_columns: list[str] | None = None, *, sort: bool = True, observed: bool = True, func: Callable | None = None, jobs: int | None = None, backend: str | None = None) → DataFrame¶

Call groupby_iter and apply the given function in parallel, returning a DataFrame.

To work as expected, func should return a DataFrame or a Series, and all the returned objects should have the same index and columns.

Still experimental.

groupby_iter(groupby_columns: list[str], selected_columns: list[str] | None = None, sort: bool = True, observed: bool = True) → Iterator[tuple[tuple, DataFrame]]¶

Group the dataframe by columns and yield each record as a tuple (key, df).

It can be used as a replacement for the iteration over df.groupby, but:

the yielded keys are namedtuples, instead of tuples
the yielded dataframes contain only the selected columns, if specified

Parameters:

groupby_columns – list of column names to group by.
selected_columns – list of column names to be included in the yielded dataframes. If None, all the columns are included.
sort – Sort group keys.
observed – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

Yields:

a tuple (key, df), where key is a namedtuple with the grouped columns

groupby_run_parallel(groupby_columns: list[str], selected_columns: list[str] | None = None, *, sort: bool = True, observed: bool = True, func: Callable | None = None, jobs: int | None = None, backend: str | None = None) → list[Any]¶

Call groupby_iter and apply the given function in parallel, returning the results.

Parameters:

groupby_columns – see groupby_iter.
selected_columns – see groupby_iter.
sort – see groupby_iter.
observed – see groupby_iter.
func – callable accepting the parameters: key (NamedTuple), df (pd.DataFrame)
jobs – number of jobs (see run_parallel)
backend – parallel backend (see run_parallel)

Returns:

list of results.

insert_columns(loc: int, columns: list, values: list) → None¶: Insert multiple columns, similar to repeatedly calling DataFrame.insert().

iter() → Iterator[tuple[NamedTuple, NamedTuple]]¶

Iterate over the items, yielding a tuple (named_index, value) for each element.

The returned named_index is a namedtuple representing the value of the index. The returned value is a namedtuple as returned by pandas.DataFrame.itertuples.

iterdict() → Iterator[tuple[dict, dict]]¶

Iterate over the items, yielding a tuple (named_index, value) for each element.

The returned named_index is a dict representing the value of the index. The returned value is a dict containing a key for each column.

This method can be used as an alternative to iter when:

The column or index names contain invalid identifiers, or
it’s more convenient to work with dictionaries.

Valid identifiers consist of letters, digits, and underscores but do not start with a digit or underscore and cannot be a Python keyword.

class blueetl_core.etl.ETLIndexAccessor(pandas_obj: Index)¶

Bases: object

Index accessor.

Initialize the accessor.

astype(dtype) → Index¶

Create a new Index with the given dtypes.

Parameters:: dtype – numpy dtype or pandas type, or dict of dtypes when applied to a MultiIndex. Any (u)int16 or (u)int32 dtype is considered as (u)int64, since Pandas doesn’t have a corresponding Index type for them.
Returns:: a copy of index using the specified dtypes.

property dtypes: Series¶: Return the dtypes of the index.

iter() → Iterator[tuple]¶

Iterate over the index, yielding a namedtuple for each element.

It can be used as an alternative to the pandas iteration over the index to yield named tuples instead of standard tuples.

It works with both Indexes and MultiIndexes.

iterdict() → Iterator[dict]¶

Iterate over the index, yielding a dict for each element.

This method can be used as an alternative to iter when:

The index names contain invalid identifiers, or
it’s more convenient to work with dictionaries.

It works with both Indexes and MultiIndexes.

class blueetl_core.etl.ETLSeriesAccessor(pandas_obj: PandasT)¶

Bases: ETLBaseAccessor[Series, SeriesGroupBy]

Series accessor.

Initialize the accessor.

iter() → Iterator[tuple[NamedTuple, Any]]¶

Iterate over the items, yielding a tuple (named_index, value) for each element.

The returned named_index is a namedtuple representing the value of the index. The returned value is the actual value of each element of the series.

iterdict() → Iterator[tuple[dict, Any]]¶

Iterate over the items, yielding a tuple (named_index, value) for each element.

The returned named_index is a dict representing the value of the index. The returned value is the actual value of each element of the series.

This method can be used as an alternative to iter when:

The index names contain invalid identifiers, or
it’s more convenient to work with dictionaries.

Valid identifiers consist of letters, digits, and underscores but do not start with a digit or underscore and cannot be a Python keyword.

unpool(func: Callable) → Series¶

Apply the given function to the object elements and add a condition to the index.

Parameters:: func – function that should accept a single element and return a Series object. The name of that Series will be used as the name of the new level in the MultiIndex of the returned object.

blueetl_core.etl.register_accessors() → None¶

Register the accessors.

It must be called once, before accessing the etl namespace.