ads.bds package

Submodules

ads.bds.auth module

exception ads.bds.auth.KRB5KinitError

Bases: Exception

KRB5KinitError class when kinit -kt command failed to generate cached ticket with the keytab file and the krb5 config file.

ads.bds.auth.has_kerberos_ticket(): Whether kerberos cache ticket exists.

ads.bds.auth.init_ccache_with_keytab(principal: str, keytab_file: str) → None

Initialize credential cache using keytab file.

Parameters:

principal (str) – The unique identity to which Kerberos can assign tickets.
keytab_path (str) – Path to your keytab file.

Returns:

Nothing.

Return type:

None

ads.bds.auth.krbcontext(principal: str, keytab_path: str, kerb5_path: str = '~/.bds_config/krb5.conf') → None

A context manager for Kerberos-related actions. It provides a Kerberos context that you can put code inside. It will initialize credential cache automatically with keytab if no cached ticket exists. Otherwise, does nothing.

Parameters:

principal (str) – The unique identity to which Kerberos can assign tickets.
keytab_path (str) – Path to your keytab file.
kerb5_path ((str, optional).) – Path to your krb5 config file.

Returns:

Nothing.

Return type:

None

Examples

>>> from ads.bds.auth import krbcontext
>>> from pyhive import hive
>>> with krbcontext(principal = "your_principal", keytab_path = "your_keytab_path"):
>>>    hive_cursor = hive.connect(host="your_hive_host",
...                    port="your_hive_port",
...                    auth='KERBEROS',
...                    kerberos_service_name="hive").cursor()

ads.bds.auth.refresh_ticket(principal: str, keytab_path: str, kerb5_path: str = '~/.bds_config/krb5.conf') → None

generate new cached ticket based on the principal and keytab file path.

Parameters:

principal (str) – The unique identity to which Kerberos can assign tickets.
keytab_path (str) – Path to your keytab file.
kerb5_path ((str, optional).) – Path to your krb5 config file.

Returns:

Nothing.

Return type:

None

Examples

>>> from ads.bds.auth import refresh_ticket
>>> from pyhive import hive
>>> refresh_ticket(principal = "your_principal", keytab_path = "your_keytab_path")
>>> hive_cursor = hive.connect(host="your_hive_host",
...                    port="your_hive_port",
...                    auth='KERBEROS',
...                    kerberos_service_name="hive").cursor()

ads.bds.big_data_service module

class ads.bds.big_data_service.ADSHiveConnection(host: str, port: str = '10000', auth_mechanism: str = 'GSSAPI', driver: str = 'impyla', **kwargs)

Bases: object

Initiate the connection.

Parameters:

host (str) – Hive host name.
port (str) – Hive port. Default to 10000.
auth_mechanism (str) – Default to “GSSAPI”. Using “PLAIN” for unsecure cluster.
driver (str) – Default to “impyla”. Client used to communicate with Hive. Only support impyla by far.
kwargs – Other connection parameters accepted by the client.

insert(table_name: str, df: DataFrame, if_exists: str, batch_size: int = 1000, **kwargs)

insert a table from a pandas dataframe.

Parameters:

(str) (if_exists) – Table Name. Table name contains database name as well. By default it will use ‘default’ database. You can specify the database name by table_name=<db_name>.<tb_name>.
(pd.DataFrame) (df) – Data to be injected to the database.
(str) – Whether to replace, append or fail if the table already exists.
batch_size (int, default 1000) – Inserting in batches improves insertion performance. Choose this value based on available memory and network bandwidth.
(dict) (kwargs) – Other parameters used by pandas.DataFrame.to_sql.

query(sql: str, bind_variables: Optional[Dict] = None, chunksize: Optional[int] = None) → Union[DataFrame, Iterator[DataFrame]]

Query data which support select statement.

Parameters:

(str) (sql) – sql query.
(Optional[Dict]) (bind_variables) – Parameters to be bound to variables in the SQL query, if any. Impyla supports all DB API paramstyle`s, including `qmark, numeric, named, format, pyformat.
(Optional[int]) (chunksize) – chunksize of each of the dataframe in the iterator.

Returns:

A pandas DataFrame or a pandas DataFrame iterator.

Return type:

Union[pd.DataFrame, Iterator[pd.DataFrame]]

class ads.bds.big_data_service.HiveConnection(**params)

Bases: ABC

Base class Interface.

set up hive connection.

abstract get_cursor()

Returns the cursor from the connection.

Returns:: cursor using a specific client.
Return type:: HiveServer2Cursor

abstract get_engine()

Returns engine from the connection.

Return type:: Engine object for the connection.

class ads.bds.big_data_service.HiveConnectionFactory

Bases: object

clientprovider = {'impyla': <class 'ads.bds.big_data_service.ImpylaHiveConnection'>}

classmethod get(driver='impyla')

class ads.bds.big_data_service.ImpylaHiveConnection(**params)

Bases: HiveConnection

ImpalaHiveConnection class which uses impyla client.

set up the impala connection.

get_cursor() → impala.hiveserver2.HiveServer2Cursor

Returns the cursor from the connection.

Returns:: cursor using impyla client.
Return type:: impala.hiveserver2.HiveServer2Cursor

get_engine(schema='default')

return the sqlalchemy engine from the connection.

Parameters:: schema (str) – Default to “default”. The default schema used for query.
Returns:: engine using a specific client.
Return type:: sqlalchemy.engine

ads.bds package

Submodules

ads.bds.auth module

ads.bds.big_data_service module

Module contents