ads.bds package

Submodules

ads.bds.auth module

exception ads.bds.auth.KRB5KinitError[source]

Bases: Exception

KRB5KinitError class when kinit -kt command failed to generate cached ticket with the keytab file and the krb5 config file.

ads.bds.auth.has_kerberos_ticket()[source]

Whether kerberos cache ticket exists.

ads.bds.auth.init_ccache_with_keytab(principal: str, keytab_file: str) None[source]

Initialize credential cache using keytab file.

Parameters:
  • principal (str) – The unique identity to which Kerberos can assign tickets.

  • keytab_path (str) – Path to your keytab file.

Returns:

Nothing.

Return type:

None

ads.bds.auth.krbcontext(principal: str, keytab_path: str, kerb5_path: str = '~/.bds_config/krb5.conf') None[source]

A context manager for Kerberos-related actions. It provides a Kerberos context that you can put code inside. It will initialize credential cache automatically with keytab if no cached ticket exists. Otherwise, does nothing.

Parameters:
  • principal (str) – The unique identity to which Kerberos can assign tickets.

  • keytab_path (str) – Path to your keytab file.

  • kerb5_path ((str, optional).) – Path to your krb5 config file.

Returns:

Nothing.

Return type:

None

Examples

>>> from ads.bds.auth import krbcontext
>>> from pyhive import hive
>>> with krbcontext(principal = "your_principal", keytab_path = "your_keytab_path"):
>>>    hive_cursor = hive.connect(host="your_hive_host",
...                    port="your_hive_port",
...                    auth='KERBEROS',
...                    kerberos_service_name="hive").cursor()
ads.bds.auth.refresh_ticket(principal: str, keytab_path: str, kerb5_path: str = '~/.bds_config/krb5.conf') None[source]

generate new cached ticket based on the principal and keytab file path.

Parameters:
  • principal (str) – The unique identity to which Kerberos can assign tickets.

  • keytab_path (str) – Path to your keytab file.

  • kerb5_path ((str, optional).) – Path to your krb5 config file.

Returns:

Nothing.

Return type:

None

Examples

>>> from ads.bds.auth import refresh_ticket
>>> from pyhive import hive
>>> refresh_ticket(principal = "your_principal", keytab_path = "your_keytab_path")
>>> hive_cursor = hive.connect(host="your_hive_host",
...                    port="your_hive_port",
...                    auth='KERBEROS',
...                    kerberos_service_name="hive").cursor()

ads.bds.big_data_service module

class ads.bds.big_data_service.ADSHiveConnection(host: str, port: str = '10000', auth_mechanism: str = 'GSSAPI', driver: str = 'impyla', **kwargs)[source]

Bases: object

Initiate the connection.

Parameters:
  • host (str) – Hive host name.

  • port (str) – Hive port. Default to 10000.

  • auth_mechanism (str) – Default to “GSSAPI”. Using “PLAIN” for unsecure cluster.

  • driver (str) – Default to “impyla”. Client used to communicate with Hive. Only support impyla by far.

  • kwargs – Other connection parameters accepted by the client.

insert(table_name: str, df: DataFrame, if_exists: str, batch_size: int = 1000, **kwargs)[source]

insert a table from a pandas dataframe.

Parameters:
  • (str) (if_exists) – Table Name. Table name contains database name as well. By default it will use ‘default’ database. You can specify the database name by table_name=<db_name>.<tb_name>.

  • (pd.DataFrame) (df) – Data to be injected to the database.

  • (str) – Whether to replace, append or fail if the table already exists.

  • batch_size (int, default 1000) – Inserting in batches improves insertion performance. Choose this value based on available memory and network bandwidth.

  • (dict) (kwargs) – Other parameters used by pandas.DataFrame.to_sql.

query(sql: str, bind_variables: Dict | None = None, chunksize: int | None = None) DataFrame | Iterator[DataFrame][source]

Query data which support select statement.

Parameters:
  • (str) (sql) – sql query.

  • (Optional[Dict]) (bind_variables) – Parameters to be bound to variables in the SQL query, if any. Impyla supports all DB API paramstyle`s, including `qmark, numeric, named, format, pyformat.

  • (Optional[int]) (chunksize) – chunksize of each of the dataframe in the iterator.

Returns:

A pandas DataFrame or a pandas DataFrame iterator.

Return type:

Union[pd.DataFrame, Iterator[pd.DataFrame]]

class ads.bds.big_data_service.HiveConnection(**params)[source]

Bases: ABC

Base class Interface.

set up hive connection.

abstract get_cursor()[source]

Returns the cursor from the connection.

Returns:

cursor using a specific client.

Return type:

HiveServer2Cursor

abstract get_engine()[source]

Returns engine from the connection.

Return type:

Engine object for the connection.

class ads.bds.big_data_service.HiveConnectionFactory[source]

Bases: object

clientprovider = {'impyla': <class 'ads.bds.big_data_service.ImpylaHiveConnection'>}
classmethod get(driver='impyla')[source]
class ads.bds.big_data_service.ImpylaHiveConnection(**params)[source]

Bases: HiveConnection

ImpalaHiveConnection class which uses impyla client.

set up the impala connection.

get_cursor() impala.hiveserver2.HiveServer2Cursor[source]

Returns the cursor from the connection.

Returns:

cursor using impyla client.

Return type:

impala.hiveserver2.HiveServer2Cursor

get_engine(schema='default')[source]

return the sqlalchemy engine from the connection.

Parameters:

schema (str) – Default to “default”. The default schema used for query.

Returns:

engine using a specific client.

Return type:

sqlalchemy.engine

Module contents