Datalake Integration¶
Corridor provides the ability to connect to different kinds of data lakes which could have data saved as parquet, orc,
avro, hive tables or any in any other format.
The user could define a custom file handler that would have the logic to read/write to/from the data lake.
Example¶
The example focuses on creating a data source handler to read from hive tables.
The user needs to inherit the base class: DataSourceHandler and define the functions:
read_from_locationwrite_to_location
from corridor_api.config.handlers import DataSourceHandler
class HiveTable(DataSourceHandler):
"""
Consider a case where every data table is a table in the Hive metastore.
The table `location` identifier is the table name.
"""
name = 'hive'
write_format = 'parquet'
def read_from_location(self, location, nrow=None):
try:
import findspark
findspark.init()
import pyspark
except ImportError:
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
data = spark.table(location)
if nrow is not None:
data = data.limit(nrow)
return data
def write_to_location(self, data, location, mode='error'):
return data.write.format(self.write_format).saveAsTable(location)
Configuration¶
Once the handler class is created, it can be set up in api_config.py as below:
(assuming the handler is defined in a file called hive_table_hander.py alongside api_config.py)