Skip to content

cashfree/file-genie

Repository files navigation

FileGenie

FileGenie is a Python library that simplifies file parsing from AWS S3 across various formats (eg.- TEXT, CSV, EXCEL, ZIP, XML, PDF etc.) and enables users to define custom functions for data massaging and transformation, ensuring seamless processing and tailored output generation based on provided configurations.

Features

  • Multi-format Support: Effortlessly parse files in formats such as TEXT, CSV, EXCEL, ZIP, XML, and PDF directly from AWS S3.
  • Flexible Response Types: Generate responses tailored to user needs, including DATAFRAME, JSON, or FILE outputs.
  • Password-Protected Files: Seamlessly parse files secured with passwords.
  • Custom Edge Case Handling: Apply user-defined custom functions to address specific data massaging and transformation requirements, such as sanitizing data, converting values, reformatting date fields etc.
  • AWS S3 Integration: Fetch files directly from AWS S3 buckets using IAM roles for secure access.
  • Streamlined Configuration: Set up easily with minimal configuration, eliminating the need of writing parser for specific file type.

Installation

Install the SDK using pip:

pip install file_genie

Prerequisites

  • Your application should be deployed on AWS EKS to enable the SDK to utilize AWS S3 credentials.
  • Python: >= '3.6'
  • Pandas: '2.0.0'

Getting Started

Define Custom Edge Cases: Let's say you need to sanitize columns (e.g., standardise column values to a common format before applying custom logic) during file parsing, you can define custom functions for the SDK to use.

To implement this:

  • Create an edgeCases folder in your project.
  • Add a file named user_edge_cases.py.
  • Define your custom functions in this file.
  • Reference these functions in the edge_case section of the file_config.
  • The SDK will automatically import and apply these functions during file parsing or transformation.
from edgeCases import user_edge_cases
self.edge_cases = user_edge_cases

Define the configuration required for file parsing logic and S3 bucket names

    s3_config: {
        upload_bucket: s3_bucket_name
        download_bucket: s3_bucket_name
    }
    file_config: {
        "file_source_1": {
            "read_from_s3_func":"read_complete_excel_file",
            "parameters_for_read_s3": None,
            "file_dtype":{
                "Order_Number": str,
                "Added On":str,
                "Added By":str
            },
            "columns_mapping": {
                <!-- "Column Name in file": "Column name required in output" -->
                "Transaction Type": "TransactionType",
                "Cust Name": "CustomerName",
                "Cust ID": "CustomerId",
                "Transaction Amount": "Amount",
                "OrderNumber": "TransactionReference",
                "Reference ID": "CustomerReferenceId",
                "Target Date": "TargetDate",
                "TransactionDate": "TransactionDate",
                "FeeAmount": "ServiceCharge",
                "TaxAmount": "ServiceTax",
                "NetAmount": "NetAmount"
            }
            "edge_case": {
                <!-- edge case function name which you have defined in user_edge_case.py : params required for that function
                there can be different type of params. For eg. - dict, list, str -->
                <!-- In this convert_amount_as_per_currency is the edge case function which you want to apply while transforming the entries and "Amount" is the param to this function where you will apply the currency conversion -->
                "convert_amount_as_per_currency": "Amount"
            }
        },
    }

read_from_s3_func: This filed in FileGenie configuration specifies the function to be used for parsing a specific file type from AWS S3. Depending on the file format, you can choose from the following available functions:

  • readFromS3 - parse the TXT, EXCEL, CSV, XML, PDF files
  • readZipFromS3 - parse the zip files
  • read_complete_excel_file - Use this function when working with EXCEL files containing multiple sheets.

parameters_for_read_s3: This field in FileGenie configuration specifies the additional parameters required for reading the file such as password_protected, password, sep etc. you can choose from the following available params:

  • password_protected: If file is password protected or not
  • passowrd_secret_key: Secret key name for password.
  • skiprows: Rows to skip at the start.
  • sep: Delimiter for CSV parsing.
  • header: Row number(s) to use as column names.
  • has_header: Specify if the file has a header.
  • skip_header: Skip the header row during processing.
  • sheet_name: Target sheet in an Excel file.
  • parser_func: Custom parser function.
  • chunksize: Number of rows to read per chunk.
  • skip_footer: Rows to skip at the end.

Import and initialise the file genie

from file_genie import FileGenie

file_genie = FileGenie(config={s3_config: s3_config, file_config: file_config})
parsed_data = file_genie.parse("s3://your-bucket-name/path/to/your/file.csv", file_source, ParsedDataResponseType.DATAFRAME.value)
//By default SDK will provide response as DATAFRAME

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages