Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading and writing avro files #367

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from
Open

Add support for reading and writing avro files #367

wants to merge 13 commits into from

Conversation

juarezr
Copy link

@juarezr juarezr commented Jan 30, 2020

Add support for reading and writing Avro files using FastAvro.

Avro is faster and safer than other format as CSV, JSON or XML.

As Avro is typed, the fields types are detected from values. Once bonobo starts preserving types, they could be used for determining field types.

Tested with the workflow mysql -> sqlalchemy -> bonobo -> avro.

Publishing now for gattering sugestions.

@lgtm-com
Copy link

lgtm-com bot commented Jan 30, 2020

This pull request introduces 2 alerts when merging 078fccc into 274058d - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method
  • 1 for Unused local variable

@lgtm-com
Copy link

lgtm-com bot commented Jan 30, 2020

This pull request introduces 1 alert when merging d62465a into 274058d - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method

@hartym
Copy link
Member

hartym commented Feb 1, 2020

Hello,

This should be an optional dependency, I don't think 90% of users want to install this dep ifthey are not going to use it.

Thanks a lot for your contribution, what's the status of the code ? Should I test that already ?

@lgtm-com
Copy link

lgtm-com bot commented Feb 2, 2020

This pull request introduces 1 alert when merging adc4431 into 274058d - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method

@juarezr
Copy link
Author

juarezr commented Feb 2, 2020

Regarding status of code:

  • Code is working
    • The reading and writing of avro files is working fine
    • Tested with a pipeline querying a remote mysql database and writing to avro files
    • It has 2 tests for basic reading and writing
    • Missing user documentation and more test cases
  • Some guidance would be appreciated on:
    • Avro columns/fields are typed. But I couldn't find how bonobo transfers/preserves value type information. I concluded that bonobo didn't have this feature.
    • A case for type preserving is data transfer between RDBMS servers.
    • For solving this the code has two strategies:
      • a fields property for specifying the type of the fields
      • automatic type dectection based on first row/record

Regarding optional dependency:

  • I agree that avro support should be optional
  • The strategy that I was thinking is:
    • if you don't use AvroReader/AvroWriter you are not affected
    • if you use, you should also pip install fastavro. Otherwise it would fail
  • Now I am fighting with the CI tests on github missing the fastavro lib.

If you want testing, just pip install fastavro in a environment and replace a CvsWriter by a AvroWriter.

What you think about?

Thanks

@hartym
Copy link
Member

hartym commented Feb 3, 2020

Avro columns/fields are typed. But I couldn't find how bonobo transfers/preserves value type information. I concluded that bonobo didn't have this feature.

Bonobo uses python's type system, which does not allow implicit conversions.

For solving this the code has two strategies: a fields property for specifying the type of the fields automatic type dectection based on first row/record

Not sure what "code" you're talking about, this is not far from what bonobo does (see NodeExecutionContext).

It seems also that the tests should do a bit more, you need to "assert" things so pytest will actually check something. I'll have a try as soon as possible.

Thanks !

@juarezr
Copy link
Author

juarezr commented Feb 3, 2020

Avro columns/fields are typed. But I couldn't find how bonobo transfers/preserves value type information. I concluded that bonobo didn't have this feature.

Bonobo uses python's type system, which does not allow implicit conversions.

Let me explain in more details what I tried to tackle.

For writing to a avro file one should define a schema like this:

schema = {
    'doc': 'A weather reading.',
    'name': 'Weather',
    'namespace': 'test',
    'type': 'record',
    'fields': [
        {'name': 'station', 'type': 'string'},
        {'name': 'day', 'type': 'int', 'logicalType': 'date'},
        {'name': 'time', 'type': 'long', 'logicalType': 'time-micros'},
        {'name': 'temp', 'type': 'int'},
        {'name': 'umidity', 'type': 'bytes', 'logicalType': 'decimal', 'precision': 4, 'scale': 2},
    ],
}

The workflow that I was trying to hadle is the most common in ETL processing:

  1. Querying a database in a RDBMS with a query like select * from mytable
  2. Doing some transformation in the rows of data:
    a. adding, converting, joining and splitting fields from a row
    b. joining and splitting a row in a one-to-many transformation
    c. joining data from other flow/file/query into the row (like descriptions of a key)
  3. Writing the rows/records to the avro file with the exact or aproximated field types

However, in the step 3, there wasn't type information for creating the required type schema.

The problem happens because the python's type system could be not enough to represent the richness of types of a RDMS database:

  • Using python float (double) for handling values like money, or units of measure could lead to roundings and imprecisions that affect the values.
  • The value 0.1 (10%) and several others does not exists in floating point representation.
  • Any reasonable person will complain about wrong values in their pay check, for example.
  • The issue does not show up with CSV because everything is converted to ascii text.
  • But formats like Avro, Parquet, ORC, Protobuf, etc... were designed for exchanging field types because developers missed/suffered with the lost of information and performance of CSV.

Using the current values types give by bonobo I got working the following types:

  • string
  • float (double, 64bits)
  • int (int64)
  • bytes (blob)

But I suffered with the following types:

  • integer types: int8, int16, int32 as storage size matters
  • half float: float32 promoted to float64
  • exact types: decimal(prec,dec), numeric(prec,dec), money promoted to float64
  • date, time, datetime and timezone: worked with some adaptation
  • bit, boolean: got mapped to int

I tried to solve this by two ways:

  1. Creating a property in AvroWriter for the developer specifing a list of types
  2. Trying to autodetect the types from first row return by bonobo. Here I suffered from reduced type information.

I'm looking for the best way to handle this issue.

@juarezr
Copy link
Author

juarezr commented Feb 3, 2020

It seems also that the tests should do a bit more, you need to "assert" things so pytest will actually check something.

I agree with that.
I just would like to settle the design before committing time writing more tests.

Thanks

@juarezr
Copy link
Author

juarezr commented Feb 3, 2020

For reproducing the issue I did the following:

  1. created a database in remotemysql.com
  2. executed the script bellow for creating a table with some recores
  3. create a graph in bonobo using:
    a. connected bonobo in the remote database with SqlAlchemy
    b. executed a query select * from testing in a Bonobo graph node
    c. used AvroWriter for writing for a avro file
  4. used debugger to inspect the resulting types

Notice that most types mapped to int, float, string, bytes, and datetime.datetime

create table testing (
	fboolean 	BOOLEAN,
	fchar		CHAR(10),
	fvarchar	VARCHAR(10),
	ftext		TEXT(10),
	ftinyint	TINYINT,
	fsmallint	SMALLINT,
	fmediumint	MEDIUMINT,
	fint		INT,
	fbigint		BIGINT,
	fdecimal	DECIMAL(9,2),
	ffloat		FLOAT,
	fdouble		DOUBLE,
	fbit		BIT,
	fdate		DATE,
	ftime		TIME,
	fdatetime 	DATETIME,
	ftimestamp 	TIMESTAMP,
	fyear		YEAR
);

insert into testing values(
	TRUE, 
	'abcdef', 'ghijkl', 'mnopqr', 
	1, 123, 32000, 66000, 1234567890,
	123.456, 456.789, 123.789, 1, 
	'2019-12-25', '21:22:23', '2019-12-25 21:22:23', '2019-10-25 17:22:23', 
	2019
);

insert into testing values(
	false, 
	'vidi', 'vini', 'vinci', 
	2, 121, 32023, 66066, 9876543210,
	234.567, 567.890, 234.890, 0, 
	'2019-12-15', '15:22:23', '2019-12-15 16:22:23', '2019-10-15 17:15:23', 
	2018
);

@hartym
Copy link
Member

hartym commented Feb 4, 2020

I understand your point.

It should be possible to use different types than builtins, like for example one could use decimals (https://docs.python.org/3/library/decimal.html) to avoid wrong numbers on payckeck or numpy types to have rightly sized variables. There are two ways to do so and I think (but I may be wrong) it's not bonobo job to handle this (or at least, not more than providing a way to let the user do it.

Either your data producer already knows how to produce those types as an output (a db driver that would yield numpy integers, for example). In that case, job's already done, and bonobo will just pass those values through. Either your data producer produces other types (assuming they do not contain unwantable approximations) and you can have a node in charge of casting things. This is of course less effective, but may still work in certain situations as it will free up memory waste for further processing, and there should be a limited amount of rows waiting to be converted. This is already something you can do in a node.

So as I see it (but let me know if I'm wrong, you may have thought more of this), there is one "correct" way which is the responsibility of whatever talks with the data source, and one workaround which is possible.

Am I missing something ?

Or are you suggesting that you would need some metadata information storage about columns ?

@juarezr
Copy link
Author

juarezr commented Feb 14, 2020

Thanks for the summarizing the implications around the my point.

After reading that, I've found some new things about the issues:

  • The issue with precision loss could be a because the fact that the DBAPI returns floats and not decimals is a database/driver/DBAPI issue. But I need further investigate with other MySql drivers, Postgresql, etc...
  • SQLAlchemy provides abstractions for most common database data types, and a mechanism for specifying your own custom data types.

@juarezr
Copy link
Author

juarezr commented Feb 14, 2020

Regarding the metadata information storage about columns, I am thinking that the importance of preserving the column types depends on the type of output and the level of effort and complexity of the development and use.

So the source column type information will/wont matter according the use case/scenario.

For instance considering only the output/destination:

  1. It wont matter for transferring data to untyped file formats like CSV, Json, XML, fixed or other text based output, because the is mandatory to convert the values to string/text.
  2. It wont matter for transferring data from a dbms/database table/query to another dbms/database because the target table will already have the columns with defined types and a implicit type conversion will happen.
  3. It will matter for transferring data to typed file formats like Avro, Parquet, ORC, Protobuf, thrift or other binary based output, because the is mandatory to specify the column types.

For instance considering only the effort and complexity of the translation:

  • For 1 and 2, it's not worth investing time and effort because the current behavior already suits well.
  • For the 3 I think the handling will depend in how one decide in balancing the effort needed for developing and the burden/benefits of solving automatically between:
    • bonobo library and bonobo developer
    • ETL developer/bonobo user

The most common ETL use cases I now are:

  • Transfer some database table/query rows into files like CSV or typed format.
  • Transfer the exported file into another database table.
  • Transfer some database table/query rows directly into another database table.

Basically we have a combination between:

  • non capable producers vs non capable consumers
  • capable producers vs non capable consumers
  • capable producers vs capable consumers
  • non capable producers vs capable consumers

Considering all this it's possible to have some decisions like:

  1. Maintain the current as-is behavior and let the ETL developer specify for each transfer the types all times.
  2. Try to help the ETL developer detecting the types and let him handle when it not fits well.
  3. Create a new transfer plane for exchanging metadata between capable producers and capable consumers.

One could think that this solutions:

  • 1 could be a valid and righfull decision as it will avoid complexity and reduce source code size anbd cluttering.
  • 3 would be a ideal scenario, but it could also be very laborius/ardous.
  • 2 could be a interesting approach and even an intermediary step to 3.
  • 2 may be a necessary measure to handle the transfer from no capable producers and capable consumers.

What you think about?
Would be desired or acceptable any effort regarding having this in bonobo?

@hartym
Copy link
Member

hartym commented Feb 14, 2020

Hey.

Thanks for the detailed analysis/explanation.

From what I understand, I think that all use cases are already handled by ... python type system :) Let's say we have int16 and int32 types understood by some consumer. Then the only need is to have some type (as in python type) that contains enough metadata for the consumer to output the correct type.

There are two things that are not done by bonobo (but I pretty much think it's not its responsibility, although you're welcome to prove me wrong) :

  • Mapping types or objects from or into target domain (understanding an int16 is a 16 bits integer in the avro format for example (which is purely conceptual, I don't know nothing about avro format and capabilities). This is hardly doable in bonobo, because it would pretty much require knowledge about all possible target domains, which is unrealistic and not wanted anyway (would be too much generic and real world usage would most probably require rewriting it anyway). An example of vanilla-python library that does this is for example the json library (or is it simplejson ? I always forget the name). You get the mappings for most simple types builtin, but if you want to convert decimals, timestamps, etc. into json, you need to write your own converters. Which is fine as depending on your application, serializing a timestamp to json may mean a lot of different things that python cannot guess for you. Another example (on the producer side) is the one you describe above : well thought libraries like sqlalchemy allows you to map things to whatever types you need.
  • Ensuring a column's value is always of a given type at a given stage of a pipeline. This is something that bonobo does not enforce at the graph level but you're still allowed to do it pretty easily : you can have a kind of "identity" transformation (one that yields NOT_MODIFIED everytime) that adds a validation step (so in fact, it yields NOT_MODIFIED if your constraints are respected, for example type constraints, and raise an error otherwise). This should be trivial to implement and I'm not sure a generic tool for that would be useful, but it may be (a kind of node factory that takes a sort of stream schema and does the "if" stuff, tricky part for a generic version is "what is a schema?").

Do you have concrete cases that cannot be handled this way?

Also, I think you should focus on avro-only test cases, as if we are able to produce whatever the avro-related nodes expect and we ensure the said nodes are indeed working correctly, it does not matter to know what kind of node (or graph) produced the data. Not sure this is something you tried to do in your tests but as you're describing cases using remote databases, I prefer to state this, sorry if it's useless.

Sorry I still did not find the time to actually test what your code does but I'm on it as soon as possible, if you focus the merge request on having avro readers/writers using optional dependency (and with tests :p), I think we can integrate it pretty soon.

If you think that from the discussion another topic is worth considering, maybe we should open specific issues to discuss it ?

Thanks

@juarezr
Copy link
Author

juarezr commented Feb 19, 2020

Hi,

Regarding focus I planning to continue in the following way:

  1. Change the code in pull request for:
    a. Allowing the user/dev directly specify the desired schema.
    b. In absense of a specified schema try to create one from python types as best as one can.
  2. Test a little bit with other RDBMS like Postgresql, MSSQL, Firebird for catching more issues with the design.
  3. Write code for some tests cases simulating basic functionality like extraction from RDBMS.
  4. Add some documentation to bonobo docs regarding output to Avro.
  5. Further discuss the types mapping/remapping in another issue as sugested.

Also I need some help regarding the best way of:

  • Figuring where to place documentation for AvroWriter/AvroReader.
  • If the code follow guidelines.
  • If the are any mistake that I should avoid.
  • If the are any form that perform faster or is better suited for maintaining.

Resuming the discussion, regarding type mapping I pretty much agree with your conclusions:

  • Ensuring a column's value is always of a given type at a given stage of a pipeline probably is unnecessary and overkill.
  • Also mapping types or objects from a source domain probably is unnecessary and overkill.
  • Mapping types or objects into target domain is what I need to write Avro files with a defined schema, but for outputs like CSV or Json are meaningless.

What I'm think is worth exploring in bonobo for the type mapping is a simpler solution like:

  1. Let bonobo and python type system continue handling field's values as is today.
  2. Do not map or convert values besides what producers have already done. This way we avoid losing precision and corrupting data.
  3. Explore ways of extracting original type information from producers that are capable of that, like RDBMS/SQLAlchemy.
  4. Allow writers, like Avro, access this type information for automatically creating their typed fields when needed.

For instance, if I have a table in MySql created like this:

create table wrong_types (
    ftinyint    TINYINT,
    fsmallint   SMALLINT,
    fmediumint  MEDIUMINT,
    fint        INT,
    fdecimal    DECIMAL(9,2),
    ffloat      FLOAT,
    fbit        BIT
);

Today without extra type information besides python types it's only possible to create a schema like:

schema = {
  'fields': [
      {'name': 'ftinyint', 'type': 'long'},
      {'name': 'fsmallint', 'type': 'long'},
      {'name': 'fmediumint', 'type': 'long'},
      {'name': 'fint', 'type': 'long'},
      {'name': 'fdecimal', 'type': 'double'},
      {'name': 'ffloat', 'type': 'double'},
      {'name': 'fbit', 'type': 'bytes'},
  ],
  ...
  }

But knowing the type information one could create a better, smaller and faster schema like:

schema = {
  'fields': [
      {'name': 'ftinyint', 'type': 'int'},
      {'name': 'fsmallint', 'type': 'int'},
      {'name': 'fmediumint', 'type': 'int'},
      {'name': 'fint', 'type': 'int'},
      {'name': 'fdecimal', 'type': 'bytes', 'logicalType': 'decimal', 'precision': 9, 'scale': 2},
      {'name': 'ffloat', 'type': 'float'},
      {'name': 'fbit', 'type': 'boolean'},
  ],
  ...
  }

The biggest offensor there is the mapping of DECIMAL(9,2) and BIT because it causes loss of precision and incorrect type handling. Most of other types causes only record/file size increases.

Should we continue this discussion in a separated issue?

@lgtm-com
Copy link

lgtm-com bot commented Mar 7, 2020

This pull request introduces 3 alerts when merging 145de5a into 274058d - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method
  • 1 for Module imports itself
  • 1 for Unused local variable

@lgtm-com
Copy link

lgtm-com bot commented Mar 7, 2020

This pull request introduces 3 alerts when merging 165575d into 274058d - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method
  • 1 for Module imports itself
  • 1 for Unused local variable

@juarezr
Copy link
Author

juarezr commented Mar 7, 2020

Hi,

Can you review this pull request?

I think that I reached the point for starting the integration.

thanks

@lgtm-com
Copy link

lgtm-com bot commented Mar 7, 2020

This pull request introduces 2 alerts when merging 3fd0c88 into 274058d - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method
  • 1 for Module imports itself

@juarezr
Copy link
Author

juarezr commented Apr 17, 2020

Any news about this PR?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants