Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to speed up creating dataframe faster for large dataset #546

Open
rvyas opened this issue Sep 18, 2022 · 1 comment
Open

How to speed up creating dataframe faster for large dataset #546

rvyas opened this issue Sep 18, 2022 · 1 comment

Comments

@rvyas
Copy link

rvyas commented Sep 18, 2022

Hi,
I am creating dataframe for 3.5m records and 25 vector. it is taking over 1min.

# construct data for 3.5m records and close to 25 same key element in each hash.
data = [
  {m: 'abc', a: 1.2, b: 2.1, c: 2.3},
  {m: 'xyz', a: 1.1, b: 22.1, c: 223.3}
  ...
]

# Convert from array of hash to hash of array
vc = {}
data.first.keys.each do |ky|
  vc[ky] = data.map{|dt| dt[ky]}
end

Benchmark.bm do |x|
  x.report("df array_of_hash: ") { Daru::DataFrame.new(data, clone: false) }
  x.report("df hash_of_array: ") { Daru::DataFrame.new(vc, clone: false) }
end

##
#                              user     system      total        real
# df array_of_hash:   86.398855   0.311986  86.710841 ( 86.850770)
# df hash_of_array:   21.745897   0.027261  21.773158 ( 21.814447)

After converting data (which also took a min), it is little faster but 21 sec is still a lot of time to create dataframe.

Any ideas how to speed this up?

@kojix2
Copy link
Member

kojix2 commented Sep 18, 2022

Unfortunately, daru is currently without a developer.
I recommend that you create your own fork, give daru another name, such as daru2, and take over the project, or use one of the following alternatives

The former is recommended for general use.
The latter is a new data frame with Apache Arrow as its backend. The functionality may be improved in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants