Skip to content

Commit

Permalink
1.8.1 bugfixes & enhancements:
Browse files Browse the repository at this point in the history
* #84, highlight columns based on dtype
* #92, build columns with random values
* #111, code export has syntax error & str() fix for column builder names
* #116, updated styling of github fork link
* #114, added "Export CSV" link
* #113, updates to "Value Counts" chart in "Column Analysis" for number of values and ordinal entry
* #120, allowing for duplicates in bar charts
* #119, fixed bug with queries not being passed to functions
* #114, added the ability to export dataframes to CSV/TSV
* added "category breakdown" in column analysis popup for float columns
* fixed bug where previous "show missing only" selection was not being recognized
  • Loading branch information
Andrew Schonfeld committed Mar 28, 2020
1 parent 50baa40 commit 0e073eb
Show file tree
Hide file tree
Showing 47 changed files with 2,058 additions and 573 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ defaults: &defaults
CIRCLE_ARTIFACTS: /tmp/circleci-artifacts
CIRCLE_TEST_REPORTS: /tmp/circleci-test-results
CODECOV_TOKEN: b0d35139-0a75-427a-907b-2c78a762f8f0
VERSION: 1.8.0
VERSION: 1.8.1
PANDOC_RELEASES_URL: https://github.com/jgm/pandoc/releases
steps:
- checkout
Expand Down
12 changes: 12 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
## Changelog

### 1.8.1 (2020-3-29)
* [#92](https://github.com/man-group/dtale/issues/92), column builders for random data
* [#84](https://github.com/man-group/dtale/issues/84), highlight columns based on dtype
* [#111](https://github.com/man-group/dtale/issues/111), fix for syntax error in charts code export
* [#113](https://github.com/man-group/dtale/issues/113), updates to "Value Counts" chart in "Column Analysis" for number of values and ordinal entry
* [#114](https://github.com/man-group/dtale/issues/114), export data to CSV/TSV
* [#116](https://github.com/man-group/dtale/issues/116), upodated styling for github fork link so "Code Export" is partially clickable
* [#119](https://github.com/man-group/dtale/issues/119), fixed bug with queries not being passed to functions
* [#120](https://github.com/man-group/dtale/issues/120), fix to allow duplicate x-axis entries in bar charts
* added "category breakdown" in column analysis popup for float columns
* fixed bug where previous "show missing only" selection was not being recognized

### 1.8.0 (2020-3-22)
* [#102](https://github.com/man-group/dtale/issues/102), interactive column filtering for string, date, int, float & bool
* better handling for y-axis management in charts. Now able to toggle between default, single & multi axis
Expand Down
34 changes: 23 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,13 +44,14 @@ D-Tale was the product of a SAS to Python conversion. What was originally a per
- [Describe](#describe), [Filter](#filter), [Building Columns](#building-columns), [Reshape](#reshape), [Charts](#charts), [Coverage (Deprecated)](#coverage-deprecated), [Correlations](#correlations), [Heat Map](#heat-map), [Instances](#instances), [Code Exports](#code-exports), [About](#about), [Resize](#resize), [Shutdown](#shutdown)
- [Column Menu Functions](#column-menu-functions)
- [Filtering](#filtering), [Moving Columns](#moving-columns), [Hiding Columns](#hiding-columns), [Lock](#lock), [Unlock](#unlock), [Sorting](#sorting), [Formats](#formats), [Column Analysis](#column-analysis)
- [Menu Functions within a Jupyter Notebook](#menu-functions-within-a-jupyter-notebook)
- [Menu Functions Depending on Browser Dimensions](#menu-functions-depending-on-browser-dimensions)
- [For Developers](#for-developers)
- [Cloning](#cloning)
- [Running Tests](#running-tests)
- [Linting](#linting)
- [Formatting JS](#formatting-js)
- [Docker Development](#docker-development)
- [Global State/Data Storage](#global-state_data-storage)
- [Startup Behavior](#startup-behavior)
- [Documentation](#documentation)
- [Requirements](#requirements)
Expand Down Expand Up @@ -368,6 +369,7 @@ This video shows you how to build the following:
- Numeric: adding/subtracting two columns or columns with static values
- Bins: bucketing values using pandas cut & qcut as well as assigning custom labels
- Dates: retrieving date properties (hour, weekday, month...) as well as conversions (month end)
- Random: columns of data type (int, float, string & date) populated with random uniformly distributed values.

#### Reshape

Expand Down Expand Up @@ -476,6 +478,11 @@ d.offline_chart(chart_type='bar', x='x', y='z3', agg='sum')
```
[![](http://img.youtube.com/vi/DseSmc3fZvc/0.jpg)](http://www.youtube.com/watch?v=DseSmc3fZvc "Offline Charts Tutorial")

**Pro Tip: If generating offline charts in jupyter notebooks and you run out of memory please add the following to your command-line when starting jupyter**

`--NotebookApp.iopub_data_rate_limit=1.0e10`


**Disclaimer: Long Running Chart Requests**

If you choose to build a chart that requires a lot of computational resources then it will take some time to run. Based on the way Flask & plotly/dash interact this will block you from performing any other request until it completes. There are two courses of action in this situation:
Expand Down Expand Up @@ -678,18 +685,23 @@ Here's a grid of all the formats available with -123456.789 as input:
#### Column Analysis
Based on the data type of a column different charts will be shown.

| Data Type | Chart |
|---------------|----------------|
| Integer | Histogram, Value Counts|
| Float | Value Counts |
| Date | Value Counts |
| String | Value Counts |
| Chart | Data Types | Sample |
|---------------|----------------|--------|
| Histogram | Float, Int |![](https://raw.githubusercontent.com/aschonfeld/dtale-media/master/images/analysis/histogram.PNG)|
| Value Counts | Int, String, Bool, Date, Category|![](https://raw.githubusercontent.com/aschonfeld/dtale-media/master/images/analysis/value_counts.PNG)|
| Category | Float |![](https://raw.githubusercontent.com/aschonfeld/dtale-media/master/images/analysis/category.PNG)|


**Histogram** can be displayed in any number of bins (default: 20), simply type a new integer value in the bins input

**Value Count** by default, show the top 100 values ranked by frequency. If you would like to show the least frequent values simply make your number negative (-10 => 10 least frequent value)

*Histograms* can be displayed in any number of bins (default: 20), simply type a new integer value in the bins input
**Value Count w/ Ordinal** you can also apply an ordinal to your **Value Count** chart by selecting a column (of type int or float) and applying an aggregation (default: sum) to it (sum, mean, etc...) this column will be grouped by the column you're analyzing and the value produced by the aggregation will be used to sort your bars and also displayed in a line. Here's an example:

![](https://raw.githubusercontent.com/aschonfeld/dtale-media/master/images/Histogram.png)
![](https://raw.githubusercontent.com/aschonfeld/dtale-media/master/images/analysis/value_counts_ordinal.PNG
)

*Value Counts* are a bar chart containing the counts of each unique value in a column.
**Category (Category Breakdown)** when viewing float columns you can also see them broken down by a categorical column (string, date, int, etc...). This means that when you select a category column this will then display the frequency of each category in a line as well as bars based on the float column you're analyzing grouped by that category and computed by your aggregation (default: mean).

### Menu Functions Depending on Browser Dimensions
Depending on the dimensions of your browser window the following buttons will not open modals, but rather separate browser windows: Correlations, Describe & Instances (see images from [Jupyter Notebook](#jupyter-notebook), also Charts will always open in a separate browser window)
Expand Down Expand Up @@ -786,7 +798,7 @@ $ python
Then view your D-Tale instance in your browser using the link that gets printed


### Global State/Data Storage
## Global State/Data Storage

If D-Tale is running in an environment with multiple python processes (ex: on a web server running [gunicorn](https://github.com/benoitc/gunicorn)) it will most likely encounter issues with inconsistent state. Developers can fix this by configuring the system D-Tale uses for storing data. Detailed documentation is available here: [Data Storage and managing Global State](https://github.com/man-group/dtale/blob/master/docs/GLOBAL_STATE.md)

Expand Down
2 changes: 1 addition & 1 deletion docker/2_7/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,4 @@ WORKDIR /app

RUN set -eux \
; . /root/.bashrc \
; easy_install dtale-1.8.0-py2.7.egg
; easy_install dtale-1.8.1-py2.7.egg
2 changes: 1 addition & 1 deletion docker/3_6/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,4 @@ WORKDIR /app

RUN set -eux \
; . /root/.bashrc \
; easy_install dtale-1.8.0-py3.7.egg
; easy_install dtale-1.8.1-py3.7.egg
4 changes: 2 additions & 2 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,9 @@
# built documents.
#
# The short X.Y version.
version = u'1.8.0'
version = u'1.8.1'
# The full version, including alpha/beta/rc tags.
release = u'1.8.0'
release = u'1.8.1'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
19 changes: 13 additions & 6 deletions dtale/charts/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,8 +198,10 @@ def check_exceptions(df, allow_duplicates, unlimited_data=False, data_limit=1500
:raises Exception: if any failure condition is met
"""
if not allow_duplicates and any(df.duplicated()):
raise Exception(
'{} contains duplicates, please specify group or additional filtering'.format(', '.join(df.columns)))
raise Exception((
"{} contains duplicates, please specify group or additional filtering or select 'No Aggregation' from"
' Aggregation drop-down.'
).format(', '.join(df.columns)))
if not unlimited_data and len(df) > data_limit:
raise Exception(limit_msg.format(data_limit))

Expand All @@ -225,7 +227,8 @@ def build_agg_data(df, x, y, inputs, agg, z=None):
:return: dataframe of aggregated data
:rtype: :class:`pandas:pandas.DataFrame`
"""

if agg == 'raw':
return df, []
z_exists = len(make_list(z))
if agg == 'corr':
if not z_exists:
Expand Down Expand Up @@ -256,7 +259,7 @@ def build_agg_data(df, x, y, inputs, agg, z=None):
groups = df.groupby(x)
return getattr(groups[y], agg)().reset_index(), [
"chart_data = chart_data.groupby('{x}')[['{y}']].{agg}().reset_index()".format(
x=x, y=y, agg=agg
x=x, y=make_list(y)[0], agg=agg
)
]

Expand Down Expand Up @@ -365,8 +368,12 @@ def _group_filter():
code.append("chart_data = chart_data.dropna()")

dupe_cols = [x_col] + (y_cols if len(z_cols) else [])
check_exceptions(data[dupe_cols].rename(columns={'x': x}), allow_duplicates, unlimited_data=unlimited_data,
data_limit=40000 if len(z_cols) else 15000)
check_exceptions(
data[dupe_cols].rename(columns={'x': x}),
allow_duplicates or agg == 'raw',
unlimited_data=unlimited_data,
data_limit=40000 if len(z_cols) else 15000
)
data_f, range_f = build_formatters(data)
ret_data = dict(
data={str('all'): data_f.format_lists(data)},
Expand Down
132 changes: 131 additions & 1 deletion dtale/column_builders.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
import random
import string

import numpy as np
import pandas as pd

Expand All @@ -14,8 +17,10 @@ def __init__(self, data_id, column_type, name, cfg):
self.builder = DatetimeColumnBuilder(name, cfg)
elif column_type == 'bins':
self.builder = BinsColumnBuilder(name, cfg)
elif column_type == 'random':
self.builder = RandomColumnBuilder(name, cfg)
else:
raise NotImplementedError('{} column builder not implemented yet!'.format(column_type))
raise NotImplementedError("'{}' column builder not implemented yet!".format(column_type))

def build_column(self):
data = global_state.get_data(self.data_id)
Expand Down Expand Up @@ -129,3 +134,128 @@ def build_code(self):
s_str = "df.loc[:, '{name}'] = pd.Series({name}_data.cat.codes.map({name}_cats), index=df.index, name='{name}')"
bins_code.append(s_str.format(name=self.name))
return '\n'.join(bins_code)


def id_generator(size=10, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(int(size)))


class RandomColumnBuilder(object):

def __init__(self, name, cfg):
self.name = name
self.cfg = cfg

def build_column(self, data):
rand_type = self.cfg['type']
if 'string' == rand_type:
kwargs = dict(size=self.cfg.get('length', 10))
if self.cfg.get('chars'):
kwargs['chars'] = self.cfg['chars']
return pd.Series(
[id_generator(**kwargs) for _ in range(len(data))], index=data.index, name=self.name
)
if 'int' == rand_type:
low = self.cfg.get('low', 0)
high = self.cfg.get('high', 100)
return pd.Series(
np.random.randint(low, high=high, size=len(data)), index=data.index, name=self.name
)
if 'date' == rand_type:
start = pd.Timestamp(self.cfg.get('start') or '19000101')
end = pd.Timestamp(self.cfg.get('end') or '21991231')
business_days = self.cfg.get('businessDay') is True
timestamps = self.cfg.get('timestamps') is True
if timestamps:
def pp(start, end, n):
start_u = start.value // 10 ** 9
end_u = end.value // 10 ** 9
return pd.DatetimeIndex(
(10 ** 9 * np.random.randint(start_u, end_u, n)).view('M8[ns]')
)

dates = pp(pd.Timestamp(start), pd.Timestamp(end), len(data))
else:
dates = pd.date_range(start, end, freq='B' if business_days else 'D')
dates = [dates[i] for i in np.random.randint(0, len(dates) - 1, size=len(data))]
return pd.Series(dates, index=data.index, name=self.name)
if 'bool' == rand_type:
return pd.Series(np.random.choice([True, False], len(data)), index=data.index, name=self.name)
if 'choice' == rand_type:
choices = self.cfg.get('choices') or 'a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z'
choices = choices.split(',')
return pd.Series(np.random.choice(choices, len(data)), index=data.index, name=self.name)

# floats
low = self.cfg.get('low', 0)
high = self.cfg.get('high', 1)
return pd.Series(np.random.uniform(low, high, len(data)), index=data.index, name=self.name)

def build_code(self):
rand_type = self.cfg['type']
if 'string' == rand_type:
kwargs = []
if self.cfg.get('length') != 10:
kwargs.append('size={size}'.format(size=self.cfg.get('length')))
if self.cfg.get('chars'):
kwargs.append("chars='{chars}'".format(chars=self.cfg.get('chars')))
kwargs = ', '.join(kwargs)
return (
'import number\nimport random\n\n'
'def id_generator(size=1500, chars=string.ascii_uppercase + string.digits):\n'
"\treturn ''.join(random.choice(chars) for _ in range(size))\n\n"
"df.loc[:, '{name}'] = pd.Series([id_generator({kwargs}) for _ in range(len(df)], index=df.index)"
).format(kwargs=kwargs, name=self.name)
return "df.loc[:, '{name}'] = df['{col}'].dt.{property}".format(name=self.name, **self.cfg)

if 'bool' == rand_type:
return (
"df.loc[:, '{name}'] = pd.Series(np.random.choice([True, False], len(df)), index=data.index"
).format(name=self.name)
if 'date' == rand_type:
start = pd.Timestamp(self.cfg.get('start') or '19000101')
end = pd.Timestamp(self.cfg.get('end') or '21991231')
business_days = self.cfg.get('businessDay') is True
timestamps = self.cfg.get('timestamps') is True
if timestamps:
code = (
'def pp(start, end, n):\n'
'\tstart_u = start.value // 10 ** 9\n'
'\tend_u = end.value // 10 ** 9\n'
'\treturn pd.DatetimeIndex(\n'
"\t\t(10 ** 9 * np.random.randint(start_u, end_u, n, dtype=np.int64)).view('M8[ns]')\n"
')\n\n'
"df.loc[:, '{name}'] = pd.Series(\n"
"\tpp(pd.Timestamp('{start}'), pd.Timestamp('{end}'), len(df)), index=df.index\n"
')'
).format(name=self.name, start=start.strftime('%Y%m%d'), end=end.strftime('%Y%m%d'))
else:
freq = ", freq='B'" if business_days else ''
code = (
"dates = pd.date_range('{start}', '{end}'{freq})\n"
'dates = [dates[i] for i in np.random.randint(0, len(dates) - 1, size=len(data))]\n'
"df.loc[:, '{name}'] = pd.Series(dates, index=data.index)"
).format(name=self.name, start=start.strftime('%Y%m%d'), end=end.strftime('%Y%m%d'), freq=freq)
return code
if 'choice' == rand_type:
choices = self.cfg.get('choices') or 'a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z'
choices = choices.split(',')
return "df.loc[:, '{name}'] = pd.Series(np.random.choice({choices}, len(df)), index=df.index)".format(
choices="', '".join(choices), name=self.name
)

if 'int' == rand_type:
low = self.cfg.get('low', 0)
high = self.cfg.get('high', 100)
return (
'import numpy as np\n\n'
"df.loc[:, '{name}'] = pd.Series(np.random.randint({low}, high={high}, size=len(df)), "
'index=df.index)'
).format(name=self.name, low=low, high=high)

low = self.cfg.get('low', 0)
high = self.cfg.get('high', 1)
return (
'import numpy as np\n\n'
"df.loc[:, '{name}'] = pd.Series(np.random.uniform({low}, {high}, len(df)), index=df.index)"
).format(low=low, high=high, name=self.name)
Loading

0 comments on commit 0e073eb

Please sign in to comment.