1.8.1 bugfixes & enhancements:

* #84, highlight columns based on dtype * #92, build columns with random values * #111, code export has syntax error & str() fix for column builder names * #116, updated styling of github fork link * #114, added "Export CSV" link * #113, updates to "Value Counts" chart in "Column Analysis" for number of values and ordinal entry * #120, allowing for duplicates in bar charts * #119, fixed bug with queries not being passed to functions * #114, added the ability to export dataframes to CSV/TSV * added "category breakdown" in column analysis popup for float columns * fixed bug where previous "show missing only" selection was not being recognized
man-group · Mar 28, 2020 · 0e073eb · 0e073eb
1 parent 50baa40
commit 0e073eb
Show file tree

Hide file tree

Showing 47 changed files with 2,058 additions and 573 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -5,7 +5,7 @@ defaults: &defaults
       CIRCLE_ARTIFACTS: /tmp/circleci-artifacts
       CIRCLE_TEST_REPORTS: /tmp/circleci-test-results
       CODECOV_TOKEN: b0d35139-0a75-427a-907b-2c78a762f8f0
-      VERSION: 1.8.0
+      VERSION: 1.8.1
       PANDOC_RELEASES_URL: https://github.com/jgm/pandoc/releases
     steps:
     - checkout

diff --git a/CHANGES.md b/CHANGES.md
@@ -1,5 +1,17 @@
 ## Changelog
 
+### 1.8.1 (2020-3-29)
+ * [#92](https://github.com/man-group/dtale/issues/92), column builders for random data
+ * [#84](https://github.com/man-group/dtale/issues/84), highlight columns based on dtype
+ * [#111](https://github.com/man-group/dtale/issues/111), fix for syntax error in charts code export
+ * [#113](https://github.com/man-group/dtale/issues/113), updates to "Value Counts" chart in "Column Analysis" for number of values and ordinal entry
+ * [#114](https://github.com/man-group/dtale/issues/114), export data to CSV/TSV
+ * [#116](https://github.com/man-group/dtale/issues/116), upodated styling for github fork link so "Code Export" is partially clickable
+ * [#119](https://github.com/man-group/dtale/issues/119), fixed bug with queries not being passed to functions
+ * [#120](https://github.com/man-group/dtale/issues/120), fix to allow duplicate x-axis entries in bar charts
+ * added "category breakdown" in column analysis popup for float columns
+ * fixed bug where previous "show missing only" selection was not being recognized
+
 ### 1.8.0 (2020-3-22)
  * [#102](https://github.com/man-group/dtale/issues/102), interactive column filtering for string, date, int, float & bool
  * better handling for y-axis management in charts.  Now able to toggle between default, single & multi axis

diff --git a/README.md b/README.md
@@ -44,13 +44,14 @@ D-Tale was the product of a SAS to Python conversion.  What was originally a per
     - [Describe](#describe), [Filter](#filter), [Building Columns](#building-columns), [Reshape](#reshape), [Charts](#charts), [Coverage (Deprecated)](#coverage-deprecated), [Correlations](#correlations), [Heat Map](#heat-map), [Instances](#instances), [Code Exports](#code-exports), [About](#about), [Resize](#resize), [Shutdown](#shutdown)
   - [Column Menu Functions](#column-menu-functions)
     - [Filtering](#filtering), [Moving Columns](#moving-columns), [Hiding Columns](#hiding-columns), [Lock](#lock), [Unlock](#unlock), [Sorting](#sorting), [Formats](#formats), [Column Analysis](#column-analysis)
-  - [Menu Functions within a Jupyter Notebook](#menu-functions-within-a-jupyter-notebook)
+  - [Menu Functions Depending on Browser Dimensions](#menu-functions-depending-on-browser-dimensions)
 - [For Developers](#for-developers)
   - [Cloning](#cloning)
   - [Running Tests](#running-tests)
   - [Linting](#linting)
   - [Formatting JS](#formatting-js)
   - [Docker Development](#docker-development)
+- [Global State/Data Storage](#global-state_data-storage)
 - [Startup Behavior](#startup-behavior)
 - [Documentation](#documentation)
 - [Requirements](#requirements)
@@ -368,6 +369,7 @@ This video shows you how to build the following:
  - Numeric: adding/subtracting two columns or columns with static values
  - Bins: bucketing values using pandas cut & qcut as well as assigning custom labels
  - Dates: retrieving date properties (hour, weekday, month...) as well as conversions (month end)
+ - Random: columns of data type (int, float, string & date) populated with random uniformly distributed values.
 
 #### Reshape
 
@@ -476,6 +478,11 @@ d.offline_chart(chart_type='bar', x='x', y='z3', agg='sum')
 ```
 [![](http://img.youtube.com/vi/DseSmc3fZvc/0.jpg)](http://www.youtube.com/watch?v=DseSmc3fZvc "Offline Charts Tutorial")
 
+**Pro Tip: If generating offline charts in jupyter notebooks and you run out of memory please add the following to your command-line when starting jupyter**
+
+`--NotebookApp.iopub_data_rate_limit=1.0e10`
+
+
 **Disclaimer: Long Running Chart Requests**
 
 If you choose to build a chart that requires a lot of computational resources then it will take some time to run.  Based on the way Flask & plotly/dash interact this will block you from performing any other request until it completes.  There are two courses of action in this situation:
@@ -678,18 +685,23 @@ Here's a grid of all the formats available with -123456.789 as input:
 #### Column Analysis
 Based on the data type of a column different charts will be shown.
 
-| Data Type     | Chart          |
-|---------------|----------------|
-| Integer       | Histogram, Value Counts|
-| Float         | Value Counts   |
-| Date          | Value Counts   |
-| String        | Value Counts   |
+| Chart         | Data Types     | Sample |
+|---------------|----------------|--------|
+| Histogram     | Float, Int |![](https://raw.githubusercontent.com/aschonfeld/dtale-media/master/images/analysis/histogram.PNG)|
+| Value Counts  | Int, String, Bool, Date, Category|![](https://raw.githubusercontent.com/aschonfeld/dtale-media/master/images/analysis/value_counts.PNG)|
+| Category      | Float   |![](https://raw.githubusercontent.com/aschonfeld/dtale-media/master/images/analysis/category.PNG)|
+
+
+**Histogram** can be displayed in any number of bins (default: 20), simply type a new integer value in the bins input
+
+**Value Count** by default, show the top 100 values ranked by frequency.  If you would like to show the least frequent values simply make your number negative (-10 => 10 least frequent value)
 
-*Histograms* can be displayed in any number of bins (default: 20), simply type a new integer value in the bins input
+**Value Count w/ Ordinal** you can also apply an ordinal to your **Value Count** chart by selecting a column (of type int or float) and applying an aggregation (default: sum) to it (sum, mean, etc...) this column will be grouped by the column you're analyzing and the value produced by the aggregation will be used to sort your bars and also displayed in a line.  Here's an example:
 
-![](https://raw.githubusercontent.com/aschonfeld/dtale-media/master/images/Histogram.png)
+![](https://raw.githubusercontent.com/aschonfeld/dtale-media/master/images/analysis/value_counts_ordinal.PNG
+)
 
-*Value Counts* are a bar chart containing the counts of each unique value in a column.
+**Category (Category Breakdown)** when viewing float columns you can also see them broken down by a categorical column (string, date, int, etc...).  This means that when you select a category column this will then display the frequency of each category in a line as well as bars based on the float column you're analyzing grouped by that category and computed by your aggregation (default: mean).
 
 ### Menu Functions Depending on Browser Dimensions
 Depending on the dimensions of your browser window the following buttons will not open modals, but rather separate browser windows:  Correlations, Describe & Instances (see images from [Jupyter Notebook](#jupyter-notebook), also Charts will always open in a separate browser window)
@@ -786,7 +798,7 @@ $ python
 Then view your D-Tale instance in your browser using the link that gets printed
 
 
-### Global State/Data Storage
+## Global State/Data Storage
 
 If D-Tale is running in an environment with multiple python processes (ex: on a web server running [gunicorn](https://github.com/benoitc/gunicorn)) it will most likely encounter issues with inconsistent state.  Developers can fix this by configuring the system D-Tale uses for storing data.  Detailed documentation is available here: [Data Storage and managing Global State](https://github.com/man-group/dtale/blob/master/docs/GLOBAL_STATE.md)
 

diff --git a/docker/2_7/Dockerfile b/docker/2_7/Dockerfile
@@ -44,4 +44,4 @@ WORKDIR /app
 
 RUN set -eux \
   ; . /root/.bashrc \
-  ; easy_install dtale-1.8.0-py2.7.egg
+  ; easy_install dtale-1.8.1-py2.7.egg
diff --git a/docker/3_6/Dockerfile b/docker/3_6/Dockerfile
@@ -44,4 +44,4 @@ WORKDIR /app
 
 RUN set -eux \
   ; . /root/.bashrc \
-  ; easy_install dtale-1.8.0-py3.7.egg
+  ; easy_install dtale-1.8.1-py3.7.egg
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -64,9 +64,9 @@
 # built documents.
 #
 # The short X.Y version.
-version = u'1.8.0'
+version = u'1.8.1'
 # The full version, including alpha/beta/rc tags.
-release = u'1.8.0'
+release = u'1.8.1'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/dtale/charts/utils.py b/dtale/charts/utils.py
@@ -198,8 +198,10 @@ def check_exceptions(df, allow_duplicates, unlimited_data=False, data_limit=1500
     :raises Exception: if any failure condition is met
     """
     if not allow_duplicates and any(df.duplicated()):
-        raise Exception(
-            '{} contains duplicates, please specify group or additional filtering'.format(', '.join(df.columns)))
+        raise Exception((
+            "{} contains duplicates, please specify group or additional filtering or select 'No Aggregation' from"
+            ' Aggregation drop-down.'
+        ).format(', '.join(df.columns)))
     if not unlimited_data and len(df) > data_limit:
         raise Exception(limit_msg.format(data_limit))
 
@@ -225,7 +227,8 @@ def build_agg_data(df, x, y, inputs, agg, z=None):
     :return: dataframe of aggregated data
     :rtype: :class:`pandas:pandas.DataFrame`
     """
-
+    if agg == 'raw':
+        return df, []
     z_exists = len(make_list(z))
     if agg == 'corr':
         if not z_exists:
@@ -256,7 +259,7 @@ def build_agg_data(df, x, y, inputs, agg, z=None):
     groups = df.groupby(x)
     return getattr(groups[y], agg)().reset_index(), [
         "chart_data = chart_data.groupby('{x}')[['{y}']].{agg}().reset_index()".format(
-            x=x, y=y, agg=agg
+            x=x, y=make_list(y)[0], agg=agg
         )
     ]
 
@@ -365,8 +368,12 @@ def _group_filter():
     code.append("chart_data = chart_data.dropna()")
 
     dupe_cols = [x_col] + (y_cols if len(z_cols) else [])
-    check_exceptions(data[dupe_cols].rename(columns={'x': x}), allow_duplicates, unlimited_data=unlimited_data,
-                     data_limit=40000 if len(z_cols) else 15000)
+    check_exceptions(
+        data[dupe_cols].rename(columns={'x': x}),
+        allow_duplicates or agg == 'raw',
+        unlimited_data=unlimited_data,
+        data_limit=40000 if len(z_cols) else 15000
+    )
     data_f, range_f = build_formatters(data)
     ret_data = dict(
         data={str('all'): data_f.format_lists(data)},

diff --git a/dtale/column_builders.py b/dtale/column_builders.py
@@ -1,3 +1,6 @@
+import random
+import string
+
 import numpy as np
 import pandas as pd
 
@@ -14,8 +17,10 @@ def __init__(self, data_id, column_type, name, cfg):
             self.builder = DatetimeColumnBuilder(name, cfg)
         elif column_type == 'bins':
             self.builder = BinsColumnBuilder(name, cfg)
+        elif column_type == 'random':
+            self.builder = RandomColumnBuilder(name, cfg)
         else:
-            raise NotImplementedError('{} column builder not implemented yet!'.format(column_type))
+            raise NotImplementedError("'{}' column builder not implemented yet!".format(column_type))
 
     def build_column(self):
         data = global_state.get_data(self.data_id)
@@ -129,3 +134,128 @@ def build_code(self):
         s_str = "df.loc[:, '{name}'] = pd.Series({name}_data.cat.codes.map({name}_cats), index=df.index, name='{name}')"
         bins_code.append(s_str.format(name=self.name))
         return '\n'.join(bins_code)
+
+
+def id_generator(size=10, chars=string.ascii_uppercase + string.digits):
+    return ''.join(random.choice(chars) for _ in range(int(size)))
+
+
+class RandomColumnBuilder(object):
+
+    def __init__(self, name, cfg):
+        self.name = name
+        self.cfg = cfg
+
+    def build_column(self, data):
+        rand_type = self.cfg['type']
+        if 'string' == rand_type:
+            kwargs = dict(size=self.cfg.get('length', 10))
+            if self.cfg.get('chars'):
+                kwargs['chars'] = self.cfg['chars']
+            return pd.Series(
+                [id_generator(**kwargs) for _ in range(len(data))], index=data.index, name=self.name
+            )
+        if 'int' == rand_type:
+            low = self.cfg.get('low', 0)
+            high = self.cfg.get('high', 100)
+            return pd.Series(
+                np.random.randint(low, high=high, size=len(data)), index=data.index, name=self.name
+            )
+        if 'date' == rand_type:
+            start = pd.Timestamp(self.cfg.get('start') or '19000101')
+            end = pd.Timestamp(self.cfg.get('end') or '21991231')
+            business_days = self.cfg.get('businessDay') is True
+            timestamps = self.cfg.get('timestamps') is True
+            if timestamps:
+                def pp(start, end, n):
+                    start_u = start.value // 10 ** 9
+                    end_u = end.value // 10 ** 9
+                    return pd.DatetimeIndex(
+                        (10 ** 9 * np.random.randint(start_u, end_u, n)).view('M8[ns]')
+                    )
+
+                dates = pp(pd.Timestamp(start), pd.Timestamp(end), len(data))
+            else:
+                dates = pd.date_range(start, end, freq='B' if business_days else 'D')
+                dates = [dates[i] for i in np.random.randint(0, len(dates) - 1, size=len(data))]
+            return pd.Series(dates, index=data.index, name=self.name)
+        if 'bool' == rand_type:
+            return pd.Series(np.random.choice([True, False], len(data)), index=data.index, name=self.name)
+        if 'choice' == rand_type:
+            choices = self.cfg.get('choices') or 'a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z'
+            choices = choices.split(',')
+            return pd.Series(np.random.choice(choices, len(data)), index=data.index, name=self.name)
+
+        # floats
+        low = self.cfg.get('low', 0)
+        high = self.cfg.get('high', 1)
+        return pd.Series(np.random.uniform(low, high, len(data)), index=data.index, name=self.name)
+
+    def build_code(self):
+        rand_type = self.cfg['type']
+        if 'string' == rand_type:
+            kwargs = []
+            if self.cfg.get('length') != 10:
+                kwargs.append('size={size}'.format(size=self.cfg.get('length')))
+            if self.cfg.get('chars'):
+                kwargs.append("chars='{chars}'".format(chars=self.cfg.get('chars')))
+            kwargs = ', '.join(kwargs)
+            return (
+                'import number\nimport random\n\n'
+                'def id_generator(size=1500, chars=string.ascii_uppercase + string.digits):\n'
+                "\treturn ''.join(random.choice(chars) for _ in range(size))\n\n"
+                "df.loc[:, '{name}'] = pd.Series([id_generator({kwargs}) for _ in range(len(df)], index=df.index)"
+            ).format(kwargs=kwargs, name=self.name)
+            return "df.loc[:, '{name}'] = df['{col}'].dt.{property}".format(name=self.name, **self.cfg)
+
+        if 'bool' == rand_type:
+            return (
+                "df.loc[:, '{name}'] = pd.Series(np.random.choice([True, False], len(df)), index=data.index"
+            ).format(name=self.name)
+        if 'date' == rand_type:
+            start = pd.Timestamp(self.cfg.get('start') or '19000101')
+            end = pd.Timestamp(self.cfg.get('end') or '21991231')
+            business_days = self.cfg.get('businessDay') is True
+            timestamps = self.cfg.get('timestamps') is True
+            if timestamps:
+                code = (
+                    'def pp(start, end, n):\n'
+                    '\tstart_u = start.value // 10 ** 9\n'
+                    '\tend_u = end.value // 10 ** 9\n'
+                    '\treturn pd.DatetimeIndex(\n'
+                    "\t\t(10 ** 9 * np.random.randint(start_u, end_u, n, dtype=np.int64)).view('M8[ns]')\n"
+                    ')\n\n'
+                    "df.loc[:, '{name}'] = pd.Series(\n"
+                    "\tpp(pd.Timestamp('{start}'), pd.Timestamp('{end}'), len(df)), index=df.index\n"
+                    ')'
+                ).format(name=self.name, start=start.strftime('%Y%m%d'), end=end.strftime('%Y%m%d'))
+            else:
+                freq = ", freq='B'" if business_days else ''
+                code = (
+                    "dates = pd.date_range('{start}', '{end}'{freq})\n"
+                    'dates = [dates[i] for i in np.random.randint(0, len(dates) - 1, size=len(data))]\n'
+                    "df.loc[:, '{name}'] = pd.Series(dates, index=data.index)"
+                ).format(name=self.name, start=start.strftime('%Y%m%d'), end=end.strftime('%Y%m%d'), freq=freq)
+            return code
+        if 'choice' == rand_type:
+            choices = self.cfg.get('choices') or 'a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z'
+            choices = choices.split(',')
+            return "df.loc[:, '{name}'] = pd.Series(np.random.choice({choices}, len(df)), index=df.index)".format(
+                choices="', '".join(choices), name=self.name
+            )
+
+        if 'int' == rand_type:
+            low = self.cfg.get('low', 0)
+            high = self.cfg.get('high', 100)
+            return (
+                'import numpy as np\n\n'
+                "df.loc[:, '{name}'] = pd.Series(np.random.randint({low}, high={high}, size=len(df)), "
+                'index=df.index)'
+            ).format(name=self.name, low=low, high=high)
+
+        low = self.cfg.get('low', 0)
+        high = self.cfg.get('high', 1)
+        return (
+            'import numpy as np\n\n'
+            "df.loc[:, '{name}'] = pd.Series(np.random.uniform({low}, {high}, len(df)), index=df.index)"
+        ).format(low=low, high=high, name=self.name)