Create 11. Visualization Data - Data Analyst.ipynb

chuongmep · Feb 27, 2024 · 60cd4c8 · 60cd4c8
1 parent 23aad3a
commit 60cd4c8
Showing 1 changed file with 375 additions and 0 deletions.
diff --git a/docs/Tutorials/11. Visualization Data - Data Analyst.ipynb b/docs/Tutorials/11. Visualization Data - Data Analyst.ipynb
@@ -0,0 +1,375 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## .NET Dataframe"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The .NET DataFrame is a data structure provided by the Microsoft.Data.Analysis library in .NET. It is designed to handle large amounts of data efficiently and with a familiar API. The DataFrame is similar to tables in a relational database or data frames in R/Python, but with a richer set of functions.\n",
+    "\n",
+    "Here are some key features of .NET DataFrame:\n",
+    "\n",
+    "1. **Ease of use**: You can easily manipulate data and perform statistical functions on it. It supports operations like group by, join, sort, filter, and others.\n",
+    "\n",
+    "2. **High performance**: DataFrame is designed to handle large data sets. It uses memory efficiently and performs operations quickly.\n",
+    "\n",
+    "3. **Flexibility**: It can handle different data types (integers, strings, floats, etc.) and allows for adding, editing, or deleting of columns.\n",
+    "\n",
+    "4. **Integration with .NET**: Since it's a .NET library, it can be used with other .NET libraries and tools, and it benefits from .NET's strong type checking.\n",
+    "\n",
+    "Here is a simple example of how to use it:\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "polyglot_notebook": {
+     "kernelName": "csharp"
+    },
+    "vscode": {
+     "languageId": "polyglot-notebook"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div><div></div><div></div><div><strong>Installed Packages</strong><ul><li><span>APSToolkit, 1.0.5</span></li></ul></div></div>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "Loading extensions from `C:\\Users\\vho2\\.nuget\\packages\\microsoft.data.analysis\\0.21.1\\interactive-extensions\\dotnet\\Microsoft.Data.Analysis.Interactive.dll`"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "// reason is APSToolkit included library Microsoft.Data.Analysis, so we just need install APSToolkit\n",
+    "#r \"nuget:APSToolkit\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "polyglot_notebook": {
+     "kernelName": "csharp"
+    },
+    "vscode": {
+     "languageId": "polyglot-notebook"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Name      Age       \r\n",
+      "John      33        \r\n",
+      "Bob       21        \r\n",
+      "\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "using Microsoft.Data.Analysis;\n",
+    "\n",
+    "// Create a DataFrame with two columns\n",
+    "DataFrame df = new DataFrame(\n",
+    "    new StringDataFrameColumn(\"Name\", new string[] { \"John\", \"Sue\", \"Bob\" }),\n",
+    "    new Int32DataFrameColumn(\"Age\", new int[] { 33, 45, 21 })\n",
+    ");\n",
+    "\n",
+    "// Filter the data\n",
+    "DataFrame filtered = df.Filter(df[\"Age\"].ElementwiseLessThan(40));\n",
+    "\n",
+    "// Display the result\n",
+    "Console.WriteLine(filtered);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "In this example, a DataFrame is created with two columns, \"Name\" and \"Age\". Then, a filter is applied to get only the rows where the age is less than 40. The result is then printed to the console."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ApsToolkit Dataframe"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In APSToolkit, we added one more function to match with data return, some posible data return are:\n",
+    "- DataTable (System.Data.DataTable)\n",
+    "- Excel - The excel file extracted from the data\n",
+    "- CSV - The CSV file extracted from the data\n",
+    "- Parquet - The Parquet file extracted from the data\n",
+    "... and more"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### DataTable\n",
+    "The DataTable is a .NET class that represents a table of in-memory data. It is a powerful and flexible data structure that can be used to store, manipulate, and analyze data. It is part of the System.Data namespace and is widely used in .NET applications for working with databases and other data sources."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "polyglot_notebook": {
+     "kernelName": "csharp"
+    },
+    "vscode": {
+     "languageId": "polyglot-notebook"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<table id=\"table_638446472287731250\"><thead><tr><th><i>index</i></th><th>Name</th><th>Age</th></tr></thead><tbody><tr><td><i><div class=\"dni-plaintext\"><pre>0</pre></div></i></td><td>John</td><td>33</td></tr><tr><td><i><div class=\"dni-plaintext\"><pre>1</pre></div></i></td><td>Sue</td><td>45</td></tr><tr><td><i><div class=\"dni-plaintext\"><pre>2</pre></div></i></td><td>Bob</td><td>21</td></tr></tbody></table><style>\r\n",
+       ".dni-code-hint {\r\n",
+       "    font-style: italic;\r\n",
+       "    overflow: hidden;\r\n",
+       "    white-space: nowrap;\r\n",
+       "}\r\n",
+       ".dni-treeview {\r\n",
+       "    white-space: nowrap;\r\n",
+       "}\r\n",
+       ".dni-treeview td {\r\n",
+       "    vertical-align: top;\r\n",
+       "    text-align: start;\r\n",
+       "}\r\n",
+       "details.dni-treeview {\r\n",
+       "    padding-left: 1em;\r\n",
+       "}\r\n",
+       "table td {\r\n",
+       "    text-align: start;\r\n",
+       "}\r\n",
+       "table tr { \r\n",
+       "    vertical-align: top; \r\n",
+       "    margin: 0em 0px;\r\n",
+       "}\r\n",
+       "table tr td pre \r\n",
+       "{ \r\n",
+       "    vertical-align: top !important; \r\n",
+       "    margin: 0em 0px !important;\r\n",
+       "} \r\n",
+       "table th {\r\n",
+       "    text-align: start;\r\n",
+       "}\r\n",
+       "</style>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "using System.Data;\n",
+    "using APSToolkit.Utils;\n",
+    "DataTable dataTable = new DataTable();\n",
+    "dataTable.Columns.Add(\"Name\", typeof(string));\n",
+    "dataTable.Columns.Add(\"Age\", typeof(int));\n",
+    "dataTable.Rows.Add(\"John\", 33);\n",
+    "dataTable.Rows.Add(\"Sue\", 45);\n",
+    "dataTable.Rows.Add(\"Bob\", 21);\n",
+    "// load into DataFrame\n",
+    "Microsoft.Data.Analysis.DataFrame df = APSToolkit.Utils.DataFrame.LoadFromDataTable(dataTable);\n",
+    "// visualize the DataFrame\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Excel"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In APSToolkit, we supported load Dataframe from excel file, the function require the file path and the sheet name to load the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "polyglot_notebook": {
+     "kernelName": "csharp"
+    },
+    "vscode": {
+     "languageId": "polyglot-notebook"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "// load excel into dataframe\n",
+    "Microsoft.Data.Analysis.DataFrame df = APSToolkit.Utils.DataFrame.LoadFromExcel(\"path_to_excel.xlsx\", \"sheet_name\");"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Parquet"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Parquet is a columnar storage file format that is optimized for use with big data processing frameworks like Apache Hadoop, Apache Spark, and others. It's compatible with most of the data processing frameworks in the Hadoop environment and is designed to perform best with complex data in bulk.\n",
+    "\n",
+    "Here are some key features of Parquet:\n",
+    "\n",
+    "1. **Columnar Storage**: Unlike row-based files like CSV or TSV, Parquet is a columnar storage file format, which allows it to provide efficient compression and encoding schemes. This structure also allows for better performance when querying data.\n",
+    "\n",
+    "2. **Schema Evolution**: Parquet supports complex nested data structures and allows for schema evolution, where you can add, remove, or modify columns.\n",
+    "\n",
+    "3. **Compression and Encoding**: Parquet provides efficient compression and encoding schemes to store data more compactly. It also allows different encoding and compression schemes to be specified for different columns.\n",
+    "\n",
+    "4. **Language Independent**: Parquet is built to support very efficient compression and encoding schemes, and to be flexible enough to work with different languages.\n",
+    "\n",
+    "Here is an example of how to load data parquet into dataframe:\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "polyglot_notebook": {
+     "kernelName": "csharp"
+    },
+    "vscode": {
+     "languageId": "polyglot-notebook"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "Microsoft.Data.Analysis.DataFrame df = APSToolkit.Utils.DataFrame.LoadFromParquet(\"path_to_parquet.parquet\");"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "In this example, a PyArrow Table is created from a pandas DataFrame, and then it's written to a Parquet file named 'example.parquet'."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Another Format Supported By Analysis .NET"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "polyglot_notebook": {
+     "kernelName": "csharp"
+    },
+    "vscode": {
+     "languageId": "polyglot-notebook"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "Microsoft.Data.Analysis.DataFrame df = Microsoft.Data.Analysis.DataFrame.LoadCsv(\"path_to_csv.csv\");\n",
+    "// Microsoft.Data.Analysis.DataFrame df = Microsoft.Data.Analysis.DataFrame.LoadCsvFromString(\"csv_string\");\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data Visualization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "dotnet_interactive": {
+     "language": "csharp"
+    },
+    "polyglot_notebook": {
+     "kernelName": "csharp"
+    },
+    "vscode": {
+     "languageId": "polyglot-notebook"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "// try see how many elements by categories"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data Analysis"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}