Show HN: An open-source tool that semantically profiles your data using LLMs

https://github.com/Cocoon-Data-Transformation/cocoon

Cocoon Logo

License: MIT

Building a chatbot for your data and pipelines is challenging because they are often too large (e.g., 1,000+ tables) to fit within the LLM context window. Cocoon addresses this by creating a RAG layer for your data and pipelines. With Cocoon's RAG, we offer a cursor-style chatbot for your data tasks.

Get Started

Cocoon is available on PyPI. Create a virtual env and then:

pip install cocoon_data -U

To get started, you need to connect to

  • LLMs (e.g., GPT-4, Claude-3, Gemini-Ultra, or your local LLMs)
  • Data Warehouses (e.g., Snowflake, Big Query, Duckdb...)
from cocoon_data import *
# if you use Open AI GPT-4
openai.api_key  = 'xycabc'
# if you use Snowflake
con = snowflake.connector.connect(...)
query_widget, cocoon_workflow = create_cocoon_workflow(con)
# a helper widget to query your data warehouse
query_widget.display()
# the main panel to interact with Cocoon
cocoon_workflow.start()

🎉 You shall see the following on a notebook:

We also offer a browser UI, only for the chat over RAG feature. Simply:

pip install cocoon_data -U
cocoon_data

You shall see

{
"by": "zh2408",
"descendants": 2,
"id": 40248744,
"kids": [
40249297
],
"score": 10,
"text": "The problem we solve is profiling tables: this is the initial step where you need to understand the table and identify any anomalies.<p>During the process, many small decisions require semantic understanding. For example, missing values are normal for &#x27;deathdate&#x27; (still alive) but abnormal for &#x27;name.&#x27; For outliers, 100 for ages is fine, but some are -1, which is impossible! We use LLMs to semantically understand your tables and detect anomalies.<p>You can try it by uploading a CSV, and we will email back the profile: <a href=\"https:&#x2F;&#x2F;cocoon-data-transformation.github.io&#x2F;page&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;cocoon-data-transformation.github.io&#x2F;page&#x2F;</a><p>Let me know your feedback. Thanks!",
"time": 1714749854,
"title": "Show HN: An open-source tool that semantically profiles your data using LLMs",
"type": "story",
"url": "https://github.com/Cocoon-Data-Transformation/cocoon"
}
{
"author": "Cocoon-Data-Transformation",
"date": null,
"description": "Contribute to Cocoon-Data-Transformation/cocoon development by creating an account on GitHub.",
"image": "https://opengraph.githubassets.com/2778a990bf55890af00b785f051762a47f176a8dfa346523a7ae652bb3978e1e/Cocoon-Data-Transformation/cocoon",
"logo": "https://logo.clearbit.com/github.com",
"publisher": "GitHub",
"title": "GitHub - Cocoon-Data-Transformation/cocoon",
"url": "https://github.com/Cocoon-Data-Transformation/cocoon"
}
{
"url": "https://github.com/Cocoon-Data-Transformation/cocoon",
"title": "GitHub - Cocoon-Data-Transformation/cocoon",
"description": "Building a chatbot for your data and pipelines is challenging because they are often too large (e.g., 1,000+ tables) to fit within the LLM context window. Cocoon addresses this by creating a RAG layer for...",
"links": [
"https://github.com/Cocoon-Data-Transformation/cocoon"
],
"image": "https://opengraph.githubassets.com/2778a990bf55890af00b785f051762a47f176a8dfa346523a7ae652bb3978e1e/Cocoon-Data-Transformation/cocoon",
"content": "<div><article><p><a target=\"_blank\" href=\"https://github.com/Cocoon-Data-Transformation/cocoon/blob/main/images/cocoon_logo.png\"><img src=\"https://github.com/Cocoon-Data-Transformation/cocoon/raw/main/images/cocoon_logo.png\" alt=\"Cocoon Logo\" /></a>\n</p>\n<p><a target=\"_blank\" href=\"https://camo.githubusercontent.com/6cd0120cc4c5ac11d28b2c60f76033b52db98dac641de3b2644bb054b449d60c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667\"><img src=\"https://camo.githubusercontent.com/6cd0120cc4c5ac11d28b2c60f76033b52db98dac641de3b2644bb054b449d60c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667\" alt=\"License: MIT\" /></a></p>\n<p>Building a chatbot for your data and pipelines is challenging because they are often too large (e.g., 1,000+ tables) to fit within the LLM context window. Cocoon addresses this by creating a RAG layer for your data and pipelines. With Cocoon's RAG, we offer a cursor-style chatbot for your data tasks.</p>\n<ul>\n<li>\n<p><a target=\"_blank\" href=\"https://cocoon-rag-851564657364.us-east1.run.app/\"><em>live demo on RAG Hubspot + Salesforce Data</em></a></p>\n</li>\n<li>\n<p><a target=\"_blank\" href=\"https://cocoon-data-transformation.github.io/page/\"><em>Learn more about all the features</em></a>\n<br /></p>\n<p><a target=\"_blank\" href=\"https://youtu.be/kv5mwTkpfY0\">\n <img src=\"https://github.com/Cocoon-Data-Transformation/cocoon/raw/main/images/Thumbnail2.png\" alt=\"IMAGE ALT TEXT\" />\n</a>\n</p>\n<br />\n</li>\n</ul>\n<p></p><h2>Get Started</h2><a target=\"_blank\" href=\"https://github.com/Cocoon-Data-Transformation/cocoon#get-started\"></a><p></p>\n<ul>\n<li>👉 <a target=\"_blank\" href=\"https://cocoon-data-transformation.github.io/page/clean\">Online Service to clean your uploaded CSV</a></li>\n<li>👉 <a target=\"_blank\" href=\"https://colab.research.google.com/github/Cocoon-Data-Transformation/cocoon/blob/main/demo/Cocoon_Stage_Demo.ipynb\">Try this Google Collab Notebook for Data Warehouse RAG</a></li>\n<li>👉 <a target=\"_blank\" href=\"https://colab.research.google.com/github/Cocoon-Data-Transformation/cocoon/blob/main/demo/Cocoon_RAG_pipeline.ipynb\">Try this Google Collab Notebook for Data Pipeline RAG</a></li>\n</ul>\n<p>Cocoon is available on PyPI. Create a virtual env and then:</p>\n<div><pre>pip install cocoon_data -U</pre></div>\n<p>To get started, you need to connect to</p>\n<ul>\n<li>LLMs (e.g., GPT-4, Claude-3, Gemini-Ultra, or your local LLMs)</li>\n<li>Data Warehouses (e.g., Snowflake, Big Query, Duckdb...)</li>\n</ul>\n<div><pre><span>from</span> <span>cocoon_data</span> <span>import</span> <span>*</span>\n<span># if you use Open AI GPT-4</span>\n<span>openai</span>.<span>api_key</span> <span>=</span> <span>'xycabc'</span>\n<span># if you use Snowflake</span>\n<span>con</span> <span>=</span> <span>snowflake</span>.<span>connector</span>.<span>connect</span>(...)\n<span>query_widget</span>, <span>cocoon_workflow</span> <span>=</span> <span>create_cocoon_workflow</span>(<span>con</span>)\n<span># a helper widget to query your data warehouse</span>\n<span>query_widget</span>.<span>display</span>()\n<span># the main panel to interact with Cocoon</span>\n<span>cocoon_workflow</span>.<span>start</span>()</pre></div>\n<p>🎉 You shall see the following on a notebook:</p>\n<p><a target=\"_blank\" href=\"https://github.com/Cocoon-Data-Transformation/cocoon/blob/main/images/notebook.png\"><img src=\"https://github.com/Cocoon-Data-Transformation/cocoon/raw/main/images/notebook.png\" /></a>\n</p>\n<p>We also offer a browser UI, only for the chat over RAG feature. Simply:</p>\n<div><pre>pip install cocoon_data -U\ncocoon_data</pre></div>\n<p>You shall see</p>\n<p><a target=\"_blank\" href=\"https://github.com/Cocoon-Data-Transformation/cocoon/blob/main/images/browser.png\"><img src=\"https://github.com/Cocoon-Data-Transformation/cocoon/raw/main/images/browser.png\" /></a>\n</p>\n</article></div>",
"author": "",
"favicon": "https://github.githubassets.com/favicons/favicon.svg",
"source": "github.com",
"published": "",
"ttr": 42,
"type": "object"
}