跳转到主要内容

标签(标签)

资源精选(342) Go开发(108) Go语言(103) Go(99) angular(82) LLM(78) 大语言模型(63) 人工智能(53) 前端开发(50) LangChain(43) golang(43) 机器学习(39) Go工程师(38) Go程序员(38) Go开发者(36) React(33) Go基础(29) Python(24) Vue(22) Web开发(20) Web技术(19) 精选资源(19) 深度学习(19) Java(18) ChatGTP(17) Cookie(16) android(16) 前端框架(13) JavaScript(13) Next.js(12) 安卓(11) 聊天机器人(10) typescript(10) 资料精选(10) NLP(10) 第三方Cookie(9) Redwoodjs(9) ChatGPT(9) LLMOps(9) Go语言中级开发(9) 自然语言处理(9) PostgreSQL(9) 区块链(9) mlops(9) 安全(9) 全栈开发(8) OpenAI(8) Linux(8) AI(8) GraphQL(8) iOS(8) 软件架构(7) RAG(7) Go语言高级开发(7) AWS(7) C++(7) 数据科学(7) 智能体(6) whisper(6) Prisma(6) 隐私保护(6) JSON(6) DevOps(6) 数据可视化(6) wasm(6) 计算机视觉(6) 算法(6) Rust(6) 微服务(6) 隐私沙盒(5) FedCM(5) 语音识别(5) Angular开发(5) 快速应用开发(5) 提示工程(5) Agent(5) LLaMA(5) 低代码开发(5) Go测试(5) gorm(5) REST API(5) kafka(5) 推荐系统(5) WebAssembly(5) GameDev(5) CMS(5) CSS(5) machine-learning(5) 机器人(5) 游戏开发(5) Blockchain(5) Web安全(5) Kotlin(5) 低代码平台(5) 机器学习资源(5) Go资源(5) Nodejs(5) PHP(5) Swift(5) devin(4) Blitz(4) javascript框架(4) Redwood(4) GDPR(4) 生成式人工智能(4) Angular16(4) Alpaca(4) 编程语言(4) SAML(4) JWT(4) JSON处理(4) Go并发(4) 移动开发(4) 移动应用(4) security(4) 隐私(4) spring-boot(4) 物联网(4) nextjs(4) 网络安全(4) API(4) Ruby(4) 信息安全(4) flutter(4) RAG架构(3) 专家智能体(3) Chrome(3) CHIPS(3) 3PC(3) SSE(3) 人工智能软件工程师(3) LLM Agent(3) Remix(3) Ubuntu(3) GPT4All(3) 软件开发(3) 问答系统(3) 开发工具(3) 最佳实践(3) RxJS(3) SSR(3) Node.js(3) Dolly(3) 移动应用开发(3) 低代码(3) IAM(3) Web框架(3) CORS(3) 基准测试(3) Go语言数据库开发(3) Oauth2(3) 并发(3) 主题(3) Theme(3) earth(3) nginx(3) 软件工程(3) azure(3) keycloak(3) 生产力工具(3) gpt3(3) 工作流(3) C(3) jupyter(3) 认证(3) prometheus(3) GAN(3) Spring(3) 逆向工程(3) 应用安全(3) Docker(3) Django(3) R(3) .NET(3) 大数据(3) Hacking(3) 渗透测试(3) C++资源(3) Mac(3) 微信小程序(3) Python资源(3) JHipster(3) 语言模型(2) 可穿戴设备(2) JDK(2) SQL(2) Apache(2) Hashicorp Vault(2) Spring Cloud Vault(2) Go语言Web开发(2) Go测试工程师(2) WebSocket(2) 容器化(2) AES(2) 加密(2) 输入验证(2) ORM(2) Fiber(2) Postgres(2) Gorilla Mux(2) Go数据库开发(2) 模块(2) 泛型(2) 指针(2) HTTP(2) PostgreSQL开发(2) Vault(2) K8s(2) Spring boot(2) R语言(2) 深度学习资源(2) 半监督学习(2) semi-supervised-learning(2) architecture(2) 普罗米修斯(2) 嵌入模型(2) productivity(2) 编码(2) Qt(2) 前端(2) Rust语言(2) NeRF(2) 神经辐射场(2) 元宇宙(2) CPP(2) 数据分析(2) spark(2) 流处理(2) Ionic(2) 人体姿势估计(2) human-pose-estimation(2) 视频处理(2) deep-learning(2) kotlin语言(2) kotlin开发(2) burp(2) Chatbot(2) npm(2) quantum(2) OCR(2) 游戏(2) game(2) 内容管理系统(2) MySQL(2) python-books(2) pentest(2) opengl(2) IDE(2) 漏洞赏金(2) Web(2) 知识图谱(2) PyTorch(2) 数据库(2) reverse-engineering(2) 数据工程(2) swift开发(2) rest(2) robotics(2) ios-animation(2) 知识蒸馏(2) 安卓开发(2) nestjs(2) solidity(2) 爬虫(2) 面试(2) 容器(2) C++精选(2) 人工智能资源(2) Machine Learning(2) 备忘单(2) 编程书籍(2) angular资源(2) 速查表(2) cheatsheets(2) SecOps(2) mlops资源(2) R资源(2) DDD(2) 架构设计模式(2) 量化(2) Hacking资源(2) 强化学习(2) flask(2) 设计(2) 性能(2) Sysadmin(2) 系统管理员(2) Java资源(2) 机器学习精选(2) android资源(2) android-UI(2) Mac资源(2) iOS资源(2) Vue资源(2) flutter资源(2) JavaScript精选(2) JavaScript资源(2) Rust开发(2) deeplearning(2) RAD(2)

category

This is Part 1 of my “Understanding Unstructured Data” series. Part 2 focuses on analyzing structured data extracted from unstructured text with a LangChain agent.

Understanding Unstructured Data | Part 1: Extraction

Confectionary Intelligence Gathering & Data Processing

Use case: Extracting Unstructured Competitive Intelligence Data with LLMs

Imagine you are a bakery, and you’ve sent out your confectioner intelligence team to gather competitor data. They report back on what the competition is up to, and they have lots of great ideas that you’d like to apply to your business. However, the data is unstructured! How can you analyze this data to understand what’s being asked for the most and best prioritize the next steps for your business?

Code Available on Github:

Note: the code is split into two files: unstructured_extraction_chain.ipynb and unstructured_pydantic.ipynb depending on the exact tool used.

AI-projects/unstructured_data at main · ingridstevens/AI-projects

AI Projects. Contribute to ingridstevens/AI-projects development by creating an account on GitHub.

github.com

To explore this use case, I’ve created a toy data set. Here’s an example datapoint in the dataset:

At Velvet Frosting Cupcakes, our team learned about the unveiling of a seasonal pastry menu that changes monthly. Introducing a rotating seasonal menu at our bakery using the “SeasonalJoy” subscription platform and adding a special touch to our cookies with the “FloralStamp” cookie stamper could keep our offerings fresh and exciting for customers.

Option 1: create_extraction_chain

We can start by looking at the data and in doing so, we can identify a rough schema — or structure — to extract. Using LangChain, we can create an extraction chain.

from langchain.chains import create_extraction_chain
# updated Mar 1 2024 to reflect updated langchain import syntax
from langchain_openai import OpenAI

# Schema
schema = {
    "properties": {
        "company": {"type": "string"},
        "offering": {"type": "string"},
        "advantage": {"type": "string"},
        "products_and_services": {"type": "string"},
        "additional_details": {"type": "string"},
    }
}

…next, let’s define a few test inputs:

# Inputs
in1 = """Sweet Delights Bakery introduced lavender-infused vanilla cupcakes with a honey buttercream frosting, using the "Frosting-Spreader-3000". This innovation could inspire our next cupcake creation"""
in2 = """Whisked Away Cupcakes introduced a dessert subscription service, ensuring regular customers receive fresh batches of various sweets. Exploring a similar subscription model using the "SweetSubs" program could boost customer loyalty."""
in3 = """At Velvet Frosting Cupcakes, our team learned about the unveiling of a seasonal pastry menu that changes monthly. Introducing a rotating seasonal menu at our bakery using the "SeasonalJoy" subscription platform and adding a special touch to our cookies with the "FloralStamp" cookie stamper could keep our offerings fresh and exciting for customers."""

inputs = [in1, in2, in3]

…and create the Chain

# Run chain
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
chain = create_extraction_chain(schema, llm)

…finally, run the chain over the examples

for input in inputs:
    print(chain.run(input))

Now we have structured outputs as Python Lists:

[{'company': 'Sweet Delights Bakery', 'offering': 'lavender-infused vanilla cupcakes', 'advantage': 'inspiring next cupcake creation', 'products_and_services': 'Frosting-Spreader-3000'}]
[{'company': 'Whisked Away Cupcakes', 'offering': 'dessert subscription service', 'advantage': 'ensuring regular customers receive fresh batches of various sweets', 'products_and_services': '', 'additional_details': ''}, {'company': '', 'offering': 'subscription model using the "SweetSubs" program', 'advantage': 'boost customer loyalty', 'products_and_services': '', 'additional_details': ''}]
[{'company': 'Velvet Frosting Cupcakes', 'offering': 'rotating seasonal menu', 'advantage': 'fresh and exciting offerings', 'products_and_services': 'SeasonalJoy subscription platform, FloralStamp cookie stamper'}]

Let’s Update Our Original Data with the Additional Parameters

This is an okay start, and it appears to be functioning. However, the optimal workflow involves importing the CSV containing competitive intelligence, applying it to the extraction chain for parsing and structuring, and seamlessly integrating the parsed information back into the original dataset. The Python code below does just that:

import pandas as pd
from langchain.chains import create_extraction_chain
# updated Mar 1 2024 to reflect updated langchain import syntax
from langchain_openai import ChatOpenAI

# Load in the data.csv (semicolon separated) file
df = pd.read_csv("data.csv", sep=';')

# Define Schema based on your data
schema = {
    "properties": {
        "company": {"type": "string"},
        "offering": {"type": "string"},
        "advantage": {"type": "string"},
        "products_and_services": {"type": "string"},
        "additional_details": {"type": "string"},
    }
}

# Create extraction chain
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
chain = create_extraction_chain(schema, llm)

# ----------
# Add the data to a data frame
# ----------

# Extract information and create a DataFrame from the list of dictionaries
extracted_data = df['INTEL'].apply(lambda x: chain.run(x)[0]).apply(pd.Series)

# Replace missing values with NaN
extracted_data.replace('', np.nan, inplace=True)

# Concatenate the extracted_data DataFrame with the original df
df = pd.concat([df, extracted_data], axis=1)

# display the data frame
df.head()
create_extraction_chain Results (15 sec)

This run took about 15 seconds, and it hasn’t found all the information we’re requesting.

Let’s try a different method instead.

Option 2: Pydantic

In the code that follows, Pydantic is being used to define data models that represent the structure of the competitive intelligence information. Pydantic is a data validation and parsing library for Python that allows you to define simple or complex data structures using Python data types. In this case, we using Pydantic models (Competitor and Company) to define the structure of the competitive intelligence data.

import pandas as pd
from typing import Optional, Sequence
# updated Mar 1 2024 to reflect updated langchain import syntax
from langchain_openai import OpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from pydantic import BaseModel

# Load data from CSV
df = pd.read_csv("data.csv", sep=';')

# Pydantic models for competitive intelligence
class Competitor(BaseModel):
    company: str
    offering: str
    advantage: str
    products_and_services: str
    additional_details: str

class Company(BaseModel):
    """Identifying information about all competitive intelligence in a text."""
    company: Sequence[Competitor]

# Set up a Pydantic parser and prompt template
parser = PydanticOutputParser(pydantic_object=Company)
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# Function to process each row and extract information
def process_row(row):
    _input = prompt.format_prompt(query=row['INTEL'])
    model = OpenAI(temperature=0)
    output = model(_input.to_string())
    result = parser.parse(output)
    
    # Convert Pydantic result to a dictionary
    competitor_data = result.model_dump()

    # Flatten the nested structure for DataFrame creation
    flat_data = {'INTEL': [], 'company': [], 'offering': [], 'advantage': [], 'products_and_services': [], 'additional_details': []}

    for entry in competitor_data['company']:
        flat_data['INTEL'].append(row['INTEL'])
        flat_data['company'].append(entry['company'])
        flat_data['offering'].append(entry['offering'])
        flat_data['advantage'].append(entry['advantage'])
        flat_data['products_and_services'].append(entry['products_and_services'])
        flat_data['additional_details'].append(entry['additional_details'])

    # Create a DataFrame from the flattened data
    df_cake = pd.DataFrame(flat_data)

    return df_cake

# Apply the function to each row and concatenate the results
intel_df = pd.concat(df.apply(process_row, axis=1).tolist(), ignore_index=True)

# Display the resulting DataFrame
intel_df.head()
Pydantic Results (9.2sec)

That was really quick! And it found details for all of the entries, unlike the create_extraction_chain attempt.

Concluding Thoughts

I found that PydanticOutputParser was faster and more reliable. Each run took about 1 sec and 400 tokens to run. Whereas the create_extraction_chain took about 2.5sec and 250 tokens to run.

We’ve managed to extract some structured data out of unstructured text! That’s great! The next step is to analyze that data. Part 2 focuses on analyzing structured data extracted from unstructured text with a LangChain agent.