-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Graph RAG - Add extra attributes #684
Comments
Thank you for this write up. I debated over whether to duplicate the attribute data between databases and graphs. With 7.0, I decided to only sync text data with the graph component. But I agree that it would be nice to have access to all attributes to handle scenarios such as what is mentioned above. With that, I'll modify the logic to sync attributes. I'll think about if attribute syncing should be always on, on but able to be disabled or off and able to be enabled. |
Hello ! This is a detailed sequential proposal to integrate typing and advanced querying capabilities into TxtAI, leveraging NetworkX and gradually adding RDF/SPARQL features: Title: Progressive Extension of Graph Capabilities in TxtAI with Typing and RDF/SPARQL Support Phase 1: Adding Basic Typing 1. Extend the TxtAI Graph class: from txtai.graph import Graph as TxtAIGraph
import networkx as nx
class TypedGraph(TxtAIGraph):
def add_typed_node(self, node, node_type, **attr):
self.graph.add_node(node, type=node_type, **attr)
def add_typed_edge(self, u, v, edge_type, **attr):
self.graph.add_edge(u, v, type=edge_type, **attr)
def get_nodes_by_type(self, node_type):
return [node for node, data in self.graph.nodes(data=True) if data.get('type') == node_type]
def get_edges_by_type(self, edge_type):
return [(u, v) for u, v, data in self.graph.edges(data=True) if data.get('type') == edge_type] Usage example: G = TypedGraph()
G.add_typed_node(1, "person", name="Alice")
G.add_typed_node(2, "person", name="Bob")
G.add_typed_node(3, "skill", name="Python")
G.add_typed_edge(1, 3, "has_skill")
G.add_typed_edge(2, 3, "has_skill")
people = G.get_nodes_by_type("person")
skills = G.get_nodes_by_type("skill") Phase 2: Integrating RDFLib-NetworkX 2. Add RDFLib-NetworkX as a dependency and extend the TypedGraph class: from rdflib_networkx import network_to_rdflib, rdflib_to_network
class RDFTypedGraph(TypedGraph):
def to_rdf(self):
return network_to_rdflib(self.graph)
def from_rdf(self, rdf_graph):
self.graph = rdflib_to_network(rdf_graph)
def load_rdf(self, file_path, format='turtle'):
import rdflib
g = rdflib.Graph()
g.parse(file_path, format=format)
self.from_rdf(g) Usage example: G = RDFTypedGraph()
G.load_rdf("data.ttl")
rdf_graph = G.to_rdf()
rdf_graph.serialize(destination='output.ttl', format='turtle') Phase 3: Adding Basic SPARQL Support 3. Implement a simple SPARQL execution method: from rdflib.plugins.sparql import prepareQuery
class SPARQLGraph(RDFTypedGraph):
def execute_sparql(self, query_string):
rdf_graph = self.to_rdf()
query = prepareQuery(query_string)
results = rdf_graph.query(query)
return list(results) Usage example: G = SPARQLGraph()
G.load_rdf("data.ttl")
results = G.execute_sparql("""
SELECT ?person ?skill
WHERE {
?person a :Person ;
:hasSkill ?skill .
}
""")
for row in results:
print(f"Person: {row.person}, Skill: {row.skill}") Phase 4: Integration with Existing TxtAI Features 4. Ensure compatibility with TxtAI's semantic search methods: from txtai.embeddings import Embeddings
class EnhancedGraph(SPARQLGraph):
def __init__(self):
super().__init__()
self.embeddings = Embeddings()
def semantic_subgraph(self, query, limit=5):
similar = self.embeddings.search(query, limit)
subgraph = self.graph.subgraph([node for node, _ in similar])
return EnhancedGraph().from_networkx(subgraph)
def sparql_with_embedding(self, query, sparql_template):
similar = self.embeddings.search(query, 1)[0][0]
sparql_query = sparql_template.format(entity=similar)
return self.execute_sparql(sparql_query) Final usage example: G = EnhancedGraph()
G.load_rdf("knowledge_base.ttl")
# Semantic search + subgraph
subgraph = G.semantic_subgraph("machine learning")
# SPARQL query with embedding
results = G.sparql_with_embedding("AI techniques", """
SELECT ?related_concept
WHERE {{
<{entity}> :relatedTo ?related_concept .
}}
""")
for row in results:
print(f"Related concept: {row.related_concept}") This progressive approach allows for the addition of typing, RDF support, and SPARQL querying while maintaining compatibility with TxtAI's existing NetworkX-based infrastructure. It provides a smooth transition to more advanced knowledge graph capabilities while preserving the integration with TxtAI's semantic search features. Practical Usage Example: Once this class is implemented, here is how it could be used to solve the initial problem: # Creating the graph
G = TypedGraph()
# Adding people
G.add_typed_node('A1', 'person', name='Person A')
G.add_typed_node('B1', 'person', name='Person B')
# Adding skills
G.add_typed_node('S1', 'skill', name='Soccer')
G.add_typed_node('S2', 'skill', name='Swimming')
# Adding relationships
G.add_typed_edge('A1', 'S1', 'has_skill')
G.add_typed_edge('A1', 'S2', 'has_skill')
G.add_typed_edge('B1', 'S1', 'has_skill')
# Searching for people
persons = G.get_nodes_by_type('person')
print("People:", persons)
# Searching for shared skills
skills = G.get_nodes_by_type('skill')
for skill in skills:
persons_with_skill = [n for n in G.graph.neighbors(skill) if G.graph.nodes[n]['type'] == 'person']
if len(persons_with_skill) > 1:
print(f"Skill {G.graph.nodes[skill]['name']} shared by: {persons_with_skill}") This approach allows you to:
It solves the initial problem by allowing the addition of custom attributes (such as type) to nodes and edges, and using them for advanced queries. |
Owlready2 is missing from the provided implementation. To make the example more complete and leverage the capabilities of Owlready2, we can modify and extend the implementation. Here's a step-by-step example incorporating Owlready2: from txtai.graph import Graph as TxtAIGraph
import networkx as nx
from owlready2 import *
from rdflib_networkx import network_to_rdflib, rdflib_to_network
from rdflib.plugins.sparql import prepareQuery
from txtai.embeddings import Embeddings
class EnhancedGraph(TxtAIGraph):
def __init__(self):
super().__init__()
self.onto = get_ontology("http://test.org/onto.owl")
self.onto.metadata.declare()
self.embeddings = Embeddings()
def add_typed_node(self, node, node_type, **attr):
with self.onto:
if not self.onto[node_type]:
types.new_class(node_type, (Thing,))
new_individual = self.onto[node_type](node)
for key, value in attr.items():
setattr(new_individual, key, value)
self.graph.add_node(node, type=node_type, **attr)
def add_typed_edge(self, u, v, edge_type, **attr):
with self.onto:
if not self.onto[edge_type]:
types.new_class(edge_type, (ObjectProperty,))
self.onto[edge_type](self.onto[u], self.onto[v])
self.graph.add_edge(u, v, type=edge_type, **attr)
def get_nodes_by_type(self, node_type):
return list(self.onto[node_type].instances())
def get_edges_by_type(self, edge_type):
return [(u, v) for u, v, data in self.graph.edges(data=True) if data.get('type') == edge_type]
def to_rdf(self):
return network_to_rdflib(self.graph)
def from_rdf(self, rdf_graph):
self.graph = rdflib_to_network(rdf_graph)
def load_rdf(self, file_path, format='turtle'):
self.onto = get_ontology(file_path).load()
for cls in self.onto.classes():
for instance in cls.instances():
self.add_typed_node(instance.name, cls.name)
for prop in self.onto.object_properties():
for subject, object in prop.get_relations():
self.add_typed_edge(subject.name, object.name, prop.name)
def execute_sparql(self, query_string):
rdf_graph = self.to_rdf()
query = prepareQuery(query_string)
results = rdf_graph.query(query)
return list(results)
def semantic_subgraph(self, query, limit=5):
similar = self.embeddings.search(query, limit)
subgraph = self.graph.subgraph([node for node, _ in similar])
return EnhancedGraph().from_networkx(subgraph)
def sparql_with_embedding(self, query, sparql_template):
similar = self.embeddings.search(query, 1)[0][0]
sparql_query = sparql_template.format(entity=similar)
return self.execute_sparql(sparql_query)
def reason(self):
with self.onto:
sync_reasoner()
# Usage example
G = EnhancedGraph()
# Adding people
G.add_typed_node('A1', 'Person', name='Person A')
G.add_typed_node('B1', 'Person', name='Person B')
# Adding skills
G.add_typed_node('S1', 'Skill', name='Soccer')
G.add_typed_node('S2', 'Skill', name='Swimming')
# Adding relationships
G.add_typed_edge('A1', 'S1', 'hasSkill')
G.add_typed_edge('A1', 'S2', 'hasSkill')
G.add_typed_edge('B1', 'S1', 'hasSkill')
# Searching for people
persons = G.get_nodes_by_type('Person')
print("People:", [p.name for p in persons])
# Searching for shared skills
skills = G.get_nodes_by_type('Skill')
for skill in skills:
persons_with_skill = [n.name for n in skill.hasSkill.inverse()]
if len(persons_with_skill) > 1:
print(f"Skill {skill.name} shared by: {persons_with_skill}")
# Using SPARQL
results = G.execute_sparql("""
SELECT ?person ?skill
WHERE {
?person a :Person ;
:hasSkill ?skill .
}
""")
for row in results:
print(f"Person: {row.person}, Skill: {row.skill}")
# Using semantic search
subgraph = G.semantic_subgraph("sports")
# Using reasoning
G.reason() This implementation incorporates Owlready2, allowing for more advanced ontology manipulation and reasoning. It combines the strengths of NetworkX, RDFLib, and Owlready2, providing a powerful toolkit for working with typed graphs, RDF data, and ontologies within the TxtAI framework. Citations: |
Another approach:
• Use NetworkX's built-in attribute handling: import networkx as nx
def sync_attributes(G, attributes):
nx.set_node_attributes(G, attributes)
# Usage
G = nx.Graph()
attributes = {1: {'type': 'person'}, 2: {'type': 'skill'}}
sync_attributes(G, attributes) • Extend TxtAI's Graph class: from txtai.graph import Graph
class EnhancedGraph(Graph):
def sync_attributes(self, attributes):
nx.set_node_attributes(self.graph, attributes)
def get_node_attributes(self, attribute):
return nx.get_node_attributes(self.graph, attribute)
• Utilize NetworkX's JSON functionality: import json
import networkx as nx
def export_to_json(G, filename):
data = nx.node_link_data(G)
with open(filename, 'w') as f:
json.dump(data, f)
def import_from_json(filename):
with open(filename, 'r') as f:
data = json.load(f)
return nx.node_link_graph(data)
# Extend TxtAI's Graph class
class EnhancedGraph(Graph):
def to_json(self, filename):
export_to_json(self.graph, filename)
@classmethod
def from_json(cls, filename):
G = import_from_json(filename)
enhanced_graph = cls()
enhanced_graph.graph = G
return enhanced_graph
• Modify TxtAI's Graph class to include these new methods: from txtai.graph import Graph
class EnhancedGraph(Graph):
def sync_attributes(self, attributes):
nx.set_node_attributes(self.graph, attributes)
def get_node_attributes(self, attribute):
return nx.get_node_attributes(self.graph, attribute)
def to_json(self, filename):
data = nx.node_link_data(self.graph)
with open(filename, 'w') as f:
json.dump(data, f)
@classmethod
def from_json(cls, filename):
with open(filename, 'r') as f:
data = json.load(f)
G = nx.node_link_graph(data)
enhanced_graph = cls()
enhanced_graph.graph = G
return enhanced_graph Benefits:
This implementation provides a straightforward way to handle attribute synchronization and basic JSON import/export within the TxtAI framework, addressing the initial attribute problem while enhancing interoperability with external systems. Citations: Integration of Cypher for Graph Queries Implementation:
from txtai.graph import Graph as TxtAIGraph
from grandcypher import GrandCypher
import networkx as nx
class CypherEnabledGraph(TxtAIGraph):
def __init__(self):
super().__init__()
self.cypher = GrandCypher(self.graph)
def cypher_query(self, query):
return self.cypher.run(query)
def add_typed_node(self, node, node_type, **attr):
attr['type'] = node_type
self.graph.add_node(node, **attr)
def add_typed_edge(self, u, v, edge_type, **attr):
attr['type'] = edge_type
self.graph.add_edge(u, v, **attr)
def get_nodes_by_type(self, node_type):
query = f"""
MATCH (n)
WHERE n.type = '{node_type}'
RETURN n
"""
return self.cypher_query(query)
def get_edges_by_type(self, edge_type):
query = f"""
MATCH ()-[r]->()
WHERE r.type = '{edge_type}'
RETURN r
"""
return self.cypher_query(query)
def find_connected_nodes(self, start_node, relationship_type, end_node_type):
query = f"""
MATCH (start {{id: '{start_node}'}})-[r:{relationship_type}]->(end {{type: '{end_node_type}'}})
RETURN end
"""
return self.cypher_query(query) Usage example: graph = CypherEnabledGraph()
# Adding nodes and edges with types
graph.add_typed_node('A1', 'person', name='Person A')
graph.add_typed_node('B1', 'person', name='Person B')
graph.add_typed_node('S1', 'skill', name='Python')
graph.add_typed_edge('A1', 'S1', 'has_skill')
graph.add_typed_edge('B1', 'S1', 'has_skill')
# Performing type-based queries
persons = graph.get_nodes_by_type('person')
skills = graph.get_nodes_by_type('skill')
has_skill_edges = graph.get_edges_by_type('has_skill')
# Finding connected nodes
python_skilled_persons = graph.find_connected_nodes('S1', 'has_skill', 'person')
print("Persons:", persons)
print("Skills:", skills)
print("Has Skill Edges:", has_skill_edges)
print("Persons with Python skill:", python_skilled_persons) This implementation solves the initial type problem by:
The integration with grand-cypher allows for more expressive and powerful queries while maintaining compatibility with TxtAI's existing graph structure. This approach provides a good balance between simplicity, integration with TxtAI's ecosystem, and the ability to perform complex graph queries. Citations: Enhancing RAG Capabilities Implementation:
from txtai.graph import Graph
from transformers import AutoTokenizer, AutoModel
import torch
import networkx as nx
class AttributeAwareGraph(Graph):
def __init__(self):
super().__init__()
self.tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
self.model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def get_node_embedding(self, node):
attributes = self.graph.nodes[node]
text = f"Node: {node}, Type: {attributes.get('type', 'Unknown')}, " + \
", ".join([f"{k}: {v}" for k, v in attributes.items() if k != 'type'])
inputs = self.tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = self.model(**inputs)
return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
def get_subgraph_embedding(self, nodes):
embeddings = [self.get_node_embedding(node) for node in nodes]
return np.mean(embeddings, axis=0)
from txtai.pipeline import LLM
class EnhancedRAG:
def __init__(self, graph, llm_model="gpt-3.5-turbo"):
self.graph = graph
self.llm = LLM(model=llm_model)
def generate_response(self, query, k=3):
# Get query embedding
query_embedding = self.graph.get_node_embedding({"type": "query", "text": query})
# Find most similar nodes
similarities = []
for node in self.graph.graph.nodes():
node_embedding = self.graph.get_node_embedding(node)
similarity = cosine_similarity([query_embedding], [node_embedding])[0][0]
similarities.append((node, similarity))
top_nodes = sorted(similarities, key=lambda x: x[1], reverse=True)[:k]
# Generate context from top nodes
context = "\n".join([f"Node {node}: {dict(self.graph.graph.nodes[node])}" for node, _ in top_nodes])
# Generate response using LLM
prompt = f"Query: {query}\nContext:\n{context}\nResponse:"
response = self.llm(prompt)
return response
graph = AttributeAwareGraph()
graph.add_node("A1", type="person", name="Alice", age=30)
graph.add_node("B1", type="person", name="Bob", age=35)
graph.add_node("S1", type="skill", name="Python")
graph.add_edge("A1", "S1", type="has_skill")
rag = EnhancedRAG(graph)
response = rag.generate_response("Who knows Python?")
print(response) This implementation enhances the RAG capabilities by:
This approach indirectly addresses the initial type problem by making the RAG process more aware of node types and attributes. It allows for more precise text generation based on the structured information in the graph, including types and other attributes. The implementation is well-integrated with TxtAI's ecosystem, extending its Graph class and using its LLM pipeline. It also leverages NetworkX for graph operations and the transformers library for generating embeddings, both of which are commonly used in the TxtAI ecosystem. Citations: |
Hello,
As I was trying to work this into a Graph LLM Rag. I was thinking on doing some queries based on data type (example: node is a 'person', 'skill', ...).
The idea was to have a person A, identified by node A1.
Person B, identified by Node B1.
Skills like soccer or swimming (S1 and S2).
So, 4 nodes at the graph at this point.
Idea is if person A1 and B2 share the skill, to return those vertex, with the skill connecting them.
But, when I try to run the next query, it is returning an empty list:
MATCH P=(N)-[*1..2]->(D)
WHERE N.type == 'person'
RETURN P
Wasn't this suppose to bring at least the person and the skills from it? I'm assuming the issue is because graph has no 'type' attribute in it.
Main point/question: Is it possible to add additional attributes? What I tried until now it's only setting attributes as [id, text, topic, topicrank]. How can a new attribute, like 'type' be added and persisted?
The text was updated successfully, but these errors were encountered: