Create a Record
This tutorial will guide you through the steps to create a record for your collection.
You can create a record to your collection by any of the following methods:
- Provide the text content directly.
- Provide the URL of a webpage, and TaskingAI will scrape the contents.
- Upload a file, and TaskingAI will extract the text content. Supported file formats:
.txt
,.pdf
,.docx
,.md
,.html
.
Text Splitters
When creating records that involve text processing, you can utilize a text splitter to divide the text content into more manageable chunks. TaskingAI currently supports two primary types of text splitters:
This splitter divides the text based on a specified number of tokens. It is configured with the following parameters:
TokenTextSplitter
chunk_size
: The maximum number of tokens per chunk. This determines how large each text chunk can be.chunk_overlap
: The number of tokens that consecutive chunks will overlap. A value of 0 indicates that there is no overlap between chunks. When creating a TextSplitter object, you can specify the type of splitter and the parameters required for that splitter. Or simply use one of the predefined splitters:TokenTextSplitter
orSeparatorTextSplitter
.
Example Usage:
token_text_splitter = {
"type": "token",
"chunk_size": 200,
"chunk_overlap": 20
}
SeparatorTextSplitter
This splitter uses specified separators to divide the text content into chunks.
If a separated chunk exceeds the chunk_size
, it will be further divided into smaller chunks.
The parameters for this splitter are:
separators
: A list of delimiter strings that will be used to split the text.chunk_size
: The maximum number of tokens per chunk. (This is a mandatory parameter.)chunk_overlap
: The number of tokens that consecutive chunks will overlap. A value of 0 indicates no overlap.
Example Usage:
separator_text_splitter = {
"type": "separator",
"separators": ["\n\n"],
"chunk_size": 200,
"chunk_overlap": 20
}
When creating a TextSplitter object, specify the type of splitter (token or separator) along with the necessary parameters. Both splitters are designed to optimize the processing of large text datasets by breaking them down into more manageable segments.
Create a record with text content
To create a new record by text, use the create_record
method. This method requires two primary parameters:
collection_id
: The identifier of the collection where the record will be stored.content
: The textual content of the record.text_splitter
: The text splitter to use for splitting the text into smaller chunks.metadata
: A dictionary containing the metadata of the record.
import taskingai
record = taskingai.retrieval.create_record(
collection_id="YOUR_COLLECTION_ID",
type="text",
content="Machine learning is a subfield of artificial intelligence...",
text_splitter={"type": "token", "chunk_size": 200, "chunk_overlap": 20},
metadata= {"file_name":"machine_learning.pdf"}
)
The chunk_size
and chunk_overlap
of the text splitter represent the max tokens per text chunk for processing
and token overlap between chunks, with 0 indicating no overlap.
After executing this function, a new record is initiated within the specified collection.
The parameter text_splitter
is not a property of the record so will not be included in the response.
It is only used for the creation process to split the text into smaller chunks.
Create a record with a web URL
To create a new record by web URL, use the same create_record
method, but set the type
to web
and provide the URL of the webpage.
The textual content of the webpage will be scraped and stored in the record.
import taskingai
record = taskingai.retrieval.create_record(
collection_id="YOUR_COLLECTION_ID",
type="web",
url="https://www.tasking.ai",
text_splitter={"type": "token", "chunk_size": 200, "chunk_overlap": 20},
)
Create a record with an uploaded file
Creating a record by uploading a file is two-fold: first, upload the file, and then create the record based on the uploaded file.
Upload a file:
To upload a file, use the upload_file
file = taskingai.file.upload_file(file=open("PATH_TO_FILE", "rb"), purpose="record_file")
print(f"uploaded file id: {file.file_id}")
Create a record based on the uploaded file:
record = taskingai.retrieval.create_record(
collection_id="YOUR_COLLECTION_ID",
type="file",
title="Machine learning",
file_id=file.file_id,
text_splitter={"type": "token", "chunk_size": 200, "chunk_overlap": 20},
)
print(f"created record: {record.record_id}\n")
Record Status
In some cases, the record status will remain creating
for some time after the create call.
Generally, after waiting for a few seconds, the recording status will change to ready
.
When the record status changes to ready
, it means that the text has been effectively split into smaller fragments, and the embeddings of these chunks have been constructed.
Only in the ready
status can the record chunks be retrieved in response to the user's query for related information.