Skip to main content

Create a Record

This tutorial will guide you through the steps to create a record for your collection.

You can create a record to your collection by any of the following methods:

  1. Provide the text content directly.
  2. Provide the URL of a webpage, and TaskingAI will scrape the contents.
  3. Upload a file, and TaskingAI will extract the text content. Supported file formats: .txt, .pdf, .docx, .md, .html.

Text Splitters

When creating records that involve text processing, you can utilize a text splitter to divide the text content into more manageable chunks. TaskingAI currently supports two primary types of text splitters:

This splitter divides the text based on a specified number of tokens. It is configured with the following parameters:

TokenTextSplitter

  • chunk_size: The maximum number of tokens per chunk. This determines how large each text chunk can be.
  • chunk_overlap: The number of tokens that consecutive chunks will overlap. A value of 0 indicates that there is no overlap between chunks. When creating a TextSplitter object, you can specify the type of splitter and the parameters required for that splitter. Or simply use one of the predefined splitters: TokenTextSplitter or SeparatorTextSplitter.

Example Usage:

token_text_splitter = {
"type": "token",
"chunk_size": 200,
"chunk_overlap": 20
}

SeparatorTextSplitter

This splitter uses specified separators to divide the text content into chunks. If a separated chunk exceeds the chunk_size, it will be further divided into smaller chunks. The parameters for this splitter are:

  • separators: A list of delimiter strings that will be used to split the text.
  • chunk_size: The maximum number of tokens per chunk. (This is a mandatory parameter.)
  • chunk_overlap: The number of tokens that consecutive chunks will overlap. A value of 0 indicates no overlap.

Example Usage:

separator_text_splitter = {
"type": "separator",
"separators": ["\n\n"],
"chunk_size": 200,
"chunk_overlap": 20
}

When creating a TextSplitter object, specify the type of splitter (token or separator) along with the necessary parameters. Both splitters are designed to optimize the processing of large text datasets by breaking them down into more manageable segments.

Create a record with text content

To create a new record by text, use the create_record method. This method requires two primary parameters:

  • collection_id: The identifier of the collection where the record will be stored.
  • content: The textual content of the record.
  • text_splitter: The text splitter to use for splitting the text into smaller chunks.
  • metadata: A dictionary containing the metadata of the record.
import taskingai

record = taskingai.retrieval.create_record(
collection_id="YOUR_COLLECTION_ID",
type="text",
content="Machine learning is a subfield of artificial intelligence...",
text_splitter={"type": "token", "chunk_size": 200, "chunk_overlap": 20},
metadata= {"file_name":"machine_learning.pdf"}
)

The chunk_size and chunk_overlap of the text splitter represent the max tokens per text chunk for processing and token overlap between chunks, with 0 indicating no overlap.

After executing this function, a new record is initiated within the specified collection.

note

The parameter text_splitter is not a property of the record so will not be included in the response. It is only used for the creation process to split the text into smaller chunks.

Create a record with a web URL

To create a new record by web URL, use the same create_record method, but set the type to web and provide the URL of the webpage. The textual content of the webpage will be scraped and stored in the record.

import taskingai

record = taskingai.retrieval.create_record(
collection_id="YOUR_COLLECTION_ID",
type="web",
url="https://www.tasking.ai",
text_splitter={"type": "token", "chunk_size": 200, "chunk_overlap": 20},
)

Create a record with an uploaded file

Creating a record by uploading a file is two-fold: first, upload the file, and then create the record based on the uploaded file.

Upload a file:

To upload a file, use the upload_file

file = taskingai.file.upload_file(file=open("PATH_TO_FILE", "rb"), purpose="record_file")
print(f"uploaded file id: {file.file_id}")

Create a record based on the uploaded file:

record = taskingai.retrieval.create_record(
collection_id="YOUR_COLLECTION_ID",
type="file",
title="Machine learning",
file_id=file.file_id,
text_splitter={"type": "token", "chunk_size": 200, "chunk_overlap": 20},
)
print(f"created record: {record.record_id}\n")

Record Status

In some cases, the record status will remain creating for some time after the create call. Generally, after waiting for a few seconds, the recording status will change to ready.

When the record status changes to ready, it means that the text has been effectively split into smaller fragments, and the embeddings of these chunks have been constructed. Only in the ready status can the record chunks be retrieved in response to the user's query for related information.