Skip to main content

Query Chunks

Chunks are integral elements of the TaskingAI retrieval system. They are derived from records and stored within collections.

The TaskingAI retrieval system is designed to efficiently query and retrieve specific chunks of information from a collection. This process is essential for extracting relevant data from large datasets. Here’s a guide on how to perform chunk queries in TaskingAI.


Preparing the Collection and Records

Before querying, it's essential to have a collection with records already created and processed. Each record is split into smaller segments known as chunks. These chunks are stored within the collection along with their generated embeddings.

For a successful query, ensure that the collection’s status is "ready". This status indicates that the collection is fully processed and available for querying. Similarly, check that the records within the collection are also marked as "ready". Only records with this status contribute chunks to the query results.

Here is an example of how to build a knowledge base for the retrieval system:

import taskingai

# create a collection
collection = taskingai.retrieval.create_collection(
embedding_model_id="YOUR_MODEL_ID",
capacity=1000,
)

After the status of the collection is "ready", create some records for testing purposes:

# define a text splitter
text_splitter = {
"type": "token",
"chunk_size": 200,
"chunk_overlap": 20,
}

# create record 1 (machine learning)
taskingai.retrieval.create_record(
collection_id=collection.collection_id,
type="text",
content="Machine learning is a subfield of artificial intelligence (AI) that involves the development of algorithms that allow computers to learn from and make decisions or predictions based on data. The term \"machine learning\" was coined by Arthur Samuel in 1959. In other words, machine learning enables a system to automatically learn and improve from experience without being explicitly programmed. This is achieved by feeding the system massive amounts of data, which it uses to learn patterns and make inferences. There are three main types of machine learning: 1. Supervised Learning: This is where the model is given labeled training data and the goal of learning is to generalize from the training data to unseen situations in a principled way. 2. Unsupervised Learning: This involves training on a dataset without explicit labels. The goal might be to discover inherent groupings or patterns within the data. 3. Reinforcement Learning: In this type, an agent learns to perform actions based on reward/penalty feedback to achieve a goal. It's commonly used in robotics, gaming, and navigation. Deep learning, a subset of machine learning, uses neural networks with many layers (\"deep\" structures) and has been responsible for many recent breakthroughs in AI, including speech recognition, image recognition, and natural language processing. It's important to note that machine learning is a rapidly developing field, with new techniques and applications emerging regularly.",
text_splitter=text_splitter,
)

# create record 2 (Michael Jordan)
taskingai.retrieval.create_record(
collection_id=collection.collection_id,
type="text",
content="Michael Jordan, often referred to by his initials MJ, is considered one of the greatest players in the history of the National Basketball Association (NBA). He was known for his scoring ability, defensive prowess, competitiveness, and clutch performances. Born on February 17, 1963, Jordan played 15 seasons in the NBA, primarily with the Chicago Bulls, but also with the Washington Wizards. His professional career spanned two decades from 1984 to 2003, during which he won numerous awards and set multiple records. Here are some key highlights of his career: - Scoring: Jordan won the NBA scoring title a record 10 times. He also has the highest career scoring average in NBA history, both in the regular season (30.12 points per game) and in the playoffs (33.45 points per game). - Championships: He led the Chicago Bulls to six NBA championships and was named Finals MVP in all six of those Finals (1991-1993, 1996-1998). - MVP Awards: Jordan was named the NBA's Most Valuable Player (MVP) five times (1988, 1991, 1992, 1996, 1998). - Defensive Ability: He was named to the NBA All-Defensive First Team nine times and won the NBA Defensive Player of the Year award in 1988. - Olympics: Jordan also won two Olympic gold medals with the U.S. basketball team, in 1984 and 1992. - Retirements and Comebacks: Jordan retired twice during his career. His first retirement came in 1993, after which he briefly played minor league baseball. He returned to the NBA in 1995. He retired a second time in 1999, only to return again in 2001, this time with the Washington Wizards. He played two seasons for the Wizards before retiring for good in 2003. After his playing career, Jordan became a team owner and executive. As of my knowledge cutoff in September 2021, he is the majority owner of the Charlotte Hornets. Off the court, Jordan is known for his lucrative endorsement deals, particularly with Nike. The Air Jordan line of sneakers is one of the most popular and enduring in the world. His influence also extends to the realms of film and fashion, and he is recognized globally as a cultural icon. In 2000, he was inducted into the Basketball Hall of Fame.",
text_splitter=text_splitter,
)

# create record 3 (Granite)
taskingai.retrieval.create_record(
collection_id=collection.collection_id,
type="text",
content="Granite is a type of coarse-grained igneous rock composed primarily of quartz and feldspar, among other minerals. The term \"granitic\" means granite-like and is applied to granite and a group of intrusive igneous rocks. Description of Granite * Type: Igneous rock * Grain size: Coarse-grained * Composition: Mainly quartz, feldspar, and micas with minor amounts of amphibole minerals * Color: Typically appears in shades of white, pink, or gray, depending on their mineralogy * Crystalline Structure: Yes, due to slow cooling of magma beneath Earth's surface * Density: Approximately 2.63 to 2.75 g/cm³ * Hardness: 6-7 on the Mohs hardness scale Formation Process Granite is formed from the slow cooling of magma that is rich in silica and aluminum, deep beneath the earth's surface. Over time, the magma cools slowly, allowing large crystals to form and resulting in the coarse-grained texture that is characteristic of granite. Uses Granite is known for its durability and aesthetic appeal, making it a popular choice for construction and architectural applications. It's often used for countertops, flooring, monuments, and building materials. In addition, due to its hardness and toughness, it is used for cobblestones and in other paving applications. Geographical Distribution Granite is found worldwide, with significant deposits in regions such as the United States (especially in New Hampshire, which is also known as \"The Granite State\"), Canada, Brazil, Norway, India, and China. Varieties There are many varieties of granite, based on differences in color and mineral composition. Some examples include Bianco Romano, Black Galaxy, Blue Pearl, Santa Cecilia, and Ubatuba. Each variety has unique patterns, colors, and mineral compositions.",
text_splitter=text_splitter,
)

Query Relevant Chunks

After establishing a knowledge base, you can easily retrieve useful chunks from it.

The chunk querying is executed by specifying the following parameters: collection_id, query_text, top_k, and score_threshold.

  • collection_id: The identifier of the collection to query.
  • query_text: The text query used to retrieve relevant chunks.
  • top_k: The number of top chunks to return.
  • score_threshold (Optional): The minimum score required for a chunk to be considered relevant. This should be set as a float value between 0 and 1.
  • max_tokens (Optional): The maximum number of tokens to return in the chunk. This is useful for limiting the size of the returned chunks.
# query chunks
chunks = taskingai.retrieval.query_chunks(
collection_id=collection.collection_id,
query_text="Basketball",
top_k=10,
score_threshold=0.2,
max_tokens=500,
)

The result should be related to basketball, so it is expected to contain chunks from record 2 (Michael Jordan).

Best Practices

Document Quality: It is crucial to maintain high-quality text documents. Ensure that your documents are well-structured and free from errors. This ensures optimal results as the system relies heavily on the quality of text data for accurate processing and retrieval.

Regular Updates: Keeping your knowledge base current is vital. Regularly update your collection with new and relevant information to maintain its relevance and accuracy. This helps the system to provide the most current and applicable results in response to queries.

Avoid Duplicate Records: Be cautious not to add duplicate records to your collection. Duplicate entries can lead to decreased performance and efficiency in the retrieval system. It can cause redundancy in the search results, making it harder to find unique and relevant information quickly.