Azure OpenAI + LLM (Large language model)

This repository contains references to LLM, as well as prompt engineering libraries, focused on Azure-related libraries.

Disclaimer: Not being able to keep up with and test every recent update, sometimes I simply copied them into this repository for later review. Please be aware that some code might be outdated.

Writing Rule: Brief each item on one or a few lines as much as possible.

What's the difference between Azure OpenAI and OpenAI?

OpenAI is a better option if you want to use the latest features like function calling, plug-ins, and access to the latest models.
Azure OpenAI is recommended if you require a reliable, secure, and compliant environment.
Azure OpenAI provides seamless integration with other Azure services..
Azure OpenAI offers private networking and role-based authentication, and responsible AI content filtering.
Azure OpenAI does not use user input as training data for other customers. Data, privacy, and security for Azure OpenAI

Section 1 : LlamaIndex & Vector Storage (Database)
Section 2 : Azure OpenAI and RAG demo
Section 3 : Microsoft Semantic Kernel
- Semantic Kernel
- Semantic Kernel sample code
Section 4 : Langchain
Section 5: Prompt Engineering & Finetuning
- Prompt Engineering
- Finetuning & Model Compression
  - Finetuning : PEFT - LoRA - QLoRA
  - Llama2 Finetuning: Llama 2
  - RLHF (Reinforcement Learning from Human Feedback) & SFT
  - Quantization: [contd.]
  - Pruning and Sparsification
  - Knowledge Distillation: Small size with Textbooks
- Visual Prompting
  - What is the Visual Prompting?
Section 6: Large Language Model: Challenges and Solutions
- Context Constraints: incl. RoPE
- OpenAI's plans
- Token Limits, Trustworthy APIs, and Memory Optimization
  - Approaches To Solve LLM Token Limits
  - Building Trustworthy, Safe and Secure LLM
  - LLM to Master APIs: incl. Gorilla
  - Memory Optimization: PagedAttention & Flash Attention
  - Language Modeling Is ...
Section 7: Open-source LLM & Generative AI Landscape
Section 8 : References
- Survey of LLMs papers
- picoGPT and lit-gpt: Implementation of LLMs
- Agents: AutoGPT and Communicative Agents
- Large Language and Vision Assistant
- MLLM (Multimodal large language model)
- ChatGPT for Robotics
- Application and UI/UX
- Data Extraction methods
  - Math problem-solving skill
  - Extract data from Tables
- Awesome demo Incl. E2E game creation
- 日本語 (Japanese Materials)
- Other Materials
Section 9 : Relevant solutions
- Microsoft Fabric: Single unified data analytics solution
- Office Copilot: Semantic Interpreter, Natural Language Commanding via Program Synthesis
- microsoft/unilm: Microsoft Foundation models
Section 10 : AI Tools
- AI Tools
Section 11 : Datasets for LLM Training
- Datasets for LLM Training
Section 12 : LLM Evaluation
- Evaluation of Large Language Models & LLMOps
Contributors
- Contributors: 👀
Symbols
- ref: external url
- doc: archived doc
- cite: the source of comments
- git: github link

Section 1 : LlamaIndex and Vector Storage (Database)

LlamaIndex (formerly GPT Index) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. The high-level API allows users to ingest and query their data in a few lines of code. ref

LlamaIndex

This section has been created for testing and feasibility checks using elastic search as a vector database and integration with LlamaIndex. LlamaIndex is specialized in integration layers to external data sources.

- index.json : Vector data local backup created by llama-index
- index_vector_in_opensearch.json : Vector data stored in Open search (Source: `files\all_h1.pdf`)
- llama-index-azure-elk-create.py: llama-index ElasticsearchVectorClient (Unofficial file to manipulate vector search, Created by me, Not Fully Tested)
- llama-index-lang-chain.py : Lang chain memory and agent usage with llama-index
- llama-index-opensearch-create.py : Vector index creation to Open search
- llama-index-opensearch-query-chatgpt.py : Test module to access Azure Open AI Embedding API.
- llama-index-opensearch-query.py : Vector index query with questions to Open search
- llama-index-opensearch-read.py : llama-index ElasticsearchVectorClient (Unofficial file to manipulate vector search, Created by me, Not Fully Tested)
- env.template : The properties. Change its name to `.env` once your values settings is done.
- Opensearch & Elasticsearch setup
  - docker : Opensearch Docker-compose
  - docker-elasticsearch : Not working for ES v8, requiring security plug-in with mandatory
  - docker-elk : Elasticsearch Docker-compose, Optimized Docker configurations with solving security plug-in issues.
  - es-open-search-set-analyzer.py : Put Language analyzer into Open search
  - es-open-search.py : Open search sample index creation
  - es-search-set-analyzer.py : Put Language analyzer into Elastic search
  - es-search.py : Usage of Elastic search python client
  - files : The Sample file for consuming

LlamaIndex example

llama-index-es-handson\callback-debug-handler.py: callback debug handler
llama-index-es-handson\chat-engine-flare-query.py: FLARE
llama-index-es-handson\chat-engine-react.py: ReAct
llama-index-es-handson\milvus-create-query.py: Milvus Vector storage

LlamaIndex Deep dive

Hign-Level Concepts
Query engine vs Chat engine
1. The query engine wraps a retriever and a response synthesizer into a pipeline, that will use the query string to fetch nodes (sentences or paragraphs) from the index and then send them to the LLM (Language and Logic Model) to generate a response
2. The chat engine is a quick and simple way to chat with the data in your index. It uses a context manager to keep track of the conversation history and generate relevant queries for the retriever. Conceptually, it is a stateful analogy of a Query Engine.

Storage Context vs Service Context

Both the Storage Context and Service Context are data classes.

index = load_index_from_storage(storage_context, service_context=service_context)

Storage Context is responsible for the storage and retrieval of data in Llama Index, while the Service Context helps in incorporating external context to enhance the search experience.

The Service Context is not directly involved in the storage or retrieval of data, but it helps in providing a more context-aware and accurate search experience.

# The storage context container is a utility container for storing nodes, indices, and vectors. 
class StorageContext:
  docstore: BaseDocumentStore
  index_store: BaseIndexStore
  vector_store: VectorStore
  graph_store: GraphStore

# The service context container is a utility container for LlamaIndex index and query classes. 
class ServiceContext:
  llm_predictor: BaseLLMPredictor
  prompt_helper: PromptHelper
  embed_model: BaseEmbedding
  node_parser: NodeParser
  llama_logger: LlamaLogger
  callback_manager: CallbackManager

CallbackManager (Japanese)
Customize TokenTextSplitter (Japanese)
Chat engine - ReAct mode
Fine-Tuning a Linear Adapter for Any Embedding Model: Fine-tuning the embeddings model requires you to reindex your documents. With this approach, you do not need to re-embed your documents. Simply transform the query instead. ref

Vector Storage Comparison

Not All Vector Databases Are Made Equal
Printed version for "Medium" limits. doc
Faiss: Facebook AI Similarity Search (Faiss) is a library for efficient similarity search and clustering of dense vectors. It is used as an alternative to a vector database in the development and library of algorithms for a vector database. It is developed by Facebook AI Research. git

Milvus (A cloud-native vector database) Embedded git

[JMO]: Alternative option to replace PineCone and Redis Search in OSS. It offers support for multiple languages, addresses the limitations of RedisSearch, and provides cloud scalability and high reliability with Kubernetes. However, for local and small-scale applications, Chroma and Qdrant have positioned themselves as the SQLite in vector databases.

pip install milvus
Docker compose: https://milvus.io/docs/install_offline-docker.md
Milvus Embedded through python console only works in Linux and Mac OS.

In Windows, Use this link, https://github.com/matrixji/milvus/releases.

# Step 1. Start Milvus

1. Unzip the package
Unzip the package, and you will find a milvus directory, which contains all the files required.

2. Start a MinIO service
Double-click the run_minio.bat file to start a MinIO service with default configurations. Data will be stored in the subdirectory s3data.

3. Start an etcd service
Double-click the run_etcd.bat file to start an etcd service with default configurations.

4. Start Milvus service
Double-click the run_milvus.bat file to start the Milvus service.

# Step 2. Run hello_milvus.py

After starting the Milvus service, you can test by running hello_milvus.py. See Hello Milvus for more information.

Vector Storage Options for Azure

Pgvector extension on Azure Cosmos DB for PostgreSQL: Langchain Document ref
Vector Search in Azure Cosmos DB for MongoDB vCore
Vector search (public preview) - Azure Cognitive Search: Langchain Document ref
Azure Cache for Redis Enterprise: Enterprise Redis Vector Search Demo

Note: Azure Cache for Redis Enterprise: Enterprise Sku series are not able to deploy by a template such as Bicep and ARM.
azure-vector-db-python\vector-db-in-azure-native.ipynb: sample code for vector databases in azure

text-embedding-ada-002 & Lucene based search engine

Azure Open AI Embedding API, text-embedding-ada-002, supports 1536 dimensions. Elastic search, Lucene based engine, supports 1024 dimensions as a max. Open search can insert 16,000 dimensions as a vector storage. Open search is available to use as a vector database with Azure Open AI Embedding API.
ref: text-embedding-ada-002: Smaller embedding size. The new embeddings have only 1536 dimensions, one-eighth the size of davinci-001 embeddings, making the new embeddings more cost effective in working with vector databases.
ref: However, one exception to this is that the maximum dimension count for the Lucene engine is 1,024, compared with 16,000 for the other engines. ref
@LlamaIndex ElasticsearchReader class: The name of the class in LlamaIndex is ElasticsearchReader. However, actually, it can only work with open search.
Vector Search with OpenAI Embeddings: Lucene Is All You Need: Our experiments were based on Lucene 9.5.0, but indexing was a bit tricky because the HNSW implementation in Lucene restricts vectors to 1024 dimensions, which was not sufficient for OpenAI’s 1536-dimensional embeddings. Although the resolution of this issue, which is to make vector dimensions configurable on a per codec basis, has been merged to the Lucene source trunk git, this feature has not been folded into a Lucene release (yet) as of early August 2023.

Section 2 : Azure OpenAI and RAG demo

Microsoft Azure OpenAI relevant LLM Framework & Copilot Stack

Semantic Kernel: Semantic Kernel is an open-source SDK that lets you easily combine AI services like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C# and Python. An LLM Ochestrator, similar to Langchain. / git
guidance: A guidance language for controlling large language models. Simple, intuitive syntax, based on Handlebars templating. Domain Specific Language (DSL) for handling model interaction. Langchain libaries but different approach rather than ochestration, particularly effective for implementing Chain of Thought. / git
Azure Machine Learning Promt flow: Visual Designer for Prompt crafting. Use Jinja as a prompt template language. / ref / git
Prompt Engine: Craft prompts for Large Language Models: npm install prompt-engine / git / python
TypeChat: TypeChat replaces prompt engineering with schema engineering. To build natural language interfaces using types. / git
DeepSpeed: DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
LMOps: a collection of tools for improving text prompts used as input to generative AI models. The toolkit includes Promptist, which optimizes a user's text input for text-to-image generation, and Structured Prompting.
Copilot Stack: Microsoft 365 Copilot, Dynamics 365 Copilot, Copilot in Microsoft Viva and Microsoft Security Copilot

ChatGPT + Enterprise data RAG (Retrieval-Augmented Generation) Demo

What is the RAG (Retrieval-Augmented Generation)?

RAG (Retrieval-Augmented Generation) : Integrates the retrieval (searching) into LLM text generation. RAG helps the model to “look up” external information to improve its responses.

cite
In a 2020 paper, Meta (Facebook) came up with a framework called retrieval-augmented generation to give LLMs access to information beyond their training data. ref
In 2021, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
1. RAG-sequence — We retrieve k documents, and use them to generate all the output tokens that answer a user query.
2. RAG-token— We retrieve k documents, use them to generate the next token, then retrieve k more documents, use them to generate the next token, and so on. This means that we could end up retrieving several different sets of documents in the generation of a single answer to a user’s query.
- Of the two approaches proposed in the paper, the RAG-sequence implementation is pretty much always used in the industry. It’s cheaper and simpler to run than the alternative, and it produces great results. cite
4 RAG techniques implemented in llama_index / cite / git
1. SQL Router Query Engine: Query router that can reference your vector database or SQL database
2. Sub Question Query Engine: Break down the complex question into sub-questions
3. Recursive Retriever + Query Engine: Reference node relationships, rather than only finding a node (chunk) that is most relevant.
4. Self Correcting Query Engines: Use an LLM to evaluate its own output.
The Problem with RAG
1. A question is not semantically similar to its answers. Cosine similarity may favor semantically similar texts that do not contain the answer.
2. Semantic similarity gets diluted if the document is too long. Cosine similarity may favor short documents with only the relevant information.
3. The information needs to be contained in one or a few documents. Information that requires aggregations by scanning the whole data.

Demo Deployment Steps

The files in this directory, extra_steps, have been created for managing extra configurations and steps for launching the demo repository.

git : Python, ReactJs, Typescript
1. (optional) Check Azure module installation in Powershell by running ms_internal_az_init.ps1 script
2. (optional) Set your Azure subscription Id to default
  
  Start the following commands in ./azure-search-openai-demo directory
3. (deploy azure resources) Simply Run azd up
  
  The azd stores relevant values in the .env file which is stored at ${project_folder}\.azure\az-search-openai-tg\.env.
4. Move to app by cd app command
5. (sample data loading) Move to scripts then Change into Powershell by Powershell command, Run prepdocs.ps1
  - console output (excerpt)
    Uploading blob for page 29 -> role_library-29.pdf Uploading blob for page 30 -> role_library-30.pdf Indexing sections from 'role_library.pdf' into search index 'gptkbindex' Splitting './data\role_library.pdf' into sections Indexed 60 sections, 60 succeeded
6. Move to app by cd .. and cd app command
7. (locally running) Run start.cmd
- console output (excerpt)
```
Building frontend


> [email protected] build \azure-search-openai-demo\app\frontend
> tsc && vite build

vite v4.1.1 building for production...
✓ 1250 modules transformed.
../backend/static/index.html                    0.49 kB
../backend/static/assets/github-fab00c2d.svg    0.96 kB
../backend/static/assets/index-184dcdbd.css     7.33 kB │ gzip:   2.17 kB
../backend/static/assets/index-41d57639.js    625.76 kB │ gzip: 204.86 kB │ map: 5,057.29 kB

Starting backend

* Serving Flask app 'app'
* Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on http://127.0.0.1:5000
Press CTRL+C to quit
...
```
Running from second times
1. Move to app by cd .. and cd app command
2. (locally running) Run start.cmd
(optional)
- fix_from_origin : The modified files, setup related
- ms_internal_az_init.ps1 : Powershell script for Azure module installation
- ms_internal_troubleshootingt.ps1 : Set Specific Subscription Id as default

Azure OpenAI samples

Azure OpenAI samples: ref
The repository for all Azure OpenAI Samples complementing the OpenAI cookbook.: ref
Azure-Samples ref
- Azure OpenAI with AKS By Terraform: https://github.com/Azure-Samples/aks-openai-terraform
- Azure OpenAI with AKS By Bicep: https://github.com/Azure-Samples/aks-openai
- Enterprise Logging: https://github.com/Azure-Samples/openai-python-enterprise-logging
- Azure OpenAI with AKS by Terraform (simple version): https://github.com/Azure-Samples/azure-openai-terraform-deployment-sample
- ChatGPT Plugin Quickstart using Python and FastAPI: https://github.com/Azure-Samples/openai-plugin-fastapi
- Azure-Cognitive-Search-Azure-OpenAI-Accelerator: https://github.com/MSUSAzureAccelerators/Azure-Cognitive-Search-Azure-OpenAI-Accelerator
Azure OpenAI Network Latency Test Script : ref

Azure Reference Architectures


Azure OpenAI Embeddings QnA	Azure Cosmos DB + OpenAI ChatGPT C# blazor and Azure Custom Template

C# Implementation ChatGPT + Enterprise data with Azure OpenAI and Cognitive Search	Simple ChatGPT UI application Typescript, ReactJs and Flask

Azure Video Indexer demo Azure Video Indexer + OpenAI	Miyagi Integration demonstrate for multiple langchain libraries

Azure Open AI work with Cognitive Search act as a Long-term memory
Tech community
1. Grounding LLMs: Retrieval-Augmented Generation (RAG)
2. Revolutionize your Enterprise Data with ChatGPT
3. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

Azure Cognitive Search : Vector Search

In the vector databases category within Azure, several alternative solutions are available. However, ACS is the only option that provides a range of choices, including a conventional Lucene-based search engine and a hybrid search incorporating vector search capabilities.
git: Vector Search Sample Code
Azure Cognitive Search supports
1. Text Search
2. Pure Vector Search
3. Hybrid Search (Text search + Vector search)
4. Semantic Hybrid Search (Text search + Semantic search + Vector search)
azure-search-vector-sample\azure-search-vector-python-sample.ipynb: Azure Cognitive Search - Vector and Hybrid Search
Azure Cognitive Search offers a set of capabilities designed to improve relevance in these scenarios. We use a combination of hybrid retrieval (vector search + keyword search) + semantic ranking as the most effective approach for improved relevance out-of–the-box. TL;DR: Hybrid search performance is better than Vector only search. ref

Azure Enterprise Services

Bing Chat Enterprise Privacy and Protection
1. Bing Chat Enterprise doesn't have plugin support
2. Only content provided in the chat by users is accessible to Bing Chat Enterprise.
Azure OpenAI Service On Your Data in Public Preview ref

Section 3 : Microsoft Semantic Kernel

Microsoft Langchain Library supports C# and Python and offers several features, some of which are still in development and may be unclear on how to implement. However, it is simple, stable, and faster than Python-based open-source software. The features listed on the link include: Semantic Kernel Feature Matrix / old

Semantic Kernel

This section includes how to utilize Azure Cosmos DB for vector storage and vector search by leveraging the SemanticKernel.

appsettings.template.json : Environment value configuration file.
ComoseDBVectorSearch.cs : Vector Search using Azure Cosmos DB
CosmosDBKernelBuild.cs : Kernel Build code (test)
CosmosDBVectorStore.cs : Embedding Text and store it to Azure Cosmos DB
LoadDocumentPage.cs : PDF splitter class. Split the text to unit of section. (C# version of azure-search-openai-demo/scripts/prepdocs.py)
LoadDocumentPageOutput : LoadDocumentPage class generated output
MemoryContextAndPlanner.cs : Test code of context and planner
MemoryConversationHistory.cs : Test code of conversation history
Program.cs : Run a demo. Program Entry point
SemanticFunction.cs : Test code of conversation history
semanticKernelCosmos.csproj : C# Project file
Settings.cs : Environment value class
SkillBingSearch.cs : Bing Search Skill
SkillDALLEImgGen.cs : DALLE Skill

Semantic Kernel Notes

Semantic Kernel Planner
Is Semantic Kernel Planner the same as LangChain agents?

Planner in SK is not the same as Agents in LangChain. cite
```
Agents in LangChain use recursive calls to the LLM to decide the next step to take based on the current state.

The two planner implementations in SK are not self-correcting.

  Sequential planner tries to produce all the steps at the very beginning, so it is unable to handle unexpected errors.
  Action planner only chooses one tool to satisfy the goal
```
- Stepwise Planner released. The Stepwise Planner features the "CreateScratchPad" function, acting as a 'Scratch Pad' to aggregate goal-oriented steps.
  
  ScratchPad: Using "program execution" strategy boosts performance of large language model tasks by enforcing the use of a "scratch pad." For instance, instead of requesting the LLM's output for a Python function with a specific input, users can ask for the execution trace. This prompts the model to generate predictions for each intermediate step of the function, thereby increasing the probability of the LLM producing the correct final line. cite
Semantic Kernel supports Azure Cognitive Search Vector Search. July 19th, 2023 Dev Blog
SemanticKernel Implementation sample to overcome Token limits of Open AI model. Semantic Kernel でトークンの限界を超えるような長い文章を分割してスキルに渡して結果を結合したい (zenn.dev) Semantic Kernel でトークンの限界を超える

Bing search Web UI and Semantic Kernel sample code

Semantic Kernel sample code to integrate with Bing Search

\ms-semactic-bing-notebook
- gs_chatgpt.ipynb: Azure Open AI ChatGPT sample to use Bing Search
- gs_davinci.ipynb: Azure Open AI Davinci sample to use Bing Search
Bing Search UI for demo

\bing-search-webui: (Utility, to see the search results from Bing Search API)

Section 4 : Langchain & Its Competitors

LangChain is a framework for developing applications powered by language models. (1) Be data-aware: connect a language model to other sources of data. (2) Be agentic: Allow a language model to interact with its environment.
- It highlights two main value props of the framework:
1. Components: modular abstractions and implementations for working with language models, with easy-to-use features.
2. Use-Case Specific Chains: chains of components that assemble in different ways to achieve specific use cases, with customizable interfaces.
cite: ref

cite: packt
```
chain = prompt | model | StrOutputParser() | search
```

Langchain and Prompt engineering libraries

Langchain Feature Matrix & Cheetsheet

Feature Matrix: LangChain Features
- Feature Matrix: Snapshot in 2023 July
Cheetsheet: LangChain CheatSheet
LangChain Cheetsheet KD-nuggets: LangChain Cheetsheet KD-nuggets doc
LangChain AI Handbook: published by Pinecone
Awesome Langchain: Curated list of tools and projects using LangChain.

Langchain Impressive Features

Langchain/cache: Reducing the number of API calls
Langchain/context-aware-splitting: Splits a file into chunks while keeping metadata
LangChain Expression Language: A declarative way to easily compose chains together
LangSmith Platform for debugging, testing, evaluating.
langflow: LangFlow is a UI for LangChain, designed with react-flow.
Flowise Drag & drop UI to build your customized LLM flow

Langchain Quick Start: How to Use

deeplearning.ai\langchain-chat-with-your-data: DeepLearning.ai LangChain: Chat with Your Data
deeplearning.ai\langchain-llm-app-dev: LangChain for LLM Application Development
@practical-ai sample code
- langchain-@practical-ai\Langchain_1_(믹스의_인공지능).ipynb : Langchain Get started
- langchain-@practical-ai\Langchain_2_(믹스의_인공지능).ipynb : Langchain Utilities
```
from langchain.chains.summarize import load_summarize_chain
chain = load_summarize_chain(chat, chain_type="map_reduce", verbose=True)
chain.run(docs[:3])
```
  cite: @practical-ai

Langchain chain type: Summarizer

stuff: Sends everything at once in LLM. If it's too long, an error will occur.
map_reduce: Summarizes by dividing and then summarizing the entire summary.
refine: (Summary + Next document) => Summary
map_rerank: Ranks by score and summarizes to important points.

Langchain Agent

If you're using a text LLM, first try zero-shot-react-description.
If you're using a Chat Model, try chat-zero-shot-react-description.
If you're using a Chat Model and want to use memory, try conversational-react-description.
self-ask-with-search: self ask with search paper
react-docstore: ReAct paper
Agent Type

class AgentType(str, Enum):
    """Enumerator with the Agent types."""

    ZERO_SHOT_REACT_DESCRIPTION = "zero-shot-react-description"
    REACT_DOCSTORE = "react-docstore"
    SELF_ASK_WITH_SEARCH = "self-ask-with-search"
    CONVERSATIONAL_REACT_DESCRIPTION = "conversational-react-description"
    CHAT_ZERO_SHOT_REACT_DESCRIPTION = "chat-zero-shot-react-description"
    CHAT_CONVERSATIONAL_REACT_DESCRIPTION = "chat-conversational-react-description"
    STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION = (
        "structured-chat-zero-shot-react-description"
    )
    OPENAI_FUNCTIONS = "openai-functions"
    OPENAI_MULTI_FUNCTIONS = "openai-multi-functions"

ReAct vs MRKL (miracle)

ReAct is inspired by the synergies between "acting" and "reasoning" which allow humans to learn new tasks and make decisions or reasoning.

MRKL stands for Modular Reasoning, Knowledge and Language and is a neuro-symbolic architecture that combines large language models, external knowledge sources, and discrete reasoning

cite: ref

zero-shot-react-description This agent uses the ReAct framework to determine which tool to use based solely on the tool’s description. Any number of tools can be provided. This agent requires that a description is provided for each tool.

react-docstore This agent uses the ReAct framework to interact with a docstore. Two tools must be provided: a Search tool and a Lookup tool (they must be named exactly as so). The Search tool should search for a document, while the Lookup tool should lookup a term in the most recently found document. This agent is equivalent to the original ReAct paper, specifically the Wikipedia example.

According to my understanding, MRKL is implemented by using ReAct framework in langchain ,which is called zero-shot-react-description. The original ReAct is been implemented in react-docstore agent type.

ps. MRKL is published at 1 May 2022, earlier than ReAct, which is published at 6 Oct 2022.

Criticism to Langchain

The Problem With LangChain: ref, git
What’s your biggest complaint about langchain?: ref
Langchain Is Pointless: ref

LangChain has been criticized for making simple things relatively complex, which creates unnecessary complexity and tribalism that hurts the up-and-coming AI ecosystem as a whole. The documentation is also criticized for being bad and unhelpful.

Langchain & Its Competitors

Langchain vs LlamaIndex

Basically LlamaIndex is a smart storage mechanism, while Langchain is a tool to bring multiple tools together. cite
LangChain offers many features and focuses on using chains and agents to connect with external APIs. In contrast, LlamaIndex is more specialized and excels at indexing data and retrieving documents.

Langchain vs Semantic Kernel

Langchain	Semantic Kernel
Memory	Memory
Tookit	Plugin (pre. Skill)
Tool	LLM prompts (semantic functions) or native C# or Python code (native function)
Agent	Planner
Chain	Steps, Pipeline
Tool	Connector

Semantic Kernel : Semantic Function

Semantic Function - expressed in natural language in a text file "skprompt.txt" using SK's Prompt Template language. Each semantic function is defined by a unique prompt template file, developed using modern prompt engineering techniques. cite

Semantic Kernel : Prompt Template language Key takeaways

1. Variables : use the {{$variableName}} syntax : Hello {{$name}}, welcome to Semantic Kernel!
2. Function calls: use the {{namespace.functionName}} syntax : The weather today is {{weather.getForecast}}.
3. Function parameters: {{namespace.functionName $varName}} and {{namespace.functionName "value"}} syntax
   : The weather today in {{$city}} is {{weather.getForecast $city}}.
4. Prompts needing double curly braces :
   {{ "{{" }} and {{ "}}" }} are special SK sequences.
5. Values that include quotes, and escaping :

    For instance:
    ... {{ 'no need to \\"escape" ' }} ...
    is equivalent to:
    ... {{ 'no need to "escape" ' }} ...

Semantic Kernel Glossary

Glossary in Git

Glossary in MS Doc

Term	Short Description
ASK	A user's goal is sent to SK as an ASK
Kernel	The kernel orchestrates a user's ASK
Planner	The planner breaks it down into steps based upon resources that are available
Resources	Planning involves leveraging available skills, memories, and connectors
Steps	A plan is a series of steps for the kernel to execute
Pipeline	Executing the steps results in fulfilling the user's ASK

Langchain vs Semantic Kernel vs Azure Machine Learning Prompt flow

What's the difference between LangChain and Semantic Kernel?

LangChain has many agents, tools, plugins etc. out of the box. More over, LangChain has 10x more popularity, so has about 10x more developer activity to improve it. On other hand, Semantic Kernel architecture and quality is better, that's quite promising for Semantic Kernel. ref
What's the difference between Azure Machine Learing PromptFlow and Semantic Kernel?
1. Low/No Code vs C#, Python, Java
2. Focused on Prompt orchestrating vs Integrate LLM into their existing app.
Using Prompt flow with Semantic Kernel: ref

Prompt Template Language

	Handlebars.js	Jinja2	Prompt Template
Conditions	{{#if user}} Hello {{user}}! {{else}} Hello Stranger! {{/if}}	{% if user %} Hello {{ user }}! {% else %} Hello Stranger! {% endif %}	Branching features such as "if", "for", and code blocks are not part of SK's template language.
Loop	{{#each items}} Hello {{this}} {{/each}}	{% for item in items %} Hello {{ item }} {% endfor %}	By using a simple language, the kernel can also avoid complex parsing and external dependencies.
Langchain Library	guidance	Langchain & Prompt flow	Semactic Kernel
URL	ref	ref	ref

Section 5: Prompt Engineering & Finetuning

Prompt Engineering

Zero-shot
- Large Language Models are Zero-Shot Reasoners
Few-shot Learning
- Open AI: Language Models are Few-Shot Learners
Chain of Thought (CoT): ReAct and Self Consistency also inherit the CoT concept.
Recursively Criticizes and Improves (RCI)
- Critique: Review your previous answer and find problems with your answer.
- Improve: Based on the problems you found, improve your answer.
ReAct: Grounding with external sources. (Reasoning and Act): Combines reasoning and acting ref
Chain-of-Thought Prompting
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Tree of Thought: Self-evaluate the progress intermediate thoughts make towards solving a problem git / Agora: Tree of Thoughts (ToT) git
- tree-of-thought\forest_of_thought.py: Forest of thought Decorator sample
- tree-of-thought\tree_of_thought.py: Tree of thought Decorator sample
- tree-of-thought\react-prompt.py: ReAct sample without Langchain
Graph of Thoughts (GoT) Solving Elaborate Problems with Large Language Models git
Retrieval Augmented Generation (RAG): To address such knowledge-intensive tasks. RAG combines an information retrieval component with a text generator model.
Zero-shot, one-shot and few-shot
Prompt Engneering overview
Prompt Concept
1. Question-Answering
2. Roll-play: Act as a [ROLE] perform [TASK] in [FORMAT]
3. Reasoning
4. Prompt-Chain
Chain-of-Verification reduces Hallucination in LLMs: A four-step process that consists of generating a baseline response, planning verification questions, executing verification questions, and generating a final verified response based on the verification results.
Prompt Engineering Guide
- Prompt Engineering: Prompt Engineering, , also known as In-Context Prompting ...
- Prompt Engineering Guide: Copyright © 2023 DAIR.AI
Promptist
- Promptist: Microsoft's researchers trained an additional language model (LM) that optimizes text prompts for text-to-image generation.
  - For example, instead of simply passing "Cats dancing in a space club" as a prompt, an engineered prompt might be "Cats dancing in a space club, digital painting, artstation, concept art, soft light, hdri, smooth, sharp focus, illustration, fantasy."

Azure OpenAI Prompt Guide

Prompt engineering techniques

OpenAI Prompt Guide

DeepLearning.ai Short Courses

Awesome ChatGPT Prompts

Awesome ChatGPT Prompts

Finetuning & Model Compression

PEFT: Parameter-Efficient Fine-Tuning (Youtube)

PEFT: Parameter-Efficient Fine-Tuning. PEFT is an approach to fine tuning only a few parameters.

Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning

Category: Represent approach - Description - Pseudo Code ref

Adapters: Adapters - Additional Layers. Inference can be slower.

def transformer_with_adapter(x):
  residual = x
  x = SelfAttention(x)
  x = FFN(x) # adapter
  x = LN(x + residual)
  residual = x
  x = FFN(x) # transformer FFN
  x = FFN(x) # adapter
  x = LN(x + residual)
  return x

Soft Prompts: Prompt-Tuning - Learnable text prompts. Not always desired results.

def soft_prompted_model(input_ids):
  x = Embed(input_ids)
  soft_prompt_embedding = SoftPromptEmbed(task_based_soft_prompt)
  x = concat([soft_prompt_embedding, x], dim=seq)
  return model(x)

Selective: BitFit - Update only the bias parameters. fast but limited.

params = (p for n,p in model.named_parameters() if "bias" in n)
optimizer = Optimizer(params)

Reparametrization: LoRa - Low-rank decomposition. Efficient, Complex to implement.

def lora_linear(x):
  h = x @ W # regular linear
  h += x @ W_A @ W_B # low_rank update
  return scale * h

LoRA: Low-Rank Adaptation of Large Language Models: LoRA is one of PEFT technique. To represent the weight updates with two smaller matrices (called update matrices) through low-rank decomposition. git
QLoRA: Efficient Finetuning of Quantized LLMs: 4-bit quantized pre-trained language model into Low Rank Adapters (LoRA). git
Training language models to follow instructions with human feedback
Fine-tuning a GPT — LoRA: Comprehensive guide for LoRA. Printed version for backup. doc
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models: A combination of sparse local attention and LoRA git
1. The document states that LoRA alone is not sufficient for long context extension.
2. Although dense global attention is needed during inference, fine-tuning the model can be done by sparse local attention, shift short attention (S2-Attn).
3. S2-Attn can be implemented with only two lines of code in training.
LIMA: Less Is More for Alignment: fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, either equivalent or strictly preferred to GPT-4 in 43% of cases.
Efficient Streaming Language Models with Attention Sinks 1. StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. 2. We neither expand the LLMs' context window nor enhance their long-term memory. git
- Key-Value (KV) cache is an important component in the StreamingLLM framework.
1. Window Attention: Only the most recent Key and Value states (KVs) are cached. This approach fails when the text length surpasses the cache size.
2. Sliding Attention /w Re-computation: Rebuilds the Key-Value (KV) states from the recent tokens for each new token. Evicts the oldest part of the cache.
3. StreamingLLM: One of the techniques used is to add a placeholder token (yellow-colored) as a dedicated attention sink during pre-training. This attention sink attracts the model’s attention and helps it generalize to longer sequences. Outperforms the sliding window with re-computation baseline by up to a remarkable 22.2× speedup.

Llama2 Finetuning

A key difference between Llama 1 and Llama 2 is the architectural change of attention layer, in which Llama 2 takes advantage of Grouped Query Attention (GQA) mechanism to improve efficiency.

Multi-query attention (MQA)

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm Youtube / git

Rotary PE

def apply_rotary_embeddings(x: torch.Tensor, freqs_complex: torch.Tensor, device: str):
    # Separate the last dimension pairs of two values, representing the real and imaginary parts of the complex number
    # Two consecutive values will become a single complex number
    # (B, Seq_Len, H, Head_Dim) -> (B, Seq_Len, H, Head_Dim/2)
    x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
    # Reshape the freqs_complex tensor to match the shape of the x_complex tensor. So we need to add the batch dimension and the head dimension
    # (Seq_Len, Head_Dim/2) --> (1, Seq_Len, 1, Head_Dim/2)
    freqs_complex = freqs_complex.unsqueeze(0).unsqueeze(2)
    # Multiply each complex number in the x_complex tensor by the corresponding complex number in the freqs_complex tensor
    # Which results in the rotation of the complex number as shown in the Figure 1 of the paper
    # (B, Seq_Len, H, Head_Dim/2) * (1, Seq_Len, 1, Head_Dim/2) = (B, Seq_Len, H, Head_Dim/2)
    x_rotated = x_complex * freqs_complex
    # Convert the complex number back to the real number
    # (B, Seq_Len, H, Head_Dim/2) -> (B, Seq_Len, H, Head_Dim/2, 2)
    x_out = torch.view_as_real(x_rotated)
    # (B, Seq_Len, H, Head_Dim/2, 2) -> (B, Seq_Len, H, Head_Dim)
    x_out = x_out.reshape(*x.shape)
    return x_out.type_as(x).to(device)

KV Cache, Grouped Query Attention

  # Replace the entry in the cache
  self.cache_k[:batch_size, start_pos : start_pos + seq_len] = xk
  self.cache_v[:batch_size, start_pos : start_pos + seq_len] = xv

  # (B, Seq_Len_KV, H_KV, Head_Dim)
  keys = self.cache_k[:batch_size, : start_pos + seq_len]
  # (B, Seq_Len_KV, H_KV, Head_Dim)
  values = self.cache_v[:batch_size, : start_pos + seq_len]

  # Since every group of Q shares the same K and V heads, just repeat the K and V heads for every Q in the same group.

  # (B, Seq_Len_KV, H_KV, Head_Dim) --> (B, Seq_Len_KV, H_Q, Head_Dim)
  keys = repeat_kv(keys, self.n_rep)
  # (B, Seq_Len_KV, H_KV, Head_Dim) --> (B, Seq_Len_KV, H_Q, Head_Dim)
  values = repeat_kv(values, self.n_rep)

Comprehensive Guide for LLaMA with RLHF: StackLLaMA: A hands-on guide to train LLaMA with RLHF
Official LLama Recipes incl. Finetuning: git
The sources of Inference code and finetuning code are commented on the files. git
- llama2-trial.ipynb: LLama 2 inference code in local
- llama2-finetune.ipynb: LLama 2 Finetuning
- llama_2_finetuning_inference.ipynb: LLama 2 Finetuning with Inference
- Llama_2_Fine_Tuning_using_QLora.ipynb: ref
Llama 2 ONNX git
- ONNX: ONNX stands for Open Neural Network Exchange. It is an open standard format for machine learning interoperability. ONNX enables AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.
- ONNX Runtime can be used on mobile devices. ONNX Runtime gives you a variety of options to add machine learning to your mobile application. ONNX Runtime mobile is a reduced size, high performance package for edge devices, including smartphones and other small storage devices.
LLM-Engine: The open source engine for fine-tuning LLM git
- finetune_llama_2_on_science_qa.ipynb: git

RLHF (Reinforcement Learning from Human Feedback) & SFT (Supervised Fine-Tuning)

Machine learning technique that trains a "reward model" directly from human feedback and uses the model as a reward function to optimize an agent's policy using reinforcement learning

cite
Libraries: TRL, trlX, Argilla

TRL: from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step

The three steps in the process: 1. pre-training on large web-scale data, 2. supervised fine-tuning on instruction data (instruction tuning), and 3. RLHF. ref
Reinforcement Learning from Human Feedback (RLHF) is a process of pretraining and retraining a language model using human feedback to develop a scoring algorithm that can be reapplied at scale for future training and refinement. As the algorithm is refined to match the human-provided grading, direct human feedback is no longer needed, and the language model continues learning and improving using algorithmic grading alone. ref
Supervised Fine-Tuning (SFT) fine-tuning a pre-trained model on a specific task or domain using labeled data. This can cause more significant shifts in the model’s behavior compared to RLHF.
Proximal Policy Optimization (PPO) is a policy gradient method for reinforcement learning that aims to have the data efficiency and reliable performance of TRPO (Trust Region Policy Optimization), while using only first-order optimization. It does this by modifying the objective function to penalize changes to the policy that move the probability ratio away from 1. This results in an algorithm that is easier to implement and tune than TRPO while still achieving good performance. TRPO requires second-order optimization, which can be more difficult to implement and computationally expensive.
First-order optimization methods are a class of optimization algorithms that use only the first derivative (gradient) of the objective function to find its minimum or maximum. These methods include gradient descent, stochastic gradient descent, and their variants.
Second-order methods: Second derivative (Hessian) of the objective function
RLAF (Reinforcement Learning from AI Feedback): Uses AI feedback to generate instructions for the model. TLDR: CoT (Chain-of-Thought, Improved), Few-shot (Not improved). Only explores the task of summarization. After training on a few thousand examples, performance is close to training on the full dataset. RLAIF and RLHF vs. the reference summaries is also not statistically significant.

Model Compression for Large Language Models

A Survey on Model Compression for Large Language Models ref

Quantization

Quantization-aware training (QAT): The model is further trained with quantization in mind after being initially trained in floating-point precision.
Post-training quantization (PTQ): The model is quantized after it has been trained without further optimization during the quantization process.

Method	Pros	Cons
Post-training quantization	Easy to use, no need to retrain the model	May result in accuracy loss
Quantization-aware training	Can achieve higher accuracy than post-training quantization	Requires retraining the model, can be more complex to implement

bitsandbytes: 8-bit optimizers git

Pruning and Sparsification

Pruning: The process of removing some of the neurons or layers from a neural network. This can be done by identifying and removing neurons or layers that have little or no impact on the output of the network.
Sparsification is indeed a technique used to reduce the size of large language models by removing redundant parameters.
Both sparsification and pruning involve removing neurons or connections from the network. The main difference between network sparsification and model pruning is that there is no operational difference between them, and a pruned network usually leads to a sparser network.

Small size with Textbooks: High quality synthetic dataset

ph-1.5: Textbooks Are All You Need II. Phi 1.5 is trained solely on synthetic data. Despite having a mere 1 billion parameters compared to Llama 7B's much larger model size, Phi 1.5 often performs better in benchmark tests.
ph-1: Despite being small in size, phi-1 attained 50.6% on HumanEval and 55.5% on MBPP. Textbooks Are All You Need. ref
Orca: Orca learns from rich signals from GPT 4 including explanation traces; step-by-step thought processes; and other complex instructions, guided by teacher assistance from ChatGPT. ref

Large Transformer Model Inference Optimization

Large Transformer Model Inference Optimization

Visual Prompting

What is visual prompting: Similarly to what has happened in NLP, large pre-trained vision transformers have made it possible for us to implement Visual Prompting. Printed version for backup doc
Visual Prompting paper
Andrew Ng’s Visual Prompting Livestream

Section 6 : Large Language Model: Challenges and Solutions

Context constraints

Introducing 100K Context Windows: hundreds of pages, Around 75,000 words; demo Anthropic Claude
Rotary Positional Embedding (RoPE) / Printed version for backup ref / doc
How is this different from the sinusoidal embeddings used in "Attention is All You Need"?
1. Sinusoidal embeddings apply to each coordinate individually, while rotary embeddings mix pairs of coordinates
2. Sinusoidal embeddings add a cos or sin term, while rotary embeddings use a multiplicative factor.
Lost in the Middle: How Language Models Use Long Contexts
1. Best Performace when relevant information is at beginning
2. Too many retrieved documents will harm performance
3. Performacnce decreases with an increase in context
Structured Prompting: Scaling In-Context Learning to 1,000 Examples
1. Microsoft's Structured Prompting allows thousands of examples, by first concatenating examples into groups, then inputting each group into the LM. The hidden key and value vectors of the LM's attention modules are cached. Finally, when the user's unaltered input prompt is passed to the LM, the cached attention vectors are injected into the hidden layers of the LM.
2. This approach wouldn't work with OpenAI's closed models. because this needs to access [keys] and [values] in the transformer internals, which they do not expose. You could implement yourself on OSS ones.
  
  cite

OpenAI's plans

OpenAI's plans according to Sam Altman

Archived Link : Printed version for backup doc

OpenAI Plugin and function calling

ChatGPT Plugin
ChatGPT Function calling

Under the hood, functions are injected into the system message in a syntax the model has been trained on. This means functions count against the model's context limit and are billed as input tokens. If running into context limits, we suggest limiting the number of functions or the length of documentation you provide for function parameters.

Azure OpenAI start to support function calling. ref

OSS Alternatives for OpenAI Advanced Data Analytics (aka. Code Interpreter)

OpenAI Code Interpreter Integration with Sandboxed python execution environment

We provide our models with a working Python interpreter in a sandboxed, firewalled execution environment, along with some ephemeral disk space.
OSS Code Interpreter A LangChain implementation of the ChatGPT Code Interpreter.
SlashGPT The tool integrated with "jupyter" agent
gpt-code-ui An open source implementation of OpenAI's ChatGPT Code interpreter.
Open Interpreter: Let language models run code on your computer.

GPT-4 details leaked

GPT-4V(ision) system card: ref / ref
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
GPT-4 details leaked

GPT-4 is a language model with approximately 1.8 trillion parameters across 120 layers, 10x larger than GPT-3. It uses a Mixture of Experts (MoE) model with 16 experts, each having about 111 billion parameters. Utilizing MoE allows for more efficient use of resources during inference, needing only about 280 billion parameters and 560 TFLOPs, compared to the 1.8 trillion parameters and 3,700 TFLOPs required for a purely dense model.

The model is trained on approximately 13 trillion tokens from various sources, including internet data, books, and research papers. To reduce training costs, OpenAI employs tensor and pipeline parallelism, and a large batch size of 60 million. The estimated training cost for GPT-4 is around $63 million. ref

OpenAI Products

ChatGPT can now see, hear, and speak: It has recently been updated to support multimodal capabilities, including voice and image. Whisper / CLIP
GPT-3.5 Turbo Fine-tuning Fine-tuning for GPT-3.5 Turbo is now available, with fine-tuning for GPT-4 coming this fall. August 22, 2023
DALL·E 3 : In September 2023, OpenAI announced their latest image model, DALL-E 3 git
Open AI Enterprise: Removes GPT-4 usage caps, and performs up to two times faster ref
Custom instructions: In a nutshell, the Custom Instructions feature is a cross-session memory that allows ChatGPT to retain key instructions across chat sessions.

ChatGPT : “user”, “assistant”, and “system” messages.

To be specific, the ChatGPT API allows for differentiation between “user”, “assistant”, and “system” messages.

always obey "system" messages.
all end user input in the “user” messages.
"assistant" messages as previous chat responses from the assistant.

Presumably, the model is trained to treat the user messages as human messages, system messages as some system level configuration, and assistant messages as previous chat responses from the assistant. ref

Approaches To Solve LLM Token Limits

Open AI Tokenizer: GPT-3, Codex Token counting
tiktoken: BPE tokeniser for use with OpenAI's models. Token counting.
What are tokens and how to count them?
5 Approaches To Solve LLM Token Limits : Printed version for backup doc

Building Trustworthy, Safe and Secure LLM

NeMo Guardrails: Building Trustworthy, Safe and Secure LLM Conversational Systems
Trustworthy LLMs: Comprehensive overview for assessing LLM trustworthiness; Reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness.
Political biases of LLMs: From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models: cited by
Red Teaming: The term red teaming has historically described systematic adversarial attacks for testing security vulnerabilities. LLM red teamers should be a mix of people with diverse social and professional backgrounds, demographic groups, and interdisciplinary expertise that fits the deployment context of your AI system. ref

LLM to Master APIs

Gorilla: An API store for LLMs: Gorilla: Large Language Model Connected with Massive APIs git
1. Used GPT-4 to generate a dataset of instruction-api pairs for fine-tuning Gorilla.
2. Used the abstract syntax tree (AST) of the generated code to match with APIs in the database and test set for evaluation purposes.
3. cite
Another user asked how Gorilla compared to LangChain; Patil replied: Langchain is a terrific project that tries to teach agents how to use tools using prompting. Our take on this is that prompting is not scalable if you want to pick between 1000s of APIs. So Gorilla is a LLM that can pick and write the semantically and syntactically correct API for you to call! A drop in replacement into Langchain!
Meta: Toolformer: Language Models That Can Use Tools, by MetaAI git
ToolLLM: : Facilitating Large Language Models to Master 16000+ Real-world APIs git

Memory Optimization

Transformer cache key-value tensors of context tokens into GPU memory to facilitate fast generation of the next token. However, these caches occupy significant GPU memory. The unpredictable nature of cache size, due to the variability in the length of each request, exacerbates the issue, resulting in significant memory fragmentation in the absence of a suitable memory management mechanism.
To alleviate this issue, PagedAttention was proposed to store the KV cache in non-contiguous memory spaces. It partitions the KV cache of each sequence into multiple blocks, with each block containing the keys and values for a fixed number of tokens.
PagedAttention : vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention, 24x Faster LLM Inference doc. paper: in prep

PagedAttention for a prompt “the cat is sleeping in the kitchen and the dog is”. Key-Value pairs of tensors for attention computation are stored in virtual contiguous blocks mapped to non-contiguous blocks in the GPU memory.
TokenAttention an attention mechanism that manages key and value caching at the token level. git
Flash Attention & FlashAttention-2: An method that reorders the attention computation and leverages classical techniques (tiling, recomputation). Instead of storing each intermediate result, use kernel fusion and run every operation in a single kernel in order to avoid memory read/write overhead. git -> Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster

Language Modeling Is

Language Modeling Is Compression: Lossless data compression, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%).

Section 7 : List of OSS LLM & Generative AI Landscape

LLM Evolutionary tree and LLaMA Family

Evolutionary Graph of LLaMA Family
LLM evolutionary tree
A Survey of Large Language Models paper git
LLM evolutionary tree: A curated list of practical guide resources of LLMs (LLMs Tree, Examples, Papers) git

Generative AI Revolution: Exploring the Current Landscape

The Generative AI Revolution: Exploring the Current Landscape : Printed version for backup doc

A Taxonomy of Natural Language Processing

An overview of different fields of study and recent developments in NLP. Printed version for backup doc ref

“Exploring the Landscape of Natural Language Processing Research” ref

NLP taxonomy

Distribution of the number of papers by most popular fields of study from 2002 to 2022

OSS (Open-source) LLM

List of OSS LLM
Printed version for "Medium" limits. doc
LLM Collection: promptingguide.ai

Huggingface Open LLM Learboard

Huggingface Open LLM Learboard
Upstage's 70B Language Model Outperforms GPT-3.5: ref

Huggingface Transformer

huggingface/transformers: 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. (github.com)

LLM for Coding

Huggingface StarCoder: A State-of-the-Art LLM for Code
git: bigcode/starcoder
Code Llama: Built on top of Llama 2, free for research and commercial use. ref, git

Democratizing the magic of ChatGPT with open models

The LLMs mentioned here are just small parts of the current advancements in the field. Most OSS LLM models have been built on the facebookresearch/llama. For a comprehensive list and the latest updates, please refer to the "Generative AI Landscape / List of OSS LLM" section.
facebookresearch/llama: Commercial use
Llama 2: Available for commercial use ref / demo
Falcon LLM Apache 2.0 license
OSS LLM
- StableVicuna First Open Source RLHF LLM Chatbot
- Alpaca: Fine-tuned from the LLaMA 7B model
- gpt4all: Run locally on your CPU
- vicuna: 90% ChatGPT Quality
- Koala: Focus on dialogue data gathered from the web.
- dolly: Databricks
- Cerebras-GPT: 7 GPT models ranging from 111m to 13b parameters.
- GPT4All Download URL
- KoAlpaca: Alpaca for korean

Section 8 : References

Survey of LLMs papers

Picked out the list by [cited by count] and used [survey] as a search keyword. The papers on a specific topic are included even if few [cited by count].
A Survey of LLMs
- A Survey of Transformers:(cited by)
- A Survey of Large Language Models:(cited by)
- A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT:(cited by)
- Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models:(cited by)
Application of LLMs
- Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond:(cited by)
Tuning & Learning
Vision & Trustworthy
Google AI Research Recap
etc.

picoGPT and lit-gpt

An unnecessarily tiny implementation of GPT-2 in NumPy. picoGPT: Transformer Decoder

q = x @ w_k # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]
k = x @ w_q # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]
v = x @ w_v # [n_seq, n_embd] @ [n_embd, n_embd] -> [n_seq, n_embd]

# In picoGPT, combine w_q, w_k and w_v into a single matrix w_fc
x = x @ w_fc # [n_seq, n_embd] @ [n_embd, 3*n_embd] -> [n_seq, 3*n_embd]

lit-gpt: Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed. git

Agents: AutoGPT and Communicative Agents

AgentBench Evaluating LLMs as Agents: Assess LLM-as Agent’s reasoning and decision-making abilities.
Auto-GPT: Most popular
babyagi: Most simplest implementation - Coworking of 4 agents
microsoft/JARVIS
SuperAGI: GUI for agent settings
lightaime/camel: 🐫 CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society (github.com)
1:1 Conversation between two ai agents Camel Agents - a Hugging Face Space by camel-ai Hugging Face (camel-agents)
Microsoft Autogen: Customizable and conversable agents framework ref

Large Language and Vision Assistant

CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image. git
LLaVa: Large Language-and-Vision Assistant git
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models git
TaskMatrix, aka VisualChatGPT: Microsoft TaskMatrix git; GroundingDINO + SAM git
BLIP-2: Salesforce Research, Querying Transformer (Q-Former) / git / ref / Youtube / BLIP: git
- Q-Former (Querying Transformer): A transformer model that consists of two submodules that share the same self-attention layers: an image transformer that interacts with a frozen image encoder for visual feature extraction, and a text transformer that can function as both a text encoder and a text decoder.
- Q-Former is a lightweight transformer which employs a set of learnable query vectors to extract visual features from the frozen image encoder. It acts as an information bottleneck between the frozen image encoder and the frozen LLM.
Vision capability to a LLM ref
Cross-attention ref
- The model has three sub-models:
  1. A model to obtain image embeddings
  2. A text model to obtain text embeddings
  3. A model to learn the relationships between them
This is analogous to adding vision capability to a LLM.

MLLM (multimodal large language model)

Facebook
1. facebookresearch/ImageBind: ImageBind One Embedding Space to Bind Them All git
2. facebookresearch/segment-anything(SAM): The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model. git
3. facebookresearch/SeamlessM4T SeamlessM4T is the first all-in-one multilingual multimodal AI translation and transcription model. This single model can perform speech-to-text, speech-to-speech, text-to-speech, and text-to-text translations for up to 100 languages depending on the task. ref
Microsoft: Kosmos
1. Language Is Not All You Need: Aligning Perception with Language Models Kosmos-1
2. Kosmos-2: Grounding Multimodal Large Language Models to the World
3. Kosmos-2.5: A Multimodal Literate Model
Microsoft: BEiT-3
- BEiT-3: Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks
TaskMatrix.AI
- TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
Benchmarking Multimodal LLMs
- SEED-Bench: Benchmarking Multimodal LLMs git

ChatGPT for Robotics

PromptCraft-Robotics: Robotics and a robot simulator with ChatGPT integration git

Application and UI/UX

Gradio: Build Machine Learning Web Apps - in Python
Text generation web UI: Text generation web UI
Very Simple Langchain example using Open AI: langchain-ask-pdf
An open source implementation of OpenAI's ChatGPT Code interpreter: gpt-code-ui
Open AI Chat Mockup: An open source ChatGPT UI. mckaywrigley/chatbot-ui
Streaming with Azure OpenAI SSE
BIG-AGI FKA nextjs-chatgpt-app
Embedding does not use Open AI. Can be executed locally: pdfGPT
Tiktoken Alternative in C#: microsoft/Tokenizer: .NET and Typescript implementation of BPE tokenizer for OpenAI LLMs. (github.com)
Azure OpenAI Proxy: OpenAI API requests converting into Azure OpenAI API requests
Opencopilot: Build and embed open-source AI Copilots into your product with ease.
TaxyAI/browser-extension: Browser Automation by Chrome debugger API and Prompt > src/helpers/determineNextAction.ts
Spring AI: Developing AI applications for Java.
RAG capabilities of LlamaIndex to QA about SEC 10-K & 10-Q documents: A real world full-stack application using LlamaIndex

Data Extraction methods

Math problem-solving skill

Plugin: Wolfram alpha
Improving mathematical reasoning with process supervision
Math formula OCR: MathPix, OSS LaTeX-OCR
Math soving optimized LLM WizardMath : Developed by adapting Evol-Instruct and Reinforcement Learning techniques, these models excel in math-related instructions like GSM8k and MATH. git
Nougat: Neural Optical Understanding for Academic Documents: The academic document PDF parser that understands LaTeX math and tables. git

Extract data from Tables

Azure Form Recognizer: ref
Table to Markdown format: Table to Markdown

Awesome demo

FRVR Official Teaser: Prompt to Game: AI-powered end-to-end game creation
rewind.ai: Rewind captures everything you’ve seen on your Mac and iPhone

日本語（Japanese Materials）

LLM研究プロジェクト: ブログ記事一覧
ブレインパッド社員が投稿したQiita記事まとめ: ブレインパッド社員が投稿したQiita記事まとめ
rinna: rinnaの36億パラメータの日本語GPT言語モデル: 3.6 billion parameter Japanese GPT language model
rinna: bilingual-gpt-neox-4b: 日英バイリンガル大規模言語モデル
法律:生成AIの利用ガイドライン: Legal: Guidelines for the Use of Generative AI
New Era of Computing - ChatGPT がもたらした新時代
大規模言語モデルで変わるMLシステム開発: ML system development that changes with large-scale language models
GPT-4登場以降に出てきたChatGPT/LLMに関する論文や技術の振り返り: Review of ChatGPT/LLM papers and technologies that have emerged since the advent of GPT-4
LLMを制御するには何をするべきか？: How to control LLM
生成AIのマルチモーダルモデルでできること -タスク紹介編-: What can be done with multimodal models of generative AI
LLMの推論を効率化する量子化技術調査: Survey of quantization techniques to improve efficiency of LLM reasoning
LLMの出力制御や新モデルについて: About LLM output control and new models
Azure OpenAIを活用したアプリケーション実装のリファレンス: 日本マイクロソフトリファレンスアーキテクチャ
生成AI・LLMのツール拡張に関する論文の動向調査: Survey of trends in papers on tool extensions for generative AI and LLM
LLMの学習・推論の効率化・高速化に関する技術調査: Technical survey on improving the efficiency and speed of LLM learning and inference

Other materials

Attention Is All You Need: The Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Illustrated transformer
Must read: the 100 most cited AI papers in 2022 : doc
The Best Machine Learning Resources : doc
What are the most influential current AI Papers?: NLLG Quarterly arXiv Report 06/23 git
OpenAI Cookbook Examples and guides for using the OpenAI API
gpt4free for educational purposes only
Comparing Adobe Firefly, Dalle-2, OpenJourney, Stable Diffusion, and Midjourney: Generative AI for images
Open Problem and Limitation of RLHF: Provides an overview of open problems and the limitations of RLHF
Ai Fire: AI Fire Learning resources doc
IbrahimSobh/llms: Language models introduction with simple code.

Section 9 : Relevant solutions

Microsoft Fabric: Fabric integrates technologies like Azure Data Factory, Azure Synapse Analytics, and Power BI into a single unified product
Microsoft Office Copilot: Natural Language Commanding via Program Synthesis: Semantic Interpreter, a natural language-friendly AI system for productivity software such as Microsoft Office that leverages large language models (LLMs) to execute user intent across application features.
Weights & Biases: Visualizing and tracking your machine learning experiments wandb.ai doc: deeplearning.ai/wandb
activeloopai/deeplake: AI Vector Database for LLMs/LangChain. Doubles as a Data Lake for Deep Learning. Store, query, version, & visualize any data. Stream data in real-time to PyTorch/TensorFlow. ref
mosaicml/llm-foundry: LLM training code for MosaicML foundation models
Generate 3D objects conditioned on text or images openai/shap-e
Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold git
string2string: The library is an open-source tool that offers a comprehensive suite of efficient algorithms for a broad range of string-to-string problems. string2string
Sentence Transformers: Python framework for state-of-the-art sentence, text and image embeddings. Useful for semantic textual similar, semantic search, or paraphrase mining. git

Section 10 : AI Tools

The leader: http://openai.com
The runner-up: http://bard.google.com
Open source: http://huggingface.co/chat
Searching web: http://perplexity.ai
Content writing: http://jasper.ai/chat
Sales and Marketing: http://chatspot.ai / cite
Oceans of AI - All AI Tools https://play.google.com/store/apps/details?id=in.blueplanetapps.oceansofai&hl=en_US
Newsletters & Tool Databas: https://www.therundown.ai/
allAIstartups: https://www.allaistartups.com/ai-tools
Future Tools: https://www.futuretools.io/
Edge and Chrome Extension & Plugin
- MaxAI.me
- BetterChatGPT
- ChatHub All-in-one chatbot client Webpage
- ChatGPT Retrieval Plugin
Vercel AI Vercel AI Playground / Vercel AI SDK git
Quora Poe A chatbot service that gives access to GPT-4, gpt-3.5-turbo, Claude from Anthropic, and a variety of other bots.

Section 11 : Datasets for LLM Training

LLM-generated datasets:
1. Self-Instruct: Seed task pool with a set of human-written instructions.
2. Self-Alignment with Instruction Backtranslation: Without human seeding, use LLM to produce instruction-response pairs. The process involves two steps: self-augmentation and self-curation.
LLMDataHub: Awesome Datasets for LLM Training: A quick guide (especially) for trending instruction finetuning datasets
Open LLMs and Datasets: A list of open LLMs available for commercial use.
SQuAD: The Stanford Question Answering Dataset (SQuAD), a set of Wikipedia articles, 100,000+ question-answer pairs on 500+ articles.

RedPajama: LLaMA training dataset of over 1.2 trillion tokens git: Pretrain for a base model

{
    "text": ...,
    "meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
    "red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}

databricks-dolly-15k: Instruction-Tuned git: SFT training - QA pairs or Dialog

{
  "prompt": "What is the capital of France?",
  "response": "The capital of France is Paris."
},
{
    "prompt": "Can you give me a recipe for chocolate chip cookies?",
    "response": "Sure! ..."
}

Anthropic human-feedback: RLHF training - Chosen and Rejected pairs

{
  "chosen": "I'm sorry to hear that. Is there anything I can do to help?", 
  "rejected": "That's too bad. You should just get over it."
}

大規模言語モデルのデータセットまとめ: 大規模言語モデルのデータセットまとめ

Dataset example

cite: https://docs.argilla.io/

SFT Dataset

category	instruction	context	response
0	open_qa	How do I get rid of mosquitos in my house?	You can get rid of mosquitos in your house by ...
1	classification	Classify each country as "African" or "European"	Nigeria: African Rwanda: African Portugal: European
2	information_extraction	Extract the unique names of composers from the text.	To some extent, European and the US traditions...
3	general_qa	Should investors time the market?	Timing the market is based on predictions of t...

RLHF Dataset

instruction	chosen_response	rejected_response
What is Depreciation	Depreciation is the drop in value of an asset ...	What is Depreciation – 10 Important Facts to K...
What do you know about the city of Aberdeen in Scotland?	Aberdeen is a city located in the North East of Scotland. It is known for its granite architecture and its offshore oil industry.	As an AI language model, I don't have personal knowledge or experiences about Aberdeen.
Describe thunderstorm season in the United States and Canada.	Thunderstorm season in the United States and Canada typically occurs during the spring and summer months, when warm, moist air collides with cooler, drier air, creating the conditions for thunderstorms to form.	Describe thunderstorm season in the United States and Canada.

Section 12 : LLM Evaluation & LLMOps

Evaluation
- Evaluation of Large Language Models: A Survey on Evaluation of Large Language Models
- promptfoo: Test your prompts. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality.
- PromptTools: Open-source tools for prompt testing git
- OpenAI Evals: git
- TruLens-Eval: Instrumentation and evaluation tools for large language model (LLM) based applications. git
LLMOps: Large Language Model Operations
- Pezzo: Open-source, developer-first LLMOps platform git
- Azure Machine Learning studio Model Data Collector: Collect production data, analyze key safety and quality evaluation metrics on a recurring basis, receive timely alerts about critical issues, and visualize the results. ref

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
code		code
deeplearning.ai		deeplearning.ai
files		files
infra		infra
.gitignore		.gitignore
README.md		README.md

anil2799/azure-openai-llm-vector-langchain

Folders and files

Latest commit

History

Repository files navigation