Pulling Models in Cortex

Cortex provides a streamlined way to pull (download) machine learning models from Hugging Face and other third-party sources, as well as import models from local storage. This functionality allows users to easily access a variety of pre-trained models to enhance their applications.

Features

Model Retrieval: Download models directly from Hugging Face or third-party repositories.
Local Import: Import models stored on your local machine.
User-Friendly Interface: Access models through a Command Line Interface (CLI) or an HTTP API.
Model Selection: Choose your desired model from a provided selection menu in the CLI.

Usage

Pulling Models via CLI

Open the CLI: Launch the Cortex CLI on your terminal.
Select Model: Use the selection menu to browse available models.
- Enter the corresponding number for your desired model quant.
Provide Repository Handle: Input the repository handle (e.g., username/repo_name for Hugging Face) when prompted.
Download Model: Cortex will handle the download process automatically.

For pulling models from Cortex model registry, simply type cortex pull <model_name> to your terminal.


cortex pull tinyllama
Downloaded models:
    tinyllama:1b-gguf
Available to download:
    1. tinyllama:1b-gguf-q2-k
    2. tinyllama:1b-gguf-q3-kl
    3. tinyllama:1b-gguf-q3-km
    4. tinyllama:1b-gguf-q3-ks
    5. tinyllama:1b-gguf-q4-km
    6. tinyllama:1b-gguf-q4-ks
    7. tinyllama:1b-gguf-q5-km
    8. tinyllama:1b-gguf-q5-ks
    9. tinyllama:1b-gguf-q6-k
    10. tinyllama:1b-gguf-q8-0
    11. tinyllama:gguf
Select a model (1-11):

Pulling models with repository handle

When user want to pull a model which is not ready in Cortex model registry, user can provide the repository handle to Cortex.

For example, we can pull model from QuantFactory-FinanceLlama3 by enter to terminal cortex pull QuantFactory/finance-Llama3-8B-GGUF.


cortex pull QuantFactory/finance-Llama3-8B-GGUF
Select an option
    1. finance-Llama3-8B.Q2_K.gguf
    2. finance-Llama3-8B.Q3_K_L.gguf
    3. finance-Llama3-8B.Q3_K_M.gguf
    4. finance-Llama3-8B.Q3_K_S.gguf
    5. finance-Llama3-8B.Q4_0.gguf
    6. finance-Llama3-8B.Q4_1.gguf
    7. finance-Llama3-8B.Q4_K_M.gguf
    8. finance-Llama3-8B.Q4_K_S.gguf
    9. finance-Llama3-8B.Q5_0.gguf
    10. finance-Llama3-8B.Q5_1.gguf
    11. finance-Llama3-8B.Q5_K_M.gguf
    12. finance-Llama3-8B.Q5_K_S.gguf
    13. finance-Llama3-8B.Q6_K.gguf
    14. finance-Llama3-8B.Q8_0.gguf
Select an option (1-14):

Pulling models with direct url

Clients can pull models directly using a URL. This allows for the direct download of models from a specified location without additional configuration.


cortex pull https://huggingface.co/QuantFactory/OpenMath2-Llama3.1-8B-GGUF/blob/main/OpenMath2-Llama3.1-8B.Q4_0.gguf
Validating download items, please wait..
Start downloading..
QuantFactory:OpenMat 0%[==================================================] [00m:00s] 3.98 MB/0.00 B

Pulling Models via HTTP API

To pull a model using the HTTP API, make a POST request to the following endpoint:


curl --request POST \
  --url http://localhost:39281/v1/models/pull \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "tinyllama:gguf"
}'

Notes

Ensure that you have an active internet connection for pulling models from external repositories.
For local model imports, specify the path to the model in your CLI command or API request.

Observing download progress

Unlike the CLI, where users can observe the download progress directly in the terminal, the HTTP API must be asynchronous. Therefore, clients can monitor the download progress by listening to the Event WebSocket API at ws://127.0.0.1:39281/events.

Download started event

DownloadStarted event will be emitted when the download starts. It will contain the DownloadTask object. Each DownloadTask will have an unique id, along with a type of downloading (e.g. Model, Engine, etc.).
DownloadTask's id will be required when client wants to stop a downloading task.


{
  "task": {
    "id": "tinyllama:1b-gguf-q2-k",
    "items": [
      {
        "bytes": 0,
        "checksum": "N/A",
        "downloadUrl": "https://huggingface.co/cortexso/tinyllama/resolve/1b-gguf-q2-k/metadata.yml",
        "downloadedBytes": 0,
        "id": "metadata.yml",
        "localPath": "/Users/user_name/cortexcpp/models/cortex.so/tinyllama/1b-gguf-q2-k/metadata.yml"
      },
      {
        "bytes": 0,
        "checksum": "N/A",
        "downloadUrl": "https://huggingface.co/cortexso/tinyllama/resolve/1b-gguf-q2-k/model.gguf",
        "downloadedBytes": 0,
        "id": "model.gguf",
        "localPath": "/Users/user_name/cortexcpp/models/cortex.so/tinyllama/1b-gguf-q2-k/model.gguf"
      },
      {
        "bytes": 0,
        "checksum": "N/A",
        "downloadUrl": "https://huggingface.co/cortexso/tinyllama/resolve/1b-gguf-q2-k/model.yml",
        "downloadedBytes": 0,
        "id": "model.yml",
        "localPath": "/Users/user_name/cortexcpp/models/cortex.so/tinyllama/1b-gguf-q2-k/model.yml"
      }
    ],
    "type": "Model"
  },
  "type": "DownloadStarted"
}

Download updated event

DownloadUpdated event will be emitted when the download is in progress. It will contain the DownloadTask object. Each DownloadTask will have an unique id, along with a type of downloading (e.g. Model, Engine, etc.).
A DownloadTask will have a list of DownloadItems. Each DownloadItem will have the following properties:
- id: the id of the download item.
- bytes: the total size of the download item.
- downloadedBytes: the number of bytes that have been downloaded so far.
- checksum: the checksum of the download item.
Client can use the downloadedBytes and bytes properties to calculate the download progress.


{
  "task": {
    "id": "tinyllama:1b-gguf-q2-k",
    "items": [
      {
        "bytes": 58,
        "checksum": "N/A",
        "downloadUrl": "https://huggingface.co/cortexso/tinyllama/resolve/1b-gguf-q2-k/metadata.yml",
        "downloadedBytes": 58,
        "id": "metadata.yml",
        "localPath": "/Users/user_name/cortexcpp/models/cortex.so/tinyllama/1b-gguf-q2-k/metadata.yml"
      },
      {
        "bytes": 432131456,
        "checksum": "N/A",
        "downloadUrl": "https://huggingface.co/cortexso/tinyllama/resolve/1b-gguf-q2-k/model.gguf",
        "downloadedBytes": 235619714,
        "id": "model.gguf",
        "localPath": "/Users/user_name/cortexcpp/models/cortex.so/tinyllama/1b-gguf-q2-k/model.gguf"
      },
      {
        "bytes": 562,
        "checksum": "N/A",
        "downloadUrl": "https://huggingface.co/cortexso/tinyllama/resolve/1b-gguf-q2-k/model.yml",
        "downloadedBytes": 562,
        "id": "model.yml",
        "localPath": "/Users/user_name/cortexcpp/models/cortex.so/tinyllama/1b-gguf-q2-k/model.yml"
      }
    ],
    "type": "Model"
  },
  "type": "DownloadUpdated"
}

Download success event

The DownloadSuccess event indicates that all items in the download task have been successfully downloaded. This event provides details about the download task and its items, including their IDs, download URLs, local paths, and other properties. In this event, the bytes and downloadedBytes properties for each item are set to 0, signifying that the download is complete and no further bytes are pending.


{
  "task": {
    "id": "tinyllama:1b-gguf-q2-k",
    "items": [
      {
        "bytes": 0,
        "checksum": "N/A",
        "downloadUrl": "https://huggingface.co/cortexso/tinyllama/resolve/1b-gguf-q2-k/metadata.yml",
        "downloadedBytes": 0,
        "id": "metadata.yml",
        "localPath": "/Users/user_name/cortexcpp/models/cortex.so/tinyllama/1b-gguf-q2-k/metadata.yml"
      },
      {
        "bytes": 0,
        "checksum": "N/A",
        "downloadUrl": "https://huggingface.co/cortexso/tinyllama/resolve/1b-gguf-q2-k/model.gguf",
        "downloadedBytes": 0,
        "id": "model.gguf",
        "localPath": "/Users/user_name/cortexcpp/models/cortex.so/tinyllama/1b-gguf-q2-k/model.gguf"
      },
      {
        "bytes": 0,
        "checksum": "N/A",
        "downloadUrl": "https://huggingface.co/cortexso/tinyllama/resolve/1b-gguf-q2-k/model.yml",
        "downloadedBytes": 0,
        "id": "model.yml",
        "localPath": "/Users/user_name/cortexcpp/models/cortex.so/tinyllama/1b-gguf-q2-k/model.yml"
      }
    ],
    "type": "Model"
  },
  "type": "DownloadSuccess"
}

Importing local-models

When clients have models that are not inside the Cortex data folder and wish to run them inside Cortex, they can import local models using either the CLI or the HTTP API.

via CLI

Use the following command to import a local model using the CLI:


cortex models import --model_id my-tinyllama --model_path /Users/user_name/cortexcpp/models/cortex.so/tinyllama/1b-gguf/model.gguf

Response:


Successfully import model from  '/Users/user_name/cortexcpp/models/cortex.so/tinyllama/1b-gguf/model.gguf' for modeID 'my-tinyllama'.

via HTTP API

Use the following curl command to import a local model using the HTTP API:


curl --request POST \
  --url http://127.0.0.1:39281/v1/models/import \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "model-id",
  "modelPath": "absolute/path/to/gguf",
  "name": "model display name"
}'

Aborting Download Task

Clients can abort a downloading task using the task ID. Below is a sample curl command to abort a download task:


curl --location --request DELETE 'http://127.0.0.1:3928/models/pull' \
--header 'Content-Type: application/json' \
--data '{
    "taskId": "tinyllama:1b-gguf-q2-k"
}'

An event with type DownloadStopped will be emitted when the task is successfully aborted.

Listing local-available models via CLI

You can list your ready-to-use models via CLI using cortex models list command.


cortex models list
+---------+-------------------+
| (Index) | ID                |
+---------+-------------------+
| 1       | tinyllama:1b-gguf |
+---------+-------------------+

For more options, use cortex models list --help command.


cortex models list -h
List all local models
Usage:
cortex models [options] [subcommand]
Positionals:
  filter TEXT                 Filter model id
Options:
  -h,--help                   Print this help message and exit
  -e,--engine                 Display engine
  -v,--version                Display version

Listing local-available models via HTTP API

This section describes how to list all models that are available locally on your system using the HTTP API. By making a GET request to the specified endpoint, you can retrieve a list of models along with their details, such as model ID, name, file paths, engine type, and version. This is useful for managing and verifying the models you have downloaded and are ready to use in your local environment.


curl --request GET \
 --url http://127.0.0.1:39281/v1/models

Response:


{
  "data": [
    {
      "model": "tinyllama:1b-gguf",
      "name": "tinyllama",
      "files": [
        "models/cortex.so/tinyllama/1b-gguf/model.gguf"
      ],
      "engine": "llama-cpp",
      "version": "1",
      # Omit some configuration parameters
    }
  ],
  "object": "list",
  "result": "OK"
}

With Cortex, pulling and managing models is simplified, allowing you to focus more on building your applications!

Features​

Usage​

Pulling Models via CLI​

Pulling models with repository handle​

Pulling models with direct url​

Pulling Models via HTTP API​

Notes​

Observing download progress​

Download started event​

Download updated event​

Download success event​

Importing local-models​

via CLI​

via HTTP API​

Aborting Download Task​

Listing local-available models via CLI​

Listing local-available models via HTTP API​