www.together.ai Open in urlscan Pro
34.249.200.254  Public Scan

Submitted URL: http://together.ai/
Effective URL: https://www.together.ai/
Submission: On May 02 via api from US — Scanned from DE

Form analysis 2 forms found in the DOM

Name: email-form-2GET

<form id="email-form-2" name="email-form-2" data-name="Email Form 2" method="get" fs-cmsfilter-element="filters" data-wf-page-id="6589b69bcbb8c584607c03ff" data-wf-element-id="c46924a9-b883-a3d1-bd45-136dd5e25b44" aria-label="Email Form 2">
  <div class="filters-wrapper">
    <div class="filters-menu"><label checked="" fs-cmsfilter-element="clear" class="tab fs-cmsfilter_active w-radio"><input type="radio" data-name="filter" id="all" name="filter" class="w-form-formradioinput hide w-radio-input" value="all"><span
          class="text-size-navigation w-form-label" for="all">All</span></label></div>
    <div class="w-dyn-list">
      <div fs-cmsstatic-element="list" role="list" class="filters-menu w-dyn-items">
        <div role="listitem" class="filters-item w-dyn-item"><label class="tab w-radio"><input type="radio" data-name="filter" id="radio-2-0-0" name="filter" class="w-form-formradioinput hide w-radio-input" value="radio"><span
              fs-cmsfilter-field="type" class="text-size-navigation w-form-label" for="radio-2">Chat</span></label></div>
        <div role="listitem" class="filters-item w-dyn-item"><label class="tab w-radio"><input type="radio" data-name="filter" id="radio-2-1-1" name="filter" class="w-form-formradioinput hide w-radio-input" value="radio"><span
              fs-cmsfilter-field="type" class="text-size-navigation w-form-label" for="radio-2">Language</span></label></div>
        <div role="listitem" class="filters-item w-dyn-item"><label class="tab w-radio"><input type="radio" data-name="filter" id="radio-2-2-2" name="filter" class="w-form-formradioinput hide w-radio-input" value="radio"><span
              fs-cmsfilter-field="type" class="text-size-navigation w-form-label" for="radio-2">Embeddings</span></label></div>
        <div role="listitem" class="filters-item w-dyn-item"><label class="tab w-radio"><input type="radio" data-name="filter" id="radio-2-3-3" name="filter" class="w-form-formradioinput hide w-radio-input" value="radio"><span
              fs-cmsfilter-field="type" class="text-size-navigation w-form-label" for="radio-2">Image</span></label></div>
        <div role="listitem" class="filters-item w-dyn-item"><label class="tab w-radio"><input type="radio" data-name="filter" id="radio-2-4-4" name="filter" class="w-form-formradioinput hide w-radio-input" value="radio"><span
              fs-cmsfilter-field="type" class="text-size-navigation w-form-label" for="radio-2">Code</span></label></div>
      </div>
    </div>
  </div>
</form>

Name: email-formGET

<form id="email-form" name="email-form" data-name="Email Form" method="get" class="footer_form-inner" data-wf-page-id="6589b69bcbb8c584607c03ff" data-wf-element-id="d9d8a00a-8aed-ff21-807d-bc6c899f4abb" aria-label="Email Form"><input
    class="footer_form-field w-input" maxlength="256" name="email-2" data-name="Email 2" placeholder="your@email.com" type="email" id="email-2" required=""><input type="submit" data-wait="" class="footer_form-submit w-button" value=""></form>

Text Content

This website uses cookies to anonymously analyze website traffic using Google
Analytics.
AcceptDecline



Together AI launches Meta Llama 3 with over 350 tokens per second



Products

Together InferenceTogether Fine-tuningTogether Custom ModelsTogether GPU
Clusters
Solutions

What we offerCustomer storiesWhy open-sourceIndustries & use cases
ResearchBlog
About

ValuesCareersTeam
PricingContactDocs
Get Started







TOGETHER.WE BUILD

The fastest cloud platform
for building and running
generative AI.

Start building nowDocs


 * 01 TOGETHER INFERENCE
   
   The fastest inference stack available — just an API call away.
   
   
   


 * 02 TOGETHER FINE-TUNING
   
   Train your own generative AI model with your private data.
   
   
   


 * 03 TOGETHER GPU CLUSTERS
   
   Dedicated training clusters ready-to-go with the Together Training stack.
   
   
   


 * 04 TOGETHER CUSTOM MODELS
   
   Build state-of-the-art models from scratch with ease.
   
   
   




THE FASTEST CLOUD FOR GEN AI.


BUILT ON LEADING AI RESEARCH.


INNOVATIONS

Our research team is behind breakthrough AI models, datasets, and optimizations.

See all research


COCKTAIL SGD

With Cocktail SGD, we’ve addressed a key hindrance to training generative AI
models in a distributed environment: networking overhead. Cocktail SGD is a set
of optimizations that reduces network overhead by up to 117x.

Read more



FLASHATTENTION 2

FlashAttention is now broadly used by all Transformer models. It accelerates
training and fine-tuning of LLMs by up to 9x and achieves an unprecedented 72%
model FLOPs utilization for training on NVIDIA A100s.

Read more



SUB-QUADRATIC MODEL ARCHITECTURES

In close collaboration with Hazy Research, we’re working on the next core
architecture for generative AI models that will provide even faster performance
with longer context. Our research published in this area includes Striped Hyena,
Monarch Mixer, and FlashConv.

Read more



REDPAJAMA

Our RedPajama project enables leading generative AI models to be available as
fully open-source. The RedPajama models have been downloaded millions of times,
and the RedPajama dataset has been used to create over 500 leading models.

Read more

100+

Open

MODELS
All

All
Chat
Language
Embeddings
Image
Code
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Try now
together.ai


Chat



Llama 3 is an auto-regressive language model that uses an optimized transformer
architecture. The tuned versions use supervised fine-tuning (SFT) and
reinforcement learning with human feedback (RLHF) to align with human
preferences for helpfulness and safety.


LLAMA-3

VIEW MORE



Chat



Arctic is a dense-MoE Hybrid transformer architecture pre-trained from scratch
by the Snowflake AI Research Team.


ARCTIC-INSTRUCT

VIEW MORE



DBRX Instruct is a mixture-of-experts (MoE) large language model trained from
scratch by Databricks. DBRX Instruct specializes in few-turn interactions.


DBRX-INSTRUCT

VIEW MORE



Chat



Llama 2-chat leverages publicly available instruction datasets and over 1
million human annotations. Available in three sizes: 7B, 13B and 70B parameters


LLAMA-2 CHAT

VIEW MORE



Chat



The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse
Mixture of Experts.


MIXTRAL INSTRUCT

VIEW MORE



Chat



A hybrid architecture composed of multi-head, grouped-query attention and gated
convolutions arranged in Hyena blocks, different from traditional decoder-only
Transformers


STRIPED HYENA NOUS

VIEW MORE



Language



The Mixtral-8x22B Large Language Model (LLM) is a pretrained generative Sparse
Mixture of Experts.


MIXTRAL-8X22B

VIEW MORE



Chat



Trained from scratch on a vast dataset of 2 trillion tokens in both English and
Chinese.


DEEPSEEK-67B

VIEW MORE



Chat



instruct fine-tuned version of Mistral-7B-v0.1


MISTRAL INSTRUCT

VIEW MORE



Language



The Yi series models are large language models trained from scratch by
developers at 01.AI


01-AI YI

VIEW MORE



Embeddings



An universal English sentence embedding model by WhereIsAI. Its embedding
dimension is 1024, it takes up to 512 context length.


UAE-LARGE V1

VIEW MORE



Image



A text-to-image generative AI model that excels at creating 1024x1024 images.


STABLE DIFFUSION XL 1.0

VIEW MORE



Embeddings



An 80M checkpoint of M2-BERT, pretrained with sequence length 32768, and it has
been fine-tuned for long-context retrieval.


M2-BERT 80M 32K RETRIEVAL

VIEW MORE



Chat



Code Llama is a family of large language models for code based on Llama 2
providing infilling capabilities, support for large input contexts, and
zero-shot instruction following ability for programming tasks.


CODE LLAMA INSTRUCT

VIEW MORE



Language



Designed for few-shot prompts, fine-tuned over the RedPajama-INCITE-Base-7B-v1
base model.


REDPAJAMA-INCITE INSTRUCT

VIEW MORE



Chat



Vicuna is a chat assistant trained by fine-tuning Llama 2 on user-shared
conversations collected from ShareGPT.


VICUNA V1.5 16K

VIEW MORE



Chat



First Nous collection of dataset and models made by fine-tuning mostly on data
created by Nous in-house


NOUS CAPYBARA

VIEW MORE



Language



NexusRaven is an open-source and commercially viable function calling LLM that
surpasses the state-of-the-art in function calling capabilities.


NEXUS RAVEN

VIEW MORE



Chat



Nous-Hermes-Llama2-13b is a state-of-the-art language model fine-tuned on over
300,000 instructions.


NOUS HERMES LLAMA-2

VIEW MORE



Chat



This model achieves a substantial and comprehensive improvement on coding,
mathematical reasoning and open-domain conversation capacities


WIZARD LM

VIEW MORE



Code



This model empowers Code LLMs with complex instruction fine-tuning, by adapting
the Evol-Instruct method to the domain of code.


WIZARDCODER PYTHON V1.0

VIEW MORE



Embeddings



An 80M checkpoint of M2-BERT, pretrained with sequence length 8192, and it has
been fine-tuned for long-context retrieval.


M2-BERT 80M 8K RETRIEVAL

VIEW MORE



Embeddings



An 80M checkpoint of M2-BERT, pretrained with sequence length 2048, and it has
been fine-tuned for long-context retrieval.


M2-BERT 80M 2K RETRIEVAL

VIEW MORE



Embeddings



BAAI general embedding - large, english - model v1.5. FlagEmbedding can map any
text to a low-dimensional dense vector which can be used for tasks like
retrieval, classification, clustering, or semantic search. And it also can be
used in vector databases for LLMs.


BGE-LARGE-EN V1.5

VIEW MORE



Chat



Extending LLaMA-2 to 32K context, built with Meta's Position Interpolation and
Together AI's data recipe and system optimizations, instruction tuned by
Together


LLAMA-2-7B-32K-INSTRUCT

VIEW MORE



Embeddings



BAAI general embedding - base, english - model v1.5. FlagEmbedding can map any
text to a low-dimensional dense vector which can be used for tasks like
retrieval, classification, clustering, or semantic search. And it also can be
used in vector databases for LLMs.


BGE-BASE-EN V1.5

VIEW MORE



Language



7.3B parameter model that outperforms Llama 2 13B on all benchmarks, approaches
CodeLlama 7B performance on code, Uses Grouped-query attention (GQA) for faster
inference and Sliding Window Attention (SWA) to handle longer sequences at
smaller cost


MISTRAL

VIEW MORE



Code



Code Llama is a family of large language models for code based on Llama 2
providing infilling capabilities, support for large input contexts, and
zero-shot instruction following ability for programming tasks.


CODE LLAMA PYTHON

VIEW MORE



Image



Latent text-to-image diffusion model capable of generating photo-realistic
images given any text input.


STABLE DIFFUSION 2.1

VIEW MORE



Chat



Vicuna is a chat assistant trained by fine-tuning Llama 2 on user-shared
conversations collected from ShareGPT.


VICUNA V1.5

VIEW MORE



Language



Phind-CodeLlama-34B-v1 trained on additional 1.5B tokens high-quality
programming-related data proficient in Python, C/C++, TypeScript, Java, and
more.


PHIND CODE LLAMA V2

VIEW MORE



Image



Fine-tune version of Stable Diffusion focused on photorealism.


REALISTIC VISION 3.0

VIEW MORE



Language



Language model trained on 2 trillion tokens with double the context length of
Llama 1. Available in three sizes: 7B, 13B and 70B parameters


LLAMA-2

VIEW MORE



Code



Code Llama is a family of large language models for code based on Llama 2
providing infilling capabilities, support for large input contexts, and
zero-shot instruction following ability for programming tasks.


CODE LLAMA

VIEW MORE



Language



Extending LLaMA-2 to 32K context, built with Meta's Position Interpolation and
Together AI's data recipe and system optimizations.


LLAMA-2-32K

VIEW MORE



Chat



This model is a 75/25 merge of Chronos (13B) and Nous Hermes (13B) models
resulting in having a great ability to produce evocative storywriting and follow
a narrative.


CHRONOS HERMES

VIEW MORE



Chat



An instruction fine-tuned LLaMA-2 (70B) model by merging Platypus2 (70B) by
garage-bAInd and LLaMA-2 Instruct v2 (70B) by upstage.


PLATYPUS2 INSTRUCT

VIEW MORE



Language



This model achieves a substantial and comprehensive improvement on coding,
mathematical reasoning and open-domain conversation capacities.


WIZARDLM V1.0 (70B)

VIEW MORE



Chat



MythoLogic-L2 and Huginn merge using a highly experimental tensor type merge
technique. The main difference with MythoMix is that I allowed more of Huginn to
intermingle with the single tensors located at the front and end of a model


MYTHOMAX-L2

VIEW MORE



Chat



7B-parameter version of the large language model series, Qwen (abbr. Tongyi
Qianwen), proposed by Aibaba Cloud. Qwen-7B-Chat is a large-model-based AI
assistant, which is trained with alignment techniques.


QWEN-CHAT

VIEW MORE



Language



7B-parameter version of the large language model series, Qwen (abbr. Tongyi
Qianwen), proposed by Aibaba Cloud. Qwen-7B is a Transformer-based large
language model, which is pretrained on a large volume of data, including web
texts, books, codes, etc.


QWEN

VIEW MORE



Chat



Chat model fine-tuned using data from Dolly 2.0 and Open Assistant over the
RedPajama-INCITE-Base-7B-v1 base model.


REDPAJAMA-INCITE CHAT

VIEW MORE



Language



Base model that aims to replicate the LLaMA recipe as closely as possible (blog
post).


REDPAJAMA-INCITE

VIEW MORE



Embeddings



Pretrained model on English language using a masked language modeling (MLM)
objective. The embedding dimension is 768, and the number of model parameters is
110M. This model is uncased: it does not make a difference between english and
English.


BERT

VIEW MORE



Embeddings



This is a sentence-transformers model: It maps sentences & paragraphs to a 768
dimensional dense vector space and was designed for semantic search. It has been
trained on 500K (query, answer) pairs from the MS MARCO dataset. Its embedding
dimension is 768 with 512 max context length.


SENTENCE-BERT

VIEW MORE



Language



This model can be used to moderate other chatbot models. Built using GPT-JT
model fine-tuned on Ontocord.ai's OIG-moderation dataset v0.1.


GPT-JT-MODERATION

VIEW MORE



Language



Fork of GPT-J instruction tuned to excel at few-shot prompts (blog post).


GPT-JT

VIEW MORE



Chat



Chat model fine-tuned from EleutherAI’s GPT-NeoX with over 40 million
instructions on carbon reduced compute.


GPT-NEOXT-CHAT-BASE

VIEW MORE



Language



An auto-regressive language model, based on the transformer architecture. The
model comes in different sizes: 7B, 13B, 33B and 65B parameters.


LLAMA

VIEW MORE



Chat



Falcon-40B-Instruct is a causal decoder-only model built by TII based on
Falcon-40B and finetuned on a mixture of Baize.


FALCON INSTRUCT

VIEW MORE



Language



Falcon-40B is a causal decoder-only model built by TII and trained on 1,000B
tokens of RefinedWeb enhanced with curated corpora.


FALCON

VIEW MORE



Image



An open source Stable Diffusion model fine tuned model on Midjourney images.


OPENJOURNEY V4

VIEW MORE



Chat



Chat model based on EleutherAI’s Pythia-7B model, and is fine-tuned with data
focusing on dialog-style interactions.


PYTHIA-CHAT-BASE

VIEW MORE



Image



Dreambooth model trained on a diverse set of analog photographs to provide an
analog film effect.


ANALOG DIFFUSION

VIEW MORE





TOGETHER INFERENCE

The fastest inference on the market1. Start with an easy-to-use API and
immediately benefit from the fastest backend optimizations anywhere.




SPEED RELATIVE
TO TGI, VLLM OR
OTHER INFERENCE SERVICES


3x faster2


LLAMA-2 70B


117
TOKENS / SEC3


COST RELATIVE TO GPT-3.5-TURBO


6x lower cost4
Input

Summarize the following article: "Transformers are the workhorse architecture
behind modern foundation models. Our team at Together AI has released
optimizations like FlashAttention-2 to help further scale their capabilities.
However, the computational requirements for Transformers increase quadratically
with sequence length, which limits the extent of knowledge these models can
capture from available datasets. We are interested in new, sub-quadratic
approaches to increase scale and eventually produce models that learn from data
more comprehensively.Today, in partnership with Hazy Research, we’re releasing a
first look at Monarch Mixer—an exploration into an alternative, more efficient
model architecture.Monarch Mixer (M2) is sub-quadratic in both sequence length
and model dimension, enabling it to scale more efficiently while still matching
quality. Our first target for M2 is BERT, the most popular model in use today
for language tasks. M2-BERT is 25% more parameter-efficient than BERT and trains
up to 3x faster.This is just the beginning of research into Monarch Mixer
models, and we are excited to share the results. You can access the first set of
checkpoints at a standard sequence length today, and expect long-sequence BERT
models to be coming soon! "



Output

The article discusses the limitations of Transformers, which are the workhorse
architecture behind modern foundation models. The authors introduce Monarch
Mixer, a new exploration into an alternative, more efficient model architecture
that is sub-quadratic in both sequence length and model dimension. The article
highlights the benefits of Monarch Mixer, including its ability to scale more
efficiently and match the quality of Transformers. The authors also provide a
first look at Monarch Mixer-BERT, which is 25% more parameter-efficient than
BERT and trains up to 3x faster. They encourage readers to access the first set
of checkpoints and expect long-sequence BERT models to be coming soon.


Input

You are an expert programmer that writes simple, concise code and explanations.
Write a python function that accepts a long input string, extracts words in
between `<span>` and `<\span>` and returns a list.



Output

Here is a Python function that accepts a long input string, extracts words in
between <span> and <\span> and returns a list:

import re
def extract_words(input_string):    
           pattern = r"<span>(.*?)<\/span>
           
            matches = re.findall(pattern, input_string)    
            return matches
‍
This function uses the re module to perform a regular expression search on the
input string. The regular expression pattern <span>(.*?)<\\/span> matches any
text that is enclosed in <span> and <\\span> tags. The findall function returns
a list of all the matches found in the input string.Here is an example of how
you can use this function:

input_string = "<span>Hello</span><span>World</span>"
words = extract_words(input_string)
print(words) # Output: ['Hello', 'World']
‍
This code will extract the words "Hello" and "World" from the input string and
return them in a list.


Input

<s>[INST] <<SYS>>Your task is to extract top 10 keywords from a provided
text.<</SYS>>\n\nScaling Transformers to longer sequence lengths has been a
major problem in the last several years, promising to improve performance in
language modeling and high-resolution image understanding, as well as to unlock
new applications in code, audio, and video generation. The attention layer is
the main bottleneck in scaling to longer sequences, as its runtime and memory
increase quadratically in the sequence length. FlashAttention exploits the
asymmetric GPU memory hierarchy to bring significant memory saving (linear
instead of quadratic) and runtime speedup (2-4× compared to optimized
baselines), with no approximation. However, FlashAttention is still not nearly
as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of
the theoretical maximum FLOPs/s. We observe that the inefficiency is due to
suboptimal work partitioning between different thread blocks and warps on the
GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We
propose FlashAttention-2, with better work partitioning to address these issues.
In particular, we (1) tweak the algorithm to reduce the number of non-matmul
FLOPs (2) parallelize the attention computation, even for a single head, across
different thread blocks to increase occupancy, and (3) within each thread block,
distribute the work between warps to reduce communication through shared memory.
These yield around 2× speedup compared to FlashAttention, reaching 50-73\% of
the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of
GEMM operations. We empirically validate that when used end-to-end to train
GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s
per A100 GPU (72\% model FLOPs utilization).[/INST]



Output

1. Scaling
2. Transformers
3. Longer sequence lengths
4. Language modeling
5. High-resolution image understanding
6. Code generation
7. Audio generation
8. Video generation
9. Attention layer
10. Memory saving


Input

<s>[INST] <<SYS>>You are a helpful, polite customer service agent for Awesome
Cheeses Inc named Sally. If the customer mentions anything about food, tell them
you have a 20% off deal on gouda cheese. <</SYS>>\n\nhi who am i speaking
to?[/INST]Hello! this is Sally from Awesome Cheeses. How is your day
going?</s><s>[INST]im hungry for something cheesy[/INST]



Output

Well, you've come to the right place! I'm happy to help you with that. We're
running a special promotion right now - you can get 20% off all gouda cheese
products. Would you like me to tell you a bit more about our gouda selection? We
have a variety of delicious flavors to choose from, such as Smoked Gouda, Goat
Gouda, and Garlic Gouda. Which one sounds the most appealing to you?


Input

Label the sentences as either "positive", "negative", "mixed", or "neutral":

Sentence: I can say that there isn't anything I would change.

Sentence: I'm not sure about this.

Sentence: I liked some parts but I didn't like other parts.

Sentence: I think the background image could have been better.

Sentence: I really like it.



Output

Sentence: I can say that there isn't anything I would change.
Label: positive

Sentence: I'm not sure about this.
Label: neutral

Sentence: I liked some parts but I didn't like other parts.
Label: mixed

Sentence: I think the background image could have been better.
Label: negative

Sentence: I really like it.
Label: positive


 * 1. Document Summarization
   
   Turn lengthy documents into concise summaries.
   
   
 * 2. Code generation
   
   Automate the creation of repetitive or template-based code.
   
   
 * 3. ENTITY EXTRACTION
   
   Identify and extract specific data points from unstructured text.
   
   
 * 4. CHAT
   
   Engage users in real-time conversations to assist or inform.
   
   
 * 5. Sentiment analysis
   
   Determine the emotional tone or subjective opinion expressed in a piece of
   text.
   
   

Together Inference




TOGETHER FINE-TUNING

Fine-tune leading open-source models with your data to achieve greater accuracy
for your tasks.



together files upload acme_corp_customer_support.jsonl
  
{
  "filename" : "acme_corp_customer_support.json",
  "id": "file-aab9997e-bca8-4b7e-a720-e820e682a10a",
  "object": "file"
}
  
  
together finetune create --training-file file-aab9997-bca8-4b7e-a720-e820e682a10a
--model together compute/RedPajama-INCITE-7B-Chat





together finetune create --training-file $FILE_ID 
--model $MODEL_NAME 
--wandb-api-key $WANDB_API_KEY 
--n-epochs 10 
--n-checkpoints 5 
--batch-size 8 
--learning-rate 0.0003
{
    "training_file": "file-aab9997-bca8-4b7e-a720-e820e682a10a",
    "model_output_name": "username/togethercomputer/llama-2-13b-chat",
    "model_output_path": "s3://together/finetune/63e2b89da6382c4d75d5ef22/username/togethercomputer/llama-2-13b-chat",
    "Suffix": "Llama-2-13b 1",
    "model": "togethercomputer/llama-2-13b-chat",
    "n_epochs": 4,
    "batch_size": 128,
    "learning_rate": 1e-06,
    "checkpoint_steps": 2,
    "created_at": 1687982945,
    "updated_at": 1687982945,
    "status": "pending",
    "id": "ft-5bf8990b-841d-4d63-a8a3-5248d73e045f",
    "epochs_completed": 3,
    "events": [
        {
            "object": "fine-tune-event",
            "created_at": 1687982945,
            "message": "Fine tune request created",
            "type": "JOB_PENDING",
        }
    ],
    "queue_depth": 0,
    "wandb_project_name": "Llama-2-13b Fine-tuned 1"
}



 * Start simple
   
   Fine-tune your model with just two commands.
   
   
 * go deep
   
   Control hyperparameters like learning rate, batch size, and epochs to
   optimize model quality.
   
   

Together Fine-tuning




NEED A CLUSTER OF H100S ON INFINIBAND TODAY?

Reserve your cluster today



TOGETHER GPU CLUSTERS

Get your own private GPU cluster for large training and fine-tuning today.

‍

Use our purpose built training clusters with H100 and A100 GPUs connected over
fast Infiniband networks. Your cluster comes optimized for distributed training
with the high performance Together Training stack out of the box. You focus on
your model, and we’ll ensure everything runs smoothly.




 * 01
   
   HIGH END CLUSTERS WITHOUT LONG-TERM COMMITS
   
   
   
   We offer flexible terms – even with our highest quality hardware. You can
   commit to just a month or reserve capacity for up to 5 years.


 * 02
   
   CLUSTERS FROM 16 GPUS TO 2048 GPUS
   
   
   
   A100 PCIe Clusters Node Specifications: 
   
   - 8x A100 / 80GB / PCIe
   
   - 200Gb node-to-node non-blocking Ethernet
   
   - 120 vCPU Intel Xeon (Ice Lake)
   
   - 960GB RAM
   
   - 7.68 TB NVMe storage
   
   ‍
   
   A100 SXM Clusters Node Specifications: 
   
   - 8x NVIDIA A100 80GB SXM
   
   - 4120 vCPU Intel Xeon (Sapphire Rapids)
   
   - 960 GB RAM
   
   - 8 x 960GB NVMe storage
   
   - 200 Gbps Ethernet or 3200 Gbps Infiniband configs available
   
   ‍
   
   H100 Clusters Node Specifications: 
   
   - 8x Nvidia H100 / 80GB / SXM5
   - 3.2 Tbps Infiniband network
   - 2x AMD EPYC 9474F 18 Cores 96 Threads 3.6GHz CPUs
   - 1.5TB ECC DDR5 Memory
   
   - 8x 3.84TB NVMe SSDs


 * 03
   
   SNAPPY SETUP. BLAZING FAST TRAINING.
   
   
   
   We value your time. Clusters are pre-configured for high-speed distributed
   training, using Slurm and the Together Custom Models stack to get you up and
   running at lightspeed.

Together GPU Clusters


TOGETHER CUSTOM MODELS

We built the open-source RedPajama models. Now, we’re helping you build models
from scratch. 

‍

Together Custom Models optimizes model builds end-to-end from data collection
and preparation to deployment. And in the end, you own the model.

Together Custom Models



TRAINING


9x
faster5


NETWORK
C0MMUNICATION


117x
lower6


RELATIVE TO AWS


4x
lower cost7


WE BELIEVE THE FUTURE OF AI IS


OPEN SOURCE


01 TRANSPARENCY



Inspect how models are trained and what data is used to increase accuracy and
minimize risks


02 NO LOCK-IN



You own the model you fine-tune, not your cloud provider. Change providers for
whatever reason, including price changes.


03 PRIVACY



Maintain complete data privacy by storing data locally or in our secure cloud


BUILD, DEPLOY, AND SCALE. ALL IN A SINGLE PLATFORM.

 * 01
   
   
   
   BUILD
   
   Whether prompt engineering, fine-tuning, or training, we are ready to meet
   your business demands.

 * 02
   
   
   
   DEPLOY
   
   Easily integrate your new model into your production application using the
   Together Inference API.

 * 03
   
   
   
   SCALE
   
   With the fastest performance available and elastic scaling, Together AI is
   built to scale with your needs as you grow.


CUSTOMER STORIES

See how we support leading teams around the world. Our customers are creating
innovative generative AI applications, faster.




PIKA CREATES THE NEXT GEN TEXT-TO-VIDEO MODELS ON TOGETHER GPU CLUSTERS





NEXUSFLOW USES TOGETHER GPU CLUSTERS TO BUILD CYBERSECURITY MODELS





ARCEE BUILDS DOMAIN ADAPTIVE LANGUAGE MODELS WITH TOGETHER CUSTOM MODELS




Start


building


yours
here →
 1. Testing conducted by Together AI in November 2023 using Llama-2-70B running
    on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic
    ML comparison based on published numbers in Mosaic ML blog. Detailed results
    and methodology published here.
 2. Testing conducted by Together AI in November 2023 using Llama-2-70B running
    on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic
    ML comparison based on published numbers in Mosaic ML blog. Detailed results
    and methodology published here.
 3. Testing conducted by Together AI in November 2023 using Llama-2-70B running
    on Together Inference. Detailed results and methodology published here.
 4. Based on published pricing November 8th, 2023, comparing Open AI
    GPT-3.5-Turbo to Llama-2-13B on Together Inference using Serverless
    Endpoints. Assumes equal number of input and output tokens.
 5. Compared to a standard attention implementation in PyTorch, FlashAttention-2
    can be up to 9x faster. Source.
 6. Testing methodology and results published in this research paper.
 7. Based on published pricing November 8th, 2023, comparing AWS Capacity Blocks
    and AWS p5.48xlarge instances to Together GPU Clusters configured with an
    equal number of H100 SXM5 GPUs on our 3200 Gbps Infiniband networking
    configuration.




SUBSCRIBE TO NEWSLETTER

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
 * Products
   
 * Solutions
   
 * Research
   
 * Blog
   
 * About
   
 * Pricing
   
 * Contact
   

 * 
 * 
 * 

© 2023 San Francisco, CA 94114

 * Privacy policy
   
 * Terms of service