hx-zero OP t1_j07df6n wrote on December 14, 2022 at 4:36 PM

Reply to comment by TrueBirch in [Project] Run and fine-tune BLOOM-176B at home using a peer-to-peer network by hx-zero

I think this is reasonable if these computers have GPUs.

hx-zero OP t1_j07d431 wrote on December 14, 2022 at 4:34 PM

Reply to comment by SleekEagle in [Project] Run and fine-tune BLOOM-176B at home using a peer-to-peer network by hx-zero

Training from scratch is slow because you need to synchronize all model weights/gradients on each step (though it's possible for somewhat smaller models with some optimizations).

In case of fine-tuning (especially prompt tuning), you train only a small percent of weights, so communication overhead is not that huge anymore. Still, this allows to adapt the LM to most downstream tasks.

hx-zero OP t1_j05i3j5 wrote on December 14, 2022 at 5:17 AM

Reply to comment by ReginaldIII in [Project] Run and fine-tune BLOOM-176B at home using a peer-to-peer network by hx-zero

A Petals client does not allow others to use your GPU by default, you need to explicitly run a Petals server (a separate program) for this.

In the Colab example, we only run the client, so its GPU can't be used by anyone besides the user directly running the notebook.

hx-zero OP t1_j04g7yj wrote on December 14, 2022 at 12:15 AM

Reply to comment by ReginaldIII in [Project] Run and fine-tune BLOOM-176B at home using a peer-to-peer network by hx-zero

Sure!

Regarding offloading:

Offloading is another method for running large LMs when you don't have the GPU memory to fit the entire model. Imagine you have an A100 GPU with 80 GB memory and want to generate text with BLOOM, a 70-block transformer model with ~2.5 GB of weights per block. For each token, offloading will load the first 1/3 of the model (~27 blocks) from RAM/SSD to your GPU memory, run a forward pass through them, then free the memory and load the next 2/3, and so on.
The table shows that inference with offloading is very slow compared to Petals. That's because it involves copying hundreds GB of block weights to your GPU memory to generate every new token in a sequence.
Even though Petals may send data to a server on a different continent over the Internet, it turns out that Petals is much faster since it just doesn't send much. It only sends activations, which are thousands times smaller than weights of one BLOOM block (and the weights are already loaded to a server's GPU).

Regarding "Petals on 3 physical servers" vs. "14 real servers":

The first setup is artificial: we use 3 high-end servers staying in one room and simulate different latency/bandwidth restrictions for research purposes.
The second setup is realistic: we use 14 different servers with customer-grade GPUs, spread across Europe and North America. So the GPUs are heterogeneous, latency may vary, we may have packet loss, etc.

Regarding "8 clients running simultaneously":

Other rows measure the performance of a client if it uses a Petals swarm alone. This row shows how the performance degrades if we have 8 concurrent clients.

You can find these and other details of the experiments in our paper (the table I've sent is from its updated version that we didn't publish yet).

hx-zero OP t1_j04dewf wrote on December 13, 2022 at 11:55 PM

Reply to comment by randyzmzzzz in [Project] Run and fine-tune BLOOM-176B at home using a peer-to-peer network by hx-zero

Not really: federated learning focuses on data privacy (and doesn't usually involve huge models), Petals focuses on making it possible to run a huge model without having much resources yourself (and doesn't give data privacy guarantees)

hx-zero OP t1_j03zy85 wrote on December 13, 2022 at 10:22 PM

Reply to comment by Acceptable-Cress-374 in [Project] Run and fine-tune BLOOM-176B at home using a peer-to-peer network by hx-zero

Yes, it's technically possible to integrate GPT-NeoX in our code instead of BLOOM (requires some work, but it's not too hard).

Also, it may be possible to fit GPT-NeoX into 20 GB of VRAM (i.e., one 3090) using recent LLM.int8() work: https://huggingface.co/blog/hf-bitsandbytes-integration We use this approach to make BLOOM consume as few memory as possible in Petals.

hx-zero OP t1_j03yfov wrote on December 13, 2022 at 10:12 PM

Reply to comment by ReginaldIII in [Project] Run and fine-tune BLOOM-176B at home using a peer-to-peer network by hx-zero

Yeah, we compared Petals to a server with 3x A100 running tensor-parallel code based on Megatron-DeepSpeed, see the green row in this table. The table also shows how Petals performance degrades if we have concurrent clients and how it compares to offloading.

Adding more servers usually doesn't make the inference speed significantly faster. New servers mostly help with the swarm capacity, so it can provide the speed of ~1 step/sec to a larger number of clients.

I don't think we've done any comparisons with Federated/Split learning systems since, as far as I understand, they mostly don't work well on models of that size (100B+ parameters). But let us know if there're such systems, maybe we will compare Petals to some of them.

hx-zero OP t1_j03tsul wrote on December 13, 2022 at 9:42 PM

Reply to comment by ReginaldIII in [Project] Run and fine-tune BLOOM-176B at home using a peer-to-peer network by hx-zero

Regarding fault tolerance:

No chunk losses involved — if a client has trouble sending/receiving chunks from a certain server, it will try other servers holding the necessary blocks until it gets a valid response.
We don't use any centralized queues like Kafka, instead the client code chooses and traverses servers by itself until it makes a full forward/backward pass. In this architecture, you can still make the client send the same request to multiple servers (if you want to validate servers' responses against each other or just get the response as soon as possible).

Regarding security & privacy:

Peers only exchange tensors (activations, gradients) serialized with safe protocols and ask each other to run pre-defined BLOOM blocks on them. They never send code to each other, so no one can execute their own code on your computer.
It may be possible for peers serving model layers to recover input data and model outputs, or modify the outputs in a malicious way. That's why we ask to never use the public swarm for sensitive data (not just pet projects/research) in the repo & notebook at the moment. Instead, you can set up a private Petals swarm hosted by people/orgs you trust. For example, several small companies/labs may collaborate and set up a private swarm to protect their data from others, while still getting benefits of Petals.
Still, we have plans to improve security of the public swarm in future:
- (a) We plan to add an option for the client to send the same request to several servers and identify discrepancies (if any).
- (b) We're working on a reputation system, so a server who returned invalid outputs loses its reputation and won't be chosen by clients again. The invalid outputs can be reported by clients or detected by special "anti-fraud" nodes that periodically validate the various servers' outputs.