I'm working on an assignment on matrix multiplication with MPI. A*B=C. the requirement is that B should be vertically partitioned. Here's what I intend to do: broadcast matrix A to all processes and scatter B into several slices with each slice containing n/p columns. The following code only works when the number of process(p) is 1. when p1(say 2), I got
    [cluster2:21080] *** Process received signal ***
    [cluster2:21080] Signal: Segmentation fault (11)
    [cluster2:21080] Signal code: Address not mapped (1)
    [cluster2:21080] Failing at address: (nil)
    [cluster2:21080] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f49f38108f0]
    [cluster2:21080] [ 1] /lib/libc.so.6(memcpy+0xe1) [0x7f49f35024c1]
    [cluster2:21080] [ 2] /usr/lib/libmpi.so.0(ompi_convertor_unpack+0x121)[0x7f49f47c88e1]
    [cluster2:21080] [ 3] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8a26) [0x7f49f0dcea26]
    [cluster2:21080] [ 4] /usr/lib/openmpi/lib/openmpi/mca_btl_tcp.so(+0x662c) [0x7f49efce462c]
    [cluster2:21080] [ 5] /usr/lib/libopen-pal.so.0(+0x1ede8) [0x7f49f42e0de8]
    [cluster2:21080] [ 6] /usr/lib/libopen-pal.so.0(opal_progress+0x99) [0x7f49f42d5369]
    [cluster2:21080] [ 7] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x5585) [0x7f49f0dcb585]
    [cluster2:21080] [ 8] /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(+0xcc01) [0x7f49eeeb1c01]
    [cluster2:21080] [ 9] /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(+0x266c) [0x7f49eeea766c]
    [cluster2:21080] [10] /usr/lib/openmpi/lib/openmpi/mca_coll_sync.so(+0x1388) [0x7f49ef0c0388]
    [cluster2:21080] [11] /usr/lib/libmpi.so.0(MPI_Bcast+0x10e) [0x7f49f47d025e]
    [cluster2:21080] [12] ./out(main+0x259) [0x401571]
    [cluster2:21080] [13] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f49f3498c8d]
    [cluster2:21080] [14] ./out() [0x400f29]
    [cluster2:21080] *** End of error message ***
Can someone help me? Thanks. 
    //matrices A and B
    //double* A =(double *)malloc(n*n*sizeof(double));
    //double* B =(double *)malloc(n*n*sizeof(double));
    //code initializing A,B...
    //n is the size of the matrix
    //p is the number of processes
    //myrank is the rank of calling process
    MPI_Init (&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 
    MPI_Comm_size(MPI_COMM_WORLD, &p);
    //broadcast A to all processes
    MPI_Bcast (A, n*n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    MPI_Datatype tmp_type, col_type;
    // extract a slice from B
    MPI_Type_vector(n, num_of_col_per_slice, n, MPI_DOUBLE, &tmp_type);
    // position of the first (0) and each next (stride * sizeof(double) ) slice
    MPI_Type_create_resized(tmp_type, 0, n * sizeof(double), &col_type);
    MPI_Type_commit(&col_type);
    //scatter a slice of B to each process 
    MPI_Scatter(B, 1, col_type, B+myrank*n/p, n * n/p, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    //use blas function to calculate A*sliceOfB and store the resulting slice to C
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n/p, n, 1.0, A, n, B+myrank*n/p, n, 0.0, C+myrank*n/p, n);
    //gather all those resulting slices into C
    MPI_Gather (C+myrank*n/p, n*n/p, MPI_DOUBLE, C, n*n/p, MPI_DOUBLE, 0, MPI_COMM_WORLD);